Search engine (computing)
   HOME

TheInfoList



OR:

In
computing Computing is any goal-oriented activity requiring, benefiting from, or creating computer, computing machinery. It includes the study and experimentation of algorithmic processes, and the development of both computer hardware, hardware and softw ...
, a search engine is an information retrieval software system designed to help find information stored on one or more
computer system A computer is a machine that can be programmed to automatically carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic sets of operations known as ''programs'', wh ...
s. Search engines discover, crawl, transform, and store information for retrieval and presentation in response to user queries. The search results are usually presented in a list and are commonly called ''hits''. The most widely used type of search engine is a
web search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
, which searches for information on the
World Wide Web The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
. A search engine normally consists of four components, as follows: a search interface, a crawler (also known as a spider or bot), an indexer, and a database. The crawler traverses a document collection, deconstructs document text, and assigns surrogates for storage in the search engine index. Online search engines store images, link data and metadata for the document.


How search engines work

Search engines provide an interface to a group of items that enables users to specify criteria about an item of interest and have the engine find the matching items. The criteria are referred to as a search query. In the case of text search engines, the search query is typically expressed as a set of words that identify the desired
concept A concept is an abstract idea that serves as a foundation for more concrete principles, thoughts, and beliefs. Concepts play an important role in all aspects of cognition. As such, concepts are studied within such disciplines as linguistics, ...
that one or more
document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes ...
s may contain. There are several styles of search query
syntax In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...
that vary in strictness. It can also switch names within the search engines from previous sites. Whereas some text search engines require users to enter two or three words separated by white space, other search engines may enable users to specify entire documents, pictures, sounds, and various forms of
natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...
. Some search engines apply improvements to search queries to increase the likelihood of providing a quality set of items through a process known as query expansion. Query understanding methods can be used as standardized query language. The list of items that meet the criteria specified by the query is typically sorted, or ranked. Ranking items by relevance (from highest to lowest) reduces the time required to find the desired information.
Probabilistic Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...
search engines rank items based on measures of similarity (between each item and the query, typically on a scale of 1 to 0, 1 being most similar) and sometimes
popularity In sociology, popularity is how much a person, idea, place, item or other concept is either liked or accorded status by other people. Liking can be due to reciprocal liking, interpersonal attraction, and similar factors. Social status can be d ...
or
authority Authority is commonly understood as the legitimate power of a person or group of other people. In a civil state, ''authority'' may be practiced by legislative, executive, and judicial branches of government,''The New Fontana Dictionary of M ...
(see
Bibliometrics Bibliometrics is the application of statistical methods to the study of bibliographic data, especially in scientific and library and information science contexts, and is closely associated with scientometrics (the analysis of scientific metri ...
) or use relevance feedback. Boolean search engines typically only return items which match exactly without regard to order, although the term ''Boolean search engine'' may simply refer to the use of Boolean-style syntax (the use of operators AND, OR, NOT, and XOR) in a probabilistic context. To provide a set of matching items that are sorted according to some criteria quickly, a search engine will typically collect
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
about the group of items under consideration beforehand through a process referred to as indexing. The index typically requires a smaller amount of
computer storage Computer data storage or digital data storage is a technology consisting of computer components and Data storage, recording media that are used to retain digital data. It is a core function and fundamental component of computers. The cent ...
, which is why some search engines only store the indexed information and not the full content of each item, and instead provide a method of navigating to the items in the search engine result page. Alternatively, the search engine may store a copy of each item in a cache so that users can see the state of the item at the time it was indexed or for archive purposes or to make repetitive processes work more efficiently and quickly. Other types of search engines do not store an index. Crawler, or spider type search engines (a.k.a. real-time search engines) may collect and assess items at the time of the search query, dynamically considering additional items based on the contents of a starting item (known as a seed, or seed URL in the case of an Internet crawler). Meta search engines store neither an index nor a cache and instead simply reuse the index or results of one or more other search engine to provide an aggregated, final set of results. Database size, which had been a significant marketing feature through the early 2000s, was similarly displaced by emphasis on relevancy ranking, the methods by which search engines attempt to sort the best results first. Relevancy ranking first became a major issue , when it became apparent that it was impractical to review full lists of results. Consequently,
algorithms In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for per ...
for relevancy ranking have continuously improved. Google's
PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Accordin ...
method for ordering the results has received the most press, but all major search engines continually refine their ranking methodologies with a view toward improving the ordering of results. As of 2006, search engine rankings are more important than ever, so much so that an industry has developed (" search engine optimizers", or "SEO") to help web-developers improve their search ranking, and an entire body of
case law Case law, also used interchangeably with common law, is a law that is based on precedents, that is the judicial decisions from previous cases, rather than law based on constitutions, statutes, or regulations. Case law uses the detailed facts of ...
has developed around matters that affect search engine rankings, such as use of
trademarks A trademark (also written trade mark or trade-mark) is a form of intellectual property that consists of a word, phrase, symbol, design, or a combination that identifies a product or service from a particular source and distinguishes it from ot ...
in metatags. The sale of search rankings by some search engines has also created controversy among librarians and consumer advocates. Search engine experience for users continues to be enhanced. Google's addition of the
Google Knowledge Graph The Google Knowledge Graph is a knowledge base from which Google serves relevant information in an infobox beside its search results. This allows the user to see the answer in a glance, as an instant answer. The data is generated automatical ...
has had wider ramifications for the Internet, possibly even limiting certain websites traffic, for example Wikipedia. By pulling information and presenting it on Google's page, some argue that it can negatively affect other sites. However, there have been no major concerns.


Search engine categories


Web search engines

Search engines that are expressly designed for searching web pages, documents, and images were developed to facilitate searching through a large, nebulous blob of unstructured resources. They are engineered to follow a multi-stage process: crawling the infinite stockpile of pages and documents to skim the figurative foam from their contents, indexing the foam/buzzwords in a sort of semi-structured form (database or something), and at last, resolving user entries/queries to return mostly relevant results and links to those skimmed documents or pages from the inventory.


Crawl

In the case of a wholly textual search, the first step in classifying web pages is to find an ‘index item’ that might relate expressly to the ‘search term.’ In the past, search engines began with a small list of URLs as a so-called seed list, fetched the content, and parsed the links on those pages for relevant information, which subsequently provided new links. The process was highly cyclical and continued until enough pages were found for the searcher's use. These days, a continuous crawl method is employed as opposed to an incidental discovery based on a seed list. The crawl method is an extension of aforementioned discovery method. Most search engines use sophisticated scheduling algorithms to “decide” when to revisit a particular page, to appeal to its relevance. These algorithms range from constant visit-interval with higher priority for more frequently changing pages to adaptive visit-interval based on several criteria such as frequency of change, popularity, and overall quality of site. The speed of the web server running the page as well as resource constraints like amount of hardware or bandwidth also figure in.


Link map

Pages that are discovered by web crawls are often distributed and fed into another computer that creates a map of resources uncovered. The bunchy clustermass looks a little like a graph, on which the different pages are represented as small nodes that are connected by links between the pages. The excess of data is stored in multiple data structures that permit quick access to said data by certain algorithms that compute the popularity score of pages on the web based on how many links point to a certain web page, which is how people can access any number of resources concerned with diagnosing psychosis. Another example would be the accessibility/rank of web pages containing information on Mohamed Morsi versus the very best attractions to visit in Cairo after simply entering ‘Egypt’ as a search term. One such algorithm,
PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Accordin ...
, proposed by Google founders Larry Page and Sergey Brin, is well known and has attracted a lot of attention because it highlights repeat mundanity of web searches courtesy of students that don't know how to properly research subjects on Google. The idea of doing link analysis to compute a popularity rank is older than PageRank. However, In October 2014, Google’s John Mueller confirmed that Google is not going to be updating it (Page Rank) going forward. Other variants of the same idea are currently in use – grade schoolers do the same sort of computations in picking kickball teams. These ideas can be categorized into three main categories: rank of individual pages and nature of web site content. Search engines often differentiate between internal links and external links, because web content creators are not strangers to shameless self-promotion. Link map data structures typically store the anchor text embedded in the links as well, because anchor text can often provide a “very good quality” summary of a web page's content.


Database Search Engines

Searching for text-based content in databases presents a few special challenges from which a number of specialized search engines flourish. Databases can be slow when solving complex queries (with multiple logical or string matching arguments). Databases allow pseudo-logical queries which full-text searches do not use. There is no crawling necessary for a database since the data is already structured. However, it is often necessary to index the data in a more economized form to allow a more expeditious search.


Mixed Search Engines

Sometimes, data searched contains both database content and web pages or documents. Search engine technology has developed to respond to both sets of requirements. Most mixed search engines are large Web search engines, like Google. They search both through structured and
unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically plain text, text-heavy, but may contain data such ...
sources. Take for example, the word ‘ball.’ In its simplest terms, it returns more than 40 variations on Wikipedia alone. Did you mean a ball, as in the social gathering/dance? A soccer ball? The ball of the foot? Pages and documents are crawled and indexed in a separate index. Databases are indexed also from various sources. Search results are then generated for users by querying these multiple indices in parallel and compounding the results according to “rules.”


History of search technology


The Memex

The concept of hypertext and a memory extension originates from an article that was published in ''
The Atlantic Monthly ''The Atlantic'' is an American magazine and multi-platform publisher based in Washington, D.C. It features articles on politics, foreign affairs, business and the economy, culture and the arts, technology, and science. It was founded in 1857 ...
'' in July 1945 written by
Vannevar Bush Vannevar Bush ( ; March 11, 1890 – June 28, 1974) was an American engineer, inventor and science administrator, who during World War II, World War II headed the U.S. Office of Scientific Research and Development (OSRD), through which almo ...
, titled "
As We May Think "As We May Think" is a 1945 essay by Vannevar Bush which has been described as visionary and influential, anticipating many aspects of information society. It was first published in ''The Atlantic'' in July 1945 and republished in an abridged v ...
". Within this article Vannevar urged scientists to work together to help build a body of knowledge for all mankind. He then proposed the idea of a virtually limitless, fast, reliable, extensible, associative memory storage and retrieval system. He named this device a
memex A memex (from "memory expansion") is a hypothetical electromechanical device for interacting with microform documents and described in Vannevar Bush's 1945 article " As We May Think". Bush envisioned the memex as a device in which individuals w ...
. Bush regarded the notion of “associative indexing” as his key conceptual contribution. As he explained, this was “a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing. All of the documents used in the memex would be in the form of microfilm copy acquired as such or, in the case of personal records, transformed to microfilm by the machine itself. Memex would also employ new retrieval techniques based on a new kind of associative indexing the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another to create personal "trails" through linked documents. The new procedures, that Bush anticipated facilitating information storage and retrieval would lead to the development of wholly new forms of the encyclopedia. The most important mechanism, conceived by Bush, is the associative trail. It would be a way to create a new linear sequence of microfilm frames across any arbitrary sequence of microfilm frames by creating a chained sequence of links in the way just described, along with personal comments and side trails. In 1965, Bush took part in the project INTREX of MIT, for developing technology for mechanization the processing of information for library use. In his 1967 essay titled "Memex Revisited", he pointed out that the development of the digital computer, the transistor, the video, and other similar devices had heightened the feasibility of such mechanization, but costs would delay its achievements.


SMART

Gerard Salton, who died on August 28 of 1995, was the father of modern search technology. His teams at Harvard and Cornell developed the SMART informational retrieval system. Salton's Magic Automatic Retriever of Text included important concepts like the
vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
, Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values, and relevancy feedback mechanisms. He authored a 56-page book called ''A Theory of Indexing'' which explained many of his tests, upon which search is still largely based.


String Search Engines

In 1987, an article was published detailing the development of a character string search engine (SSE) for rapid text retrieval on a double-metal 1.6-μm n-well CMOS solid-state circuit with 217,600 transistors lain out on a 8.62x12.76-mm die area. The SSE accommodated a novel string-search architecture which combines a 512-stage finite-state automaton (FSA) logic with a content addressable memory (CAM) to achieve an approximate string comparison of 80 million strings per second. The CAM cell consisted of four conventional static RAM (SRAM) cells and a read/write circuit. Concurrent comparison of 64 stored strings with variable length was achieved in 50 ns for an input text stream of 10 million characters/s, permitting performance despite the presence of single character errors in the form of character codes. Furthermore, the chip allowed nonanchor string search and variable-length `don't care' (VLDC) string search.


See also


By source

* Database search engine * Desktop search *
Enterprise search Enterprise search is software technology for searching data sources internal to a company, typically intranet and database content. The search is generally offered only to users internal to the company. Enterprise search can be contrasted with web ...
*
Federated search Federated search retrieves information from a variety of sources via a search application built on top of one or more search engines. A user makes a single query request which is distributed to the search engines, databases or other query engines ...
* Human search engine *
Metasearch engine A metasearch engine (or search aggregator) is an online information retrieval tool that uses the data of a web search engine to produce its own results. Metasearch engines take input from a user and immediately query search engines for results. ...
* Multisearch * Search aggregator *
Web search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...


By content type

* Audio search engine *
Full text search In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts ...
*
Image search An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as Photo c ...
* Video search engine


By interface

* Incremental search *Instant answer *
Semantic search Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. Semantic search seek ...
*
Selection-based search A selection-based search system is a search engine system in which the user invokes a search query using only the mouse. A selection-based search system allows the user to search the internet for more information about any keyword or phrase cont ...
* Voice Search


By topic

*
Bibliographic database A bibliographic database is a database of bibliographic records. This is an organised online collection of references to published written works like academic journal, journal and newspaper articles, conference proceedings, reports, government an ...
*
Enterprise search Enterprise search is software technology for searching data sources internal to a company, typically intranet and database content. The search is generally offered only to users internal to the company. Enterprise search can be contrasted with web ...
* Medical literature retrieval *
Vertical search A vertical search engine is distinct from a general web search engine, in that it focuses on a specific segment of online content. They are also called specialty or topical search engines. The vertical content area may be based on topicality, medi ...


Others

*
Automatic summarization Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are comm ...
*
Emanuel Goldberg Emanuel Goldberg (; ; ; 31August 188113September 1970) was an Israeli physicist and inventor. He was born in Moscow and moved first to Germany and later to Israel. He described himself as "a chemist by learning, physicist by calling, and a mecha ...
(inventor of early search engine) *
Index (search engine) Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, an ...
*
Inverted index In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of d ...
*
List of search engines Search engines, including web search engines, selection-based search engines, metasearch engines, desktop search tools, and web portals and vertical market websites have a search facility for online databases. By content/topic General ...
*
Search as a service Searching may refer to: Music * " Searchin", a 1957 song originally performed by The Coasters * "Searching" (China Black song), a 1991 song by China Black * "Searchin" (CeCe Peniston song), a 1993 song by CeCe Peniston * " Searchin' (I Gott ...
*
Search engine indexing Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, an ...
*
Search engine optimization Search engine optimization (SEO) is the process of improving the quality and quantity of Web traffic, website traffic to a website or a web page from web search engine, search engines. SEO targets unpaid search traffic (usually referred to as ...
* Search suggest drop-down list * Solver (computer science) *
Spamdexing Spamdexing (also known as search engine spam, search engine poisoning, black-hat search engine optimization, search spam or web spam) is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building ...
*
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
*
Text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
*
Web crawler Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
*
Word-sense disambiguation Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious. Given that natural language requires ref ...
(dealing with
Ambiguity Ambiguity is the type of meaning (linguistics), meaning in which a phrase, statement, or resolution is not explicitly defined, making for several interpretations; others describe it as a concept or statement that has no real reference. A com ...
)


References

{{DEFAULTSORT:Search Engine (Computing) Information retrieval systems