HOME

TheInfoList



OR:

In
text processing In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text. ''Text'' usually refers to all the alphanumeric characters specified on the keyboard of the person engaging th ...
, a proximity search looks for documents where two or more separately matching term occurrences are within a specified
distance Distance is a numerical or occasionally qualitative measurement of how far apart objects, points, people, or ideas are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two co ...
, where distance is the number of intermediate words or characters. In addition to proximity, some implementations may also impose a constraint on the word order, in that the order in the searched text must be identical to the order of the search query. Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search. For example, a search could be used to find "red brick house", and match phrases such as "red house of brick" or "house made of red brick". By limiting the proximity, these phrases can be matched while avoiding documents where the words are scattered or spread across a page or in unrelated articles in an anthology.


Rationale

The basic linguistic assumption of proximity searching is that the proximity of the words in a document implies a relationship between the words. Given that authors of documents try to formulate sentences which contain a single idea, or cluster of related ideas within neighboring sentences or organized into paragraphs, there is an inherent, relatively high, probability within the document structure that words used together are related. On the other hand, when two words are on the opposite ends of a book, the probability of a relationship between the words is relatively weak. By limiting search results to only include matches where the words are within the specified maximum proximity, or distance, the search results are assumed to be of higher
relevance Relevance is the connection between topics that makes one useful for dealing with the other. Relevance is studied in many different fields, including cognitive science, logic, and library and information science. Epistemology studies it in gener ...
than the matches where the words are scattered. Commercial internet search engines tend to produce too many matches (known as recall) for the average search query. Proximity searching is one method of reducing the number of pages matches, and to improve the relevance of the matched pages by using word proximity to assist in ranking. As an added benefit, proximity searching helps combat
spamdexing Spamdexing (also known as search engine spam, search engine poisoning, black-hat search engine optimization, search spam or web spam) is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building ...
by avoiding webpages which contain dictionary lists or shotgun lists of thousands of words, which would otherwise rank highly if the search engine was heavily biased toward word frequency.


Boolean syntax and operators

Note that a proximity search can designate that only some keywords must be within a specified distance. Proximity searching can be used with other search syntax and/or controls to allow more articulate search queries. Sometimes query operators like NEAR, NOT NEAR, FOLLOWED BY, NOT FOLLOWED BY, SENTENCE or FAR are used to indicate a proximity-search limit between specified keywords: for example, "brick NEAR house".


Usage in commercial search engines

In regards to implicit/automatic versus explicit proximity search, as of November 2008, most Internet
search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
s only implement an implicit proximity search functionality. That is, they automatically rank those search results higher where the user keywords have a good "overall proximity score" in such results. If only two keywords are in the search query, this has no difference from an explicit proximity search which puts a NEAR operator between the two keywords. However, if three or more than three keywords are present, it is often important for the user to specify which subsets of these keywords expect a proximity in search results. This is useful if the user wants to do a
prior art Prior art (also known as state of the art or background art) is a concept in patent law used to determine the patentability of an invention, in particular whether an invention meets the novelty and the inventive step or non-obviousness criteria f ...
search (e.g. finding an existing approach to complete a specific task, finding a document that discloses a system that exhibits a procedural behavior collaboratively conducted by several components and links between these components).
Web search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
s which support proximity search via an explicit proximity operator in their query language include Walhello, Exalead,
Yandex Yandex LLC ( rus, Яндекс, r=Yandeks, p=ˈjandəks) is a Russian technology company that provides Internet-related products and services including a web browser, search engine, cloud computing, web mapping, online food ordering, streaming ...
,
Yahoo! Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, and its a ...
,
Altavista AltaVista was a web search engine established in 1995. It became one of the most-used early search engines, but lost ground to Google and was purchased by Yahoo! in 2003, which retained the brand, but based all AltaVista searches on its own sear ...
, and
Bing Bing most often refers to: * Bing Crosby (1903–1977), American singer * Microsoft Bing, a web search engine Bing may also refer to: Food and drink * Bing (bread), a Chinese flatbread * Bing (soft drink), a UK brand * Bing cherry, a varie ...
: * When using the Walhello search-engine, the proximity can be defined by the number of characters between the keywords. * The search engine Exalead allows the user to specify the required proximity, as the maximum number of words between keywords. The syntax is (keyword1 NEAR/n keyword2) where n is the number of words. *
Yandex Yandex LLC ( rus, Яндекс, r=Yandeks, p=ˈjandəks) is a Russian technology company that provides Internet-related products and services including a web browser, search engine, cloud computing, web mapping, online food ordering, streaming ...
uses the syntax keyword1 /n keyword2 to search for two keywords separated by at most n - 1 words, and supports a few other variations of this syntax. *
Yahoo! Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, and its a ...
and
Altavista AltaVista was a web search engine established in 1995. It became one of the most-used early search engines, but lost ground to Google and was purchased by Yahoo! in 2003, which retained the brand, but based all AltaVista searches on its own sear ...
both support an undocumented NEAR operator. The syntax is keyword1 NEAR keyword2. *
Google Search Google Search (also known simply as Google or Google.com) is a search engine operated by Google. It allows users to search for information on the World Wide Web, Web by entering keywords or phrases. Google Search uses algorithms to analyze an ...
supports AROUND(#). *
Bing Bing most often refers to: * Bing Crosby (1903–1977), American singer * Microsoft Bing, a web search engine Bing may also refer to: Food and drink * Bing (bread), a Chinese flatbread * Bing (soft drink), a UK brand * Bing cherry, a varie ...
supports NEAR. The syntax is keyword1 near:n keyword2 where n=the number of maximum separating words. Ordered search within the
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
and
Yahoo! Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, and its a ...
search engines is possible using the asterisk (*) full-word wildcards: in Google this matches one or more words, and an in Yahoo! Search this matches exactly one word."Review of Yahoo! Search", by Search Engine Showdown, visited 23 December 2009
/ref> (This is easily verified by searching for the following phrase in both Google and Yahoo!: "addictive * of biblioscopy".) To emulate unordered search of the NEAR operator can be done using a combination of ordered searches. For example, to specify a close co-occurrence of "house" and "dog", the following search-expression could be specified: "house dog" OR "dog house" OR "house * dog" OR "dog * house" OR "house * * dog" OR "dog * * house".


See also

*
Compound term processing Compound-term processing, in information-retrieval, is search result matching on the basis of compound terms. Compound terms are built by combining two or more simple terms; for example, "triple" is a single word term, but "triple heart bypass" is ...
*
Edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two String (computing), strings (e.g., words) are to one another, that is measured by counting the minimum number of opera ...
*
Information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
*
Search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
*
Search engine indexing Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, an ...
- how texts are indexed to support proximity search * Semantic proximity


Notes

{{Reflist Information retrieval techniques Internet search algorithms