Proximity search (text)

Last updated February 09, 2024

In text processing, a proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. In addition to proximity, some implementations may also impose a constraint on the word order, in that the order in the searched text must be identical to the order of the search query. Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search.

For example, a search could be used to find "red brick house", and match phrases such as "red house of brick" or "house made of red brick". By limiting the proximity, these phrases can be matched while avoiding documents where the words are scattered or spread across a page or in unrelated articles in an anthology.

Rationale

The basic linguistic assumption of proximity searching is that the proximity of the words in a document implies a relationship between the words. Given that authors of documents try to formulate sentences which contain a single idea, or cluster of related ideas within neighboring sentences or organized into paragraphs, there is an inherent, relatively high, probability within the document structure that words used together are related. On the other hand, when two words are on the opposite ends of a book, the probability of a relationship between the words is relatively weak. By limiting search results to only include matches where the words are within the specified maximum proximity, or distance, the search results are assumed to be of higher relevance than the matches where the words are scattered.

Commercial internet search engines tend to produce too many matches (known as recall) for the average search query. Proximity searching is one method of reducing the number of pages matches, and to improve the relevance of the matched pages by using word proximity to assist in ranking. As an added benefit, proximity searching helps combat spamdexing by avoiding webpages which contain dictionary lists or shotgun lists of thousands of words, which would otherwise rank highly if the search engine was heavily biased toward word frequency.

Boolean syntax and operators

Note that a proximity search can designate that only some keywords must be within a specified distance. Proximity searching can be used with other search syntax and/or controls to allow more articulate search queries. Sometimes query operators like NEAR, NOT NEAR, FOLLOWED BY, NOT FOLLOWED BY, SENTENCE or FAR are used to indicate a proximity-search limit between specified keywords: for example, "brick NEAR house".

Usage in commercial search engines

In regards to implicit/automatic versus explicit proximity search, as of November 2008, most Internet search engines only implement an implicit proximity search functionality. That is, they automatically rank those search results higher where the user keywords have a good "overall proximity score" in such results. If only two keywords are in the search query, this has no difference from an explicit proximity search which puts a NEAR operator between the two keywords. However, if three or more than three keywords are present, it is often important for the user to specify which subsets of these keywords expect a proximity in search results. This is useful if the user wants to do a prior art search (e.g. finding an existing approach to complete a specific task, finding a document that discloses a system that exhibits a procedural behavior collaboratively conducted by several components and links between these components).

Web search engines which support proximity search via an explicit proximity operator in their query language include Walhello, Exalead, Yandex, Yahoo!, Altavista, and Bing:

When using the Walhello search-engine, the proximity can be defined by the number of characters between the keywords.^[1]
The search engine Exalead allows the user to specify the required proximity, as the maximum number of words between keywords. The syntax is (keyword1 NEAR/n keyword2) where n is the number of words.^[2]
Yandex uses the syntax keyword1 /n keyword2 to search for two keywords separated by at most $n-1$ words, and supports a few other variations of this syntax.^[3]
Yahoo! and Altavista both support an undocumented NEAR operator.^[4]^[5] The syntax is keyword1 NEAR keyword2.
Google Search supports AROUND(#).^[6]^[7]
Bing supports NEAR.^[8] The syntax is keyword1 near:n keyword2 where n=the number of maximum separating words.

Ordered search within the Google and Yahoo! search engines is possible using the asterisk (*) full-word wildcards: in Google this matches one or more words,^[9] and an in Yahoo! Search this matches exactly one word.^[10] (This is easily verified by searching for the following phrase in both Google and Yahoo!: "addictive * of biblioscopy".)

To emulate unordered search of the NEAR operator can be done using a combination of ordered searches. For example, to specify a close co-occurrence of "house" and "dog", the following search-expression could be specified: "house dog" OR "dog house" OR "house * dog" OR "dog * house" OR "house * * dog" OR "dog * * house".

Notes

↑ "About Walhello" Archived 2012-05-01 at archive.today , visited 23 December 2009
↑ "Web Search Syntax", visited 23 December 2009
↑ Yandex help page on query language (in Russian)
↑ "Successful Yahoo! proximity query" (22 Feb 2010)
↑ "Unsuccessful Yahoo! proximity query" (22 Feb 2010)
↑ "GuidingTech: Meet Google Search's Little Known AROUND Operator"
↑ "Google Offers Proximity Search" (8 Feb 2011)
↑ "How to Use Bing’s Advanced Search Operators"
↑ "More Google Search Help" visited 23 December 2009
↑ "Review of Yahoo! Search", by Search Engine Showdown, visited 23 December 2009

Related Research Articles

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.

Dogpile is a metasearch engine for information on the World Wide Web that fetches results from Google, Yahoo!, Yandex, Bing, and other popular search engines, including those from audio and video content providers such as Yahoo!.

Pay-per-click (PPC) is an internet advertising model used to drive traffic to websites, in which an advertiser pays a publisher when the ad is clicked.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

Walhello was a spider based search engine developed in the Netherlands. The Walhello spider is called "appie".

In computer hypertext, a URI fragment is a string of characters that refers to a resource that is subordinate to another, primary resource. The primary resource is identified by a Uniform Resource Identifier (URI), and the fragment identifier points to the subordinate resource.

Search engine marketing (SEM) is a form of Internet marketing that involves the promotion of websites by increasing their visibility in search engine results pages (SERPs) primarily through paid advertising. SEM may incorporate search engine optimization (SEO), which adjusts or rewrites website content and site architecture to achieve a higher ranking in search engine results pages to enhance pay per click (PPC) listings and increase the Call to action (CTA) on the website.

A search engine is a software system that finds web pages that match a web search. It searches the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a line of results, often referred to as search engine results pages (SERPs). The information may be a mix of hyperlinks to web pages, images, videos, infographics, articles, and other types of files. As of January 2022, Google is by far the world's most used search engine, with a market share of 90.6%, and the world's other most used search engines were Bing, Yahoo!, Baidu, Yandex, and DuckDuckGo.

A search engine results page (SERP) is a webpage that is displayed by a search engine in response to a query by a user. The main component of a SERP is the listing of results that are returned by the search engine in response to a keyword query.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:

In Internet marketing, search advertising is a method of placing online advertisements on web pages that show results from search engine queries. Through the same search-engine advertising services, ads can also be placed on Web pages with other published content.

A web query or web search query is a query that a user enters into a web search engine to satisfy their information needs. Web search queries are distinctive in that they are often plain text and boolean search directives are rarely used. They vary greatly from standard query languages, which are governed by strict syntax rules as command languages with keyword or positional parameters.

Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.

Compound-term processing, in information-retrieval, is search result matching on the basis of compound terms. Compound terms are built by combining two or more simple terms; for example, "triple" is a single word term, but "triple heart bypass" is a compound term.

<span class="mw-page-title-main">Reverse image search</span> Content-based image retrieval

Reverse image search is a content-based image retrieval (CBIR) query technique that involves providing the CBIR system with a sample image that it will then base its search upon; in terms of information retrieval, the sample image is very useful. In particular, reverse image search is characterized by a lack of search terms. This effectively removes the need for a user to guess at keywords or terms that may or may not return a correct result. Reverse image search also allows users to discover content that is related to a specific sample image or the popularity of an image, and to discover manipulated versions and derivative works.

Yandex Search is a search engine owned by the company Yandex, based in Russia. In January 2015, Yandex Search generated 51.2% of all of the search traffic in Russia according to LiveInternet.

Contextual search is a form of optimizing web-based search results based on context provided by the user and the computer being used to enter the query. Contextual search services differ from current search engines based on traditional information retrieval that return lists of documents based on their relevance to the query. Rather, contextual search attempts to increase the precision of results based on how valuable they are to individual users.

Microsoft Azure Cognitive Search, formerly known as Azure Search, is a component of the Microsoft Azure Cloud Platform providing indexing and querying capabilities for data uploaded to Microsoft servers. The Search as a service framework is intended to provide developers with complex search capabilities for mobile and web development while hiding infrastructure requirements and search algorithm complexities. Azure Search is a recent addition to Microsoft's Infrastructure as a Service (IaaS) approach.

Searx is a free and open-source metasearch engine, available under the GNU Affero General Public License version 3, with the aim of protecting the privacy of its users. To this end, Searx does not share users' IP addresses or search history with the search engines from which it gathers results. Tracking cookies served by the search engines are blocked, preventing user-profiling-based results modification. By default, Searx queries are submitted via HTTP POST, to prevent users' query keywords from appearing in webserver logs. Searx was inspired by the Seeks project, though it does not implement Seeks' peer-to-peer user-sourced results ranking.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "About Walhello" Archived 2012-05-01 at archive.today , visited 23 December 2009

[2] "Web Search Syntax", visited 23 December 2009

[3] Yandex help page on query language (in Russian)

[4] "Successful Yahoo! proximity query" (22 Feb 2010)

[5] "Unsuccessful Yahoo! proximity query" (22 Feb 2010)

[6] "GuidingTech: Meet Google Search's Little Known AROUND Operator"

[7] "Google Offers Proximity Search" (8 Feb 2011)

[8] "How to Use Bing’s Advanced Search Operators"

[9] "More Google Search Help" visited 23 December 2009

[10] "Review of Yahoo! Search", by Search Engine Showdown, visited 23 December 2009

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]