Metasearch engine

Last updated
Architecture of a metasearch engine Meta-search-en.svg
Architecture of a metasearch engine

A metasearch engine (or search aggregator ) is an online information retrieval tool that uses the data of a web search engine to produce its own results. [1] [2] Metasearch engines take input from a user and immediately query search engines [3] for results. Sufficient data is gathered, ranked, and presented to the users.

Contents

Problems such as spamming reduces the accuracy and precision of results. [4] The process of fusion aims to improve the engineering of a metasearch engine. [5]

Examples of metasearch engines include Skyscanner and Kayak.com, which aggregate search results of online travel agencies and provider websites and Searx, a free and open-source search engine which aggregates results from internet search engines.

History

The first person to incorporate the idea of meta searching was Daniel Dreilinger of Colorado State University. He developed SearchSavvy, which let users search up to 20 different search engines and directories at once. Although fast, the search engine was restricted to simple searches and thus wasn't reliable. University of Washington student Eric Selberg released a more "updated" version called MetaCrawler. This search engine improved on SearchSavvy's accuracy by adding its own search syntax behind the scenes, and matching the syntax to that of the search engines it was probing. Metacrawler reduced the amount of search engines queried to 6, but although it produced more accurate results, it still wasn't considered as accurate as searching a query in an individual engine. [6]

On May 20, 1996, HotBot, then owned by Wired, was a search engine with search results coming from the Inktomi and Direct Hit databases. It was known for its fast results and as a search engine with the ability to search within search results. Upon being bought by Lycos in 1998, development for the search engine staggered and its market share fell drastically. After going through a few alterations, HotBot was redesigned into a simplified search interface, with its features being incorporated into Lycos' website redesign. [7]

A metasearch engine called Anvish was developed by Bo Shu and Subhash Kak in 1999; the search results were sorted using instantaneously trained neural networks. [8] This was later incorporated into another metasearch engine called Solosearch. [9]

In August 2000, India got its first meta search engine when HumHaiIndia.com was launched. [10] It was developed by the then 16 year old Sumeet Lamba. [11] The website was later rebranded as Tazaa.com. [12]

Ixquick is a search engine known for its privacy policy statement. Developed and launched in 1998 by David Bodnick, it is owned by Surfboard Holding BV. In June 2006, Ixquick began to delete private details of its users following the same process with Scroogle. Ixquick's privacy policy includes no recording of users' IP addresses, no identifying cookies, no collection of personal data, and no sharing of personal data with third parties. [13] It also uses a unique ranking system where a result is ranked by stars. The more stars in a result, the more search engines agreed on the result.

In April 2005, Dogpile, then owned and operated by InfoSpace, Inc., collaborated with researchers from the University of Pittsburgh and Pennsylvania State University to measure the overlap and ranking differences of leading Web search engines in order to gauge the benefits of using a metasearch engine to search the web. Results found that from 10,316 random user-defined queries from Google, Yahoo!, and Ask Jeeves, only 3.2% of first page search results were the same across those search engines for a given query. Another study later that year using 12,570 random user-defined queries from Google, Yahoo!, MSN Search, and Ask Jeeves found that only 1.1% of first page search results were the same across those search engines for a given query. [14]

Advantages

By sending multiple queries to several other search engines this extends the coverage data of the topic and allows more information to be found. They use the indexes built by other search engines, aggregating and often post-processing results in unique ways. A metasearch engine has an advantage over a single search engine because more results can be retrieved with the same amount of exertion. [2] It also reduces the work of users from having to individually type in searches from different engines to look for resources. [2]

Metasearching is also a useful approach if the purpose of the user's search is to get an overview of the topic or to get quick answers. Instead of having to go through multiple search engines like Yahoo! or Google and comparing results, metasearch engines are able to quickly compile and combine results. They can do it either by listing results from each engine queried with no additional post-processing (Dogpile) or by analyzing the results and ranking them by their own rules (IxQuick, Metacrawler, and Vivismo).

A metasearch engine can also hide the searcher's IP address from the search engines queried thus providing privacy to the search.

Disadvantages

Metasearch engines are not capable of parsing query forms or able to fully translate query syntax. The number of hyperlinks generated by metasearch engines are limited, and therefore do not provide the user with the complete results of a query. [15]

The majority of metasearch engines do not provide over ten linked files from a single search engine, and generally do not interact with larger search engines for results. Pay per click links are prioritised and are normally displayed first. [16]

Metasearching also gives the illusion that there is more coverage of the topic queried, particularly if the user is searching for popular or commonplace information. It's common to end with multiple identical results from the queried engines. It is also harder for users to search with advanced search syntax to be sent with the query, so results may not be as precise as when a user is using an advanced search interface at a specific engine. This results in many metasearch engines using simple searching. [17]

Operation

A metasearch engine accepts a single search request from the user. This search request is then passed on to another search engine's database. A metasearch engine does not create a database of web pages but generates a Federated database system of data integration from multiple sources. [18] [19] [20]

Since every search engine is unique and has different algorithms for generating ranked data, duplicates will therefore also be generated. To remove duplicates, a metasearch engine processes this data and applies its own algorithm. A revised list is produced as an output for the user.[ citation needed ] When a metasearch engine contacts other search engines, these search engines will respond in three ways:

Architecture of ranking

Web pages that are highly ranked on many search engines are likely to be more relevant in providing useful information. [21] However, all search engines have different ranking scores for each website and most of the time these scores are not the same. This is because search engines prioritise different criteria and methods for scoring, hence a website might appear highly ranked on one search engine and lowly ranked on another. This is a problem because Metasearch engines rely heavily on the consistency of this data to generate reliable accounts. [21]

Fusion

Data Fusion Model DFIG Model.jpg
Data Fusion Model

A metasearch engine uses the process of Fusion to filter data for more efficient results. The two main fusion methods used are: Collection Fusion and Data Fusion.

Spamdexing

Spamdexing is the deliberate manipulation of search engine indexes. It uses a number of methods to manipulate the relevance or prominence of resources indexed in a manner unaligned with the intention of the indexing system. Spamdexing can be very distressing for users and problematic for search engines because the return contents of searches have poor precision.[ citation needed ] This will eventually result in the search engine becoming unreliable and not dependable for the user. To tackle Spamdexing, search robot algorithms are made more complex and are changed almost every day to eliminate the problem. [24]

It is a major problem for metasearch engines because it tampers with the Web crawler's indexing criteria, which are heavily relied upon to format ranking lists. Spamdexing manipulates the natural ranking system of a search engine, and places websites higher on the ranking list than they would naturally be placed. [25] There are three primary methods used to achieve this:

Content spam

Content spam are the techniques that alter the logical view that a search engine has over the page's contents. Techniques include:

Link spam are links between pages present for reasons other than merit. Techniques include:

Cloaking

This is an SEO technique in which different materials and information are sent to the web crawler and to the web browser. [26] It is commonly used as a spamdexing technique because it can trick search engines into either visiting a site that is substantially different from the search engine description or giving a certain site a higher ranking.

See also

Related Research Articles

Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes.

In computing, a search engine is an information retrieval software system designed to help find information stored on one or more computer systems. Search engines discover, crawl, transform, and store information for retrieval and presentation in response to user queries. The search results are usually presented in a list and are commonly called hits. The most widely used type of search engine is a web search engine, which searches for information on the World Wide Web.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

<span class="mw-page-title-main">Link farm</span> Group of websites that link to each other

On the World Wide Web, a link farm is any group of websites that all hyperlink to other sites in the group for the purpose of increasing SEO rankings. In graph theoretic terms, a link farm is a clique. Although some link farms can be created by hand, most are created through automated programs and services. A link farm is a form of spamming the index of a web search engine. Other link exchange systems are designed to allow individual websites to selectively exchange links with other relevant websites, and are not considered a form of spamdexing.

Internet research is the practice of using Internet information, especially free information on the World Wide Web, or Internet-based resources in research.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer scientist Michael K. Bergman is credited with inventing the term in 2001 as a search-indexing term.

Spam in blogs is a form of spamdexing which utilizes internet sites which allow content to be publicly posted, in order to artificially inflate their website ranking by linking back to their web pages. Backlink helps search algorithms determine the popularity of a web page, which plays a major role for search engines like Google and Microsoft Bing to decide a web page ranking on a certain search query. This helps the spammer's website to list ahead of other sites for certain searches, which helps them to increase the number of visitors to their website.

<span class="mw-page-title-main">Dogpile</span> Metasearch engine

Dogpile is a metasearch engine for information on the World Wide Web that fetches results from Google, Yahoo!, Yandex, Bing, and other popular search engines, including those from audio and video content providers such as Yahoo!.

Doorway pages are web pages that are created for the deliberate manipulation of search engine indexes (spamdexing). A doorway page will affect the index of a search engine by inserting results for particular phrases while sending visitors to a different page. Doorway pages that redirect visitors without their knowledge use some form of cloaking. This usually falls under Black Hat SEO.

Keyword stuffing is a search engine optimization (SEO) technique, considered webspam or spamdexing, in which keywords are loaded into a web page's meta tags, visible content, or backlink anchor text in an attempt to gain an unfair rank advantage in search engines. Keyword stuffing may lead to a website being temporarily or permanently banned or penalized on major search engines. The repetition of words in meta tags may explain why many search engines no longer use these tags. Nowadays, search engines focus more on the content that is unique, comprehensive, relevant, and helpful that overall makes the quality better which makes keyword stuffing useless, but it is still practiced by many webmasters.

Federated search retrieves information from a variety of sources via a search application built on top of one or more search engines. A user makes a single query request which is distributed to the search engines, databases or other query engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user. Federated search can be used to integrate disparate information resources within a single large organization ("enterprise") or for the entire web.

<span class="mw-page-title-main">Search engine</span> Software system for finding relevant information on the Web

A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.

A search engine results page (SERP) is a webpage that is displayed by a search engine in response to a query by a user. The main component of a SERP is the listing of results that are returned by the search engine in response to a keyword query.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Adversarial information retrieval is a topic in information retrieval related to strategies for working with a data source where some portion of it has been manipulated maliciously. Tasks can include gathering, indexing, filtering, retrieving and ranking information from such a data source. Adversarial IR includes the study of methods to detect, isolate, and defeat such manipulation.

PolyCola, previously known as GahooYoogle, is a metasearch engine which was created by Arbel Hakopian.

A Web query topic classification/categorization is a problem in information science. The task is to assign a Web search query to one or more predefined categories, based on its topics. The importance of query classification is underscored by many services provided by Web search. A direct application is to provide better search result pages for users with interests of different categories. For example, the users issuing a Web query "apple" might expect to see Web pages related to the fruit apple, or they may prefer to see products or news related to the computer company. Online advertisement services can rely on the query classification results to promote different products more accurately. Search result pages can be grouped according to the categories predicted by a query classification algorithm. However, the computation of query classification is non-trivial. Different from the document classification tasks, queries submitted by Web search users are usually short and ambiguous; also the meanings of the queries are evolving over time. Therefore, query topic classification is much more difficult than traditional document classification tasks.

DeepPeep was a search engine that aimed to crawl and index every database on the public Web. Unlike traditional search engines, which crawl existing webpages and their hyperlinks, DeepPeep aimed to allow access to the so-called Deep web, World Wide Web content only available via for instance typed queries into databases. The project started at the University of Utah and was overseen by Juliana Freire, an associate professor at the university's School of Computing WebDB group. The goal was to make 90% of all WWW content accessible, according to Freire. The project ran a beta search engine and was sponsored by the University of Utah and a $243,000 grant from the National Science Foundation. It generated worldwide interest.

<span class="mw-page-title-main">Searx</span> Metasearch engine

Searx is a free and open-source metasearch engine, available under the GNU Affero General Public License version 3, with the aim of protecting the privacy of its users. To this end, Searx does not share users' IP addresses or search history with the search engines from which it gathers results. Tracking cookies served by the search engines are blocked, preventing user-profiling-based results modification. By default, Searx queries are submitted via HTTP POST, to prevent users' query keywords from appearing in webserver logs. Searx was inspired by the Seeks project, though it does not implement Seeks' peer-to-peer user-sourced results ranking.

References

  1. Berger, Sandy (2005). "Sandy Berger's Great Age Guide to the Internet" (Document). Que Publishing. ISBN   0-7897-3442-7
  2. 1 2 3 "Architecture of a Metasearch Engine that Supports User Information Needs". 1999.
  3. Ride, Onion (2021). "How search Engine work". onionride.
  4. Lawrence, Stephen R.; Lee Giles, C. (October 10, 1997). "Patent US6999959 - Meta search engine" via Google Books.
  5. Voorhees, Ellen M.; Gupta, Narendra; Johnson-Laird, Ben (April 2000). "The collection fusion problem".
  6. "The Meta-search — Search Engine History". Archived from the original on 2020-01-30. Retrieved 2014-12-02.
  7. "Search engine rankings on HotBot: a brief history of the HotBot search engine".
  8. Shu, Bo; Kak, Subhash (1999). "A neural network based intelligent metasearch engine". Information Sciences. 120 (4): 1–11. CiteSeerX   10.1.1.84.6837 . doi:10.1016/S0020-0255(99)00062-6.
  9. Kak, Subhash (November 1999). "Better Web searches and prediction with instantaneously trained neural networks" (PDF). IEEE Intelligent Systems.
  10. "New kid in town". India Today. Retrieved 2024-03-14.
  11. "What is Metasearch Engine?". GeeksforGeeks. 2020-08-01. Retrieved 2024-03-14.
  12. "www.metaseek.nl". www.metaseek.nl. Retrieved 2024-03-14.
  13. "ABOUT US – Our history".
  14. Spink, Amanda; Jansen, Bernard J.; Kathuria, Vinish; Koshman, Sherry (2006). "Overlap among major web search engines" (PDF). Emerald.
  15. "Department of Informatics". University of Fribourg.
  16. "Intelligence Exploitation of the Internet" (PDF). 2002.
  17. HENNEGAR, ANNE (16 September 2009). "Metasearch Engines Expands your Horizon".
  18. MENG, WEIYI (May 5, 2008). "Metasearch Engines" (PDF).
  19. Selberg, Erik; Etzioni, Oren (1997). "The MetaCrawler architecture for resource aggregation on the Web". IEEE expert. pp. 11–14.
  20. Manoj, M; Jacob, Elizabeth (July 2013). "Design and Development of a Programmable Meta Search Engine" (PDF). Foundation of Computer Science. pp. 6–11.
  21. 1 2 3 4 Manoj, M.; Jacob, Elizabeth (October 2008). "Information retrieval on Internet using meta-search engines: A review" (PDF). Council of Scientific and Industrial Research.
  22. Wu, Shengli; Crestani, Fabio; Bi, Yaxin (2006). "Evaluating Score Normalization Methods in Data Fusion". Information Retrieval Technology. Lecture Notes in Computer Science. Vol. 4182. pp. 642–648. CiteSeerX   10.1.1.103.295 . doi:10.1007/11880592_57. ISBN   978-3-540-45780-0.
  23. Manmatha, R.; Sever, H. (2014). "A Formal Approach to Score Normalization for Meta-search" (PDF). Archived from the original (PDF) on 2019-09-30. Retrieved 2014-10-27.
  24. Najork, Marc (2014). "Web Spam Detection". Microsoft.
  25. Vandendriessche, Gerrit (February 2009). "A few legal comments on spamdexing".
  26. Wang, Yi-Min; Ma, Ming; Niu, Yuan; Chen, Hao (May 8, 2007). "Connecting Web Spammers with Advertisers" (PDF).