Vertical search

Last updated

A vertical search engine is distinct from a general web search engine, in that it focuses on a specific segment of online content. They are also called specialty or topical search engines. The vertical content area may be based on topicality, media type, or genre of content. Common verticals include shopping, the automotive industry, legal information, medical information, scholarly literature, job search and travel. Examples of vertical search engines include the Library of Congress, Mocavo, Nuroa, Trulia, and Yelp.

Contents

In contrast to general web search engines, which attempt to index large portions of the World Wide Web using a web crawler, vertical search engines typically use a focused crawler which attempts to index only relevant web pages to a pre-defined topic or set of topics. Some vertical search sites focus on individual verticals, while other sites include multiple vertical searches within one search engine.

Benefits

Vertical search offers several potential benefits over general search engines:

Vertical search can be viewed as similar to enterprise search where the domain of focus is the enterprise, such as a company, government or other organization. In 2013, consumer price comparison websites with integrated vertical search engines such as FindTheBest drew large rounds of venture capital funding, indicating a growth trend for these applications of vertical search technology. [1] [2]

Domain-specific verticals focus on a specific topic. John Battelle describes this in his book The Search (2005):

Domain-specific search solutions focus on one area of knowledge, creating customized search experiences, that because of the domain's limited corpus and clear relationships between concepts, provide extremely relevant results for searchers. [3]

Any general search engine would be indexing all the pages and searches in a breadth-first manner to collect documents. The spidering in domain-specific search engines more efficiently searches a small subset of documents by focusing on a particular set. Spidering accomplished with a reinforcement-learning framework has been found to be three times more efficient than breadth-first search. [4]

DARPA's Memex program

In early 2014, the Defense Advanced Research Projects Agency (DARPA) released a statement on their website outlining the preliminary details of the "Memex program", which aims at developing new search technologies overcoming some limitations of text-based search. [5] DARPA wants the Memex technology developed in this research to be usable for search engines that can search for information on the Deep Web – the part of the Internet that is largely unreachable by commercial search engines like Google or Yahoo. DARPA's website describes that "The goal is to invent better methods for interacting with and sharing information, so users can quickly and thoroughly organize and search subsets of information relevant to their individual interests". [6] As reported in a 2015 Wired article, the search technology being developed in the Memex program "aims to shine a light on the dark web and uncover patterns and relationships in online data to help law enforcement and others track illegal activity". [7] DARPA intends for the program to replace the centralized procedures used by commercial search engines, stating that the "creation of a new domain-specific indexing and search paradigm will provide mechanisms for improved content discovery, information extraction, information retrieval, user collaboration, and extension of current search capabilities to the deep web, the dark web, and nontraditional (e.g. multimedia) content". [8] In their description of the program, DARPA explains the program's name as a tribute to Bush's original Memex invention, which served as an inspiration. [5]

In April 2015, it was announced parts of Memex would be open sourced. [9] Modules were available for download. [8]

Related Research Articles

<span class="mw-page-title-main">Memex</span> Hypothetical proto-hypertext system that was first described by Vannevar Bush in 1945

Memex is a hypothetical electromechanical device for interacting with microform documents and described in Vannevar Bush's 1945 article "As We May Think". Bush envisioned the memex as a device in which individuals would compress and store all of their books, records, and communications, "mechanized so that it may be consulted with exceeding speed and flexibility". The individual was supposed to use the memex as an automatic personal filing system, making the memex "an enlarged intimate supplement to his memory". The name memex is a portmanteau of memory and expansion.

A search engine is an information retrieval system designed to help find information stored on a computer system. It is an information retrieval software program that discovers, crawls, transforms, and stores information for retrieval and presentation in response to user queries. The search results are usually presented in a list and are commonly called hits. A search engine normally consists of four components, as follows: a search interface, a crawler, an indexer, and a database. The crawler traverses a document collection, deconstructs document text, and assigns surrogates for storage in the search engine index. Online search engines store images, link data and metadata for the document as well.

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

A web portal is a specially designed website that brings information from diverse sources, like emails, online forums and search engines, together in a uniform way. Usually, each information source gets its dedicated area on the page for displaying information ; often, the user can configure which ones to display. Variants of portals include mashups and intranet dashboards for executives and managers. The extent to which content is displayed in a "uniform way" may depend on the intended user and the intended purpose, as well as the diversity of the content. Very often design emphasis is on a certain "metaphor" for configuring and customizing the presentation of the content and the chosen implementation framework or code libraries. In addition, the role of the user in an organization may determine which content can be added to the portal or deleted from the portal configuration.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

Internet research is the practice of using Internet information, especially free information on the World Wide Web, or Internet-based resources in research.

<span class="mw-page-title-main">Deep web</span> Content of the World Wide Web that is not indexed by search engines

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer scientist Michael K. Bergman is credited with inventing the term in 2001 as a search-indexing term.

<span class="mw-page-title-main">Dogpile</span> Metasearch engine

Dogpile is a metasearch engine for information on the World Wide Web that fetches results from Google, Yahoo!, Yandex, Bing, and other popular search engines, including those from audio and video content providers such as Yahoo!.

<span class="mw-page-title-main">Metasearch engine</span> ALO.Online information retrieval tool

A metasearch engine is an online information retrieval tool that uses the data of a web search engine to produce its own results. Metasearch engines take input from a user and immediately query search engines for results. Sufficient data is gathered, ranked, and presented to the users.

Findability is the ease with which information contained on a website can be found, both from outside the website and by users already on the website. Although findability has relevance outside the World Wide Web, the term is usually used in that context. Most relevant websites do not come up in the top results because designers and engineers do not cater to the way ranking algorithms work currently. Its importance can be determined from the first law of e-commerce, which states "If the user can’t find the product, the user can’t buy the product." As of December 2014, out of 10.3 billion monthly Google searches by Internet users in the United States, an estimated 78% are made to research products and services online.

Yahoo! Search is a Yahoo! web search provider that uses Microsoft's Bing search engine to power results.

Federated search retrieves information from a variety of sources via a search application built on top of one or more search engines. A user makes a single query request which is distributed to the search engines, databases or other query engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user. Federated search can be used to integrate disparate information resources within a single large organization ("enterprise") or for the entire web.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

<span class="mw-page-title-main">Search engine</span> Software system that is designed to search for information on the World Wide Web

A search engine is a software system that finds web pages that match a web search. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a line of results, often referred to as search engine results pages (SERPs). The information may be a mix of hyperlinks to web pages, images, videos, infographics, articles, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories and social bookmarking sites, which are maintained by human editors, search engines also maintain real-time information by running an algorithm on a web crawler. Any internet-based content that cannot be indexed and searched by a web search engine falls under the category of deep web.

A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a Web directory, a Web text index, backlinks, or any other Web artifact.

Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.

GenieKnows Inc. was a privately owned vertical search engine company based in Halifax, Nova Scotia. It was started by Rami Hamodah who also started SwiftlyLabs.com and Salesboom.com. Like many internet search engines, its revenue model centers on an online advertising platform and B2B transactions. It focuses on a set of search markets, or verticals, including health search, video games search, and local business directory search.

DeepPeep was a search engine that aimed to crawl and index every database on the public Web. Unlike traditional search engines, which crawl existing webpages and their hyperlinks, DeepPeep aimed to allow access to the so-called Deep web, World Wide Web content only available via for instance typed queries into databases. The project started at the University of Utah and was overseen by Juliana Freire, an associate professor at the university's School of Computing WebDB group. The goal was to make 90% of all WWW content accessible, according to Freire. The project ran a beta search engine and was sponsored by the University of Utah and a $243,000 grant from the National Science Foundation. It generated worldwide interest.

<span class="mw-page-title-main">Chris Mattmann</span> American data scientist

Chris Mattmann is an American data scientist currently working as the Principal Data Scientist and Chief Technology and Innovation Officer in the Office of the Chief Information Officer (OCIO) at the NASA Jet Propulsion Laboratory (JPL) in Pasadena, California. He is also the manager of JPL's Open Source Applications office. Mattmann was formerly Chief Architect in the Instrument and Data Systems section at the laboratory.

References

  1. Rao, Leena (5 March 2013). "Data-Driven Comparison Shopping Platform FindTheBest Raises $11M From New World, Kleiner Perkins And Others". TechCrunch. Archived from the original on 1 June 2013. Retrieved 27 May 2013.
  2. HO, VICTORIA (11 May 2013). "Asian Price Comparison Site Save 22 Gets Angel Round Of "Mid Six Figures"". Archived from the original on 7 June 2013. Retrieved 27 May 2013.
  3. Battelle, John (2005). The Search: How Google and its Rivals Rewrote the Rules of Business and Transformed Our Culture . New York: Portfolio.
  4. McCallum, Andrew (1999). "A Machine Learning Approach to Building Domain-Specific Search Engines". IJCAI. 99: 662–667. CiteSeerX   10.1.1.88.3818 .
  5. 1 2 "Memex Aims to Create a New Paradigm for Domain-Specific Search" (Press release). DARPA. February 9, 2014. Archived from the original on February 11, 2015. Retrieved February 11, 2015.
  6. "Memex (Domain-Specific Search)". www.darpa.mil. Archived from the original on 2016-09-16. Retrieved 2016-09-21.
  7. Kim Zetter (February 2, 2015). "Darpa Is Developing a Search Engine for the Dark Web". Wired . Archived from the original on June 29, 2023. Retrieved November 19, 2020.
  8. 1 2 "Memex (Domain-Specific Search)". DARPA. Archived from the original on June 10, 2015. Retrieved April 20, 2015.
  9. Forbes (April 17, 2015). "Watch Out Google, DARPA Just Open Sourced All This Swish 'Dark Web' Search Tech". Forbes . Archived from the original on April 20, 2015. Retrieved April 20, 2015.