CLEVER project

Last updated

The CLEVER project was a research project in Web search led by Jon Kleinberg at IBM's Almaden Research Center. Techniques developed in CLEVER included various forms of link analysis, including the HITS algorithm.

Jon Kleinberg American computer scientist

Jon Michael Kleinberg is an American computer scientist and the Tisch University Professor of Computer Science at Cornell University known for his work in algorithms and networks. He is a recipient of the Nevanlinna Prize by the International Mathematical Union.

IBM American multinational technology and consulting corporation

International Business Machines Corporation (IBM) is an American multinational information technology company headquartered in Armonk, New York, with operations in over 170 countries. The company began in 1911, founded in Endicott, New York, as the Computing-Tabulating-Recording Company (CTR) and was renamed "International Business Machines" in 1924.

In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions. Link analysis has been used for investigation of criminal activity, computer security analysis, search engine optimization, market research, medical research, and art.

Contents

Features

The CLEVER search engine incorporates several algorithms that make use of the Web's hyperlink structure for discovering high-quality information. It can be exceedingly difficult to locate resources on the World Wide Web that are both high-quality and relevant to a user's informational needs. Traditional automated search methods for locating information on the Web are easily overwhelmed by low-quality and unrelated content. Second generation search engines have to have effective methods for focusing on the most authoritative documents. The rich structure implicit in hyperlinks among Web documents offers a simple, and effective, means to deal with many of these problems.

Algorithm An unambiguous specification of how to solve a class of problems

In mathematics and computer science, an algorithm is an unambiguous specification of how to solve a class of problems. Algorithms can perform calculation, data processing, automated reasoning, and other tasks.

Hyperlink computing term; reference to data that the reader can directly follow either by clicking, tapping, or hovering

In computing, a hyperlink, or simply a link, is a reference to data that the reader can follow by clicking or tapping. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text with hyperlinks. The text that is linked from is called anchor text. A software system that is used for viewing and creating hypertext is a hypertext system, and to create a hyperlink is to hyperlink. A user following hyperlinks is said to navigate or browse the hypertext.

Members of the Clever project have come up with a mathematical algorithm that views the Net as simply web pages pointing at each other. It also takes into account the notion of hubs, which point to quality content and link information together, and the idea of authority pages, which are often written by specialists in certain fields.

Bill Cody, senior manager of exploratory data management research at IBM's Almaden Research Center, said: "Web searches provide a lot of information, some good, some bad. But the people providing good hubs usually point to authority pages and authority pages generally know of good hubs. The algorithm enables us to find them and so provide users with quality information rather than the regular list of irrelevant web pages." He added that the algorithm had also been used to find Internet based communities using the same principle of finding links between like and like. It was used to discover 300,000 communities worldwide, only four per cent of which turned out to be spurious. About two thirds of these still existed, he claimed, with about half now appearing on Yahoo as mature communities.

Cody said that the tool could be used for targeted advertising purposes or for enabling users to find out more information about incipient communities, but declined to say whether IBM had plans to turn Clever into a commercial product or not.

Related Research Articles

Web crawler Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering)

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.

World Wide Web System of interlinked hypertext documents accessed over the Internet

The World Wide Web (WWW), commonly known as the Web, is an information system where documents and other web resources are identified by Uniform Resource Locators, which may be interlinked by hypertext, and are accessible over the Internet. The resources of the WWW may be accessed by users by a software application called a web browser.

In digital marketing and online advertising, spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed, in a manner inconsistent with the purpose of the indexing system.

Search engine optimization (SEO) is the process of increasing the quality and quantity of website traffic, increasing visibility of a website or a web page to users of a web search engine.
SEO refers to the improvement of unpaid results, and excludes the purchase of paid placement.

Web syndication Broadcasting content from one website to other sites

Web syndication is a form of syndication in which content is made available from one website to other sites. Most commonly, websites are made available to provide either summaries or full renditions of a website's recently added content. The term may also describe other kinds of content licensing for reuse.

Internet research is the practice of using Internet information, especially free information on the World Wide Web, or Internet-based resources in research.

A backlink for a given web resource is a link from some other website to that web resource. A web resource may be a website, web page, or web directory.

Findability is the ease with which information contained on a website can be found, both from outside the website and by users already on the website. Although findability has relevance outside the World Wide Web, the term is usually used in that context. Most relevant websites do not come up in the top results because designers and engineers do not cater to the way ranking algorithms work currently. Its importance can be determined from the first law of e-commerce, which states "If the user can’t find the product, the user can’t buy the product." As of December 2014, out of 10.3 billion monthly Google searches by Internet users in the United States, an estimated 78% are made to research products and services online.

The anchor text, link label, link text, or link title is the visible, clickable text in a hyperlink. The words contained in the anchor text can determine the ranking that the page will receive by search engines. Since 1998, some web browsers have added the ability to show a tooltip for a hyperlink before it is selected. Not all links have anchor texts because it may be obvious where the link will lead due to the context in which it is used. Anchor texts normally remain below 50 characters. Different browsers will display anchor texts differently. Usually, web search engines analyze anchor text from hyperlinks on web pages. Other services apply the basic principles of anchor text analysis as well. For instance, academic search engines may use citation context to classify academic articles, and anchor text from documents linked in mind maps may be used too.

Hyperlink-Induced Topic Search is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs.

A video search engine is a web-based search engine which crawls the web for video content. Some video search engines parse externally hosted content while others allow content to be uploaded and hosted on their own servers. Some engines also allow users to search by video format type and by length of the clip. The video search results are usually accompanied by a thumbnail view of the video.

Web content Content encountered as part of the user experience on websites

Web content is the textual, visual, or aural content that is encountered as part of the user experience on websites. It may include—among other things—text, images, sounds, videos, and animations.

Web content development is the process of researching, writing, gathering, organizing, and editing information for publication on websites. Website content may consist of prose, graphics, pictures, recordings, movies, or other digital assets that could be distributed by a hypertext transfer protocol server, and viewed by a web browser.

A squeeze page is a landing page created to solicit opt-in email addresses from prospective subscribers.

Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.

In the field of search engine optimization (SEO), link building describes actions aimed at increasing the number and quality of inbound links to a webpage with the goal of increasing the search engine rankings of that page or website. Briefly, link building is the process of establishing relevant hyperlinks to a website from external sites. Link building can increase the number of high-quality links pointing to a website, in turn increasing the likelihood of the website ranking highly in search engine results. Link building is also a proven marketing tactic for increasing brand awareness.

DeepPeep was a search engine that aimed to crawl and index every database on the public Web. Unlike traditional search engines, which crawl existing webpages and their hyperlinks, DeepPeep aimed to allow access to the so-called Deep Web, World Wide Web content only available via for instance typed queries into databases. The project started at the University of Utah and was overseen by Juliana Freire, an associate professor at the university's School of Computing WebDB group. The goal was to make 90% of all WWW content accessible, according to Freire. The project ran a beta search engine and was sponsored by the University of Utah and a $243,000 grant from the National Science Foundation. It generated worldwide interest.

Ubiquitous Knowledge Processing Lab is a research lab in the Department of Computer Science at the Technische Universität Darmstadt. It was founded in 2006 by Prof. Dr. Iryna Gurevych.

The following outline is provided as an overview of and topical guide to search engines.

Google Hummingbird

Hummingbird is the codename given to a significant algorithm change in Google Search in 2013. Its name was derived from the speed and accuracy of the hummingbird. The change was announced on September 26, 2013, having already been in use for a month. "Hummingbird" places greater emphasis on natural language queries, considering context and meaning over individual keywords. It also looks deeper at content on individual pages of a website, with improved ability to lead users directly to the most appropriate page rather than just a website's homepage.

References