Focused crawler

Last updated

A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. [1] Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy [2] while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a Web directory, a Web text index, backlinks, or any other Web artifact.

A focused crawler must predict the probability that an unvisited page will be relevant before actually downloading the page. [3] A possible predictor is the anchor text of links; this was the approach taken by Pinkerton [4] in a crawler developed in the early days of the Web. Topical crawling was first introduced by Filippo Menczer. [5] [6] Chakrabarti et al. coined the term 'focused crawler' and used a text classifier [7] to prioritize the crawl frontier. Andrew McCallum and co-authors also used reinforcement learning [8] [9] to focus crawlers. Diligenti et al. traced the context graph [10] leading up to relevant pages, and their text content, to train classifiers. A form of online reinforcement learning has been used, along with features extracted from the DOM tree and text of linking pages, to continually train [11] classifiers that guide the crawl. In a review of topical crawling algorithms, Menczer et al. [12] show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as reinforcement learning and evolutionary adaptation can give the best performance over longer crawls. It has been shown that spatial information is important to classify Web documents. [13]

Another type of focused crawlers is semantic focused crawler, which makes use of domain ontologies to represent topical maps and link Web pages with relevant ontological concepts for the selection and categorization purposes. [14] In addition, ontologies can be automatically updated in the crawling process. Dong et al. [15] introduced such an ontology-learning-based crawler using support vector machine to update the content of ontological concepts when crawling Web Pages.

Crawlers are also focused on page properties other than topics. Cho et al. [16] study a variety of crawl prioritization policies and their effects on the link popularity of fetched pages. Najork and Weiner [17] show that breadth-first crawling, starting from popular seed pages, leads to collecting large-PageRank pages early in the crawl. Refinements involving detection of stale (poorly maintained) pages have been reported by Eiron et al. [18] A kind of semantic focused crawler, making use of the idea of reinforcement learning has been introduced by Meusel et al. [19] using online-based classification algorithms in combination with a bandit-based selection strategy to efficiently crawl pages with markup languages like RDFa, Microformats, and Microdata.

The performance of a focused crawler depends on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points. Davison [20] presented studies on Web links and text that explain why focused crawling succeeds on broad topics; similar studies were presented by Chakrabarti et al. [21] Seed selection can be important for focused crawlers and significantly influence the crawling efficiency. [22] A whitelist strategy is to start the focus crawl from a list of high quality seed URLs and limit the crawling scope to the domains of these URLs. These high quality seeds should be selected based on a list of URL candidates which are accumulated over a sufficiently long period of general web crawling. The whitelist should be updated periodically after it is created.

Related Research Articles

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an AI-hard problem.

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For annotations of different digital media, see web annotation and text annotation.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

Vasant G. Honavar is an Indian born American computer scientist, and artificial intelligence, machine learning, big data, data science, causal inference, knowledge representation, bioinformatics and health informatics researcher and professor.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE additionally requires the removal of repeated relations (disambiguation) and generally refers to the extraction of many different relationships.

Co-training is a machine learning algorithm used when there are only small amounts of labeled data and large amounts of unlabeled data. One of its uses is in text mining for search engines. It was introduced by Avrim Blum and Tom Mitchell in 1998.

DeepPeep was a search engine that aimed to crawl and index every database on the public Web. Unlike traditional search engines, which crawl existing webpages and their hyperlinks, DeepPeep aimed to allow access to the so-called Deep web, World Wide Web content only available via for instance typed queries into databases. The project started at the University of Utah and was overseen by Juliana Freire, an associate professor at the university's School of Computing WebDB group. The goal was to make 90% of all WWW content accessible, according to Freire. The project ran a beta search engine and was sponsored by the University of Utah and a $243,000 grant from the National Science Foundation. It generated worldwide interest.

<span class="mw-page-title-main">Entity linking</span>

In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.

<span class="mw-page-title-main">Infobox</span> Template used to collect and present a subset of information about a subject

An infobox is a digital or physical table used to collect and present a subset of information about its subject, such as a document. It is a structured document containing a set of attribute–value pairs, and in Wikipedia represents a summary of information about the subject of an article. In this way, they are comparable to data tables in some aspects. When presented within the larger document it summarizes, an infobox is often presented in a sidebar format.

<span class="mw-page-title-main">Filippo Menczer</span> American and Italian computer scientist

Filippo Menczer is an American and Italian academic. He is a University Distinguished Professor and the Luddy Professor of Informatics and Computer Science at the Luddy School of Informatics, Computing, and Engineering, Indiana University. Menczer is the Director of the Observatory on Social Media, a research center where data scientists and journalists study the role of media and technology in society and build tools to analyze and counter disinformation and manipulation on social media. Menczer holds courtesy appointments in Cognitive Science and Physics, is a founding member and advisory council member of the IU Network Science Institute, a former director the Center for Complex Networks and Systems Research, a senior research fellow of the Kinsey Institute, a fellow of the Center for Computer-Mediated Communication, and a former fellow of the Institute for Scientific Interchange in Turin, Italy. In 2020 he was named a Fellow of the ACM.

Soumen Chakrabarti is an Indian computer scientist and professor in the Department of Computer Science and Engineering at IIT Bombay. He is known for his work on

In computer science, SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages. It was created by Moses Charikar. In 2021 Google announced its intent to also use the algorithm in their newly created FLoC system.

Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering, ontology induction, automated reasoning, and code generation. The phrase was first used in the 1970s by Yorick Wilks as the basis for machine translation programs working with only semantic representations.

The Computer Science Ontology (CSO) is an automatically generated taxonomy of research topics in the field of Computer Science. It was produced by the Open University in collaboration with Springer Nature by running an information extraction system over a large corpus of scientific articles. Several branches were manually improved by domain experts. The current version includes about 14K research topics and 160K semantic relationships.

References

  1. Soumen Chakrabarti, Focused Web Crawling, in the Encyclopedia of Database Systems.
  2. Controversial topics
  3. Improving the Performance of Focused Web Crawlers , Sotiris Batsakis, Euripides G. M. Petrakis, Evangelos Milios, 2012-04-09
  4. Pinkerton, B. (1994). Finding what people want: Experiences with the WebCrawler. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland.
  5. Menczer, F. (1997). ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery Archived 2012-12-21 at the Wayback Machine . In D. Fisher, ed., Proceedings of the 14th International Conference on Machine Learning (ICML97). Morgan Kaufmann.
  6. Menczer, F. and Belew, R.K. (1998). Adaptive Information Agents in Distributed Textual Environments Archived 2012-12-21 at the Wayback Machine . In K. Sycara and M. Wooldridge (eds.) Proceedings of the 2nd International Conference on Autonomous Agents (Agents '98). ACM Press.
  7. Focused crawling: a new approach to topic-specific Web resource discovery, Soumen Chakrabarti, Martin van den Berg and Byron Dom, WWW 1999.
  8. A machine learning approach to building domain-specific search engines, Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore, IJCAI 1999.
  9. Using Reinforcement Learning to Spider the Web Efficiently, Jason Rennie and Andrew McCallum, ICML 1999.
  10. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused crawling using context graphs Archived 2008-03-07 at the Wayback Machine . In Proceedings of the 26th International Conference on Very Large Databases (VLDB), pages 527-534, Cairo, Egypt.
  11. Accelerated focused crawling through online relevance feedback, Soumen Chakrabarti, Kunal Punera, and Mallela Subramanyam, WWW 2002.
  12. Menczer, F., Pant, G., and Srinivasan, P. (2004). Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Trans. on Internet Technology 4(4): 378–419.
  13. Recognition of common areas in a Web page using visual information: a possible application in a page classification, Milos Kovacevic, Michelangelo Diligenti, Marco Gori, Veljko Milutinovic, Data Mining, 2002. ICDM 2003.
  14. Dong, H., Hussain, F.K., Chang, E.: State of the art in semantic focused crawlers. Computational Science and Its Applications – ICCSA 2009. Springer-Verlag, Seoul, Korea (July 2009) pp. 910-924
  15. Dong, H., Hussain, F.K.: SOF: A semi-supervised ontology-learning-based focused crawler. Concurrency and Computation: Practice and Experience. 25(12) (August 2013) pp. 1623-1812
  16. Junghoo Cho, Hector Garcia-Molina, Lawrence Page: Efficient Crawling Through URL Ordering. Computer Networks 30(1-7): 161-172 (1998)
  17. Marc Najork, Janet L. Wiener: Breadth-first crawling yields high-quality pages. WWW 2001: 114-118
  18. Nadav Eiron, Kevin S. McCurley, John A. Tomlin: Ranking the web frontier. WWW 2004: 309-318.
  19. Meusel R., Mika P., Blanco R. (2014). Focused Crawling for Structured Data. ACM International Conference on Information and Knowledge Management, Pages 1039-1048.
  20. Brian D. Davison: Topical locality in the Web. SIGIR 2000: 272-279.
  21. Soumen Chakrabarti, Mukul Joshi, Kunal Punera, David M. Pennock: The structure of broad topics on the Web. WWW 2002: 251-262.
  22. Jian Wu, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Prasenjit Mitra, Shuyi Zheng, C. Lee Giles, The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists, In proceedings of the 3rd Annual ACM Web Science Conference Pages 340-343, Evanston, IL, USA, June 2012.