Distributed web crawling

Last updated

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.

Contents

Types

Cho [1] and Garcia-Molina studied two types of policies:

Dynamic assignment

With this type of policy, a central server assigns new URLs to different crawlers dynamically. This allows the central server to, for instance, dynamically balance the load of each crawler. [2]

With dynamic assignment, typically the systems can also add or remove downloader processes. The central server may become the bottleneck, so most of the workload must be transferred to the distributed crawling processes for large crawls.

There are two configurations of crawling architectures with dynamic assignments that have been described by Shkapenyuk and Suel: [3]

Static assignment

With this type of policy, there is a fixed rule stated from the beginning of the crawl that defines how to assign new URLs to the crawlers.

For static assignment, a hashing function can be used to transform URLs (or, even better, complete website names) into a number that corresponds to the index of the corresponding crawling process. [4] As there are external links that will go from a Web site assigned to one crawling process to a website assigned to a different crawling process, some exchange of URLs must occur.

To reduce the overhead due to the exchange of URLs between crawling processes, the exchange should be done in batch, several URLs at a time, and the most cited URLs in the collection should be known by all crawling processes before the crawl (e.g.: using data from a previous crawl). [1]

Implementations

As of 2003, most modern commercial search engines use this technique. Google and Yahoo use thousands of individual computers to crawl the Web.

Newer projects are attempting to use a less structured, more ad hoc form of collaboration by enlisting volunteers to join the effort using, in many cases, their home or personal computers. LookSmart is the largest search engine to use this technique, which powers its Grub distributed web-crawling project. Wikia (now known as Fandom) acquired Grub from LookSmart in 2007. [5]

This solution uses computers that are connected to the Internet to crawl Internet addresses in the background. Upon downloading crawled web pages, they are compressed and sent back, together with a status flag (e.g. changed, new, down, redirected) to the powerful central servers. The servers, which manage a large database, send out new URLs to clients for testing.

Drawbacks

According to the FAQ about Nutch, an open-source search engine website, the savings in bandwidth by distributed web crawling are not significant, since "A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages...". [6]

See also

Sources

  1. 1 2 Cho, Junghoo; Garcia-Molina, Hector (2002). "Parallel crawlers". Proceedings of the 11th international conference on World Wide Web. ACM. pp. 124–135. doi:10.1145/511446.511464. ISBN   1-58113-449-5 . Retrieved 2015-10-13.
  2. Guerriero, A.; Ragni, F.; Martines, C. (2010). "A dynamic URL assignment method for parallel web crawler". 2010 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications. pp. 119–123. doi:10.1109/CIMSA.2010.5611764. ISBN   978-1-4244-7228-4. S2CID   14817039.
  3. Shkapenyuk, Vladislav; Suel, Torsten (2002). "Design and implementation of a high-performance distributed web crawler". Data Engineering, 2002. Proceedings. 18th International Conference on. IEEE. pp. 357–368. Retrieved 2015-10-13.
  4. Wan, Yuan; Tong, Hengqing (2008). "URL Assignment Algorithm of Crawler in Distributed System Based on Hash". 2008 IEEE International Conference on Networking, Sensing and Control. pp. 1632–1635. doi:10.1109/icnsc.2008.4525482. ISBN   978-1-4244-1685-1. S2CID   39188334.{{cite book}}: |journal= ignored (help)
  5. "Wikia Acquires Distributed Web Crawler Grub". TechCrunch. 2007-07-27. Retrieved 2022-10-08.
  6. "Nutch: faq". nutch.sourceforge.net. Retrieved 2022-10-08.

Related Research Articles

<span class="mw-page-title-main">Peer-to-peer</span> Type of decentralized and distributed network architecture

Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network, forming a peer-to-peer network of nodes.

In computing, a search engine is an information retrieval software system designed to help find information stored on one or more computer systems. Search engines discover, crawl, transform, and store information for retrieval and presentation in response to user queries. The search results are usually presented in a list and are commonly called hits. The most widely used type of search engine is a web search engine, which searches for information on the World Wide Web.

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

<span class="mw-page-title-main">Load balancing (computing)</span> Set of techniques to improve the distribution of workloads across multiple computing resources

In computing, load balancing is the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing can optimize response time and avoid unevenly overloading some compute nodes while other compute nodes are left idle.

robots.txt Internet protocol

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

<span class="mw-page-title-main">Grub (search engine)</span>

Grub was an open source distributed search crawler platform.

BitTorrent, also referred to as simply torrent, is a communication protocol for peer-to-peer file sharing (P2P), which enables users to distribute data and electronic files over the Internet in a decentralized manner. The protocol is developed and maintained by Rainberry, Inc., and was first released in 2001.

<span class="mw-page-title-main">Googlebot</span> Web crawler used by Google

Googlebot is the web crawler software used by Google that collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler and a mobile crawler.

<span class="mw-page-title-main">Apache Nutch</span> Open source web crawler

Apache Nutch is a highly extensible and scalable open source web crawler software project.

The deep web, invisible web, or hidden web are parts of the World Wide Web whose contents are not indexed by standard web search-engine programs. This is in contrast to the "surface web", which is accessible to anyone using the Internet. Computer scientist Michael K. Bergman is credited with inventing the term in 2001 as a search-indexing term.

<span class="mw-page-title-main">YaCy</span> Peer-to-peer search engine

YaCy is a free distributed search engine built on the principles of peer-to-peer (P2P) networks, created by Michael Christen in 2003. The engine is written in Java and distributed on several hundred computers, as of September 2006, so-called YaCy-peers.

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

A spider trap is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called web spiders, from which the name is derived. Spider traps may be created to "catch" spambots or other crawlers that waste a website's bandwidth. They may also be created unintentionally by calendars that use dynamic pages with links that continually point to the next day or year.

<span class="mw-page-title-main">Search engine</span> Software system for finding relevant information on the Web

A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.

Peer-to-peer file sharing (P2P) systems like Gnutella, KaZaA, and eDonkey/eMule, have become extremely popular in recent years, with the estimated user population in the millions. An academic research paper analyzed Gnutella and eMule protocols and found weaknesses in the protocol; many of the issues found in these networks are fundamental and probably common on other P2P networks. Users of file sharing networks, such as eMule and Gnutella, are subject to monitoring of their activity. Clients may be tracked by IP address, DNS name, software version they use, files they share, queries they initiate, and queries they answer to. Clients may also share their private files to the network without notice due to inappropriate settings.

A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a Web directory, a Web text index, backlinks, or any other Web artifact.

A distributed search engine is a search engine where there is no central server. Unlike traditional centralized search engines, work such as crawling, data mining, indexing, and query processing is distributed among several peers in a decentralized manner where there is no single point of control.

A single-page application (SPA) is a web application or website that interacts with the user by dynamically rewriting the current web page with new data from the web server, instead of the default method of a web browser loading entire new pages. The goal is faster transitions that make the website feel more like a native app.

Torsten Suel is a professor in the Department of Computer Science and Engineering at the New York University Tandon School of Engineering. He received his Ph.D. in 1994 from the University of Texas at Austin under the supervision of Greg Plaxton. He works on the subjects of implementation of bulk synchronous parallel computation, streaming algorithms for histograms, join operations in databases, distributed algorithms for dominating sets, and web crawler algorithms. A conference paper he co-authored in 2011 introduces fast retrieval techniques that were integrated into the Apache Lucene search engine library.

<span class="mw-page-title-main">80legs</span> Web crawling service

80legs is a web crawling service that allows its users to create and run web crawls through its software as a service platform.