80legs

Last updated
80legs
80legs-logo.gif
Type of site
Web Crawling Service
Available inEnglish
OwnerDatafiniti, LLC
Created byShion Deysarkar
URL www.80legs.com
LaunchedSeptember 2009

80legs is a web crawling service that allows its users to create and run web crawls through its software as a service platform.

Contents

History

80legs was created by Computational Crawling, a company in Houston, Texas. The company launched the private beta of 80legs in April 2009 and publicly launched the service at the DEMOfall 09 conference. At the time of its public launch, 80legs offered customized web crawling and scraping services. It has since added subscription plans and other product offerings. [1] [2]

Technology

80legs is built on top of a distributed grid computing network. [3] This grid consists of approximately 50,000 individual computers, distributed across the world, and uses bandwidth monitoring technology to prevent bandwidth cap overages. [4]

80legs has been criticised by numerous site owners for its technology effectively acting as a Distributed Denial of Service attack and not obeying robots.txt. [5] [6] [7] [8] As the average webmaster is not aware of the existence of 80legs, blocking access to its crawler can only be done when it is already too late, the server DDoSed, and the guilty party detected after a time-consuming in-depth analysis of the logfiles.

Some rulesets for modsecurity (like the one from Atomicorp [9] ) block all access to the webserver from 80legs in order to prevent a DDOS. WebKnight also blocks 80legs by default. As it is a distributed crawler, it is impossible to block this crawler by IP. The best way found to block 80legs is by its UserAgent, "008". [10] Wrecksite blocks 80legs by default.

Related Research Articles

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

<span class="mw-page-title-main">Denial-of-service attack</span> Cyber attack disrupting service by overloading the provider of the service

In computing, a denial-of-service attack is a cyber-attack in which the perpetrator seeks to make a machine or network resource unavailable to its intended users by temporarily or indefinitely disrupting services of a host connected to a network. Denial of service is typically accomplished by flooding the targeted machine or resource with superfluous requests in an attempt to overload systems and prevent some or all legitimate requests from being fulfilled.

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.

<span class="mw-page-title-main">Googlebot</span> Web crawler used by Google

Googlebot is the web crawler software used by Google that collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler and a mobile crawler.

<span class="mw-page-title-main">Zombie (computing)</span> Compromised computer used for malicious tasks on a network

In computing, a zombie is a computer connected to the Internet that has been compromised by a hacker via a computer virus, computer worm, or trojan horse program and can be used to perform malicious tasks under the remote direction of the hacker. Zombie computers often coordinate together in a botnet controlled by the hacker, and are used for activities such as spreading e-mail spam and launching distributed denial-of-service attacks against web servers. Most victims are unaware that their computers have become zombies. The concept is similar to the zombie of Haitian Voodoo folklore, which refers to a corpse resurrected by a sorcerer via magic and enslaved to the sorcerer's commands, having no free will of its own. A coordinated DDoS attack by multiple botnet machines also resembles a "zombie horde attack", as depicted in fictional zombie films.

<span class="mw-page-title-main">Dogpile</span> Metasearch engine

Dogpile is a metasearch engine for information on the World Wide Web that fetches results from Google, Yahoo!, Yandex, Bing, and other popular search engines, including those from audio and video content providers such as Yahoo!.

<span class="mw-page-title-main">Twisted (software)</span> Event-driven network programming framework

Twisted is an event-driven network programming framework written in Python and licensed under the MIT License.

<span class="mw-page-title-main">Xfinity</span> American cable provider

Comcast Cable Communications, LLC, doing business as Xfinity, is an American telecommunications business segment and division of Comcast Corporation used to market consumer cable television, internet, telephone, and wireless services provided by the company. The brand was first introduced in 2010; prior to that, these services were marketed primarily under the Comcast name.

<span class="mw-page-title-main">YaCy</span>

YaCy is a free distributed search engine, built on the principles of peer-to-peer (P2P) networks created by Michael Christen in 2003. The engine is written in Java and distributed on several hundred computers, as of September 2006, so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database which is shared with other YaCy-peers using principles of peer-to-peer. It is a search engine that everyone can use to build a search portal for their intranet and to help search the public internet clearly.

In web archiving, an archive site is a website that stores information on webpages from the past for anyone to view.

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

NoScript is a free and open-source extension for Firefox- and Chromium-based web browsers, written and maintained by Giorgio Maone, an Italian software developer and member of the Mozilla Security Group.

<span class="mw-page-title-main">Search engine</span> Software system that is designed to search for information on the World Wide Web

A search engine is a software system that finds web pages that match a web search. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a line of results, often referred to as search engine results pages (SERPs). The information may be a mix of hyperlinks to web pages, images, videos, infographics, articles, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories and social bookmarking sites, which are maintained by human editors, search engines also maintain real-time information by running an algorithm on a web crawler. Any internet-based content that cannot be indexed and searched by a web search engine falls under the category of deep web.

<span class="mw-page-title-main">Zoho Office Suite</span> Online suite of business management software

Zoho Office Suite is an Indian web-based online office suite containing word processing, spreadsheets, presentations, databases, note-taking, wikis, web conferencing, customer relationship management (CRM), project management, invoicing and other applications. It is developed by Zoho Corporation.

A distributed search engine is a search engine where there is no central server. Unlike traditional centralized search engines, work such as crawling, data mining, indexing, and query processing is distributed among several peers in a decentralized manner where there is no single point of control.

Bingbot is a web-crawling robot, deployed by Microsoft October 2010 to supply Bing. It collects documents from the web to build a searchable index for the Bing. It performs the same as Google's Googlebot.

Hola is a freemium web and mobile application which provides a form of VPN service to its users through a peer-to-peer network. It also uses peer-to-peer caching. When a user accesses certain domains that are known to use geo-blocking, the Hola application redirects the request to go through the computers and Internet connections of other users in non-blocked areas, thereby circumventing the blocking. Users of the free service share a portion of their idle upload bandwidth to be used for serving cached data to other users. Paying users can choose to redirect all requests to peers but are themselves never used as peers.

<span class="mw-page-title-main">Web shell</span> Interface enabling remote access to a web server

A web shell is a shell-like interface that enables a web server to be remotely accessed, often for the purposes of cyberattacks. A web shell is unique in that a web browser is used to interact with it.

References

  1. https://venturebeat.com/2009/12/21/80legs-web-crawler-free/ 80legs sets its web crawler free
  2. http://www.readwriteweb.com/archives/bulk_social_data_80legs.php Archived 2010-07-22 at the Wayback Machine Thoughts From the Man Who Would Sell The World, Nicely
  3. http://gigaom.com/2009/09/22/80legs-is-where-setihome-meets-google/ 80legs is Where SETI@home Meets Google
  4. http://gigaom.com/2009/05/14/80legs-cares-about-your-bandwidth-cap/ 80legs Cares About Your Bandwidth Cap
  5. http://www.datamadness.com/2012/01/ddosed-by-80legs/ Archived 2012-01-16 at the Wayback Machine DDOSed by 80legs
  6. http://news.ycombinator.com/item?id=1056960 HackerNews thread
  7. http://www.webmasterworld.com/search_engine_spiders/4457359.htm Webmasterworld thread
  8. https://twitter.com/openstreetmap/status/221188821721681920 Complaint from OpenStreetMap
  9. "Atomicorp" . Retrieved 2013-02-05.[ permanent dead link ]
  10. "80legs - Most Powerful Web Crawler Ever". Archived from the original on 2013-10-31. Retrieved 2013-11-06.