StormCrawler

StormCrawler
Developer(s)	DigitalPebble, Ltd.
Initial release	September 11, 2014
Stable release	2.8 / March 29, 2023;13 months ago
Repository	github.com/DigitalPebble/storm-crawler ;
Written in	Java
Type	Web crawler
License	Apache License
Website	stormcrawler.net

Last updated May 27, 2024

StormCrawler is an open-source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java (programming language).

StormCrawler is modular and consists of a core module, which provides the basic building blocks of a web crawler such as fetching, parsing, URL filtering. Apart from the core components, the project also provides external resources, like for instance spout and bolts for Elasticsearch and Apache Solr or a ParserBolt which uses Apache Tika to parse various document formats.

The project is used by various organisations,^[1] notably Common Crawl ^[2] for generating a large and publicly available dataset of news.

Linux.com published a Q&A in October 2016 with the author of StormCrawler.^[3] InfoQ ran one in December 2016.^[4] A comparative benchmark with Apache Nutch was published in January 2017 on dzone.com.^[5]

Several research papers mentioned the use of StormCrawler, in particular:

Crawling the German Health Web: Exploratory Study and Graph Analysis.^[6]
The generation of a multi-million page corpus for the Persian language.^[7]
The SIREN - Security Information Retrieval and Extraction engine.^[8]

The project Wiki contains a list of videos and slides available online.^[9]

Related Research Articles

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Googlebot is the web crawler software used by Google that collects documents from the web to build a searchable index for the Google Search engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler and a mobile crawler.

<span class="mw-page-title-main">Apache Nutch</span> Open source web crawler

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for production search applications.

YaCy is a free distributed search engine built on the principles of peer-to-peer (P2P) networks, created by Michael Christen in 2003. The engine is written in Java and distributed on several hundred computers, as of September 2006, so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database which is shared with other YaCy-peers using principles of peer-to-peer.

Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.

A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a Web directory, a Web text index, backlinks, or any other Web artifact.

A Contributor License Agreement (CLA) defines the terms under which intellectual property has been contributed to a company/project, typically software under an open source license.

A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-licensed under the (source-available) Server Side Public License and the Elastic license, while other parts fall under the proprietary (source-available) Elastic License. Official clients are available in Java, .NET (C#), PHP, Python, Ruby and many other languages. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine.

Scrapy is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler. It is currently maintained by Zyte, a web-scraping development and services company.

OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to spreadsheet applications, and can handle spreadsheet file formats such as CSV, but it behaves more like a database.

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2008. It completes crawls generally every month.

Kibana is a source-available data visualization dashboard software for Elasticsearch.

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

<span class="mw-page-title-main">Octopussy (software)</span> Log analysis software

Octopussy, also known as 8Pussy, is a free and open-source computer-software which monitors systems, by constantly analyzing the syslog data they generate and transmit to such a central Octopussy server. Therefore, software like Octopussy plays an important role in maintaining an information security management system within ISO/IEC 27001-compliant environments.

References

↑ "Powered By · DigitalPebble/storm-crawler Wiki · GitHub". Github.com. 2017-03-02. Retrieved 2017-04-19.
↑ "News Dataset Available – Common Crawl".
↑ "StormCrawler: An Open Source SDK for Building Web Crawlers with ApacheStorm | Linux.com | The source for Linux information". Linux.com. 2016-10-12. Retrieved 2017-04-19.
↑ "Julien Nioche on StormCrawler, Open-Source Crawler Pipelines Backed by Apache Storm". Infoq.com. 2016-12-15. Retrieved 2017-04-19.
↑ "The Battle of the Crawlers: Apache Nutch vs. StormCrawler - DZone Big Data". Dzone.com. Retrieved 2017-04-19.
↑ Zowalla, Richard; Wetter, Thomas; Pfeifer, Daniel (2020). "Crawling the German Health Web: Exploratory Study and Graph Analysis". Journal of Medical Internet Research. 22 (7): e17853. doi: 10.2196/17853 . PMC 7414401 . PMID 32706701.
↑ "MirasText: An Automatically Generated Text Corpus for Persian".
↑ Sanagavarapu, Lalit Mohan; Mathur, Neeraj; Agrawal, Shriyansh; Reddy, Y. Raghu (2018). "SIREN - Security Information Retrieval and Extraction eNgine". Advances in Information Retrieval. Lecture Notes in Computer Science. Vol. 10772. pp. 811–814. doi:10.1007/978-3-319-76941-7_81. ISBN 978-3-319-76940-0.
↑ "Presentations · DigitalPebble/storm-crawler Wiki · GitHub". Github.com. 2017-04-04. Retrieved 2017-04-19.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Powered By · DigitalPebble/storm-crawler Wiki · GitHub". Github.com. 2017-03-02. Retrieved 2017-04-19.

[2] "News Dataset Available – Common Crawl".

[3] "StormCrawler: An Open Source SDK for Building Web Crawlers with ApacheStorm | Linux.com | The source for Linux information". Linux.com. 2016-10-12. Retrieved 2017-04-19.

[4] "Julien Nioche on StormCrawler, Open-Source Crawler Pipelines Backed by Apache Storm". Infoq.com. 2016-12-15. Retrieved 2017-04-19.

[5] "The Battle of the Crawlers: Apache Nutch vs. StormCrawler - DZone Big Data". Dzone.com. Retrieved 2017-04-19.

[6] Zowalla, Richard; Wetter, Thomas; Pfeifer, Daniel (2020). "Crawling the German Health Web: Exploratory Study and Graph Analysis". Journal of Medical Internet Research. 22 (7): e17853. doi: 10.2196/17853 . PMC 7414401 . PMID 32706701.

[7] "MirasText: An Automatically Generated Text Corpus for Persian".

[8] Sanagavarapu, Lalit Mohan; Mathur, Neeraj; Agrawal, Shriyansh; Reddy, Y. Raghu (2018). "SIREN - Security Information Retrieval and Extraction eNgine". Advances in Information Retrieval. Lecture Notes in Computer Science. Vol. 10772. pp. 811–814. doi:10.1007/978-3-319-76941-7_81. ISBN 978-3-319-76940-0.

[9] "Presentations · DigitalPebble/storm-crawler Wiki · GitHub". Github.com. 2017-04-04. Retrieved 2017-04-19.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

StormCrawler

See also

Related Research Articles

References