Apache Nutch

Last updated
Apache Nutch
Original author(s) Doug Cutting, Mike Cafarella
Developer(s) Apache Software Foundation
Stable release
1.x1.19 / 22 August 2022;17 months ago (2022-08-22) [1]
2.x2.4 / 11 October 2019;4 years ago (2019-10-11) [1]
Repository Nutch Repository
Written in Java
Operating system Cross-platform
Type Web crawler
License Apache License 2.0
Website nutch.apache.org

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Contents

Features

Nutch robot mascot Nutch.png
Nutch robot mascot

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation. [2]

In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl. [3]

While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.[ citation needed ]

Release history

1.x

Branch

2.x

Branch

Release dateDescription
1.12010-06-06This release includes several major upgrades of existing libraries (Hadoop, Solr, Tika, etc.) on which Nutch depends. Various bug fixes, and speedups (e.g., to Fetcher2) have also been included.
1.22010-10-24This release includes several improvements (addition of parse-html as a selectable parser again, configurable per-field indexing), new features (including adding timing information to all Tool classes, and implementation of parser timeouts), and bug fixes (fixing an NPE in distributed search, fixing of XML formatting issues per Document fields).
1.32011-06-07This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball—only about 2 MB).
1.42011-11-26This release includes several improvements including allowing Parsers to declare support for multiple MIME types, configurable Fetcher Queue depth, Fetcher speed improvements, tighter Tika integration, and support for HTTP auth in Solr indexing.
1.52012-06-07This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few.
2.02012-07-07This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora) for big data stores such as Apache Accumulo, Apache Avro, Apache Cassandra, Apache HBase, HDFS, an in memory data store and various high-profile SQL stores.
1.5.12012-07-10This release is a maintenance release of the popular 1.5.X mainstream version of Nutch which has been widely adopted within the community.
2.12012-10-05This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search.
1.62012-12-06This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API including the normalization of URLs and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8.
2.22013-06-08This release includes over 30 bug fixes and over 25 improvements representing the third release of increasingly popular 2.x Nutch series. This release features inclusion of Crawler-Commons which Nutch now utilizes for improved robots.txt parsing, library upgrades to Apache Hadoop 1.1.1, Apache Gora 0.3, Apache Tika 1.2 and Automaton 1.11-8.
1.72013-06-24This release includes over 20 bug fixes, as many improvements; most noticeably featuring a new pluggable indexing architecture which currently supports Apache Solr and Elastic Search. Shadowing the recent Nutch 2.2 release, parsing of Robots.txt is now delegated to Crawler-Commons. Key library upgrades have been made to Apache Hadoop 1.2.0 and Apache Tika 1.3.
2.2.12013-07-02This release includes library upgrades to Apache Hadoop 1.2.0 and Apache Tika 1.3, it is predominantly a bug fix for NUTCH-1591 - Incorrect conversion of ByteBuffer to String.
1.82014-03-17Although this release includes library upgrades to Crawler Commons 0.3 and Apache Tika 1.5, it also provides over 30 bug fixes as well as 18 improvements.
2.32015-01-22Nutch 2.3 release now comes packaged with a self-contained Apache Wicket-based Web Application. The SQL backend for Gora has been deprecated. [4]
1.102015-05-06This release includes library upgrades to Tika 1.6, also provides over 46 bug fixes as well as 37 improvements and 12 new features. [5]
1.112015-12-07This release includes library upgrades to Hadoop 2.X, Tika 1.11, also provides over 32 bug fixes as well as 35 improvements and 14 new features. [6]
2.3.12016-01-21This bug fix release contains around 40 issues addressed.
1.122016-06-18
1.132017-04-02
1.142017-12-23
1.152018-08-09
1.162019-10-11
2.42019-10-11Expected to be the last release on the 2.X series, as "no committer is actively working on it". [7]
1.172020-07-02
1.182021-01-24

Scalability

IBM Research studied the performance [8] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. [9] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second. [10]

Search engines built with Nutch

See also

Related Research Articles

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

The Jakarta Project created and maintained open source software for the Java platform. It operated as an umbrella project under the auspices of the Apache Software Foundation, and all Jakarta products are released under the Apache License. As of December 21, 2011 the Jakarta project was retired because no subprojects were remaining.

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.

Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for production search applications.

<span class="mw-page-title-main">Heritrix</span> Web crawler designed for web archiving

Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

<span class="mw-page-title-main">Doug Cutting</span> American information theorist

Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation. Cutting and Cafarella are also the co-founders of Apache Hadoop.

<span class="mw-page-title-main">Apache Felix</span> Open-source OSGi framework

Apache Felix is an open source implementation of the OSGi Core Release 6 framework specification. The initial codebase was donated from the Oscar project at ObjectWeb. The developers worked on Felix for a full year and have made various improvements while retaining the original footprint and performance. On June 21, 2007, the project graduated from incubation as a top level project and is considered the smallest size software at Apache Software Foundation.

<span class="mw-page-title-main">Apache Solr</span> Open-source enterprise-search platform

Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra. In the past, many of the implementations use the Apache Hadoop platform, however today it is primarily focused on Apache Spark. Mahout also provides Java/Scala libraries for common math operations and primitive Java collections. Mahout is a work in progress; a number of algorithms have been implemented.

Pentaho is business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. Its headquarters are in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015 and in 2017 became part of Hitachi Vantara.

<span class="mw-page-title-main">Apache ZooKeeper</span> System for distributed coordination

Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation.

<span class="mw-page-title-main">Apache Avro</span> Open-source remote procedure call framework

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing and another which is more machine-readable based on JSON.

<span class="mw-page-title-main">Apache Pig</span> Open-source data analytics software

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

<span class="mw-page-title-main">Apache OODT</span>

The Apache Object Oriented Data Technology (OODT) is an open source data management system framework that is managed by the Apache Software Foundation. OODT was originally developed at NASA Jet Propulsion Laboratory to support capturing, processing and sharing of data for NASA's scientific archives.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

The Apache Ambari project intends to simplify the management of Apache Hadoop clusters using a web UI. It also integrates with other existing applications using Ambari REST APIs.

StormCrawler is an open-source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java.

References

  1. 1 2 "Apache Nutch™ - Downloads" . Retrieved 27 September 2022.
  2. "Apache Nutch -". nutch.apache.org.
  3. 1 2 "Common Crawl's Move to Nutch – Common Crawl – Blog". blog.commoncrawl.org. Retrieved 2015-10-14.
  4. "Nutch 2.3 Release". Apache Nutch News. The Apache Software Foundation. 22 January 2015. Retrieved 18 January 2016.
  5. "Nutch 1.10 Release Notes". ASF JIRA. The Apache Software Foundation. 6 May 2015. Retrieved 18 January 2016.
  6. "Nutch 1.11 Release Notes". ASF JIRA. The Apache Software Foundation. 7 December 2015. Retrieved 18 January 2016.
  7. "Nutch 2.4 Release". Apache Nutch News. The Apache Software Foundation. 11 October 2019. Retrieved 20 May 2022.
  8. "Scalability of the Nutch search engine" (PDF).
  9. "Base Operating System Provisioning and Bringup for a Commercial Supercomputer" (PDF). Archived from the original (PDF) on December 3, 2008.
  10. The Sapphire Web Crawler - Crawl Statistics. Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21.
  11. "Our Updated Search". Creative Commons. 2004-09-03.
  12. "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22. Archived from the original on 2010-01-07.
  13. "New CC search UI". Creative Commons. 2006-08-02.
  14. "Where can I get the source code for Wikia Search?". Archived from the original on 2011-11-04. Retrieved 2010-02-12.
  15. "Update on Wikia – doing more of what's working | Jimmy Wales". 31 March 2009.

Bibliography