Norconex Web Crawler

Last updated


Other namesNorconex HTTP Collector
Developer(s) Norconex Inc.
Initial release2016
Stable release
3.0.2 / 2022-01-05
Repository GitHub Repository
Written in Java
Operating system Cross-platform
License Apache License
Website Norconex Web Crawler

Norconex Web Crawler is a free and open-source web crawling and web scraping Software written in Java and released under an Apache License. It can export data to many repositories such as Apache Solr, Elasticsearch, Microsoft Azure Cognitive Search, Amazon CloudSearch and more. [1] [2] [3]

Contents

The Crawler can be run on its own or embedded in your own Java application. [4] [5]

Some key features are:

Some well-known companies and products using Norconex Web Crawler are: Apache Solr Ecosystem, Department of National Defence, Universities Canada, U.S. Department of Education, Department of National Defence. [6] [7]

History

Norconex Web Crawler was released as free and open-source software in 2013. [8]

Related Research Articles

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

CiteSeerX is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.

<span class="mw-page-title-main">Apache Nutch</span> Open source web crawler

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for production search applications.

<span class="mw-page-title-main">YaCy</span>

YaCy is a free distributed search engine, built on the principles of peer-to-peer (P2P) networks created by Michael Christen in 2003. The engine is written in Java and distributed on several hundred computers, as of September 2006, so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database which is shared with other YaCy-peers using principles of peer-to-peer. It is a search engine that everyone can use to build a search portal for their intranet and to help search the public internet clearly.

<span class="mw-page-title-main">Alfresco Software</span> Information management software

Alfresco Software is a collection of information management software products for Microsoft Windows and Unix-like operating systems developed by Alfresco Software Inc. using Java technology. The software, branded as a Digital Business Platform is principally a proprietary & a commercially licensed open source platform, supports open standards, and provides enterprise scale. There are also open source Community Editions available licensed under LGPLv3.

<span class="mw-page-title-main">DSpace</span> Repository software package

DSpace is an open source repository software package typically used for creating open access repositories for scholarly and/or published digital content. While DSpace shares some feature overlap with content management systems and document management systems, the DSpace repository software serves a specific need as a digital archives system, focused on the long-term storage, access and preservation of digital content. The optional DSpace registry lists almost three thousand repositories all over the world.

<span class="mw-page-title-main">Heritrix</span> Web crawler designed for web archiving

Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.

<span class="mw-page-title-main">Apache Solr</span> Open-source enterprise-search platform

Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

<span class="mw-page-title-main">NewGenLib</span>

NewGenLib is an integrated library management system developed by Verus Solutions Pvt Ltd. Domain expertise is provided by Kesavan Institute of Information and Knowledge Management in Hyderabad, India. NewGenLib version 1.0 was released in March 2005. On 9 January 2008, NewGenLib was declared free and open-source under GNU GPL. The latest version of NewGenLib is 3.1.1 released on 16 April 2015. Many libraries across the globe are using NewGenLib as their Primary integrated library management system as seen from the NewGenlib discussion forum.

SVNKit is an open-source, pure Java software library for working with the Subversion version control system. It is free to use on opensource projects but requires that you buy a commercial license to use to develop with proprietary software. It implements virtually all Subversion features and provides API to work with Subversion working copies, access and manipulate Subversion repositories.

parboiled is an open-source Java library released under an Apache License. It provides support for defining PEG parsers directly in Java source code.

<span class="mw-page-title-main">Elasticsearch</span> Search engine

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-licensed under the source-available Server Side Public License and the Elastic license, while other parts fall under the proprietary (source-available) Elastic License. Official clients are available in Java, .NET (C#), PHP, Python, Ruby and many other languages. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine.

Algolia is a proprietary search-as-a-service platform designed for use cases that require high quality and relevant search.

<span class="mw-page-title-main">Apache Tika</span> Open-source content analysis framework

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

StormCrawler is an open-source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java.

Blacklight is an open-source Ruby on Rails engine for creating search interfaces on top of Apache Solr indices. The software is used by libraries to create discovery layers or institutional repositories; by museums and archives to highlight digital collections; and by other information retrieval projects.

References

  1. "Committers". opensource.norconex.com.
  2. Hoppa, Jocelyn (10 February 2020). "Importing Data from the Web with Norconex & Neo4j". Graph Database & Analytics.
  3. "Deploy a Norconex HTTP Collector Indexer Plugin | Cloud Search". Google for Developers.
  4. Valcheva, Silvia (11 February 2018). "10 Best Open Source Web Crawlers: Web Data Extraction Software". Blog For Data-Driven Business.
  5. "Norconex HTTP Collector". Softpedia. Retrieved 25 September 2023.
  6. "SolrEcosystem - Solr - Apache Software Foundation". cwiki.apache.org.
  7. "Norconex Crawler Users". opensource.norconex.com.
  8. "Norconex Gives Back to Open-Source – Norconex Inc" . Retrieved 2023-09-25.

Mentions in Academic Research

See also