Apache Tika

Last updated
Tika
Developer(s) Apache Software Foundation
Stable release
2.6.0 [1]   OOjs UI icon edit-ltr-progressive.svg / 7 November 2022;4 months ago (7 November 2022)
Repository Tika Repository
Written in Java
Operating system Cross-platform
Type Search and index API
License Apache License 2.0
Website tika.apache.org

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. [2] It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

Contents

History

The project originated as part of the Apache Nutch codebase, to provide content identification and extraction when crawling. In 2007, it was separated out, to make it more extensible and usable by content management systems, other Web crawlers, and information retrieval systems. The standalone Tika was founded by Jérôme Charron, Chris Mattmann and Jukka Zitting. [3] In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.

Features

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types. For most of the more common and popular formats, [4] Tika then provides content extraction, metadata extraction and language identification capabilities.

It can also get text from images by using the OCR software Tesseract. [5]

While Tika is written in Java, it is widely used from other languages. [6] The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

Notable uses

Tika is used by financial institutions including the Fair Isaac Corporation (FICO), [7] Goldman Sachs, [8] NASA and academic researchers [9] and by major content management systems including Drupal, [10] and Alfresco (software) [11] to analyze large amounts of content, and to make it available in common formats using information retrieval techniques.

On April 4, 2016 [12] Forbes published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore shell corporations. The leaked documents and the project to analyze them is referred to as the Panama Papers.

See also

Related Research Articles

Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. MIR is a small but growing field of research with many real-world applications. Those involved in MIR may have a background in academic musicology, psychoacoustics, psychology, signal processing, informatics, machine learning, optical music recognition, computational intelligence or some combination of these.

<span class="mw-page-title-main">Apache Nutch</span> Open source web crawler

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for non-research search applications.

<span class="mw-page-title-main">Hierarchical Data Format</span> Set of file formats

Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

OPeNDAP is an acronym for "Open-source Project for a Network Data Access Protocol," an endeavor focused on enhancing the retrieval of remote, structured data through a Web-based architecture and a discipline-neutral Data Access Protocol (DAP). Widely used, especially in Earth science, the protocol is layered on HTTP, and its current specification is DAP4, though the previous DAP2 version remains broadly used. Developed and advanced by the non-profit OPeNDAP, Inc., DAP is intended to enable remote, selective data-retrieval as an easily invoked Web service. OPeNDAP, Inc. also develops and maintains zero-cost (reference) implementations of the DAP protocol in both server-side and client-side software.

<span class="mw-page-title-main">Alfresco Software</span> Information management software

Alfresco Software is a collection of information management software products for Microsoft Windows and Unix-like operating systems developed by Alfresco Software Inc. using Java technology. The software, branded as a Digital Business Platform is principally a proprietary & a commercially licensed open source platform, supports open standards, and provides enterprise scale. There are also open source Community Editions available licensed under LGPLv3.

Content Repository API for Java (JCR) is a specification for a Java platform application programming interface (API) to access content repositories in a uniform manner. The content repositories are used in content management systems to keep the content data and also the metadata used in content management systems (CMS) such as versioning metadata. The specification was developed under the Java Community Process as JSR-170, and as JSR-283. The main Java package is javax.jcr.

<span class="mw-page-title-main">JabRef</span> Reference management software

JabRef is an open-sourced, cross-platform citation and reference management software. It uses BibTeX and BibLaTeX as its native formats and is therefore typically used for LaTeX. The name JabRef stands for Java, Alver, Batada, Reference. The original version was released on November 29, 2003.

<span class="mw-page-title-main">Jaikoz</span> Java tagging program

Jaikoz is a Java program used for editing and mass tagging music file tags.

<span class="mw-page-title-main">Apache Sling</span> Java web framework

Apache Sling is an open source Web framework for the Java platform designed to create content-centric applications on top of a JSR-170-compliant content repository such as Apache Jackrabbit. Apache Sling allows developers to deploy their application components as OSGi bundles or as scripts and templates in the content repository. Supported scripting languages are JSP, server-side JavaScript, Ruby, Velocity. The goal of Apache Sling is to expose content in the content repository as HTTP resources, fostering a RESTful style of application architecture.

Content Management Interoperability Services (CMIS) is an open standard that allows different content management systems to inter-operate over the Internet. Specifically, CMIS defines an abstraction layer for controlling diverse document management systems and repositories using web protocols.

SimpleDL is digital collection management software that allows for the upload, description, management and access of digital collections. In addition to that, it is UTF-8 compatible. SimpleDL is not limited by format and is capable of handling documents, PDFs, images, videos, audio files, and data only objects. Furthermore, it can connect content so multipage documents, scores, or books can be uploaded and organized into chapters, books or by page number. It can also combine any number of images into one display object. The software is mostly used by libraries, archives, museums, government agencies, universities, corporations, historical societies, and other organizations that wish to host a digital collection.

<span class="mw-page-title-main">Apache OODT</span>

The Apache Object Oriented Data Technology (OODT) is an open source data management system framework that is managed by the Apache Software Foundation. OODT was originally developed at NASA Jet Propulsion Laboratory to support capturing, processing and sharing of data for NASA's scientific archives.

<span class="mw-page-title-main">Apache cTAKES</span> Natural language processing system

Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context, and negated/not negated.

Asprise OCR is a commercial optical character recognition and barcode recognition SDK library that provides an API to recognize text as well as barcodes from images and output in formats like plain text, xml and searchable PDF.

<span class="mw-page-title-main">Chris Mattmann</span> American data scientist

Chris Mattmann is an American data scientist currently working as the Principal Data Scientist and Chief Technology and Innovation Officer in the Office of the Chief Information Officer (OCIO) at the NASA Jet Propulsion Laboratory (JPL) in Pasadena, California. He is also the manager of JPL's Open Source Applications office. Mattmann was formerly Chief Architect in the Instrument and Data Systems section at the laboratory.

StormCrawler is an open-source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java.

References

  1. https://gitbox.apache.org/repos/asf?p=tika.git;a=tag;h=5489ec85f8b5ead67fd337d663ee815c93514666.
  2. "Apache Tika" . Retrieved 2016-04-15.
  3. "Tika Proposal" . Retrieved 2016-04-15.
  4. "The Apache Software Foundation". Apache Tika formats page. Retrieved 16 April 2016.
  5. "TikaOCR". Apache Tika. 2019-03-26. Retrieved 2019-12-02.
  6. "API Bindings for Tika". Apache Tika. Retrieved 2016-04-17.
  7. "FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud | FICO". FICO | Decisions. Archived from the original on 2016-06-03. Retrieved 2016-04-15.
  8. "Goldman Sachs Puts Elasticsearch To Work - InformationWeek". InformationWeek. Retrieved 2017-06-21.
  9. "Studying polar data with the help of Apache Tika". Opensource.com. Retrieved 2016-04-15.
  10. "Text Extract for Drupal using Tika | Drupal.org". www.drupal.org. 30 July 2012. Retrieved 2016-04-15.
  11. "Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki". wiki.alfresco.com. 5 June 2015. Retrieved 2016-04-15.
  12. Fox-Brewster, Thomas. "From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers". Forbes. Retrieved 2016-04-15.