Apache PDFBox

Last updated
PDFBox
Developer(s) Apache Software Foundation
Stable release
1.8.x:1.8.17 / 15 September 2022;15 months ago (2022-09-15) [1]
2.0.x:2.0.29 / 1 July 2023;6 months ago (2023-07-01) [1]
3.0.x:3.0.0 / 18 August 2023;4 months ago (2023-08-18) [1]
Repository PDFBox Repository (Mirror)
Written in Java
Operating system Cross-platform
Type Portable Document Format (PDF)
License Apache License 2.0
Website pdfbox.apache.org

Apache PDFBox is an open source pure-Java library that can be used to create, render, print, split, merge, alter, verify and extract text and meta-data of PDF files.

Contents

Open Hub reports over 11,000 commits (since the start as an Apache project) by 18 contributors representing more than 140,000 lines of code. PDFBox has a well established, mature codebase maintained by an average size development team with increasing year-over-year commits. Using the COCOMO model, it took an estimated 46 person-years of effort. [2]

Structure

Apache PDFBox has these components:

History

PDFBox was started in 2002 in SourceForge by Ben Litchfield who wanted to be able to extract text of PDF files for Lucene. [3] It became an Apache Incubator project in 2008, and an Apache top level project in 2009. [4]

Preflight was originally named PaDaF and developed by Atos worldline, and donated to the project in 2011. [5]

In February 2015, Apache PDFBox was named an Open Source Partner Organization of the PDF Association. [6]

See also

Related Research Articles

<span class="mw-page-title-main">PDF</span> Portable Document Format, a digital file format

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it. PDF has its roots in "The Camelot Project" initiated by Adobe co-founder John Warnock in 1991. PDF was standardized as ISO 32000 in 2008. The last edition as ISO 32000-2:2020 was published in December 2020.

<span class="mw-page-title-main">Apache Subversion</span> Free and open-source software versioning and revision control system

Apache Subversion is a software versioning and revision control system distributed as open source under the Apache License. Software developers use Subversion to maintain current and historical versions of files such as source code, web pages, and documentation. Its goal is to be a mostly compatible successor to the widely used Concurrent Versions System (CVS).

<span class="mw-page-title-main">Apache License</span> Free software license

The Apache License is a permissive free software license written by the Apache Software Foundation (ASF). It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software under the terms of the license, without concern for royalties. The ASF and its projects release their software products under the Apache License. The license is also used by many non-ASF projects.

The Mozilla Public License (MPL) is a free and open-source weak copyleft license for most Mozilla Foundation software such as Firefox and Thunderbird. The MPL license is developed and maintained by Mozilla, which seeks to balance the concerns of both open-source and proprietary developers; it is distinguished from others as a middle ground between the permissive software BSD-style licenses and the General Public License. So under the terms of the MPL, it allows the integration of MPL-licensed code into proprietary codebases, but only on condition those components remain accessible.

Formatting Objects Processor is a Java application that converts XSL Formatting Objects (XSL-FO) files to PDF or other printable formats. FOP was originally developed by James Tauber who donated it to the Apache Software Foundation in 1999. It is part of the Apache XML Graphics project.

In software development, a codebase is a collection of source code used to build a particular software system, application, or software component. Typically, a codebase includes only human-written source code system files; thus, a codebase usually does not include source code files generated by tools or binary library files, as they can be built from the human-written source code. However, it generally does include configuration and property files, as they are the data necessary for the build.

The Extensible Metadata Platform (XMP) is an ISO standard, originally created by Adobe Systems Inc., for the creation, processing and interchange of standardized and custom metadata for digital documents and data sets.

<span class="mw-page-title-main">WinMerge</span> Open-source data software

WinMerge is a free software tool for data comparison and merging of text-like files. It is useful for determining what has changed between versions, and then merging changes between versions.

<span class="mw-page-title-main">Poppler (software)</span> Free library for creating PDF documents

Poppler is a free software utility library for rendering Portable Document Format (PDF) documents. Its development is supported by freedesktop.org. It is commonly used on Linux systems, and is used by the PDF viewers of the open source GNOME and KDE desktop environments.

<span class="mw-page-title-main">Open Hub</span> Public directory of free and open source software (FOSS)

Black Duck Open Hub, formerly Ohloh, is a website which provides a web services suite and online community platform that aims to index the open-source software development community. It was founded by former Microsoft managers Jason Allen and Scott Collison in 2004 and joined by the developer Robin Luckey. As of 15 January 2016, the site lists 669,601 open-source projects, 681,345 source control repositories, 3,848,524 contributors and 31,688,426,179 lines of code.

<span class="mw-page-title-main">Apache Solr</span> Open-source enterprise-search platform

Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

<span class="mw-page-title-main">Apache ActiveMQ</span> Software message broker

Apache ActiveMQ is an open source message broker written in Java together with a full Java Message Service (JMS) client. It provides "Enterprise Features" which in this case means fostering the communication from more than one client or server. Supported clients include Java via JMS 1.1 as well as several other "cross language" clients. The communication is managed with features such as computer clustering and ability to use any database as a JMS persistence provider besides virtual memory, cache, and journal persistency.

In version control systems, a repository is a data structure that stores metadata for a set of files or directory structure. Depending on whether the version control system in use is distributed, like Git or Mercurial, or centralized, like Subversion, CVS, or Perforce, the whole set of information in the repository may be duplicated on every user's system or may be maintained on a single server. Some of the metadata that a repository contains includes, among other things, a historical record of changes in the repository, a set of commit objects, and a set of references to commit objects, called heads.

<span class="mw-page-title-main">Apache OpenOffice</span> Free and open-source office software suite

Apache OpenOffice (AOO) is an open-source office productivity software suite. It is one of the successor projects of OpenOffice.org and the designated successor of IBM Lotus Symphony. It was a close cousin of LibreOffice, Collabora Online and NeoOffice in 2014. It contains a word processor (Writer), a spreadsheet (Calc), a presentation application (Impress), a drawing application (Draw), a formula editor (Math), and a database management application (Base).

<span class="mw-page-title-main">Brackets (text editor)</span> Editor for web development

Brackets is a source code editor with a primary focus on web development. Created by Adobe Inc., it is free and open-source software licensed under the MIT License, and is currently maintained on GitHub by open-source developers. It is written in JavaScript, HTML and CSS. Brackets is cross-platform, available for macOS, Windows, and most Linux distributions. The main purpose of Brackets is its live HTML, CSS and JavaScript editing functionality.

<span class="mw-page-title-main">Zopfli</span> Data compression software

Zopfli is a data compression library that performs Deflate, gzip and zlib data encoding. It achieves higher compression ratios than mainstream Deflate and zlib implementations at the cost of being slower. Google first released Zopfli in February 2013 under the terms of Apache License 2.0.

Apache Allura is an open-source forge software for managing source code repositories, bug reports, discussions, wiki pages, blogs and more for any number of individual projects. Allura graduated from incubation with the Apache Software Foundation in March 2013.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Sourcegraph Inc. is a company developing code search and code intelligence tool that semantically indexes and analyzes large codebases so that they can be searched across commercial, open-source, local, and cloud-based repositories.

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.

References

  1. 1 2 3 "Apache PDFBox - Blog". pdfbox.apache.org. Apache Software Foundation. Retrieved 2022-09-27.
  2. "The Apache PDFBox Open Source Project on Open Hub". openhub.net. 2017-03-18. Retrieved 2017-03-18.
  3. Apache PDFBox and FontBox 1.0.0 released, The H Open, 16 February 2010
  4. PDFBox Project Incubation Status
  5. PaDaF Preflight Codebase Intellectual Property (IP) Clearance Status
  6. Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association, February 3, 2015