PDF/A

Last updated
PDF/Archive
Filename extension
.pdf
Internet media type
application/pdf
Type code 'PDF ' (including a single trailing space)
Uniform Type Identifier (UTI) com.adobe.pdf
Magic number %PDF
Developed by ISO
Initial releaseOctober 1, 2005;18 years ago (2005-10-01)
Extended from PDF
Standard ISO 19005

PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for use in the archiving and long-term preservation of electronic documents. PDF/A differs from PDF by prohibiting features unsuitable for long-term archiving, such as font linking (as opposed to font embedding) and encryption. [1] The ISO requirements for PDF/A file viewers include color management guidelines, support for embedded fonts, and a user interface for reading embedded annotations.

Contents

Standards

ISO 19005 – Document management – Electronic document file format for long-term preservation (PDF/A)
Abbr.SubtitlePublishedStandardBased onRef.
PDF/A-1Part 1: Use of PDF 1.42005-09-28ISO 19005-1PDF 1.4 (Adobe Systems, PDF Reference, third edition) [2]
PDF/A-2Part 2: Use of ISO 32000-12011-06-20ISO 19005-2PDF 1.7 (ISO 32000-1:2008) [3]
PDF/A-3Part 3: Use of ISO 32000-1 with support for embedded files2012-10-15ISO 19005-3PDF 1.7 (ISO 32000-1:2008) [4]
PDF/A-4Part 4: Use of ISO 32000-22020-11ISO 19005-4PDF 2.0 (ISO 32000-2:2020) [5]

Background

PDF is a standard for encoding documents in an "as printed" form that is portable between systems. However, the suitability of a PDF file for archival preservation depends on options chosen when the PDF is created: most notably, whether to embed the necessary fonts for rendering the document; whether to use encryption; and whether to preserve additional information from the original document beyond what is needed to print it.

PDF/A was originally a new joint activity between the Association for Suppliers of Printing, Publishing and Converting Technologies (NPES) and the Association for Information and Image Management AIIM in conjunction with Adobe to develop an international standard defining the use of the Portable Document Format (PDF) for archiving documents. [6] The goal was to address the growing need to electronically archive documents in a way that would ensure preservation of their contents over an extended period of time and ensure that those documents would be able to be retrieved and rendered with a consistent and predictable result in the future. [7] This need exists in a wide variety of government, industry and academic areas worldwide, including legal systems, libraries, newspapers, and regulated industries. [8]

Description

The PDF/A standard does not define an archiving strategy or the goals of an archiving system. It identifies a "profile" for electronic documents that ensures the documents can be reproduced exactly the same way using various software in years to come. A key element to this reproducibility is the requirement for PDF/A documents to be 100% self-contained. All of the information necessary for displaying the document in the same manner is embedded in the file. This includes, but is not limited to, all content (text, raster images and vector graphics), fonts, and color information. A PDF/A document is not permitted to be reliant on information from external sources (e.g., font programs and data streams), but may include annotations (e.g., hypertext links) that link to external documents. [9]

Other key elements to PDF/A conformance include: [10] [11] [12]

Conformance levels and versions

PDF/A-1

Part 1 of the standard was first published on September 28, 2005, [2] and specifies two levels of conformance for PDF files: [13]

Level B conformance requires only that standards necessary for the reliable reproduction of a document's visual appearance be followed, while Level A conformance includes all Level B requirements in addition to features intended to improve a document's digital accessibility.

Additional Level A requirements:

Level A conformance was intended to increase the accessibility of conforming files for physically impaired users by allowing assistive software, such as screen readers, to more precisely extract and interpret a file's contents. [13] A later standard, PDF/UA, was developed to eliminate what became considered some of PDF/A's shortcomings, replacing many of its general guidelines with more detailed technical specifications. [14]

PDF/A-2

Part 2 of the standard, published on June 20, 2011, [3] addresses some of the new features added with versions 1.5, 1.6 and 1.7 of the PDF Reference. PDF/A-1 files will not necessarily conform to PDF/A-2, and PDF/A-2 compliant files will not necessarily conform to PDF/A-1.

Part 2 of the PDF/A Standard is based on a PDF 1.7 (ISO 32000-1), rather than PDF 1.4 and offers several new features:

Part 2 defines three conformance levels. PDF/A-2a and PDF/A-2b correspond to conformance levels a and b in PDF/A-1. A new conformance level, PDF/A-2u, represents Level B conformance (PDF/A-2b) with the additional requirement that all text in the document have Unicode mapping. [13] [15]

PDF/A-3

Part 3 of the standard, published on October 15, 2012, [4] differs from PDF/A-2 in only one regard: it allows embedding of arbitrary file formats (such as XML, CSV, CAD, word-processing documents, spreadsheet documents, and others) into PDF/A conforming documents. [16]

PDF/A-4

Part 4 of the standard, based on PDF 2.0, was published in late 2020. [17]

How to create a PDF/A File

Archives sometimes request from their users to submit PDF/A Files. They thus provide their users with information how to convert their files to PDF/A. There are several methods using standard software that differ in computation time as well as preservation of links, equations, vectorgraphs and special characters. [18]

When documents are converted to PDF/A visual inspection is needed since errors in the visual content are common. In a test sample 11 percent of the produced PDF/A-1b document contained visual artefacts. These reproducibility errors included vector graphics issues (transparent objects), loss of links, loss of other document content (unreadable characters, missing text, document part missing), updated fields (reflecting time or folder of conversion) and spelling errors. [19] Archives thus usually do not convert to PDF/A themselves. Instead, some archives ask their users to provide a PDF/A document. Typical computer setups provide several methods for the conversion of documents to PDF/A with different pros and cons. [18]

Converting a simple PDF (up to version 1.4) into a PDF/A-2 usually works as expected, except for problems with glyphs. According to the PDF Association, "Problems can occur before and/or during the generation of PDFs. A PDF/A file can be formally correct yet still have incorrect glyphs. Only a careful visual check can uncover this problem. Because generation problems also affect Unicode mapping, the problem attracts the attention when a visual check is carried out on the extracted text. In PDF/A, text/font usage is specified uniquely enough to ensure that it cannot be incorrect. If viewers or printers do not offer complete support for encoding systems, this can result in problems with regard to PDF/A." [20] Meaning that for a document to be completely compliant with the standard, it will be correct internally, while the system used for viewing or printing the document may produce undesired results.

A document produced with optical character recognition (OCR) conversion into PDF/A-2 or PDF/A-3 doesn't support the notdefglyph flag. Therefore, this type of conversion can result in unrendered content.

PDF/A standard documents can be created with the following software: SoftMaker Office 2021, MS Word 2010 and newer, Adobe Acrobat Distiller, PDF Creator, OpenOffice or LibreOffice since release 3.0, LaTeX with pdfx or pdfTeX addons, or by using a virtual PDF printer (Adobe Acrobat Pro, PDF24, FreePDF + Ghostscript). [21]

Identification

A PDF/A document can be identified as such through PDF/A-specific metadata located in the "http://www.aiim.org/pdfa/ns/id/" namespace. This metadata represents a claim of conformance; in itself it does not ensure conformance:

Validation

Validation of PDF/A documents is attempted to reveal whether a produced file really is a PDF/A file or not. Unfortunately, PDF/A validators quite often disagree, since the interpretation of the PDF/A standards is not always clear. [19]

Isartor Test Suite

Industry collaboration in the original PDF/A Competence Center led to the development of the Isartor Test Suite in 2007 and 2008. The test suite consists of 204 PDF files intentionally constructed to systematically fail each of the requirements for PDF/A-1b conformance, allowing developers to test the ability of their software to validate against the standard's most basic level of conformance. [23] [24] By mid-2009 the test suite had already made an appreciable difference in the general quality of PDF/A validation software. [25]

veraPDF

The veraPDF consortium, led by the Open Preservation Foundation [26] and the PDF Association, was created in response to the EU Commission's PREFORMA challenge [27] to develop an open-source validator for the PDF/A format. The PDF Association launched the PDF Validation Technical Working Group in November 2014 to articulate a plan for developing an industry-supported PDF/A validator. [28] [ failed verification ]

The veraPDF consortium subsequently won phase 2 of the PREFORMA contract in April 2015. [29] Development continued throughout 2016, [30] with Phase 2 completed on-schedule by December 2016. The Phase 3 testing and acceptance period concluded in July, 2017. veraPDF now covers all parts (1, 2 and 3) and conformance levels (a, b, u) of PDF/A.

veraPDF is available for installation on Windows, macOS, or Linux using a PDFBox-based or "Greenfields" PDF parser. [31]

PDF/A viewers

The PDF/A specification also states some requirements for a conforming PDF/A viewer, which must

When encountering a file that claims conformance with PDF/A, some PDF viewers will default to a special "PDF/A viewing mode" to fulfill conforming reader requirements. To take one example, Adobe Acrobat and Adobe Reader 9 include an alert to advise the user that PDF/A viewing mode has been activated. Some PDF viewers allow users to disable the PDF/A viewing mode or to remove the PDF/A information from a file. [32] [33]

Reception

A PDF/A document must embed all fonts in use; accordingly, a PDF/A file will often be larger than an equivalent PDF file that does not include embedded fonts.

The use of transparency is forbidden in PDF/A-1. The majority of PDF generation tools that allow for PDF/A document compliance, such as the PDF export in OpenOffice.org or PDF export tool in Microsoft Office 2007 suites, will also make any transparent images in a given document non-transparent. That restriction was removed in PDF/A-2. [10]

Some archivists have voiced concerns that PDF/A-3, which allows arbitrary files to be embedded in PDF/A documents, could result in circumvention of memory institution procedures and restrictions on archived formats. [34]

The PDF Association had addressed various misconceptions [35] regarding PDF/A in its publication "PDF/A in a Nutshell 2.0". [36]

See also

Further reading

Related Research Articles

MPEG-4 is a group of international standards for the compression of digital audio and visual data, multimedia systems, and file storage formats. It was originally introduced in late 1998 as a group of audio and video coding formats and related technology agreed upon by the ISO/IEC Moving Picture Experts Group (MPEG) under the formal standard ISO/IEC 14496 – Coding of audio-visual objects. Uses of MPEG-4 include compression of audiovisual data for Internet video and CD distribution, voice and broadcast television applications. The MPEG-4 standard was developed by a group led by Touradj Ebrahimi and Fernando Pereira.

The Portable Operating System Interface is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines both the system and user-level application programming interfaces (APIs), along with command line shells and utility interfaces, for software compatibility (portability) with variants of Unix and other operating systems. POSIX is also a trademark of the IEEE. POSIX is intended to be used by both application and system developers.

<span class="mw-page-title-main">PDF</span> Portable Document Format, a digital file format

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it. PDF has its roots in "The Camelot Project" initiated by Adobe co-founder John Warnock in 1991. PDF was standardized as ISO 32000 in 2008. The last edition as ISO 32000-2:2020 was published in December 2020.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

Tag Image File Format or Tagged Image File Format, commonly known by the abbreviations TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word processing, optical character recognition, image manipulation, desktop publishing, and page-layout applications. The format was created by the Aldus Corporation for use in desktop publishing. It published the latest version 6.0 in 1992, subsequently updated with an Adobe Systems copyright after the latter acquired Aldus in 1994. Several Aldus or Adobe technical notes have been published with minor extensions to the format, and several specifications have been based on TIFF 6.0, including TIFF/EP, TIFF/IT, TIFF-F and TIFF-FX.

OpenType is a format for scalable computer fonts. Derived from TrueType, it retains TrueType's basic structure but adds many intricate data structures for describing typographic behavior. OpenType is a registered trademark of Microsoft Corporation.

XFA stands for XML Forms Architecture, a family of proprietary XML specifications that was suggested and developed by JetForm to enhance the processing of web forms. It can be also used in PDF files starting with the PDF 1.5 specification. The XFA specification is referenced as an external specification necessary for full application of the ISO 32000-1 specification. The XML Forms Architecture was not standardized as an ISO standard, and has been deprecated in PDF 2.0.

The International Standard Recording Code (ISRC) is an international standard code for uniquely identifying sound recordings and music video recordings. The code was developed by the recording industry in conjunction with the ISO technical committee 46, subcommittee 9, which codified the standard as ISO 3901 in 1986, and updated it in 2001.

In library and archival science, digital preservation is a formal process to ensure that digital information of continuing value remains accessible and usable in the long term. It involves planning, resource allocation, and application of preservation methods and technologies, and combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.

In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane, and one each in, and nearly covering, planes 15 and 16. The code points in these areas cannot be considered as standardized characters in Unicode itself. They are intentionally left undefined so that third parties may define their own characters without conflicting with Unicode Consortium assignments. Under the Unicode Stability Policy, the Private Use Areas will remain allocated for that purpose in all future Unicode versions.

Office Open XML is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. Ecma International standardized the initial version as ECMA-376. ISO and IEC standardized later versions as ISO/IEC 29500.

This article describes the technical specifications of the OpenDocument office document standard, as developed by the OASIS industry consortium. A variety of organizations developed the standard publicly and make it publicly accessible, meaning it can be implemented by anyone without restriction. The OpenDocument format aims to provide an open alternative to proprietary document formats.

PDF/UA, formally ISO 14289, is an International Organization for Standardization (ISO) standard for accessible PDF technology. A technical specification intended for developers implementing PDF writing and processing software, PDF/UA provides definitive terms and requirements for accessibility in PDF documents and applications. For those equipped with appropriate software, conformance with PDF/UA ensures accessibility for people with disabilities who use assistive technology such as screen readers, screen magnifiers, joysticks and other technologies to navigate and read electronic content.

PDF/X is a subset of the ISO standard for PDF. The purpose of PDF/X is to facilitate graphics exchange, and it therefore has a series of printing-related requirements which do not apply to standard PDF files. For example, in PDF/X-1a all fonts need to be embedded and all images need to be CMYK or spot colors. PDF/X-3 accepts calibrated RGB and CIELAB colors, while retaining most of the other restrictions of PDF/X-1a.

The Open Packaging Conventions (OPC) is a container-file technology initially created by Microsoft to store a combination of XML and non-XML files that together form a single entity such as an Open XML Paper Specification (OpenXPS) document. OPC-based file formats combine the advantages of leaving the independent file entities embedded in the document intact and resulting in much smaller files compared to normal use of XML.

<span class="mw-page-title-main">EPUB</span> E-book file format

EPUB is an e-book file format that uses the ".epub" file extension. The term is short for electronic publication and is sometimes styled ePub. EPUB is supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers. EPUB is a technical standard published by the International Digital Publishing Forum (IDPF). It became an official standard of the IDPF in September 2007, superseding the older Open eBook (OEB) standard.

PDF/VT is an international standard published by ISO in August 2010 as ISO 16612-2. It defines the use of PDF as an exchange format optimized for variable and transactional printing. Built on top of PDF/X-4, it is the first variable-data printing (VDP) format which ensures modern International Color Consortium-based (ICC) color management through the use of ICC Output Intents. It adds the notion of encapsulated groups of graphic objects to support optimized efficient processing for repeating text, graphic or image content. Introducing the concept of document part metadata (DPM), it enables reliable and dynamic management of pages for High Volume Transactional Output (HVTO) print data, like record selection or postage optimization based on metadata.

<span class="mw-page-title-main">Solid PDF Creator</span>

Solid PDF Creator is proprietary document processing software which converts virtually any Windows-based document into a PDF. Suitable for home and office use, the program appears as a printer option in the Print menu of any print-capable Windows application. The same technology used in the software's Solid Framework SDK is licensed by Adobe for Acrobat X

The Portable Document Format (PDF) was created by Adobe Systems, introduced at the Windows and OS/2 Conference in January 1993 and remained a proprietary format until it was released as an open standard in 2008. Since then, it has been under the control of an International Organization for Standardization (ISO) committee of volunteer industry experts.

A machine-readable document is a document whose content can be readily processed by computers. Such documents are distinguished from more general machine-readable data by virtue of having further structure to provide the necessary context to support the business processes for which they are created.

References

  1. Oettler, Alexandra (2013). "PDF/A facts – an introduction to the standard". PDF/A in a Nutshell 2.0 (PDF). Archived (PDF) from the original on 2021-07-29. Retrieved 2021-07-29.{{cite book}}: |website= ignored (help)
  2. 1 2 "ISO 19005-1:2005". ISO. Archived from the original on 2016-08-18. Retrieved 2016-07-27.
  3. 1 2 "ISO 19005-2:2011". ISO. Archived from the original on 2016-08-17. Retrieved 2016-07-27.
  4. 1 2 "ISO 19005-3:2012". ISO. Archived from the original on 2016-08-17. Retrieved 2016-07-27.
  5. "ISO 19005-4:2020". ISO. Archived from the original on 2021-02-09. Retrieved 2021-02-04.
  6. "A short history of PDF/A". PDF Association. 2013-02-07. Archived from the original on 2014-07-14. Retrieved 2014-07-11.
  7. Oettler, Alexandra (2013-02-07). "The most important reasons to use PDF/A". PDF Association. Archived from the original on 2014-07-14. Retrieved 2014-07-11.
  8. Oettler, Alexandra (2013-02-07). "Typical uses for PDF/A". PDF Association. Archived from the original on 2014-07-14. Retrieved 2014-07-11.
  9. Oettler, Alexandra (2013-02-07). "The technical side of the PDF/A standard". PDF Association. Archived from the original on 2015-07-02. Retrieved 2017-08-07.
  10. 1 2 "PDF/A – A Look at the Technical Side". Archived from the original on 2011-07-26. Retrieved 2011-07-06.
  11. 1 2 "PDF/A-2 Standard Published by ISO! The New Standard Includes Great Technical Enhancements". 2011-07-01. Archived from the original on 2012-01-11. Retrieved 2011-07-06.
  12. Frequently Asked Questions (FAQs) – ISO 19005-1:2005 – PDF/A-1, Date: July 10, 2006 (PDF), 2006-07-10, archived from the original (PDF) on January 18, 2012, retrieved 2011-07-06
  13. 1 2 3 "Improved PDF/A-1b". PDF Association. 2011-08-05. Archived from the original on 2012-09-15. Retrieved 2012-09-26.
  14. Oettler, Alexandra (2013-02-07). "PDF/A and the other PDF standards". PDF Association. Archived from the original on 2014-07-14. Retrieved 2014-07-12.
  15. PDF/A-2, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7), Library of Congress, archived from the original on 2013-01-30, retrieved 2012-09-26
  16. "PDF Association Arranges Its First Seminar on PDF/A to Include Standards 1 to 3". PDF Association. 2012-03-29. Archived from the original on 2012-09-15.
  17. "The new PDF 2.0 and subset standards (PDF Association)". Archived from the original on 2021-01-27. Retrieved 2021-02-04.
  18. 1 2 Suri, Roland Erwin (February 15, 2017). "How do I create a PDF/A file?". doi:10.16911/ethz-ib-2722-de. Archived from the original on May 25, 2022. Retrieved May 30, 2022.
  19. 1 2 Suri, Roland Erwin; El-Saad, Mohamed (2018-06-06). "Lost in migration: document quality for batch conversion to PDF/A". Library Hi Tech. 39 (2): LHT–10–2017-0220. doi:10.1108/LHT-10-2017-0220. hdl: 20.500.11850/269199 . ISSN   0737-8831. S2CID   67441801.
  20. Drümmer, Olaf (22 September 2007). "PDF/A – A Look at the Technical Side" (PDF). PDF/A Competence Center. PDF Association. p. 5. Archived (PDF) from the original on 2022-08-19. Retrieved 15 June 2022.
  21. "INSTRUCTIONS FOR CREATING PDF/A-COMPLIANT FILES FOR ONLINE PUBLISHING AT THE TU BERLIN" (PDF). Archived from the original (PDF) on 2020-07-11. Retrieved 2020-07-08.
  22. Oettler, Alexandra (2013-02-07). "Validation: is it really PDF/A?". PDF Association. Archived from the original on 2016-09-21. Retrieved 2014-07-11.
  23. Isartor Test Suite (PDF). PDF/A Competence Center. 2008-08-12. Archived (PDF) from the original on 2015-06-22. Retrieved 2016-09-23.
  24. "Isartor Test Suite". PDF Association. 2011-08-03. Archived from the original on 2016-09-23. Retrieved 2016-09-23.
  25. "Bavaria Report". PDFlib. 2009. Archived from the original on 2015-04-21. Retrieved 2015-04-30.
  26. "Open Preservation Foundation veraPDF project". Open Preservation Foundation. Archived from the original on 2015-04-28. Retrieved 2015-04-30.
  27. PREFORMA, an EU Commission funded project, archived from the original on 2015-04-27, retrieved 2015-04-30
  28. "A consortium including the PDF Association wins phase 1 of an EU Commission tender to create an open-source PDF/A validator". PDF Association. 2014-11-13. Archived from the original on 2015-04-21. Retrieved 2015-04-30.
  29. PREFORMA starts prototyping phase, archived from the original on 2015-04-27, retrieved 2015-04-30
  30. "veraPDF 0.22 released". 8 September 2016. Archived from the original on 24 September 2016. Retrieved 23 September 2016.
  31. "Software". veraPDF. 30 June 2015. Archived from the original on 2017-03-15. Retrieved 2017-03-15. Page for downloading the platform-specific installer.
  32. "How to Remove PDF/A Information from a file". Archived from the original on 2014-04-13. Retrieved 2014-04-10.
  33. "Change the PDF/A viewing mode". Archived from the original on 2014-04-13. Retrieved 2014-04-10.
  34. Archivists: No flowers for PDF/A-3, archived from the original on 2014-08-14, retrieved 2014-07-12
  35. The myths and legends surrounding PDF/A, archived from the original on 2018-02-16, retrieved 2018-02-15
  36. "PDF/A in a Nutshell 2.0". 23 May 2013. Archived from the original on 3 June 2019. Retrieved 3 June 2019.