DjVu

Last updated
DjVu
Djvu icon.svg
Filename extensions
.djvu, .djv
Internet media type
image/vnd.djvu, image/x-djvu
Magic number AT&T
Developed by AT&T Labs – Research
Initial release1998;26 years ago (1998)
Latest release
Version 26 [1]
April 2005;19 years ago (2005-04)
Type of format Image file formats
Contained by Interchange File Format
Open format?Yes

DjVu [a] is a computer file format designed primarily to store scanned documents, especially those containing a combination of text, line drawings, indexed color images, and photographs. It uses technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding, and lossy compression for bitonal (monochrome) images. This allows high-quality, readable images to be stored in a minimum of space, so that they can be made available on the web.

Contents

DjVu has been promoted as providing smaller files than PDF for most scanned documents. [3] The DjVu developers report that color magazine pages compress to 40–70 kB, black-and-white technical papers compress to 15–40 kB, and ancient manuscripts compress to around 100 kB; a satisfactory JPEG image typically requires 500 kB. [4] Like PDF, DjVu can contain an OCR text layer, making it easy to perform copy and paste and text search operations.

Free creators, manipulators, converters, web browser plug-ins, and desktop viewers are available. [2] DjVu is supported by a number of multi-format document viewers and e-book reader software on Linux (Okular, Evince, Zathura), Windows (Okular, SumatraPDF), and Android (Document Viewer, [5] FBReader, EBookDroid, PocketBook).

History

The DjVu technology was originally developed by Yann LeCun, Léon Bottou, Patrick Haffner, Paul G. Howard, Patrice Simard, and Yoshua Bengio at AT&T Labs from 1996 to 2001. [4]

Prior to the standardization of PDF in 2008, [6] [7] DjVu had been considered superior due to it being an open file format in contrast to the proprietary nature of PDF at the time. The declared higher compression ratio (and thus smaller file size), and the claimed ease of converting large volumes of text into DjVu format, were other arguments for DjVu's superiority over PDF in the technology landscape of 2004. Independent technologist Brewster Kahle in a 2004 talk on IT Conversations discussed the benefits of allowing easier access to DjVu files. [8] [9]

The DjVu library distributed as part of the open-source package DjVuLibre has become the reference implementation for the DjVu format. DjVuLibre has been maintained and updated by the original developers of DjVu since 2002. [10]

The DjVu file format specification has gone through a number of revisions, the most recent being from 2005.

Revision history
VersionRelease dateNotes
Old version, no longer maintained: 1–19[ citation needed ]1996–1999Developmental versions by AT&T labs preceding the sale of the format to LizardTech.
Old version, no longer maintained: Version 20 [1] April 1999DjVu version 3. DjVu changed from a single-page format to a multipage format.
Old version, yet still maintained: Version 21 [1] September 1999Indirect storage format replaced. The searchable text layer was added.
Old version, yet still maintained: Version 22 [1] April 2001Page orientation, color JB2
Old version, no longer maintained: Version 23 [1] July 2002CID chunk
Old version, no longer maintained: Version 24 [1] February 2003LTAnno chunk
Old version, yet still maintained: Version 25 [1] May 2003NAVM chunk. Support for DjVu bookmarks (outlines) was added. Changes made by Versions 23 and 24 were made obsolete.
Current stable version:Version 26 [1] April 2005Text/line annotations
Legend:
Old version, not maintained
Old version, still maintained
Latest version
Latest preview version
Future release

Role in the software ecosystem

The primary usage of the DjVu format has been the electronic distribution of documents with a quality comparable to that of printed documents. As that niche is also the primary usage for PDF, it was inevitable that the two formats would become competitors. It should however be observed that the two formats approach the problem of delivering high resolution documents in very different ways: PDF primarily encodes graphics and text as vectorised data, whereas DjVu primarily encodes them as pixmap images. This means PDF places the burden of rendering the document on the reader, whereas DjVu places that burden on the creator.

During a number of years, significantly overlapping with the period when DjVu was being developed, there were no PDF viewers for free operating systems—a particular stumbling block was the rendering of vectorised fonts, which are essential for combining small file size with high resolution in PDF. Since displaying DjVu was a simpler problem for which free software was available, there were suggestions that the free software movement should employ DjVu instead of PDF for distributing documentation; rendering for creating DjVu is in principle not much different from rendering for a device-specific printer driver, and DjVu can as a last resort be generated from scans of paper media. However, when FreeType 2.0 in 2000 began to provide rendering of all major vectorised font formats, that specific advantage of DjVu began to erode.

In the 2000s, with the growth of the World Wide Web and before widespread adoption of broadband, DjVu was often adopted by digital libraries as their format of choice, thanks to its integration with software like Greenstone [11] and the Internet Archive, [12] browser plugins which allowed advanced online browsing, smaller file size for comparable quality of book scans and other image-heavy documents [13] and support for embedding and searching full text from OCR. [14] [15] Some features such as the thumbnail previews were later integrated in the Internet Archive's BookReader [16] and DjVu browsing was deprecated in its favour as around 2015 some major browsers stopped supporting NPAPI and DjVu plugins with them. [17]

DjVu.js Viewer attempts to replace the missing plugins.

Technical overview

File structure

The DjVu file format is based on the Interchange File Format and is composed of hierarchically organized chunks. The IFF structure is preceded by a 4-byte AT&T magic number. Following is a single FORM chunk with a secondary identifier of either DJVU or DJVM for a single-page or a multi-page document, respectively.

All the chunks can be contained in a single file in the case of the so called bundled documents, or can be contained in several files: one file for every page plus some files with shared chunks.

Chunk types

Chunk types in DjVu files
Chunk identifierContained byDescription
FORM:DJVUFORM:DJVMDescribes a single page. Can either be at the root of a document and be a single-page document or referred to from a DIRM chunk.
FORM:DJVMDescribes a multi-page document. Is the document's root chunk.
FORM:DJVIFORM:DJVMContains data shared by multiple pages.
FORM:THUMFORM:DJVMContains thumbnails.
INFOFORM:DJVUMust be the first chunk. Describes the page width, height, format version, resolution, gamma, and rotation.
DIRMFORM:DJVMMust be the first chunk. References other FORM chunks. These chunks can either follow this chunk inside the FORM:DJVM chunk or be contained in external files. These types of documents are referred to as bundled or indirect, respectively.
NAVMFORM:DJVMIf present, must immediately follow the DIRM chunk. Contains a BZZ-compressed outline of the document.
ANTa, ANTzFORM:DJVI or FORM:DJVUAnnotations.
TXTa, TXTzFORM:DJVUUnicode text and layout information.
INCLFORM:DJVUThe ID of an included FORM::DJVI chunk.
SjbzFORM:DJVUBZZ compressed JB2 bitonal data used to store mask.
DjbzFORM:DJVI or FORM:DJVUShared shape table.
WMRM ?JB2 data required to remove a watermark.
CIDaFORM:DJVUObsolete chunk with unknown content.

Compression

DjVu divides a single image into many different images, then compresses them separately. To create a DjVu file, the initial image is first separated into three images: a background image, a foreground image, and a mask image. The background and foreground images are typically lower-resolution color images (e.g., 100 dpi); the mask image is a high-resolution bilevel image (e.g., 300 dpi) and is typically where the text is stored. The background and foreground images are then compressed using a wavelet-based compression algorithm named IW44. [4] The mask image is compressed using a method called JB2 (similar to JBIG2). The JB2 encoding method identifies nearly identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. It compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appears on the page. Thus, instead of compressing a letter "e" in a given font multiple times, it compresses the letter "e" once (as a compressed bit image) and then records every place on the page it occurs.

Optionally, these shapes may be mapped to UTF-8 codes (either by hand or potentially by a text recognition system) and stored in the DjVu file. If this mapping exists, it is possible to select and copy text.

Since JB2 (also called DjVuBitonal) is a variation on JBIG2, working on the same principles, [18] both compression methods have the same problems when performing lossy compression. In 2013 it emerged that Xerox photocopiers and scanners had been substituting digits for similar looking ones, for example replacing a 6 with an 8. [19] A DjVu document has been spotted in the wild with character substitutions, such as an n with bleeding serifs turning into a u and an o with a spot inside turning into an e. [20] Whether lossy compression has occurred is not stored in the file. [1] Thus the DjView viewing application can't warn the user that glyph substitutions might have occurred, neither when opening a lossy compressed file, nor in the Information or Metadata dialogue boxes. [21]

Format licensing

DjVu is an open file format with patents. [3] The file format specification is published, as well as source code for the reference library. [3] The original authors distribute an open-source implementation named "DjVuLibre" under the GNU General Public License and a patent grant. [22] The rights to the commercial development of the encoding software have been transferred to different companies over the years, including AT&T Corporation, LizardTech, [23] Celartem [24] and ePapyrus Solutions K.K. (formerly Cuminas [25] before joining ePapyrus Solutions, Inc. [26] ). [27] Patents typically have an expiry term of about 20 years.

Celartem acquired LizardTech and Extensis. [28] [29] [24] [30] [31]

Support

The selection of downloadable DjVu viewers is wider on Linux distributions than it is on Windows or macOS. Additionally, the format is rarely supported by proprietary scanning software.

In 2002, the DjVu file format was chosen by the Internet Archive as a format in which its Million Book Project provides scanned public-domain books online (along with TIFF and PDF). [32] In February 2016, the Internet Archive announced that DjVu would no longer be used for new uploads, among other reasons citing the format's declining use and the difficulty of maintaining their Java applet based viewer for the format. [17]

Wikimedia Commons, a media repository used by Wikipedia among others, conditionally permits PDF and DjVu media files. [33]

See also

Notes

  1. Although usually pronounced as an initialism "D-J-V-U", the file type was intended to have the pronunciation DAY-zhah-VOO ( /ˌdʒɑːˈv/ ) after French déjà vu . [2]

Related Research Articles

<span class="mw-page-title-main">Lossy compression</span> Data compression approach that reduces data size while discarding or changing some of it

In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data size for storing, handling, and transmitting content. Higher degrees of approximation create coarser images as more details are removed. This is opposed to lossless data compression which does not degrade the data. The amount of data reduction possible using lossy compression is much higher than using lossless techniques.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

Multiple-image Network Graphics (MNG) is a graphics file format published in 2001 for animated images. Its specification is publicly documented and there are free software reference implementations available.

<span class="mw-page-title-main">PDF</span> Portable Document Format, a digital file format

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it. PDF has its roots in "The Camelot Project" initiated by Adobe co-founder John Warnock in 1991. PDF was standardized as ISO 32000 in 2008. The last edition as ISO 32000-2:2020 was published in December 2020.

<span class="mw-page-title-main">PNG</span> Family of lossless-compression image file formats

Portable Network Graphics is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF)—unofficially, the initials PNG stood for the recursive acronym "PNG's not GIF".

Tag Image File Format or Tagged Image File Format, commonly known by the abbreviations TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word processing, optical character recognition, image manipulation, desktop publishing, and page-layout applications. The format was created by the Aldus Corporation for use in desktop publishing. It published the latest version 6.0 in 1992, subsequently updated with an Adobe Systems copyright after the latter acquired Aldus in 1994. Several Aldus or Adobe technical notes have been published with minor extensions to the format, and several specifications have been based on TIFF 6.0, including TIFF/EP, TIFF/IT, TIFF-F and TIFF-FX.

<span class="mw-page-title-main">Compression artifact</span> Distortion of media caused by lossy data compression

A compression artifact is a noticeable distortion of media caused by the application of lossy compression. Lossy data compression involves discarding some of the media's data so that it becomes small enough to be stored within the desired disk space or transmitted (streamed) within the available bandwidth. If the compressor cannot store enough data in the compressed version, the result is a loss of quality, or introduction of artifacts. The compression algorithm may not be intelligent enough to discriminate between distortions of little subjective importance and those objectionable to the user.

Transcoding is the direct digital-to-digital conversion of one encoding to another, such as for video data files, audio files, or character encoding. This is usually done in cases where a target device does not support the format or has limited storage capacity that mandates a reduced file size, or to convert incompatible or obsolete data to a better-supported or modern format.

WavPack is a free and open-source lossless audio compression format and application implementing the format. It is unique in the way that it supports hybrid audio compression alongside normal compression which is similar to how FLAC works. It also supports compressing a wide variety of lossless formats, including various variants of PCM and also DSD as used in SACDs, together with its support for surround audio.

Mixed raster content (MRC) is a method for compressing images that contain both binary-compressible text and continuous-tone components, using image segmentation methods to improve the level of compression and the quality of the rendered image. By separating the image into components with different compressibility characteristics, the most efficient and accurate compression algorithm for each component can be applied.

<span class="mw-page-title-main">Evince</span> Free and open-source document viewer

Evince, also known as GNOME Document Viewer, is a free and open-source document viewer supporting many document file formats including PDF, PostScript, DjVu, TIFF, XPS and DVI. It is designed for the GNOME desktop environment.

An image file format is a file format for a digital image. There are many formats that can be used, such as JPEG, PNG, and GIF. Most formats up until 2022 were for storing 2D images, not 3D ones. The data stored in an image file format may be compressed or uncompressed. If the data is compressed, it may be done so using lossy compression or lossless compression. For graphic design applications, vector formats are often used. Some image file formats support transparency.

MrSID is an acronym that stands for multiresolution seamless image database. It is a file format developed and patented by LizardTech for encoding of georeferenced raster graphics, such as orthophotos.

JBIG2 is an image compression standard for bi-level images, developed by the Joint Bi-level Image Experts Group. It is suitable for both lossless and lossy compression. According to a press release from the Group, in its lossless mode JBIG2 typically generates files 3–5 times smaller than Fax Group 4 and 2–4 times smaller than JBIG, the previous bi-level compression standard released by the Group. JBIG2 was published in 2000 as the international standard ITU T.88, and in 2001 as ISO/IEC 14492.

<span class="mw-page-title-main">Comic book archive</span> File format

A comic book archive or comic book reader file is a type of archive file for the purpose of sequential viewing of images, commonly for comic books. The idea was made popular by the CDisplay sequential image viewer; since then, many viewers for different platforms have been created.

The following is a comparison of e-book formats used to create and publish e-books.

Léon Bottou is a researcher best known for his work in machine learning and data compression. His work presents stochastic gradient descent as a fundamental learning algorithm. He is also one of the main creators of the DjVu image compression technology, and the maintainer of DjVuLibre, the open source implementation of DjVu. He is the original developer of the Lush programming language.

WebP is a raster graphics file format developed by Google intended as a replacement for JPEG, PNG, and GIF file formats. It supports both lossy and lossless compression, as well as animation and alpha transparency.

JPEG XL is a royalty-free open standard for the compressed representation of raster graphics images. It defines a graphics file format and the abstract device for coding JPEG XL bitstreams. It is developed by the Joint Photographic Experts Group (JPEG) and standardized by the International Electrotechnical Commission (IEC) and the International Organization for Standardization (ISO) as the international standard ISO/IEC 18181 as a superset of JPEG/JFIF encoding, with a compression mode built on a traditional block-based transform coding core and a "modular mode" for synthetic image content and lossless compression. Optional lossy quantization enables both lossless and lossy compression.

References

  1. 1 2 3 4 5 6 7 8 9 "Lizardtech DjVu Reference" (PDF). Cuminas.jp. p. 25. Retrieved 7 December 2021.
  2. 1 2 "DjVu.org – the premier menu for djvu resources". djvu.org. Archived from the original on 2017-06-29. Retrieved 2017-07-02.{{cite web}}: CS1 maint: unfit URL (link)
  3. 1 2 3 "What is DjVu – DjVu.org". DjVu.org. Archived from the original on 2019-01-21. Retrieved 2009-03-05.
  4. 1 2 3 Léon Bottou; Patrick Haffner; Paul G. Howard; Patrice Simard; Yoshua Bengio; Yann Le Cun (1998). "High Quality Document Image Compression with DjVu, 7(3):410–425" (PDF). Journal of Electronic Imaging.
  5. Document Viewer, Sufficiently Secure, 2022-04-04, retrieved 2022-04-09
  6. "ISO 32000-1:2008 – Document management – Portable document format – Part 1: PDF 1.7". Iso.org. 2008-07-01. Retrieved 2010-02-21.
  7. Orion, Egan (2007-12-05). "PDF 1.7 is approved as ISO 32000". The Inquirer . Incisive Media. Archived from the original on December 13, 2007. Retrieved 2007-12-05.
  8. Brewster Kahle (December 16, 2004). "Universal Access to All Knowledge" (Audio; Speech at 1h:31 m:20s). Conversations Network.
  9. "LizardTech To Open Source A DjVu Java Viewer". ECM Connection. 7 December 2004. Retrieved 18 August 2017.
  10. "DjVuLibre: Open Source DjVu library and viewer". djvu.sourceforge.net.
  11. "nzdl:projects - Greenstone". Wiki.greenstone.org. Retrieved 7 December 2021.
  12. Eric Rumsey (2018-09-05). "Google Books vs DjVu in Internet Archive". Blog.libuiowa.edu. Archived from the original on 2018-08-22. Retrieved 2018-08-21.
  13. Eric Rumsey (2018-09-10). "DjVu again". Blog.libuiowa.edu.
  14. Jeff Kaplan (2004-12-09). "New book collection: color scans, djvu, some pdf" (PDF). Blog.archive.org.
  15. Janusz S. Bień (2011-09-12). "Efficient search in hidden text of large DjVu documents". Advanced Language Technologies for Digital Libraries (PDF). Lecture Notes in Computer Science. Vol. 6699. pp. 1–14. doi:10.1007/978-3-642-23160-5_1. ISBN   978-3-642-23159-9. S2CID   3095526.
  16. Eric Rumsey (2010-09-10). "Internet Archive's BookReader Thumbnail View". Blog.libuiowa.edu.
  17. 1 2 Brewster Kahle; Jeff Kaplan (2016-02-26). "DjVu files for new uploads". Archive.org.
  18. Artem Mikheev, Luc Vincent, Mike Hawrylycz & Léon Bottou: Electronic Document Publishing Using DjVu
  19. See the JBIG2 article for more details and references.
  20. "This document caused me a fair bit of consternation transcribing it on a site th... | Hacker News". News.ycombinator.com. Retrieved 7 December 2021.
  21. "DjVuLibre". SourceForge.net. Retrieved 7 December 2021.
  22. "DjVuLibre: Open Source DjVu library and viewer".
  23. Extensis. "Company – About – LizardTech". Lizardtech.com.
  24. 1 2 "Celartem, Inc.: Private Company Information – Bloomberg". Bloomberg.com.
  25. "会社情報 - Cuminas Corporation". Cuminas.jp. Archived from the original on 2018-01-15. Retrieved 2018-01-14.
  26. 株式譲渡および完全子会社化のお知らせ [Notice regarding share transfer and becoming a wholly owned subsidiary]. epapyrus.jp (in Japanese). 2022-06-03. Retrieved 2024-12-08.
  27. 会社名変更のお知らせ [Notice of company name change]. epapyrus.jp (in Japanese). 2023-11-06. Retrieved 2024-12-08.
  28. "Company Overview – Celartem Technology, Inc". Celartem.com. Archived from the original on 27 May 2019. Retrieved 7 December 2021.
  29. "Celartem Technology Announces Merger of US Holdings – Extensis.com". Archived from the original on 2018-01-15. Retrieved 2018-01-14.
  30. "Celartem Technology Inc.: Private Company Information – Bloomberg". Bloomberg.com.
  31. "Celartem Sells Extensis and LizardTech Plugins and XTensions to onOne Software – Big Picture – Wide Format Printing". bigpicture.net. 28 July 2005.
  32. "Image file formats – OLPC". Wiki.laptop.org. Retrieved 2008-09-09.
  33. Wikimedia Commons. Project scope: PDF and DjVu.