BagIt

Last updated

BagIt is a set of hierarchical file system conventions designed to support disk-based storage and network transfer of arbitrary digital content. A "bag" consists of a "payload" (the arbitrary content) and "tags", which are metadata files intended to document the storage and transfer of the bag. A required tag file contains a manifest listing every file in the payload together with its corresponding checksum. The name, BagIt, is inspired by the "enclose and deposit" method, [1] sometimes referred to as "bag it and tag it".

Contents

Bags are ideal for digital content normally kept as a collection of files. They are also well-suited to the export, for archival purposes, of content normally kept in database structures that receiving parties are unlikely to support. Relying on cross-platform (Windows and Unix) filesystem naming conventions, a bag's payload may include any number of directories and sub-directories (folders and sub-folders). A bag can specify payload content indirectly via a "fetch.txt" file that lists URLs for content that can be fetched over the network to complete the bag; simple parallelization (e.g. running 10 instances of Wget) can exploit this feature to transfer large bags very quickly. Benefits of bags include:

Specification

BagIt is currently defined in RFC 8493. [2] It defines a simple file naming convention used by the digital curation community for packaging up arbitrary digital content, so that it can be reliably transported via both physical media (hard disk drive, CD-ROM, DVD) and network transfers (FTP, HTTP, rsync, etc.). BagIt is also used for managing the digital preservation of content over time. Discussion about the specification and its future directions takes place on the Digital Curation discussion list.

The BagIt specification is organized around the notion of a “bag”. A bag is a named file system directory that minimally contains:

On receipt of a bag a piece of software can examine the manifest file to make sure that the payload files are present, and that their checksums are correct. This allows for accidentally removed, or corrupted files to be identified. Below is an example of a minimal bag “myfirstbag” that encloses two files of payload. The contents of the tag files are included below their filenames.

myfirstbag/ |-- data |   \-- 27613-h |       \-- images |           \-- q172.png |           \-- q172.txt |-- manifest-md5.txt |     49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png |     408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt \-- bagit.txt       BagIt-Version: 0.97       Tag-File-Character-Encoding: UTF-8 

In this example the payload happens to consist of a Portable Network Graphics image file and an Optical Character Recognition text file. In general the identification and definition of file formats is out of the scope of the BagIt specification; File attributes are likewise out of scope.

The specification allows for several optional tag files (in addition to the manifest). Their character encoding must be identified in “bagit.txt”, which itself must always be encoded in UTF-8. The specification defines the following optional tag files:

Until version 15, the draft also described how to serialize a bag in an archive file, such as ZIP or TAR. From version 15 on, the serialization is no longer part of the specifications, but not because of technical reasons but only because of the scope and focus of the specification.

History

The BagIt specification emerged from a collaboration between The Library of Congress and the California Digital Library while transferring digital content created as part of the National Digital Information Infrastructure and Preservation Program. The origins of the idea date back to work done at the University of Tsukuba on the "enclose and deposit" model, for mutually depositing archived resources to enable long-term digital preservation. [3] The practice of using manifests and checksums is fairly common practice as evidenced by their use in ZIP (file format), the Deb (file format), as well as on public FTP sites.

In 2007 the California Digital Library needed to transfer several terabytes of content (largely Web archiving data) to the Library of Congress. The BagIt specification allowed the content to be packaged up in "bags" with package metadata, and a manifest that detailed file checksums, which were later verified on receipt of the bags. The specification was written up as an IETF draft by John Kunze in December 2008, where it has seen several revisions before being issued as an RFC. [2] In 2009 the Library of Congress produced a video that describes the specification and the use cases around it. [4] [5] In 2018, version 1.0 was published as an RFC by the Internet Engineering Task Force.

See also

Related Research Articles

Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP).

Ogg Open container format maintained by the Xiph.Org Foundation

Ogg is a free, open container format maintained by the Xiph.Org Foundation. The authors of the Ogg format state that it is unrestricted by software patents and is designed to provide for efficient streaming and manipulation of high-quality digital multimedia. Its name is derived from "ogging", jargon from the computer game Netrek.

Audio Video Interleave, is a proprietary multimedia container format and Windows standard introduced by Microsoft in November 1992 as part of its Video for Windows software. AVI files can contain both audio and video data in a file container that allows synchronous audio-with-video playback. Like the DVD video format, AVI files support multiple streaming audio and video, although these features are seldom used.

An 8.3 filename is a filename convention used by old versions of DOS and versions of Microsoft Windows prior to Windows 95 and Windows NT 3.5. It is also used in modern Microsoft operating systems as an alternate filename to the long filename for compatibility with legacy programs. The filename convention is limited by the FAT file system. Similar 8.3 file naming schemes have also existed on earlier CP/M, TRS-80, Atari, and some Data General and Digital Equipment Corporation minicomputer operating systems.

Tag Image File Format, abbreviated TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word processing, optical character recognition, image manipulation, desktop publishing, and page-layout applications. The format was created by the Aldus Corporation for use in desktop publishing. It published the latest version 6.0 in 1992, subsequently updated with an Adobe Systems copyright after the latter acquired Aldus in 1994. Several Aldus or Adobe technical notes have been published with minor extensions to the format, and several specifications have been based on TIFF 6.0, including TIFF/EP, TIFF/IT, TIFF-F and TIFF-FX.

A filename extension, file extension or file type is an identifier specified as a suffix to the name of a computer file. The extension indicates a characteristic of the file contents or its intended use. A filename extension is typically delimited from the filename with a full stop (period), but in some systems it is separated with spaces.

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization.

ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support in versions of Microsoft Windows since 2000. Apple has included built-in ZIP support in Mac OS X 10.3 and later. Most free operating systems have built in support for ZIP in similar manners to Windows and Mac OS X.

JAR (file format) Java archive file format

A JAR is a package file format typically used to aggregate many Java class files and associated metadata and resources into one file for distribution.

Filename Text string used to uniquely identify a computer file

A filename or file name is a name used to uniquely identify a computer file in a directory structure. Different file systems impose different restrictions on filename lengths and the allowed characters within filenames.

File system Format or program for storing files and directories

In computing, file system or filesystem is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stopped and the next began, or where any piece of data was located when it was time to retrieve it. By separating the data into pieces and giving each piece a name, the data is easily isolated and identified. Taking its name from the way a paper-based data management system is named, each group of data is called a "file." The structure and logic rules used to manage the groups of data and their names is called a "file system."

In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and technologies, and it combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time. The Association for Library Collections and Technical Services Preservation and Reformatting Section of the American Library Association, defined digital preservation as combination of "policies, strategies and actions that ensure access to digital content over time." According to the Harrod's Librarian Glossary, digital preservation is the method of keeping digital material alive so that they remain usable as technological advances render original hardware and software specification obsolete.

These tables compare features of multimedia container formats, most often used for storing or streaming digital video or digital audio content. To see which multimedia players support which container format, look at comparison of media players.

Archive file Computer file used for compression or collection of multiple other files

In computing, an archive file is a computer file that is composed of one or more files along with metadata. Archive files are used to collect multiple data files together into a single file for easier portability and storage, or simply to compress files to use less storage space. Archive files often store directory structures, error detection and correction information, arbitrary comments, and sometimes use built-in encryption.

Design rule for Camera File system (DCF) is a JEITA specification which defines a file system for digital cameras, including the directory structure, file naming method, character set, file format, and metadata format. It is currently the de facto industry standard for digital still cameras. The file format of DCF conforms to the Exif specification, but the DCF specification also allows use of any other file formats.

EncFS is a Free (LGPL) FUSE-based cryptographic filesystem. It transparently encrypts files, using an arbitrary directory as storage for the encrypted files.

The File URI Scheme is a URI scheme defined in RFC 8089, typically used to retrieve files from within one's own computer.

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open.

EPUB E-book file format

EPUB is an e-book file format that uses the ".epub" file extension. The term is short for electronic publication and is sometimes styled ePub. EPUB is supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers. EPUB is a technical standard published by the International Digital Publishing Forum (IDPF). It became an official standard of the IDPF in September 2007, superseding the older Open eBook standard.

The Linear Tape File System (LTFS) is a file system that allows files stored on magnetic tape to be accessed in a similar fashion to those on disk or removable flash drives. It requires both a specific format of data on the tape media and software to provide a file system interface to the data.

References

  1. "A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method" (PDF). 2005. Archived from the original (PDF) on 2016-03-05. Retrieved 2015-05-07.
  2. 1 2 "The BagIt File Packaging Format (V1.0)" . Retrieved 29 October 2018.
  3. Tabata, Koichi. "A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method" (PDF). Archived from the original (PDF) on 26 July 2011. Retrieved 12 October 2010.
  4. BagIt: Transferring Digital Content for Preservation. Library of Congress. 2009. Archived from the original on 2021-12-21. Retrieved 12 October 2010.
  5. "BagIt: Transferring Digital Content for Preservation (Transcript)" (PDF). Library of Congress. 2009. Archived (PDF) from the original on 10 October 2010. Retrieved 12 October 2010.