Compound File Binary Format

Last updated

Compound File Binary Format (CFBF), also called Compound File, Compound Document format, [1] or Composite Document File V2 [2] (CDF), is a compound document file format for storing numerous files and streams within a single file on a disk. CFBF is developed by Microsoft and is an implementation of Microsoft COM Structured Storage. [3] [4] [5]

Contents

Microsoft has opened the format for use by others and it is now used in a variety of programs from Microsoft Word and Microsoft Access to Business Objects.[ citation needed ] It also forms the basis of the Advanced Authoring Format. [6]

Overview

At its simplest, the Compound File Binary Format is a container, with little restriction on what can be stored within it.

A CFBF file structure loosely resembles a FAT filesystem. The file is partitioned into Sectors which are chained together with a File Allocation Table (not to be mistaken with the file system of the same name) which contains chains of sectors related to each file, a Directory holds information for contained files with a Sector ID (SID) for the starting sector of a chain and so on.

Structure

The CFBF file consists of a 512-Byte header record followed by a number of sectors whose size is defined in the header. The literature defines Sectors to be either 512 or 4096 bytes in length, although the format is potentially capable of supporting sectors ranging in size from 128-Bytes upwards in powers of 2 (128, 256, 512, 1024, etc.). The lower limit of 128 is the minimum required to fit a single directory entry in a Directory Sector.[ relevant? ]

There are several types of sector that may be present in a CFBF:

More detail is given below for the header and each sector type.

CFBF Header format

The CFBF Header occupies the first 512 bytes of the file and information required to interpret the rest of the file. The C-Style structure declaration below (extracted from the AAFA's Low-Level Container Specification) shows the members of the CFBF header and their purpose:

typedefunsignedlongULONG;// 4 BytestypedefunsignedshortUSHORT;// 2 BytestypedefshortOFFSET;// 2 BytestypedefULONGSECT;// 4 BytestypedefULONGFSINDEX;// 4 BytestypedefUSHORTFSOFFSET;// 2 BytestypedefUSHORTWCHAR;// 2 BytestypedefULONGDFSIGNATURE;// 4 BytestypedefunsignedcharBYTE;// 1 BytetypedefunsignedshortWORD;// 2 BytestypedefunsignedlongDWORD;// 4 BytestypedefULONGSID;// 4 BytestypedefGUIDCLSID;// 16 BytesstructStructuredStorageHeader{// [offset from start (bytes), length (bytes)]BYTE_abSig[8];// [00H,08] {0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1,// 0x1a, 0xe1} for current versionCLSID_clsid;// [08H,16] reserved must be zero (WriteClassStg/// GetClassFile uses root directory class id)USHORT_uMinorVersion;// [18H,02] minor version of the format: 33 is// written by reference implementationUSHORT_uDllVersion;// [1AH,02] major version of the dll/format: 3 for// 512-byte sectors, 4 for 4 KB sectorsUSHORT_uByteOrder;// [1CH,02] 0xFFFE: indicates Intel byte-orderingUSHORT_uSectorShift;// [1EH,02] size of sectors in power-of-two;// typically 9 indicating 512-byte sectorsUSHORT_uMiniSectorShift;// [20H,02] size of mini-sectors in power-of-two;// typically 6 indicating 64-byte mini-sectorsUSHORT_usReserved;// [22H,02] reserved, must be zeroULONG_ulReserved1;// [24H,04] reserved, must be zeroFSINDEX_csectDir;// [28H,04] must be zero for 512-byte sectors,// number of SECTs in directory chain for 4 KB// sectorsFSINDEX_csectFat;// [2CH,04] number of SECTs in the FAT chainSECT_sectDirStart;// [30H,04] first SECT in the directory chainDFSIGNATURE_signature;// [34H,04] signature used for transactions; must// be zero. The reference implementation// does not support transactionsULONG_ulMiniSectorCutoff;// [38H,04] maximum size for a mini stream;// typically 4096 bytesSECT_sectMiniFatStart;// [3CH,04] first SECT in the MiniFAT chainFSINDEX_csectMiniFat;// [40H,04] number of SECTs in the MiniFAT chainSECT_sectDifStart;// [44H,04] first SECT in the DIFAT chainFSINDEX_csectDif;// [48H,04] number of SECTs in the DIFAT chainSECT_sectFat[109];// [4CH,436] the SECTs of first 109 FAT sectors};

File Allocation Table (FAT) Sectors

When taken together as a single stream the collection of FAT sectors define the status and linkage of every sector in the file. Each entry in the FAT is 4 bytes in length and contains the sector number of the next sector in a FAT chain or one of the following special values:

Range Lock Sector

The Range Lock Sector must exist in files greater than 2GB in size, and must not exist in files smaller than 2GB. The Range Lock Sector must contain the byte range 0x7FFFFF00 to 0x7FFFFFFF in the file. This area is reserved by Microsoft's COM implementation for storing byte-range locking information for concurrent access.

Glossary

See also

Related Research Articles

New Technology File System (NTFS) is a proprietary journaling file system developed by Microsoft. Starting with Windows NT 3.1, it is the default file system of the Windows NT family. It superseded File Allocation Table (FAT) as the preferred filesystem on Windows and is supported in Linux and BSD as well. NTFS reading and writing support is provided using a free and open-source kernel implementation known as NTFS3 in Linux and the NTFS-3G driver in BSD. By using the convert command, Windows can convert FAT32/16/12 into NTFS without the need to rewrite all files. NTFS uses several files typically hidden from the user to store metadata about other files stored on the drive which can help improve speed and performance when reading data. Unlike FAT and High Performance File System (HPFS), NTFS supports access control lists (ACLs), filesystem encryption, transparent compression, sparse files and file system journaling. NTFS also supports shadow copy to allow backups of a system while it is running, but the functionality of the shadow copies varies between different versions of Windows.

File Allocation Table (FAT) is a file system developed for personal computers and was the default filesystem for MS-DOS and Windows 9x operating systems. Originally developed in 1977 for use on floppy disks, it was adapted for use on hard disks and other devices. The increase in disk drives capacity required three major variants: FAT12, FAT16 and FAT32. FAT was replaced with NTFS as the default file system on Microsoft operating systems starting with Windows XP. Nevertheless, FAT continues to be used on flash and other solid-state memory cards and modules, many portable and embedded devices because of its compatibility and ease of implementation.

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own, such as devices that use magnetic tape. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support in versions of Microsoft Windows since 1998 via the "Plus! 98" addon for Windows 98. Native support was added as of the year 2000 in Windows ME. Apple has included built-in ZIP support in Mac OS X 10.3 and later. Most free operating systems have built in support for ZIP in similar manners to Windows and macOS.

<span class="mw-page-title-main">86-DOS</span> Discontinued computer operating system for x86 processors and predecessor to MS-DOS/PC DOS

86-DOS is a discontinued operating system developed and marketed by Seattle Computer Products (SCP) for its Intel 8086-based computer kit.

The BMP file format or bitmap, is a raster graphics image file format used to store bitmap digital images, independently of the display device, especially on Microsoft Windows and OS/2 operating systems.

Disk formatting is the process of preparing a data storage device such as a hard disk drive, solid-state drive, floppy disk, memory card or USB flash drive for initial use. In some cases, the formatting operation may also create one or more new file systems. The first part of the formatting process that performs basic medium preparation is often referred to as "low-level formatting". Partitioning is the common term for the second part of the process, dividing the device into several sub-devices and, in some cases, writing information to the device allowing an operating system to be booted from it. The third part of the process, usually termed "high-level formatting" most often refers to the process of generating a new file system. In some operating systems all or parts of these three processes can be combined or repeated at different levels and the term "format" is understood to mean an operation in which a new disk medium is fully prepared to store files. Some formatting utilities allow distinguishing between a quick format, which does not erase all existing data and a long option that does erase all existing data.

On the Amiga, the Old File System was the filesystem for AmigaOS before the Amiga Fast File System. Even though it used 512-byte blocks, it reserved the first small portion of each block for metadata, leaving an actual data block capacity of 488 bytes per block. It wasn't very suitable for anything except floppy disks, and it was soon replaced.

<span class="mw-page-title-main">Cylinder-head-sector</span> Historical method for giving addresses to physical data blocks on hard disk drives

Cylinder-head-sector (CHS) is an early method for giving addresses to each physical block of data on a hard disk drive.

The ICO file format is an image file format for computer icons in Microsoft Windows. ICO files contain one or more small images at multiple sizes and color depths, such that they may be scaled appropriately. In Windows, all executables that display an icon to the user, on the desktop, in the Start Menu, or in file Explorer, must carry the icon in ICO format.

<span class="mw-page-title-main">C data types</span> Data types supported by the C programming language

In the C programming language, data types constitute the semantics and characteristics of storage of data elements. They are expressed in the language syntax in form of declarations for memory locations or variables. Data types also determine the types of operations or methods of processing of data elements.

<span class="mw-page-title-main">GUID Partition Table</span> Computer disk partitioning standard

The GUID Partition Table (GPT) is a standard for the layout of partition tables of a physical computer storage device, such as a hard disk drive or solid-state drive, using universally unique identifiers, which are also known as globally unique identifiers (GUIDs). Forming a part of the Unified Extensible Firmware Interface (UEFI) standard, it is nevertheless also used for some BIOSs, because of the limitations of master boot record (MBR) partition tables, which use 32 bits for logical block addressing (LBA) of traditional 512-byte disk sectors.

Long filename (LFN) support is Microsoft's backward-compatible extension of the 8.3 filename naming scheme used in DOS. Long filenames can be more descriptive, including longer filename extensions such as .jpeg, .tiff, .html, and .xhtml that are common on other operating systems, rather than specialized shortened names such as .jpg, .tif, .htm, or .xht. The standard has been common with File Allocation Table (FAT) filesystems since its first implementation in Windows NT 3.5 of 1994.

The following tables compare general and technical information for a number of file systems.

Program database (PDB) is a file format for storing debugging information about a program. PDB files commonly have a .pdb extension. A PDB file is typically created from source files during compilation. It stores a list of all symbols in a module with their addresses and possibly the name of the file and the line on which the symbol was declared. This symbol information is not stored in the module itself, because it takes up a lot of space.

exFAT is a file system introduced by Microsoft in 2006 and optimized for flash memory such as USB flash drives and SD cards. exFAT was proprietary until 28 August 2019, when Microsoft published its specification. Microsoft owns patents on several elements of its design.

<span class="mw-page-title-main">Disk sector</span> Logical or physical division of storage media

In computer disk storage, a sector is a subdivision of a track on a magnetic disk or optical disc. For most disks, each sector stores a fixed amount of user-accessible data, traditionally 512 bytes for hard disk drives (HDDs) and 2048 bytes for CD-ROMs and DVD-ROMs. Newer HDDs and SSDs use 4096-byte (4 KiB) sectors, which are known as the Advanced Format (AF).

A master boot record (MBR) is a special type of boot sector at the very beginning of partitioned computer mass storage devices like fixed disks or removable drives intended for use with IBM PC-compatible systems and beyond. The concept of MBRs was publicly introduced in 1983 with PC DOS 2.0.

An rpmsg file is a file format containing a restricted-permission message. It is used to implement IRM for Outlook messages with the aim of controlling access to content via encryption and access controls, and restricting certain actions such as the ability to forward or copy.

The FAT file system is a file system used on MS-DOS and Windows 9x family of operating systems. It continues to be used on mobile devices and embedded systems, and thus is a well suited file system for data exchange between computers and devices of almost any type and age from 1981 through the present.

References

  1. "Apache POI – POIFS". POI Project. Archived from the original on 26 April 2011. Retrieved 10 May 2011.
  2. "How to convert documents between LibreOffice and Microsoft Office file formats on Linux". Archived from the original on 21 September 2019. Retrieved 25 November 2016.
  3. "Compound Files (Windows)". Microsoft Developers Network (MSDN) library – COM SDK. Microsoft Corporation. 20 November 2008. Retrieved 23 September 2009.
  4. "Containers: Compound Files". Microsoft Developers Network (MSDN) library – Visual Studio 2008 documentation. Microsoft Corporation. Retrieved 23 September 2009.
  5. "Understand Compound Files". Microsoft Developers Network (MSDN) library – ActiveDirectory Rights Management. 25 June 2009. Retrieved 23 September 2009.
  6. AMW Association (formerly AAF Association) Archived 15 August 2000 at the Wayback Machine