Sparse file

Last updated August 21, 2024

In computer science, a sparse file is a type of computer file that attempts to use file system space more efficiently when the file itself is partially empty. This is achieved by writing brief information (metadata) representing the empty blocks to the data storage media instead of the actual "empty" space which makes up the block, thus consuming less storage space. The full block is written to the media as the actual size only when the block contains "real" (non-empty) data.

Most commonly, sparse files are created when blocks of the file are never written to. This is typical for random-access files like databases. Some operating systems or utilities go further by "sparsifying" files when writing or copying them: if a block contains only null bytes, it is not written to storage but rather marked as empty.

When reading sparse files, the file system transparently converts metadata representing empty blocks into "real" blocks filled with null bytes at runtime. The application is unaware of this conversion.

Most modern file systems support sparse files, including most Unix variants and NTFS.^[1] Apple's HFS+ does not provide support for sparse files, but in OS X, the virtual file system layer supports storing them in any supported file system, including HFS+.^{[ citation needed ]} Apple File System (APFS) also supports them.^[2] Sparse files are commonly used for disk images, database snapshots, log files and in scientific applications.

Advantages

The advantage of sparse files is that storage space is only allocated when actually needed: Storage capacity is conserved, and large files can occasionally be created even if insufficient free space for the original file is available on the storage media. This also reduces the time of the first write as the system does not have to allocate blocks for the "skipped" space. If the initial allocation requires writing all zeros to the space, it also keeps the system from having to write over the "skipped" space twice.

For example, a virtual machine image with max size of 100 GB that has 2 GB of files actually written would require the full 100 GB when backed by pre-allocated storage, yet only 2 GB on a sparse file. If the file system supports hole punching and the guest operating system issues TRIM commands, deleting files on the guest will accordingly reduce the space needed.

Disadvantages

Disadvantages are that sparse files may become fragmented; file system free space reports may be misleading; filling up file systems containing sparse files can have unexpected effects (such as disk-full or quota-exceeded errors when merely overwriting an existing portion of a file that happened to have been sparse); and copying a sparse file with a program that does not explicitly support them may copy the entire, uncompressed size of the file, including the zero sections which are not allocated on the storage media—losing the benefits of the sparse property in the file. Sparse files are also not fully supported by all backup software or applications. However, the VFS implementation sidesteps^{[ citation needed ]} the prior two disadvantages. Loading executables on 32 bit Windows (exe or dll) which are sparse takes a much longer time since the file cannot be memory mapped in the limited 4 GB address space, and are not cached as there is no codepath for caching 32 bit sparse executables (Windows on 64 bit architectures can map sparse executables).^{[ citation needed ]} On NTFS sparse files (or rather their non-zero areas) cannot be compressed. NTFS implements sparseness as a special kind of compression so a file may be either sparse or compressed.

Sparse files in Unix

Sparse files are typically handled transparently to the user. But the differences between a normal file and sparse file become apparent in some situations.

Creation

The Unix command

ddof=sparse-filebs=5Mseek=1count=0

will create a file of five mebibytes in size, but with no data stored on the media (only metadata). (GNU dd has this behavior because it calls ftruncate to set the file size; other implementations may merely create an empty file.)

Similarly the truncate command may be used, if available:

truncate-s5M<filename>

On Linux, an existing file can be converted to sparse by:

fallocate-d<filename>

There is no portable system call to punch holes; Linux provides fallocate(FALLOC_FL_PUNCH_HOLE), and Solaris provides fcntl(F_FREESP).

Detection

The -s option of the ls command shows the occupied space in blocks.

ls-lssparse-file

Alternatively, the du command prints the occupied space, while ls prints the apparent size. In some non-standard versions of du, the option --block-size=1 prints the occupied space in bytes instead of blocks, so that it can be compared to the ls output:

du--block-size=1sparse-file ls-lsparse-file

Note the above du usage has the abbreviated option syntax format "du -B 1 sf", itself equivalent to the shortest version "du -b sf" as stated in the du manual:^[3]-b, --bytes is equivalent to --apparent-size --block-size=1.

Also, the tool filefrag from e2fsprogs package can be used to show block allocation details of the file.

filefrag-vsparse-file

Copying

Normally the GNU version of cp is good at detecting whether a file is sparse, so

cp sparse-file new-file

creates new-file, which will be sparse. However, GNU cp does have a --sparse option.^[4] This is especially useful if a file containing long zero blocks is saved in a non-sparse way (i.e. the zero blocks have been written to the storage media in full). Storage space can be conserved by doing:

cp --sparse=always file1 file1_sparsed

Some cp implementations, like FreeBSD's cp, do not support the --sparse option and will always expand sparse files. A partially viable alternative on those systems is to use rsync with its own --sparse option^[5] instead of cp. Unfortunately --sparse cannot be combined with --inplace.^[6]^[7] Newer Versions of rsync do support --sparse combined with --inplace.^[8]

Via standard input, sparse file copying is achieved as follows:

cp--sparse=always/dev/fd/0new-sparse-file<somefile

Related Research Articles

XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; as of June 2014, XFS is supported by most Linux distributions; Red Hat Enterprise Linux uses it as its default file system.

NT File System (NTFS) is a proprietary journaling file system developed by Microsoft in the 1990s.

rsync File synchronization protocol and software

rsync is a utility for transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like operating systems and is under the GPL-3.0-or-later license.

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own, such as devices that use magnetic tape. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

Disk formatting is the process of preparing a data storage device such as a hard disk drive, solid-state drive, floppy disk, memory card or USB flash drive for initial use. In some cases, the formatting operation may also create one or more new file systems. The first part of the formatting process that performs basic medium preparation is often referred to as "low-level formatting". Partitioning is the common term for the second part of the process, dividing the device into several sub-devices and, in some cases, writing information to the device allowing an operating system to be booted from it. The third part of the process, usually termed "high-level formatting" most often refers to the process of generating a new file system. In some operating systems all or parts of these three processes can be combined or repeated at different levels and the term "format" is understood to mean an operation in which a new disk medium is fully prepared to store files. Some formatting utilities allow distinguishing between a quick format, which does not erase all existing data and a long option that does erase all existing data.

dd is a command-line utility for Unix, Plan 9, Inferno, and Unix-like operating systems and beyond, the primary purpose of which is to convert and copy files. On Unix, device drivers for hardware and special device files appear in the file system just like normal files; dd can also read and/or write from/to these files, provided that function is implemented in their respective driver. As a result, dd can be used for tasks such as backing up the boot sector of a hard drive, and obtaining a fixed amount of random data. The dd program can also perform conversions on the data as it is copied, including byte order swapping and conversion to and from the ASCII and EBCDIC text encodings.

df is a standard Unix command used to display the amount of available disk space for file systems on which the invoking user has appropriate read access. df is typically implemented using the statfs or statvfs system calls.

du is a standard Unix program used to estimate file space usage—space used under a particular directory or files on a file system. A Windows commandline version of this program is part of Sysinternals suite by Mark Russinovich.

HFS Plus or HFS+ is a journaling file system developed by Apple Inc. It replaced the Hierarchical File System (HFS) as the primary file system of Apple computers with the 1998 release of Mac OS 8.1. HFS+ continued as the primary Mac OS X file system until it was itself replaced with the Apple File System (APFS), released with macOS High Sierra in 2017. HFS+ is also one of the formats supported by the iPod digital music player.

In computing, a file system or filesystem governs file organization and access. A local file system is a capability of an operating system that services the applications running on the same computer. A distributed file system is a protocol that provides file access between networked computers.

In Unix-like operating systems, find is a command-line utility that locates files based on some user-specified criteria and either prints the pathname of each matched object or, if another action is requested, performs that action on each matched object.

In computing, an extent is a contiguous area of storage reserved for a file in a file system, represented as a range of block numbers, or tracks on count key data devices. A file can consist of zero or more extents; one file fragment requires one extent. The direct benefit is in storing each range compactly as two numbers, instead of canonically storing every block number in the range. Also, extent allocation results in less file fragmentation.

In computer programming, the block starting symbol is the portion of an object file, executable, or assembly language code that contains statically allocated variables that are declared but have not been assigned a value yet. It is often referred to as the "bss section" or "bss segment".

Extended file attributes are file system features that enable users to associate computer files with metadata not interpreted by the filesystem, whereas regular attributes have a purpose strictly defined by the filesystem. Unlike forks, which can usually be as large as the maximum file size, extended attributes are usually limited in size to a value significantly smaller than the maximum file size. Typical uses include storing the author of a document, the character encoding of a plain-text document, or a checksum, cryptographic hash or digital certificate, and discretionary access control information.

BackupPC is a free disk-to-disk backup software suite with a web-based frontend. The cross-platform server will run on any Linux, Solaris, or UNIX-based server. No client is necessary, as the server is itself a client for several protocols that are handled by other services native to the client OS. In 2007, BackupPC was mentioned as one of the three most well known open-source backup software, even though it is one of the tools that are "so amazing, but unfortunately, if no one ever talks about them, many folks never hear of them".

The following tables compare general and technical information for a number of file systems.

exFAT is a file system introduced by Microsoft in 2006 and optimized for flash memory such as USB flash drives and SD cards. exFAT was proprietary until 28 August 2019, when Microsoft published its specification. Microsoft owns patents on several elements of its design.

A sparse image is a type of disk image file used on macOS that grows in size as the user adds data to the image, taking up only as much disk space as stored in it. Encrypted sparse image files are used to secure a user's home directory by the FileVault feature in Mac OS X Snow Leopard and earlier. Sparse images can be created using Disk Utility.

Btrfs is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager, developed together. It was created by Chris Mason in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel.

A zero-byte file or zero-length file is a computer file containing no data; that is, it has a length or size of zero bytes.

References

↑ Giampaolo, Dominic (1999). Practical File System Design with the Be File System (PDF). Morgan Kaufmann Publishers. ISBN 9781558604971.
↑ "Apple File System Guide". Apple's Developer Site. Apple Inc. Retrieved 27 April 2017.
↑ "Du(1) – Linux manual page".
↑ Meyering, Jim (1995-12-21). "GNU coreutils/cp: Accept new option, --sparse={never,auto,always}, to control creation of sparse files" . Retrieved 2016-06-17.
↑ Tridgell, Andrew (1996-06-29). "rsync: hard links, better sparse handling, FERROR and FINFO" . Retrieved 2016-06-17.
↑ Tridgell, Andrew (2016-06-30). "rsync manpage" . Retrieved 2017-01-19.
↑ Davison, Wayne (2005-08-30). "rsync: Reject attempts to combine --sparse with --inplace" . Retrieved 2017-01-19.
↑ Davison, Wayne. "Support --sparse combined with --preallocate or --inplace".

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Giampaolo, Dominic (1999). Practical File System Design with the Be File System (PDF). Morgan Kaufmann Publishers. ISBN 9781558604971.

[2] "Apple File System Guide". Apple's Developer Site. Apple Inc. Retrieved 27 April 2017.

[3] "Du(1) – Linux manual page".

[4] Meyering, Jim (1995-12-21). "GNU coreutils/cp: Accept new option, --sparse={never,auto,always}, to control creation of sparse files" . Retrieved 2016-06-17.

[5] Tridgell, Andrew (1996-06-29). "rsync: hard links, better sparse handling, FERROR and FINFO" . Retrieved 2016-06-17.

[6] Tridgell, Andrew (2016-06-30). "rsync manpage" . Retrieved 2017-01-19.

[7] Davison, Wayne (2005-08-30). "rsync: Reject attempts to combine --sparse with --inplace" . Retrieved 2017-01-19.

[8] Davison, Wayne. "Support --sparse combined with --preallocate or --inplace".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Computer files
Types	Binary file / text file Data file File format List of file formats File signatures Magic number Open file formats Proprietary file formats Metafile Sidecar file Sparse file Swap file System file Temporary file Zero-byte file
Properties	Filename 8.3 filename Long filename Filename mangling Filename extension List of filename extensions File attribute Extended file attributes File size Hidden file / Hidden directory
Organisation	Directory/folder NTFS links Temporary folder Directory structure File system Filesystem Hierarchy Standard Grid file system Semantic file system Path
Operations	Open Close Read Write
Linking	File descriptor Hard link Shortcut Alias Shadow Symbolic link
Management	Backup File comparison File copying Data compression File manager Comparison of file managers File system fragmentation File-system permissions File transfer File sharing File synchronization File verification