Journaling file system

Last updated

A journaling file system is a file system that keeps track of changes not yet committed to the file system's main part by recording the goal of such changes in a data structure known as a "journal", which is usually a circular log. In the event of a system crash or power failure, such file systems can be brought back online more quickly with a lower likelihood of becoming corrupted. [1] [2]

Contents

Depending on the actual implementation, a journaling file system may only keep track of stored metadata, resulting in improved performance at the expense of increased possibility for data corruption. Alternatively, a journaling file system may track both stored data and related metadata, while some implementations allow selectable behavior in this regard. [3]

History

In 1990 IBM introduced JFS in AIX 3.1 as one of the first UNIX commercial filesystems that implemented journaling. [4] The next year the idea was popularized in a widely cited paper on log-structured file systems. [5] This was subsequently implemented in Microsoft's Windows NT's NTFS filesystem in 1993 and in Linux's ext3 filesystem in 2001. [6]

Rationale

Updating file systems to reflect changes to files and directories usually requires many separate write operations. This makes it possible for an interruption (like a power failure or system crash) between writes to leave data structures in an invalid intermediate state. [1]

For example, deleting a file on a Unix file system involves three steps: [7]

  1. Removing its directory entry.
  2. Releasing the inode to the pool of free inodes.
  3. Returning all disk blocks to the pool of free disk blocks.

If a crash occurs after step 1 and before step 2, there will be an orphaned inode and hence a storage leak; if a crash occurs between steps 2 and 3, then the blocks previously used by the file cannot be used for new files, effectively decreasing the storage capacity of the file system. Re-arranging the steps does not help, either. If step 3 preceded step 1, a crash between them could allow the file's blocks to be reused for a new file, meaning the partially deleted file would contain part of the contents of another file, and modifications to either file would show up in both. On the other hand, if step 2 preceded step 1, a crash between them would cause the file to be inaccessible, despite appearing to exist.

Detecting and recovering from such inconsistencies normally requires a complete walk of its data structures, for example by a tool such as fsck (the file system checker). [2] This must typically be done before the file system is next mounted for read-write access. If the file system is large and if there is relatively little I/O bandwidth, this can take a long time and result in longer downtimes if it blocks the rest of the system from coming back online.

To prevent this, a journaled file system allocates a special area—the journal—in which it records the changes it will make ahead of time. After a crash, recovery simply involves reading the journal from the file system and replaying changes from this journal until the file system is consistent again. The changes are thus said to be atomic (not divisible) in that they either succeed (succeeded originally or are replayed completely during recovery), or are not replayed at all (are skipped because they had not yet been completely written to the journal before the crash occurred).

Techniques

Some file systems allow the journal to grow, shrink and be re-allocated just as a regular file, while others put the journal in a contiguous area or a hidden file that is guaranteed not to move or change size while the file system is mounted. Some file systems may also allow external journals on a separate device, such as a solid-state drive or battery-backed non-volatile RAM. Changes to the journal may themselves be journaled for additional redundancy, or the journal may be distributed across multiple physical volumes to protect against device failure.

The internal format of the journal must guard against crashes while the journal itself is being written to. Many journal implementations (such as the JBD2 layer in ext4) bracket every change logged with a checksum, on the understanding that a crash would leave a partially written change with a missing (or mismatched) checksum that can simply be ignored when replaying the journal at next remount.

Physical journals

A physical journal logs an advance copy of every block that will later be written to the main file system. If there is a crash when the main file system is being written to, the write can simply be replayed to completion when the file system is next mounted. If there is a crash when the write is being logged to the journal, the partial write will have a missing or mismatched checksum and can be ignored at next mount.

Physical journals impose a significant performance penalty because every changed block must be committed twice to storage, but may be acceptable when absolute fault protection is required. [8]

Logical journals

A logical journal stores only changes to file metadata in the journal, and trades fault tolerance for substantially better write performance. [9] A file system with a logical journal still recovers quickly after a crash, but may allow unjournaled file data and journaled metadata to fall out of sync with each other, causing data corruption.

For example, appending to a file may involve three separate writes to:

  1. The file's inode, to note in the file's metadata that its size has increased.
  2. The free space map, to mark out an allocation of space for the to-be-appended data.
  3. The newly allocated space, to actually write the appended data.

In a metadata-only journal, step 3 would not be logged. If step 3 was not done, but steps 1 and 2 are replayed during recovery, the file will be appended with garbage.

Write hazards

The write cache in most operating systems sorts its writes (using the elevator algorithm or some similar scheme) to maximize throughput. To avoid an out-of-order write hazard with a metadata-only journal, writes for file data must be sorted so that they are committed to storage before their associated metadata. This can be tricky to implement because it requires coordination within the operating system kernel between the file system driver and write cache. An out-of-order write hazard can also occur if a device cannot write blocks immediately to its underlying storage, that is, it cannot flush its write-cache to disk due to deferred write being enabled.

To complicate matters, many mass storage devices have their own write caches, in which they may aggressively reorder writes for better performance. (This is particularly common on magnetic hard drives, which have large seek latencies that can be minimized with elevator sorting.) Some journaling file systems conservatively assume such write-reordering always takes place, and sacrifice performance for correctness by forcing the device to flush its cache at certain points in the journal (called barriers in ext3 and ext4). [10]

Alternatives

Soft updates

Some UFS implementations avoid journaling and instead implement soft updates: they order their writes in such a way that the on-disk file system is never inconsistent, or that the only inconsistency that can be created in the event of a crash is a storage leak. To recover from these leaks, the free space map is reconciled against a full walk of the file system at next mount. This garbage collection is usually done in the background. [11]

Log-structured file systems

In log-structured file systems, the write-twice penalty does not apply because the journal itself is the file system: it occupies the entire storage device and is structured so that it can be traversed as would a normal file system.

Copy-on-write file systems

Full copy-on-write file systems (such as ZFS and Btrfs) avoid in-place changes to file data by writing out the data in newly allocated blocks, followed by updated metadata that would point to the new data and disown the old, followed by metadata pointing to that, and so on up to the superblock, or the root of the file system hierarchy. This has the same correctness-preserving properties as a journal, without the write-twice overhead.

See also

Related Research Articles

XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; as of June 2014, XFS is supported by most Linux distributions; Red Hat Enterprise Linux uses it as default filesystem.

The ext2 or second extended file system is a file system for the Linux kernel. It was initially designed by French software developer Rémy Card as a replacement for the extended file system (ext). Having been designed according to the same principles as the Berkeley Fast File System from BSD, it was the first commercial-grade filesystem for Linux.

ext3, or third extended filesystem, is a journaled file system that is commonly used by the Linux kernel. It used to be the default file system for many popular Linux distributions. Stephen Tweedie first revealed that he was working on extending ext2 in Journaling the Linux ext2fs Filesystem in a 1998 paper, and later in a February 1999 kernel mailing list posting. The filesystem was merged with the mainline Linux kernel in November 2001 from 2.4.15 onward. Its main advantage over ext2 is journaling, which improves reliability and eliminates the need to check the file system after an unclean shutdown. Its successor is ext4.

The Unix file system (UFS) is a file system supported by many Unix and Unix-like operating systems. It is a distant descendant of the original filesystem used by Version 7 Unix.

A log-structured filesystem is a file system in which data and metadata are written sequentially to a circular buffer, called a log. The design was first proposed in 1988 by John K. Ousterhout and Fred Douglis and first implemented in 1992 by Ousterhout and Mendel Rosenblum for the Unix-like Sprite distributed operating system.

The Andrew File System (AFS) is a distributed file system which uses a set of trusted servers to present a homogeneous, location-transparent file name space to all the client workstations. It was developed by Carnegie Mellon University as part of the Andrew Project. Originally named "Vice", "Andrew" refers to Andrew Carnegie and Andrew Mellon. Its primary use is in distributed computing.

In computing, the Global File System 2 or GFS2 is a shared-disk file system for Linux computer clusters. GFS2 allows all members of a cluster to have direct concurrent access to the same shared block storage, in contrast to distributed file systems which distribute data throughout the cluster. GFS2 can also be used as a local file system on a single computer.

File system Format or program for storing files and directories

In computing, file system or filesystem is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stopped and the next began, or where any piece of data was located when it was time to retrieve it. By separating the data into pieces and giving each piece a name, the data is easily isolated and identified. Taking its name from the way a paper-based data management system is named, each group of data is called a "file." The structure and logic rules used to manage the groups of data and their names is called a "file system."

The Write Anywhere File Layout (WAFL) is a proprietary file system that supports large, high-performance RAID arrays, quick restarts without lengthy consistency checks in the event of a crash or power failure, and growing the filesystems size quickly. It was designed by NetApp for use in its storage appliances like NetApp FAS, AFF, Cloud Volumes ONTAP and ONTAP Select.

Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file system software is available under the GNU General Public License and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site systems. Since June 2005, Lustre has consistently been used by at least half of the top ten, and more than 60 of the top 100 fastest supercomputers in the world, including the world's No. 1 ranked TOP500 supercomputer in June 2020, Fugaku, as well as previous top supercomputers such as Titan and Sequoia.

NILFS or NILFS2 is a log-structured file system implementation for the Linux kernel. It was developed by Nippon Telegraph and Telephone Corporation (NTT) CyberSpace Laboratories and a community from all over the world. NILFS was released under the terms of the GNU General Public License (GPL).

sync is a standard system call in the Unix operating system, which commits all data in the kernel filesystem to non-volatile storage buffers, i.e., data which has been scheduled for writing via low-level I/O system calls. Higher-level I/O layers such as stdio may maintain separate buffers of their own.

The following tables compare general and technical information for a number of file systems.

The ext4 journaling file system or fourth extended filesystem is a journaling file system for Linux, developed as the successor to ext3.

Btrfs is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager, developed together. It was initially designed at Oracle Corporation in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel. According to Oracle, Btrfs "is not a true acronym".

Next3 is a journaling file system for Linux based on ext3 which adds snapshots support, yet retains compatibility to the ext3 on-disk format. Next3 is implemented as open-source software, licensed under the GPL license.

The NOVA file system is an open-source, log-structured file system for byte-addressable persistent memory for Linux.

ZFS File system

ZFS combines a file system with a volume manager. It began as part of the Sun Microsystems Solaris operating system in 2001. Large parts of Solaris – including ZFS – were published under an open source license as OpenSolaris for around 5 years from 2005, before being placed under a closed source license when Oracle Corporation acquired Sun in 2009/2010. During 2005 to 2010, the open source version of ZFS was ported to Linux, Mac OS X and FreeBSD. In 2010, the illumos project forked a recent version of OpenSolaris, to continue its development as an open source project, including ZFS. In 2013, OpenZFS was founded to coordinate the development of open source ZFS. OpenZFS maintains and manages the core ZFS code, while organizations using ZFS maintain the specific code and validation processes required for ZFS to integrate within their systems. OpenZFS is widely used in Unix-like systems.

References

  1. 1 2 Jones, M Tim (June 4, 2008), Anatomy of Linux journaling file systems, IBM DeveloperWorks, archived from the original on February 21, 2009, retrieved April 13, 2009
  2. 1 2 Arpaci-Dusseau, Remzi H.; Arpaci-Dusseau, Andrea C. (January 21, 2014), Crash Consistency: FSCK and Journaling (PDF), Arpaci-Dusseau Books, archived (PDF) from the original on January 24, 2014, retrieved January 22, 2014
  3. "tune2fs(8) – Linux man page". linux.die.net. Archived from the original on February 25, 2015. Retrieved February 20, 2015.
  4. Chang, A.; Mergen, M.F.; Rader, R.K.; Roberts, J.A.; Porter, S.L. (January 1990), "Evolution of storage facilities in AIX Version 3 for RISC System/6000 processors" (PDF), IBM Journal of Research and Development, 34:1: 105–109, doi:10.1147/rd.341.0105
  5. Rosembaum, Mendel; Ousterhout, John (February 1991). The Design and Implementation of a Log-Structure File System (PDF). ACM 13th Annual Symposium on Operating Systems Principles.
  6. "'2.4.15-final' - MARC". marc.info. Retrieved March 24, 2018.
  7. File Systems from Tanenbaum, A.S. (2008). Modern operating systems (3rd ed., pp. 287). Upper Saddle River, NJ: Prentice Hall.
  8. Tweedie, Stephen (2000), "Ext3, journaling filesystem", Proceedings of the Ottawa Linux Symposium: 24–29
  9. Prabhakaran, Vijayan; Arpaci-Dusseau, Andrea C; Arpaci-Dusseau, Remzi H, "Analysis and Evolution of Journaling File Systems" (PDF), 2005 USENIX Annual Technical Conference, USENIX Association, archived (PDF) from the original on September 26, 2007, retrieved July 27, 2007.
  10. Corbet, Jonathan (May 21, 2008), Barriers and journaling filesystems, archived from the original on March 14, 2010, retrieved March 6, 2010
  11. Seltzer, Margo I; Ganger, Gregory R; McKusick, M Kirk, "Journaling Versus Soft Updates: Asynchronous Meta-data Protection in File Systems", 2000 USENIX Annual Technical Conference, USENIX Association, archived from the original on October 26, 2007, retrieved July 27, 2007.