Sync (Unix)

Last updated

sync is a standard system call in the Unix operating system, which commits all data in the kernel filesystem to non-volatile storage buffers, i.e., data which has been scheduled for writing via low-level I/O system calls. Higher-level I/O layers such as stdio may maintain separate buffers of their own.

Contents

As a function in C, the sync() call is typically declared as void sync(void) in <unistd.h>. The system call is also available via a command line utility also called sync, and similarly named functions in other languages such as Perl and Node.js (in the fs module).

The related system call fsync() commits just the buffered data relating to a specified file descriptor. [1] fdatasync() is also available to write out just the changes made to the data in the file, and not necessarily the file's related metadata. [2]

Some Unix systems run a kind of flush or update daemon, which calls the sync function on a regular basis. On some systems, the cron daemon does this, and on Linux it was handled by the pdflush daemon which was replaced by a new implementation and finally removed from the Linux kernel in 2012. [3] Buffers are also flushed when filesystems are unmounted or remounted read-only, [4] for example prior to system shutdown.

Database use

In order to provide proper durability, databases need to use some form of sync in order to make sure the information written has made it to non-volatile storage rather than just being stored in a memory-based write cache that would be lost if power failed. PostgreSQL for example may use a variety of different sync calls, including fsync() and fdatasync(), [5] in order for commits to be durable. [6] Unfortunately, for any single client writing a series of records, a rotating hard drive can only commit once per rotation, which makes for at best a few hundred such commits per second. [7] Turning off the fsync requirement can therefore greatly improve commit performance, but at the expense of potentially introducing database corruption after a crash.

Databases also employ transaction log files (typically much smaller than the main data files) that have information about recent changes, such that changes can be reliably redone in case of crash; then the main data files can be synced less often.

Error reporting and checking

To avoid any data loss return values of fsync() should be checked because when performing I/O operations that are buffered by the library or the kernel, errors may not be reported at the time of using the write() system call or the fflush() call, since the data may not be written to non-volatile storage but only be written to the memory page cache. Errors from writes are instead often reported during system calls to fsync(), msync() or close(). [8] Prior to 2018, Linux's fsync() behavior under certain circumstances failed to report error status, [9] [10] change behavior was proposed on 23 April 2018. [11]

Performance controversies

Hard disks may default to using their own volatile write cache to buffer writes, which greatly improves performance while introducing a potential for lost writes. [12] Tools such as hdparm -F will instruct the HDD controller to flush the on-drive write cache buffer. The performance impact of turning caching off is so large that even the normally conservative FreeBSD community rejected disabling write caching by default in FreeBSD 4.3. [13]

In SCSI and in SATA with Native Command Queuing (but not in plain ATA, even with TCQ) the host can specify whether it wants to be notified of completion when the data hits the disk's platters or when it hits the disk's buffer (on-board cache). Assuming a correct hardware implementation, this feature allows the disk's on-board cache to be used while guaranteeing correct semantics for system calls like fsync. [14] This hardware feature is called Force Unit Access (FUA) and it allows consistency with less overhead than flushing the entire cache as done for ATA (or SATA non-NCQ) disks. [15] Although Linux enabled NCQ around 2007, it did not enable SATA/NCQ FUA until 2012, citing lack of support in the early drives. [16] [17]

Firefox 3.0, released in 2008, introduced fsync system calls that were found to degrade its performance; the call was introduced in order to guarantee the integrity of the embedded SQLite database. [18] Linux Foundation chief technical officer Theodore Ts'o claims there is no need to "fear fsync", and that the real cause of Firefox 3 slowdown is the excessive use of fsync. [19] He also concedes however (quoting Mike Shaver) that

On some rather common Linux configurations, especially using the ext3 filesystem in the "data=ordered" mode, calling fsync doesn't just flush out the data for the file it's called on, but rather on all the buffered data for that filesystem. [20]

See also

Related Research Articles

XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; as of June 2014, XFS is supported by most Linux distributions; Red Hat Enterprise Linux uses it as its default file system.

ext3, or third extended filesystem, is a journaled file system that is commonly used by the Linux kernel. It used to be the default file system for many popular Linux distributions. Stephen Tweedie first revealed that he was working on extending ext2 in Journaling the Linux ext2fs Filesystem in a 1998 paper, and later in a February 1999 kernel mailing list posting. The filesystem was merged with the mainline Linux kernel in November 2001 from 2.4.15 onward. Its main advantage over ext2 is journaling, which improves reliability and eliminates the need to check the file system after an unclean shutdown. Its successor is ext4.

The Unix file system (UFS) is a family of file systems supported by many Unix and Unix-like operating systems. It is a distant descendant of the original filesystem used by Version 7 Unix.

stat (system call) Unix system call

stat is a Unix system call that returns file attributes about an inode. The semantics of stat vary between operating systems. As an example, Unix command ls uses this system call to retrieve information on files that includes:

<span class="mw-page-title-main">Native Command Queuing</span>

In computing, Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed. This can reduce the amount of unnecessary drive head movement, resulting in increased performance for workloads where multiple simultaneous read/write requests are outstanding, most often occurring in server-type applications.

The proc filesystem (procfs) is a special filesystem in Unix-like operating systems that presents information about processes and other system information in a hierarchical file-like structure, providing a more convenient and standardized method for dynamically accessing process data held in the kernel than traditional tracing methods or direct access to kernel memory. Typically, it is mapped to a mount point named /proc at boot time. The proc file system acts as an interface to internal data structures about running processes in the kernel. In Linux, it can also be used to obtain information about the kernel and to change certain kernel parameters at runtime (sysctl).

The device mapper is a framework provided by the Linux kernel for mapping physical block devices onto higher-level virtual block devices. It forms the foundation of the logical volume manager (LVM), software RAIDs and dm-crypt disk encryption, and offers additional features such as file system snapshots.

NILFS or NILFS2 is a log-structured file system implementation for the Linux kernel. It was developed by Nippon Telegraph and Telephone Corporation (NTT) CyberSpace Laboratories and a community from all over the world. NILFS was released under the terms of the GNU General Public License (GPL).

The following tables compare general and technical information for a number of file systems.

splice is a Linux-specific system call that moves data between a file descriptor and a pipe without a round trip to user space. The related system call vmsplice moves or copies data between a pipe and user space. Ideally, splice and vmsplice work by remapping pages and do not actually copy any data, which may improve I/O performance. As linear addresses do not necessarily correspond to contiguous physical addresses, this may not be possible in all cases and on all hardware combinations.

ext4 is a journaling file system for Linux, developed as the successor to ext3.

<span class="mw-page-title-main">Disk buffer</span>

In computer storage, disk buffer is the embedded memory in a hard disk drive (HDD) or solid state drive (SSD) acting as a buffer between the rest of the computer and the physical hard disk platter or flash memory that is used for storage. Modern hard disk drives come with 8 to 256 MiB of such memory, and solid-state drives come with up to 4 GB of cache memory.

Btrfs is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager, developed together. It was initially designed at Oracle Corporation in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel. According to Oracle, Btrfs "is not a true acronym".

In Unix-like operating systems, a device file or special file is an interface to a device driver that appears in a file system as if it were an ordinary file. There are also special files in DOS, OS/2, and Windows. These special files allow an application program to interact with a device by using its device driver via standard input/output system calls. Using standard system calls simplifies many programming tasks, and leads to consistent user-space I/O mechanisms regardless of device features and functions.

Toybox is a free and open-source software implementation of over 200 Unix command line utilities such as ls, cp, and mv. The Toybox project was started in 2006, and became a 0BSD licensed BusyBox alternative. Toybox is used for most of Android's command line tools in all currently supported Android versions, and is also used to build Android on Linux and macOS. All of the tools are tested on Linux, and many of them also work on BSD and macOS.

A trim command allows an operating system to inform a solid-state drive (SSD) which blocks of data are no longer considered to be 'in use' and therefore can be erased internally.

A journaling file system is a file system that keeps track of changes not yet committed to the file system's main part by recording the goal of such changes in a data structure known as a "journal", which is usually a circular log. In the event of a system crash or power failure, such file systems can be brought back online more quickly with a lower likelihood of becoming corrupted.

bcache is a cache in the Linux kernel's block layer, which is used for accessing secondary storage devices. It allows one or more fast storage devices, such as flash-based solid-state drives (SSDs), to act as a cache for one or more slower storage devices, such as hard disk drives (HDDs); this effectively creates hybrid volumes and provides performance improvements.

References

  1. fsync specification
  2. fdatasync specification
  3. "R.I.P. Pdflush [LWN.net]".
  4. "mount - Does umount calls sync to complete any pending writes". Unix & Linux Stack Exchange. Retrieved 2021-05-02.
  5. Vondra, Tomas (2 February 2019). "PostgreSQL vs. fsync". Osuosl Org. Archived from the original (mp4) on 10 February 2019. Retrieved 10 February 2019.
  6. PostgreSQL Reliability and the Write-Ahead Log
  7. Tuning PostgreSQL WAL Synchronization Archived 2009-11-25 at the Wayback Machine
  8. "Ensuring data reaches disk [LWN.net]".
  9. "PostgreSQL's fsync() surprise [LWN.net]".
  10. "Improved block-layer error handling [LWN.net]".
  11. "Always report a writeback error once - Patchwork". Archived from the original on 2018-05-04. Retrieved 2018-05-03.
  12. Write-Cache Enabled?
  13. FreeBSD Handbook — Tuning Disks
  14. Marshall Kirk McKusick. "Disks from the Perspective of a File System - ACM Queue". Queue.acm.org. Retrieved 2014-01-11.
  15. Gregory Smith (2010). PostgreSQL 9.0: High Performance. Packt Publishing Ltd. p. 78. ISBN   978-1-84951-031-8.
  16. "Enabling FUA for SATA drives (Was Re: [RFC][PATCH] libata: Enable SATA disk fua detection on default) (Linux SCSI)".
  17. "Linux-Kernel Archive: [PATCH RFC] libata: FUA updates".
  18. "Shaver » fsyncers and curveballs". Archived from the original on 2012-12-09. Retrieved 2009-10-15.
  19. "Don't fear the fsync!".
  20. "Delayed allocation and the zero-length file problem".