ZFS

Last updated

ZFS
Developer(s) Sun Microsystems originally, Oracle Corporation since 2010, OpenZFS since 2013
Variants Oracle ZFS, OpenZFS
IntroducedNovember 2005;19 years ago (2005-11) with OpenSolaris
Structures
Directory contents Extendible hashing
Limits
Max volume size256 trillion  yobibytes (2128 bytes) [1]
Max file size16  exbibytes (264 bytes)
Max no. of files
  • Per directory: 248
  • Per file system: unlimited [1]
Max filename length1023 ASCII characters (fewer for multibyte character standards such as Unicode)
Features
Forks Yes (called "extended attributes", but they are full-fledged streams)
Attributes POSIX, extended attributes
File system
permissions
Unix permissions, NFSv4 ACLs
Transparent
compression
Yes
Transparent
encryption
Yes
Data deduplication Yes
Copy-on-write Yes
Other
Supported
operating systems

ZFS (previously Zettabyte File System) is a file system with volume management capabilities. It began as part of the Sun Microsystems Solaris operating system in 2001. Large parts of Solaris, including ZFS, were published under an open source license as OpenSolaris for around 5 years from 2005 before being placed under a closed source license when Oracle Corporation acquired Sun in 20092010. During 2005 to 2010, the open source version of ZFS was ported to Linux, Mac OS X (continued as MacZFS) and FreeBSD. In 2010, the illumos project forked a recent version of OpenSolaris, including ZFS, to continue its development as an open source project. In 2013, OpenZFS was founded to coordinate the development of open source ZFS. [3] [4] [5] OpenZFS maintains and manages the core ZFS code, while organizations using ZFS maintain the specific code and validation processes required for ZFS to integrate within their systems. OpenZFS is widely used in Unix-like systems. [6] [7] [8]

Contents

Overview

The management of stored data generally involves two aspects: the physical volume management of one or more block storage devices (such as hard drives and SD cards), including their organization into logical block devices as VDEVs (ZFS Virtual Device) [9] as seen by the operating system (often involving a volume manager, RAID controller, array manager, or suitable device driver); and the management of data and files that are stored on these logical block devices (a file system or other data storage).

Example: A RAID array of 2 hard drives and an SSD caching disk is controlled by Intel's RST system, part of the chipset and firmware built into a desktop computer. The Windows user sees this as a single volume, containing an NTFS-formatted drive of their data, and NTFS is not necessarily aware of the manipulations that may be required (such as reading from/writing to the cache drive or rebuilding the RAID array if a disk fails). The management of the individual devices and their presentation as a single device is distinct from the management of the files held on that apparent device.

ZFS is unusual because, unlike most other storage systems, it unifies both of these roles and acts as both the volume manager and the file system. Therefore, it has complete knowledge of both the physical disks and volumes (including their status, condition, and logical arrangement into volumes) as well as of all the files stored on them. ZFS is designed to ensure (subject to sufficient data redundancy) that data stored on disks cannot be lost due to physical errors, misprocessing by the hardware or operating system, or bit rot events and data corruption that may happen over time. Its complete control of the storage system is used to ensure that every step, whether related to file management or disk management, is verified, confirmed, corrected if needed, and optimized, in a way that the storage controller cards and separate volume and file systems cannot achieve.

ZFS also includes a mechanism for dataset and pool-level snapshots and replication, including snapshot cloning, which is described by the FreeBSD documentation as one of its "most powerful features" with functionality that "even other file systems with snapshot functionality lack". [10] Very large numbers of snapshots can be taken without degrading performance, allowing snapshots to be used prior to risky system operations and software changes, or an entire production ("live") file system to be fully snapshotted several times an hour in order to mitigate data loss due to user error or malicious activity. Snapshots can be rolled back "live" or previous file system states can be viewed, even on very large file systems, leading to savings in comparison to formal backup and restore processes. [10] Snapshots can also be cloned to form new independent file systems. ZFS also has the ability to take a pool level snapshot (known as a "checkpoint"), which allows rollback of operations that may affect the entire pool's structure or that add or remove entire datasets.

History

2004-2010: Development at Sun Microsystems

In 1987, AT&T Corporation and Sun announced that they were collaborating on a project to merge the most popular Unix variants on the market at that time: Berkeley Software Distribution, UNIX System V, and Xenix. This became Unix System V Release 4 (SVR4). [11] The project was released under the name Solaris, which became the successor to SunOS 4 (although SunOS 4.1.x micro releases were retroactively named Solaris 1). [12]

ZFS was designed and implemented by a team at Sun led by Jeff Bonwick, Bill Moore, [13] and Matthew Ahrens. It was announced on September 14, 2004, [14] but development started in 2001. [15] Source code for ZFS was integrated into the main trunk of Solaris development on October 31, 2005 [16] and released for developers as part of build 27 of OpenSolaris on November 16, 2005. In June 2006, Sun announced that ZFS was included in the mainstream 6/06 update to Solaris 10. [17]

Solaris was originally developed as proprietary software, but Sun Microsystems was an early commercial proponent of open source software and in June 2005 released most of the Solaris codebase under the CDDL license and founded the OpenSolaris open-source project. [18] In Solaris 10 6/06 ("U2"), Sun added the ZFS file system and frequently updated ZFS with new features during the next 5 years. ZFS was ported to Linux, Mac OS X (continued as MacZFS), and FreeBSD, under this open source license.

The name at one point was said to stand for "Zettabyte File System", [19] but by 2006, the name was no longer considered to be an abbreviation. [20] A ZFS file system can store up to 256 quadrillion zettabytes (ZB).

In September 2007, NetApp sued Sun, claiming that ZFS infringed some of NetApp's patents on Write Anywhere File Layout. Sun counter-sued in October the same year claiming the opposite. The lawsuits were ended in 2010 with an undisclosed settlement. [21]

2010-current: Development at Oracle, OpenZFS

Ported versions of ZFS began to appear in 2005. After the Sun acquisition by Oracle in 2010, Oracle's version of ZFS became closed source, and the development of open-source versions proceeded independently, coordinated by OpenZFS from 2013.

Features

Summary

Examples of features specific to ZFS include:

  • Designed for long-term storage of data, and indefinitely scaled datastore sizes with zero data loss, and high configurability.
  • Hierarchical checksumming of all data and metadata, ensuring that the entire storage system can be verified on use, and confirmed to be correctly stored, or remedied if corrupt. Checksums are stored with a block's parent block, rather than with the block itself. This contrasts with many file systems where checksums (if held) are stored with the data so that if the data is lost or corrupt, the checksum is also likely to be lost or incorrect.
  • Can store a user-specified number of copies of data or metadata, or selected types of data, to improve the ability to recover from data corruption of important files and structures.
  • Automatic rollback of recent changes to the file system and data, in some circumstances, in the event of an error or inconsistency.
  • Automated and (usually) silent self-healing of data inconsistencies and write failure when detected, for all errors where the data is capable of reconstruction. Data can be reconstructed using all of the following: error detection and correction checksums stored in each block's parent block; multiple copies of data (including checksums) held on the disk; write intentions logged on the SLOG (ZIL) for writes that should have occurred but did not occur (after a power failure); parity data from RAID/RAID-Z disks and volumes; copies of data from mirrored disks and volumes.
  • Native handling of standard RAID levels and additional ZFS RAID layouts ("RAID-Z"). The RAID-Z levels stripe data across only the disks required, for efficiency (many RAID systems stripe indiscriminately across all devices), and checksumming allows rebuilding of inconsistent or corrupted data to be minimized to those blocks with defects;
  • Native handling of tiered storage and caching devices, which is usually a volume related task. Because ZFS also understands the file system, it can use file-related knowledge to inform, integrate, and optimize its tiered storage handling which a separate device cannot;
  • Native handling of snapshots and backup/replication which can be made efficient by integrating the volume and file handling. Relevant tools are provided at a low level and require external scripts and software for utilization.
  • Native data compression and deduplication, although the latter is largely handled in RAM and is memory hungry.
  • Efficient rebuilding of RAID arrays—a RAID controller often has to rebuild an entire disk, but ZFS can combine disk and file knowledge to limit any rebuilding to data which is actually missing or corrupt, greatly speeding up rebuilding;
  • Unaffected by RAID hardware changes which affect many other systems. On many systems, if self-contained RAID hardware such as a RAID card fails, or the data is moved to another RAID system, the file system will lack information that was on the original RAID hardware, which is needed to manage data on the RAID array. This can lead to a total loss of data unless near-identical hardware can be acquired and used as a "stepping stone". Since ZFS manages RAID itself, a ZFS pool can be migrated to other hardware, or the operating system can be reinstalled, and the RAID-Z structures and data will be recognized and immediately accessible by ZFS again.
  • Ability to identify data that would have been found in a cache but has been discarded recently instead; this allows ZFS to reassess its caching decisions in light of later use and facilitates very high cache-hit levels (ZFS cache hit rates are typically over 80%);
  • Alternative caching strategies can be used for data that would otherwise cause delays in data handling. For example, synchronous writes which are capable of slowing down the storage system can be converted to asynchronous writes by being written to a fast separate caching device, known as the SLOG (sometimes called the ZIL – ZFS Intent Log).
  • Highly tunable—many internal parameters can be configured for optimal functionality.
  • Can be used for high availability clusters and computing, although not fully designed for this use.

Data integrity

One major feature that distinguishes ZFS from other file systems is that it is designed with a focus on data integrity by protecting the user's data on disk against silent data corruption caused by data degradation, power surges (voltage spikes), bugs in disk firmware, phantom writes (the previous write did not make it to disk), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the checksum validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), etc..

A 1999 study showed that neither any of the then-major and widespread filesystems (such as UFS, Ext, [22] XFS, JFS, or NTFS), nor hardware RAID (which has some issues with data integrity) provided sufficient protection against data corruption problems. [23] [24] [25] [26] Initial research indicates that ZFS protects data better than earlier efforts. [27] [28] It is also faster than UFS [29] [30] and can be seen as its replacement.

Within ZFS, data integrity is achieved by using a Fletcher-based checksum or a SHA-256 hash throughout the file system tree. [31] Each block of data is checksummed and the checksum value is then saved in the pointer to that block—rather than at the actual block itself. Next, the block pointer is checksummed, with the value being saved at its pointer. This checksumming continues all the way up the file system's data hierarchy to the root node, which is also checksummed, thus creating a Merkle tree. [31] In-flight data corruption or phantom reads/writes (the data written/read checksums correctly but is actually wrong) are undetectable by most filesystems as they store the checksum with the data. ZFS stores the checksum of each block in its parent block pointer so that the entire pool self-validates. [31]

When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored checksum value of what it "should" be. If the checksums match, the data are passed up the programming stack to the process that asked for it; if the values do not match, then ZFS can heal the data if the storage pool provides data redundancy (such as with internal mirroring), assuming that the copy of data is undamaged and with matching checksums. [32] It is optionally possible to provide additional in-pool redundancy by specifying copies=2 (or copies=3), which means that data will be stored twice (or three times) on the disk, effectively halving (or, for copies=3, reducing to one-third) the storage capacity of the disk. [33] Additionally, some kinds of data used by ZFS to manage the pool are stored multiple times by default for safety even with the default copies=1 setting.

If other copies of the damaged data exist or can be reconstructed from checksums and parity data, ZFS will use a copy of the data (or recreate it via a RAID recovery mechanism) and recalculate the checksum—ideally resulting in the reproduction of the originally expected value. If the data passes this integrity check, the system can then update all faulty copies with known-good data and redundancy will be restored.

If there are no copies of the damaged data, ZFS puts the pool in a faulted state, [34] preventing its future use and providing no documented ways to recover pool contents.

Consistency of data held in memory, such as cached data in the ARC, is not checked by default, as ZFS is expected to run on enterprise-quality hardware with error correcting RAM. However, the capability to check in-memory data exists and can be enabled using "debug flags". [35]

RAID

For ZFS to be able to guarantee data integrity, it needs multiple copies of the data or parity information, usually spread across multiple disks. This is typically achieved by using either a RAID controller or so-called "soft" RAID (built into a file system).

Avoidance of hardware RAID controllers

While ZFS can work with hardware RAID devices, it will usually work more efficiently and with greater data protection if it has raw access to all storage devices. ZFS relies on the disk for an honest view to determine the moment data is confirmed as safely written and has numerous algorithms designed to optimize its use of caching, cache flushing, and disk handling.

Disks connected to the system using a hardware, firmware, other "soft" RAID, or any other controller that modifies the ZFS-to-disk I/O path will affect ZFS performance and data integrity. If a third-party device performs caching or presents drives to ZFS as a single system without the low level view ZFS relies upon, there is a much greater chance that the system will perform less optimally and that ZFS will be less likely to prevent failures, recover from failures more slowly, or lose data due to a write failure. For example, if a hardware RAID card is used, ZFS may not be able to determine the condition of disks, determine if the RAID array is degraded or rebuilding, detect all data corruption, place data optimally across the disks, make selective repairs, control how repairs are balanced with ongoing use, or make repairs that ZFS could usually undertake. The hardware RAID card will interfere with ZFS' algorithms. RAID controllers also usually add controller-dependent data to the drives which prevents software RAID from accessing the user data. In the case of a hardware RAID controller failure, it may be possible to read the data with another compatible controller, but this isn't always possible and a replacement may not be available. Alternate hardware RAID controllers may not understand the original manufacturer's custom data required to manage and restore an array.

Unlike most other systems where RAID cards or similar hardware can offload resources and processing to enhance performance and reliability, with ZFS it is strongly recommended that these methods not be used as they typically reduce the system's performance and reliability.

If disks must be attached through a RAID or other controller, it is recommended to minimize the amount of processing done in the controller by using a plain HBA (host adapter), a simple fanout card, or configure the card in JBOD mode (i.e. turn off RAID and caching functions), to allow devices to be attached with minimal changes in the ZFS-to-disk I/O pathway. A RAID card in JBOD mode may still interfere if it has a cache or, depending upon its design, may detach drives that do not respond in time (as has been seen with many energy-efficient consumer-grade hard drives), and as such, may require Time-Limited Error Recovery (TLER)/CCTL/ERC-enabled drives to prevent drive dropouts, so not all cards are suitable even with RAID functions disabled. [36]

ZFS's approach: RAID-Z and mirroring

Instead of hardware RAID, ZFS employs "soft" RAID, offering RAID-Z (parity based like RAID 5 and similar) and disk mirroring (similar to RAID 1). The schemes are highly flexible.

RAID-Z is a data/parity distribution scheme like RAID-5, but uses dynamic stripe width: every block is its own RAID stripe, regardless of blocksize, resulting in every RAID-Z write being a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, eliminates the write hole error. RAID-Z is also faster than traditional RAID 5 because it does not need to perform the usual read-modify-write sequence. [37]

As all stripes are of different sizes, RAID-Z reconstruction has to traverse the filesystem metadata to determine the actual RAID-Z geometry. This would be impossible if the filesystem and the RAID array were separate products, whereas it becomes feasible when there is an integrated view of the logical and physical structure of the data. Going through the metadata means that ZFS can validate every block against its 256-bit checksum as it goes, whereas traditional RAID products usually cannot do this. [37]

In addition to handling whole-disk failures, RAID-Z can also detect and correct silent data corruption, offering "self-healing data": when reading a RAID-Z block, ZFS compares it against its checksum, and if the data disks did not return the right answer, ZFS reads the parity and then figures out which disk returned bad data. Then, it repairs the damaged data and returns good data to the requestor. [37]

RAID-Z and mirroring do not require any special hardware: they do not need NVRAM for reliability, and they do not need write buffering for good performance or data protection. With RAID-Z, ZFS provides fast, reliable storage using cheap, commodity disks.[ promotion? ] [37]

There are five different RAID-Z modes: striping (similar to RAID 0, offers no redundancy), RAID-Z1 (similar to RAID 5, allows one disk to fail), RAID-Z2 (similar to RAID 6, allows two disks to fail), RAID-Z3 (a RAID 7 [a] configuration, allows three disks to fail), and mirroring (similar to RAID 1, allows all but one disk to fail). [39]

The need for RAID-Z3 arose in the early 2000s as multi-terabyte capacity drives became more common. This increase in capacity—without a corresponding increase in throughput speeds—meant that rebuilding an array due to a failed drive could "easily take weeks or months" to complete. [38] During this time, the older disks in the array will be stressed by the additional workload, which could result in data corruption or drive failure. By increasing parity, RAID-Z3 reduces the chance of data loss by simply increasing redundancy. [40]

Resilvering and scrub (array syncing and integrity checking)

ZFS has no tool equivalent to fsck (the standard Unix and Linux data checking and repair tool for file systems). [41] Instead, ZFS has a built-in scrub function which regularly examines all data and repairs silent corruption and other problems. Some differences are:

  • fsck must be run on an offline filesystem, which means the filesystem must be unmounted and is not usable while being repaired, while scrub is designed to be used on a mounted, live filesystem, and does not need the ZFS filesystem to be taken offline.
  • fsck usually only checks metadata (such as the journal log) but never checks the data itself. This means, after an fsck, the data might still not match the original data as stored.
  • fsck cannot always validate and repair data when checksums are stored with data (often the case in many file systems), because the checksums may also be corrupted or unreadable. ZFS always stores checksums separately from the data they verify, improving reliability and the ability of scrub to repair the volume. ZFS also stores multiple copies of data—metadata, in particular, may have upwards of 4 or 6 copies (multiple copies per disk and multiple disk mirrors per volume), greatly improving the ability of scrub to detect and repair extensive damage to the volume, compared to fsck.
  • scrub checks everything, including metadata and the data. The effect can be observed by comparing fsck to scrub times—sometimes a fsck on a large RAID completes in a few minutes, which means only the metadata was checked. Traversing all metadata and data on a large RAID takes many hours, which is exactly what scrub does.
  • while fsck detects and tries to fix errors using available filesystem data, scrub relies on redundancy to recover from issues. While fsck offers to fix the file system with partial data loss, scrub puts it into faulted state if there is no redundancy. [34]

The official recommendation from Sun/Oracle is to scrub enterprise-level disks once a month, and cheaper commodity disks once a week. [42] [43]

Capacity

ZFS is a 128-bit file system, [44] [16] so it can address 1.84 × 1019 times more data than 64-bit systems such as Btrfs. The maximum limits of ZFS are designed to be so large that they should never be encountered in practice. For instance, fully populating a single zpool with 2128 bits of data would require 3×1024 TB hard disk drives. [45]

Some theoretical limits in ZFS are:

Encryption

With Oracle Solaris, the encryption capability in ZFS [47] is embedded into the I/O pipeline. During writes, a block may be compressed, encrypted, checksummed and then deduplicated, in that order. The policy for encryption is set at the dataset level when datasets (file systems or ZVOLs) are created. The wrapping keys provided by the user/administrator can be changed at any time without taking the file system offline. The default behaviour is for the wrapping key to be inherited by any child data sets. The data encryption keys are randomly generated at dataset creation time. Only descendant datasets (snapshots and clones) share data encryption keys. [48] A command to switch to a new data encryption key for the clone or at any time is provided—this does not re-encrypt already existing data, instead utilising an encrypted master-key mechanism.

As of 2019 the encryption feature is also fully integrated into OpenZFS 0.8.0 available for Debian and Ubuntu Linux distributions. [49]

There have been anecdotal end-user reports of failures when using ZFS native encryption. An exact cause has not been established. [50] [51]

Read/write efficiency

ZFS will automatically allocate data storage across all vdevs in a pool (and all devices in each vdev) in a way that generally maximises the performance of the pool. ZFS will also update its write strategy to take account of new disks added to a pool, when they are added.

As a general rule, ZFS allocates writes across vdevs based on the free space in each vdev. This ensures that vdevs which have proportionately less data already, are given more writes when new data is to be stored. This helps to ensure that as the pool becomes more used, the situation does not develop that some vdevs become full, forcing writes to occur on a limited number of devices. It also means that when data is read (and reads are much more frequent than writes in most uses), different parts of the data can be read from as many disks as possible at the same time, giving much higher read performance. Therefore, as a general rule, pools and vdevs should be managed and new storage added, so that the situation does not arise that some vdevs in a pool are almost full and others almost empty, as this will make the pool less efficient.

Free space in ZFS tends to become fragmented with usage. ZFS does not have a mechanism for defragmenting free space. There are anecdotal end-user reports of diminished performance when high free-space fragmentation is coupled with disk space over-utilization. [52] [53]


Other features

Storage devices, spares, and quotas

Pools can have hot spares to compensate for failing disks. When mirroring, block devices can be grouped according to physical chassis, so that the filesystem can continue in the case of the failure of an entire chassis.

Storage pool composition is not limited to similar devices, but can consist of ad-hoc, heterogeneous collections of devices, which ZFS seamlessly pools together, subsequently doling out space to datasets (file system instances or ZVOLs) as needed. Arbitrary storage device types can be added to existing pools to expand their size. [54]

The storage capacity of all vdevs is available to all of the file system instances in the zpool. A quota can be set to limit the amount of space a file system instance can occupy, and a reservation can be set to guarantee that space will be available to a file system instance.

Caching mechanisms: ARC, L2ARC, Transaction groups, ZIL, SLOG, Special VDEV

ZFS uses different layers of disk cache to speed up read and write operations. Ideally, all data should be stored in RAM, but that is usually too expensive. Therefore, data is automatically cached in a hierarchy to optimize performance versus cost; [55] these are often called "hybrid storage pools". [56] Frequently accessed data will be stored in RAM, and less frequently accessed data can be stored on slower media, such as solid-state drives (SSDs). Data that is not often accessed is not cached and left on the slow hard drives. If old data is suddenly read a lot, ZFS will automatically move it to SSDs or to RAM.

ZFS caching mechanisms include one each for reads and writes, and in each case, two levels of caching can exist, one in computer memory (RAM) and one on fast storage (usually solid-state drives (SSDs)), for a total of four caches.

 Where storedRead cacheWrite cache
First level cacheIn RAMKnown as ARC, due to its use of a variant of the adaptive replacement cache (ARC) algorithm. RAM will always be used for caching, thus this level is always present. The efficiency of the ARC algorithm means that disks will often not need to be accessed, provided the ARC size is sufficiently large. If RAM is too small there will hardly be any ARC at all; in this case, ZFS always needs to access the underlying disks, which impacts performance, considerably.Handled by means of "transaction groups" – writes are collated over a short period (typically 5 – 30 seconds) up to a given limit, with each group being written to disk ideally while the next group is being collated. This allows writes to be organized more efficiently for the underlying disks at the risk of minor data loss of the most recent transactions upon power interruption or hardware fault. In practice the power loss risk is avoided by ZFS write journaling and by the SLOG/ZIL second tier write cache pool (see below), so writes will only be lost if a write failure happens at the same time as a total loss of the second tier SLOG pool, and then only when settings related to synchronous writing and SLOG use are set in a way that would allow such a situation to arise. If data is received faster than it can be written, data receipt is paused until the disks can catch up.
Second level cache & Intent logOn fast storage devices (which can be added or removed from a "live" system without disruption in current versions of ZFS, although not always in older versions)Known as L2ARC ("Level 2 ARC"), optional. ZFS will cache as much data in L2ARC as it can. L2ARC will also considerably speed up deduplication if the entire deduplication table can be cached in L2ARC. It can take several hours to fully populate the L2ARC from empty (before ZFS has decided which data are "hot" and should be cached). If the L2ARC device is lost, all reads will go out to the disks which slows down performance, but nothing else will happen (no data will be lost).Known as SLOG or ZIL ("ZFS Intent Log") – the terms are often used incorrectly. A SLOG (secondary log device) is an optional dedicated cache on a separate device, for recording writes, in the event of a system issue. If an SLOG device exists, it will be used for the ZFS Intent Log as a second level log, and if no separate cache device is provided, the ZIL will be created on the main storage devices instead. The SLOG thus, technically, refers to the dedicated disk to which the ZIL is offloaded, in order to speed up the pool. Strictly speaking, ZFS does not use the SLOG device to cache its disk writes. Rather, it uses SLOG to ensure writes are captured to a permanent storage medium as quickly as possible, so that in the event of power loss or write failure, no data which was acknowledged as written, will be lost. The SLOG device allows ZFS to speedily store writes and quickly report them as written, even for storage devices such as HDDs that are much slower. In the normal course of activity, the SLOG is never referred to or read, and it does not act as a cache; its purpose is to safeguard data in flight during the few seconds taken for collation and "writing out", in case the eventual write were to fail. If all goes well, then the storage pool will be updated at some point within the next 5 to 60 seconds, when the current transaction group is written out to disk (see above), at which point the saved writes on the SLOG will simply be ignored and overwritten. If the write eventually fails, or the system suffers a crash or fault preventing its writing, then ZFS can identify all the writes that it has confirmed were written, by reading back the SLOG (the only time it is read from), and use this to completely repair the data loss.

This becomes crucial if a large number of synchronous writes take place (such as with ESXi, NFS and some databases), [57] where the client requires confirmation of successful writing before continuing its activity; the SLOG allows ZFS to confirm writing is successful much more quickly than if it had to write to the main store every time, without the risk involved in misleading the client as to the state of data storage. If there is no SLOG device then part of the main data pool will be used for the same purpose, although this is slower.

If the log device itself is lost, it is possible to lose the latest writes, therefore the log device should be mirrored. In earlier versions of ZFS, loss of the log device could result in loss of the entire zpool, although this is no longer the case. Therefore, one should upgrade ZFS if planning to use a separate log device.

A number of other caches, cache divisions, and queues also exist within ZFS. For example, each VDEV has its own data cache, and the ARC cache is divided between data stored by the user and metadata used by ZFS, with control over the balance between these.

Special VDEV Class

In OpenZFS 0.8 and later, it is possible to configure a Special VDEV class to preferentially store filesystem metadata, and optionally the Data Deduplication Table (DDT), and small filesystem blocks. [58] This allows, for example, to create a Special VDEV on fast solid-state storage to store the metadata, while the regular file data is stored on spinning disks. This speeds up metadata-intensive operations such as filesystem traversal, scrub, and resilver, without the expense of storing the entire filesystem on solid-state storage.

Copy-on-write transactional model

ZFS uses a copy-on-write transactional object model. All block pointers within the filesystem contain a 256-bit checksum or 256-bit hash (currently a choice between Fletcher-2, Fletcher-4, or SHA-256) [59] of the target block, which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and ZIL (intent log) write cache is used when synchronous write semantics are required. The blocks are arranged in a tree, as are their checksums (see Merkle signature scheme).

Snapshots and clones

An advantage of copy-on-write is that, when ZFS writes new data, the blocks containing the old data can be retained, allowing a snapshot version of the file system to be maintained. ZFS snapshots are consistent (they reflect the entire data as it existed at a single point in time), and can be created extremely quickly, since all the data composing the snapshot is already stored, with the entire storage pool often snapshotted several times per hour. They are also space efficient, since any unchanged data is shared among the file system and its snapshots. Snapshots are inherently read-only, ensuring they will not be modified after creation, although they should not be relied on as a sole means of backup. Entire snapshots can be restored and also files and directories within snapshots.

Writeable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist. This is an implementation of the Copy-on-write principle.

Sending and receiving snapshots

ZFS file systems can be moved to other pools, also on remote hosts over the network, as the send command creates a stream representation of the file system's state. This stream can either describe complete contents of the file system at a given snapshot, or it can be a delta between snapshots. Computing the delta stream is very efficient, and its size depends on the number of blocks changed between the snapshots. This provides an efficient strategy, e.g., for synchronizing offsite backups or high availability mirrors of a pool.

Dynamic striping

Dynamic striping across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them; thus, all disks in a pool are used, which balances the write load across them. [60]

Variable block sizes

ZFS uses variable-sized blocks, with 128 KB as the default size. Available features allow the administrator to tune the maximum block size which is used, as certain workloads do not perform well with large blocks. If data compression is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations). [61]

Lightweight filesystem creation

In ZFS, filesystem manipulation within a storage pool is easier than volume manipulation within a traditional filesystem; the time and effort required to create or expand a ZFS filesystem is closer to that of making a new directory than it is to volume manipulation in some other systems.[ citation needed ]

Adaptive endianness

Pools and their associated ZFS file systems can be moved between different platform architectures, including systems implementing different byte orders. The ZFS block pointer format stores filesystem metadata in an endian-adaptive way; individual metadata blocks are written with the native byte order of the system writing the block. When reading, if the stored endianness does not match the endianness of the system, the metadata is byte-swapped in memory.

This does not affect the stored data; as is usual in POSIX systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness.

Deduplication

Data deduplication capabilities were added to the ZFS source repository at the end of October 2009, [62] and relevant OpenSolaris ZFS development packages have been available since December 3, 2009 (build 128).

Effective use of deduplication may require large RAM capacity; recommendations range between 1 and 5 GB of RAM for every TB of storage. [63] [64] [65] An accurate assessment of the memory required for deduplication is made by referring to the number of unique blocks in the pool, and the number of bytes on disk and in RAM ("core") required to store each record—these figures are reported by inbuilt commands such as zpool and zdb. Insufficient physical memory or lack of ZFS cache can result in virtual memory thrashing when using deduplication, which can cause performance to plummet, or result in complete memory starvation.[ citation needed ] Because deduplication occurs at write-time, it is also very CPU-intensive and this can also significantly slow down a system.

Other storage vendors use modified versions of ZFS to achieve very high data compression ratios. Two examples in 2012 were GreenBytes [66] and Tegile. [67] In May 2014, Oracle bought GreenBytes for its ZFS deduplication and replication technology. [68]

As described above, deduplication is usually not recommended due to its heavy resource requirements (especially RAM) and impact on performance (especially when writing), other than in specific circumstances where the system and data are well-suited to this space-saving technique.

Additional capabilities

  • Explicit I/O priority with deadline scheduling.[ citation needed ]
  • Claimed globally optimal I/O sorting and aggregation.[ citation needed ]
  • Multiple independent prefetch streams with automatic length and stride detection.[ citation needed ]
  • Parallel, constant-time directory operations.[ citation needed ]
  • End-to-end checksumming, using a kind of "Data Integrity Field", allowing data corruption detection (and recovery if you have redundancy in the pool). A choice of 3 hashes can be used, optimized for speed (fletcher), standardization and security (SHA256) and salted hashes (Skein). [69]
  • Transparent filesystem compression. Supports LZJB, gzip, [70] LZ4 and Zstd.
  • Intelligent scrubbing and resilvering (resyncing). [71]
  • Load and space usage sharing among disks in the pool. [72]
  • Ditto blocks: Configurable data replication per filesystem, with zero, one or two extra copies requested per write for user data, and with that same base number of copies plus one or two for metadata (according to metadata importance). [73] If the pool has several devices, ZFS tries to replicate over different devices. Ditto blocks are primarily an additional protection against corrupted sectors, not against total disk failure. [74]
  • ZFS design (copy-on-write + superblocks) is safe when using disks with write cache enabled, if they honor the write barriers.[ citation needed ] This feature provides safety and a performance boost compared with some other filesystems.[ according to whom? ]
  • On Solaris, when entire disks are added to a ZFS pool, ZFS automatically enables their write cache. This is not done when ZFS only manages discrete slices of the disk, since it does not know if other slices are managed by non-write-cache safe filesystems, like UFS.[ citation needed ] The FreeBSD implementation can handle disk flushes for partitions thanks to its GEOM framework, and therefore does not suffer from this limitation.[ citation needed ]
  • Per-user, per-group, per-project, and per-dataset quota limits. [75]
  • Filesystem encryption since Solaris 11 Express, [76] and OpenZFS (ZoL) 0.8. [58] (on some other systems ZFS can utilize encrypted disks for a similar effect; GELI on FreeBSD can be used this way to create fully encrypted ZFS storage).
  • Pools can be imported in read-only mode.
  • It is possible to recover data by rolling back entire transactions at the time of importing the zpool.[ citation needed ]
  • Snapshots can be taken manually or automatically. The older versions of the stored data that they contain can be exposed as full read-only file systems. They can also be exposed as historic versions of files and folders when used with CIFS (also known as SMB, Samba or file shares); this is known as "Previous versions", "VSS shadow copies", or "File history" on Windows, or AFP and "Apple Time Machine" on Apple devices. [77]
  • Disks can be marked as 'spare'. A data pool can be set to automatically and transparently handle disk faults by activating a spare disk and beginning to resilver the data that was on the suspect disk onto it, when needed.

Limitations

Data recovery

ZFS does not ship with tools such as fsck, because the file system itself was designed to self-repair. So long as a storage pool had been built with sufficient attention to the design of storage and redundancy of data, basic tools like fsck were never required. However, if the pool was compromised because of poor hardware, inadequate design or redundancy, or unfortunate mishap, to the point that ZFS was unable to mount the pool, traditionally, there were no other, more advanced, tools which allowed an end-user to attempt partial salvage of the stored data from a badly corrupted pool.

Modern ZFS has improved considerably on this situation over time, and continues to do so:

  • Removal or abrupt failure of caching devices no longer causes pool loss. (At worst, loss of the ZIL may lose very recent transactions, but the ZIL does not usually store more than a few seconds' worth of recent transactions. Loss of the L2ARC cache does not affect data.)
  • If the pool is unmountable, modern versions of ZFS will attempt to identify the most recent consistent point at which the pool can be recovered, at the cost of losing some of the most recent changes to the contents. Copy on write means that older versions of data, including top-level records and metadata, may still exist even though they are superseded, and if so, the pool can be wound back to a consistent state based on them. The older the data, the more likely it is that at least some blocks have been overwritten and that some data will be irrecoverable, so there is a limit at some point, on the ability of the pool to be wound back.
  • Informally, tools exist to probe the reason why ZFS is unable to mount a pool, and guide the user or a developer as to manual changes required to force the pool to mount. These include using zdb (ZFS debug) to find a valid importable point in the pool, using dtrace or similar to identify the issue causing mount failure, or manually bypassing health checks that cause the mount process to abort, and allow mounting of the damaged pool.
  • As of March 2018, a range of significantly enhanced methods are gradually being rolled out within OpenZFS. These include: [86]
  • Code refactoring, and more detailed diagnostic and debug information on mount failures, to simplify diagnosis and fixing of corrupt pool issues;
  • The ability to trust or distrust the stored pool configuration. This is particularly powerful, as it allows a pool to be mounted even when top-level vdevs are missing or faulty, when top level data is suspect, and also to rewind beyond a pool configuration change if that change was connected to the problem. Once the corrupt pool is mounted, readable files can be copied for safety, and it may turn out that data can be rebuilt even for missing vdevs, by using copies stored elsewhere in the pool.
  • The ability to fix the situation where a disk needed in one pool, was accidentally removed and added to a different pool, causing it to lose metadata related to the first pool, which becomes unreadable.

OpenZFS and ZFS

Oracle Corporation ceased the public development of both ZFS and OpenSolaris after the acquisition of Sun in 2010. Some developers forked the last public release of OpenSolaris as the Illumos project. Because of the significant advantages present in ZFS, it has been ported to several different platforms with different features and commands. For coordinating the development efforts and to avoid fragmentation, OpenZFS was founded in 2013.

According to Matt Ahrens, one of the main architects of ZFS, over 50% of the original OpenSolaris ZFS code has been replaced in OpenZFS with community contributions as of 2019, making “Oracle ZFS” and “OpenZFS” politically and technologically incompatible. [87]

Commercial and open source products

Oracle Corporation, closed source, and forking (from 2010)

In January 2010, Oracle Corporation acquired Sun Microsystems, and quickly discontinued the OpenSolaris distribution and the open source development model. [95] [96] In August 2010, Oracle discontinued providing public updates to the source code of the Solaris OS/Networking repository, effectively turning Solaris 11 back into a closed source proprietary operating system. [97]

In response to the changing landscape of Solaris and OpenSolaris, the illumos project was launched via webinar [98] on Thursday, August 3, 2010, as a community effort of some core Solaris engineers to continue developing the open source version of Solaris, and complete the open sourcing of those parts not already open sourced by Sun. [99] illumos was founded as a Foundation, the Illumos Foundation, incorporated in the State of California as a 501(c)6 trade association. The original plan explicitly stated that illumos would not be a distribution or a fork. However, after Oracle announced discontinuing OpenSolaris, plans were made to fork the final version of the Solaris ON, allowing illumos to evolve into an operating system of its own. [100] As part of OpenSolaris, an open source version of ZFS was therefore integral within illumos.

ZFS was widely used within numerous platforms, as well as Solaris. Therefore, in 2013, the co-ordination of development work on the open source version of ZFS was passed to an umbrella project, OpenZFS. The OpenZFS framework allows any interested parties to collaboratively develop the core ZFS codebase in common, while individually maintaining any specific extra code which ZFS requires to function and integrate within their own systems.

Version history

Legend:
Old release
Latest FOSS stable release
ZFS Filesystem Version NumberRelease dateSignificant changes
1OpenSolaris Nevada [101] build 36First release
2OpenSolaris Nevada b69Enhanced directory entries. In particular, directory entries now store the object type. For example, file, directory, named pipe, and so on, in addition to the object number.
3OpenSolaris Nevada b77Support for sharing ZFS file systems over SMB. Case insensitivity support. System attribute support. Integrated anti-virus support.
4OpenSolaris Nevada b114Properties: userquota, groupquota, userused and groupused
5OpenSolaris Nevada b137System attributes; symlinks now their own object type
ZFS Pool Version NumberRelease dateSignificant changes
1OpenSolaris Nevada [101] b36First release
2OpenSolaris Nevada b38Ditto Blocks
3OpenSolaris Nevada b42Hot spares, double-parity RAID-Z (raidz2), improved RAID-Z accounting
4OpenSolaris Nevada b62zpool history
5OpenSolaris Nevada b62gzip compression for ZFS datasets
6OpenSolaris Nevada b62"bootfs" pool property
7OpenSolaris Nevada b68ZIL: adds the capability to specify a separate Intent Log device or devices
8OpenSolaris Nevada b69ability to delegate zfs(1M) administrative tasks to ordinary users
9OpenSolaris Nevada b77CIFS server support, dataset quotas
10OpenSolaris Nevada b77Devices can be added to a storage pool as "cache devices"
11OpenSolaris Nevada b94Improved zpool scrub / resilver performance
12OpenSolaris Nevada b96Snapshot properties
13OpenSolaris Nevada b98Properties: usedbysnapshots, usedbychildren, usedbyrefreservation, and usedbydataset
14OpenSolaris Nevada b103passthrough-x aclinherit property support
15OpenSolaris Nevada b114Properties: userquota, groupquota, usuerused and groupused; also required FS v4
16OpenSolaris Nevada b116STMF property support
17OpenSolaris Nevada b120triple-parity RAID-Z
18OpenSolaris Nevada b121ZFS snapshot holds
19OpenSolaris Nevada b125ZFS log device removal
20OpenSolaris Nevada b128zle compression algorithm that is needed to support the ZFS deduplication properties in ZFS pool version 21, which were released concurrently
21OpenSolaris Nevada b128Deduplication
22OpenSolaris Nevada b128zfs receive properties
23OpenSolaris Nevada b135slim ZIL
24OpenSolaris Nevada b137System attributes. Symlinks now their own object type. Also requires FS v5.
25OpenSolaris Nevada b140Improved pool scrubbing and resilvering statistics
26OpenSolaris Nevada b141Improved snapshot deletion performance
27OpenSolaris Nevada b145Improved snapshot creation performance (particularly recursive snapshots)
28OpenSolaris Nevada b147Multiple virtual device replacements

Note: The Solaris version under development by Sun since the release of Solaris 10 in 2005 was codenamed 'Nevada', and was derived from what was the OpenSolaris codebase. 'Solaris Nevada' is the codename for the next-generation Solaris OS to eventually succeed Solaris 10 and this new code was then pulled successively into new OpenSolaris 'Nevada' snapshot builds. [101] OpenSolaris is now discontinued and OpenIndiana forked from it. [102] [103] A final build (b134) of OpenSolaris was published by Oracle (2010-Nov-12) as an upgrade path to Solaris 11 Express.

Operating system support

List of Operating Systems, distributions and add-ons that support ZFS, the zpool version it supports, and the Solaris build they are based on (if any):

OSZpool versionSun/Oracle Build #Comments
Oracle Solaris 11.44911.4.51 (11.4 SRU 51) [104]
Oracle Solaris 11.3370.5.11-0.175.3.1.0.5.0
Oracle Solaris 10 1/13 (U11)32
Oracle Solaris 11.2350.5.11-0.175.2.0.0.42.0
Oracle Solaris 11 2011.1134b175
Oracle Solaris Express 11 2010.1131b151alicensed for testing only
OpenSolaris 2009.0614b111b
OpenSolaris (last dev)22b134
OpenIndiana 5000b147distribution based on illumos; creates a name clash naming their build code 'b151a'
Nexenta Core 3.0.126b134+GNU userland
NexentaStor Community 3.0.126b134+up to 18 TB, web admin
NexentaStor Community 3.1.028b134+GNU userland
NexentaStor Community 4.05000b134+up to 18 TB, web admin
NexentaStor Enterprise28b134 +not free, web admin
GNU/kFreeBSD "Squeeze" (Unsupported)14Requires package "zfsutils"
GNU/kFreeBSD "Wheezy-9" (Unsupported)28Requires package "zfsutils"
FreeBSD 5000
zfs-fuse 0.7.223suffered from performance issues; defunct
ZFS on Linux 0.6.5.850000.6.0 release candidate has POSIX layer
KQ Infotech's ZFS on Linux28defunct; code integrated into LLNL-supported ZFS on Linux
BeleniX 0.8b114b111small-size live-CD distribution; once based on OpenSolaris
Schillix 0.7.228b147small-size live-CD distribution; as SchilliX-ON 0.8.0 based on OpenSolaris
StormOS "hail"distribution once based on Nexenta Core 2.0+, Debian Linux; superseded by Dyson OS
JarisJapanese Solaris distribution; once based on OpenSolaris
MilaX 0.520b128asmall-size live-CD distribution; once based on OpenSolaris
FreeNAS 8.0.2 / 8.215
FreeNAS 8.3.028based on FreeBSD 8.3
FreeNAS 9.1.0+5000based on FreeBSD 9.1+
XigmaNAS 11.4.0.4/12.2.0.45000based on FreeBSD 11.4/12.2
Korona 4.5.022b134KDE
EON NAS (v0.6)22b130embedded NAS
EON NAS (v1.0beta)28b151aembedded NAS
napp-it 28/5000Illumos/SolarisStorage appliance; OpenIndiana (Hipster), OmniOS, Solaris 11, Linux (ZFS management)
OmniOS CE 28/5000illumos-OmniOS branchminimal stable/LTS storage server distribution based on Illumos, community driven
SmartOS 28/5000Illumos b151+minimal live distribution based on Illumos (USB/CD boot); cloud and hypervisor use (KVM)
macOS 10.5, 10.6, 10.7, 10.8, 10.95000via MacZFS; superseded by OpenZFS on OS X
macOS 10.6, 10.7, 10.828via ZEVO; superseded by OpenZFS on OS X
NetBSD 22
MidnightBSD 6
Proxmox VE 5000 native support since 2014, pve.proxmox.com/wiki/ZFS_on_Linux
Ubuntu Linux 16.04 LTS+5000 native support via installable binary module, wiki.ubuntu.com/ZFS
ZFSGuru 10.1.1005000

See also

Notes

  1. While RAID 7 is not a standard RAID level, it has been proposed as a catch-all term for any >3 parity RAID configuration [38]

Related Research Articles

XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; as of June 2014, XFS is supported by most Linux distributions; Red Hat Enterprise Linux uses it as its default file system.

ext3, or third extended filesystem, is a journaled file system that is commonly used with the Linux kernel. It used to be the default file system for many popular Linux distributions but generally has been supplanted by its successor version ext4. The main advantage of ext3 over its predecessor, ext2, is journaling, which improves reliability and eliminates the need to check the file system after an improper, a.k.a. unclean, shutdown.

RAID is a data storage virtualization technology that combines multiple physical data storage components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).

fsck System tool for checking the consistency of a file system

The system utility fsck is a tool for checking the consistency of a file system in Unix and Unix-like operating systems, such as Linux, macOS, and FreeBSD. The equivalent programs on MS-DOS and Microsoft Windows are CHKDSK, SFC, and SCANDISK.

In computing, the Global File System 2 (GFS2) is a shared-disk file system for Linux computer clusters. GFS2 allows all members of a cluster to have direct concurrent access to the same shared block storage, in contrast to distributed file systems which distribute data throughout the cluster. GFS2 can also be used as a local file system on a single computer.

<span class="mw-page-title-main">File system</span> Computer filing system

In computing, a file system or filesystem governs file organization and access. A local file system is a capability of an operating system that services the applications running on the same computer. A distributed file system is a protocol that provides file access between networked computers.

The Write Anywhere File Layout (WAFL) is a proprietary file system that supports large, high-performance RAID arrays, quick restarts without lengthy consistency checks in the event of a crash or power failure, and growing the filesystems size quickly. It was designed by NetApp for use in its storage appliances like NetApp FAS, AFF, Cloud Volumes ONTAP and ONTAP Select.

Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file system software is available under the GNU General Public License and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site systems. Since June 2005, Lustre has consistently been used by at least half of the top ten, and more than 60 of the top 100 fastest supercomputers in the world, including the world's No. 1 ranked TOP500 supercomputer in November 2022, Frontier, as well as previous top supercomputers such as Fugaku, Titan and Sequoia.

File attributes are a type of metadata that describe and may modify how files and/or directories in a filesystem behave. Typical file attributes may, for example, indicate or specify whether a file is visible, modifiable, compressed, or encrypted. The availability of most file attributes depends on support by the underlying filesystem where attribute data must be stored along with other control structures. Each attribute can have one of two states: set and cleared. Attributes are considered distinct from other metadata, such as dates and times, filename extensions or file system permissions. In addition to files, folders, volumes and other file system objects may have attributes.

In Linux, Logical Volume Manager (LVM) is a device mapper framework that provides logical volume management for the Linux kernel. Most modern Linux distributions are LVM-aware to the point of being able to have their root file systems on a logical volume.

NILFS or NILFS2 is a log-structured file system implementation for the Linux kernel. It was developed by Nippon Telegraph and Telephone Corporation (NTT) CyberSpace Laboratories and a community from all over the world. NILFS was released under the terms of the GNU General Public License (GPL).

Data scrubbing is an error correction technique that uses a background task to periodically inspect main memory or storage for errors, then corrects detected errors using redundant data in the form of different checksums or copies of data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors.

The following tables compare general and technical information for a number of file systems.

Although all RAID implementations differ from the specification to some extent, some companies and open-source projects have developed non-standard RAID implementations that differ substantially from the standard. Additionally, there are non-RAID drive architectures, providing configurations of multiple hard drives not referred to by RAID acronyms.

Btrfs is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager, developed together. It was created by Chris Mason in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel.

A journaling file system is a file system that keeps track of changes not yet committed to the file system's main part by recording the goal of such changes in a data structure known as a "journal", which is usually a circular log. In the event of a system crash or power failure, such file systems can be brought back online more quickly with a lower likelihood of becoming corrupted.

Resilient File System (ReFS), codenamed "Protogon", is a Microsoft proprietary file system introduced with Windows Server 2012 with the intent of becoming the "next generation" file system after NTFS.

<span class="mw-page-title-main">OpenZFS</span> Open-source implementation of the ZFS file system

OpenZFS is an open-source implementation of the ZFS file system and volume manager initially developed by Sun Microsystems for the Solaris operating system, and is now maintained by the OpenZFS Project. Similar to the original ZFS, the implementation supports features like data compression, data deduplication, copy-on-write clones, snapshots, RAID-Z, and virtual devices that can create filesystems that span multiple disks.

bcache is a cache mechanism in the Linux kernel's block layer, which is used for accessing secondary storage devices. It allows one or more fast storage devices, such as flash-based solid-state drives (SSDs), to act as a cache for one or more slower storage devices, such as hard disk drives (HDDs); this effectively creates hybrid volumes and provides performance improvements.

References

  1. 1 2 "What Is ZFS?". Oracle Solaris ZFS Administration Guide. Oracle. Archived from the original on March 4, 2016. Retrieved December 29, 2015.
  2. "ZFS on Linux Licensing". GitHub . Retrieved May 17, 2020.
  3. "The OpenZFS project launches". LWN.net. September 17, 2013. Archived from the original on October 4, 2013. Retrieved October 1, 2013.
  4. "OpenZFS Announcement". OpenZFS. September 17, 2013. Archived from the original on April 2, 2018. Retrieved September 19, 2013.
  5. open-zfs.org /History Archived December 24, 2013, at the Wayback Machine "OpenZFS is the truly open source successor to the ZFS project [...] Effects of the fork (2010 to date)"
  6. Sean Michael Kerner (September 18, 2013). "LinuxCon: OpenZFS moves Open Source Storage Forward". infostor.com. Archived from the original on March 14, 2014. Retrieved October 9, 2013.
  7. "The OpenZFS project launches". LWN.net. September 17, 2013. Archived from the original on October 11, 2016. Retrieved October 1, 2013.
  8. "OpenZFS – Communities co-operating on ZFS code and features". freebsdnews.net. September 23, 2013. Archived from the original on October 14, 2013. Retrieved March 14, 2014.
  9. "The Starline ZFS FAQ". Starline. Retrieved July 20, 2024.
  10. 1 2 "19.4. zfs Administration". www.freebsd.org. Archived from the original on February 23, 2017. Retrieved February 22, 2017.
  11. Salus, Peter (1994). A Quarter Century of Unix. Addison-Wesley. pp. 199–200. ISBN   0-201-54777-5.
  12. "What are SunOS and Solaris?". Knowledge Base. Indiana University Technology Services. May 20, 2013. Retrieved November 10, 2014.
  13. Brown, David. "A Conversation with Jeff Bonwick and Bill Moore". ACM Queue. Association for Computing Machinery. Archived from the original on July 16, 2011. Retrieved November 17, 2015.
  14. "ZFS: the last word in file systems". Sun Microsystems. September 14, 2004. Archived from the original on April 28, 2006. Retrieved April 30, 2006.
  15. Matthew Ahrens (November 1, 2011). "ZFS 10 year anniversary". Archived from the original on June 28, 2016. Retrieved July 24, 2012.
  16. 1 2 Bonwick, Jeff (October 31, 2005). "ZFS: The Last Word in Filesystems". blogs.oracle.com. Archived from the original on June 19, 2013. Retrieved June 22, 2013.
  17. "Sun Celebrates Successful One-Year Anniversary of OpenSolaris". Sun Microsystems. June 20, 2006. Archived from the original on September 28, 2008. Retrieved April 30, 2018.
  18. Michael Singer (January 25, 2005). "Sun Cracks Open Solaris". InternetNews.com. Retrieved April 12, 2010.
  19. "ZFS FAQ at OpenSolaris.org". Sun Microsystems. Archived from the original on May 15, 2011. Retrieved May 18, 2011. The largest SI prefix we liked was 'zetta' ('yotta' was out of the question)
  20. Jeff Bonwick (May 3, 2006). "You say zeta, I say zetta". Jeff Bonwick's Blog. Archived from the original on February 23, 2017. Retrieved April 21, 2017. So we finally decided to unpimp the name back to ZFS, which doesn't stand for anything.
  21. "Oracle and NetApp dismiss ZFS lawsuits". theregister.co.uk. September 9, 2010. Archived from the original on September 9, 2017. Retrieved December 24, 2013.
  22. The Extended file system (Ext) has metadata structure copied from UFS. "Rémy Card (Interview, April 1998)". April Association. April 19, 1999. Archived from the original on February 4, 2012. Retrieved February 8, 2012. (In French)
  23. Vijayan Prabhakaran (2006). "IRON FILE SYSTEMS" (PDF). Doctor of Philosophy in Computer Sciences. University of Wisconsin-Madison. Archived (PDF) from the original on April 29, 2011. Retrieved June 9, 2012.
  24. "Parity Lost and Parity Regained". Archived from the original on June 15, 2010. Retrieved November 29, 2010.
  25. "An Analysis of Data Corruption in the Storage Stack" (PDF). Archived (PDF) from the original on June 15, 2010. Retrieved November 29, 2010.
  26. "Impact of Disk Corruption on Open-Source DBMS" (PDF). Archived (PDF) from the original on June 15, 2010. Retrieved November 29, 2010.
  27. Kadav, Asim; Rajimwale, Abhishek. "Reliability Analysis of ZFS" (PDF). Archived (PDF) from the original on September 21, 2013. Retrieved September 19, 2013.
  28. Yupu Zhang; Abhishek Rajimwale; Andrea Arpaci-Dusseau; Remzi H. Arpaci-Dusseau (2010). "End-to-end data integrity for file systems: a ZFS case study" (PDF). USENIX Conference on File and Storage Technologies. CiteSeerX   10.1.1.154.3979 . S2CID   5722163. Wikidata   Q111972797 . Retrieved December 6, 2010.
  29. Larabel, Michael. "Benchmarking ZFS and UFS On FreeBSD vs. EXT4 & Btrfs On Linux". Phoronix Media 2012. Archived from the original on November 29, 2016. Retrieved November 21, 2012.
  30. Larabel, Michael. "Can DragonFlyBSD's HAMMER Compete With Btrfs, ZFS?". Phoronix Media 2012. Archived from the original on November 29, 2016. Retrieved November 21, 2012.
  31. 1 2 3 Bonwick, Jeff (December 8, 2005). "ZFS End-to-End Data Integrity". blogs.oracle.com. Archived from the original on April 3, 2012. Retrieved September 19, 2013.
  32. Cook, Tim (November 16, 2009). "Demonstrating ZFS Self-Healing". blogs.oracle.com. Archived from the original on August 12, 2011. Retrieved February 1, 2015.
  33. Ranch, Richard (May 4, 2007). "ZFS, copies, and data protection". blogs.oracle.com. Archived from the original on August 18, 2016. Retrieved February 2, 2015.
  34. 1 2 "zpoolconcepts.7 — OpenZFS documentation". openzfs.github.io. Retrieved April 5, 2023.
  35. "ZFS Without Tears: Using ZFS without ECC memory". www.csparks.com. December 2015. Archived from the original on January 13, 2021. Retrieved June 16, 2020.
  36. wdc.custhelp.com. "Difference between Desktop edition and RAID (Enterprise) edition drives". Archived from the original on January 5, 2015. Retrieved September 8, 2011.
  37. 1 2 3 4 Bonwick, Jeff (November 17, 2005). "RAID-Z". Jeff Bonwick's Blog. Oracle Blogs. Archived from the original on December 16, 2014. Retrieved February 1, 2015.
  38. 1 2 Leventhal, Adam (December 17, 2009). "Triple-Parity RAID and Beyond". Queue. 7 (11): 30. doi: 10.1145/1661785.1670144 .
  39. "ZFS Raidz Performance, Capacity and integrity". calomel.org. Archived from the original on November 27, 2017. Retrieved June 23, 2017.
  40. "Why RAID 6 stops working in 2019". ZDNet . February 22, 2010. Archived from the original on October 31, 2014. Retrieved October 26, 2014.
  41. "No fsck utility equivalent exists for ZFS. This utility has traditionally served two purposes, those of file system repair and file system validation." "Checking ZFS File System Integrity". Oracle. Archived from the original on January 31, 2013. Retrieved November 25, 2012.
  42. "ZFS Scrubs". freenas.org. Archived from the original on November 27, 2012. Retrieved November 25, 2012.
  43. "You should also run a scrub prior to replacing devices or temporarily reducing a pool's redundancy to ensure that all devices are currently operational." "ZFS Best Practices Guide". solarisinternals.com. Archived from the original on September 5, 2015. Retrieved November 25, 2012.
  44. Jeff Bonwick. "128-bit storage: are you high?". oracle.com. Archived from the original on May 29, 2015. Retrieved May 29, 2015.
  45. "ZFS: Boils the Ocean, Consumes the Moon (Dave Brillhart's Blog)". Archived from the original on December 8, 2015. Retrieved December 19, 2015.
  46. "Solaris ZFS Administration Guide". Oracle Corporation. Archived from the original on January 13, 2021. Retrieved February 11, 2011.
  47. "Encrypting ZFS File Systems". Archived from the original on June 23, 2011. Retrieved May 2, 2011.
  48. "Having my secured cake and Cloning it too (aka Encryption + Dedup with ZFS)". Archived from the original on May 29, 2013. Retrieved October 9, 2012.
  49. "ZFS – Debian Wiki". wiki.debian.org. Archived from the original on September 8, 2019. Retrieved December 10, 2019.
  50. "Proposal: Consider adding warnings against using zfs native encryption along with send/recv in production". Github. Github. Retrieved August 15, 2024.
  51. "PSA: ZFS has a data corruption bug when using native encryption and send/recv". Reddit. Reddit. Retrieved August 15, 2024.
  52. "ZFS Fragmentation: Long-term Solutions". Github. Github. Retrieved August 15, 2024.
  53. "What are the best practices to keep ZFS from being too fragmented". Lawrence Systems. Lawrence Systems. Retrieved August 15, 2024.
  54. "Solaris ZFS Enables Hybrid Storage Pools—Shatters Economic and Performance Barriers" (PDF). Sun.com. September 7, 2010. Archived (PDF) from the original on October 17, 2011. Retrieved November 4, 2011.
  55. Gregg, Brendan. "ZFS L2ARC". Brendan's blog. Dtrace.org. Archived from the original on November 6, 2011. Retrieved October 5, 2012.
  56. Gregg, Brendan (October 8, 2009). "Hybrid Storage Pool: Top Speeds". Brendan's blog. Dtrace.org. Archived from the original on April 5, 2016. Retrieved August 15, 2017.
  57. "Solaris ZFS Performance Tuning: Synchronous Writes and the ZIL". Constantin.glez.de. July 20, 2010. Archived from the original on June 23, 2012. Retrieved October 5, 2012.
  58. 1 2 3 "Release zfs-0.8.0". GitHub. OpenZFS. May 23, 2019. Retrieved July 3, 2021.
  59. "ZFS On-Disk Specification" (PDF). Sun Microsystems, Inc. 2006. Archived from the original (PDF) on December 30, 2008. See section 2.4.
  60. "RAIDZ — OpenZFS documentation". openzfs.github.io. Retrieved February 9, 2023.
  61. Eric Sproul (May 21, 2009). "ZFS Nuts and Bolts". slideshare.net. pp. 30–31. Archived from the original on June 22, 2014. Retrieved June 8, 2014.
  62. "ZFS Deduplication". blogs.oracle.com. Archived from the original on December 24, 2019. Retrieved November 25, 2019.
  63. Gary Sims (January 4, 2012). "Building ZFS Based Network Attached Storage Using FreeNAS 8". TrainSignal Training. TrainSignal, Inc. Archived from the original (Blog) on May 7, 2012. Retrieved June 9, 2012.
  64. Ray Van Dolson (May 2011). "[zfs-discuss] Summary: Deduplication Memory Requirements". zfs-discuss mailing list. Archived from the original on April 25, 2012.
  65. "ZFSTuningGuide". Archived from the original on January 16, 2012. Retrieved January 3, 2012.
  66. Chris Mellor (October 12, 2012). "GreenBytes brandishes full-fat clone VDI pumper". The Register. Archived from the original on March 24, 2013. Retrieved August 29, 2013.
  67. Chris Mellor (June 1, 2012). "Newcomer gets out its box, plans to sell it cheaply to all comers". The Register. Archived from the original on August 12, 2013. Retrieved August 29, 2013.
  68. Chris Mellor (December 11, 2014). "Dedupe, dedupe... dedupe, dedupe, dedupe: Oracle polishes ZFS diamond". The Register. Archived from the original on July 7, 2017. Retrieved December 17, 2014.
  69. "Checksums and Their Use in ZFS". github.com. September 2, 2018. Archived from the original on July 19, 2019. Retrieved July 11, 2019.
  70. "Solaris ZFS Administration Guide". Chapter 6 Managing ZFS File Systems. Archived from the original on February 5, 2011. Retrieved March 17, 2009.
  71. "Smokin' Mirrors". blogs.oracle.com. May 2, 2006. Archived from the original on December 16, 2011. Retrieved February 13, 2012.
  72. "ZFS Block Allocation". Jeff Bonwick's Weblog. November 4, 2006. Archived from the original on November 2, 2012. Retrieved February 23, 2007.
  73. "Ditto Blocks — The Amazing Tape Repellent". Flippin' off bits Weblog. May 12, 2006. Archived from the original on May 26, 2013. Retrieved March 1, 2007.
  74. "Adding new disks and ditto block behaviour". Archived from the original on August 23, 2011. Retrieved October 19, 2009.
  75. "OpenSolaris.org". Sun Microsystems. Archived from the original on May 8, 2009. Retrieved May 22, 2009.
  76. "What's new in Solaris 11 Express 2010.11" (PDF). Oracle. Archived (PDF) from the original on November 16, 2010. Retrieved November 17, 2010.
  77. "10. Sharing — FreeNAS User Guide 9.3 Table of Contents". doc.freenas.org. Archived from the original on January 7, 2017. Retrieved February 23, 2017.
  78. "Bug ID 4852783: reduce pool capacity". OpenSolaris Project. Archived from the original on June 29, 2009. Retrieved March 28, 2009.
  79. Goebbels, Mario (April 19, 2007). "Permanently removing vdevs from a pool". zfs-discuss (Mailing list).[ permanent dead link ] archive link Archived January 13, 2021, at the Wayback Machine
  80. Chris Siebenmann Information on future vdev removal Archived August 11, 2016, at the Wayback Machine , Univ Toronto, blog, quote: informal Twitter announcement by Alex Reece Archived August 11, 2016, at the Wayback Machine
  81. "Data Management Features – What's New in Oracle® Solaris 11.4". Archived from the original on September 24, 2019. Retrieved October 9, 2019.
  82. "Expand-O-Matic RAID Z". Adam Leventhal. April 7, 2008. Archived from the original on December 28, 2011. Retrieved April 16, 2012.
  83. "ZFS Toy". SourceForge.net. Retrieved April 12, 2022.
  84. "zpoolconcepts(7)". OpenZFS documentation. OpenZFS. June 2, 2021. Retrieved April 12, 2021. Virtual devices cannot be nested, so a mirror or raidz virtual device can only contain files or disks. Mirrors of mirrors (or other combinations) are not allowed.
  85. "zpool(1M)". Download.oracle.com. June 11, 2010. Archived from the original on January 13, 2021. Retrieved November 4, 2011.
  86. "Turbocharging ZFS Data Recovery". Archived from the original on November 29, 2018. Retrieved November 29, 2018.
  87. "ZFS and OpenZFS". iXSystems. Retrieved May 18, 2020.
  88. "Sun rolls out its own storage appliances". techworld.com.au. November 11, 2008. Archived from the original on November 13, 2013. Retrieved November 13, 2013.
  89. Chris Mellor (October 2, 2013). "Oracle muscles way into seat atop the benchmark with hefty ZFS filer". theregister.co.uk. Archived from the original on July 7, 2017. Retrieved July 7, 2014.
  90. "Unified ZFS Storage Appliance built in Silicon Valley by iXsystem". ixsystems.com. Archived from the original on July 3, 2014. Retrieved July 7, 2014.
  91. 1 2 "TrueNAS 12 & TrueNAS SCALE are officially here!". ixsystems.com. Retrieved January 2, 2021.
  92. "ReadyDATA 516 – Unified Network Storage" (PDF). netgear.com. Archived (PDF) from the original on July 15, 2014. Retrieved July 7, 2014.
  93. Jim Salter (December 17, 2015). "rsync.net: ZFS Replication to the cloud is finally here—and it's fast". arstechnica.com. Archived from the original on August 22, 2017. Retrieved August 21, 2017.
  94. rsync.net, Inc. "Cloud Storage with ZFS send and receive over SSH". rsync.net. Archived from the original on July 21, 2017. Retrieved August 21, 2017.
  95. Steven Stallion / Oracle (August 13, 2010). "Update on SXCE". Iconoclastic Tendencies. Archived from the original on November 9, 2020. Retrieved April 30, 2018.
  96. Alasdair Lumsden. "OpenSolaris cancelled, to be replaced with Solaris 11 Express". osol-discuss (Mailing list). Archived from the original on August 16, 2010. Retrieved November 24, 2014.
  97. Solaris still sorta open, but OpenSolaris distro is dead Archived September 5, 2017, at the Wayback Machine on Ars Technica by Ryan Paul (Aug 16, 2010)
  98. Garrett D'Amore (August 3, 2010). "Illumos - Hope and Light Springs Anew - Presented by Garrett D'Amore" (PDF). illumos.org. Retrieved August 3, 2010.
  99. "Whither OpenSolaris? Illumos Takes Up the Mantle". Archived from the original on September 26, 2015.
  100. Garrett D'Amore (August 13, 2010). "The Hand May Be Forced" . Retrieved November 14, 2013.
  101. 1 2 3 "While under Sun Microsystems' control, there were bi-weekly snapshots of Solaris Nevada (the codename for the next-generation Solaris OS to eventually succeed Solaris 10) and this new code was then pulled into new OpenSolaris preview snapshots available at Genunix.org. The stable releases of OpenSolaris are based off of [ sic ] these Nevada builds." Larabel, Michael. "It Looks Like Oracle Will Stand Behind OpenSolaris". Phoronix Media. Archived from the original on November 29, 2016. Retrieved November 21, 2012.
  102. Ljubuncic, Igor (May 23, 2011). "OpenIndiana — there's still hope". DistroWatch. Archived from the original on October 27, 2012. Retrieved November 21, 2012.
  103. "Welcome to Project OpenIndiana!". Project OpenIndiana. September 10, 2010. Archived from the original on November 27, 2012. Retrieved September 14, 2010.
  104. "ZFS Pool Versions". Oracle Corporation. 2022. Archived from the original on December 21, 2022. Retrieved January 1, 2023.

Bibliography