GFS2

Last updated
GFS2
Developer(s) Red Hat
Full nameGlobal File System 2
Introduced2005 with Linux 2.6.19
Structures
Directory contentsHashed (small directories stuffed into inode)
File allocationbitmap (resource groups)
Bad blocksNo
Limits
Max. number of filesVariable
Max. filename length255 bytes
Allowed characters in filenamesAll except NUL
Features
Dates recordedattribute modification (ctime), modification (mtime), access (atime)
Date resolutionNanosecond
AttributesNo-atime, journaled data (regular files only), inherit journaled data (directories only), synchronous-write, append-only, immutable, exhash (dirs only, read only)
File system permissions Unix permissions, ACLs and arbitrary security attributes
Transparent compressionNo
Transparent encryption No
Data deduplication across nodes only
Other
Supported operating systems Linux
GFS
Developer(s) Red Hat (formerly, Sistina Software)
Full nameGlobal File System
Introduced1996 with IRIX (1996), Linux (1997)
Structures
Directory contentsHashed (small directories stuffed into inode)
File allocationbitmap (resource groups)
Bad blocksNo
Limits
Max. number of filesVariable
Max. filename length255 bytes
Allowed characters in filenamesAll except NUL
Features
Dates recordedattribute modification (ctime), modification (mtime), access (atime)
Date resolution1s
AttributesNo-atime, journaled data (regular files only), inherit journaled data (directories only), synchronous-write, append-only, immutable, exhash (dirs only, read only)
File system permissions Unix permissions, ACLs
Transparent compressionNo
Transparent encryption No
Data deduplication across nodes only
Other
Supported operating systems IRIX (now obsolete), FreeBSD (now obsolete), Linux

In computing, the Global File System 2 or GFS2 is a shared-disk file system for Linux computer clusters. GFS2 allows all members of a cluster to have direct concurrent access to the same shared block storage, in contrast to distributed file systems which distribute data throughout the cluster. GFS2 can also be used as a local file system on a single computer.

Contents

GFS2 has no disconnected operating-mode, and no client or server roles. All nodes in a GFS2 cluster function as peers. Using GFS2 in a cluster requires hardware to allow access to the shared storage, and a lock manager to control access to the storage. The lock manager operates as a separate module: thus GFS2 can use the Distributed Lock Manager (DLM) for cluster configurations and the "nolock" lock manager for local filesystems. Older versions of GFS also support GULM, a server-based lock manager which implements redundancy via failover.

GFS and GFS2 are free software, distributed under the terms of the GNU General Public License. [1] [2]

History

Development of GFS began in 1995 and was originally developed by University of Minnesota professor Matthew O'Keefe and a group of students. [3] It was originally written for SGI's IRIX operating system, but in 1998 it was ported to Linux since the open source code provided a more convenient development platform. In late 1999/early 2000 it made its way to Sistina Software, where it lived for a time as an open-source project. In 2001, Sistina made the choice to make GFS a proprietary product.

Developers forked OpenGFS from the last public release of GFS and then further enhanced it to include updates allowing it to work with OpenDLM. But OpenGFS and OpenDLM became defunct, since Red Hat purchased Sistina in December 2003 and released GFS and many cluster-infrastructure pieces under the GPL in late June 2004.

Red Hat subsequently financed further development geared towards bug-fixing and stabilization. A further development, GFS2 [4] [5] derives from GFS and was included along with its distributed lock manager (shared with GFS) in Linux 2.6.19. Red Hat Enterprise Linux 5.2 included GFS2 as a kernel module for evaluation purposes. With the 5.3 update, GFS2 became part of the kernel package.

GFS2 forms part of the Fedora, Red Hat Enterprise Linux and associated CentOS Linux distributions. Users can purchase commercial support to run GFS2 fully supported on top of Red Hat Enterprise Linux. As of Red Hat Enterprise Linux 8.3, GFS2 is supported in cloud computing environments in which shared storage devices are available. [6]

The following list summarizes some version numbers and major features introduced:

Hardware

The design of GFS and of GFS2 targets SAN-like environments. Although it is possible to use them as a single node filesystem, the full feature-set requires a SAN. This can take the form of iSCSI, FibreChannel, AoE, or any other device which can be presented under Linux as a block device shared by a number of nodes, for example a DRBD device.

The DLM requires an IP based network over which to communicate. This is normally just Ethernet, but again, there are many other possible solutions. Depending upon the choice of SAN, it may be possible to combine this, but normal practice[ citation needed ] involves separate networks for the DLM and storage.

The GFS requires a fencing mechanism of some kind. This is a requirement of the cluster infrastructure, rather than GFS/GFS2 itself, but it is required for all multi-node clusters. The usual options include power switches and remote access controllers (e.g. DRAC, IPMI, or ILO). Virtual and hypervisor-based fencing mechanisms can also be used. Fencing is used to ensure that a node which the cluster believes to be failed cannot suddenly start working again while another node is recovering the journal for the failed node. It can also optionally restart the failed node automatically once the recovery is complete.

Differences from a local filesystem

Although the designers of GFS/GFS2 aimed to emulate a local filesystem closely, there are a number of differences to be aware of. Some of these are due to the existing filesystem interfaces not allowing the passing of information relating to the cluster. Some stem from the difficulty of implementing those features efficiently in a clustered manner. For example:

The other main difference, and one that is shared by all similar cluster filesystems, is that the cache control mechanism, known as glocks (pronounced Gee-locks) for GFS/GFS2, has an effect across the whole cluster. Each inode on the filesystem has two glocks associated with it. One (called the iopen glock) keeps track of which processes have the inode open. The other (the inode glock) controls the cache relating to that inode. A glock has four states, UN (unlocked), SH (shared – a read lock), DF (deferred – a read lock incompatible with SH) and EX (exclusive). Each of the four modes maps directly to a DLM lock mode.

When in EX mode, an inode is allowed to cache data and metadata (which might be "dirty", i.e. waiting for write back to the filesystem). In SH mode, the inode can cache data and metadata, but it must not be dirty. In DF mode, the inode is allowed to cache metadata only, and again it must not be dirty. The DF mode is used only for direct I/O. In UN mode, the inode must not cache any metadata.

In order that operations which change an inode's data or metadata do not interfere with each other, an EX lock is used. This means that certain operations, such as create/unlink of files from the same directory and writes to the same file should be, in general, restricted to one node in the cluster. Of course, doing these operations from multiple nodes will work as expected, but due to the requirement to flush caches frequently, it will not be very efficient.

The single most frequently asked question about GFS/GFS2 performance is why the performance can be poor with email servers. The solution is to break up the mail spool into separate directories and to try to keep (so far as is possible) each node reading and writing to a private set of directories.

Journaling

GFS and GFS2 are both journaled file systems; and GFS2 supports a similar set of journaling modes as ext3. In data=writeback mode, only metadata is journaled. This is the only mode supported by GFS, however it is possible to turn on journaling on individual data-files, but only when they are of zero size. Journaled files in GFS have a number of restrictions placed upon them, such as no support for the mmap or sendfile system calls, they also use a different on-disk format from regular files. There is also an "inherit-journal" attribute which when set on a directory causes all files (and sub-directories) created within that directory to have the journal (or inherit-journal, respectively) flag set. This can be used instead of the data=journal mount option which ext3 supports (and GFS/GFS2 does not).

GFS2 also supports data=ordered mode which is similar to data=writeback except that dirty data is synced before each journal flush is completed. This ensures that blocks which have been added to an inode will have their content synced back to disk before the metadata is updated to record the new size and thus prevents uninitialised blocks appearing in a file under node failure conditions. The default journaling mode is data=ordered, to match ext3's default.

As of 2010, GFS2 does not yet support data=journal mode, but it does (unlike GFS) use the same on-disk format for both regular and journaled files, and it also supports the same journaled and inherit-journal attributes. GFS2 also relaxes the restrictions on when a file may have its journaled attribute changed to any time that the file is not open (also the same as ext3).

For performance reasons, each node in GFS and GFS2 has its own journal. In GFS the journals are disk extents, in GFS2 the journals are just regular files. The number of nodes which may mount the filesystem at any one time is limited by the number of available journals.

Features of GFS2 compared with GFS

GFS2 adds a number of new features which are not in GFS. Here is a summary of those features not already mentioned in the boxes to the right of this page:

Compatibility and the GFS2 meta filesystem

GFS2 was designed so that upgrading from GFS would be a simple procedure. To this end, most of the on-disk structure has remained the same as GFS, including the big-endian byte ordering. There are a few differences though:

The journaling systems of GFS and GFS2 are not compatible with each other. Upgrading is possible by means of a tool (gfs2_convert) which is run with the filesystem off-line to update the metadata. Some spare blocks in the GFS journals are used to create the (very small) per_node files required by GFS2 during the update process. Most of the data remains in place.

The GFS2 "meta filesystem" is not a filesystem in its own right, but an alternate root of the main filesystem. Although it behaves like a "normal" filesystem, its contents are the various system files used by GFS2, and normally users do not need to ever look at it. The GFS2 utilities mount and unmount the meta filesystem as required, behind the scenes.

See also

Related Research Articles

XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; as of June 2014, XFS is supported by most Linux distributions; Red Hat Enterprise Linux uses it as default filesystem.

ReiserFS is a general-purpose, journaling file system initially designed and implemented by a team at Namesys led by Hans Reiser and licensed under GPLv2. Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file system to be included in the standard kernel. ReiserFS was the default file system in Novell's SUSE Linux Enterprise until Novell decided to move to ext3 on October 12, 2006, for future releases.

The ext2 or second extended file system is a file system for the Linux kernel. It was initially designed by French software developer Rémy Card as a replacement for the extended file system (ext). Having been designed according to the same principles as the Berkeley Fast File System from BSD, it was the first commercial-grade filesystem for Linux.

ext3, or third extended filesystem, is a journaled file system that is commonly used by the Linux kernel. It used to be the default file system for many popular Linux distributions. Stephen Tweedie first revealed that he was working on extending ext2 in Journaling the Linux ext2fs Filesystem in a 1998 paper, and later in a February 1999 kernel mailing list posting. The filesystem was merged with the mainline Linux kernel in November 2001 from 2.4.15 onward. Its main advantage over ext2 is journaling, which improves reliability and eliminates the need to check the file system after an unclean shutdown. Its successor is ext4.

Journaled File System (JFS) is a 64-bit journaling file system created by IBM. There are versions for AIX, OS/2, eComStation, ArcaOS and Linux operating systems. The latter is available as free software under the terms of the GNU General Public License (GPL). HP-UX has another, different filesystem named JFS that is actually an OEM version of Veritas Software's VxFS.

The inode is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block locations of the object's data. File-system object attributes may include metadata, as well as owner and permission data.

In computing, an extent is a contiguous area of storage reserved for a file in a file system, represented as a range of block numbers, or tracks on count key data devices. A file can consist of zero or more extents; one file fragment requires one extent. The direct benefit is in storing each range compactly as two numbers, instead of canonically storing every block number in the range. Also, extent allocation results in less file fragmentation.

Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file system software is available under the GNU General Public License and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site systems. Since June 2005, Lustre has consistently been used by at least half of the top ten, and more than 60 of the top 100 fastest supercomputers in the world, including the world's No. 1 ranked TOP500 supercomputer in November 2022, Frontier, as well as previous top supercomputers such as Fugaku, Titan and Sequoia.

In Linux, Logical Volume Manager (LVM) is a device mapper framework that provides logical volume management for the Linux kernel. Most modern Linux distributions are LVM-aware to the point of being able to have their root file systems on a logical volume.

GPFS is high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes, or a combination of these. It is used by many of the world's largest commercial companies, as well as some of the supercomputers on the Top 500 List. For example, it is the filesystem of the Summit at Oak Ridge National Laboratory which was the #1 fastest supercomputer in the world in the November 2019 TOP500 list of supercomputers. Summit is a 200 Petaflops system composed of more than 9,000 POWER9 processors and 27,000 NVIDIA Volta GPUs. The storage filesystem called Alpine has 250 PB of storage using Spectrum Scale on IBM ESS storage hardware, capable of approximately 2.5TB/s of sequential I/O and 2.2TB/s of random I/O.

Operating systems use lock managers to organise and serialise the access to resources. A distributed lock manager (DLM) runs in every machine in a cluster, with an identical copy of a cluster-wide lock database. In this way a DLM provides software applications which are distributed across a cluster on multiple machines with a means to synchronize their accesses to shared resources.

sync is a standard system call in the Unix operating system, which commits all data in the kernel filesystem to non-volatile storage buffers, i.e., data which has been scheduled for writing via low-level I/O system calls. Higher-level I/O layers such as stdio may maintain separate buffers of their own.

The following tables compare general and technical information for a number of file systems.

ext4 is a journaling file system for Linux, developed as the successor to ext3.

The Red Hat Cluster includes software to create a high availability and load balancing cluster. Both can be used on the same system although this use case is unlikely. Both products, the High Availability Add-On and Load Balancer Add-On, are based on open-source community projects. Red Hat Cluster developers contribute code upstream for the community. Computational clustering is not part of cluster suite, but instead provided by Red Hat MRG.

Btrfs is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager, developed together. It was initially designed at Oracle Corporation in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel. According to Oracle, Btrfs "is not a true acronym".

Ceph is an open-source software-defined storage platform that implements object storage on a single distributed computer cluster and provides 3-in-1 interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalability to the exabyte level, and to be freely available. Since version 12, Ceph does not rely on other filesystems and can directly manage HDDs and SSDs with its own storage backend BlueStore and can completely self reliantly expose a POSIX filesystem.

A journaling file system is a file system that keeps track of changes not yet committed to the file system's main part by recording the goal of such changes in a data structure known as a "journal", which is usually a circular log. In the event of a system crash or power failure, such file systems can be brought back online more quickly with a lower likelihood of becoming corrupted.

<span class="mw-page-title-main">BeeGFS</span> Distributed file system

BeeGFS is a parallel file system, developed and optimized for high-performance computing. BeeGFS includes a distributed metadata architecture for scalability and flexibility reasons. Its most used and widely known aspect is data throughput.

References

  1. Teigland, David (29 June 2004). "Symmetric Cluster Architecture and Component Technical Specifications" (PDF). Red Hat Inc. Retrieved 2007-08-03.{{cite journal}}: Cite journal requires |journal= (help)
  2. Soltis, Steven R.; Erickson, Grant M.; Preslan, Kenneth W. (1997). "The Global File System: A File System for Shared Disk Storage" (PDF). IEEE Transactions on Parallel and Distributed Systems. Archived from the original (PDF) on 2004-04-15.
  3. OpenGFS Data sharing with a GFS storage cluster
  4. Whitehouse, Steven (27–30 June 2007). "The GFS2 Filesystem" (PDF). Proceedings of the Linux Symposium 2007. Ottawa, Ontario, Canada. pp. 253–259.
  5. Whitehouse, Steven (13–17 July 2009). "Testing and verification of cluster filesystems" (PDF). Proceedings of the Linux Symposium 2009. Montreal, Quebec, Canada. pp. 311–317.
  6. "Bringing Red Hat Resilient Storage to the public cloud". www.redhat.com. Retrieved 19 February 2021.