Orlov block allocator

Last updated

The Orlov block allocator is an algorithm to define where a particular file will reside on a given file system (blockwise), so as to speed up disk operations.

Contents

Etymology

The scheme is named after its creator Grigoriy Orlov, who first posted, in 2000, a brief description and implementation for OpenBSD [1] of the technique, which was later used in the BSD Fast Filesystem kernel variants.

Background

The performance of a file system is dependent on many things; one of the crucial factors is just how that filesystem lays out files on the disk. In general, it is best to keep related items together. The Linux ext2 and ext3 filesystems, for instance, have tried to spread directories on the cylinders of the disk. Imagine setting up a system with users' home directories in /home: if all the first-level directories within /home (i.e. the home directories for numerous users) are placed next to each other, there may be no space left for the contents of those directories. User files thus end up being placed far from the directories that contain them, and performance suffers.

Spreading directories on the disc allows files in the same directory to remain more or less contiguous as their number and/or size grows, but there are some situations where this causes excessive spreading of the data on the disk's surface.

How it works

Essentially, the Orlov algorithm tries to distribute "top-level" directories on the assumption that each is unrelated to the others. Directories created in the root directory of a filesystem are considered top-level directories; Theodore Ts'o added a special inode flag that allows the system administrator to mark other directories as being top-level directories as well. If /home lives in the root filesystem, a simple chattr command will make the system treat it as a top-level directory.

When creating a directory that is not in a top-level directory, the Orlov algorithm tries to put it into the same cylinder group as its parent. A little more care is taken, however, to ensure that the directory's contents will also be able to fit into that cylinder group; if there are not many inodes or blocks available in the group, the directory will be placed in a different cylinder group that has more resources available. The result of all this, hopefully, is much better locality for files that are truly related to each other and likely to be accessed together.

Performance

The Orlov block allocator was shown to offer performance gains on workloads that traverse directory trees [2] on FreeBSD. As of October 2007, only one benchmark result [3] for ext3, using the allocator seems to have been posted. The results are promising: the time required to traverse through a Linux kernel tree was reduced by roughly 30%.

Evolution

The Orlov scheme needs more rigorous benchmarking; it also needs some serious stress testing to demonstrate that performance does not degrade as the filesystem is changed over time.

Related Research Articles

XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; as of June 2014, XFS is supported by most Linux distributions; Red Hat Enterprise Linux uses it as its default file system.

ReiserFS is a general-purpose, journaling file system initially designed and implemented by a team at Namesys led by Hans Reiser and licensed under GPLv2. Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file system to be included in the standard kernel. ReiserFS was the default file system in Novell's SUSE Linux Enterprise until Novell decided to move to ext3 on October 12, 2006, for future releases.

The ext2 or second extended file system is a file system for the Linux kernel. It was initially designed by French software developer Rémy Card as a replacement for the extended file system (ext). Having been designed according to the same principles as the Berkeley Fast File System from BSD, it was the first commercial-grade filesystem for Linux.

ext3, or third extended filesystem, is a journaled file system that is commonly used by the Linux kernel. It used to be the default file system for many popular Linux distributions. Stephen Tweedie first revealed that he was working on extending ext2 in Journaling the Linux ext2fs Filesystem in a 1998 paper, and later in a February 1999 kernel mailing list posting. The filesystem was merged with the mainline Linux kernel in November 2001 from 2.4.15 onward. Its main advantage over ext2 is journaling, which improves reliability and eliminates the need to check the file system after an unclean shutdown. Its successor is ext4.

Journaled File System (JFS) is a 64-bit journaling file system created by IBM. There are versions for AIX, OS/2, eComStation, ArcaOS and Linux operating systems. The latter is available as free software under the terms of the GNU General Public License (GPL). HP-UX has another, different filesystem named JFS that is actually an OEM version of Veritas Software's VxFS.

The Unix file system (UFS) is a family of file systems supported by many Unix and Unix-like operating systems. It is a distant descendant of the original filesystem used by Version 7 Unix.

Reiser4 is a computer file system, successor to the ReiserFS file system, developed from scratch by Namesys and sponsored by DARPA as well as Linspire. Reiser4 was named after its former lead developer Hans Reiser. As of 2021, the Reiser4 patch set is still being maintained, but according to Phoronix, it is unlikely to be merged into mainline Linux without corporate backing.

The inode is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block locations of the object's data. File-system object attributes may include metadata, as well as owner and permission data.

In computing, the Global File System 2 or GFS2 is a shared-disk file system for Linux computer clusters. GFS2 allows all members of a cluster to have direct concurrent access to the same shared block storage, in contrast to distributed file systems which distribute data throughout the cluster. GFS2 can also be used as a local file system on a single computer.

A disk quota is a limit set by a system administrator that restricts certain aspects of file system usage on modern operating systems. The function of using disk quotas is to allocate limited disk space in a reasonable way.

The Minix file system is the native file system of the Minix operating system. It was written from scratch by Andrew S. Tanenbaum in the 1980s and aimed to replicate the structure of the Unix File System while omitting complex features, and was intended to be a teaching aid. It largely fell out of favour among Linux users by 1994 due to the popularity of other filesystems - most notably ext2 - and its lack of features, including limited partition sizes and filename length limits.

<span class="mw-page-title-main">Theodore Ts'o</span> American computer scientist, free software developer

Theodore (Ted) Yue Tak Ts'o (曹子德) is an American software engineer mainly known for his contributions to the Linux kernel, in particular his contributions to file systems. He is the Secondary developer and maintainer of e2fsprogs, the userspace utilities for the ext2, ext3, and ext4 filesystems, and is a maintainer for the ext4 file system.

The following tables compare general and technical information for a number of file systems.

ext4 is a journaling file system for Linux, developed as the successor to ext3.

chattr is the command in Linux that allows a user to set certain attributes of a file. lsattr is the command that displays the attributes of a file.

Btrfs is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager, developed together. It was initially designed at Oracle Corporation in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel. According to Oracle, Btrfs "is not a true acronym".

An HTree is a specialized tree data structure for directory indexing, similar to a B-tree. They are constant depth of either one or two levels, have a high fanout factor, use a hash of the filename, and do not require balancing. The HTree algorithm is distinguished from standard B-tree methods by its treatment of hash collisions, which may overflow across multiple leaf and index blocks. HTree indexes are used in the ext3 and ext4 Linux filesystems, and were incorporated into the Linux kernel around 2.5.40. HTree indexing improved the scalability of Linux ext2 based filesystems from a practical limit of a few thousand files, into the range of tens of millions of files per directory.

A journaling file system is a file system that keeps track of changes not yet committed to the file system's main part by recording the goal of such changes in a data structure known as a "journal", which is usually a circular log. In the event of a system crash or power failure, such file systems can be brought back online more quickly with a lower likelihood of becoming corrupted.

Next3 is a journaling file system for Linux based on ext3 which adds snapshots support, yet retains compatibility to the ext3 on-disk format. Next3 is implemented as open-source software, licensed under the GPL license.

References

  1. Grigoriy Orlov. "Directory Allocation Algorithm For FFS". Archived from the original on 2008-01-31.
  2. Recent Filesystem Optimisations in FreeBSD
  3. Bert Hubert, Naive but spectacular ext3 HTREE+Orlov benchmark