Deadline scheduler

Last updated

The Deadline Scheduler is an I/O scheduler, or disk scheduler, for the Linux kernel. It was written in 2002 by Jens Axboe.

Contents

Overview

The main purpose of the Deadline Scheduler is to guarantee a start service time for a request. [1] The scheduler imposes a deadline on all I/O operations to prevent starvation of requests, and in addition to the sorted queues (both read and write), it maintains two deadline queues. Deadline queues are here sorted by their expiration time, while the sorted queues are sorted by the sector number.

Before serving the next request, the deadline scheduler decides which queue to use. Read queues are given a higher priority, because processes usually block on read operations. Next, the deadline scheduler checks if the first request in the deadline queue has expired. Otherwise, the scheduler serves a batch of requests from the sorted queue. In both cases, the scheduler also serves a batch of requests following the chosen request in the sorted queue.

By default, read requests have an expiration time of 500 ms, and write requests expire in 5 seconds.

A rough version of the scheduler was published on the Linux Kernel Mailing List by Axboe in January 2002. [2]

Measurements have shown that the deadline I/O scheduler outperforms the CFQ I/O scheduler for certain multi-threaded workloads. [3]

SysFS Tunables

fifo_batch (integer)

Deadline executes I/O Operations (IOPs) through the concept of "batches" which are sets of operations ordered in terms of increasing sector number. This tunable determines how big a batch will have to be before the requests are queued to the disk (barring expiration of a currently-being-built batch). Smaller batches can reduce latency by ensuring new requests are executed sooner (rather than possibly waiting for more requests to come in), but may degrade overall throughput by increasing the overall movement of drive heads (since sequencing happens within a batch and not between them). Additionally, if the number of IOPs is high enough the batches will be executed in a timely fashion anyway.

read_expire (integer)

The ‘read_expire’ time is the maximum time in milliseconds after which the read is considered ‘expired’. Think of this more like the expiration date on a milk carton. The milk is best used before the expiration date. The same with the deadline scheduler. It will not attempt to make sure all I/O is issued before its expiration date. However, if the IO is past expiration, then it gets a bump in priority.

The read expiration queue is only checked when the deadline scheduler re-evaluates read queue. For reads this means every time a sorted read is dispatched except for the case of streaming I/O. While the scheduler is streaming I/O from the read queue, the read expired is not evaluated. When re-evaluating the read queue, the logic is

check for expired reads (look at head of FIFO [time ordered] queue) check to see if cached read pointer valid (so even if not streaming, the cached pointer still takes precedence so the sorted queue is traversed tip to tail in a sweep) pick up the first read from the sorted queue (start at the tip again for another sweep) If there are expired reads, then the first one is pulled from the FIFO. Note that this expired read then is the new nexus for read sort ordering. The cached next pointer will be set to point to the next I/O from the sort queue after this expired one…. The thing to note is that the algorithm doesn’t just execute all expired I/O once they are past their expiration date. This allows some reasonable performance to be maintained by batching up ‘write_starved’ sorted reads together before checking the expired read queue again.

So, the maximum number of I/O that can be performed between read expired I/O is 2 * 'fifo_batch' * 'writes_starved'. One set of ‘fifo_batch’ streaming reads after the first expired read I/O and if this stream happened to cause the write starved condition, then possibly another ‘fifo_batch’ streaming writes. This is worse case, after which the read expired queue would be re-evaluated. At best, the expired read queue will be evaluated ‘write_starved’ times in a row before being skipped because the write queue would be used.

write_expire (integer)

Identical to read_expire but for write operations (grouped into separate batches from reads).

writes_starved (integer)

As stated previously, deadline prefers reads to writes. As a consequence, this can lead to situations where the operations are executed are almost entirely reads. This becomes more of an important tunable as write_expire is elongated or overall bandwidth approaches saturation. Decreasing this gives more bandwidth to writes (relatively speaking) at the expense of read operations. If application workload, however, is read-heavy (for example most HTTP or directory servers) with only an occasional write, decreased latency of average IOPs may be achieved by increasing this (so that more reads must be performed before a write batch is queued to disk).

front_merges (bool integer)

A "front merge" is an operation where the I/O Scheduler, seeking to condense (or "merge") smaller requests into fewer (larger) operations, will take a new operation then examine the active batch and attempt to locate operations where the beginning sector is the same or immediately after another operation's beginning sector. A "back merge" is the opposite, where ending sectors in the active batch are searched for sectors that are either the same or immediately after the current operation's beginning sectors. Merging diverts operations from the current batch to the active one, decreasing "fairness" in order to increase throughput.

Due to the way files are typically laid out, back merges are much more common than front merges. For some workloads, you may even know that it is a waste of time to spend any time attempting to front merge requests. Setting front_merges to 0 disables this functionality. Front merges may still occur due to the cached last_merge hint, but since that comes at basically zero cost, it is still performed. This boolean simply disables front sector lookup when the I/O scheduler merging function is called. Disk merge totals are recorded per-block device in /proc/diskstats. [1]

Other I/O schedulers

Related Research Articles

In computing, a context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point, and then restoring a different, previously saved, state. This allows multiple processes to share a single central processing unit (CPU), and is an essential feature of a multiprogramming or multitasking operating system. In a traditional CPU, each process - a program in execution - utilizes the various CPU registers to store data and hold the current state of the running process. However, in a multitasking operating system, the operating system switches between processes or threads to allow the execution of multiple processes simultaneously. For every switch, the operating system must save the state of the currently running process, followed by loading the next process state, which will run on the CPU. This sequence of operations that stores the state of the running process and the loading of the following running process is called a context switch.

XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; as of June 2014, XFS is supported by most Linux distributions; Red Hat Enterprise Linux uses it as its default file system.

In computing, scheduling is the action of assigning resources to perform tasks. The resources may be processors, network links or expansion cards. The tasks may be threads, processes or data flows.

Completely Fair Queuing (CFQ) is an I/O scheduler for the Linux kernel which was written in 2003 by Jens Axboe.

In a computer operating system that uses paging for virtual memory management, page replacement algorithms decide which memory pages to page out, sometimes called swap out, or write to disk, when a page of memory needs to be allocated. Page replacement happens when a requested page is not in memory and a free page cannot be used to satisfy the allocation, either because there are none, or because the number of free pages is lower than some threshold.

<span class="mw-page-title-main">Native Command Queuing</span>

In computing, Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed. This can reduce the amount of unnecessary drive head movement, resulting in increased performance for workloads where multiple simultaneous read/write requests are outstanding, most often occurring in server-type applications.

sync is a standard system call in the Unix operating system, which commits all data from the kernel filesystem buffers to non-volatile storage, i.e., data which has been scheduled for writing via low-level I/O system calls. Higher-level I/O layers such as stdio may maintain separate buffers of their own.

Anticipatory scheduling is an algorithm for scheduling hard disk input/output. It seeks to increase the efficiency of disk utilization by "anticipating" future synchronous read operations.

<span class="mw-page-title-main">Solid-state drive</span> Computer storage device with no moving parts

A solid-state drive (SSD) is a solid-state storage device. It provides persistent data storage using no moving parts. It is sometimes called semiconductor storage device or solid-state device; it is also called solid-state disk because it is frequently interfaced to a host system as a hard disk drive.

<span class="mw-page-title-main">Disk buffer</span>

In computer storage, a disk buffer is the embedded memory in a hard disk drive (HDD) or solid state drive (SSD) acting as a buffer between the rest of the computer and the physical hard disk platter or flash memory that is used for storage. Modern hard disk drives come with 8 to 256 MiB of such memory, and solid-state drives come with up to 4 GB of cache memory.

Jens Axboe is a Linux kernel hacker.

<span class="mw-page-title-main">I/O scheduling</span> Arbiter for mass storage access in an operating system

Input/output (I/O) scheduling is the method that computer operating systems use to decide in which order I/O operations will be submitted to storage volumes. I/O scheduling is sometimes called disk scheduling.

<span class="mw-page-title-main">Completely Fair Scheduler</span> Linux process scheduler

The Completely Fair Scheduler (CFS) was a process scheduler that was merged into the 2.6.23 release of the Linux kernel. It was the default scheduler of the tasks of the SCHED_NORMAL class and handled CPU resource allocation for executing processes, aiming to maximize overall CPU utilization while also maximizing interactive performance.

<span class="mw-page-title-main">Noop scheduler</span> Simple I/O scheduler for the Linux kernel

The NOOP scheduler is the simplest I/O scheduler for the Linux kernel. This scheduler was developed by Jens Axboe.

A trim command allows an operating system to inform a solid-state drive (SSD) which blocks of data are no longer considered to be "in use" and therefore can be erased internally.

OS 2200 is the operating system for the Unisys ClearPath Dorado family of mainframe systems. The operating system kernel of OS 2200 is a lineal descendant of Exec 8 for the UNIVAC 1108 and was previously known as OS 1100. Documentation and other information on current and past Unisys systems can be found on the Unisys public support website.

<span class="mw-page-title-main">Brain Fuck Scheduler</span> Process scheduler in Linux

The Brain Fuck Scheduler (BFS) is a process scheduler designed for the Linux kernel in August 2009 based on earliest eligible virtual deadline first scheduling (EEVDF), as an alternative to the Completely Fair Scheduler (CFS) and the O(1) scheduler. BFS was created by Con Kolivas.

bcache is a cache in the Linux kernel's block layer, which is used for accessing secondary storage devices. It allows one or more fast storage devices, such as flash-based solid-state drives (SSDs), to act as a cache for one or more slower storage devices, such as hard disk drives (HDDs); this effectively creates hybrid volumes and provides performance improvements.

dm-cache is a component of the Linux kernel's device mapper, which is a framework for mapping block devices onto higher-level virtual block devices. It allows one or more fast storage devices, such as flash-based solid-state drives (SSDs), to act as a cache for one or more slower storage devices such as hard disk drives (HDDs); this effectively creates hybrid volumes and provides secondary storage performance improvements.

io_uring is a Linux kernel system call interface for storage device asynchronous I/O operations addressing performance issues with similar interfaces provided by functions like read /write or aio_read /aio_write etc. for operations on data accessed by file descriptors.

References

  1. 1 2 Jens Axboe (11 November 2002). "Deadline I/O scheduler tunables". Linux kernel documentation. Retrieved 20 November 2011.
  2. Jens Axboe (4 January 2002). "[PATCH][RFT] simple deadline I/O scheduler". Linux Kernel Mailing List Archive. Retrieved 6 July 2014.
  3. IBM (12 September 2013). "Kernel Virtual Machine (KVM) Best practices for KVM" (PDF). IBM. Archived from the original (PDF) on May 13, 2016. Retrieved 6 July 2014.