Completely Fair Scheduler

Completely Fair Scheduler
Original author(s)	Ingo Molnár
Developer(s)	Linux kernel developers
Written in	C
Operating system	Linux kernel
Type	process scheduler
License	GPL-2.0
Website	kernel.org

Last updated December 23, 2023

The Completely Fair Scheduler (CFS) is a process scheduler that was merged into the 2.6.23 (October 2007) release of the Linux kernel and is the default scheduler of the tasks of the SCHED_NORMAL class (i.e., tasks that have no real-time execution constraints). It handles CPU resource allocation for executing processes, and aims to maximize overall CPU utilization while also maximizing interactive performance.

In contrast to the previous O(1) scheduler used in older Linux 2.6 kernels, which maintained and switched run queues of active and expired tasks, the CFS scheduler implementation is based on per-CPU run queues, whose nodes are time-ordered schedulable entities that are kept sorted by red–black trees. The CFS does away with the old notion of per-priorities fixed time-slices and instead it aims at giving a fair share of CPU time to tasks (or, better, schedulable entities).^[1]^[2]

Algorithm

A task (i.e., a synonym for thread) is the minimal entity that Linux can schedule. However, it can also manage groups of threads, whole multi-threaded processes, and even all the processes of a given user. This design leads to the concept of schedulable entities, where tasks are grouped and managed by the scheduler as a whole. For this design to work, each task_struct task descriptor embeds a field of type sched_entity that represents the set of entities the task belongs to.

Each per-CPU run-queue of type cfs_rq sorts sched_entity structures in a time-ordered fashion into a red-black tree (or 'rbtree' in Linux lingo), where the leftmost node is occupied by the entity that has received the least slice of execution time (which is saved in the vruntime field of the entity). The nodes are indexed by processor "execution time" in nanoseconds.^[3]

A "maximum execution time" is also calculated for each process to represent the time the process would have expected to run on an "ideal processor". This is the time the process has been waiting to run, divided by the total number of processes.

When the scheduler is invoked to run a new process:

The leftmost node of the scheduling tree is chosen (as it will have the lowest spent execution time), and sent for execution.
If the process simply completes execution, it is removed from the system and scheduling tree.
If the process reaches its maximum execution time or is otherwise stopped (voluntarily or via interrupt) it is reinserted into the scheduling tree based on its newly spent execution time.
The new leftmost node will then be selected from the tree, repeating the iteration.

If the process spends a lot of its time sleeping, then its spent time value is low and it automatically gets the priority boost when it finally needs it. Hence such tasks do not get less processor time than the tasks that are constantly running.

The complexity of the algorithm that inserts nodes into the cfs_rq runqueue of the CFS scheduler is O(log N), where N is the total number of entities. Choosing the next entity to run is made in constant time because the leftmost node is always cached.

History

Con Kolivas's work with scheduling, most significantly his implementation of "fair scheduling" named Rotating Staircase Deadline, inspired Ingo Molnár to develop his CFS, as a replacement for the earlier O(1) scheduler, crediting Kolivas in his announcement.^[4] CFS is an implementation of a well-studied, classic scheduling algorithm called weighted fair queuing.^[5] Originally invented for packet networks, fair queuing had been previously applied to CPU scheduling under the name stride scheduling. CFS is the first implementation of a fair queuing process scheduler widely used in a general-purpose operating system.^[5]

The Linux kernel received a patch for CFS in November 2010 for the 2.6.38 kernel that has made the scheduler "fairer" for use on desktops and workstations. Developed by Mike Galbraith using ideas suggested by Linus Torvalds, the patch implements a feature called auto-grouping that significantly boosts interactive desktop performance.^[6] The algorithm puts parent processes in the same task group as child processes.^[7] (Task groups are tied to sessions created via the setsid() system call.^[8]) This solved the problem of slow interactive response times on multi-core and multi-CPU (SMP) systems when they were performing other tasks that use many CPU-intensive threads in those tasks. A simple explanation is that, with this patch applied, one is able to still watch a video, read email and perform other typical desktop activities without glitches or choppiness while, say, compiling the Linux kernel or encoding video.

In 2016, the Linux scheduler was patched for better multicore performance, based on the suggestions outlined in the paper, "The Linux Scheduler: A Decade of Wasted Cores".^[9]

In 2023, a new scheduler based on earliest eligible virtual deadline first scheduling (EEVDF) was being readied to replace CFS.^[10] The motivation was to remove the need for CFS "latency nice" patches.^[11]

Related Research Articles

In computing, a context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point, and then restoring a different, previously saved, state. This allows multiple processes to share a single central processing unit (CPU), and is an essential feature of a multiprogramming or multitasking operating system. In a traditional CPU, each process - a program in execution - utilizes the various CPU registers to store data and hold the current state of the running process. However, in a multitasking operating system, the operating system switches between processes or threads to allow the execution of multiple processes simultaneously. For every switch, the operating system must save the state of the currently running process, followed by loading the next process state, which will run on the CPU. This sequence of operations that stores the state of the running process and the loading of the following running process is called a context switch.

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

Processor affinity, or CPU pinning or "cache affinity", enables the binding and unbinding of a process or a thread to a central processing unit (CPU) or a range of CPUs, so that the process or thread will execute only on the designated CPU or CPUs rather than any CPU. This can be viewed as a modification of the native central queue scheduling algorithm in a symmetric multiprocessing operating system. Each item in the queue has a tag indicating its kin processor. At the time of resource allocation, each task is allocated to its kin processor in preference to others.

<span class="mw-page-title-main">Load (computing)</span> Amount of computational work that a computer system performs

In UNIX computing, the system load is a measure of the amount of computational work that a computer system performs. The load average represents the average system load over a period of time. It conventionally appears in the form of three numbers which represent the system load during the last one-, five-, and fifteen-minute periods.

In computing, scheduling is the action of assigning resources to perform tasks. The resources may be processors, network links or expansion cards. The tasks may be threads, processes or data flows.

Ingo Molnár, employed by Red Hat as of May 2013, is a Hungarian Linux hacker. He is known for his contributions to the operating system in terms of security and performance.

An O(1) scheduler is a kernel scheduling design that can schedule processes within a constant amount of time, regardless of how many processes are running on the operating system. This is an improvement over previously used O(n) schedulers, which schedule processes in an amount of time that scales linearly based on the amounts of inputs.

The Direct Rendering Manager (DRM) is a subsystem of the Linux kernel responsible for interfacing with GPUs of modern video cards. DRM exposes an API that user-space programs can use to send commands and data to the GPU and perform operations such as configuring the mode setting of the display. DRM was first developed as the kernel-space component of the X Server Direct Rendering Infrastructure, but since then it has been used by other graphic stack alternatives such as Wayland and standalone applications and libraries such as SDL2 and Kodi.

Con Kolivas is a Greek-Australian anaesthetist. He has worked as a computer programmer on the Linux kernel and on the development of the cryptographic currency mining software CGMiner. His Linux contributions include patches for the kernel to improve its desktop performance, particularly reducing I/O impact.

<span class="mw-page-title-main">Linux kernel</span> Operating system kernel

The Linux kernel is a free and open-source, monolithic, modular, multitasking, Unix-like operating system kernel. It was originally written in 1991 by Linus Torvalds for his i386-based PC, and it was soon adopted as the kernel for the GNU operating system, which was written to be a free (libre) replacement for Unix.

In computer operating system design, kernel preemption is a property possessed by some kernels, in which the CPU can be interrupted in the middle of executing kernel code and assigned other tasks.

The Brain Fuck Scheduler (BFS) is a process scheduler designed for the Linux kernel in August 2009 based on earliest eligible virtual deadline first scheduling (EEVDF), as an alternative to the Completely Fair Scheduler (CFS) and the O(1) scheduler. BFS was created by an experienced kernel programmer Con Kolivas.

The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.

cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage of a collection of processes.

SCHED_DEADLINE is a CPU scheduler available in the Linux kernel since version 3.14, based on the Earliest Deadline First (EDF) and Constant Bandwidth Server (CBS) algorithms, supporting resource reservations: each task scheduled under such policy is associated with a budget Q, and a period P, corresponding to a declaration to the kernel that Q time units are required by that task every P time units, on any processor. This makes SCHED_DEADLINE particularly suitable for real-time applications, like multimedia or industrial control, where P corresponds to the minimum time elapsing between subsequent activations of the task, and Q corresponds to the worst-case execution time needed by each activation of the task.

Earliest deadline first (EDF) or least time to go is a dynamic priority scheduling algorithm used in real-time operating systems to place processes in a priority queue. Whenever a scheduling event occurs the queue will be searched for the process closest to its deadline. This process is the next to be scheduled for execution.

<span class="mw-page-title-main">ARM big.LITTLE</span> Heterogeneous computing architecture

ARM big.LITTLE is a heterogeneous computing architecture developed by ARM Holdings, coupling relatively battery-saving and slower processor cores (LITTLE) with relatively more powerful and power-hungry ones (big). The intention is to create a multi-core processor that can adjust better to dynamic computing needs and use less power than clock scaling alone. ARM's marketing material promises up to a 75% savings in power usage for some activities. Most commonly, ARM big.LITTLE architectures are used to create a multi-processor system-on-chip (MPSoC).

kGraft is a feature of the Linux kernel that implements live patching of a running kernel, which allows kernel patches to be applied while the kernel is still running. By avoiding the need for rebooting the system with a new kernel that contains the desired patches, kGraft aims to maximize the system uptime and availability. At the same time, kGraft allows kernel-related security updates to be applied without deferring them to scheduled downtimes. Internally, kGraft allows entire functions in a running kernel to be replaced with their patched versions, doing that safely by selectively using original versions of functions to ensure per-process consistency while the live patching is performed.

Earliest eligible virtual deadline first (EEVDF) is a dynamic priority proportional share scheduling algorithm for soft real-time systems.

References

↑ Love, Robert (2010). Linux Kernel Development (3rd ed.). United States of America: Addison Wesley. pp. 41–61. ISBN 9780672329463.
↑ "Linux: The Completely Fair Scheduler | KernelTrap". 2007-04-19. Archived from the original on 2007-04-19. Retrieved 2021-05-16.
↑ CFS description at ibm.com
↑ Molnár, Ingo (2007-04-13). "[patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]". linux-kernel (Mailing list).
1 2 Li, T.; Baumberger, D.; Hahn, S. (2009). "Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin" (PDF). ACM SIGPLAN Notices. 44 (4): 65. CiteSeerX 10.1.1.567.2170 . doi:10.1145/1594835.1504188.
↑ The ~200 Line Linux Kernel Patch That Does Wonders
↑ Galbraith, Mike (2010-11-15). "[RFC/RFT PATCH v3] Re: [RFC/RFT PATCH v3] sched: automated per tty task groups [CFS]". linux-kernel (Mailing list).
↑ Galbraith, Mike (2010-11-20). "[PATCH v4] sched: automated per session task groups". linux-kernel (Mailing list).
↑ Lozi, Jean-Pierre; Lepers, Baptiste; Funston, Justin; Gaud, Fabien; Quema, Vivian; Fedorova, Alexandra. "The Linux Scheduler: A Decade of Wasted Cores" (PDF). EuroSys 2016. doi:10.1145/2901318.2901326 . Retrieved 15 June 2019.
↑ "EEVDF Scheduler May Be Ready For Landing With Linux 6.6". www.phoronix.com. Retrieved 2023-07-27.
↑ "An EEVDF CPU scheduler for Linux [LWN.net]". lwn.net. Retrieved 2023-07-27.

External links

Corbet, Jonathan (2007-04-17). "Schedulers: The Plot Thickens". LWN.net. Archived from the original on 2017-09-06. Retrieved 2016-07-21.
Corbet, J. (2007-07-02). "CFS Group Scheduling". LWN.net. Archived from the original on 2017-09-06. Retrieved 2016-07-21.
Corbet, J. (2007-10-16). "Fair user scheduling and other scheduler patches". LWN.net. Archived from the original on 2017-06-12. Retrieved 2016-11-24.
Corbet, J. (2010-11-17). "TTY-based group scheduling". LWN.net. Archived from the original on 2018-05-21. Retrieved 2016-11-24.
Corbet, J. (2010-12-06). "Group scheduling and alternatives". LWN.net. Archived from the original on 2017-06-11. Retrieved 2016-11-24.
Singh Pabla, Chandandeep (2009-08-01). "Completely Fair Scheduler". linuxjournal.com. Archived from the original on 2018-03-18. Retrieved 2014-11-17.
Jones, Tim (2009-12-15). "Inside the Linux 2.6 Completely Fair Scheduler". IBM.
Lozi, Jean-Pierre (2016-04-21). "The Linux Scheduler: a Decade of Wasted Cores" (PDF). ACM. Archived from the original (PDF) on 2018-02-05. Retrieved 2016-11-24.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Love, Robert (2010). Linux Kernel Development (3rd ed.). United States of America: Addison Wesley. pp. 41–61. ISBN 9780672329463.

[2] "Linux: The Completely Fair Scheduler | KernelTrap". 2007-04-19. Archived from the original on 2007-04-19. Retrieved 2021-05-16.

[3] CFS description at ibm.com

[4] Molnár, Ingo (2007-04-13). "[patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]". linux-kernel (Mailing list).

[dwrr-5] 1 2 Li, T.; Baumberger, D.; Hahn, S. (2009). "Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin" (PDF). ACM SIGPLAN Notices. 44 (4): 65. CiteSeerX 10.1.1.567.2170 . doi:10.1145/1594835.1504188.

[6] The ~200 Line Linux Kernel Patch That Does Wonders

[7] Galbraith, Mike (2010-11-15). "[RFC/RFT PATCH v3] Re: [RFC/RFT PATCH v3] sched: automated per tty task groups [CFS]". linux-kernel (Mailing list).

[8] Galbraith, Mike (2010-11-20). "[PATCH v4] sched: automated per session task groups". linux-kernel (Mailing list).

[9] Lozi, Jean-Pierre; Lepers, Baptiste; Funston, Justin; Gaud, Fabien; Quema, Vivian; Fedorova, Alexandra. "The Linux Scheduler: A Decade of Wasted Cores" (PDF). EuroSys 2016. doi:10.1145/2901318.2901326 . Retrieved 15 June 2019.

[10] "EEVDF Scheduler May Be Ready For Landing With Linux 6.6". www.phoronix.com. Retrieved 2023-07-27.

[11] "An EEVDF CPU scheduler for Linux [LWN.net]". lwn.net. Retrieved 2023-07-27.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]