Protection ring

Last updated • 14 min readFrom Wikipedia, The Free Encyclopedia

Privilege rings for the x86 available in protected mode Priv rings.svg
Privilege rings for the x86 available in protected mode

In computer science, hierarchical protection domains, [1] [2] often called protection rings, are mechanisms to protect data and functionality from faults (by improving fault tolerance) and malicious behavior (by providing computer security).

Contents

Computer operating systems provide different levels of access to resources. A protection ring is one of two or more hierarchical levels or layers of privilege within the architecture of a computer system. This is generally hardware-enforced by some CPU architectures that provide different CPU modes at the hardware or microcode level. Rings are arranged in a hierarchy from most privileged (most trusted, usually numbered zero) to least privileged (least trusted, usually with the highest ring number). On most operating systems, Ring 0 is the level with the most privileges and interacts most directly with the physical hardware such as certain CPU functionality (e.g. the control registers) and I/O controllers.

Special mechanisms are provided to allow an outer ring to access an inner ring's resources in a predefined manner, as opposed to allowing arbitrary usage. Correctly gating access between rings can improve security by preventing programs from one ring or privilege level from misusing resources intended for programs in another. For example, spyware running as a user program in Ring 3 should be prevented from turning on a web camera without informing the user, since hardware access should be a Ring 1 function reserved for device drivers. Programs such as web browsers running in higher numbered rings must request access to the network, a resource restricted to a lower numbered ring.

X86S, an Intel architecture published in 2024, has only ring 0 and ring 3. Ring 1 and 2 will be removed under X86S since modern OSes never utilize them. [3]

Implementations

Multiple rings of protection were among the most revolutionary concepts introduced by the Multics operating system, a highly secure predecessor of today's Unix family of operating systems. The GE 645 mainframe computer did have some hardware access control, including the same two modes that the other GE-600 series machines had, and segment-level permissions in its memory management unit ("Appending Unit"), but that was not sufficient to provide full support for rings in hardware, so Multics supported them by trapping ring transitions in software; [4] its successor, the Honeywell 6180, implemented them in hardware, with support for eight rings; [5] Protection rings in Multics were separate from CPU modes; code in all rings other than ring 0, and some ring 0 code, ran in slave mode. [6]

However, most general-purpose systems use only two rings, even if the hardware they run on provides more CPU modes than that. For example, Windows 7 and Windows Server 2008 (and their predecessors) use only two rings, with ring 0 corresponding to kernel mode and ring 3 to user mode, [7] because earlier versions of Windows NT ran on processors that supported only two protection levels. [8]

Many modern CPU architectures (including the popular Intel x86 architecture) include some form of ring protection, although the Windows NT operating system, like Unix, does not fully utilize this feature. OS/2 does, to some extent, use three rings: [9] ring 0 for kernel code and device drivers, ring 2 for privileged code (user programs with I/O access permissions), and ring 3 for unprivileged code (nearly all user programs). Under DOS, the kernel, drivers and applications typically run on ring 3 (however, this is exclusive to the case where protected-mode drivers or DOS extenders are used; as a real-mode OS, the system runs with effectively no protection), whereas 386 memory managers such as EMM386 run at ring 0. In addition to this, DR-DOS' EMM386 3.xx can optionally run some modules (such as DPMS) on ring 1 instead. OpenVMS uses four modes called (in order of decreasing privileges) Kernel, Executive, Supervisor and User.

While x86 has 4 protection rings, it is more common for architectures to only have two. Even on x86, most operating systems only use ring 0 and 3. Most common protection rings.svg
While x86 has 4 protection rings, it is more common for architectures to only have two. Even on x86, most operating systems only use ring 0 and 3.

A renewed interest in this design structure came with the proliferation of the Xen VMM software, ongoing discussion on monolithic vs. micro-kernels (particularly in Usenet newsgroups and Web forums), Microsoft's Ring-1 design structure as part of their NGSCB initiative, and hypervisors based on x86 virtualization such as Intel VT-x (formerly Vanderpool).

The original Multics system had eight rings, but many modern systems have fewer. The hardware remains aware of the current ring of the executing instruction thread at all times, with the help of a special machine register. In some systems, areas of virtual memory are instead assigned ring numbers in hardware. One example is the Data General Eclipse MV/8000, in which the top three bits of the program counter (PC) served as the ring register. Thus code executing with the virtual PC set to 0xE200000, for example, would automatically be in ring 7, and calling a subroutine in a different section of memory would automatically cause a ring transfer.

The hardware severely restricts the ways in which control can be passed from one ring to another, and also enforces restrictions on the types of memory access that can be performed across rings. Using x86 as an example, there is a special[ clarification needed ]gate structure which is referenced by the call instruction that transfers control in a secure way[ clarification needed ] towards predefined entry points in lower-level (more trusted) rings; this functions as a supervisor call in many operating systems that use the ring architecture. The hardware restrictions are designed to limit opportunities for accidental or malicious breaches of security. In addition, the most privileged ring may be given special capabilities (such as real memory addressing that bypasses the virtual memory hardware).

ARM version 7 architecture implements three privilege levels: application (PL0), operating system (PL1), and hypervisor (PL2). Unusually, level 0 (PL0) is the least-privileged level, while level 2 is the most-privileged level. [10] ARM version 8 implements four exception levels: application (EL0), operating system (EL1), hypervisor (EL2), and secure monitor / firmware (EL3), for AArch64 [11] :D1-2454 and AArch32. [11] :G1-6013

Ring protection can be combined with processor modes (master/kernel/privileged/supervisor mode versus slave/unprivileged/user mode) in some systems. Operating systems running on hardware supporting both may use both forms of protection or only one.

Effective use of ring architecture requires close cooperation between hardware and the operating system.[ why? ] Operating systems designed to work on multiple hardware platforms may make only limited use of rings if they are not present on every supported platform. Often the security model is simplified to "kernel" and "user" even if hardware provides finer granularity through rings.

Modes

Supervisor mode

In computer terms, supervisor mode is a hardware-mediated flag that can be changed by code running in system-level software. System-level tasks or threads may [a] have this flag set while they are running, whereas user-level applications will not. This flag determines whether it would be possible to execute machine code operations such as modifying registers for various descriptor tables, or performing operations such as disabling interrupts. The idea of having two different modes to operate in comes from "with more power comes more responsibility"  a program in supervisor mode is trusted never to fail, since a failure may cause the whole computer system to crash.

Supervisor mode is "an execution mode on some processors which enables execution of all instructions, including privileged instructions. It may also give access to a different address space, to memory management hardware and to other peripherals. This is the mode in which the operating system usually runs." [12]

In a monolithic kernel, the operating system runs in supervisor mode and the applications run in user mode. Other types of operating systems, like those with an exokernel or microkernel, do not necessarily share this behavior.

Some examples from the PC world:

Most processors have at least two different modes. The x86-processors have four different modes divided into four different rings. Programs that run in Ring 0 can do anything with the system, and code that runs in Ring 3 should be able to fail at any time without impact to the rest of the computer system. Ring 1 and Ring 2 are rarely used, but could be configured with different levels of access.

In most existing systems, switching from user mode to kernel mode has an associated high cost in performance. It has been measured, on the basic request getpid , to cost 1000–1500 cycles on most machines. Of these just around 100 are for the actual switch (70 from user to kernel space, and 40 back), the rest is "kernel overhead". [13] [14] In the L3 microkernel, the minimization of this overhead reduced the overall cost to around 150 cycles. [13]

Maurice Wilkes wrote: [15]

... it eventually became clear that the hierarchical protection that rings provided did not closely match the requirements of the system programmer and gave little or no improvement on the simple system of having two modes only. Rings of protection lent themselves to efficient implementation in hardware, but there was little else to be said for them. [...] The attractiveness of fine-grained protection remained, even after it was seen that rings of protection did not provide the answer... This again proved a blind alley...

To gain performance and determinism, some systems place functions that would likely be viewed as application logic, rather than as device drivers, in kernel mode; security applications (access control, firewalls, etc.) and operating system monitors are cited as examples. At least one embedded database management system, eXtremeDB Kernel Mode, has been developed specifically for kernel mode deployment, to provide a local database for kernel-based application functions, and to eliminate the context switches that would otherwise occur when kernel functions interact with a database system running in user mode. [16]

Functions are also sometimes moved across rings in the other direction. The Linux kernel, for instance, injects into processes a vDSO section which contains functions that would normally require a system call, i.e. a ring transition. Instead of doing a syscall these functions use static data provided by the kernel. This avoids the need for a ring transition and so is more lightweight than a syscall. The function gettimeofday can be provided this way.

Hypervisor mode

Recent CPUs from Intel and AMD offer x86 virtualization instructions for a hypervisor to control Ring 0 hardware access. Although they are mutually incompatible, both Intel VT-x (codenamed "Vanderpool") and AMD-V (codenamed "Pacifica") allow a guest operating system to run Ring 0 operations natively without affecting other guests or the host OS.

Before hardware-assisted virtualization, guest operating systems ran under ring 1. Any attempt that requires a higher privilege level to perform (ring 0) will produce an interrupt and then be handled using software; this is called "Trap and Emulate".

To assist virtualization and reduce overhead caused by the reason above, VT-x and AMD-V allow the guest to run under Ring 0. VT-x introduces VMX Root/Non-root Operation: The hypervisor runs in VMX Root Operation mode, possessing the highest privilege. Guest OS runs in VMX Non-Root Operation mode, which allows them to operate at ring 0 without having actual hardware privileges. VMX non-root operation and VMX transitions are controlled by a data structure called a virtual-machine control. [17] These hardware extensions allow classical "Trap and Emulate" virtualization to perform on x86 architecture but now with hardware support.

Privilege level

A privilege level in the x86 instruction set controls the access of the program currently running on the processor to resources such as memory regions, I/O ports, and special instructions. There are 4 privilege levels ranging from 0 which is the most privileged, to 3 which is least privileged. Most modern operating systems use level 0 for the kernel/executive, and use level 3 for application programs. Any resource available to level n is also available to levels 0 to n, so the privilege levels are rings. When a lesser privileged process tries to access a higher privileged process, a general protection fault exception is reported to the OS.

It is not necessary to use all four privilege levels. Current operating systems with wide market share including Microsoft Windows, macOS, Linux, iOS and Android mostly use a paging mechanism with only one bit to specify the privilege level as either Supervisor or User (U/S Bit). Windows NT uses the two-level system. [18] The real mode programs in 8086 are executed at level 0 (highest privilege level) whereas virtual mode in 8086 executes all programs at level 3. [19]

Potential future uses for the multiple privilege levels supported by the x86 ISA family include containerization and virtual machines. A host operating system kernel could use instructions with full privilege access (kernel mode), whereas applications running on the guest OS in a virtual machine or container could use the lowest level of privileges in user mode. The virtual machine and guest OS kernel could themselves use an intermediate level of instruction privilege to invoke and virtualize kernel-mode operations such as system calls from the point of view of the guest operating system. [20]

IOPL

The IOPL (I/O Privilege level) flag is a flag found on all IA-32 compatible x86 CPUs. It occupies bits 12 and 13 in the FLAGS register. In protected mode and long mode, it shows the I/O privilege level of the current program or task. The Current Privilege Level (CPL) (CPL0, CPL1, CPL2, CPL3) of the task or program must be less than or equal to the IOPL in order for the task or program to access I/O ports.

The IOPL can be changed using POPF(D) and IRET(D) only when the current privilege level is Ring 0.

Besides IOPL, the I/O Port Permissions in the TSS also take part in determining the ability of a task to access an I/O port.

Miscellaneous

In x86 systems, the x86 hardware virtualization (VT-x and SVM) is referred as "ring −1", the System Management Mode is referred as "ring −2", the Intel Management Engine and AMD Platform Security Processor are sometimes referred as "ring −3". [21]

Use of hardware features

Many CPU hardware architectures provide far more flexibility than is exploited by the operating systems that they normally run. Proper use of complex CPU modes requires very close cooperation between the operating system and the CPU, and thus tends to tie the OS to the CPU architecture. When the OS and the CPU are specifically designed for each other, this is not a problem (although some hardware features may still be left unexploited), but when the OS is designed to be compatible with multiple, different CPU architectures, a large part of the CPU mode features may be ignored by the OS. For example, the reason Windows uses only two levels (ring 0 and ring 3) is that some hardware architectures that were supported in the past (such as PowerPC or MIPS) implemented only two privilege levels. [7]

Multics was an operating system designed specifically for a special CPU architecture (which in turn was designed specifically for Multics), and it took full advantage of the CPU modes available to it. However, it was an exception to the rule. Today, this high degree of interoperation between the OS and the hardware is not often cost-effective, despite the potential advantages for security and stability.

Ultimately, the purpose of distinct operating modes for the CPU is to provide hardware protection against accidental or deliberate corruption of the system environment (and corresponding breaches of system security) by software. Only "trusted" portions of system software are allowed to execute in the unrestricted environment of kernel mode, and then, in paradigmatic designs, only when absolutely necessary. All other software executes in one or more user modes. If a processor generates a fault or exception condition in a user mode, in most cases system stability is unaffected; if a processor generates a fault or exception condition in kernel mode, most operating systems will halt the system with an unrecoverable error. When a hierarchy of modes exists (ring-based security), faults and exceptions at one privilege level may destabilize only the higher-numbered privilege levels. Thus, a fault in Ring 0 (the kernel mode with the highest privilege) will crash the entire system, but a fault in Ring 2 will only affect Rings 3 and beyond and Ring 2 itself, at most.

Transitions between modes are at the discretion of the executing thread when the transition is from a level of high privilege to one of low privilege (as from kernel to user modes), but transitions from lower to higher levels of privilege can take place only through secure, hardware-controlled "gates" that are traversed by executing special instructions or when external interrupts are received.

Microkernel operating systems attempt to minimize the amount of code running in privileged mode, for purposes of security and elegance, but ultimately sacrificing performance.

See also

Notes

  1. E.g., In IBM OS/360 through z/OS, some system tasks run in problem state key 0.

Related Research Articles

<span class="mw-page-title-main">Operating system</span> Software that manages computer hardware resources

An operating system (OS) is system software that manages computer hardware and software resources, and provides common services for computer programs.

Real mode, also called real address mode, is an operating mode of all x86-compatible CPUs. The mode gets its name from the fact that addresses in real mode always correspond to real locations in memory. Real mode is characterized by a 20-bit segmented memory address space and unlimited direct software access to all addressable memory, I/O addresses and peripheral hardware. Real mode provides no support for memory protection, multitasking, or code privilege levels.

A modern computer operating system usually uses virtual memory to provide separate address spaces or separate regions of a single address space, called user space and kernel space. Primarily, this separation serves to provide memory protection and hardware protection from malicious or errant software behaviour.

<span class="mw-page-title-main">System call</span> Way for programs to access kernel services

In computing, a system call is the programmatic way in which a computer program requests a service from the operating system on which it is executed. This may include hardware-related services, creation and execution of new processes, and communication with integral kernel services such as process scheduling. System calls provide an essential interface between a process and the operating system.

The Intel x86 computer instruction set architecture has supported memory segmentation since the original Intel 8086 in 1978. It allows programs to address more than 64 KB (65,536 bytes) of memory, the limit in earlier 80xx processors. In 1982, the Intel 80286 added support for virtual memory and memory protection; the original mode was renamed real mode, and the new version was named protected mode. The x86-64 architecture, introduced in 2003, has largely dropped support for segmentation in 64-bit mode.

In computing, protected mode, also called protected virtual address mode, is an operational mode of x86-compatible central processing units (CPUs). It allows system software to use features such as segmentation, virtual memory, paging and safe multi-tasking designed to increase an operating system's control over application software.

x86-64 64-bit version of x86 architecture

x86-64 is a 64-bit extension of the x86 instruction set architecture first announced in 1999. It introduces two new operating modes: 64-bit mode and compatibility mode, along with a new four-level paging mechanism.

Memory protection is a way to control memory access rights on a computer, and is a part of most modern instruction set architectures and operating systems. The main purpose of memory protection is to prevent a process from accessing memory that has not been allocated to it. This prevents a bug or malware within a process from affecting other processes, or the operating system itself. Protection may encompass all accesses to a specified area of memory, write accesses, or attempts to execute the contents of the area. An attempt to access unauthorized memory results in a hardware fault, e.g., a segmentation fault, storage violation exception, generally causing abnormal termination of the offending process. Memory protection for computer security includes additional techniques such as address space layout randomization and executable-space protection.

<span class="mw-page-title-main">General protection fault</span> Fault initiated by x86 processors due to an access violation

A general protection fault (GPF) in the x86 instruction set architectures (ISAs) is a fault initiated by ISA-defined protection mechanisms in response to an access violation caused by some running code, either in the kernel or a user program. The mechanism is first described in Intel manuals and datasheets for the Intel 80286 CPU, which was introduced in 1983; it is also described in section 9.8.13 in the Intel 80386 programmer's reference manual from 1986. A general protection fault is implemented as an interrupt. Some operating systems may also classify some exceptions not related to access violations, such as illegal opcode exceptions, as general protection faults, even though they have nothing to do with memory protection. If a CPU detects a protection violation, it stops executing the code and sends a GPF interrupt. In most cases, the operating system removes the failing process from the execution queue, signals the user, and continues executing other processes. If, however, the operating system fails to catch the general protection fault, i.e. another protection violation occurs before the operating system returns from the previous GPF interrupt, the CPU signals a double fault, stopping the operating system. If yet another failure occurs, the CPU is unable to recover; since 80286, the CPU enters a special halt state called "Shutdown", which can only be exited through a hardware reset. The IBM PC AT, the first PC-compatible system to contain an 80286, has hardware that detects the Shutdown state and automatically resets the CPU when it occurs. All descendants of the PC AT do the same, so in a PC, a triple fault causes an immediate system reset.

In the 80386 microprocessor and later, virtual 8086 mode allows the execution of real mode applications that are incapable of running directly in protected mode while the processor is running a protected mode operating system. It is a hardware virtualization technique that allowed multiple 8086 processors to be emulated by the 386 chip. It emerged from the painful experiences with the 80286 protected mode, which by itself was not suitable to run concurrent real-mode applications well. John Crawford developed the Virtual Mode bit at the register set, paving the way to this environment.

x86 virtualization is the use of hardware-assisted virtualization capabilities on an x86/x86-64 CPU.

<span class="mw-page-title-main">QEMU</span> Free virtualization and emulation software

The Quick Emulator (QEMU) is a free and open-source emulator that uses dynamic binary translation to emulate a computer's processor; that is, it translates the emulated binary codes to an equivalent binary format which is executed by the machine. It provides a variety of hardware and device models for the virtual machine, enabling it to run different guest operating systems. QEMU can be used with a Kernel-based Virtual Machine (KVM) to emulate hardware at near-native speeds. Additionally, it supports user-level processes, allowing applications compiled for one processor architecture to run on another.

In computer security, executable-space protection marks memory regions as non-executable, such that an attempt to execute machine code in these regions will cause an exception. It makes use of hardware features such as the NX bit, or in some cases software emulation of those features. However, technologies that emulate or supply an NX bit will usually impose a measurable overhead while using a hardware-supplied NX bit imposes no measurable overhead.

CPU modes are operating modes for the central processing unit of most computer architectures that place restrictions on the type and scope of operations that can be performed by instructions being executed by the CPU. For example, this design allows an operating system to run with more privileges than application software by running the operating systems and applications in different modes.

A call gate is a mechanism in Intel's x86 architecture for changing the privilege level of a process when it executes a predefined function call using a CALL FAR instruction.

The task state segment (TSS) is a structure on x86-based computers which holds information about a task. It is used by the operating system kernel for task management. Specifically, the following information is stored in the TSS:

<span class="mw-page-title-main">Kernel (operating system)</span> Core of a computer operating system

A kernel is a computer program at the core of a computer's operating system that always has complete control over everything in the system. The kernel is also responsible for preventing and mitigating conflicts between different processes. It is the portion of the operating system code that is always resident in memory and facilitates interactions between hardware and software components. A full kernel controls all hardware resources via device drivers, arbitrates conflicts between processes concerning such resources, and optimizes the utilization of common resources e.g. CPU & cache usage, file systems, and network sockets. On most systems, the kernel is one of the first programs loaded on startup. It handles the rest of startup as well as memory, peripherals, and input/output (I/O) requests from software, translating them into data-processing instructions for the central processing unit.

In information security, computer science, and other fields, the principle of least privilege (PoLP), also known as the principle of minimal privilege (PoMP) or the principle of least authority (PoLA), requires that in a particular abstraction layer of a computing environment, every module must be able to access only the information and resources that are necessary for its legitimate purpose.

Supervisor Mode Access Prevention (SMAP) is a feature of some CPU implementations such as the Intel Broadwell microarchitecture that allows supervisor mode programs to optionally set user-space memory mappings so that access to those mappings from supervisor mode will cause a trap. This makes it harder for malicious programs to "trick" the kernel into using instructions or data from a user-space program.

A system virtual machine is a virtual machine (VM) that provides a complete system platform and supports the execution of a complete operating system (OS). These usually emulate an existing architecture, and are built with the purpose of either providing a platform to run programs where the real hardware is not available for use, or of having multiple instances of virtual machines leading to more efficient use of computing resources, both in terms of energy consumption and cost effectiveness, or both. A VM was originally defined by Popek and Goldberg as "an efficient, isolated duplicate of a real machine".

References

  1. Karger, Paul A.; Herbert, Andrew J. (1984). An Augmented Capability Architecture to Support Lattice Security and Traceability of Access. 1984 IEEE Symposium on Security and Privacy. p. 2. doi:10.1109/SP.1984.10001. ISBN   0-8186-0532-4. S2CID   14788823.
  2. Binder, W. (2001). "Design and implementation of the J-SEAL2 mobile agent kernel". Proceedings 2001 Symposium on Applications and the Internet. pp. 35–42. doi:10.1109/SAINT.2001.905166. ISBN   0-7695-0942-8. S2CID   11066378.
  3. "Envisioning a Simplified Intel Architecture for the Future". Intel. Retrieved 28 May 2024.
  4. "A Hardware Architecture for Implementing Protection Rings". Communications of the ACM . 15 (3). March 1972. Retrieved 27 September 2012.
  5. "Multics Glossary - ring" . Retrieved 27 September 2012.
  6. The Multics Virtual Memory, part 2 (PDF). Honeywell Information Systems. June 1972. pp. 160–161.
  7. 1 2 Russinovich, Mark E.; David A. Solomon (2005). Microsoft Windows Internals (4 ed.). Microsoft Press. pp.  16. ISBN   978-0-7356-1917-3.
  8. Russinovich, Mark (2012). Windows Internals Part 1 (6th ed.). Redmond, Washington: Microsoft Press. p. 17. ISBN   978-0-7356-4873-9. The reason Windows uses only two levels is that some hardware architectures that were supported in the past (such as Compaq Alpha and Silicon Graphics MIPS) implemented only two privilege levels.
  9. "Presentation Device Driver Reference for OS/2 – 5. Introduction to OS/2 Presentation Drivers". Archived from the original on 15 June 2015. Retrieved 13 June 2015.
  10. ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition. Arm Ltd. p. B1-1136.
  11. 1 2 Arm Architecture Reference Manual Armv8, for A-profile architecture. Arm Ltd.
  12. "supervisor mode". FOLDOC . 15 February 1995.
  13. 1 2 Jochen Liedtke (December 1995). "On µ-Kernel Construction". Proc. 15th ACM Symposium on Operating System Principles (SOSP).
  14. Ousterhout, J. K. (1990). Why aren't operating systems getting faster as fast as hardware?. Usenix Summer Conference A. naheim, CA. pp. 247–256.
  15. Maurice Wilkes (April 1994). "Operating systems in a changing world". ACM SIGOPS Operating Systems Review. 28 (2): 9–21. doi: 10.1145/198153.198154 . ISSN   0163-5980. S2CID   254134.
  16. Gorine, Andrei; Krivolapov, Alexander (May 2008). "Kernel Mode Databases: A DBMS Technology For High-Performance Applications". Dr. Dobb's Journal.
  17. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3C (PDF). Intel Cooperation (published September 2016). 2016. pp. 1–3.
  18. Russinovich, Mark E.; Solomon, David A. (2005). Microsoft Windows Internals (4th ed.). Microsoft Press. p. 16. ISBN   978-0-7356-1917-3.
  19. Sunil Mathur. Microprocessor 8086: Architecture, Programming and Interfacing (Eastern Economy ed.). PHI Learning.
  20. Anderson, Thomas; Dahlin, Michael (21 August 2014). "2.2". Operating Systems: Principles and Practice (2nd ed.). Recursive Books. ISBN   978-0985673529.
  21. De Gelas, Johan. "Hardware Virtualization: the Nuts and Bolts". AnandTech. Retrieved 13 March 2021.

Further reading