System call

Last updated

A high-level overview of the Linux kernel's system call interface, which handles communication between its various components and the userspace Linux kernel interfaces.svg
A high-level overview of the Linux kernel's system call interface, which handles communication between its various components and the userspace

In computing, a system call (commonly abbreviated to syscall) is the programmatic way in which a computer program requests a service from the operating system [lower-alpha 1] on which it is executed. This may include hardware-related services (for example, accessing a hard disk drive or accessing the device's camera), creation and execution of new processes, and communication with integral kernel services such as process scheduling. System calls provide an essential interface between a process and the operating system.

Contents

In most systems, system calls can only be made from userspace processes, while in some systems, OS/360 and successors for example, privileged system code also issues system calls. [1]

For embedded systems, system calls typically do not change the privilege mode of the CPU.

Privileges

The architecture of most modern processors, with the exception of some embedded systems, involves a security model. For example, the rings model specifies multiple privilege levels under which software may be executed: a program is usually limited to its own address space so that it cannot access or modify other running programs or the operating system itself, and is usually prevented from directly manipulating hardware devices (e.g. the frame buffer or network devices).

However, many applications need access to these components, so system calls are made available by the operating system to provide well-defined, safe implementations for such operations. The operating system executes at the highest level of privilege, and allows applications to request services via system calls, which are often initiated via interrupts. An interrupt automatically puts the CPU into some elevated privilege level and then passes control to the kernel, which determines whether the calling program should be granted the requested service. If the service is granted, the kernel executes a specific set of instructions over which the calling program has no direct control, returns the privilege level to that of the calling program, and then returns control to the calling program.

The library as an intermediary

Generally, systems provide a library or API that sits between normal programs and the operating system. On Unix-like systems, that API is usually part of an implementation of the C library (libc), such as glibc, that provides wrapper functions for the system calls, often named the same as the system calls they invoke. On Windows NT, that API is part of the Native API, in the ntdll.dll library; this is an undocumented API used by implementations of the regular Windows API and directly used by some system programs on Windows. The library's wrapper functions expose an ordinary function calling convention (a subroutine call on the assembly level) for using the system call, as well as making the system call more modular. Here, the primary function of the wrapper is to place all the arguments to be passed to the system call in the appropriate processor registers (and maybe on the call stack as well), and also setting a unique system call number for the kernel to call. In this way the library, which exists between the OS and the application, increases portability.

The call to the library function itself does not cause a switch to kernel mode and is usually a normal subroutine call (using, for example, a "CALL" assembly instruction in some Instruction set architectures (ISAs)). The actual system call does transfer control to the kernel (and is more implementation-dependent and platform-dependent than the library call abstracting it). For example, in Unix-like systems, fork and execve are C library functions that in turn execute instructions that invoke the fork and exec system calls. Making the system call directly in the application code is more complicated and may require embedded assembly code to be used (in C and C++), as well as requiring knowledge of the low-level binary interface for the system call operation, which may be subject to change over time and thus not be part of the application binary interface; the library functions are meant to abstract this away.

On exokernel based systems, the library is especially important as an intermediary. On exokernels, libraries shield user applications from the very low level kernel API, and provide abstractions and resource management.

IBM's OS/360, DOS/360 and TSS/360 implement most system calls through a library of assembly language macros, [lower-alpha 2] although there are a few services with a call linkage. This reflects their origin at a time when programming in assembly language was more common than high-level language usage. IBM system calls were therefore not directly executable by high-level language programs, but required a callable assembly language wrapper subroutine. Since then, IBM has added many services that can be called from high level languages in, e.g., z/OS and z/VSE. In more recent release of MVS/SP and in all later MVS versions, some system call macros generate Program Call (PC).

Examples and tools

On Unix, Unix-like and other POSIX-compliant operating systems, popular system calls are open , read , write , close , wait , exec , fork , exit , and kill . Many modern operating systems have hundreds of system calls. For example, Linux and OpenBSD each have over 300 different calls, [2] [3] NetBSD has close to 500, [4] FreeBSD has over 500, [5] Windows has close to 2000, divided between win32k (graphical) and ntdll (core) system calls [6] while Plan 9 has 51. [7]

Tools such as strace, ftrace and truss allow a process to execute from start and report all system calls the process invokes, or can attach to an already running process and intercept any system call made by the said process if the operation does not violate the permissions of the user. This special ability of the program is usually also implemented with system calls such as ptrace or system calls on files in procfs.

Typical implementations

Implementing system calls requires a transfer of control from user space to kernel space, which involves some sort of architecture-specific feature. A typical way to implement this is to use a software interrupt or trap. Interrupts transfer control to the operating system kernel, so software simply needs to set up some register with the system call number needed, and execute the software interrupt.

This is the only technique provided for many RISC processors, but CISC architectures such as x86 support additional techniques. For example, the x86 instruction set contains the instructions SYSCALL/SYSRET and SYSENTER/SYSEXIT (these two mechanisms were independently created by AMD and Intel, respectively, but in essence they do the same thing). These are "fast" control transfer instructions that are designed to quickly transfer control to the kernel for a system call without the overhead of an interrupt. [8] Linux 2.5 began using this on the x86, where available; formerly it used the INT instruction, where the system call number was placed in the EAX register before interrupt 0x80 was executed. [9] [10]

An older mechanism is the call gate; originally used in Multics and later, for example, see call gate on the Intel x86. It allows a program to call a kernel function directly using a safe control transfer mechanism, which the operating system sets up in advance. This approach has been unpopular on x86, presumably due to the requirement of a far call (a call to a procedure located in a different segment than the current code segment [11] ) which uses x86 memory segmentation and the resulting lack of portability it causes, and the existence of the faster instructions mentioned above.

For IA-64 architecture, EPC (Enter Privileged Code) instruction is used. The first eight system call arguments are passed in registers, and the rest are passed on the stack.

In the IBM System/360 mainframe family, and its successors, a Supervisor Call instruction (SVC), with the number in the instruction rather than in a register, implements a system call for legacy facilities in most of [lower-alpha 3] IBM's own operating systems, and for all system calls in Linux. In later versions of MVS, IBM uses the Program Call (PC) instruction for many newer facilities. In particular, PC is used when the caller might be in Service Request Block (SRB) mode.

The PDP-11 minicomputer used the EMT, TRAP and IOT instructions, which, similar to the IBM System/360 SVC and x86 INT, put the code in the instruction; they generate interrupts to specific addresses, transferring control to the operating system. The VAX 32-bit successor to the PDP-11 series used the CHMK, CHME, and CHMS instructions to make system calls to privileged code at various levels; the code is an argument to the instruction.

Categories of system calls

System calls can be grouped roughly into six major categories: [12]

  1. Process control
  2. File management
    • create file, delete file
    • open, close
    • read, write, reposition
    • get/set file attributes
  3. Device management
    • request device, release device
    • read, write, reposition
    • get/set device attributes
    • logically attach or detach devices
  4. Information maintenance
    • get/set total system information (including time, date, computer name, enterprise etc.)
    • get/set process, file, or device metadata (including author, opener, creation time and date, etc.)
  5. Communication
    • create, delete communication connection
    • send, receive messages
    • transfer status information
    • attach or detach remote devices
  6. Protection
    • get/set file permissions

Processor mode and context switching

System calls in most Unix-like systems are processed in kernel mode, which is accomplished by changing the processor execution mode to a more privileged one, but no process context switch is necessary  although a privilege context switch does occur. The hardware sees the world in terms of the execution mode according to the processor status register, and processes are an abstraction provided by the operating system. A system call does not generally require a context switch to another process; instead, it is processed in the context of whichever process invoked it. [13] [14]

In a multithreaded process, system calls can be made from multiple threads. The handling of such calls is dependent on the design of the specific operating system kernel and the application runtime environment. The following list shows typical models followed by operating systems: [15] [16]

See also

Notes

  1. In UNIX-like operating systems, system calls are used only for the kernel
  2. In many but not all cases, IBM documented, e.g., the SVC number, the parameter registers.
  3. The CP components of CP-67 and VM use the Diagnose (DIAG) instruction as a Hypervisor CALL (HVC) from a virtual machine to CP.

Related Research Articles

<span class="mw-page-title-main">Operating system</span> Software that manages computer hardware resources

An operating system (OS) is system software that manages computer hardware and software resources, and provides common services for computer programs.

<span class="mw-page-title-main">Thread (computing)</span> Smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

x86-64 64-bit version of x86 architecture

x86-64 is a 64-bit version of the x86 instruction set, first announced in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mode.

<span class="mw-page-title-main">DragonFly BSD</span> Free and open-source Unix-like operating system

DragonFly BSD is a free and open-source Unix-like operating system forked from FreeBSD 4.8. Matthew Dillon, an Amiga developer in the late 1980s and early 1990s and FreeBSD developer between 1994 and 2003, began working on DragonFly BSD in June 2003 and announced it on the FreeBSD mailing lists on 16 July 2003.

RTLinux is a hard realtime real-time operating system (RTOS) microkernel that runs the entire Linux operating system as a fully preemptive process. The hard real-time property makes it possible to control robots, data acquisition systems, manufacturing plants, and other time-sensitive instruments and machines from RTLinux applications. The design was patented. Despite the similar name, it is not related to the Real-Time Linux project of the Linux Foundation.

<span class="mw-page-title-main">Pseudoterminal</span>

In some operating systems, including Unix-like systems, a pseudoterminal, pseudotty, or PTY is a pair of pseudo-device endpoints (files) which establish asynchronous, bidirectional communication (IPC) channel between two or more processes.

Signals are standardized messages sent to a running program to trigger specific behavior, such as quitting or error handling. They are a limited form of inter-process communication (IPC), typically used in Unix, Unix-like, and other POSIX-compliant operating systems.

W^X is a security feature in operating systems and virtual machines. It is a memory protection policy whereby every page in a process's or kernel's address space may be either writable or executable, but not both. Without such protection, a program can write CPU instructions in an area of memory intended for data and then run those instructions. This can be dangerous if the writer of the memory is malicious. W^X is the Unix-like terminology for a strict use of the general concept of executable space protection, controlled via the mprotect system call.

A hypervisor, also known as a virtual machine monitor (VMM) or virtualizer, is a type of computer software, firmware or hardware that creates and runs virtual machines. A computer on which a hypervisor runs one or more virtual machines is called a host machine, and each virtual machine is called a guest machine. The hypervisor presents the guest operating systems with a virtual operating platform and manages the execution of the guest operating systems. Unlike an emulator, the guest executes most instructions on the native hardware. Multiple instances of a variety of operating systems may share the virtualized hardware resources: for example, Linux, Windows, and macOS instances can all run on a single physical x86 machine. This contrasts with operating-system–level virtualization, where all instances must share a single kernel, though the guest operating systems can differ in user space, such as different Linux distributions with the same kernel.

<span class="mw-page-title-main">Linux kernel interfaces</span> An overview and comparison of the Linux kernal APIs and ABIs.

The Linux kernel provides multiple interfaces to user-space and kernel-mode code that are used for varying purposes and that have varying properties by design. There are two types of application programming interface (API) in the Linux kernel:

  1. the "kernel–user space" API; and
  2. the "kernel internal" API.
<span class="mw-page-title-main">Protection ring</span> Layer of protection in computer systems

In computer science, hierarchical protection domains, often called protection rings, are mechanisms to protect data and functionality from faults and malicious behavior.

In computing, a dynamic linker is the part of an operating system that loads and links the shared libraries needed by an executable when it is executed, by copying the content of libraries from persistent storage to RAM, filling jump tables and relocating pointers. The specific operating system and executable format determine how the dynamic linker functions and how it is implemented.

<span class="mw-page-title-main">Minix 3</span> Unix-like operating system

Minix 3 is a small, Unix-like operating system. It is published under a BSD-3-Clause license and is a successor project to the earlier versions, Minix 1 and 2.

A call gate is a mechanism in Intel's x86 architecture for changing the privilege level of a process when it executes a predefined function call using a CALL FAR instruction.

The following is a timeline of virtualization development. In computing, virtualization is the use of a computer to simulate another computer. Through virtualization, a host simulates a guest by exposing virtual hardware devices, which may be done through software or by allowing access to a physical device connected to the machine.

The Berkeley Packet Filter is a network tap and packet filter which permits computer network packets to be captured and filtered at the operating system level. It provides a raw interface to data link layers, permitting raw link-layer packets to be sent and received, and allows a userspace process to supply a filter program that specifies which packets it wants to receive. For example, a tcpdump process may want to receive only packets that initiate a TCP connection. BPF returns only packets that pass the filter that the process supplies. This avoids copying unwanted packets from the operating system kernel to the process, greatly improving performance. The filter program is in the form of instructions for a virtual machine, which are interpreted, or compiled into machine code by a just-in-time (JIT) mechanism and executed, in the kernel.

Binary-code compatibility is a property of a computer system, meaning that it can run the same executable code, typically machine code for a general-purpose computer Central processing unit (CPU), that another computer system can run. Source-code compatibility, on the other hand, means that recompilation or interpretation is necessary before the program can be run on the compatible system.

<span class="mw-page-title-main">Kernel (operating system)</span> Core of a computer operating system

The kernel is a computer program at the core of a computer's operating system and generally has complete control over everything in the system. The kernel is also responsible for preventing and mitigating conflicts between different processes. It is the portion of the operating system code that is always resident in memory and facilitates interactions between hardware and software components. A full kernel controls all hardware resources via device drivers, arbitrates conflicts between processes concerning such resources, and optimizes the utilization of common resources e.g. CPU & cache usage, file systems, and network sockets. On most systems, the kernel is one of the first programs loaded on startup. It handles the rest of startup as well as memory, peripherals, and input/output (I/O) requests from software, translating them into data-processing instructions for the central processing unit.

ptrace is a system call found in Unix and several Unix-like operating systems. By using ptrace one process can control another, enabling the controller to inspect and manipulate the internal state of its target. ptrace is used by debuggers and other code-analysis tools, mostly as aids to software development.

<span class="mw-page-title-main">Longene</span> Linux distribution

Longene is a Linux-based operating system kernel intended to be binary compatible with application software and device drivers made for Microsoft Windows and Linux. As of 1.0-rc2, it consists of a Linux kernel module implementing aspects of the Windows kernel and a modified Wine distribution designed to take advantage of the more native interface. Longene is written in the C programming language and is free and open source software. It is licensed under the terms of the GNU General Public License version 2 (GPLv2).

References

  1. IBM (March 1967). "Writing SVC Routines". IBM System/360 Operating System System Programmer's Guide (PDF). Third Edition. pp. 32–36. C28-6550-2.
  2. "syscalls(2) - Linux manual page".
  3. OpenBSD (14 September 2013). "System call names (kern/syscalls.c)". BSD Cross Reference.
  4. NetBSD (17 October 2013). "System call names (kern/syscalls.c)". BSD Cross Reference.
  5. "FreeBSD syscalls.c, the list of syscall names and IDs".
  6. Mateusz "j00ru" Jurczyk (5 November 2017). "Windows WIN32K.SYS System Call Table (NT/2000/XP/2003/Vista/2008/7/8/10)".{{cite web}}: CS1 maint: numeric names: authors list (link)
  7. "sys.h". Plan 9 from Bell Labs. Archived from the original on 8 September 2023, the list of syscall names and IDs.
  8. "SYSENTER". OSDev wiki. Archived from the original on 8 November 2023.
  9. Anonymous (19 December 2002). "Linux 2.5 gets vsyscalls, sysenter support". KernelTrap . Retrieved 1 January 2008.
  10. Manu Garg (2006). "Sysenter Based System Call Mechanism in Linux 2.6".
  11. "Liberation: x86 Instruction Set Reference". renejeschke.de. Retrieved 4 July 2015.
  12. Silberschatz, Abraham (2018). Operating System Concepts. Peter B Galvin; Greg Gagne (10th ed.). Hoboken, NJ: Wiley. p. 67. ISBN   9781119320913. OCLC   1004849022.
  13. Bach, Maurice J. (1986), The Design of the UNIX Operating System, Prentice Hall, pp. 15–16.
  14. Elliot, John (2011). "Discussion of system call implementation at ProgClub including quote from Bach 1986".
  15. "Threads".
  16. "Threading Models" (PDF).