Linux namespaces

Last updated
namespaces
Original author(s) Al Viro
Developer(s) Eric W. Biederman, Pavel Emelyanov, Al Viro, Cyrill Gorcunov et al.
Initial release2002;22 years ago (2002)
Written in C
Operating system Linux
Type System software
License GPL and LGPL

Namespaces are a feature of the Linux kernel that partition kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. The feature works by having the same namespace for a set of resources and processes, but those namespaces refer to distinct resources. Resources may exist in multiple spaces. Examples of such resources are process IDs, host-names, user IDs, file names, some names associated with network access, and Inter-process communication.

Contents

Namespaces are a fundamental aspect of containers in Linux.

The term "namespace" is often used for a type of namespace (e.g. process ID) as well as for a particular space of names.

A Linux system starts out with a single namespace of each type, used by all processes. Processes can create additional namespaces and also join different namespaces.

History

Linux namespaces were inspired by the wider namespace functionality used heavily throughout Plan 9 from Bell Labs. [1]

The Linux Namespaces originated in 2002 in the 2.4.19 kernel with work on the mount namespace kind. Additional namespaces were added beginning in 2006 [2] and continuing into the future.

Adequate containers support functionality was finished in kernel version 3.8 with the introduction of User namespaces. [3]

Namespace kinds

Since kernel version 5.6, there are 8 kinds of namespaces. Namespace functionality is the same across all kinds: each process is associated with a namespace and can only see or use the resources associated with that namespace, and descendant namespaces where applicable. This way each process (or process group thereof) can have a unique view on the resources. Which resource is isolated depends on the kind of namespace that has been created for a given process group.

Mount (mnt)

Mount namespaces control mount points. Upon creation the mounts from the current mount namespace are copied to the new namespace, but mount points created afterwards do not propagate between namespaces (using shared subtrees, it is possible to propagate mount points between namespaces [4] ).

The clone flag used to create a new namespace of this type is CLONE_NEWNS - short for "NEW NameSpace". This term is not descriptive (it does not tell which kind of namespace is to be created) because mount namespaces were the first kind of namespace and designers did not anticipate there being any others.

Process ID (pid)

The PID namespace provides processes with an independent set of process IDs (PIDs) from other namespaces. PID namespaces are nested, meaning when a new process is created it will have a PID for each namespace from its current namespace up to the initial PID namespace. Hence the initial PID namespace is able to see all processes, albeit with different PIDs than other namespaces will see processes with.

The first process created in a PID namespace is assigned the process ID number 1 and receives most of the same special treatment as the normal init process, most notably that orphaned processes within the namespace are attached to it. This also means that the termination of this PID 1 process will immediately terminate all processes in its PID namespace and any descendants. [5]

Network (net)

Network namespaces virtualize the network stack. On creation, a network namespace contains only a loopback interface.

Each network interface (physical or virtual) is present in exactly 1 namespace and can be moved between namespaces.

Each namespace will have a private set of IP addresses, its own routing table, socket listing, connection tracking table, firewall, and other network-related resources.

Destroying a network namespace destroys any virtual interfaces within it and moves any physical interfaces within it back to the initial network namespace.

Inter-process Communication (ipc)

IPC namespaces isolate processes from SysV style inter-process communication. This prevents processes in different IPC namespaces from using, for example, the SHM family of functions to establish a range of shared memory between the two processes. Instead, each process will be able to use the same identifiers for a shared memory region and produce two such distinct regions.

UTS

UTS (UNIX Time-Sharing) namespaces allow a single system to appear to have different host and domain names to different processes. "When a process creates a new UTS namespace ... the hostname and domain of the new UTS namespace are copied from the corresponding values in the caller's UTS namespace." [6]

User ID (user)

User namespaces are a feature to provide both privilege isolation and user identification segregation across multiple sets of processes available since kernel 3.8. [7] With administrative assistance it is possible to build a container with seeming administrative rights without actually giving elevated privileges to user processes. Like the PID namespace, user namespaces are nested and each new user namespace is considered to be a child of the user namespace that created it.

A user namespace contains a mapping table converting user IDs from the container's point of view to the system's point of view. This allows, for example, the root user to have user id 0 in the container but is actually treated as user id 1,400,000 by the system for ownership checks. A similar table is used for group id mappings and ownership checks.

To facilitate privilege isolation of administrative actions, each namespace type is considered owned by a user namespace based on the active user namespace at the moment of creation. A user with administrative privileges in the appropriate user namespace will be allowed to perform administrative actions within that other namespace type. For example, if a process has administrative permission to change the IP address of a network interface, it may do so as long as its own user namespace is the same as (or ancestor of) the user namespace that owns the network namespace. Hence the initial user namespace has administrative control over all namespace types in the system. [8]

Control group (cgroup) Namespace

The cgroup namespace type hides the identity of the control group of which process is a member. A process in such a namespace, checking which control group any process is part of, would see a path that is actually relative to the control group set at creation time, hiding its true control group position and identity. This namespace type has existed since March 2016 in Linux 4.6. [9] [10]

Time Namespace

The time namespace allows processes to see different system times in a way similar to the UTS namespace. It was proposed in 2018 and landed on Linux 5.6, which was released in March 2020. [11]

Proposed namespaces

syslog namespace

The syslog namespace was proposed by Rui Xiang, an engineer at Huawei, but wasn't merged into the linux kernel. [12] systemd implemented a similar feature called “journal namespace” in February 2020. [13]

Implementation details

The kernel assigns each process a symbolic link per namespace kind in /proc/<pid>/ns/. The inode number pointed to by this symlink is the same for each process in this namespace. This uniquely identifies each namespace by the inode number pointed to by one of its symlinks.

Reading the symlink via readlink returns a string containing the namespace kind name and the inode number of the namespace.

Syscalls

Three syscalls can directly manipulate namespaces:

Destruction

If a namespace is no longer referenced, it will be deleted, the handling of the contained resource depends on the namespace kind. Namespaces can be referenced in three ways:

  1. by a process belonging to the namespace
  2. by an open filedescriptor to the namespace's file (/proc/<pid>/ns/<ns-kind>)
  3. a bind mount of the namespace's file (/proc/<pid>/ns/<ns-kind>)

Adoption

Various container software use Linux namespaces in combination with cgroups to isolate their processes, including Docker [14] and LXC.

Other applications, such as Google Chrome make use of namespaces to isolate its own processes which are at risk from attack on the internet. [15]

There is also an unshare wrapper in util-linux. An example of its use is:

SHELL=/bin/shunshare--map-root-user--fork--pidchroot"${chrootdir}""$@"

Related Research Articles

<span class="mw-page-title-main">GNU Hurd</span> Operating system kernel designed as a replacement for Unix

GNU Hurd is a collection of microkernel servers written as part of GNU, for the GNU Mach microkernel. It has been under development since 1990 by the GNU Project of the Free Software Foundation, designed as a replacement for the Unix kernel, and released as free software under the GNU General Public License. When the Linux kernel proved to be a viable solution, development of GNU Hurd slowed, at times alternating between stasis and renewed activity and interest.

In Unix-like systems, multiple users can be put into groups. POSIX and conventional Unix file system permissions are organized into three classes, user, group, and others. The use of groups allows additional abilities to be delegated in an organized fashion, such as access to disks, printers, and other peripherals. This method, among others, also enables the superuser to delegate some administrative tasks to normal users, similar to the Administrators group on Microsoft Windows NT and its derivatives.

udev is a device manager for the Linux kernel. As the successor of devfsd and hotplug, udev primarily manages device nodes in the /dev directory. At the same time, udev also handles all user space events raised when hardware devices are added into the system or removed from it, including firmware loading as required by certain devices.

Unix-like operating systems identify a user by a value called a user identifier, often abbreviated to user ID or UID. The UID, along with the group identifier (GID) and other access control criteria, is used to determine which system resources a user can access. The password file maps textual user names to UIDs. UIDs are stored in the inodes of the Unix file system, running processes, tar archives, and the now-obsolete Network Information Service. In POSIX-compliant environments, the shell command id gives the current user's UID, as well as more information such as the user name, primary user group and group identifier (GID).

The proc filesystem (procfs) is a special filesystem in Unix-like operating systems that presents information about processes and other system information in a hierarchical file-like structure, providing a more convenient and standardized method for dynamically accessing process data held in the kernel than traditional tracing methods or direct access to kernel memory. Typically, it is mapped to a mount point named /proc at boot time. The proc file system acts as an interface to internal data structures about running processes in the kernel. In Linux, it can also be used to obtain information about the kernel and to change certain kernel parameters at runtime (sysctl).

D-Bus is a message-oriented middleware mechanism that allows communication between multiple processes running concurrently on the same machine. D-Bus was developed as part of the freedesktop.org project, initiated by GNOME developer Havoc Pennington to standardize services provided by Linux desktop environments such as GNOME and KDE.

<span class="mw-page-title-main">Linux kernel interfaces</span> An overview and comparison of the Linux kernal APIs and ABIs.

The Linux kernel provides multiple interfaces to user-space and kernel-mode code that are used for varying purposes and that have varying properties by design. There are two types of application programming interface (API) in the Linux kernel:

  1. the "kernel–user space" API; and
  2. the "kernel internal" API.

seccomp is a computer security facility in the Linux kernel. seccomp allows a process to make a one-way transition into a "secure" state where it cannot make any system calls except exit , sigreturn , read and write to already-open file descriptors. Should it attempt any other system calls, the kernel will either just log the event or terminate the process with SIGKILL or SIGSYS. In this sense, it does not virtualize the system's resources but isolates the process from them entirely.

OS-level virtualization is an operating system (OS) virtualization paradigm in which the kernel allows the existence of multiple isolated user space instances, called containers, zones, virtual private servers (OpenVZ), partitions, virtual environments (VEs), virtual kernels, or jails. Such instances may look like real computers from the point of view of programs running in them. A computer program running on an ordinary operating system can see all resources of that computer. However, programs running inside of a container can only see the container's contents and devices assigned to the container.

<span class="mw-page-title-main">OpenVZ</span> Operating-system level virtualization technology

OpenVZ is an operating-system-level virtualization technology for Linux. It allows a physical server to run multiple isolated operating system instances, called containers, virtual private servers (VPSs), or virtual environments (VEs). OpenVZ is similar to Solaris Containers and LXC.

binfmt_misc is a capability of the Linux kernel which allows arbitrary executable file formats to be recognized and passed to certain user space applications, such as emulators and virtual machines. It is one of a number of binary format handlers in the kernel that are involved in preparing a user-space program to run.

debugfs is a special file system available in the Linux kernel since version 2.6.10-rc3. It was written by Greg Kroah-Hartman.

The Linux booting process involves multiple stages and is in many ways similar to the BSD and other Unix-style boot processes, from which it derives. Although the Linux booting process depend very much on the computer architecture, those architectures share similar stages and software components, including firmware initialization, execution of a bootloader, loading and startup of a Linux kernel image, and execution of various startup scripts and daemons. When a Linux system is powered up or reset, its processor will execute a specific firmware/program for system initialization, such as Power-on self-test, invoking the reset vector to start a program at a known address in flash/ROM, then load the bootloader into RAM for later execution. In personal computer (PC), not only limited to Linux-distro PC, this firmware/program is called BIOS, which is stored in the mainboard. In embedded Linux system, this firmware/program is called boot ROM. After being loaded into RAM, bootloader will execute to load the second-stage bootloader. The second-stage bootloader will load the kernel image into memory, decompress and initialize it then pass control to this kernel image. Second-stage bootloader also performs several operation on the system such as system hardware check, mounting the root device, loading the necessary kernel modules,... Finally, the very first user-space process starts, and other high-level system initializations are performed.

Toybox is a free and open-source software implementation of over 200 Unix command line utilities such as ls, cp, and mv. The Toybox project was started in 2006, and became a 0BSD licensed BusyBox alternative. Toybox is used for most of Android's command-line tools in all currently supported Android versions, and is also used to build Android on Linux and macOS. All of the tools are tested on Linux, and many of them also work on BSD and macOS.

<span class="mw-page-title-main">LXC</span> Operating system-level virtualization for Linux

Linux Containers (LXC) is an operating-system-level virtualization method for running multiple isolated Linux systems (containers) on a control host using a single Linux kernel.

systemd Suite of system components for Linux

systemd is a software suite that provides an array of system components for Linux operating systems. The main aim is to unify service configuration and behavior across Linux distributions. Its primary component is a "system and service manager" – an init system used to bootstrap user space and manage user processes. It also provides replacements for various daemons and utilities, including device management, login management, network connection management, and event logging. The name systemd adheres to the Unix convention of naming daemons by appending the letter d. It also plays on the term "System D", which refers to a person's ability to adapt quickly and improvise to solve problems.

cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage of a collection of processes.

Docker is a set of platform as a service (PaaS) products that use OS-level virtualization to deliver software in packages called containers. The service has both free and premium tiers. The software that hosts the containers is called Docker Engine. It was first released in 2013 and is developed by Docker, Inc.

devpts

devpts is a virtual filesystem directory available in the Linux kernel since version 2.1.93. It is normally mounted at /dev/pts and contains solely devices files which represent slaves to the multiplexing master located at /dev/ptmx which in turn is used to implement terminal emulators.

Container Linux is a discontinued open-source lightweight operating system based on the Linux kernel and designed for providing infrastructure for clustered deployments while focusing on automation, ease of application deployment, security, reliability, and scalability. As an operating system, Container Linux provided only the minimal functionality required for deploying applications inside software containers, together with built-in mechanisms for service discovery and configuration sharing.

References

  1. "The Use of Name Spaces in Plan 9". 1992. Archived from the original on 2014-09-06. Retrieved 2016-03-24.
  2. "Linux kernel source tree". kernel.org. 2016-10-02.
  3. "Namespaces in operation, part 5: User namespaces [LWN.net]".
  4. "Documentation/filesystems/sharedsubtree.txt". 2016-02-25. Retrieved 2017-03-06.
  5. "Namespaces in operation, part 3: PID namespaces". lwn.net. 2013-01-16.
  6. "uts_namespaces(7) - Linux manual page". www.man7.org. Retrieved 2021-02-16.
  7. "Namespaces in operation, part 5: User namespaces [LWN.net]".
  8. "Namespaces in operation, part 5: User namespaces". lwn.net. 2013-02-27.
  9. Heo, Tejun (2016-03-18). "[GIT PULL] cgroup namespace support for v4.6-rc1". lkml (Mailing list).
  10. Torvalds, Linus (2016-03-26). "Linux 4.6-rc1". lkml (Mailing list).
  11. "It's Finally Time: The Time Namespace Support Has Been Added To The Linux 5.6 Kernel - Phoronix". www.phoronix.com. Retrieved 2020-03-30.
  12. "Add namespace support for syslog [LWN.net]". lwn.net. Retrieved 2022-07-11.
  13. "journal: add concept of "journal namespaces" by poettering · Pull Request #14178 · systemd/systemd". GitHub. Retrieved 2022-07-11.
  14. "Docker security". docker.com. Retrieved 2016-03-24.
  15. "Chromium Linux Sandboxing" . Retrieved 2019-12-19.