diff --git a/man7/user_namespaces.7 b/man7/user_namespaces.7 new file mode 100644 index 000000000..a537b4415 --- /dev/null +++ b/man7/user_namespaces.7 @@ -0,0 +1,352 @@ +.\" Copyright (c) 2013 by Michael Kerrisk +.\" and Copyright (c) 2012 by Eric W. Biederman +.\" +.\" Permission is granted to make and distribute verbatim copies of this +.\" manual provided the copyright notice and this permission notice are +.\" preserved on all copies. +.\" +.\" Permission is granted to copy and distribute modified versions of this +.\" manual under the conditions for verbatim copying, provided that the +.\" entire resulting derived work is distributed under the terms of a +.\" permission notice identical to this one. +.\" +.\" Since the Linux kernel and libraries are constantly changing, this +.\" manual page may be incorrect or out-of-date. The author(s) assume no +.\" responsibility for errors or omissions, or for damages resulting from +.\" the use of the information contained herein. The author(s) may not +.\" have taken the same level of care in the production of this manual, +.\" which is licensed free of charge, as they might when working +.\" professionally. +.\" +.\" Formatted or processed versions of this manual, if unaccompanied by +.\" the source, must acknowledge the copyright and authors of this work. +.\" +.\" +.TH USER_NAMESPACES 7 2013-01-14 "Linux" "Linux Programmer's Manual" +.SH NAME +user_namespaces \- overview of Linux user_namespaces +.SH DESCRIPTION +For an overview of namespaces, see +.BR namespaces (7). + +User namespaces isolate security-related identifiers, in particular, +user IDs, group IDs, keys (see +.BR keyctl (2)), +and capabilities. +A process's user and group IDs can be different +inside and outside a user namespace. +In particular, +a process can have a normal unprivileged user ID outside a user namespace +while at the same time having a user ID of 0 inside the namespace; +in other words, +the process has full privileges for operations inside the user namespace, +but is unprivileged for operations outside the namespace. + +User namespaces can be nested; +that is, each user namespace has a parent user namespace, +and can have zero or more child user namespaces. +The parent of a user namespace is the user namespace +of the process that creates the user namespace via a call to +.BR unshare (2) +or +.BR clone (2) +with the +.BR CLONE_NEWUSER +flag. + +When a user namespace is created, +it starts out without a mapping of user IDs (group IDs) +to the parent user namespace. +The desired mapping of user IDs (group IDs) to the parent user namespace +may be set by writing into +.IR /proc/[pid]/uid_map +.RI ( /proc/[pid]/gid_map ); +see below. + +The first process in a user namespace starts out with a complete set +of capabilities with respect to the new user namespace. + +System calls that return user IDs (group IDs) will return +either the user ID (group ID) mapped into the current +user namespace if there is a mapping, or the overflow user ID (group ID); +the default value for the overflow user ID (group ID) is 65534. +See the descriptions of +.IR /proc/sys/kernel/overflowuid +and +.IR /proc/sys/kernel/overflowgid +in +.BR proc (5). + +Starting in Linux 3.8, unprivileged processes can create user namespaces, +and mount, PID, IPC, network, and UTS namespaces can be created with just the +.B CAP_SYS_ADMIN +capability in the caller's user namespace. + +If +.BR CLONE_NEWUSER +is specified along with other +.B CLONE_NEW* +flags in a single +.BR clone (2) +or +.BR unshare (2) +call, the user namespace is guaranteed to be created first, +giving the caller privileges over the remaining +namespaces created by the call. +Thus, it is possible for an unprivileged caller to specify this combination +of flags. + +When a new IPC, mount, network, PID, or UTS namespace is created via +.BR clone (2) +or +.BR unshare (2), +the kernel records the user namespace of the creating process against +the new namespace. +When a process in the new namespace subsequently performs +privileged operations that operate on global +resources isolated by the namespace, +the permission checks are performed according to the process's capabilities +in the user namespace that the kernel associated with the new namespace. + + +The following rules apply with respect to the capabilities granted +to a process: +.\" In the 3.8 sources, see security/commoncap.c::cap_capable(): +.IP 1. 3 +If a process has a capability in a parent user namespace, +then it has that capability in all child (and further removed descendant) +namespaces as well. +.IP 2. +.\" * The owner of the user namespace in the parent of the +.\" * user namespace has all caps. +When a user namespace is created, the kernel records the effective +user ID of the creating process as being the "owner" of the namespace, +and likewise associates the effective group ID of the creating process +with the namespace. +A process whose effective user ID matches that of the +owner of a user namespace and which is a member of the parent namespace +(or a further removed namespace that is a direct ancestor) +has all capabilities in the user namespace. +.\" As a rough approximation, this means that +.\" the user who creates a user namespace +.\" has all capabilities inside that namespace and its descendants. +.PP +Use of user namespaces requires a kernel that is configured with the +.B CONFIG_USER_NS +option. + +Over the years, there have been a lot of features that have been added +to the Linux kernel that are only available to privileged users +because of their potential to confuse set-user-ID-root applications. +In general, it becomes safe to allow the root user in a user namespace to +use those features because it is impossible, while in a user namespace, +to gain more privilege than the root user of a user namespace has. + +The +.IR /proc/[pid]/uid_map +and +.IR /proc/[pid]/gid_map +files (available since Linux 3.5) +.\" commit 22d917d80e842829d0ca0a561967d728eb1d6303 +expose the mappings for user and group IDs +inside the user namespace for the process +.IR pid . +The description here explains the details for +.IR uid_map ; +.IR gid_map +is exactly the same, +but each instance of "user ID" is replaced by "group ID". + +The +.I uid_map +file exposes the mapping of user IDs from the user namespace +of the process +.IR pid +to the user namespace of the process that opened +.IR uid_map +(but see a qualification to this point below). +In other words, processes that are in different user namespaces +will potentially see different values when reading from a particular +.I uid_map +file, depending on the user ID mappings for the user namespaces +of the reading processes. + +Each line in the +.I uid_map +file specifies a 1-to-1 mapping of a range of contiguous +user IDs between two user namespaces. +(When a user namespace is first created, this file is empty.) +The specification in each line takes the form of +three numbers delimited by white space. +The first two numbers specify the starting user ID in +each user namespace. +The third number specifies the length of the mapped range. +In detail, the fields are interpreted as follows: +.IP (1) 4 +The start of the range of user IDs in +the user namespace of the process +.IR pid . +.IP (2) +The start of the range of user +IDs to which the user IDs specified by field one map. +How field two is interpreted depends on whether the process that opened +.I uid_map +and the process +.IR pid +are in the same user namespace, as follows: +.RS +.IP a) 3 +If the two processes are in different user namespaces: +field two is the start of a range of +user IDs in the user namespace of the process that opened +.IR uid_map . +.IP b) +If the two processes are in the same user namespace: +field two is the start of the range of +user IDs in the parent user namespace of the process +.IR pid . +This case enables the opener of +.I uid_map +(the common case here is opening +.IR /proc/self/uid_map ) +to see the mapping of user IDs into the user namespace of the process +that created this user namespace. +.RE +.IP (3) +The length of the range of user IDs that is mapped between the two +user namespaces. +.PP +After the creation of a new user namespace, the +.I uid_map +file of +.I one +of the process in the namespace may be written to +.I once +to define the mapping of user IDs in the new user namespace. +(An attempt to write more than once to a +.I uid_map +file in a user namespace fails with the error +.BR EPERM .) + +The lines written to +.IR uid_map +must conform to the following rules: +.IP * 3 +The three fields must be valid numbers, +and the last field must be greater than 0. +.IP * +Lines are terminated by newline characters. +.IP * +There is an (arbitrary) limit on the number of lines in the file. +As at Linux 3.8, the limit is five lines. +In addition, the number of bytes written to +the file must be less than the system page size, +.\" FIXME(Eric): the restriction "less than" rather than "less than or equal" +.\" seems strangely arbitrary. Furthermore, the comment does not agree +.\" with the code in kernel/user_namespace.c. Which is correct. +and the write must be performed at the start of the file (i.e., +.BR lseek (2) +and +.BR pwrite (2) +can't be used to write to nonzero offsets in the file). +.IP * +The range of user IDs specified in each line cannot overlap with the ranges +in any other lines. +In the current implementation (Linux 3.8), this requirement is +satisfied by a simplistic implementation that imposes the further +requirement that +the values in both field 1 and field 2 of successive lines must be +in ascending numerical order. +.IP * +At least one line must be written to the file. +.PP +Writes that violate the above rules fail with the error +.BR EINVAL . + +In order for a process to write to the +.I /proc/[pid]/uid_map +.RI ( /proc/[pid]/gid_map ) +file, all of the following requirements must be met: +.IP 1. 3 +The writing process must have the +.BR CAP_SETUID +.RB ( CAP_SETGID ) +capability in the user namespace of the process +.IR pid . +.\" FIXME(Eric): +.\" Something isn't quite right in the description here. +.\" Suppose UID 1000 creates a user namespace. At this point, UID 0 in +.\" the parent namespace can write a map of (say) '0 1000 10' to uid_map. +.\" That succeeds. But how is that case covered in the three rules here? +.\" In other words, how does UID 0 in the parent namespace have any +.\" capabilities in the new child namespace? Somewhere on the page, +.\" I think there needs to be a statement about the privileges of +.\" UID 0 when no mapping has yet been defined, right? +.\" Or is it simply the case that UID 0 in the parent namespace +.\" always has all capabilities in the child namespace? +.\" +.IP 2. +The writing process must be in either the user namespace of the process +.I pid +or inside the parent user namespace of the process +.IR pid . +.IP 3. +One of the following is true: +.RS +.IP * 3 +The data written to +.I uid_map +.RI ( gid_map ) +consists of a single line that maps the writing process's file system user ID +(group ID) in the parent user namespace to a user ID (group ID) +in the user namespace. +The usual case here is that this single line provides a mapping for user ID +of the process that created the namespace. +.IP * 3 +The process has the +.BR CAP_SETUID +.RB ( CAP_SETGID ) +capability in the parent user namespace. +Thus, a privileged process can make mappings to arbitrary user IDs (group IDs) +in the parent user namespace. +.RE +.PP +Writes that violate the above rules fail with the error +.BR EPERM . +.PP +In order to create a new user namespace, +there must exist a mapping of the caller's effective +user and group IDs into the parent namespace. +If such a mapping does not exist, then +.BR clone (2) +and +.BR unshare (2) +fail with the error +.BR EPERM . +.PP +When a process inside a user namespace executes +a set-user-ID (set-group-ID) program, +the process's effective user (group) ID inside the namespace is changed +to whatever value is mapped for the user (group) ID of the file. +However, if either the user +.I or +the group ID of the file has no mapping inside the namespace, +the set-user-ID (set-group-ID) bit is silently ignored: +the new program is executed, +but the process's effective user (group) ID is left unchanged. +(This mirrors the semantics of executing a set-user-ID or set-group-ID +program that resides on a file system that was mounted with the +.BR MS_NOSUID +flag (see +.BR mount (2).) +.SH CONFORMING TO +Namespaces are a Linux-specific feature. +.SH SEE ALSO +.BR unshare (1), +.BR clone (2), +.BR setns (2), +.BR unshare (2), +.BR proc (5), +.BR credentials (7), +.BR capabilities (7) +.BR namespaces (7)