man-pages/man7/user_namespaces.7

.\" Copyright (c) 2013 by Michael Kerrisk <mtk.manpages@gmail.com>
.\" and Copyright (c) 2012 by Eric W. Biederman <ebiederm@xmission.com>
.\"
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\"
.TH USER_NAMESPACES 7 2013-01-14 "Linux" "Linux Programmer's Manual"
.SH NAME
user_namespaces \- overview of Linux user_namespaces
.SH DESCRIPTION
For an overview of namespaces, see
.BR namespaces (7).

User namespaces isolate security-related identifiers, in particular,
user IDs and group IDs (see
.BR credentials (7),
keys (see
.BR keyctl (2)),
and capabilities (see
.BR capabilities (7).
A process's user and group IDs can be different
inside and outside a user namespace.
In particular,
a process can have a normal unprivileged user ID outside a user namespace
while at the same time having a user ID of 0 inside the namespace;
in other words,
the process has full privileges for operations inside the user namespace,
but is unprivileged for operations outside the namespace.

User namespaces can be nested;
that is, each user namespace has a parent user namespace,
and can have zero or more child user namespaces.
The parent user namespace is the user namespace
of the process that creates the user namespace via a call to
.BR unshare (2)
or
.BR clone (2)
with the
.BR CLONE_NEWUSER
flag.

The first process in a user namespace starts out with a complete set
of capabilities with respect to the new user namespace.
On the other hand,
that process has no capabilities outside that user namespace,
even if the new namespace is created by the root user.
(However, a child process created by the root user
will be able to access resources such as
files that are owned by user ID 0,
and will be able to do things such as sending signals
to processes belonging to user ID 0.)

When a user namespace is created,
it starts out without a mapping of user IDs (group IDs)
to the parent user namespace.
The desired mapping of user IDs (group IDs) to the parent user namespace
may be set by writing into
.IR /proc/[pid]/uid_map
.RI ( /proc/[pid]/gid_map );
see below.
.PP
In order to create a new user namespace,
there must exist a mapping of the caller's effective
user and group IDs into the parent namespace.
If such a mapping does not exist, then
.BR clone (2)
and
.BR unshare (2)
fail with the error
.BR EPERM .

System calls that return user IDs (group IDs)\(emfor example,
.BR getuid (2),
.BR getgid (2),
and the credential fields in the structure returned by
.BR stat (2)\(emwill
return either the user ID (group ID) mapped into the current
user namespace if there is a mapping, or the overflow user ID (group ID);
the default value for the overflow user ID (group ID) is 65534.
See the descriptions of
.IR /proc/sys/kernel/overflowuid
and
.IR /proc/sys/kernel/overflowgid
in
.BR proc (5).

When a process accesses a file, its user and group IDs
are mapped into the initial user namespace for the purpose of permission
checking and assigning IDs when creating a file.
When a process retrieves file user and group IDs via
.BR stat (2)
the IDs are mapped in the opposite direction,
to produce values relative to the process user and group ID mappings.

When a process's user and group IDs are passed over a UNIX domain socket
to a process in a different user namespace (see the description of
.B SCM_CREDENTIALS
in
.BR unix (7)),
they are translated into the corresponding values as per the
receiving process's user and group ID mappings.

Use of user namespaces requires a kernel that is configured with the
.B CONFIG_USER_NS
option.
.\"
.\" ============================================================
.\"
.SS Interaction of user namespaces and other types of namespaces
Starting in Linux 3.8, unprivileged processes can create user namespaces,
and mount, PID, IPC, network, and UTS namespaces can be created with just the
.B CAP_SYS_ADMIN
capability in the caller's user namespace.

If
.BR CLONE_NEWUSER
is specified along with other
.B CLONE_NEW*
flags in a single
.BR clone (2)
or
.BR unshare (2)
call, the user namespace is guaranteed to be created first,
giving the caller privileges over the remaining
namespaces created by the call.
Thus, it is possible for an unprivileged caller to specify this combination
of flags.

When a new IPC, mount, network, PID, or UTS namespace is created via
.BR clone (2)
or
.BR unshare (2),
the kernel records the user namespace of the creating process against
the new namespace.
When a process in the new namespace subsequently performs
privileged operations that operate on global
resources isolated by the namespace,
the permission checks are performed according to the process's capabilities
in the user namespace that the kernel associated with the new namespace.
.\"
.\" ============================================================
.\"
.SS Capabilities
A process may have a capability either
because that capability is present in its effective capability set,
or because it inherits the capability from a parent user namespace
according to the following rules:
.\" In the 3.8 sources, see security/commoncap.c::cap_capable():
.IP 1. 3
If a process has a capability in a user namespace,
then it has that capability in all child (and further removed descendant)
namespaces as well.
.IP 2.
.\" * The owner of the user namespace in the parent of the
.\" * user namespace has all caps.
When a user namespace is created, the kernel records the effective
user ID of the creating process as being the "owner" of the namespace
(and likewise associates the effective group ID of the creating process
with the namespace).
.IP
A process whose effective user ID matches that of the
owner of a user namespace and which is a member of the parent namespace
has all capabilities in the user namespace.
By virtue of the first rule,
this means that the process has all capabilities in all
further removed descendant user namespaces as well.
.\" As a rough approximation, this means that
.\" the user who creates a user namespace
.\" has all capabilities inside that namespace and its descendants.
.\"
.\" ============================================================
.\"
.SS User and group ID mappings: uid_map and gid_map
The
.IR /proc/[pid]/uid_map
and
.IR /proc/[pid]/gid_map
files (available since Linux 3.5)
.\" commit 22d917d80e842829d0ca0a561967d728eb1d6303
expose the mappings for user and group IDs
inside the user namespace for the process
.IR pid .
These files can be read to view the mappings in a user namespace and
written to (once) to define the mappings.

The description in the following paragraphs explains the details for
.IR uid_map ;
.IR gid_map
is exactly the same,
but each instance of "user ID" is replaced by "group ID".

The
.I uid_map
file exposes the mapping of user IDs from the user namespace
of the process
.IR pid
to the user namespace of the process that opened
.IR uid_map
(but see a qualification to this point below).
In other words, processes that are in different user namespaces
will potentially see different values when reading from a particular
.I uid_map
file, depending on the user ID mappings for the user namespaces
of the reading processes.

Each line in the
.I uid_map
file specifies a 1-to-1 mapping of a range of contiguous
user IDs between two user namespaces.
(When a user namespace is first created, this file is empty.)
The specification in each line takes the form of
three numbers delimited by white space.
The first two numbers specify the starting user ID in
each user namespace.
The third number specifies the length of the mapped range.
In detail, the fields are interpreted as follows:
.IP (1) 4
The start of the range of user IDs in
the user namespace of the process
.IR pid .
.IP (2)
The start of the range of user
IDs to which the user IDs specified by field one map.
How field two is interpreted depends on whether the process that opened
.I uid_map
and the process
.IR pid
are in the same user namespace, as follows:
.RS
.IP a) 3
If the two processes are in different user namespaces:
field two is the start of a range of
user IDs in the user namespace of the process that opened
.IR uid_map .
.IP b)
If the two processes are in the same user namespace:
field two is the start of the range of
user IDs in the parent user namespace of the process
.IR pid .
This case enables the opener of
.I uid_map
(the common case here is opening
.IR /proc/self/uid_map )
to see the mapping of user IDs into the user namespace of the process
that created this user namespace.
.RE
.IP (3)
The length of the range of user IDs that is mapped between the two
user namespaces.
.\"
.\" ============================================================
.\"
.SS Defining user and group ID mappings: writing to uid_map and gid_map
.PP
After the creation of a new user namespace, the
.I uid_map
file of
.I one
of the process in the namespace may be written to
.I once
to define the mapping of user IDs in the new user namespace.
An attempt to write more than once to a
.I uid_map
file in a user namespace fails with the error
.BR EPERM .
Similar rules apply for
.I gid_map
files.

The lines written to
.IR uid_map
.RI ( gid_map )
must conform to the following rules:
.IP * 3
The three fields must be valid numbers,
and the last field must be greater than 0.
.IP *
Lines are terminated by newline characters.
.IP *
There is an (arbitrary) limit on the number of lines in the file.
As at Linux 3.8, the limit is five lines.
In addition, the number of bytes written to
the file must be less than the system page size,
.\" FIXME(Eric): the restriction "less than" rather than "less than or equal"
.\" seems strangely arbitrary. Furthermore, the comment does not agree
.\" with the code in kernel/user_namespace.c. Which is correct.
and the write must be performed at the start of the file (i.e.,
.BR lseek (2)
and
.BR pwrite (2)
can't be used to write to nonzero offsets in the file).
.IP *
The range of user IDs (group IDs)
specified in each line cannot overlap with the ranges
in any other lines.
In the initial implementation (Linux 3.8), this requirement was
satisfied by a simplistic implementation that imposed the further
requirement that
the values in both field 1 and field 2 of successive lines must be
in ascending numerical order,
which prevented some otherwise valid maps from being created.
Linux 3.9 and later
.\" commit 0bd14b4fd72afd5df41e9fd59f356740f22fceba
fix this limitation, allowing any valid set of nonoverlapping maps.
.IP *
The mapped user IDs (group IDs) must in turn have a mapping
in the parent user namespace.
.IP *
At least one line must be written to the file.
.PP
Writes that violate the above rules fail with the error
.BR EINVAL .

In order for a process to write to the
.I /proc/[pid]/uid_map
.RI ( /proc/[pid]/gid_map )
file, all of the following requirements must be met:
.IP 1. 3
The writing process must have the
.BR CAP_SETUID
.RB ( CAP_SETGID )
capability in the user namespace of the process
.IR pid .
.IP 2.
The writing process must be in either the user namespace of the process
.I pid
or inside the parent user namespace of the process
.IR pid .
.IP 3.
One of the following is true:
.RS
.IP * 3
The data written to
.I uid_map
.RI ( gid_map )
consists of a single line that maps the writing process's file system user ID
(group ID) in the parent user namespace to a user ID (group ID)
in the user namespace.
The usual case here is that this single line provides a mapping for user ID
of the process that created the namespace.
.IP * 3
The process has the
.BR CAP_SETUID
.RB ( CAP_SETGID )
capability in the parent user namespace.
Thus, a privileged process can make mappings to arbitrary user IDs (group IDs)
in the parent user namespace.
.RE
.PP
Writes that violate the above rules fail with the error
.BR EPERM .
.\"
.\" ============================================================
.\"
.SS Set-user-ID and set-group-ID programs
.PP
When a process inside a user namespace executes
a set-user-ID (set-group-ID) program,
the process's effective user (group) ID inside the namespace is changed
to whatever value is mapped for the user (group) ID of the file.
However, if either the user
.I or
the group ID of the file has no mapping inside the namespace,
the set-user-ID (set-group-ID) bit is silently ignored:
the new program is executed,
but the process's effective user (group) ID is left unchanged.
(This mirrors the semantics of executing a set-user-ID or set-group-ID
program that resides on a file system that was mounted with the
.BR MS_NOSUID
flag (see
.BR mount (2).)
.SH CONFORMING TO
Namespaces are a Linux-specific feature.
.SH NOTES
Over the years, there have been a lot of features that have been added
to the Linux kernel that have been made available only to privileged users
because of their potential to confuse set-user-ID-root applications.
In general, it becomes safe to allow the root user in a user namespace to
use those features because it is impossible, while in a user namespace,
to gain more privilege than the root user of a user namespace has.
.SH SEE ALSO
.BR unshare (1),
.BR clone (2),
.BR setns (2),
.BR unshare (2),
.BR proc (5),
.BR credentials (7),
.BR capabilities (7)
.BR namespaces (7)