clone.2, namespaces.7: Move some CLONE_NEWUSER text from clone.2 to namespaces.7

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
Michael Kerrisk 2013-01-14 04:49:29 +01:00
parent 3dd2331ce7
commit 9d005472a8
2 changed files with 165 additions and 158 deletions

View File

@ -379,90 +379,6 @@ in the same
.BR clone ()
call.
.TP
.BR CLONE_NEWUSER
(This flag first became meaningful for
.BR clone ()
in Linux 2.6.23,
the current
.BR clone()
semantics were merged in Linux 3.5,
and the final pieces to make the user namespaces completely usable were
merged in Linux 3.8.)
If
.B CLONE_NEWUSER
is set, then create the process in a new user namespace.
If this flag is not set, then (as with
.BR fork (2))
the process is created in the same user namespace as the calling process.
A user namespace provides an isolated environment for
security related identifiers, in particular,
user IDs, group IDs, keys (see
.BR keyctl (2)),
and capabilities.
When a user namespace is created,
it starts out without a mapping of user IDs (group IDs)
to the parent user namespace.
The desired mapping of user IDs (group IDs) to the parent user namespace
may be set by writing into
.IR /proc/[pid]/uid_map
.RI ( /proc/[pid]/gid_map );
see
.BR proc (5).
The first process in a user namespace starts out with a complete set
of capabilities with respect to the new user namespace.
System calls that return user IDs (group IDs) will return
either the user ID (group ID) mapped into the current
user namespace if there is a mapping, or the overflow user ID (group ID);
the default value for the overflow user ID (group ID) is 65534.
See the descriptions of
.IR /proc/sys/kernel/overflowuid
and
.IR /proc/sys/kernel/overflowgid
in
.BR proc (5).
Use of this flag requires a kernel configured with the
.BR CONFIG_USER_NS
option.
Before Linux 3.8, use of
.BR CLONE_NEWUSER
required that the caller have three capabilities:
.BR CAP_SYS_ADMIN ,
.BR CAP_SETUID ,
and
.BR CAP_SETGID .
.\" Before Linux 2.6.29, it appears that only CAP_SYS_ADMIN was needed
Starting with Linux 3.8,
no privileges are needed to create a user namespace,
and mount, PID, IPC, network, and UTS namespaces can be created with just the
.B CAP_SYS_ADMIN
capability in the caller's user namespace.
If
.BR CLONE_NEWUSER
is specified along with other
.B CLONE_NEW*
flags in a single
.BR clone()
call, the user namespace is guaranteed to be created first,
giving the caller privileges over the remaining
namespaces created by the call.
Thus, it possible for an unprivileged caller to specify this combination
of flags.
Over the years, there have been a lot of features that have been added
to the Linux kernel that are only available to privileged users
because of their potential to confuse set-user-ID-root applications.
In general, it becomes safe to allow the root user in a user namespace to
use those features because it is impossible, while in a user namespace,
to gain more privilege than the root user of a user namespace has.
.TP
.BR CLONE_NEWPID " (since Linux 2.6.24)"
.\" This explanation draws a lot of details from
@ -481,68 +397,47 @@ the process is created in the same PID namespace as
the calling process.
This flag is intended for the implementation of containers.
A PID namespace provides an isolated environment for PIDs:
PIDs in a new namespace start at 1,
somewhat like a standalone system, and calls to
.BR fork (2),
.BR vfork (2),
or
.BR clone ()
will produce processes with PIDs that are unique within the namespace.
For further information on PID namespaces, see
.BR namespaces (7).
The first process created in a new namespace
(i.e., the process created using the
.BR CLONE_NEWPID
flag) has the PID 1, and is the "init" process for the namespace.
Children that are orphaned within the namespace will be reparented
to this process rather than
.BR init (8).
Unlike the traditional
.B init
process, the "init" process of a PID namespace can terminate,
and if it does, all of the processes in the namespace are terminated.
PID namespaces form a hierarchy.
When a new PID namespace is created,
the processes in that namespace are visible
in the PID namespace of the process that created the new namespace;
analogously, if the parent PID namespace is itself
the child of another PID namespace,
then processes in the child and parent PID namespaces will both be
visible in the grandparent PID namespace.
Conversely, the processes in the "child" PID namespace do not see
the processes in the parent namespace.
The existence of a namespace hierarchy means that each process
may now have multiple PIDs:
one for each namespace in which it is visible;
each of these PIDs is unique within the corresponding namespace.
(A call to
.BR getpid (2)
always returns the PID associated with the namespace in which
the process lives.)
After creating the new namespace,
it is useful for the child to change its root directory
and mount a new procfs instance at
.I /proc
so that tools such as
.BR ps (1)
work correctly.
.\" mount -t proc proc /proc
(If
.BR CLONE_NEWNS
is also included in
.IR flags ,
then it isn't necessary to change the root directory:
a new procfs instance can be mounted directly over
.IR /proc .)
Use of this flag requires: a kernel configured with the
.B CONFIG_PID_NS
option and that the process be privileged
Use of this flag requires
that the process be privileged
.RB ( CAP_SYS_ADMIN ).
This flag can't be specified in conjunction with
.BR CLONE_THREAD .
.TP
.BR CLONE_NEWUSER
(This flag first became meaningful for
.BR clone ()
in Linux 2.6.23,
the current
.BR clone()
semantics were merged in Linux 3.5,
and the final pieces to make the user namespaces completely usable were
merged in Linux 3.8.)
If
.B CLONE_NEWUSER
is set, then create the process in a new user namespace.
If this flag is not set, then (as with
.BR fork (2))
the process is created in the same user namespace as the calling process.
For further information on user namespaces, see
.BR namespaces (7).
Before Linux 3.8, use of
.BR CLONE_NEWUSER
required that the caller have three capabilities:
.BR CAP_SYS_ADMIN ,
.BR CAP_SETUID ,
and
.BR CAP_SETGID .
.\" Before Linux 2.6.29, it appears that only CAP_SYS_ADMIN was needed
Starting with Linux 3.8,
no privileges are needed to create a user namespace.
.TP
.BR CLONE_NEWUTS " (since Linux 2.6.19)"
If

View File

@ -292,27 +292,88 @@ PID namespaces isolate the process ID number space,
meaning that processes in different PID namespaces can have the same PID.
PID namespaces allow containers to migrate to a new hosts
while the processes inside the container maintain the same PIDs.
Each PID namespace has its own init (PID 1, see
.BR init (1)),
the "ancestor of all processes" that
manages various system initialization tasks and
reaps orphaned child processes when they terminate.
From the point of view of a particular PID namespace instance,
a process has two PIDs: the PID inside the namespace,
and the PID outside the namespace on the host system.
PID namespaces can be nested:
a process will have one PID for each of the layers of the hierarchy
PIDs in a new PID namespace start at 1,
somewhat like a standalone system, and calls to
.BR fork (2),
.BR vfork (2),
or
.BR clone (2)
will produce processes with PIDs that are unique within the namespace.
The first process created in a new namespace
(i.e., the process created using
.BR clone (2)
with the
.BR CLONE_NEWPID
flag, or the first child created by a process after a call to
.BR unshare (2)
using the
.BR CLONE_NEWPID
flag) has the PID 1, and is the "init" process for the namespace (see
.BR init (1)).
Children that are orphaned within the namespace will be reparented
to this process rather than
.BR init (8).
Unlike the traditional
.B init
process, the "init" process of a PID namespace can terminate,
and if it does, all of the processes in the namespace are terminated.
PID namespaces can be nested.
When a new PID namespace is created,
the processes in that namespace are visible
in the PID namespace of the process that created the new namespace;
analogously, if the parent PID namespace is itself
the child of another PID namespace,
then processes in the child and parent PID namespaces will both be
visible in the grandparent PID namespace.
Conversely, the processes in the "child" PID namespace do not see
the processes in the parent namespace.
More succinctly: a process can see (e.g., send signals with
.BR kill(2))
only to processes contained in its own PID namespace
and the namespaces nested below that PID namespace.
A process will have one PID for each of the layers of the hierarchy
starting from the PID namespace in which it resides
through to the root PID namespace.
A process can see (e.g., send signals with
.BR kill(2))
only processes contained in its own PID namespace
and the namespaces nested below that PID namespace.
A call to
.BR getpid (2)
always returns the PID associated with the namespace in which
the process resides.
After creating a new PID namespace,
it is useful for the child to change its root directory
and mount a new procfs instance at
.I /proc
so that tools such as
.BR ps (1)
work correctly.
.\" mount -t proc proc /proc
(If
.BR CLONE_NEWNS
is also included in the
.IR flags
argument of
.BR clone (2)
or
.BR unshare (2)),
then it isn't necessary to change the root directory:
a new procfs instance can be mounted directly over
.IR /proc .)
Use of PID namespaces requires a kernel that is configured with the
.B CONFIG_PID_NS
option.
.SS User namespaces (CLONE_NEWUSER)
User namespaces isolate the user and group ID number spaces.
User namespaces isolate
security related identifiers, in particular,
user IDs, group IDs, keys (see
.BR keyctl (2)),
and capabilities.
In other words, a process's user and group IDs can be different
inside and outside a user namespace.
A process can have a normal unprivileged user ID outside a user namespace
@ -321,7 +382,58 @@ in other words,
the process has full privileges for operations inside the user namespace,
but is unprivileged for operations outside the namespace.
Starting in Linux 3.8, unprivileged processes can create user namespaces.
When a user namespace is created,
it starts out without a mapping of user IDs (group IDs)
to the parent user namespace.
The desired mapping of user IDs (group IDs) to the parent user namespace
may be set by writing into
.IR /proc/[pid]/uid_map
.RI ( /proc/[pid]/gid_map );
see below.
The first process in a user namespace starts out with a complete set
of capabilities with respect to the new user namespace.
System calls that return user IDs (group IDs) will return
either the user ID (group ID) mapped into the current
user namespace if there is a mapping, or the overflow user ID (group ID);
the default value for the overflow user ID (group ID) is 65534.
See the descriptions of
.IR /proc/sys/kernel/overflowuid
and
.IR /proc/sys/kernel/overflowgid
in
.BR proc (5).
Starting in Linux 3.8, unprivileged processes can create user namespaces,
and mount, PID, IPC, network, and UTS namespaces can be created with just the
.B CAP_SYS_ADMIN
capability in the caller's user namespace.
If
.BR CLONE_NEWUSER
is specified along with other
.B CLONE_NEW*
flags in a single
.BR clone (2)
or
.BR unshare (2)
call, the user namespace is guaranteed to be created first,
giving the caller privileges over the remaining
namespaces created by the call.
Thus, it possible for an unprivileged caller to specify this combination
of flags.
Use of user namespaces requires a kernel that is configured with the
.B CONFIG_USER_NS
option.
Over the years, there have been a lot of features that have been added
to the Linux kernel that are only available to privileged users
because of their potential to confuse set-user-ID-root applications.
In general, it becomes safe to allow the root user in a user namespace to
use those features because it is impossible, while in a user namespace,
to gain more privilege than the root user of a user namespace has.
The
.IR /proc/[pid]/uid_map