From ed3f4f34fc3a80db97256f9f283e274078bbfc31 Mon Sep 17 00:00:00 2001 From: Michael Kerrisk Date: Tue, 9 Jan 2018 00:19:02 +0100 Subject: [PATCH] cgroups.7: Document cgroup v2 delegation via the 'nsdelegate' mount option Reviewed-by: Tejun Heo Signed-off-by: Michael Kerrisk --- man7/cgroups.7 | 100 +++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 92 insertions(+), 8 deletions(-) diff --git a/man7/cgroups.7 b/man7/cgroups.7 index 0ed62a2fe..ccf9251cd 100644 --- a/man7/cgroups.7 +++ b/man7/cgroups.7 @@ -493,14 +493,6 @@ the value in this file is inherited from the corresponding file in the parent cgroup. .\" .SH CGROUPS VERSION 2 -.\" FIXME -.\" Document the 'nsdelegate' mount option added in Linux 4.13 -.\" To test this, it can be useful to boot the kernel with the options: -.\" -.\" cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller -.\" -.\" The effect of th latter option is to prevent systemd from employing -.\" its "hybrid" cgroup mode, where it tries to make use of cgroups v2. In cgroups v2, all mounted controllers reside in a single unified hierarchy. While (different) controllers may be simultaneously @@ -919,6 +911,93 @@ or the ownership of that file was passed to the delegatee, the delegatee can also control the further redistribution of the corresponding resources into the delegated subtree. .\" +.SS Cgroups v2 delegation: nsdelegate and cgroup namespaces +.\" +.\" To test this, it can be useful to boot the kernel with the options: +.\" +.\" cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller +.\" +.\" The effect of the latter option is to prevent systemd from employing +.\" its "hybrid" cgroup mode, where it tries to make use of cgroups v2. +.\" +Starting with Linux 4.13, +.\" commit 5136f6365ce3eace5a926e10f16ed2a233db5ba9 +there is a second way to perform cgroup delegation. +This is done by mounting the cgroup v2 filesystem with the +.I nsdelegate +mount option: +.PP +.in +4n +.EX +$ mount -t cgroup2 -o nsdelegate none /sys/fs/cgroup/unified +.EE +.in +.PP +The effect of this option is to cause cgroup namespaces +to automatically become delegation boundaries. +More specifically, +the following restrictions apply for processes inside the cgroup namespace: +.IP * 3 +Writes to controller interface files in the root directory +will fail with the error +.BR EPERM . +Processes inside the cgroup namespace can still write to delegatable +files such as +.IR cgroup.procs +and +.IR cgroup.subtree_control , +and can create subhierarchy underneath the root directory of +the cgroup namespace. +.IP * +Attempts to migrate processes across the namespace boundary are denied +(with the error +.BR ENOENT ). +Processes inside the cgroup namespace can still +(subject to the containment rules described below) +move processes between cgroups +.I within +the subhierarchy under the namespace root. +.PP +The ability to define cgroup namespaces as delegation boundaries +makes cgroup namespaces more useful. +To understand why, suppose that we already have one cgroup hierarchy +that has been delegated to a nonprivileged user, +.IR cecilia , +using the older delegation technique described above. +Suppose further that +.I cecilia +wanted to further delegate a subhierarchy +under the existing delegated hierarchy. +(For example, the delegated hierarchy might be associated with +an unprivileged container run by +.IR cecilia .) +Even if a cgroup namespace was employed, +because both hierarchies are owned by the unprivileged user +.IR cecilia , +the following illegitimate actions could be performed: +.IP * 3 +A process in the inferior hierarchy could change the +resource controller settings in the root directory of the that hierarchy. +(These resource controller settings are intended to allow control to +be exercised from the +.I parent +cgroup; +a process inside the child cgroup should not be allowed to modify them.) +.IP * +A process inside the inferior hierarchy could move processes +into and out of the inferior hierarchy if the cgroups in the +superior hierarchy were somehow visible. +.PP +Employing the +.I nsdelegate +mount option prevents both of these possibilities. +.PP +The +.I nsdelegate +mount option only has an effect when performed in +the initial mount namespace; +in other mount namespaces, the option is silently ignored. +.\" .SS Cgroup v2 delegation containment rules Some delegation .IR "containment rules" @@ -941,6 +1020,11 @@ file in the common ancestor of the source and destination cgroups. (In some cases, the common ancestor may be the source or destination cgroup itself.) .IP * +If the cgroup v2 filesystem was mounted with the +.I nsdelegate +option, the writer must be able to see the source and destination cgroup +from its cgroup namespace. +.IP * Before Linux 4.11: .\" commit 576dd464505fc53d501bb94569db76f220104d28 the effective UID of the writer (i.e., the delegatee) matches the