diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7 index 13dfb078c..7e57f1e6a 100644 --- a/man7/mount_namespaces.7 +++ b/man7/mount_namespaces.7 @@ -76,272 +76,6 @@ in either mount namespace will not (by default) affect the mount point list seen in the other namespace (but see the following discussion of shared subtrees). .\" -.\" ============================================================ -.\" -.SS Restrictions on mount namespaces -Note the following points with respect to mount namespaces: -.IP * 3 -Each mount namespace has an owner user namespace. -As explained above, when a new mount namespace is created, -its mount point list is initialized as a copy of the mount point list -of another mount namespace. -If the new namespace and the namespace from which the mount point list -was copied are owned by different user namespaces, -then the new mount namespace is considered -.IR "less privileged" . -.IP * -When creating a less privileged mount namespace, -shared mounts are reduced to slave mounts. -(Shared and slave mounts are discussed below.) -This ensures that mappings performed in less -privileged mount namespaces will not propagate to more privileged -mount namespaces. -.IP * -Mounts that come as a single unit from a more privileged mount namespace are -locked together and may not be separated in a less privileged mount -namespace. -(The -.BR unshare (2) -.B CLONE_NEWNS -operation brings across all of the mounts from the original -mount namespace as a single unit, -and recursive mounts that propagate between -mount namespaces propagate as a single unit.) -.IP -In this context, "may not be separated" means that the mounts -are locked so that they may not be individually unmounted. -Consider the following example: -.IP -.RS -.in +4n -.EX -$ \fBsudo mkdir /mnt/dir\fP -$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP -$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP -$ \fBls /mnt/dir\fP # Former contents of directory are invisible -.EE -.in -.RE -.IP -The above steps, performed in a more privileged user namespace, -have created a (read-only) bind mount that -obscures the contents of the directory -.IR /mnt/dir . -For security reasons, it should not be possible to unmount -that mount in a less privileged user namespace, -since that would reveal the contents of the directory -.IR /mnt/dir . -.IP -Suppose we now create a new mount namespace -owned by a (new) subordinate user namespace. -The new mount namespace will inherit copies of all of the mounts -from the previous mount namespace. -However, those mounts will be locked because the new mount namespace -is owned by a less privileged user namespace. -Consequently, an attempt to unmount the mount fails: -.IP -.RS -.in +4n -.EX -$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP - \fBstrace \-o /tmp/log \e\fP - \fBumount /mnt/dir\fP -umount: /mnt/dir: not mounted. -$ \fBgrep \(aq^umount\(aq /tmp/log\fP -umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument) -.EE -.in -.RE -.IP -The error message from -.BR mount (8) -is a little confusing, but the -.BR strace (1) -output reveals that the underlying -.BR umount2 (2) -system call failed with the error -.BR EINVAL , -which is the error that the kernel returns to indicate that -the mount is locked. -.IP * -Following on from the previous point, -note that it is possible to unmount an entire -tree of mounts that propagated as a unit into a mount namespace -that is owned by a less privileged user namespace, -as illustrated in the following example. -.IP -First, we create new user and mount namespaces using -.BR unshare (1). -In the new mount namespace, -the propagation type of all mounts is set to private. -We then create a shared bind mount at -.IR /mnt , -and a small hierarchy of mount points underneath that mount point. -.IP -.in +4n -.EX -$ \fBPS1=\(aqns1# \(aq sudo unshare \-\-user \-\-map\-root\-user \e\fP - \fB\-\-mount \-\-propagation private bash\fP -ns1# \fBecho $$\fP # We need the PID of this shell later -778501 -ns1# \fBmount \-\-make\-shared \-\-bind /mnt /mnt\fP -ns1# \fBmkdir /mnt/x\fP -ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x\fP -ns1# \fBmkdir /mnt/x/y\fP -ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x/y\fP -ns1# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP -986 83 8:5 /mnt /mnt rw,relatime shared:344 -989 986 0:56 / /mnt/x rw,relatime -990 989 0:57 / /mnt/x/y rw,relatime -.EE -.in -.IP -Continuing in the same shell session, -we then create a second shell in a new mount namespace and a new subordinate -(and thus less privileged) user namespace and -check the state of the propagated mount points rooted at -.IR /mnt . -.IP -.in +4n -.EX -ns1# \fBPS1=\(aqns2# unshare \-\-user \-\-map\-root\-user \e\fP - \fB\-\-mount \-\-propagation unchanged bash\fP -ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP -1239 1204 8:5 /mnt /mnt rw,relatime master:344 -1240 1239 0:56 / /mnt/x rw,relatime -1241 1240 0:57 / /mnt/x/y rw,relatime -.EE -.in -.IP -Of note in the above output is that the propagation type of the mount point -.I /mnt -has been reduced to slave, as explained near the start of this subsection. -This means that submount events will propagate from the master -.I /mnt -in "ns1", but propagation will not occur in the opposite direction. -.IP -From a separate terminal window, we then use -.BR nsenter (1) -to enter the mount and user namespaces corresponding to "ns1". -In that terminal window, we then recursively bind mount -.IR /mnt/x -at the location -.IR /mnt/ppp . -.IP -.in +4n -.EX -$ \fBPS1=\(aqns3# \(aq sudo nsenter \-t 778501 \-\-user \-\-mount\fP -ns3# \fBmount \-\-rbind \-\-make\-private /mnt/x /mnt/ppp\fP -ns3# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP -986 83 8:5 /mnt /mnt rw,relatime shared:344 -989 986 0:56 / /mnt/x rw,relatime -990 989 0:57 / /mnt/x/y rw,relatime -1242 986 0:56 / /mnt/ppp rw,relatime -1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518 -.EE -.in -.IP -Because the propagation type of the parent mount, -.IR /mnt , -was shared, the recursive bind mount propagated a small tree of -mounts under the slave mount -.I /mnt -into "ns2", -as can be verified by executing the following command in that shell session: -.IP -.in +4n -.EX -ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP -1239 1204 8:5 /mnt /mnt rw,relatime master:344 -1240 1239 0:56 / /mnt/x rw,relatime -1241 1240 0:57 / /mnt/x/y rw,relatime -1244 1239 0:56 / /mnt/ppp rw,relatime -1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518 -.EE -.in -.IP -While it is not possible to unmount a part of that propagated subtree -.RI ( /mnt/ppp/y ), -it is possible to unmount the entire tree, -as shown by the following commands: -.IP -.in +4n -.EX -ns2# \fBumount /mnt/ppp/y\fP -umount: /mnt/ppp/y: not mounted. -ns2# \fBumount \-l /mnt/ppp | sed \(aqs/ \- .*//\(aq\fP # Succeeds... -ns2# \fBgrep /mnt /proc/self/mountinfo\fP -1239 1204 8:5 /mnt /mnt rw,relatime master:344 -1240 1239 0:56 / /mnt/x rw,relatime -1241 1240 0:57 / /mnt/x/y rw,relatime -.EE -.in -.IP * -The -.BR mount (2) -flags -.BR MS_RDONLY , -.BR MS_NOSUID , -.BR MS_NOEXEC , -and the "atime" flags -.RB ( MS_NOATIME , -.BR MS_NODIRATIME , -.BR MS_RELATIME ) -settings become locked -.\" commit 9566d6742852c527bf5af38af5cbb878dad75705 -.\" Author: Eric W. Biederman -.\" Date: Mon Jul 28 17:26:07 2014 -0700 -.\" -.\" mnt: Correct permission checks in do_remount -.\" -when propagated from a more privileged to -a less privileged mount namespace, -and may not be changed in the less privileged mount namespace. -.IP -This point can be illustrated by a variation on an earlier example. -In that example, the bind mount was marked as read-only. -For security reasons, -it should not be possible to make the mount writable in -a less privileged namespace, and indeed the kernel prevents this, -as illustrated by the following: -.IP -.RS -.in +4n -.EX -$ \fBsudo mkdir /mnt/dir\fP -$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP -$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP -$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP - \fBmount \-o remount,rw /mnt/dir\fP -mount: /mnt/dir: permission denied. -.EE -.in -.RE -.IP * -.\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree)) -A file or directory that is a mount point in one namespace that is not -a mount point in another namespace, may be renamed, unlinked, or removed -.RB ( rmdir (2)) -in the mount namespace in which it is not a mount point -(subject to the usual permission checks). -Consequently, the mount point is removed in the mount namespace -where it was a mount point. -.IP -Previously (before Linux 3.18), -.\" mtk: The change was in Linux 3.18, I think, with this commit: -.\" commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe -.\" Author: Eric W. Biederman -.\" Date: Tue Oct 1 18:33:48 2013 -0700 -.\" -.\" vfs: Lazily remove mounts on unlinked files and directories. -attempting to unlink, rename, or remove a file or directory -that was a mount point in another mount namespace would result in the error -.BR EBUSY . -That behavior had technical problems of enforcement (e.g., for NFS) -and permitted denial-of-service attacks against more privileged users -(i.e., preventing individual files from being updated -by bind mounting on top of them). -.\" .SH SHARED SUBTREES After the implementation of mount namespaces was completed, experience showed that the isolation that they provided was, @@ -1306,6 +1040,271 @@ and creating bind mounts .RB ( MS_BIND ), see .IR Documentation/filesystems/sharedsubtree.txt . +.\" +.\" ============================================================ +.\" +.SS Restrictions on mount namespaces +Note the following points with respect to mount namespaces: +.IP * 3 +Each mount namespace has an owner user namespace. +As explained above, when a new mount namespace is created, +its mount point list is initialized as a copy of the mount point list +of another mount namespace. +If the new namespace and the namespace from which the mount point list +was copied are owned by different user namespaces, +then the new mount namespace is considered +.IR "less privileged" . +.IP * +When creating a less privileged mount namespace, +shared mounts are reduced to slave mounts. +This ensures that mappings performed in less +privileged mount namespaces will not propagate to more privileged +mount namespaces. +.IP * +Mounts that come as a single unit from a more privileged mount namespace are +locked together and may not be separated in a less privileged mount +namespace. +(The +.BR unshare (2) +.B CLONE_NEWNS +operation brings across all of the mounts from the original +mount namespace as a single unit, +and recursive mounts that propagate between +mount namespaces propagate as a single unit.) +.IP +In this context, "may not be separated" means that the mounts +are locked so that they may not be individually unmounted. +Consider the following example: +.IP +.RS +.in +4n +.EX +$ \fBsudo mkdir /mnt/dir\fP +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP +$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP +$ \fBls /mnt/dir\fP # Former contents of directory are invisible +.EE +.in +.RE +.IP +The above steps, performed in a more privileged user namespace, +have created a (read-only) bind mount that +obscures the contents of the directory +.IR /mnt/dir . +For security reasons, it should not be possible to unmount +that mount in a less privileged user namespace, +since that would reveal the contents of the directory +.IR /mnt/dir . +.IP +Suppose we now create a new mount namespace +owned by a (new) subordinate user namespace. +The new mount namespace will inherit copies of all of the mounts +from the previous mount namespace. +However, those mounts will be locked because the new mount namespace +is owned by a less privileged user namespace. +Consequently, an attempt to unmount the mount fails: +.IP +.RS +.in +4n +.EX +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP + \fBstrace \-o /tmp/log \e\fP + \fBumount /mnt/dir\fP +umount: /mnt/dir: not mounted. +$ \fBgrep \(aq^umount\(aq /tmp/log\fP +umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument) +.EE +.in +.RE +.IP +The error message from +.BR mount (8) +is a little confusing, but the +.BR strace (1) +output reveals that the underlying +.BR umount2 (2) +system call failed with the error +.BR EINVAL , +which is the error that the kernel returns to indicate that +the mount is locked. +.IP * +Following on from the previous point, +note that it is possible to unmount an entire +tree of mounts that propagated as a unit into a mount namespace +that is owned by a less privileged user namespace, +as illustrated in the following example. +.IP +First, we create new user and mount namespaces using +.BR unshare (1). +In the new mount namespace, +the propagation type of all mounts is set to private. +We then create a shared bind mount at +.IR /mnt , +and a small hierarchy of mount points underneath that mount point. +.IP +.in +4n +.EX +$ \fBPS1=\(aqns1# \(aq sudo unshare \-\-user \-\-map\-root\-user \e\fP + \fB\-\-mount \-\-propagation private bash\fP +ns1# \fBecho $$\fP # We need the PID of this shell later +778501 +ns1# \fBmount \-\-make\-shared \-\-bind /mnt /mnt\fP +ns1# \fBmkdir /mnt/x\fP +ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x\fP +ns1# \fBmkdir /mnt/x/y\fP +ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x/y\fP +ns1# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP +986 83 8:5 /mnt /mnt rw,relatime shared:344 +989 986 0:56 / /mnt/x rw,relatime +990 989 0:57 / /mnt/x/y rw,relatime +.EE +.in +.IP +Continuing in the same shell session, +we then create a second shell in a new mount namespace and a new subordinate +(and thus less privileged) user namespace and +check the state of the propagated mount points rooted at +.IR /mnt . +.IP +.in +4n +.EX +ns1# \fBPS1=\(aqns2# unshare \-\-user \-\-map\-root\-user \e\fP + \fB\-\-mount \-\-propagation unchanged bash\fP +ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP +1239 1204 8:5 /mnt /mnt rw,relatime master:344 +1240 1239 0:56 / /mnt/x rw,relatime +1241 1240 0:57 / /mnt/x/y rw,relatime +.EE +.in +.IP +Of note in the above output is that the propagation type of the mount point +.I /mnt +has been reduced to slave, as explained near the start of this subsection. +This means that submount events will propagate from the master +.I /mnt +in "ns1", but propagation will not occur in the opposite direction. +.IP +From a separate terminal window, we then use +.BR nsenter (1) +to enter the mount and user namespaces corresponding to "ns1". +In that terminal window, we then recursively bind mount +.IR /mnt/x +at the location +.IR /mnt/ppp . +.IP +.in +4n +.EX +$ \fBPS1=\(aqns3# \(aq sudo nsenter \-t 778501 \-\-user \-\-mount\fP +ns3# \fBmount \-\-rbind \-\-make\-private /mnt/x /mnt/ppp\fP +ns3# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP +986 83 8:5 /mnt /mnt rw,relatime shared:344 +989 986 0:56 / /mnt/x rw,relatime +990 989 0:57 / /mnt/x/y rw,relatime +1242 986 0:56 / /mnt/ppp rw,relatime +1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518 +.EE +.in +.IP +Because the propagation type of the parent mount, +.IR /mnt , +was shared, the recursive bind mount propagated a small tree of +mounts under the slave mount +.I /mnt +into "ns2", +as can be verified by executing the following command in that shell session: +.IP +.in +4n +.EX +ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP +1239 1204 8:5 /mnt /mnt rw,relatime master:344 +1240 1239 0:56 / /mnt/x rw,relatime +1241 1240 0:57 / /mnt/x/y rw,relatime +1244 1239 0:56 / /mnt/ppp rw,relatime +1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518 +.EE +.in +.IP +While it is not possible to unmount a part of that propagated subtree +.RI ( /mnt/ppp/y ), +it is possible to unmount the entire tree, +as shown by the following commands: +.IP +.in +4n +.EX +ns2# \fBumount /mnt/ppp/y\fP +umount: /mnt/ppp/y: not mounted. +ns2# \fBumount \-l /mnt/ppp | sed \(aqs/ \- .*//\(aq\fP # Succeeds... +ns2# \fBgrep /mnt /proc/self/mountinfo\fP +1239 1204 8:5 /mnt /mnt rw,relatime master:344 +1240 1239 0:56 / /mnt/x rw,relatime +1241 1240 0:57 / /mnt/x/y rw,relatime +.EE +.in +.IP * +The +.BR mount (2) +flags +.BR MS_RDONLY , +.BR MS_NOSUID , +.BR MS_NOEXEC , +and the "atime" flags +.RB ( MS_NOATIME , +.BR MS_NODIRATIME , +.BR MS_RELATIME ) +settings become locked +.\" commit 9566d6742852c527bf5af38af5cbb878dad75705 +.\" Author: Eric W. Biederman +.\" Date: Mon Jul 28 17:26:07 2014 -0700 +.\" +.\" mnt: Correct permission checks in do_remount +.\" +when propagated from a more privileged to +a less privileged mount namespace, +and may not be changed in the less privileged mount namespace. +.IP +This point can be illustrated by a variation on an earlier example. +In that example, the bind mount was marked as read-only. +For security reasons, +it should not be possible to make the mount writable in +a less privileged namespace, and indeed the kernel prevents this, +as illustrated by the following: +.IP +.RS +.in +4n +.EX +$ \fBsudo mkdir /mnt/dir\fP +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP +$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP + \fBmount \-o remount,rw /mnt/dir\fP +mount: /mnt/dir: permission denied. +.EE +.in +.RE +.IP * +.\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree)) +A file or directory that is a mount point in one namespace that is not +a mount point in another namespace, may be renamed, unlinked, or removed +.RB ( rmdir (2)) +in the mount namespace in which it is not a mount point +(subject to the usual permission checks). +Consequently, the mount point is removed in the mount namespace +where it was a mount point. +.IP +Previously (before Linux 3.18), +.\" mtk: The change was in Linux 3.18, I think, with this commit: +.\" commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe +.\" Author: Eric W. Biederman +.\" Date: Tue Oct 1 18:33:48 2013 -0700 +.\" +.\" vfs: Lazily remove mounts on unlinked files and directories. +attempting to unlink, rename, or remove a file or directory +that was a mount point in another mount namespace would result in the error +.BR EBUSY . +That behavior had technical problems of enforcement (e.g., for NFS) +and permitted denial-of-service attacks against more privileged users +(i.e., preventing individual files from being updated +by bind mounting on top of them). .SH EXAMPLES See .BR pivot_root (2).