From a66648bbd10f518b98ff6483b78a3dc6de5d0210 Mon Sep 17 00:00:00 2001
From: Michael Kerrisk <mtk.manpages@gmail.com>
Date: Tue, 17 Aug 2021 05:04:11 +0200
Subject: [PATCH] mount_namespaces.7: Relocate the "Restrictions on mount
 namespaces" subsection

The "Restrictions on mount namespaces" subsection belongs lower in
the page, following the discussion of concepts (e.g., shared
subtrees and propagation) that are discussed elsewhere in the page.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
---
 man7/mount_namespaces.7 | 531 ++++++++++++++++++++--------------------
 1 file changed, 265 insertions(+), 266 deletions(-)

diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
index 13dfb078c..7e57f1e6a 100644
--- a/man7/mount_namespaces.7
+++ b/man7/mount_namespaces.7
@@ -76,272 +76,6 @@ in either mount namespace will not (by default) affect the
 mount point list seen in the other namespace
 (but see the following discussion of shared subtrees).
 .\"
-.\" ============================================================
-.\"
-.SS Restrictions on mount namespaces
-Note the following points with respect to mount namespaces:
-.IP * 3
-Each mount namespace has an owner user namespace.
-As explained above, when a new mount namespace is created,
-its mount point list is initialized as a copy of the mount point list
-of another mount namespace.
-If the new namespace and the namespace from which the mount point list
-was copied are owned by different user namespaces,
-then the new mount namespace is considered
-.IR "less privileged" .
-.IP *
-When creating a less privileged mount namespace,
-shared mounts are reduced to slave mounts.
-(Shared and slave mounts are discussed below.)
-This ensures that mappings performed in less
-privileged mount namespaces will not propagate to more privileged
-mount namespaces.
-.IP *
-Mounts that come as a single unit from a more privileged mount namespace are
-locked together and may not be separated in a less privileged mount
-namespace.
-(The
-.BR unshare (2)
-.B CLONE_NEWNS
-operation brings across all of the mounts from the original
-mount namespace as a single unit,
-and recursive mounts that propagate between
-mount namespaces propagate as a single unit.)
-.IP
-In this context, "may not be separated" means that the mounts
-are locked so that they may not be individually unmounted.
-Consider the following example:
-.IP
-.RS
-.in +4n
-.EX
-$ \fBsudo mkdir /mnt/dir\fP
-$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
-$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP
-$ \fBls /mnt/dir\fP   # Former contents of directory are invisible
-.EE
-.in
-.RE
-.IP
-The above steps, performed in a more privileged user namespace,
-have created a (read-only) bind mount that
-obscures the contents of the directory
-.IR /mnt/dir .
-For security reasons, it should not be possible to unmount
-that mount in a less privileged user namespace,
-since that would reveal the contents of the directory
-.IR /mnt/dir .
-.IP
-Suppose we now create a new mount namespace
-owned by a (new) subordinate user namespace.
-The new mount namespace will inherit copies of all of the mounts
-from the previous mount namespace.
-However, those mounts will be locked because the new mount namespace
-is owned by a less privileged user namespace.
-Consequently, an attempt to unmount the mount fails:
-.IP
-.RS
-.in +4n
-.EX
-$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
-               \fBstrace \-o /tmp/log \e\fP
-               \fBumount /mnt/dir\fP
-umount: /mnt/dir: not mounted.
-$ \fBgrep \(aq^umount\(aq /tmp/log\fP
-umount2("/mnt/dir", 0)     = \-1 EINVAL (Invalid argument)
-.EE
-.in
-.RE
-.IP
-The error message from
-.BR mount (8)
-is a little confusing, but the
-.BR strace (1)
-output reveals that the underlying
-.BR umount2 (2)
-system call failed with the error
-.BR EINVAL ,
-which is the error that the kernel returns to indicate that
-the mount is locked.
-.IP *
-Following on from the previous point,
-note that it is possible to unmount an entire
-tree of mounts that propagated as a unit into a mount namespace
-that is owned by a less privileged user namespace,
-as illustrated in the following example.
-.IP
-First, we create new user and mount namespaces using
-.BR unshare (1).
-In the new mount namespace,
-the propagation type of all mounts is set to private.
-We then create a shared bind mount at
-.IR /mnt ,
-and a small hierarchy of mount points underneath that mount point.
-.IP
-.in +4n
-.EX
-$ \fBPS1=\(aqns1# \(aq sudo unshare \-\-user \-\-map\-root\-user \e\fP
-                       \fB\-\-mount \-\-propagation private bash\fP
-ns1# \fBecho $$\fP        # We need the PID of this shell later
-778501
-ns1# \fBmount \-\-make\-shared \-\-bind /mnt /mnt\fP
-ns1# \fBmkdir /mnt/x\fP
-ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x\fP
-ns1# \fBmkdir /mnt/x/y\fP
-ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x/y\fP
-ns1# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
-986 83 8:5 /mnt /mnt rw,relatime shared:344
-989 986 0:56 / /mnt/x rw,relatime
-990 989 0:57 / /mnt/x/y rw,relatime
-.EE
-.in
-.IP
-Continuing in the same shell session,
-we then create a second shell in a new mount namespace and a new subordinate
-(and thus less privileged) user namespace and
-check the state of the propagated mount points rooted at
-.IR /mnt .
-.IP
-.in +4n
-.EX
-ns1# \fBPS1=\(aqns2# unshare \-\-user \-\-map\-root\-user \e\fP
-                       \fB\-\-mount \-\-propagation unchanged bash\fP
-ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
-1239 1204 8:5 /mnt /mnt rw,relatime master:344
-1240 1239 0:56 / /mnt/x rw,relatime
-1241 1240 0:57 / /mnt/x/y rw,relatime
-.EE
-.in
-.IP
-Of note in the above output is that the propagation type of the mount point
-.I /mnt
-has been reduced to slave, as explained near the start of this subsection.
-This means that submount events will propagate from the master
-.I /mnt
-in "ns1", but propagation will not occur in the opposite direction.
-.IP
-From a separate terminal window, we then use
-.BR nsenter (1)
-to enter the mount and user namespaces corresponding to "ns1".
-In that terminal window, we then recursively bind mount
-.IR /mnt/x
-at the location
-.IR /mnt/ppp .
-.IP
-.in +4n
-.EX
-$ \fBPS1=\(aqns3# \(aq sudo nsenter \-t 778501 \-\-user \-\-mount\fP
-ns3# \fBmount \-\-rbind \-\-make\-private /mnt/x /mnt/ppp\fP
-ns3# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
-986 83 8:5 /mnt /mnt rw,relatime shared:344
-989 986 0:56 / /mnt/x rw,relatime
-990 989 0:57 / /mnt/x/y rw,relatime
-1242 986 0:56 / /mnt/ppp rw,relatime
-1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
-.EE
-.in
-.IP
-Because the propagation type of the parent mount,
-.IR /mnt ,
-was shared, the recursive bind mount propagated a small tree of
-mounts under the slave mount
-.I /mnt
-into "ns2",
-as can be verified by executing the following command in that shell session:
-.IP
-.in +4n
-.EX
-ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
-1239 1204 8:5 /mnt /mnt rw,relatime master:344
-1240 1239 0:56 / /mnt/x rw,relatime
-1241 1240 0:57 / /mnt/x/y rw,relatime
-1244 1239 0:56 / /mnt/ppp rw,relatime
-1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
-.EE
-.in
-.IP
-While it is not possible to unmount a part of that propagated subtree
-.RI ( /mnt/ppp/y ),
-it is possible to unmount the entire tree,
-as shown by the following commands:
-.IP
-.in +4n
-.EX
-ns2# \fBumount /mnt/ppp/y\fP
-umount: /mnt/ppp/y: not mounted.
-ns2# \fBumount \-l /mnt/ppp | sed \(aqs/ \- .*//\(aq\fP      # Succeeds...
-ns2# \fBgrep /mnt /proc/self/mountinfo\fP
-1239 1204 8:5 /mnt /mnt rw,relatime master:344
-1240 1239 0:56 / /mnt/x rw,relatime
-1241 1240 0:57 / /mnt/x/y rw,relatime
-.EE
-.in
-.IP *
-The
-.BR mount (2)
-flags
-.BR MS_RDONLY ,
-.BR MS_NOSUID ,
-.BR MS_NOEXEC ,
-and the "atime" flags
-.RB ( MS_NOATIME ,
-.BR MS_NODIRATIME ,
-.BR MS_RELATIME )
-settings become locked
-.\" commit 9566d6742852c527bf5af38af5cbb878dad75705
-.\" Author: Eric W. Biederman <ebiederm@xmission.com>
-.\" Date:   Mon Jul 28 17:26:07 2014 -0700
-.\"
-.\"      mnt: Correct permission checks in do_remount
-.\"
-when propagated from a more privileged to
-a less privileged mount namespace,
-and may not be changed in the less privileged mount namespace.
-.IP
-This point can be illustrated by a variation on an earlier example.
-In that example, the bind mount was marked as read-only.
-For security reasons,
-it should not be possible to make the mount writable in
-a less privileged namespace, and indeed the kernel prevents this,
-as illustrated by the following:
-.IP
-.RS
-.in +4n
-.EX
-$ \fBsudo mkdir /mnt/dir\fP
-$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
-$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP
-$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
-               \fBmount \-o remount,rw /mnt/dir\fP
-mount: /mnt/dir: permission denied.
-.EE
-.in
-.RE
-.IP *
-.\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
-A file or directory that is a mount point in one namespace that is not
-a mount point in another namespace, may be renamed, unlinked, or removed
-.RB ( rmdir (2))
-in the mount namespace in which it is not a mount point
-(subject to the usual permission checks).
-Consequently, the mount point is removed in the mount namespace
-where it was a mount point.
-.IP
-Previously (before Linux 3.18),
-.\" mtk: The change was in Linux 3.18, I think, with this commit:
-.\"     commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
-.\"     Author: Eric W. Biederman <ebiederman@twitter.com>
-.\"     Date:   Tue Oct 1 18:33:48 2013 -0700
-.\"
-.\"         vfs: Lazily remove mounts on unlinked files and directories.
-attempting to unlink, rename, or remove a file or directory
-that was a mount point in another mount namespace would result in the error
-.BR EBUSY .
-That behavior had technical problems of enforcement (e.g., for NFS)
-and permitted denial-of-service attacks against more privileged users
-(i.e., preventing individual files from being updated
-by bind mounting on top of them).
-.\"
 .SH SHARED SUBTREES
 After the implementation of mount namespaces was completed,
 experience showed that the isolation that they provided was,
@@ -1306,6 +1040,271 @@ and creating bind mounts
 .RB ( MS_BIND ),
 see
 .IR Documentation/filesystems/sharedsubtree.txt .
+.\"
+.\" ============================================================
+.\"
+.SS Restrictions on mount namespaces
+Note the following points with respect to mount namespaces:
+.IP * 3
+Each mount namespace has an owner user namespace.
+As explained above, when a new mount namespace is created,
+its mount point list is initialized as a copy of the mount point list
+of another mount namespace.
+If the new namespace and the namespace from which the mount point list
+was copied are owned by different user namespaces,
+then the new mount namespace is considered
+.IR "less privileged" .
+.IP *
+When creating a less privileged mount namespace,
+shared mounts are reduced to slave mounts.
+This ensures that mappings performed in less
+privileged mount namespaces will not propagate to more privileged
+mount namespaces.
+.IP *
+Mounts that come as a single unit from a more privileged mount namespace are
+locked together and may not be separated in a less privileged mount
+namespace.
+(The
+.BR unshare (2)
+.B CLONE_NEWNS
+operation brings across all of the mounts from the original
+mount namespace as a single unit,
+and recursive mounts that propagate between
+mount namespaces propagate as a single unit.)
+.IP
+In this context, "may not be separated" means that the mounts
+are locked so that they may not be individually unmounted.
+Consider the following example:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo mkdir /mnt/dir\fP
+$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
+$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP
+$ \fBls /mnt/dir\fP   # Former contents of directory are invisible
+.EE
+.in
+.RE
+.IP
+The above steps, performed in a more privileged user namespace,
+have created a (read-only) bind mount that
+obscures the contents of the directory
+.IR /mnt/dir .
+For security reasons, it should not be possible to unmount
+that mount in a less privileged user namespace,
+since that would reveal the contents of the directory
+.IR /mnt/dir .
+.IP
+Suppose we now create a new mount namespace
+owned by a (new) subordinate user namespace.
+The new mount namespace will inherit copies of all of the mounts
+from the previous mount namespace.
+However, those mounts will be locked because the new mount namespace
+is owned by a less privileged user namespace.
+Consequently, an attempt to unmount the mount fails:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
+               \fBstrace \-o /tmp/log \e\fP
+               \fBumount /mnt/dir\fP
+umount: /mnt/dir: not mounted.
+$ \fBgrep \(aq^umount\(aq /tmp/log\fP
+umount2("/mnt/dir", 0)     = \-1 EINVAL (Invalid argument)
+.EE
+.in
+.RE
+.IP
+The error message from
+.BR mount (8)
+is a little confusing, but the
+.BR strace (1)
+output reveals that the underlying
+.BR umount2 (2)
+system call failed with the error
+.BR EINVAL ,
+which is the error that the kernel returns to indicate that
+the mount is locked.
+.IP *
+Following on from the previous point,
+note that it is possible to unmount an entire
+tree of mounts that propagated as a unit into a mount namespace
+that is owned by a less privileged user namespace,
+as illustrated in the following example.
+.IP
+First, we create new user and mount namespaces using
+.BR unshare (1).
+In the new mount namespace,
+the propagation type of all mounts is set to private.
+We then create a shared bind mount at
+.IR /mnt ,
+and a small hierarchy of mount points underneath that mount point.
+.IP
+.in +4n
+.EX
+$ \fBPS1=\(aqns1# \(aq sudo unshare \-\-user \-\-map\-root\-user \e\fP
+                       \fB\-\-mount \-\-propagation private bash\fP
+ns1# \fBecho $$\fP        # We need the PID of this shell later
+778501
+ns1# \fBmount \-\-make\-shared \-\-bind /mnt /mnt\fP
+ns1# \fBmkdir /mnt/x\fP
+ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x\fP
+ns1# \fBmkdir /mnt/x/y\fP
+ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x/y\fP
+ns1# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
+986 83 8:5 /mnt /mnt rw,relatime shared:344
+989 986 0:56 / /mnt/x rw,relatime
+990 989 0:57 / /mnt/x/y rw,relatime
+.EE
+.in
+.IP
+Continuing in the same shell session,
+we then create a second shell in a new mount namespace and a new subordinate
+(and thus less privileged) user namespace and
+check the state of the propagated mount points rooted at
+.IR /mnt .
+.IP
+.in +4n
+.EX
+ns1# \fBPS1=\(aqns2# unshare \-\-user \-\-map\-root\-user \e\fP
+                       \fB\-\-mount \-\-propagation unchanged bash\fP
+ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
+1239 1204 8:5 /mnt /mnt rw,relatime master:344
+1240 1239 0:56 / /mnt/x rw,relatime
+1241 1240 0:57 / /mnt/x/y rw,relatime
+.EE
+.in
+.IP
+Of note in the above output is that the propagation type of the mount point
+.I /mnt
+has been reduced to slave, as explained near the start of this subsection.
+This means that submount events will propagate from the master
+.I /mnt
+in "ns1", but propagation will not occur in the opposite direction.
+.IP
+From a separate terminal window, we then use
+.BR nsenter (1)
+to enter the mount and user namespaces corresponding to "ns1".
+In that terminal window, we then recursively bind mount
+.IR /mnt/x
+at the location
+.IR /mnt/ppp .
+.IP
+.in +4n
+.EX
+$ \fBPS1=\(aqns3# \(aq sudo nsenter \-t 778501 \-\-user \-\-mount\fP
+ns3# \fBmount \-\-rbind \-\-make\-private /mnt/x /mnt/ppp\fP
+ns3# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
+986 83 8:5 /mnt /mnt rw,relatime shared:344
+989 986 0:56 / /mnt/x rw,relatime
+990 989 0:57 / /mnt/x/y rw,relatime
+1242 986 0:56 / /mnt/ppp rw,relatime
+1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
+.EE
+.in
+.IP
+Because the propagation type of the parent mount,
+.IR /mnt ,
+was shared, the recursive bind mount propagated a small tree of
+mounts under the slave mount
+.I /mnt
+into "ns2",
+as can be verified by executing the following command in that shell session:
+.IP
+.in +4n
+.EX
+ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
+1239 1204 8:5 /mnt /mnt rw,relatime master:344
+1240 1239 0:56 / /mnt/x rw,relatime
+1241 1240 0:57 / /mnt/x/y rw,relatime
+1244 1239 0:56 / /mnt/ppp rw,relatime
+1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
+.EE
+.in
+.IP
+While it is not possible to unmount a part of that propagated subtree
+.RI ( /mnt/ppp/y ),
+it is possible to unmount the entire tree,
+as shown by the following commands:
+.IP
+.in +4n
+.EX
+ns2# \fBumount /mnt/ppp/y\fP
+umount: /mnt/ppp/y: not mounted.
+ns2# \fBumount \-l /mnt/ppp | sed \(aqs/ \- .*//\(aq\fP      # Succeeds...
+ns2# \fBgrep /mnt /proc/self/mountinfo\fP
+1239 1204 8:5 /mnt /mnt rw,relatime master:344
+1240 1239 0:56 / /mnt/x rw,relatime
+1241 1240 0:57 / /mnt/x/y rw,relatime
+.EE
+.in
+.IP *
+The
+.BR mount (2)
+flags
+.BR MS_RDONLY ,
+.BR MS_NOSUID ,
+.BR MS_NOEXEC ,
+and the "atime" flags
+.RB ( MS_NOATIME ,
+.BR MS_NODIRATIME ,
+.BR MS_RELATIME )
+settings become locked
+.\" commit 9566d6742852c527bf5af38af5cbb878dad75705
+.\" Author: Eric W. Biederman <ebiederm@xmission.com>
+.\" Date:   Mon Jul 28 17:26:07 2014 -0700
+.\"
+.\"      mnt: Correct permission checks in do_remount
+.\"
+when propagated from a more privileged to
+a less privileged mount namespace,
+and may not be changed in the less privileged mount namespace.
+.IP
+This point can be illustrated by a variation on an earlier example.
+In that example, the bind mount was marked as read-only.
+For security reasons,
+it should not be possible to make the mount writable in
+a less privileged namespace, and indeed the kernel prevents this,
+as illustrated by the following:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo mkdir /mnt/dir\fP
+$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
+$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP
+$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
+               \fBmount \-o remount,rw /mnt/dir\fP
+mount: /mnt/dir: permission denied.
+.EE
+.in
+.RE
+.IP *
+.\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
+A file or directory that is a mount point in one namespace that is not
+a mount point in another namespace, may be renamed, unlinked, or removed
+.RB ( rmdir (2))
+in the mount namespace in which it is not a mount point
+(subject to the usual permission checks).
+Consequently, the mount point is removed in the mount namespace
+where it was a mount point.
+.IP
+Previously (before Linux 3.18),
+.\" mtk: The change was in Linux 3.18, I think, with this commit:
+.\"     commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
+.\"     Author: Eric W. Biederman <ebiederman@twitter.com>
+.\"     Date:   Tue Oct 1 18:33:48 2013 -0700
+.\"
+.\"         vfs: Lazily remove mounts on unlinked files and directories.
+attempting to unlink, rename, or remove a file or directory
+that was a mount point in another mount namespace would result in the error
+.BR EBUSY .
+That behavior had technical problems of enforcement (e.g., for NFS)
+and permitted denial-of-service attacks against more privileged users
+(i.e., preventing individual files from being updated
+by bind mounting on top of them).
 .SH EXAMPLES
 See
 .BR pivot_root (2).