mount_namespaces.7: Relocate the "Restrictions on mount namespaces" subsection

The "Restrictions on mount namespaces" subsection belongs lower in
the page, following the discussion of concepts (e.g., shared
subtrees and propagation) that are discussed elsewhere in the page.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
Michael Kerrisk 2021-08-17 05:04:11 +02:00
parent 44f2a6b8cd
commit a66648bbd1
1 changed files with 265 additions and 266 deletions

View File

@ -76,272 +76,6 @@ in either mount namespace will not (by default) affect the
mount point list seen in the other namespace
(but see the following discussion of shared subtrees).
.\"
.\" ============================================================
.\"
.SS Restrictions on mount namespaces
Note the following points with respect to mount namespaces:
.IP * 3
Each mount namespace has an owner user namespace.
As explained above, when a new mount namespace is created,
its mount point list is initialized as a copy of the mount point list
of another mount namespace.
If the new namespace and the namespace from which the mount point list
was copied are owned by different user namespaces,
then the new mount namespace is considered
.IR "less privileged" .
.IP *
When creating a less privileged mount namespace,
shared mounts are reduced to slave mounts.
(Shared and slave mounts are discussed below.)
This ensures that mappings performed in less
privileged mount namespaces will not propagate to more privileged
mount namespaces.
.IP *
Mounts that come as a single unit from a more privileged mount namespace are
locked together and may not be separated in a less privileged mount
namespace.
(The
.BR unshare (2)
.B CLONE_NEWNS
operation brings across all of the mounts from the original
mount namespace as a single unit,
and recursive mounts that propagate between
mount namespaces propagate as a single unit.)
.IP
In this context, "may not be separated" means that the mounts
are locked so that they may not be individually unmounted.
Consider the following example:
.IP
.RS
.in +4n
.EX
$ \fBsudo mkdir /mnt/dir\fP
$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP
$ \fBls /mnt/dir\fP # Former contents of directory are invisible
.EE
.in
.RE
.IP
The above steps, performed in a more privileged user namespace,
have created a (read-only) bind mount that
obscures the contents of the directory
.IR /mnt/dir .
For security reasons, it should not be possible to unmount
that mount in a less privileged user namespace,
since that would reveal the contents of the directory
.IR /mnt/dir .
.IP
Suppose we now create a new mount namespace
owned by a (new) subordinate user namespace.
The new mount namespace will inherit copies of all of the mounts
from the previous mount namespace.
However, those mounts will be locked because the new mount namespace
is owned by a less privileged user namespace.
Consequently, an attempt to unmount the mount fails:
.IP
.RS
.in +4n
.EX
$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
\fBstrace \-o /tmp/log \e\fP
\fBumount /mnt/dir\fP
umount: /mnt/dir: not mounted.
$ \fBgrep \(aq^umount\(aq /tmp/log\fP
umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument)
.EE
.in
.RE
.IP
The error message from
.BR mount (8)
is a little confusing, but the
.BR strace (1)
output reveals that the underlying
.BR umount2 (2)
system call failed with the error
.BR EINVAL ,
which is the error that the kernel returns to indicate that
the mount is locked.
.IP *
Following on from the previous point,
note that it is possible to unmount an entire
tree of mounts that propagated as a unit into a mount namespace
that is owned by a less privileged user namespace,
as illustrated in the following example.
.IP
First, we create new user and mount namespaces using
.BR unshare (1).
In the new mount namespace,
the propagation type of all mounts is set to private.
We then create a shared bind mount at
.IR /mnt ,
and a small hierarchy of mount points underneath that mount point.
.IP
.in +4n
.EX
$ \fBPS1=\(aqns1# \(aq sudo unshare \-\-user \-\-map\-root\-user \e\fP
\fB\-\-mount \-\-propagation private bash\fP
ns1# \fBecho $$\fP # We need the PID of this shell later
778501
ns1# \fBmount \-\-make\-shared \-\-bind /mnt /mnt\fP
ns1# \fBmkdir /mnt/x\fP
ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x\fP
ns1# \fBmkdir /mnt/x/y\fP
ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x/y\fP
ns1# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
986 83 8:5 /mnt /mnt rw,relatime shared:344
989 986 0:56 / /mnt/x rw,relatime
990 989 0:57 / /mnt/x/y rw,relatime
.EE
.in
.IP
Continuing in the same shell session,
we then create a second shell in a new mount namespace and a new subordinate
(and thus less privileged) user namespace and
check the state of the propagated mount points rooted at
.IR /mnt .
.IP
.in +4n
.EX
ns1# \fBPS1=\(aqns2# unshare \-\-user \-\-map\-root\-user \e\fP
\fB\-\-mount \-\-propagation unchanged bash\fP
ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
1239 1204 8:5 /mnt /mnt rw,relatime master:344
1240 1239 0:56 / /mnt/x rw,relatime
1241 1240 0:57 / /mnt/x/y rw,relatime
.EE
.in
.IP
Of note in the above output is that the propagation type of the mount point
.I /mnt
has been reduced to slave, as explained near the start of this subsection.
This means that submount events will propagate from the master
.I /mnt
in "ns1", but propagation will not occur in the opposite direction.
.IP
From a separate terminal window, we then use
.BR nsenter (1)
to enter the mount and user namespaces corresponding to "ns1".
In that terminal window, we then recursively bind mount
.IR /mnt/x
at the location
.IR /mnt/ppp .
.IP
.in +4n
.EX
$ \fBPS1=\(aqns3# \(aq sudo nsenter \-t 778501 \-\-user \-\-mount\fP
ns3# \fBmount \-\-rbind \-\-make\-private /mnt/x /mnt/ppp\fP
ns3# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
986 83 8:5 /mnt /mnt rw,relatime shared:344
989 986 0:56 / /mnt/x rw,relatime
990 989 0:57 / /mnt/x/y rw,relatime
1242 986 0:56 / /mnt/ppp rw,relatime
1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
.EE
.in
.IP
Because the propagation type of the parent mount,
.IR /mnt ,
was shared, the recursive bind mount propagated a small tree of
mounts under the slave mount
.I /mnt
into "ns2",
as can be verified by executing the following command in that shell session:
.IP
.in +4n
.EX
ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
1239 1204 8:5 /mnt /mnt rw,relatime master:344
1240 1239 0:56 / /mnt/x rw,relatime
1241 1240 0:57 / /mnt/x/y rw,relatime
1244 1239 0:56 / /mnt/ppp rw,relatime
1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
.EE
.in
.IP
While it is not possible to unmount a part of that propagated subtree
.RI ( /mnt/ppp/y ),
it is possible to unmount the entire tree,
as shown by the following commands:
.IP
.in +4n
.EX
ns2# \fBumount /mnt/ppp/y\fP
umount: /mnt/ppp/y: not mounted.
ns2# \fBumount \-l /mnt/ppp | sed \(aqs/ \- .*//\(aq\fP # Succeeds...
ns2# \fBgrep /mnt /proc/self/mountinfo\fP
1239 1204 8:5 /mnt /mnt rw,relatime master:344
1240 1239 0:56 / /mnt/x rw,relatime
1241 1240 0:57 / /mnt/x/y rw,relatime
.EE
.in
.IP *
The
.BR mount (2)
flags
.BR MS_RDONLY ,
.BR MS_NOSUID ,
.BR MS_NOEXEC ,
and the "atime" flags
.RB ( MS_NOATIME ,
.BR MS_NODIRATIME ,
.BR MS_RELATIME )
settings become locked
.\" commit 9566d6742852c527bf5af38af5cbb878dad75705
.\" Author: Eric W. Biederman <ebiederm@xmission.com>
.\" Date: Mon Jul 28 17:26:07 2014 -0700
.\"
.\" mnt: Correct permission checks in do_remount
.\"
when propagated from a more privileged to
a less privileged mount namespace,
and may not be changed in the less privileged mount namespace.
.IP
This point can be illustrated by a variation on an earlier example.
In that example, the bind mount was marked as read-only.
For security reasons,
it should not be possible to make the mount writable in
a less privileged namespace, and indeed the kernel prevents this,
as illustrated by the following:
.IP
.RS
.in +4n
.EX
$ \fBsudo mkdir /mnt/dir\fP
$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP
$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
\fBmount \-o remount,rw /mnt/dir\fP
mount: /mnt/dir: permission denied.
.EE
.in
.RE
.IP *
.\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
A file or directory that is a mount point in one namespace that is not
a mount point in another namespace, may be renamed, unlinked, or removed
.RB ( rmdir (2))
in the mount namespace in which it is not a mount point
(subject to the usual permission checks).
Consequently, the mount point is removed in the mount namespace
where it was a mount point.
.IP
Previously (before Linux 3.18),
.\" mtk: The change was in Linux 3.18, I think, with this commit:
.\" commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
.\" Author: Eric W. Biederman <ebiederman@twitter.com>
.\" Date: Tue Oct 1 18:33:48 2013 -0700
.\"
.\" vfs: Lazily remove mounts on unlinked files and directories.
attempting to unlink, rename, or remove a file or directory
that was a mount point in another mount namespace would result in the error
.BR EBUSY .
That behavior had technical problems of enforcement (e.g., for NFS)
and permitted denial-of-service attacks against more privileged users
(i.e., preventing individual files from being updated
by bind mounting on top of them).
.\"
.SH SHARED SUBTREES
After the implementation of mount namespaces was completed,
experience showed that the isolation that they provided was,
@ -1306,6 +1040,271 @@ and creating bind mounts
.RB ( MS_BIND ),
see
.IR Documentation/filesystems/sharedsubtree.txt .
.\"
.\" ============================================================
.\"
.SS Restrictions on mount namespaces
Note the following points with respect to mount namespaces:
.IP * 3
Each mount namespace has an owner user namespace.
As explained above, when a new mount namespace is created,
its mount point list is initialized as a copy of the mount point list
of another mount namespace.
If the new namespace and the namespace from which the mount point list
was copied are owned by different user namespaces,
then the new mount namespace is considered
.IR "less privileged" .
.IP *
When creating a less privileged mount namespace,
shared mounts are reduced to slave mounts.
This ensures that mappings performed in less
privileged mount namespaces will not propagate to more privileged
mount namespaces.
.IP *
Mounts that come as a single unit from a more privileged mount namespace are
locked together and may not be separated in a less privileged mount
namespace.
(The
.BR unshare (2)
.B CLONE_NEWNS
operation brings across all of the mounts from the original
mount namespace as a single unit,
and recursive mounts that propagate between
mount namespaces propagate as a single unit.)
.IP
In this context, "may not be separated" means that the mounts
are locked so that they may not be individually unmounted.
Consider the following example:
.IP
.RS
.in +4n
.EX
$ \fBsudo mkdir /mnt/dir\fP
$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP
$ \fBls /mnt/dir\fP # Former contents of directory are invisible
.EE
.in
.RE
.IP
The above steps, performed in a more privileged user namespace,
have created a (read-only) bind mount that
obscures the contents of the directory
.IR /mnt/dir .
For security reasons, it should not be possible to unmount
that mount in a less privileged user namespace,
since that would reveal the contents of the directory
.IR /mnt/dir .
.IP
Suppose we now create a new mount namespace
owned by a (new) subordinate user namespace.
The new mount namespace will inherit copies of all of the mounts
from the previous mount namespace.
However, those mounts will be locked because the new mount namespace
is owned by a less privileged user namespace.
Consequently, an attempt to unmount the mount fails:
.IP
.RS
.in +4n
.EX
$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
\fBstrace \-o /tmp/log \e\fP
\fBumount /mnt/dir\fP
umount: /mnt/dir: not mounted.
$ \fBgrep \(aq^umount\(aq /tmp/log\fP
umount2("/mnt/dir", 0) = \-1 EINVAL (Invalid argument)
.EE
.in
.RE
.IP
The error message from
.BR mount (8)
is a little confusing, but the
.BR strace (1)
output reveals that the underlying
.BR umount2 (2)
system call failed with the error
.BR EINVAL ,
which is the error that the kernel returns to indicate that
the mount is locked.
.IP *
Following on from the previous point,
note that it is possible to unmount an entire
tree of mounts that propagated as a unit into a mount namespace
that is owned by a less privileged user namespace,
as illustrated in the following example.
.IP
First, we create new user and mount namespaces using
.BR unshare (1).
In the new mount namespace,
the propagation type of all mounts is set to private.
We then create a shared bind mount at
.IR /mnt ,
and a small hierarchy of mount points underneath that mount point.
.IP
.in +4n
.EX
$ \fBPS1=\(aqns1# \(aq sudo unshare \-\-user \-\-map\-root\-user \e\fP
\fB\-\-mount \-\-propagation private bash\fP
ns1# \fBecho $$\fP # We need the PID of this shell later
778501
ns1# \fBmount \-\-make\-shared \-\-bind /mnt /mnt\fP
ns1# \fBmkdir /mnt/x\fP
ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x\fP
ns1# \fBmkdir /mnt/x/y\fP
ns1# \fBmount \-\-make\-private \-t tmpfs none /mnt/x/y\fP
ns1# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
986 83 8:5 /mnt /mnt rw,relatime shared:344
989 986 0:56 / /mnt/x rw,relatime
990 989 0:57 / /mnt/x/y rw,relatime
.EE
.in
.IP
Continuing in the same shell session,
we then create a second shell in a new mount namespace and a new subordinate
(and thus less privileged) user namespace and
check the state of the propagated mount points rooted at
.IR /mnt .
.IP
.in +4n
.EX
ns1# \fBPS1=\(aqns2# unshare \-\-user \-\-map\-root\-user \e\fP
\fB\-\-mount \-\-propagation unchanged bash\fP
ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
1239 1204 8:5 /mnt /mnt rw,relatime master:344
1240 1239 0:56 / /mnt/x rw,relatime
1241 1240 0:57 / /mnt/x/y rw,relatime
.EE
.in
.IP
Of note in the above output is that the propagation type of the mount point
.I /mnt
has been reduced to slave, as explained near the start of this subsection.
This means that submount events will propagate from the master
.I /mnt
in "ns1", but propagation will not occur in the opposite direction.
.IP
From a separate terminal window, we then use
.BR nsenter (1)
to enter the mount and user namespaces corresponding to "ns1".
In that terminal window, we then recursively bind mount
.IR /mnt/x
at the location
.IR /mnt/ppp .
.IP
.in +4n
.EX
$ \fBPS1=\(aqns3# \(aq sudo nsenter \-t 778501 \-\-user \-\-mount\fP
ns3# \fBmount \-\-rbind \-\-make\-private /mnt/x /mnt/ppp\fP
ns3# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
986 83 8:5 /mnt /mnt rw,relatime shared:344
989 986 0:56 / /mnt/x rw,relatime
990 989 0:57 / /mnt/x/y rw,relatime
1242 986 0:56 / /mnt/ppp rw,relatime
1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
.EE
.in
.IP
Because the propagation type of the parent mount,
.IR /mnt ,
was shared, the recursive bind mount propagated a small tree of
mounts under the slave mount
.I /mnt
into "ns2",
as can be verified by executing the following command in that shell session:
.IP
.in +4n
.EX
ns2# \fBgrep /mnt /proc/self/mountinfo | sed \(aqs/ \- .*//\(aq\fP
1239 1204 8:5 /mnt /mnt rw,relatime master:344
1240 1239 0:56 / /mnt/x rw,relatime
1241 1240 0:57 / /mnt/x/y rw,relatime
1244 1239 0:56 / /mnt/ppp rw,relatime
1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
.EE
.in
.IP
While it is not possible to unmount a part of that propagated subtree
.RI ( /mnt/ppp/y ),
it is possible to unmount the entire tree,
as shown by the following commands:
.IP
.in +4n
.EX
ns2# \fBumount /mnt/ppp/y\fP
umount: /mnt/ppp/y: not mounted.
ns2# \fBumount \-l /mnt/ppp | sed \(aqs/ \- .*//\(aq\fP # Succeeds...
ns2# \fBgrep /mnt /proc/self/mountinfo\fP
1239 1204 8:5 /mnt /mnt rw,relatime master:344
1240 1239 0:56 / /mnt/x rw,relatime
1241 1240 0:57 / /mnt/x/y rw,relatime
.EE
.in
.IP *
The
.BR mount (2)
flags
.BR MS_RDONLY ,
.BR MS_NOSUID ,
.BR MS_NOEXEC ,
and the "atime" flags
.RB ( MS_NOATIME ,
.BR MS_NODIRATIME ,
.BR MS_RELATIME )
settings become locked
.\" commit 9566d6742852c527bf5af38af5cbb878dad75705
.\" Author: Eric W. Biederman <ebiederm@xmission.com>
.\" Date: Mon Jul 28 17:26:07 2014 -0700
.\"
.\" mnt: Correct permission checks in do_remount
.\"
when propagated from a more privileged to
a less privileged mount namespace,
and may not be changed in the less privileged mount namespace.
.IP
This point can be illustrated by a variation on an earlier example.
In that example, the bind mount was marked as read-only.
For security reasons,
it should not be possible to make the mount writable in
a less privileged namespace, and indeed the kernel prevents this,
as illustrated by the following:
.IP
.RS
.in +4n
.EX
$ \fBsudo mkdir /mnt/dir\fP
$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
$ \fBsudo mount \-\-bind \-o ro /some/path /mnt/dir\fP
$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
\fBmount \-o remount,rw /mnt/dir\fP
mount: /mnt/dir: permission denied.
.EE
.in
.RE
.IP *
.\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
A file or directory that is a mount point in one namespace that is not
a mount point in another namespace, may be renamed, unlinked, or removed
.RB ( rmdir (2))
in the mount namespace in which it is not a mount point
(subject to the usual permission checks).
Consequently, the mount point is removed in the mount namespace
where it was a mount point.
.IP
Previously (before Linux 3.18),
.\" mtk: The change was in Linux 3.18, I think, with this commit:
.\" commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
.\" Author: Eric W. Biederman <ebiederman@twitter.com>
.\" Date: Tue Oct 1 18:33:48 2013 -0700
.\"
.\" vfs: Lazily remove mounts on unlinked files and directories.
attempting to unlink, rename, or remove a file or directory
that was a mount point in another mount namespace would result in the error
.BR EBUSY .
That behavior had technical problems of enforcement (e.g., for NFS)
and permitted denial-of-service attacks against more privileged users
(i.e., preventing individual files from being updated
by bind mounting on top of them).
.SH EXAMPLES
See
.BR pivot_root (2).