Consecutive calls to clock_gettime(CLOCK_MONOTONIC) are guaranteed
to return MONOTONIC values, which means that they either return
the *SAME* time value like the last call, or a later (higher) time
value.
Due to high resolution counters, like TSC on x86, most people see
that the values returned increase, but on other less common
platforms it's less likely that consecutive calls return newer
values, and instead users may unexpectedly get back the SAME time
value.
I think it makes sense to document that people should not expect
to see "always-growing" time values. For example in Debian I've
seen in quite some source packages where return values of
consecutive calls are compared against each other and then the
package build fails if they are equal (e.g. ruby-hitimes, ...).
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
* man2/syscalls.2 (.SH DESCRIPTION) <\fBgetdtablesize\fP(2)>: Remove "since
Linux 2.0" part for the osf_getdtablesize note, as syscall is generally
available since Linux 2.0; add line break after the word "as".
(.SH DESCRIPTION) <\fBpwrite\fP(2)>: Add line breaks.
(.SH DESCRIPTION) <\fBvm86old\fP(2)>: Add a line break after "in".
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In many cases, these don't improve readability, and (when stacked)
they sometimes have the side effect of sometimes forcing text
to be justified within a narrow column range.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
And add self to copyright, since, by now, the majority of the
text in the page has now been (re)written by me.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Improve structure and readability, at the same time incorporating
text and details that were formerly in select_tut(2). Also
move a few details in other parts of the page into DESCRIPTION.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The POSIX situation has been the norm for a long time now,
and including ancient details overcomplicates the page.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The discussion about pre-POSIX types for 'timeval' and 'timespec'
is rather old, and these days serves mainly to complicate the
page. Remove it.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The current text layout is a little hard to parse, with details of
pselect() spread in the main description. Move some of that text
to a headed subsection, and add a one-sentence introduction
describing the purpose of pselect().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Admittedly, the POSIX specification for exit() also uses octal.
However, 0xFF immediately indicates the lowest 8 bits to me
whereas I had to think a bit about the octal mask.
Cowritten-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This was already shown in an earlier version of the page,
but Adam Borowski's patch replaced it with an alternative.
Probably, it is better to show both possibilities.
Reported-by: "Joseph C. Sible" <josephcsible@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In the example snippet, we already have the fd, thus there's no
need to refer to the file by name. And, /proc/ might be not
mounted or not accessible.
Noticed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
PTRACE_EVENT_STOP does not always report SIGTRAP, can be the
signal which stopped us
While at it, fix an obvious copy/paste error in
PTRACE_GET_SYSCALL_INFO description.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Somewhat surprisingly, perf_event_open() can fail with EINTR when
trying to enable perf reporting for a uprobe that's already been
configured for use with ftrace. Mention this error in the man
page.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The man page contains a trivial bug that's discussed here:
https://stackoverflow.com/q/59628958
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Linux kernel commit e78bbfa82624 ("mm: stop returning -ENOENT from
sys_move_pages() if nothing got migrated") had the effect of
*never* returning -ENOENT, in any situation. So we need to update
the man page to reflect that ENOENT is not a possible return
value.
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Historically (before Linux 2.6.23), base_addr was unsigned long
for 32-bit code and unsigned int for 64-bit code. In other words,
it was always a 32-bit value. When the ldt.h header files were
unified, the type became unsigned int on all systems. Update
modify_ldt.2 and set_thread_area.2 accordingly.
Indeed, on x86, the GDT and LDT specify 32-bit bases for code and
data segments, and this has nothing to do with the kernel.
Reported-by: "Metzger, Markus T" <markus.t.metzger@intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The programmer should not need to care about the numeric values,
and their inclusion is verbosity.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Since kernel commit 3dd4d40b4208("xfs: Sanity check flags
of Q_XQUOTARM call"), it has added flags check. If it is
not usr,grp,prj quota type, it will report EINVAL.
Signed-off-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Added two missing parentheses
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This surely meant to say clone3() and not clone(3).
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The CLONE_PARENT flag cannot but used by init processes. Let's mention
this in the manpages to prevent surprises.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The introductory paragraphs note that "the calling process" is
normally synonymous with the "the parent process", except in the
case of CLONE_PARENT. The same is also true of CLONE_THREAD.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Advertise to userspace that they should use proper pid_t types
for arguments returning a pid.
The kernel-internal struct kernel_clone_args currently uses int
as type and since POSIX mandates that pid_t is a signed integer
type and glibc and friends use int this is not an issue. After
the merge window for v5.5 closes we can switch struct
kernel_clone_args over to using pid_t as well without any danger
in regressing current userspace.
Also note, that the new set tid feature which will be merged for
v5.5 uses pid_t types as well.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
If mmap() fails it will return MAP_FAILED which according to the manpage
is (void *)-1 not NULL.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Fix two spelling mistakes in manpage describing the clone{2,3}()
syscalls/syscall wrappers.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reword a little to allow for the fact that there are now
*two* reasons to consider using this flag.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Christian Brauner suggested mmap(MAP_STACKED), rather than
malloc(), as the canonical way of allocating a stack for the
child of clone(), and Jann Horn noted some reasons why:
Not on Linux, but on OpenBSD, they do use MAP_STACK now
AFAIK; this was announced here:
<http://openbsd-archive.7691.n7.nabble.com/stack-register-checking-td338238.html>.
Basically they periodically check whether the userspace
stack pointer points into a MAP_STACK region, and if not,
they kill the process. So even if it's a no-op on Linux, it
might make sense to advise people to use the flag to improve
portability? I'm not sure if that's something that belongs
in Linux manpages.
Another reason against malloc() is that when setting up
thread stacks in proper, reliable software, you'll probably
want to place a guard page (in other words, a 4K PROT_NONE
VMA) at the bottom of the stack to reliably catch stack
overflows; and you probably don't want to do that with
malloc, in particular with non-page-aligned allocations.
And the OpenBSD 6.5 manual pages says:
MAP_STACK
Indicate that the mapping is used as a stack. This
flag must be used in combination with MAP_ANON and
MAP_PRIVATE.
And I then noticed that MAP_STACK seems already to be on
FreeBSD for a long time:
MAP_STACK
Map the area as a stack. MAP_ANON is implied.
Offset should be 0, fd must be -1, and prot should
include at least PROT_READ and PROT_WRITE. This
option creates a memory region that grows to at
most len bytes in size, starting from the stack
top and growing down. The stack top is the start‐
ing address returned by the call, plus len bytes.
The bottom of the stack at maximum growth is the
starting address returned by the call.
The entire area is reserved from the point of view
of other mmap() calls, even if not faulted in yet.
Reported-by: Jann Horn <jannh@google.com>
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The obsolete CLONE_DETACHED flag has never been properly
documented, but now the discussion CLONE_PIDFD also requires
mention of CLONE_DETACHED. So, properly document CLONE_DETACHED,
and mention its interactions with CLONE_PIDFD.
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Sometimes the descriptions of these flags mentioned the
corresponding section 7 namespace manual page and then the
required capabilities, and sometimes the order was the was
the reverse. Make it consistent.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Remove details of UTS, IPC, and network namespaces that are
already covered in the corresponding namespaces pages in
section 7. This change is for consistency, since corresponding
details were not provided for other namespace types in clone(2)
and these details do not appear in unshare(2).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
After feedback from Christian Brauner [1], I've adjusted a few pieces
of the clone3() text, and also adjusted some of the older text in
the page.
[1] https://lore.kernel.org/linux-man/20191107151941.dw4gtul5lrtax4se@wittgenstein/
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Change the text in the introductory paragraph (which was written
20 years ago) to reflect the fact that clone*() does more things
nowadays.
Cowritten-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Adjust references to namespaces(7) to be references to pages
describing specific namespace types.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
For Q_QUOTAON, on old kernel we can use quotacheck -ug to generate
quota files. But in current kernel, we can also hide them in
system inodes and indicate them by using "quota" or project
feature.
For user or group quota, we can do as below (etc ext4):
mkfs.ext4 -F -o quota /dev/sda5
mount /dev/sda5 /mnt
quotactl(QCMD(Q_QUOTAON, USRQUOTA), /dev/sda5, QFMT_VFS_V0, NULL);
For project quota, we can do as below (etc ext4):
mkfs.ext4 -F -o quota,project /dev/sda5
mount /dev/sda5 /mnt
quotactl(QCMD(Q_QUOTAON, PRJQUOTA), /dev/sda5, QFMT_VFS_V0, NULL);
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Use "flags mask" as a generic term to refer to the clone()
'flags' argument and the clone3() 'cl_args.flags' field.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Sometime soon, we'll have to add documentation of clone3() to this
page. As a preparatorys step, make the names of the clone()
arguments the same as the fields in the clone3() 'args' struct:
ctid ==> child_pid
ptid ==> parent_tid
newtls ==> tld
child_stack ==> stack
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As noted in kernel commit 821cc7b0b205c0df64cce59aacc330af251fa8f7,
threads create an ambiguity: what if the calling process's PGID
is changed by another thread while waitpid(0, ...) is blocked?
So, clarify that waitpid(0, ...) means wait for children whose
PGID matches the caller's PGID at the time of the call to
waitpid().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Since Linux 5.4, idtype == P_PGID && id == 0 can be used to wait
on children in same process group as caller.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Since kernel commit a9a08845e9acbd224e4ee466f5c1275ed50054e8, the
equivalence between select() and poll()/epoll is defined in terms
of the EPOLL* constants, rather than the POLL* constants.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Christian noted that SA_NOCLDWAIT also matters in this scenario.
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
After review comments from Christian and Daniel.
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Thus, pidfd_open() is the preferred way of obtaining a PID
file descriptor.
Notes from a conversation with Christian Brauner:
[[
> A further question... We now have three ways of getting a
> process file descriptor [*]:
>
> open() of /proc/PID
> pidfd_open()
> clone()/clone3() with CLONE_PIDFD
>
> I thought the FD was supposed to be equivalent in all three cases.
> However, if I try (on kernel 5.3) poll() an FD returned by opening
> /proc/PID, poll() tells me POLLNVAL for the FD. Is that difference
> intentional? (I am guessing it is not.)
It's intentional.
The short answer is that /proc/<pid> is a convenience for sending
signals.
The longer answer is that this stems from a heavy debate about what a
process file descriptor was supposed to be and some people pushing for
at least being able to use /proc/<pid> dirfds while ignoring security
problems as soon as you're talking about returning those fds from
clone(); not to mention the additional problems discovered when trying
to implementing this.
A "real" pidfd is one from CLONE_PIDFD or pidfd_open() and all features
such as exit notification, read, and other future extensions will only
be implemented on top of them.
As much as we'd have liked to get rid of two different file descriptor
types it doesn't hurt us much and is not that much different from what
we will e.g. see with fsinfo() in the new mount api which needs to work
on regular fds gotten via open()/openat() and mountfds gotten from
fsopen() and fspick(). The mountfds will also allow for advanced
operations that the other ones will not. There's even an argument to be
made that fds you will get from open()/openat() and openat2() are
different types since they have very different behavior; openat2()
returning fds that are non arbitrarily upgradable etc.
]]
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Notes from a conversation on linux-man@ with Christian Brauner:
[[
> [*} By the way, going forward, can we call these things
> "process FDs", rather than "PID FDs"? The API names are what
> they are, an that's okay, but these just as we have socket
> FDs that refer to sockets, directory FDs that refer to
> directories, and timer FDs that refer to timers, and so on,
> these are FDs that refer to *processes*, not "process IDs".
> It's a little thing, but I think the naming better, and
> it's what I propose to use in the manual pages.
The naming was another debate and we ended with this compromise.
I would just clarify that a pidfd is a process file descriptor. I
wouldn't make too much of a deal of hiding the shortcut "pidfd".
People are already using it out there in the wild and it's never
proven a good idea to go against accepted practice.
]]
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In the kernel source (kernel/fork.c::copy_process()), there is:
pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
O_RDWR | O_CLOEXEC);
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add an entry for CLONE_PIDFD. This flag is available starting
with kernel 5.2. If specified, a process file descriptor
("pidfd") referring to the child process will be returned in
the ptid argument.
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
After my rewriting, almost nothing of the original page remains,
so update the copyright. As the author, I'm relicensing to the
"verbatim" license most commonly used in man pages.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The text stating that "pivot_root() may or may not change the
current root and the current working directory of any processes
or threads which use the old root directory" was written 19 years
ago, before the system call itself was even finalized in the
kernel. The implementation has never changed, and it won't
change in the future, since that would cause user-space breakage.
The existence of that text in DESCRIPTION, followed by qualifying
text stating what the implementation actually does (and has always
done) makes for confusing reading. Therefore, relegate this text
to a historical note in NOTES (so that readers with long memories
can see why the manual page was changed) and rework the text in
DESCRIPTION accordingly.
Reported-by: Philipp Wendler <ml@philippwendler.de>
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Reported-by: Reid Priedhorsky <reidpr@lanl.gov>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Quoting Eric:
If we are going to be pedantic "filesystem" is really the
wrong concept here. The section about bind mount clarifies
it, but I wonder if there is a better term.
I think I would say: "new_root and put_old must not be on
the same mount as the current root."
I think using "mount" instead of "filesystem" keeps the
concepts less confusing.
As I am reading through this email and seeing text that is
trying to be precise and clear then hitting the term
"filesystem" is a bit jarring. pivot_root doesn't care a
thing for file systems. pivot_root only cares about mounts.
And by a "mount" I mean the thing that you get when you
create a bind mount or you call mount normally.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Philipp Wendler noted that the text on the restrictions for
'new_root' was slightly contradictory, and things could be
clarified and simplified by describing the restrictions on
'new_root' in one place.
Reported-by: Philipp Wendler <ml@philippwendler.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Remove the text that suggests that pivot_root() changes the root
directory and CWD of process that have directory and CWD on the
old root *filesystem*. Change "filesystem" to "directory".
Reported-by: Philipp Wendler <ml@philippwendler.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reid noted a confusion between 'old_root' (my attempt at a
shorthand for the old root point) and 'put_old. Eliminate the
confusion by replacing the shorthand with "old root mount point".
Reported-by: Reid Priedhorsky <reidpr@lanl.gov>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Eric Biederman notes that the change in commit f646ac88ef was
not strictly necessary for this example, since one of the already
documented requirements is that various mount points must not have
shared propagation, or else pivot_root() will fail. So, simplify
the example.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Update with all the missing errors the syscall can return, the
behaviour the syscall should have w.r.t. to copies within single
files, etc.
[Amir] updates for final released version.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In RETURN VALUE, point reader at subsection noting that the return
value of the raw sched_setaffinity() system call differs from the
wrapper function in glibc.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Using signalfd(2) with epoll(7) and fork(2) can lead to some head
scratching.
It seems that when a signalfd file descriptor is added to epoll
you will only get notifications for signals sent to the process
that added the file descriptor to epoll.
So if you have a signalfd fd registered with epoll and then call
fork(2), perhaps by way of daemon(3) for example. Then you will
find that you no longer get notifications for signals sent to the
newly forked process.
User kentonv on ycombinator[0] explained it thus
"One place where the inconsistency gets weird is when you
use signalfd with epoll. The epoll will flag events on the
signalfd based on the process where the signalfd was
registered with epoll, not the process where the epoll is
being used. One case where this can be surprising is if you
set up a signalfd and an epoll and then fork() for the
purpose of daemonizing -- now you will find that your epoll
mysteriously doesn't deliver any events for the signalfd
despite the signalfd otherwise appearing to function as
expected."
And another post from the same person[1].
And then there is this snippet from this kernel commit message[2]
"If you share epoll fd which contains our sigfd with another
process you should blame yourself. signalfd is "really
special"."
So add a note to the man page that points this out where people
will hopefully find it sooner rather than later!
[0]: https://news.ycombinator.com/item?id=9564975
[1]: https://stackoverflow.com/questions/26701159/sending-signalfd-to-another-process/29751604#29751604
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d80e731ecab420ddcb79ee9d0ac427acbc187b4b
Signed-off-by: Andrew Clayton <andrew@digital-domain.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Eric Biederman noted that my list of directories that could not
have shared propagation was incorrect. I had written that
new_root could not be shared; rather it should be: the parent of
the current root mount point.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Quoting Eric Biederman:
The concern from our conversation at the container
mini-summit was that there is a pathology if in your initial
mount namespace all of the mounts are marked MS_SHARED like
systemd does (and is almost necessary if you are going to
use mount propagation), that if new_root itself is MS_SHARED
then unmounting the old_root could propagate.
So I believe the desired sequence is:
>>> chdir(new_root);
+++ mount("", ".", MS_SLAVE | MS_REC, NULL);
>>> pivot_root(".", ".");
>>> umount2(".", MNT_DETACH);
The change to new new_root could be either MS_SLAVE or
MS_PRIVATE. So long as it is not MS_SHARED the mount won't
propagate back to the parent mount namespace.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
LXC uses this [1]. I tested, to double-check, and it works.
The fchdir() dance done by LXC is not needed though:
fchdir(old_root); umount(".", MNT_DETACH); fchdir(new_root);
As far as I can see, just the umount() is sufficient, since,
after pivot_root(), oldi_root is at the top of the stack
of mounts at "/" and thus (so long as CWD is at "/")
the umount will remove the mount at the top of the stack.
Eric Biederman confirmed my understanding by mail, and
Philipp Wendler verified my results by experiment.
[1] See the following commit in LXC:
commit 2d489f9e87fa0cccd8a1762680a43eeff2fe1b6e
Author: Serge Hallyn <serge.hallyn@ubuntu.com>
Date: Sat Sep 20 03:15:44 2014 +0000
pivot_root: switch to a new mechanism (v2)
Helped-by: Eric W. Biederman <ebiederm@xmission.com>
Helped-by: Philipp Wendler <ml@philippwendler.de>
Helped-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
After around 19 years, the behavior of pivot_root() has not been
changed, and will almost certainly not change in the future.
So, reword to remove the suggestion that the behavior may change.
Also, more clearly document the effect of pivot_root() on
the calling process's current working directory.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The reference of "Note that this also applies" was vague. So
combine this paragraph with an earlier one to make the linkage
clearer.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The idea that there might one day be a mechanism for kernel
threads to explicitly relinquish access to the filesystem never
came to pass (after 20 years), and the presence of text
describing this idea is, IMO, a distraction. So, remove it.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
One kernel printk() later, my suspicions seem confirmed: the text
describing the situation where the current root is not a mount
point (because of a chroot()) seems to be bogus. (Perhaps it was
true once upon a time.) In my testing, if the current root is not
a mount point, an EINVAL error results.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In this text:
If the current root is not a mount point (e.g., after an
earlier chroot(2) or pivot_root())...
mention of pivot_root() makes no sense, since (as noted in an
earlier commit message for this page) 'new_root' in a previous
pivot_root() must (since Linux 2.4.5) have been a mount point.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
One of these "bugs" is a philosophical point already covered
elsewhere in the page, while the other is a somewhat obscure joke.
Both pieces are a bit of a distraction, really.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The note that EBUSY is given if a filesystem is already mounted
on 'Iput_old' was never really true. That restriction was in
Linux 2.3.14, but removed in Linux 2.3.99-pre6 so it never made
it to mainline.
The relevant diff in pivot_root() was:
error = -EBUSY;
- if (d_new_root->d_sb == root->d_sb || d_put_old->d_sb == root->d_sb)
+ if (new_nd.mnt == root_mnt || old_nd.mnt == root_mnt)
goto out2; /* loop */
- if (d_put_old != d_put_old->d_covers)
- goto out2; /* mount point is busy */
error = -EINVAL;
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Some of the text was written long ago, and hinted that things
might change in the future. However, 20 years have passed
and these details have not changed, so rework the text to
hint at that fact.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As far as I can see from the source code, the statement that
"No other filesystem may be mounted on 'put_old'" is incorrect.
Even looking at the 2.4.0 source code, there I can't see such
a restriction. In addition, some testing on a 5.0 kernel
(mounting 'put_old' in the new mount namespace just before
pivot_root()) did not result in an error for this case when
calling pivot_root().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
pivot_root() only affects the current working directory and root
directory of other processes in the same mount namespace as the
caller.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The restriction on what values may be specified in 'si_code'
apply only when sending a signal to a process other than the
caller itself.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Threads are allowed to switch mount namespaces if the filesystem
details aren't being shared. That's the purpose of the check in
the kernel quoted by the comment:
if (fs->users != 1)
return -EINVAL;
It's been this way since the code was originally merged in v3.8.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Since introduction of MAP_SHARED_VALIDATE, in case flags contain
both MAP_PRIVATE and MAP_SHARED, mmap() doesn't fail with EINVAL,
it succeeds.
The reason for that is that MAP_SHARED_VALIDATE is in fact equal
to MAP_PRIVATE | MAP_SHARED.
This is intended behavior, see:
https://lwn.net/Articles/758594/https://lwn.net/Articles/758598/
Signed-off-by: Nikola Forró <nforro@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Even though the RFW_* flags were first introduced in Linux 4.6,
they could not be used with aio until 4.13 where the aio_rw_flags
field was added to struct iocb (9830f4be159b "fs: Use RWF_* flags
for AIO operations"). Correct the stated version for each flag.
Fixes: 2f72816f86 ("io_submit.2: Add kernel version numbers for various 'aio_rw_flags' flags")
Signed-off-by: Matti Möll <Matti.Moell@opensynergy.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
E2BIG was removed in 2.6.29, we should mark it as deprecated.
Signed-off-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add powerpc64 to the calling convention tables.
Signed-off-by: Shawn Anastasio <shawn@anastas.io>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
PTRACE_GET_SYSCALL_INFO request was introduced by Linux kernel
commit 201766a20e30f982ccfe36bebfad9602c3ff574a aka
v5.3-rc1~65^2~23.
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As reported by Florin:
In the first table, for the riscv Arch/ABI, the instruction
should be ecall instead of scall.
According the official manual, the instruction has been
renamed.
https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf
"The SCALL and SBREAK instructions have been renamed to
ECALL and EBREAK, respectively. Their encoding and
functionality are unchanged."
Reported-by: Florin Blanaru <florin.blanaru96@gmail.com>
Reviewed-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As reported by Simone:
I was looking at version from 2017-09-15 but it's the same
on: http://man7.org/linux/man-pages/man2/statx.2.html
(2019-03-06)
There is reported (about the mask argument) after the list
of constants:
> Note that the kernel does not reject values in mask other
> than the above. Instead, it simply informs the caller which
> values are sup‐ ported by this kernel and filesystem via the
> statx.stx_mask field.
But as reported in the error values, there can be EINVAL if
mask has a reserved valued, and I found a check against
STATX__RESERVED in fs/stat.c for this. So if you use a that
bit (0x80000000U) the kernel will reject the value.
Probably is better to say that the kernel do not enforce the
use of only the listed values, but there are anyway reserved
values so and so you cannot put whatever you want on mask
(that apply to more values than UINT_MAX).
Reported-by: Simone Piccardi <piccardi@truelite.it>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Hi,
Both the Ext2 filesystem handler and the Ext4 filesystem handler will
return the ERANGE error code. Ext2 will return it if the name or value is
too long to be able to be stored, Ext4 will return it if the name is too
long. For reference, the relevant files/lines (with excerpts) are:
fs/ext2/xattr.c: lines 394 to 396 in ext2_xattr_set
> 394 name_len = strlen(name);
> 395 if (name_len > 255 || value_len > sb->s_blocksize)
> 396 return -ERANGE;
fs/ext4/xattr.c: lines 2317 to 2318 in ext4_xattr_set_handle
> 2317 if (strlen(name) > 255)
> 2318 return -ERANGE;
Other filesystems also return this code:
xfs/libxfs/xfs_attr.h: lines 53 to 55
> * The maximum size (into the kernel or returned from the kernel) of an
> * attribute value or the buffer used for an attr_list() call. Larger
> * sizes will result in an ERANGE return code.
It's possible that more filesystem handlers do this, a cursory grep shows
that most of the filesystem xattr handler files mention ERANGE in some
form. A suggested patch is below (I'm not 100% sure on the wording through).
Thanks
--
- Finn
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In kernel/sys.c, arg2 is an unsigned long value and it will never
less than 0. Also, since kernel commit id da8b44d5a9f8 (Linux
4.6), timer_slack_ns and default timer_slack_ns have been
converted into u64, the return value of PR_GET_TIMERSLACK has been
limited under ULONG_MAX.
The timer slack value also can be inherited by a child created via
fork(2).
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As reported by Alan Stern:
Here are two extracts from the man page for ppoll(2):
Specifying a negative value in timeout means an infinite
timeout.
Other than the difference in the precision of the timeout
argument, the following ppoll() call:
ready = ppoll(&fds, nfds, tmo_p, &sigmask);
is equivalent to atomically executing the following calls:
sigset_t origmask;
int timeout;
timeout = (tmo_p == NULL) ? -1 :
(tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000);
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = poll(&fds, nfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
But if tmo_p->tv_sec is negative, the ppoll() call is not
equivalent to the corresponding poll() call. The kernel rejects
negative values of tv_sec with an EINVAL error; it does not
interpret the value as meaning an infinite timeout.
(Yes, the kernel interprets tmo_p == NULL as an infinite timeout,
but the man page is still wrong for the case tmo_p->tv_sec < 0.)
Suggested fix: Following the end of the second extract above, add:
except that negative time values in tmo_p are not
interpreted as an infinite timeout.
Also, in the ERRORS section, change the text for EINVAL to:
EINVAL The nfds value exceeds the RLIMIT_NOFILE value or
*tmo_p contains an invalid (negative) time value.
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
It appears that 'new_root' may not have needed to be a mount
point on ancient kernels, but already in Linux 2.4.5, there
was the diff shown below. Verified also by testing.
@@ -1631,8 +1605,9 @@
* - we don't move root/cwd if they are not at the root (reason: if something
* cared enough to change them, it's probably wrong to force them elsewhere)
* - it's okay to pick a root that isn't the root of a file system, e.g.
- * /nfs/my_root where /nfs is the mount point. Better avoid creating
- * unreachable mount points this way, though.
+ * /nfs/my_root where /nfs is the mount point. It must be a mountpoint,
+ * though, so you may need to say mount --bind /nfs/my_root /nfs/my_root
+ * first.
*/
asmlinkage long sys_pivot_root(const char *new_root, const char *put_old)
@@ -1640,7 +1615,7 @@
struct dentry *root;
struct vfsmount *root_mnt;
struct vfsmount *tmp;
- struct nameidata new_nd, old_nd;
+ struct nameidata new_nd, old_nd, parent_nd, root_parent;
char *name;
int error;
@@ -1688,6 +1663,10 @@
if (new_nd.mnt == root_mnt || old_nd.mnt == root_mnt)
goto out2; /* loop */
error = -EINVAL;
+ if (root_mnt->mnt_root != root)
+ goto out2;
+ if (new_nd.mnt->mnt_root != new_nd.dentry)
+ goto out2; /* not a mountpoint */
tmp = old_nd.mnt; /* make sure we can reach put_old from new_root */
spin_lock(&dcache_lock);
if (tmp != new_nd.mnt) {
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
To get the pkey_alloc, pkey_free and pkey_mprotect functions
_GNU_SOURCE needs to be defined before including sys/mman.h.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Mark Wielaard <mark@klomp.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The mprotect.2 NOTES say:
On systems that do not support protection keys in
hardware, pkey_mprotect() may still be used, but pkey must
be set to 0. When called this way, the operation of
pkey_mprotect() is equivalent to mprotect().
But this is not what the glibc manual says:
It is also possible to call pkey_mprotect with a key value
of -1, in which case it will behave in the same way as
mprotect.
Which is correct. Both the glibc implementation and the
kernel check whether pkey is -1. 0 is not a valid pkey when
memory protection keys are not supported in hardware.
Signed-off-by: Mark Wielaard <mark@klomp.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Details relating to the new initialization flag FAN_REPORT_FID has been
added. As part of the FAN_REPORT_FID feature, a new set of event masks are
available and have been documented accordingly.
A simple example program has been added to also support the understanding
and use of FAN_REPORT_FID and directory modification events.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>