As far as I can see from the source code, the statement that
"No other filesystem may be mounted on 'put_old'" is incorrect.
Even looking at the 2.4.0 source code, there I can't see such
a restriction. In addition, some testing on a 5.0 kernel
(mounting 'put_old' in the new mount namespace just before
pivot_root()) did not result in an error for this case when
calling pivot_root().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
pivot_root() only affects the current working directory and root
directory of other processes in the same mount namespace as the
caller.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The restriction on what values may be specified in 'si_code'
apply only when sending a signal to a process other than the
caller itself.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Threads are allowed to switch mount namespaces if the filesystem
details aren't being shared. That's the purpose of the check in
the kernel quoted by the comment:
if (fs->users != 1)
return -EINVAL;
It's been this way since the code was originally merged in v3.8.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Since introduction of MAP_SHARED_VALIDATE, in case flags contain
both MAP_PRIVATE and MAP_SHARED, mmap() doesn't fail with EINVAL,
it succeeds.
The reason for that is that MAP_SHARED_VALIDATE is in fact equal
to MAP_PRIVATE | MAP_SHARED.
This is intended behavior, see:
https://lwn.net/Articles/758594/https://lwn.net/Articles/758598/
Signed-off-by: Nikola Forró <nforro@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Even though the RFW_* flags were first introduced in Linux 4.6,
they could not be used with aio until 4.13 where the aio_rw_flags
field was added to struct iocb (9830f4be159b "fs: Use RWF_* flags
for AIO operations"). Correct the stated version for each flag.
Fixes: 2f72816f86 ("io_submit.2: Add kernel version numbers for various 'aio_rw_flags' flags")
Signed-off-by: Matti Möll <Matti.Moell@opensynergy.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
E2BIG was removed in 2.6.29, we should mark it as deprecated.
Signed-off-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add powerpc64 to the calling convention tables.
Signed-off-by: Shawn Anastasio <shawn@anastas.io>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
PTRACE_GET_SYSCALL_INFO request was introduced by Linux kernel
commit 201766a20e30f982ccfe36bebfad9602c3ff574a aka
v5.3-rc1~65^2~23.
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As reported by Florin:
In the first table, for the riscv Arch/ABI, the instruction
should be ecall instead of scall.
According the official manual, the instruction has been
renamed.
https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf
"The SCALL and SBREAK instructions have been renamed to
ECALL and EBREAK, respectively. Their encoding and
functionality are unchanged."
Reported-by: Florin Blanaru <florin.blanaru96@gmail.com>
Reviewed-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As reported by Simone:
I was looking at version from 2017-09-15 but it's the same
on: http://man7.org/linux/man-pages/man2/statx.2.html
(2019-03-06)
There is reported (about the mask argument) after the list
of constants:
> Note that the kernel does not reject values in mask other
> than the above. Instead, it simply informs the caller which
> values are sup‐ ported by this kernel and filesystem via the
> statx.stx_mask field.
But as reported in the error values, there can be EINVAL if
mask has a reserved valued, and I found a check against
STATX__RESERVED in fs/stat.c for this. So if you use a that
bit (0x80000000U) the kernel will reject the value.
Probably is better to say that the kernel do not enforce the
use of only the listed values, but there are anyway reserved
values so and so you cannot put whatever you want on mask
(that apply to more values than UINT_MAX).
Reported-by: Simone Piccardi <piccardi@truelite.it>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Hi,
Both the Ext2 filesystem handler and the Ext4 filesystem handler will
return the ERANGE error code. Ext2 will return it if the name or value is
too long to be able to be stored, Ext4 will return it if the name is too
long. For reference, the relevant files/lines (with excerpts) are:
fs/ext2/xattr.c: lines 394 to 396 in ext2_xattr_set
> 394 name_len = strlen(name);
> 395 if (name_len > 255 || value_len > sb->s_blocksize)
> 396 return -ERANGE;
fs/ext4/xattr.c: lines 2317 to 2318 in ext4_xattr_set_handle
> 2317 if (strlen(name) > 255)
> 2318 return -ERANGE;
Other filesystems also return this code:
xfs/libxfs/xfs_attr.h: lines 53 to 55
> * The maximum size (into the kernel or returned from the kernel) of an
> * attribute value or the buffer used for an attr_list() call. Larger
> * sizes will result in an ERANGE return code.
It's possible that more filesystem handlers do this, a cursory grep shows
that most of the filesystem xattr handler files mention ERANGE in some
form. A suggested patch is below (I'm not 100% sure on the wording through).
Thanks
--
- Finn
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In kernel/sys.c, arg2 is an unsigned long value and it will never
less than 0. Also, since kernel commit id da8b44d5a9f8 (Linux
4.6), timer_slack_ns and default timer_slack_ns have been
converted into u64, the return value of PR_GET_TIMERSLACK has been
limited under ULONG_MAX.
The timer slack value also can be inherited by a child created via
fork(2).
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As reported by Alan Stern:
Here are two extracts from the man page for ppoll(2):
Specifying a negative value in timeout means an infinite
timeout.
Other than the difference in the precision of the timeout
argument, the following ppoll() call:
ready = ppoll(&fds, nfds, tmo_p, &sigmask);
is equivalent to atomically executing the following calls:
sigset_t origmask;
int timeout;
timeout = (tmo_p == NULL) ? -1 :
(tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000);
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = poll(&fds, nfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
But if tmo_p->tv_sec is negative, the ppoll() call is not
equivalent to the corresponding poll() call. The kernel rejects
negative values of tv_sec with an EINVAL error; it does not
interpret the value as meaning an infinite timeout.
(Yes, the kernel interprets tmo_p == NULL as an infinite timeout,
but the man page is still wrong for the case tmo_p->tv_sec < 0.)
Suggested fix: Following the end of the second extract above, add:
except that negative time values in tmo_p are not
interpreted as an infinite timeout.
Also, in the ERRORS section, change the text for EINVAL to:
EINVAL The nfds value exceeds the RLIMIT_NOFILE value or
*tmo_p contains an invalid (negative) time value.
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
It appears that 'new_root' may not have needed to be a mount
point on ancient kernels, but already in Linux 2.4.5, there
was the diff shown below. Verified also by testing.
@@ -1631,8 +1605,9 @@
* - we don't move root/cwd if they are not at the root (reason: if something
* cared enough to change them, it's probably wrong to force them elsewhere)
* - it's okay to pick a root that isn't the root of a file system, e.g.
- * /nfs/my_root where /nfs is the mount point. Better avoid creating
- * unreachable mount points this way, though.
+ * /nfs/my_root where /nfs is the mount point. It must be a mountpoint,
+ * though, so you may need to say mount --bind /nfs/my_root /nfs/my_root
+ * first.
*/
asmlinkage long sys_pivot_root(const char *new_root, const char *put_old)
@@ -1640,7 +1615,7 @@
struct dentry *root;
struct vfsmount *root_mnt;
struct vfsmount *tmp;
- struct nameidata new_nd, old_nd;
+ struct nameidata new_nd, old_nd, parent_nd, root_parent;
char *name;
int error;
@@ -1688,6 +1663,10 @@
if (new_nd.mnt == root_mnt || old_nd.mnt == root_mnt)
goto out2; /* loop */
error = -EINVAL;
+ if (root_mnt->mnt_root != root)
+ goto out2;
+ if (new_nd.mnt->mnt_root != new_nd.dentry)
+ goto out2; /* not a mountpoint */
tmp = old_nd.mnt; /* make sure we can reach put_old from new_root */
spin_lock(&dcache_lock);
if (tmp != new_nd.mnt) {
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
To get the pkey_alloc, pkey_free and pkey_mprotect functions
_GNU_SOURCE needs to be defined before including sys/mman.h.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Mark Wielaard <mark@klomp.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The mprotect.2 NOTES say:
On systems that do not support protection keys in
hardware, pkey_mprotect() may still be used, but pkey must
be set to 0. When called this way, the operation of
pkey_mprotect() is equivalent to mprotect().
But this is not what the glibc manual says:
It is also possible to call pkey_mprotect with a key value
of -1, in which case it will behave in the same way as
mprotect.
Which is correct. Both the glibc implementation and the
kernel check whether pkey is -1. 0 is not a valid pkey when
memory protection keys are not supported in hardware.
Signed-off-by: Mark Wielaard <mark@klomp.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Details relating to the new initialization flag FAN_REPORT_FID has been
added. As part of the FAN_REPORT_FID feature, a new set of event masks are
available and have been documented accordingly.
A simple example program has been added to also support the understanding
and use of FAN_REPORT_FID and directory modification events.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The removed text long ago ceased to be accurate. Nowadays, the
dispatch table is autogenerated when building the kernel (via
the kernel makefile, arch/x86/entry/syscalls/Makefile).
Reported-by: Andreas Korb <andreas.d.korb@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Rewrite for improved clarity and defer to setfsuid(2) for the
rationale of the fsGID rather than repeating the same details
in this page.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The current text reads somewhat clumsily. Rewrite it to introduce
the eUID and fsUID in parallel, and more clearly hint at the the
historical rationale for the fsUID, which is detailed lower in
the page.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This patch documents two additional flags recently introduced
for the attr.sched_flags field of sched_setattr().
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Quoting Branden:
*roff escape sequences may sometimes look like C escapes, but that
is misleading. *roff is in part a macro language and that means
recursive expansion to arbitrary depths.
You can get away with "\\" in a context where no macro expansion
is taking place, but try to spell a literal backslash this way in
the argument to a macro and you will likely be unhappy with
results.
Try viewing the attached file with "man -l".
"\e" is the preferred and portable way to get a portable "escape
literal" going back to CSTR #54, the original Bell Labs troff
paper.
groff(7) discusses the issue:
\\ reduces to a single backslash; useful to delay its
interpretation as escape character in copy mode. For a
printable backslash, use \e, or even better \[rs], to be
independent from the current escape character.
As of groff 1.22.4, groff_man(7) does as well:
\e Widely used in man pages to represent a backslash output
glyph. It works reliably as long as the .ec request is
not used, which should never happen in man pages, and it
is slightly more portable than the more exact ‘\(rs’
(“reverse solidus”) escape sequence.
People not concerned with portability to extremely old troffs should
probably just use \(rs (or \[rs]), as it means "the backslash
glyph", not "the glyph corresponding to whatever the current escape
character is".
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Quoting Branden:
*roff systems will interpret the period in the unpatched
page as sentence-ending punctuation and put inter-sentence
spacing after it. (This might not be visible on
nroff/terminal devices, but it is more likely to be on
typesetter/PostScript/PDF output).
groff_man(7) in groff 1.22.4 attempts to throw man page
writers a bone here:
\& Zero‐width space. Append to an input line to prevent
an end‐of‐ sentence punctuation sequence from being
recognized as such, or insert at the beginning of an
input line to prevent a dot or apostrophe from being
interpreted as the beginning of a roff request.
Reported-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is>
Reported-by: G. Branden Robinson <g.branden.robinson@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
1) Use single-font macros for a single argument.
2) Use quotation marks for arguments containing a space.
3) Use roman font for punctuation marks.
The output has only changes of the font for a punctuation mark.
Signed-off-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
fanotify_init.2: add new flag FAN_REPORT_TID
fanotify.7: update description of member pid in
struct fanotify_event_metadata
Signed-off-by: nixiaoming <nixiaoming@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Monitor fanotify events on the entire filesystem.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
New event masks have been added to the fanotify API. Documentation to
support the use and behaviour of these new masks has been added
accordingly.
Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Note EEXIST error that occurs when requesting a watch on a path
which is already watched with IN_MASK_CREATE.
Note EINVAL error also occurs when requesting a watch specifying
both IN_MASK_CREATE and IN_MASK_ADD.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The glibc wrapper was added in glibc 2.29, release on 1 Feb 2019.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
mtk: checked also against examples in samples/bpf
in kernel source to confirm.
Signed-off-by: Oded Elisha <oded123456@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The current manpage reads to me as if the kernel will always pick
a free space close to the requested address, but that's not the
case:
mmap(0x600000000000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x600000000000
mmap(0x600000000000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x7f5042859000
You can also see this in the various implementations of
->get_unmapped_area() - if the specified address isn't available,
the kernel basically ignores the hint (apart from the 5level
paging hack).
Clarify how this works a bit.
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
All address families are now documented in address_families.7,
which is already present in SEE ALSO section. Also, the AF_ALG
note contains dead link to kernel HTML documentation.
Signed-off-by: Nikola Forró <nforro@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
It has its own man page, so it probably makes sense to mention
it here.
* man2/socket.2 (.SH DESCRIPTION): Add mention of AF_VSOCK back.
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
* man2/socket.2 (.SH DESCRIPTION): Mention that the list of
address families is Linux-specific.
* man7/address_families.7 (.SH DESCRIPTION): Likewise.
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
* man2/sigaction.2 (.SS Undocumented): Provide information about
relation between the second argument of sa_handler and
uc_mcontext field of the struct ucontext structure.
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Some architectures do provide an 'l_sysid' declaration in
struct flock; however, it is not used anyway.
* man2/fcntl.2 (.SH NOTES): Note that l_sysid field is not used on
Linux even if present on some architectures.
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Remove crufty sentence suggesting use of deprecated capsetp(3) and
capgetp(3); the manual page for those functions has long (at least
as far back as 2007) noted that they are deprecated.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
O_RSYNC is defined in <asm/fcntl.h> on HP PA-RISC, but is not
used anyway.
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add a note regarding other implementations of whiteout inodes
and update filesystem support information.
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This information is already summarized in syscall(2), so there's
no need to repeat it in each page.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Some architectures (ab)use second return value register for additional
return value in some system calls. Let's describe this.
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As reported by Nadav Har'El in
https://bugzilla.kernel.org/show_bug.cgi?id=197961
The write(2) manual page has this paragraph:
"On success, the number of bytes written is returned
(zero indicates nothing was written). It is not an error
if this number is smaller than the number of bytes
requested; this may happen for example because the disk
device was filled. See also NOTES."
I find a few problems with this paragraph:
1. It's not clear what "See also NOTES." refers to (does it
refer to anything?). What in the NOTES is relevant here?
2. The paragraph seems to suggest that write(2) of a
non-empty buffer may sometimes return even 0 in case of an
error like the device being filled. I think this is wrong
- if there was an error after already writing some number
of bytes, this non-zero number is returned. But if there's
an error before writing any bytes, -1 will be returned
(and the error reason in errno) - 0 will not be returned
unless the given count is 0 (that case is explained in the
following paragraph).
3. The paragraph doesn't explain what a user should do
after a short write (i.e., write(2) returning less than
count). How would the user know why there was an error, or
if there even was one? I think users should be told what
to do next because this information is part of how to use
this API correctly. I think users should be told to retry
the rest of the write (i.e., write(fd, buf+ret, count-ret)
and this will either succeed in writing some more data if
the error reason was solved, or the second write will
return -1 and the error reason in errno.
Reported-by: Nadav Har'El <nyh@math.technion.ac.il>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
ENOATTR is not a standard error code, but rather one that is
defined in 'libattr' as a synonym for ENODATA. The manual pages
should use the error code actually returned by the kernel APIs.
See also https://bugzilla.kernel.org/show_bug.cgi?id=201995
Reported-by: Enrico Scholz <enrico.scholz@sigma-chemnitz.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
alpha use v0 e.g. $0 as the return value register both in
syscall ABI and C ABI.
see also
https://github.com/torvalds/linux/blob/master/arch/alpha/kernel/entry.S#L479
The normal Alpha C ABI use a0~a5 to pass arguments and use v0 as
the return value register. See here
https://www2.cs.arizona.edu/projects/alto/Doc/local/alpha.register.html
The syscall ABI use v0 as the trap number, a0~a5 to pass arguments
and use a3 as a indicator (bool type) whether has a error occurred.
We can also see the libc's syscall wrapper implements at
https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/alpha/syscall.S.html
The v0 is the normal used as return register, and we can see the
return processing doesn't do anything about a0 which is the wrong
register of currently syscall(2) description.
p.s. I found this wrong description because I'm porting Go gc to
a new CPU architecture which is similar to Alpha, And I use the
wrong register at first, then I have inspect the kernel code and
objdump to ensure the right syscall ABI.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Back in 2014 (37bee118ad) the text
describing when multiplexing happens was changed in a confusing way.
This is an attempt to clarify things a bit.
Signed-off-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Since 93e06c7a6453 ("mm: enable MADV_FREE for swapless system") we
handle MADV_FREE on a swapless system the same way as with the
swap available. Clarify that fact in the man page.
Reported-by: Niklas Hambüchen <mail@nh2.me>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Remove the old statement that PTRACE_O_TRACESYSGOOD may not work
on all architectures. As far as I can tell, all kernel code
properly tests PT_TRACESYSGOOD flag and sets the 7th bit in the
exit code passed to ptrace_notify().
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
To test the behavior documented by this patch, the following
demos employ the program shown at the foot of this commit message.
First, show that the pdeath signal is sent when the parent
terminates:
$ ./pdeath_signal 0 10 4
Parent (18595) about to sleep for 4 seconds
Child about to set PR_SET_PDEATHSIG
Child about to sleep
Parent (18595) terminating
*********** Child (18596) got signal; si_pid = 18595; si_uid = 1000
Parent PID is now 1403
$ Child about to exit
But the signal is not sent if the parent terminates before the
child uses PR_SET_PDEATHSIG:
$ ./pdeath_signal 2 10 0
Parent (18707) about to sleep for 0 seconds
Parent (18707) terminating
Child about to sleep 2 seconds before setting PR_SET_PDEATHSIG
$ Child about to set PR_SET_PDEATHSIG
Child about to sleep
Child about to exit
Demonstrate that the pdeath signal is sent on termination of each
ancestor subreaper process:
$ ./pdeath_signal 2 10 3 7 6 5
18786 marked itself as a subreaper
18786 subreaper about to sleep 7 seconds
18787 marked itself as a subreaper
18787 subreaper about to sleep 6 seconds
18788 marked itself as a subreaper
18788 subreaper about to sleep 5 seconds
Parent (18789) about to sleep for 3 seconds
Child about to sleep 2 seconds before setting PR_SET_PDEATHSIG
Child about to set PR_SET_PDEATHSIG
Child about to sleep
Parent (18789) terminating
*********** Child (18790) got signal; si_pid = 18789; si_uid = 1000
Parent PID is now 18788
18788 subreaper about to terminate
*********** Child (18790) got signal; si_pid = 18788; si_uid = 1000
Parent PID is now 18787
18787 subreaper about to terminate
*********** Child (18790) got signal; si_pid = 18787; si_uid = 1000
Parent PID is now 18786
18786 subreaper about to terminate
*********** Child (18790) got signal; si_pid = 18786; si_uid = 1000
Parent PID is now 1403
$ Child about to exit
But in the case where some subreapers terminate before they
have a chance to adopt the child, the terminations of those
subreapers do not result in a signal for the child:
$ ./pdeath_signal 2 10 3 5 6 7
18836 marked itself as a subreaper
18836 subreaper about to sleep 5 seconds
18837 marked itself as a subreaper
18837 subreaper about to sleep 6 seconds
18838 marked itself as a subreaper
18838 subreaper about to sleep 7 seconds
Parent (18839) about to sleep for 3 seconds
Child about to sleep 2 seconds before setting PR_SET_PDEATHSIG
Child about to set PR_SET_PDEATHSIG
Child about to sleep
Parent (18839) terminating
*********** Child (18840) got signal; si_pid = 18839; si_uid = 1000
Parent PID is now 18838
18836 subreaper about to terminate
$ 18837 subreaper about to terminate
18838 subreaper about to terminate
*********** Child (18840) got signal; si_pid = 18838; si_uid = 1000
Parent PID is now 1403
Child about to exit
============================
/* pdeath_signal.c */
} while (0)
static void
handler(int sig, siginfo_t *si, void *ucontext)
{
printf("*********** Child (%ld) got signal; si_pid = %d; si_uid = %d\n",
(long) getpid(), si->si_pid, si->si_uid);
printf(" Parent PID is now %ld\n", (long) getppid());
}
int
main(int argc, char *argv[])
{
struct sigaction sa;
int childPreSleep, childPostSleep, parentSleep;
if (argc < 2) {
fprintf(stderr, "Usage: %s child-pre-sleep "
"[child-post-sleep [parent-sleep [subreaper-sleep...]]]\n",
argv[0]);
exit(EXIT_FAILURE);
}
childPreSleep = atoi(argv[1]);
if (argc > 2)
childPostSleep = atoi(argv[2]);
if (argc > 3)
parentSleep = atoi(argv[3]);
/* Optionally create a series of subreapers */
if (argc > 4) {
for (int sr = 4; sr < argc; sr++) {
if (prctl(PR_SET_CHILD_SUBREAPER, 1) == -1)
errExit("prctl");
printf("%ld marked itself as a subreaper\n", (long) getpid());
switch (fork()) {
case -1:
errExit("fork");
case 0:
break;
default:
printf("%ld subreaper about to sleep %s seconds\n",
(long) getpid(), argv[sr]);
sleep(atoi(argv[sr]));
printf("%ld subreaper about to terminate\n", (long) getpid());
exit(EXIT_SUCCESS);
}
}
}
switch (fork()) {
case -1:
errExit("fork");
case 0:
sa.sa_flags = SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sa.sa_sigaction = handler;
if (sigaction(SIGUSR1, &sa, NULL) == -1)
errExit("sigaction");
if (childPreSleep > 0) {
printf("Child about to sleep %d seconds before setting "
"PR_SET_PDEATHSIG\n", childPreSleep);
sleep(childPreSleep);
}
printf("Child about to set PR_SET_PDEATHSIG\n");
if (prctl(PR_SET_PDEATHSIG, SIGUSR1) == -1)
errExit("prctl");
printf("Child about to sleep\n");
for (int j = 0; j < childPostSleep; j++)
sleep(1);
printf("Child about to exit\n");
exit(EXIT_SUCCESS);
default:
printf("Parent (%ld) about to sleep for %d seconds\n",
(long) getpid(), parentSleep);
sleep(parentSleep);
printf("Parent (%ld) terminating\n", (long) getpid());
exit(EXIT_SUCCESS);
}
}
Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The signal is process directed and the siginfo_t->si_pid
filed contains the PID of the terminating parent.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
ptrace() with requests PTRACE_PEEKTEXT, PTRACE_PEEKDATA and
PTRACE_PEEKUSER can set errno to zero. AFAICS this is for a good
reason (so that you can tell the difference between a successful
PEEK with a result of -1 and a failed PEEK, even if you forget to
clear errno yourself), but it technically violates the rules
described in the errno.3 manpage.
glibc snippet from sysdeps/unix/sysv/linux/ptrace.c:
res = INLINE_SYSCALL (ptrace, 4, request, pid, addr, data);
if (res >= 0 && request > 0 && request < 4)
{
__set_errno (0);
return ret;
}
reproducer:
$ cat ptrace_test.c
char foobar_data[4] = "ABCD";
int main(void) {
pid_t child = fork();
if (child == -1) err(1, "fork");
if (child == 0) {
if (prctl(PR_SET_PDEATHSIG, SIGKILL)) err(1, "prctl");
while (1) sleep(1);
}
int status;
if (ptrace(PTRACE_ATTACH, child, NULL, NULL)) err(1, "attach");
if (waitpid(child, &status, 0) != child) err(1, "wait");
errno = EINVAL;
unsigned int res = ptrace(PTRACE_PEEKDATA, child, foobar_data, NULL);
printf("errno after PEEKDATA: %d\n", errno);
printf("PEEKDATA result: 0x%x\n", res);
}
$ gcc -o ptrace_test ptrace_test.c -Wall
$ ./ptrace_test
errno after PEEKDATA: 0
PEEKDATA result: 0x44434241
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
See copy_process() in kernel/fork.c:
if (clone_flags & CLONE_THREAD) {
if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
(task_active_pid_ns(current) !=
current->nsproxy->pid_ns_for_children))
return ERR_PTR(-EINVAL);
}
current->nsproxy->pid_ns_for_children is where unshare(CLONE_NEWPID)
stashes the pending namespace.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The extra detail has little of noting with -test 2.6.0
added a particular feature has little value these days,
and is likely to confuse some readers who don't know
(and probably don't care) about the historical details.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Checking the FreeBSD source code, there's explicit support for
this to accommodate non-BSD systems (such as Linux).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Use "UFFDIO_ZEROPAGE" consistently rather than "UFFDIO_ZERO".
Signed-off-by: Anthony Iliopoulos <ailiopoulos@suse.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Based on text from Documentation/filesystems/ramfs-rootfs-initramfs.txt.
Signed-off-by: Elvira Khabirova <lineprinter@altlinux.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Initially it was planned that the parisc linux port would natively
support 32-bit HP-UX binaries, but this compatibility was never
reached and finally dropped with Linux kernel 3.14.
With that background, drop parisc from the list of of platforms
which supports it's proprietary operating-system.
Additional notes from mtk:
The most relevant commit from the Linux 3.14 change log was:
[[
commit f5a408d53edef3af07ac7697b8bc54a755628450
Author: Guy Martin <gmsoft@tuxicoman.be>
Date: Thu Jan 16 17:17:53 2014 +0100
parisc: Make EWOULDBLOCK be equal to EAGAIN on parisc
On Linux, only parisc uses a different value for EWOULDBLOCK which
causes a lot of troubles for applications not checking for both values.
Since the hpux compat is long dead, make EWOULDBLOCK behave the same as
all other architectures.
]]
Additional notes from Helge:
The patch above is the initial and most important one with which
we stopped the HP-UX compatibility.
Then, with this commit in kernel 3.18 there is no way back:
"parisc: Reduce SIGRTMIN from 37 to 32 to behave like
other Linux architectures"
commit 1f25df2eff5b25f52c139d3ff31bc883eee9a0ab
And in kernel 4.0 we finally dropped the HP-UX compat layer
from Linux kernel source code with the commit series
"parisc: hpux - Drop support for HP-UX binaries":
commit 04c1614977168fb8f002e2d81f704eeabe0c5ebd
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
On parisc one needs to take care of the 32-bit calling conventions
with 64-bit syscall parameters on a 32-bit kernel. So on parisc we
suffer from the same issues like ARM, PowerPC and Xtensa.
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
There was probably a little too much detail in
Lukas Werkmeister's patch. Simplify, by removing a few
file systems, and arrange the information as a bulleted
list for easier readability.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The RENAME_NOREPLACE flag was added with the initial release of the
renameat2 syscall in Linux 3.15, but support for most filesystems was
only added in later versions, and some may still not support it.
Signed-off-by: Lucas Werkmeister <mail@lucaswerkmeister.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The original implementation of PR_SET_MM_EXE_FILE only allowed it
to be used once in a process's lifetime. This restriction was
lifted in Linux commit 3fb4afd9a504c2386b8435028d43283216bf588e
("prctl: remove one-shot limitation for changing exe link").
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reported-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
[I got two patches for this; the other from Florian Weimer]
According to the following kernel code, preadv2(2)/pwritev2(2) with
an unknown flag actually returned EOPNOTSUPP instead of EINVAL:
----------------------------------------------------------------
static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
{
if (unlikely(flags & ~RWF_SUPPORTED)) {
return -EOPNOTSUPP;
}
...
}
static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
loff_t *ppos, int type, rwf_t flags)
{
...
if (flags & ~RWF_HIPRI)
return -EOPNOTSUPP;
...
}
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This fixes three typos of EACCES (one "S" is the correct errno
name).
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>