A process doesn't have a capability in a mount namespace, but
rather in the user namespace that owns the mount namespace.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Attempts (settimeofday(), clcok_settime(CLOCK_REALTIME)) to set
the real time clock to a value less than the current value of the
CLOCK_MONOTONIC clock result in EINVAL.
In the kernel source file kernel/time/timekeeping.c::do_settimeofday64(),
there is this check:
if (timespec64_compare(&tk->wall_to_monotonic, &ts_delta) > 0) {
ret = -EINVAL;
goto out;
}
It appears that the check was added in Linux 4.3:
commit e1d7ba8735551ed79c7a0463a042353574b96da3
Author: Wang YanQing <udknight@gmail.com>
Date: Tue Jun 23 18:38:54 2015 +0800
time: Always make sure wall_to_monotonic isn't positive
Reported-by: Jens Thoms Toerring <jt@toerring.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I noticed that it was undocumented how inotify_add_watch(2)
behaves if IN_ONLYDIR is specified and the target is not a
directory.
I've included a patch that adds ENOTDIR as an additional error in
the inotify_add_watch(2) man page.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Remove a section that adds no benefit to the discussion of O_DIRECT.
Signed-off-by: Andrew Price <andy@andrewprice.me.uk>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
* man2/s390_sthyi.2
(.SH DESCRIPTION): Document the size of the resp_buffer when
function_code is 0.
(.SH NOTES): Document various aspects of the current
implementation (the lifted requirement for the response buffer
alignment, the presence of in-kernel cache), add description
for the documentation URL.
Coauthored-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
* man2/s390_runtime_instr.2 (.SH NOTES): Note the version of
the Linux kernel since which asm/runtime_inster.h header
is available.
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
There are system calls of the same name present on the m86k and
MIPS architectures, but they simply allow setting some arbitrary
value which can be interpreted as a thread pointer by a threading
library.
* man2/set_thread_area.2 (.SH NAME): Rephrase in order to not
mention GDT.
(.SH SYNOPSIS): Add declarations for MIPS and m68k.
(.SH DESCRIPTION, .SH RETURN VALUE): Add description for MIPS
and m68k.
(.SH NOTES): Mention a way to get thread pointer on MIPS.
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This text has become rather long, making it it somewhat
unwieldy in the discussion of the mmap() flags. Therefore,
move it to NOTES, with a pointer in DESCRIPTION referring
the reader to NOTES.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Change "extremely hazardous" to "hazardous". The former phrasing
is a little overwrought; on its own "hazardous" is enough to
convey the sense of danger.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Clarify that MAP_FIXED is appropriate if the specified address
range has been reserved using an existing mapping, but shouldn't
be used otherwise.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Document the following membarrier commands introduced in
Linux 4.16:
MEMBARRIER_CMD_GLOBAL_EXPEDITED
(the old enum label MEMBARRIER_CMD_SHARED is now an
alias to preserve header backward compatibility)
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED
MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The sentence is out of place, and probably doesn't really add to
the understanding already provided by the rest of the text
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Two new types kprobe and uprobe are being added to
perf_event_open(), which allow creating kprobe or
uprobe with perf_event_open. This patch adds
information about these types.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In Linux 2..0, do_mmap() had the following check:
if (flags & MAP_DENYWRITE) {
if (file->f_inode->i_writecount > 0)
return -ETXTBSY;
}
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
4.17+ kernels offer a new MAP_FIXED_NOREPLACE flag which allows
the caller to atomically probe for a given address range.
[wording heavily updated by John Hubbard]
Cowritten-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
These functions are nonstandard, but there is no replacement.
See https://bugzilla.kernel.org/show_bug.cgi?id=199215
Reported-by: Martin Mares <mj@ucw.cz>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As far as I can tell, these EBUSY errors disappeared
with the addition of stackable mounts in Linux 2.4.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
On a multiarch/multi-ABI platform such as modern x86, each
architecture/ABI (x86-64, x32, i386)has its own syscall numbers,
which means a seccomp() filter may see different syscall numbers
over the life of the process if that process uses execve() to
execute programs that has a different architectures/ABIs.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
It is not possible to consecutively stack mounts of the
same source+target inside the same mount namespace.
For example, if procfs was already mounted against /proc in
this mount namespace:
$ sudo mount -t proc none /proc
mount: /proc: none already mounted or mount point busy.
See the following code in fs/namespace.c:
/* Refuse the same filesystem on the same mount point */
err = -EBUSY;
if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
path->mnt->mnt_root == path->dentry)
goto unlock;
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The "RETURN VALUE" section made a claim that was incorrect for
PTRACE_SECCOMP_GET_FILTER. Explicitly describe the behavior of
PTRACE_SECCOMP_GET_FILTER in the "RETURN VALUE" section (as
usual), but leave the now duplicate description in the section
describing PTRACE_SECCOMP_GET_FILTER, since the
PTRACE_SECCOMP_GET_FILTER section would otherwise probably become
harder to understand.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
If an error occurs after at least one message has been received,
recvmmsg() call succeeds, and returns the number of messages
received. The error code is expected to be returned on a
subsequent call. In the current implementation, however, the
error code can be overwritten in the meantime by an unrelated
network event on a socket, for example an incoming ICMP packet.
If an error occurs after at least one message has been sent,
sendmmsg() call succeeds, and returns the number of messages sent.
The error code is lost. The caller can retry the transmission,
starting at first failed message, but there is no guarantee that,
if an error is returned, it will be the same as the one that was
lost on the previous call.
Reference:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/net/socket.c
Signed-off-by: Nikola Forró <nforro@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
When the user creates an unprivileged mount namespace, the Linux
kernel sets the MNT_LOCKED flag [1] on any submounts to prevent
such mounts from being unmounted inside the mount namespace. Such
an unmount would reveal the filesystem tree behind the mount,
which is not otherwise possible from an unprivileged vantage
point.
Attempting to unmount such a mount will fail with EINVAL. However,
less obvious implication is that attempting a bind mount without
MS_REC, where the tree being bound contains locked sub-mounts,
will also fail with EINVAL, because, without MS_REC, such
submounts are effectively being unmounted.
Cursory googling shows several instances of people running into
this problem, so I felt it advantageous to have it documented in
the man page.
[1] 4fbd8d194f/fs/namespace.c (L1110-L1113)
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
If an advisory lock is lost, then read/write requests on any
affected file descriptor can return EIO - for NFSv4 at least.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
One last thing: reading through this, I think it might need a
wording fix (this is my fault), in order to avoid implying that
brk() or malloc() use dlopen().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
-- Expand the documentation to discuss the hazards in
enough detail to allow avoiding them.
-- Mention the upcoming MAP_FIXED_SAFE flag.
-- Enhance the alignment requirement slightly.
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Jann Horn <jannh@google.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: Michal Hocko <mhocko@kernel.org>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Cyril Hrubis <chrubis@suse.cz>
CC: Michal Hocko <mhocko@suse.com>
CC: Pavel Machek <pavel@ucw.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
MAP_FIXED has been widely used for a very long time, yet the man
page still claims that "the use of this option is discouraged".
The documentation assumes that "less portable" == "must be discouraged".
Instead of discouraging something that is so useful and widely used,
change the documentation to explain its limitations better.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
It makes no sense to describe this flag in two different
manual pages, so consolidate the description to one page.
Furthermore, the following statement that was in the prctl(2)
page is not correct:
A thread's effective capability set is always cleared
when such a credential change is made, regardless of
the setting of the "keep capabilities" flag.
The effective set is not cleared if, for example, the
credential sets were [ruid != 0, euid != 0, suid == 0]
and suid is switched to zero while the "keep capabilities"
flag is set.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As hinted in the kernel source, MAX_HANDLE_SZ is a hint
rather than a promise:
/* limit the handle size to NFSv4 handle size now */
#define MAX_HANDLE_SZ 128
Note the "now" (probably should be "for now").
So change the description to make this clear.
Reported-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The recent addition of NFS re-export and the possibility of using
name_to_handle_at() on an NFS filesystem raises issues with
name_to_handle_at() which have not been properly documented.
Getting the file handle for an untriggered automount point is
arguably meaningless and in certainly not supported by NFS.
name_to_handle_at() will return -EOVERFLOW even though the
requested "handle_bytes" is large enough. This is an unfortunate
overloading of the error code, but is manageable.
So clarify this and also note that the mount_id is returned when
EOVERFLOW is reported.
Thought: it would be nice if mount_id were returned in the
EOPNOTSUPP case too. I guess it is too late to fix that (?).
Link: https://github.com/systemd/systemd/issues/7082
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add more information about the iocb structure. It explains the
fields of the I/O control block structure which is passed to the
io_submit call.
The work also includes the nowait feature flags which is currently
posted at http://marc.info/?l=linux-fsdevel&m=149664103900715&w=2
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Currently pkey_alloc() syscall has two arguments, and the very
first argument is still not supported as in kernel 4.14-rc8 and
should be set to zero, as showed in the following syscall
implementation:
SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, ...)
{
int pkey;
int ret;
/* No flags supported yet. */
if (flags)
return -EINVAL;
This behaviour is also documented correctly in the kernel
documentation as Documentation/x86/protection-keys.txt
The second argument is the one that should specify the page
access rights.
This patch fixes the manpage to describe how the code behaves.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add documentation for those new membarrier() commands:
MEMBARRIER_CMD_PRIVATE_EXPEDITED
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED
Adapt the MEMBARRIER_CMD_SHARED return value documentation to reflect
that it now returns -EINVAL when issued on a system configured for
nohz_full.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andrew Hunter <ahh@google.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The kernel defaults to either SECCOMP_RET_KILL_PROCESS
or SECCOMP_RET_KILL_THREAD for unrecognized filter
return action values.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
From linux/v4.14-rc6/source/net/ipv4/tcp.c:
if (tp->fastopen_req)
return -EALREADY; /* Another Fast Open is in progress */
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In Linux 4.14, the action component of the return value
switched from being 15 bits to being 16 bits. A new macro,
SECCOMP_RET_ACTION_FULL, that masks the 16 bits was added,
to replace the older SECCOMP_RET_ACTION.
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Linux 4.14 added SECCOMP_RET_KILL_THREAD as a synonym for
SECCOMP_RET_KILL. Remove also the discussion of multithreaded
processes, since that will be addressed in the documentation
of SECCOMP_RET_KILL_PROCESS.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
vfork(2), getpid(2) and others which return pid_t already do this.
mtk: Additional info from Ahmad: <unistd.h> defines 'pid_t',
but only dependent on certain FTMs beng defined.
Cc: linux-man@vger.kernel.org
Signed-off-by: Ahmad Fatoum <ahmad@a3f.at>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Based on an email discussion with Florian Weimer and
Adhemerval Zanella on the libc-alpha mailing list.
("Seccomp implications for glibc wrapper function changes",
7 Nov 2017).
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Document the seccomp /proc interfaces in Linux 4.14:
/proc/sys/kernel/seccomp/actions_avail and
/proc/sys/kernel/seccomp/actions_logged.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Point the reader at strace(1) as a way of discovering system calls
that might need to be filtered.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
From a conversation with Walter Harms:
> i am confused, i understand that:
> ss.ss_sp = malloc(SIGSTKSZ);
>
> ss.ss_size = SIGSTKSZ;
> ss.ss_flags = 0;
> if (sigaltstack(&ss, NULL) == -1)
>
> is equivalent to:
> ss.ss_sp = malloc(SIGSTKSZ);
>
> ss.ss_size = SIGSTKSZ;
> ss.ss_flags = SS_ONSTACK ;
> if (sigaltstack(&ss, NULL) == -1)
>
> but also to
> ss.ss_sp = malloc(SIGSTKSZ);
>
> ss.ss_size = SIGSTKSZ;
> ss.ss_flags = SS_ONSTACK | SOMETHING_FLAG ;
> if (sigaltstack(&ss, NULL) == -1)
>
> so the use of SS_ONSTACK would result in ss.ss_flags = 0 no matter what.
> OR
> SS_ONSTACK is a no-op in Linux
I see what you mean. The point is back then that SS_ONSTACK was
the only flag that could (on Linux) be specified in ss.ss_flags,
so that "SS_ONSTACK | SOMETHING_FLAG" was a nonexistent case.
These days, it's possible to specify the new SS_AUTODISARM
flag in ss.ss_flags, which I think is why you are doubtful
about the new page text.
Reported-by: Walter Harms <wharms@bfs.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
[mtk: The raw system calls use "unsigned int", but the glibc
wrappers have "int" for the 'flags' argument.]
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
People seem to be using "cf." ("confere"), which means "compare",
to mean "see" instead, for which the Latin abbreviation would be
"q.v." ("quod vide" -> "which see").
In some cases "cf." might actually be the correct term but it's
still not clear what specific aspects of a function/system call
one is supposed to be comparing.
I left one use in place in hope of obtaining clarification,
because it looks like it might be useful there, if contextualized.
Migrate these uses to English and add them to the list of
abbreviations to be avoided.
If the patch to vfork(2) is not accepted, then the cf. still needs
an \& after it because it is at the end of the line but not the
end of a sentence.
Signed-off-by: G. Branden Robinson <g.branden.robinson@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
It's more logical to use lstat() in the example code,
since one can then experiment with sybolic links, and
also the S_IFLNK case can also occur.
Reported-by: Richard Knutsson <richard.knutsson@abelko.se>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and
should be available in the Linux 4.14 release. This patch is for
the man page changes documenting this API.
Documents the following commit:
commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e
Author: Prakash Sangappa <prakash.sangappa@oracle.com>
Date: Wed Sep 6 16:23:39 2017 -0700
mm: userfaultfd: add feature to request for a signal delivery
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
When referring to the architecture, consistently use "x86-64",
not "x86_64". Hitherto, there was a mixture of usages, with
"x86-64" predominant.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Combine two redundant paragraphs (one of which I recently
added) describing child_stack==NULL for the raw system call.
Also, make sure this text is in a more obvious place than
its previous location.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
At the current man page for shmat(2)[1], there is no mentioning
whether the returned memory address of shmat(2) will be page size
aligned or not. As that is quite important to many applications(e.g.,
those that use locks heavily and would like to avoid some locks by
some atomic guarantees provided by the CPU), it would be great to
specify that for Linux.
I walked down the current implementation of shmat(2) in the latest
kernel src and found that shmat(2) does return a page size aligned
memory address:
SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg)
-> do_shmat(...)
-> do_mmap_pgoff(...)
-> do_mmap(...)
-> get_unmapped_area(...)
-> get_area(...) -> offset_in_page(addr)
there is a `offset_in_page(addr)' assertion at the end and if that is
true a -EINVAL would be returned, by which we can be sure that
shmat(2) will return a page size aligned memory address on success[2].
[1]: http://man7.org/linux/man-pages/man2/shmat.2.html
[2]: there is also a `offset_in_page(2)' in get_unmapped_area(...),
but that doesn't lead to -EINVAL...I am not sure whether the logic of
that code is right.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Expand and rework the text a little, in particular adding
a reference to sigreturn(2) as a source of further
information about the ucontext argument.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Since 4.13, errors from writeback are more reliably reported
to all file descriptors that might be relevant.
Add notes to this effect, and also add detail about ENOSPC and
EDQUOT which can be delayed in a similar many to EIO - for NFS
in particular.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The current text was confused (mea culpa). No signal is sent to
the init() process. Rather, depending on the 'cmd' given to
reboot(), the 'group_exit_code' value will set to either SIGHUP or
SIGINT, with the effect that one of those signals is reported to
wait() in the parent process.
See https://bugzilla.kernel.org/show_bug.cgi?id=195899
Reported-by: Michał Zegan <webczat_200@poczta.onet.pl>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>