Doing so decreases the degree to which text is indented, and
thus avoids short, poorly wrapped lines.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Quoting Jann Horn:
[[
As discussed at
<https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=Q@mail.gmail.com>,
we need to re-check checkNotificationIdIsValid() after reading remote
memory but before using the read value in any way. Otherwise, the
syscall could in the meantime get interrupted by a signal handler, the
signal handler could return, and then the function that performed the
syscall could free() allocations or return (thereby freeing buffers on
the stack).
In essence, this pread() is (unavoidably) a potential use-after-free
read; and to make that not have any security impact, we need to check
whether UAF read occurred before using the read value. This should
probably be called out elsewhere in the manpage, too...
Now, of course, **reading** is the easy case. The difficult case is if
we have to **write** to the remote process... because then we can't
play games like that. If we write data to a freed pointer, we're
screwed, that's it. (And for somewhat unrelated bonus fun, consider
that /proc/$pid/mem is originally intended for process debugging,
including installing breakpoints, and will therefore happily write
over "readonly" private mappings, such as typical mappings of
executable code.)
So, uuuuh... I guess if anyone wants to actually write memory back to
the target process, we'd better come up with some dedicated API for
that, using an ioctl on the seccomp fd that magically freezes the
target process inside the syscall while writing to its memory, or
something like that? And until then, the manpage should have a big fat
warning that writing to the target's memory is simply not possible
(safely).
]]
and
<https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=Q@mail.gmail.com>:
[[
The second bit of trouble is that if the supervisor is so oblivious
that it doesn't realize that syscalls can be interrupted, it'll run
into other problems. Let's say the target process does something like
this:
int func(void) {
char pathbuf[4096];
sprintf(pathbuf, "/tmp/blah.%d", some_number);
mount("foo", pathbuf, ...);
}
and mount() is handled with a notification. If the supervisor just
reads the path string and immediately passes it into the real mount()
syscall, something like this can happen:
target: starts mount()
target: receives signal, aborts mount()
target: runs signal handler, returns from signal handler
target: returns out of func()
supervisor: receives notification
supervisor: reads path from remote buffer
supervisor: calls mount()
but because the stack allocation has already been freed by the time
the supervisor reads it, the supervisor just reads random garbage, and
beautiful fireworks ensue.
So the supervisor *fundamentally* has to be written to expect that at
*any* time, the target can abandon a syscall. And every read of remote
memory has to be separated from uses of that remote memory by a
notification ID recheck.
And at that point, I think it's reasonable to expect the supervisor to
also be able to handle that a syscall can be aborted before the
notification is delivered.
]]
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
- Rename the function that does the SECCOMP_IOCTL_NOTIF_ID_VALID
check.
- Make that function return a 'bool' rather than terminating the
process.
- Use that return value in the calling function.
- Rework/improve various related comments.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
pidfd_open(2) and pidfd_getfd(2) presumably have use cases
with the user-space notification feature.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
According to Tycho Andersen, he had no particular use case
in mind when building this detail into the API.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
And, as noted by Jann Horn, note how the user-space notification
mechanism causes a small breakage in the user-space API with
respect to nonrestartable system calls.
====
From the email discussion with Jann Horn
> >> So, I partially demonstrated what you describe here, for two example
> >> system calls (epoll_wait() and pause()). But I could not exactly
> >> demonstrate things as I understand you to be describing them. (So,
> >> I'm not sure whether I have not understood you correctly, or
> >> if things are not exactly as you describe them.)
> >>
> >> Here's a scenario (A) that I tested:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >> (epoll_wait() or pause(), both of which should never restart,
> >> regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor is sleeping (i.e., is not blocked in
> >> SECCOMP_IOCTL_NOTIF_RECV operation).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. SIGINT gets delivered to target; handler gets called;
> >> ***and syscall gets restarted by the kernel***
> >>
> >> That last should never happen, of course, and is a result of the
> >> combination of both the user-notify filter and the SA_RESTART flag.
> >> If one or other is not present, then the system call is not
> >> restarted.
> >>
> >> So, as you note below, the UAPI gets broken a little.
> >>
> >> However, from your description above I had understood that
> >> something like the following scenario (B) could occur:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >> (epoll_wait() or pause(), both of which should never restart,
> >> regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
> >> blocks).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. Supervisor gets seccomp user-space notification (i.e.,
> >> SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
> >> 6. SIGINT gets delivered to target; handler gets called;
> >> and syscall gets restarted by the kernel
> >> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
> >> which gets another notification for the restarted system call.
> >>
> >> However, I don't observe such behavior. In step 6, the syscall
> >> does not get restarted by the kernel, but instead returns -1/EINTR.
> >> Perhaps I have misconstructed my experiment in the second case, or
> >> perhaps I've misunderstood what you meant, or is it possibly the
> >> case that things are not quite as you said?
>
> Thanks for the code, Jann (including the demo of the CLONE_FILES
> technique to pass the notification FD to the supervisor).
>
> But I think your code just demonstrates what I described in
> scenario A. So, it seems that I both understood what you
> meant (because my code demonstrates the same thing) and
> also misunderstood what you said (because I thought you
> were meaning something more like scenario B).
Ahh, sorry, I should've read your mail more carefully. Indeed, that
testcase only shows scenario A. But the following shows scenario B...
[Below, two pieces of code from Jann, with a lot of
cosmetic changes by mtk.]
====
[And from a follow-up in the same email thread:]
> If userspace relies on non-restarting behavior, it should be using
> something like epoll_pwait(). And that stuff only unblocks signals
> after we've already past the seccomp checks on entry.
Thanks for elaborating that detail, since as soon as you talked
about "enlarging a preexisting race" above, I immediately wondered
sigsuspend(), pselect(), etc.
(Mind you, I still wonder about the effect on system calls that
are normally nonrestartable because they have timeouts. My
understanding is that the kernel doesn't restart those system
calls because it's impossible for the kernel to restart the call
with the right timeout value. I wonder what happens when those
system calls are restarted in the scenario we're discussing.)
Anyway, returning to your point... So, to be clear (and to
quickly remind myself in case I one day reread this thread),
there is not a problem with sigsuspend(), pselect(), ppoll(),
and epoll_pwait() since:
* Before the syscall, signals are blocked in the target.
* Inside the syscall, signals are still blocked at the time
the check is made for seccomp filters.
* If a seccomp user-space notification event kicks, the target
is put to sleep with the signals still blocked.
* The signal will only get delivered after the supervisor either
triggers a spoofed success/failure return in the target or the
supervisor sends a CONTINUE response to the kernel telling it
to execute the target's system call. Either way, there won't be
any restarting of the target's system call (and the supervisor
thus won't see multiple notifications).
====
Scenario A
$ ./seccomp_unotify_restart_scen_A
C: installed seccomp: fd 3
C: woke 1 waiters
P: child installed seccomp fd 3
C: About to call pause(): Success
P: going to send SIGUSR1...
C: sigusr1_handler handler invoked
P: about to terminate
C: got pdeath signal on parent termination
C: about to terminate
/* Modified version of code from Jann Horn */
#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <limits.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>
struct {
int seccomp_fd;
} *shared;
static void
sigusr1_handler(int sig, siginfo_t * info, void *uctx)
{
printf("C: sigusr1_handler handler invoked\n");
}
static void
sigusr2_handler(int sig, siginfo_t * info, void *uctx)
{
printf("C: got pdeath signal on parent termination\n");
printf("C: about to terminate\n");
exit(0);
}
int
main(void)
{
setbuf(stdout, NULL);
/* Allocate memory that will be shared by parent and child */
shared = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
if (shared == MAP_FAILED)
err(1, "mmap");
shared->seccomp_fd = -1;
/* glibc's clone() wrapper doesn't support fork()-style usage */
/* Child process and parent share file descriptor table */
pid_t child = syscall(__NR_clone, CLONE_FILES | SIGCHLD,
NULL, NULL, NULL, 0);
if (child == -1)
err(1, "clone");
/* CHILD */
if (child == 0) {
/* don't outlive the parent */
prctl(PR_SET_PDEATHSIG, SIGUSR2);
if (getppid() == 1)
exit(0);
/* Install seccomp filter */
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
struct sock_filter insns[] = {
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_pause, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
};
struct sock_fprog prog = {
.len = sizeof(insns) / sizeof(insns[0]),
.filter = insns
};
int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
if (seccomp_ret < 0)
err(1, "install");
printf("C: installed seccomp: fd %d\n", seccomp_ret);
/* Place the notifier FD number into the shared memory */
__atomic_store(&shared->seccomp_fd, &seccomp_ret,
__ATOMIC_RELEASE);
/* Wake the parent */
int futex_ret =
syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
INT_MAX, NULL, NULL, 0);
printf("C: woke %d waiters\n", futex_ret);
/* Establish SA_RESTART handler for SIGUSR1 */
struct sigaction act = {
.sa_sigaction = sigusr1_handler,
.sa_flags = SA_RESTART | SA_SIGINFO
};
if (sigaction(SIGUSR1, &act, NULL))
err(1, "sigaction");
struct sigaction act2 = {
.sa_sigaction = sigusr2_handler,
.sa_flags = 0
};
if (sigaction(SIGUSR2, &act2, NULL))
err(1, "sigaction");
/* Make a blocking system call */
perror("C: About to call pause()");
pause();
perror("C: pause returned");
exit(0);
}
/* PARENT */
/* Wait for futex wake-up from child */
int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
-1, NULL, NULL, 0);
if (futex_ret == -1 && errno != EAGAIN)
err(1, "futex wait");
/* Get notification FD from the child */
int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
printf("\tP: child installed seccomp fd %d\n", fd);
sleep(1);
printf("\tP: going to send SIGUSR1...\n");
kill(child, SIGUSR1);
sleep(1);
printf("\tP: about to terminate\n");
exit(0);
}
====
Scenario B
$ ./seccomp_unotify_restart_scen_B
C: installed seccomp: fd 3
C: woke 1 waiters
C: About to call pause()
P: child installed seccomp fd 3
P: about to SECCOMP_IOCTL_NOTIF_RECV
P: got notif: id=17773741941218455591 pid=25052 nr=34
P: about to send SIGUSR1 to child...
P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
P: got notif: id=17773741941218455592 pid=25052 nr=34
P: about to send SIGUSR1 to child...
P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
P: got notif: id=17773741941218455593 pid=25052 nr=34
P: about to send SIGUSR1 to child...
P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
P: got notif: id=17773741941218455594 pid=25052 nr=34
P: about to send SIGUSR1 to child...
C: sigusr1_handler handler invoked
C: got pdeath signal on parent termination
C: about to terminate
/* Modified version of code from Jann Horn */
#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <string.h>
#include <limits.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>
struct {
int seccomp_fd;
} *shared;
static void
sigusr1_handler(int sig, siginfo_t * info, void *uctx)
{
printf("C: sigusr1_handler handler invoked\n");
}
static void
sigusr2_handler(int sig, siginfo_t * info, void *uctx)
{
printf("C: got pdeath signal on parent termination\n");
printf("C: about to terminate\n");
exit(0);
}
static size_t
max_size(size_t a, size_t b)
{
return (a > b) ? a : b;
}
int
main(void)
{
setbuf(stdout, NULL);
/* Allocate memory that will be shared by parent and child */
shared = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
if (shared == MAP_FAILED)
err(1, "mmap");
shared->seccomp_fd = -1;
/* glibc's clone() wrapper doesn't support fork()-style usage */
/* Child process and parent share file descriptor table */
pid_t child = syscall(__NR_clone, CLONE_FILES | SIGCHLD,
NULL, NULL, NULL, 0);
if (child == -1)
err(1, "clone");
/* CHILD */
if (child == 0) {
/* don't outlive the parent */
prctl(PR_SET_PDEATHSIG, SIGUSR2);
if (getppid() == 1)
exit(0);
/* Install seccomp filter */
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
struct sock_filter insns[] = {
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_pause, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
};
struct sock_fprog prog = {
.len = sizeof(insns) / sizeof(insns[0]),
.filter = insns
};
int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
if (seccomp_ret < 0)
err(1, "install");
printf("C: installed seccomp: fd %d\n", seccomp_ret);
/* Place the notifier FD number into the shared memory */
__atomic_store(&shared->seccomp_fd, &seccomp_ret,
__ATOMIC_RELEASE);
/* Wake the parent */
int futex_ret =
syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
INT_MAX, NULL, NULL, 0);
printf("C: woke %d waiters\n", futex_ret);
/* Establish SA_RESTART handler for SIGUSR1 */
struct sigaction act = {
.sa_sigaction = sigusr1_handler,
.sa_flags = SA_RESTART | SA_SIGINFO
};
if (sigaction(SIGUSR1, &act, NULL))
err(1, "sigaction");
struct sigaction act2 = {
.sa_sigaction = sigusr2_handler,
.sa_flags = 0
};
if (sigaction(SIGUSR2, &act2, NULL))
err(1, "sigaction");
/* Make a blocking system call */
printf("C: About to call pause()\n");
pause();
perror("C: pause returned");
exit(0);
}
/* PARENT */
/* Wait for futex wake-up from child */
int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
-1, NULL, NULL, 0);
if (futex_ret == -1 && errno != EAGAIN)
err(1, "futex wait");
/* Get notification FD from the child */
int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
printf("\tP: child installed seccomp fd %d\n", fd);
/* Discover seccomp buffer sizes and allocate notification buffer */
struct seccomp_notif_sizes sizes;
if (syscall(__NR_seccomp, SECCOMP_GET_NOTIF_SIZES, 0, &sizes))
err(1, "notif_sizes");
struct seccomp_notif *notif =
malloc(max_size(sizeof(struct seccomp_notif),
sizes.seccomp_notif));
if (!notif)
err(1, "malloc");
for (int i = 0; i < 4; i++) {
printf("\tP: about to SECCOMP_IOCTL_NOTIF_RECV\n");
memset(notif, '\0', sizes.seccomp_notif);
if (ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, notif))
err(1, "notif_recv");
printf("\tP: got notif: id=%llu pid=%u nr=%d\n",
notif->id, notif->pid, notif->data.nr);
sleep(1);
printf("\tP: about to send SIGUSR1 to child...\n");
kill(child, SIGUSR1);
}
sleep(1);
exit(0);
}
====
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In the usual case, read(fd, buf, PATH_MAX) will return PATH_MAX
bytes that include trailing garbage after the pathname. So the
right check is to scan from the start of the buffer to see if
there's a NUL, and error if there is not.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
After some discussions with Jann Horn, perhaps a better way of
dealing with an invalid target pathname is to trigger an
error for the system call.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
From a conversation with Jann Horn:
[[
>>>> struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
>>>
>>> This should probably do something like max(sizes.seccomp_notif_resp,
>>> sizeof(struct seccomp_notif_resp)) in case the program was built
>>> against new UAPI headers that make struct seccomp_notif_resp big, but
>>> is running under an old kernel where that struct is still smaller?
>>
>> I'm confused. Why? I mean, if the running kernel says that it expects
>> a buffer of a certain size, and we allocate a buffer of that size,
>> what's the problem?
>
> Because in userspace, we cast the result of malloc() to a "struct
> seccomp_notif_resp *". If the kernel tells us that it expects a size
> smaller than sizeof(struct seccomp_notif_resp), then we end up with a
> pointer to a struct that consists partly of allocated memory, partly
> of out-of-bounds memory, which is generally a bad idea - I'm not sure
> whether the C standard permits that. And if userspace then e.g.
> decides to access some member of that struct that is beyond what the
> kernel thinks is the struct size, we get actual OOB memory accesses.
Got it. (But gosh, this seems like a fragile API mess.)
I added the following to the code:
/* When allocating the response buffer, we must allow for the fact
that the user-space binary may have been built with user-space
headers where 'struct seccomp_notif_resp' is bigger than the
response buffer expected by the (older) kernel. Therefore, we
allocate a buffer that is the maximum of the two sizes. This
ensures that if the supervisor places bytes into the response
structure that are past the response size that the kernel expects,
then the supervisor is not touching an invalid memory location. */
size_t resp_size = sizes.seccomp_notif_resp;
if (sizeof(struct seccomp_notif_resp) > resp_size)
resp_size = sizeof(struct seccomp_notif_resp);
struct seccomp_notif_resp *resp = malloc(resp_size);
]]
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
From a conversation with Jann Horn:
>> We should probably make sure here that the value we read is actually
>> NUL-terminated?
>
> So, I was curious about that point also. But, (why) are we not
> guaranteed that it will be NUL-terminated?
Because it's random memory filled by another process, which we don't
necessarily trust. While seccomp notifiers aren't usable for applying
*extra* security restrictions, the supervisor will still often be more
privileged than the supervised process.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Change "read(2) will return 0" to "read(2) may return 0".
Quoting Jann Horn:
Maybe make that "may return 0" instead of "will return 0" -
reading from /proc/$pid/mem can only return 0 in the
following cases AFAICS:
1. task->mm was already gone at open() time
2. mm->mm_users has dropped to zero (the mm only has lazytlb
users; page tables and VMAs are being blown away or have
been blown away)
3. the syscall was called with length 0
When a process has gone away, normally mm->mm_users will
drop to zero, but someone else could theoretically still be
holding a reference to the mm (e.g. someone else in the
middle of accessing /proc/$pid/mem). (Such references
should normally not be very long-lived though.)
Additionally, in the unlikely case that the OOM killer just
chomped through the page tables of the target process, I
think the read will return -EIO (same error as if the
address was simply unmapped) if the address is within a
non-shared mapping. (Maybe that's something procfs could do
better...)
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add some strongly worded text warning the reader about the correct
uses of seccomp user-space notification.
Reported-by: Jann Horn <jannh@google.com>
Cowritten-by: Christian Brauner <christian@brauner.io>
Cowritten-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The APIs used by this mechanism comprise not only seccomp(2), but
also a number of ioctl(2) operations. And any useful example
demonstrating these APIs is will necessarily be rather long.
Trying to cram all of this into the seccomp(2) page would make
that page unmanageably long. Therefore, let's document this
mechanism in a separate page.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The existing text says the structures (plural!) contain a 'struct
seccomp_data'. But this is only true for the received notification
structure (seccomp_notif). So, reword the sentence to be more
general, noting simply that the structures may evolve over time.
Add some comments to the structure definition.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Rework the description a little, and note that the close-on-exec
flag is set for the returned file descriptor.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I can't see a reason to include it. <fcntl.h> provides O_*
constants for 'flags', S_* constants for 'mode', and mode_t.
Probably a long time ago, some of those weren't defined in
<fcntl.h>, and both headers needed to be included, or maybe it's
a historical error.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Document also why each header is required
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This function doesn't use any flags or special types, so there's
no reason to include <asm/unistd.h>; remove it. Add the includes
needed for syscall(2) only.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Also document why each header is needed.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This error can occur if the caller is does not have CAP_IPC_LOCK
and is not a member of the sysctl_hugetlb_shm_group.
Reported-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Explain also why headers are needed.
And some ffix.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
It is only used for providing 'sigset_t'. We're only documenting
(with some exceptions) the includes needed for constants and the
prototype itself. And 'sigset_t' is better documented in
system_data_types(7). Remove that include.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
All of the constants used by mknod() are defined in <sys/stat.h>.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
AFAICS, there's no use for <unistd.h> here. The prototype is
declared in <sys/mman.h>, and there are no constants needed.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Remove the libkeyutils prototype from the synopsis, which isn't
documented in the rest of the page, and as NOTES says, it's
probably better to use the various library functions.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The constants needed for using this function are defined in
<linux/ipc.h>. Add the include, even when those constants are not
mentioned in this manual page.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Of course that is for the glibc wrapper. As all of the other
pages that don't explicitly say otherwise.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In this case there's a wrapper provided by libaio,
but this page documents the raw syscall.
Also remove <linux/time.h> from the includes: 'struct timespec'
is already documented in system_data_types(7), where the
information is more up to date.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In this case there's a wrapper provided by libaio,
but this page documents the raw syscall.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
At the same time, document only headers that are required
for calling the function, or those that are specific to the
function:
<unistd.h> is required for the syscall() prototype.
<sys/syscall.h> is required for the syscall name SYS_xxx.
<linux/futex.h> is specific to this syscall.
However, uint32_t is generic enough that it shouldn't be
documented here. The system_data_types(7) page already documents
it, and is more precise about it. The same goes for timespec.
As a general rule a man[23] page should document the header that
includes the prototype, and all of the headers that define macros
that should be used with the call. However, the information about
types should be restricted to system_data_types(7) (and that page
should probably be improved by adding types), except for types
that are very specific to the call. Otherwise, we're duplicating
info and it's then harder to maintain, and probably outdated in
the future.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This complements commit e3eba861bd.
Since we don't need syscall(2) anymore, we don't need SYS_* definitions.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
AT_EMPTY_PATH works with empty strings (""), but not with NULL
(or at least it's not obvious).
The relevant kernel code is the following:
linux$ sed -n 189,198p fs/namei.c
result->refcnt = 1;
/* The empty path is special. */
if (unlikely(!len)) {
if (empty)
*empty = 1;
if (!(flags & LOOKUP_EMPTY)) {
putname(result);
return ERR_PTR(-ENOENT);
}
}
Reported-by: Walter Harms <wharms@bfs.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
'C library/kernel differences' was added to BUGS incorrectly.
Fix it
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Despite my mention of this spawning a hilarious discussion
on IRC, this alignment restriction should be 128-bit, not
126-bit.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add a missing "to" in an "in order to" formulation.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
CIFS flock() locks behave differently than the standard. Give overview
of those differences.
Here is the rendered text:
CIFS details
In Linux kernels up to 5.4, flock() is not propagated over SMB. A file
with such locks will not appear locked for remote clients.
Since Linux 5.5, flock() locks are emulated with SMB byte-range locks
on the entire file. Similarly to NFS, this means that fcntl(2) and
flock() locks interact with one another. Another important side-effect
is that the locks are not advisory anymore: any IO on a locked file
will always fail with EACCES when done from a separate file descriptor.
This difference originates from the design of locks in the SMB proto-
col, which provides mandatory locking semantics.
Remote and mandatory locking semantics may vary with SMB protocol,
mount options and server type. See mount.cifs(8) for additional infor-
mation.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Discussion: linux-man <https://lore.kernel.org/linux-man/20210302154831.17000-1-aaptel@suse.com/>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In b0b19983d9 we removed
<sys/types.h>. For the same reasons there, remove now <sys/ipc.h>
from many pages.
If someone wonders why <sys/ipc.h> was needed, the reason was to
get all the definitions of IPC_* constants. However, that header
is now included by <sys/msg.h>, so it's not needed anymore to
explicitly include it. Quoting POSIX: "In addition, the
<sys/msg.h> header shall include the <sys/ipc.h> header."
There were some remaining cases where I forgot to remove
<sys/types.h>; remove them now too.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
AFAICS, all types and constants used by these functions are
defined in <fcntl.h>.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
A few architectures have a different call signature for pipe().
Since those architectures are the minority, place the prototype
at the end of the SYNOPSIS, rather than the start.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Userfaultfd write-protect mode is supported starting from Linux 5.7.
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
[alx: ffix + srcfix]
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
UFFD_FEATURE_THREAD_ID is supported in Linux 4.14.
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Write-protect mode is supported starting from Linux 5.7.
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
UFFD_FEATURE_THREAD_ID is supported since Linux 4.14.
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
[alx: srcfix]
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In this case there's a wrapper provided by libaio,
but this page documents the raw kernel syscall.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
<linux/unistd.h> is not needed. We need <unistd.h> for syscall(),
and <sys/syscall.h> for SYS_exit_group.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add <linux/fcntl.h>, which contains AT_* definitions used by
execveat().
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The CLONE_* constants seem to be available from either
<linux/sched.h> or <sched.h>, and since clone3() already
includes <linux/sched.h> for 'struct clone_args', <sched.h>
is not really needed, AFAICS; however, to avoid confusion,
I also included <sched.h> for clone3() for consistency:
clone() is getting CLONE_* from <sched.h>, and it would confuse
the reader if clone3() got the same CLONE_* constants from a
different header.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
AFAICS, there's no reason to include that.
All of the macros that this function uses
are already defined in the other headers.
Cc: glibc <libc-alpha@sourceware.org>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The page didn't specify includes, and the syscalls are extinct, so
instead of adding incomplete information about includes, just
leave it without any includes.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Only the include that provides the prototype doesn't need a comment.
Also sort the includes alphabetically.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Only the one that provides the prototype doesn't need a comment.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
<linux/fs.h> doesn't seem to be needed!
Only the include that provides the prototype doesn't need a comment.
Also sort the includes alphabetically.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Only the include that provides the prototype doesn't need a comment.
Also sort the includes alphabetically.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Only the one that provides the prototype doesn't need a comment.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
[mtk: Alex's change switches the comment to the more generally used
form "Definition of..."]
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
<sys/time.h> is not needed to get the function declaration nor any
constant used by the function. It was only needed (before
POSIX.1) to get 'struct timeval', but that information would be
more suited for system_data_types(7), and not for this page.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
<sys/time.h> is not required by any of the function declarations
or macro definitions used by these functions. It may be (or maybe
not) needed by some type inside the rlimit structure, but that
info belongs in system_data_types(7), not here.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
<sys/types.h> was only needed for size_t, AFAIK. That is already
(and more precisely) documented in system_data_types(7). Let's
remove it here, as it's not really needed for calling add_key().
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I couldn't find a reason for including <unistd.h>. All the macros
used by fcntl() are defined in <fcntl.h>. For comparison, FreeBSD
and OpenBSD don't specify <unistd.h> in their manual pages.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This function never returns to its caller.
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In Linux kernel 5.12, a new mode flag, MPOL_F_NUMA_BALANCING, is
added to set_mempolicy() to optimize the page placement among the
NUMA nodes with the NUMA balancing mechanism even if the memory of
the applications is bound with MPOL_BIND. This patch updates the
man page for the new mode flag.
Related kernel commits:
bda420b985054a3badafef23807c4b4fa38a3dff
[mtk: Minor fixes to commit message]
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Michael Kerrisk" <mtk.manpages@gmail.com>
[ alx: srcfix ]
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add a sentence explaining what dup2() does in terms of file
descriptors and open file descriptions.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Sometimes people are confused, thinking a file descriptor is just a
number. To help avoid such confusions, add text highlighting that
a file descriptor is an index to an entry in the process's FD table.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As can be seen by any number of StackOverflow questions, people
persistently misunderstand what dup() does, and the existing manual
page text, which talks of "copying" a file descriptor doesn't help.
Rewrite the text a little to try to prevent some of these
misunderstandings, in particular noting at the start that dup()
allocates a new file descriptor.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
close_range() CLOSE_RANGE_USHARE triggers a call to dup_fd()
which in turn calls alloc_fdtable(), which checks that
sysctl_nr_open has not been exceeded.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The current example program can't really be used to demonstrate the
effect of close_range(). Replace it by a program that does show the
effect of this system call.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This documents close_range(2) based on information in
278a5fbaed89dacd04e9d052f4594ffd0e0585de,
60997c3d45d9a67daf01c56d805ae4fec37e0bd8, and
582f1fb6b721facf04848d2ca57f34468da1813e.
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Stephen Kitt <steve@sk2.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The manual pages are already inconsistent in which headers need
to be included. Right now, not all of the types used by a
function have their required header included in the SYNOPSIS.
If we were to add the headers required by all of the types used by
functions, the SYNOPSIS would grow too much. Not only it would
grow too much, but the information there would be less precise.
Having system_data_types(7) document each type with all the
information about required includes is much more precise, and the
info is centralized so that it's much easier to maintain.
So let's document only the include required for the function
prototype, and also the ones required for the macros needed to
call the function.
<sys/types.h> only defines types, not functions or constants, so
it doesn't belong to man[23] (function) pages at all.
I ignore if some old systems had headers that required you to
include <sys/types.h> *before* them (incomplete headers), but if
so, those implementations would be broken, and those headers
should probably provide some kind of warning. I hope this is not
the case.
[mtk: Already in 2001, POSIX.1 removed the requirement to
include <sys/types.h> for many APIs, so this patch seems
well past due.]
Acked-by: Zack Weinberg <zackw@panix.com>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
RESOLVE_CACHED allows an application to attempt a cache-only open
of a file. If this isn't possible, the request will fail with
-1/EAGAIN and the caller should retry without RESOLVE_CACHED set.
This will generally happen from a different context, where a slower
open operation can be performed.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>