Compare commits

...

78 Commits

Author SHA1 Message Date
Michael Kerrisk 911789ee76 seccomp_unotify.2: Add caveats regarding emulation of blocking system calls
Reported-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 1b5592f534 seccomp_unotify.2: Reformat ioctls as subsections rather than hanging list
Doing so decreases the degree to which text is indented, and
thus avoids short, poorly wrapped lines.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk d1c8db825a seccomp_unotify.2: Document the SECCOMP_IOCTL_NOTIF_ADDFD ioctl()
Starting from some notes by Sargun Dhillon.

Reported-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk c13b1b2bdd seccomp_unotify.2: EXAMPLES: simplify logic in getTargetPathname()
And reword some comments there.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk f8899e1c88 seccomp_unotify.2: EXAMPLES: fix a file descriptor leak
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 8760bd15a1 seccomp_unotify.2: EXAMPLES: some code modularity improvements
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 8bae56c220 seccomp_unotify.2: Minor cleanup fix
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 40fdc84999 seccomp_unotify.2: Change name of SECCOMP_IOCTL_NOTIF_ID_VALID function
Give this function a shorter, slightly easier to read name.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk b4763b6e61 seccomp_unotify.2: Fixes after review comments from Christian Brauner
Reported-by: Christian Brauner <christian.brauner@canonical.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk a46a1879c5 seccomp_unotify.2: A cookie check is also required after reading target's memory
Quoting Jann Horn:

[[
As discussed at
<https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=Q@mail.gmail.com>,
we need to re-check checkNotificationIdIsValid() after reading remote
memory but before using the read value in any way. Otherwise, the
syscall could in the meantime get interrupted by a signal handler, the
signal handler could return, and then the function that performed the
syscall could free() allocations or return (thereby freeing buffers on
the stack).

In essence, this pread() is (unavoidably) a potential use-after-free
read; and to make that not have any security impact, we need to check
whether UAF read occurred before using the read value. This should
probably be called out elsewhere in the manpage, too...

Now, of course, **reading** is the easy case. The difficult case is if
we have to **write** to the remote process... because then we can't
play games like that. If we write data to a freed pointer, we're
screwed, that's it. (And for somewhat unrelated bonus fun, consider
that /proc/$pid/mem is originally intended for process debugging,
including installing breakpoints, and will therefore happily write
over "readonly" private mappings, such as typical mappings of
executable code.)

So, uuuuh... I guess if anyone wants to actually write memory back to
the target process, we'd better come up with some dedicated API for
that, using an ioctl on the seccomp fd that magically freezes the
target process inside the syscall while writing to its memory, or
something like that? And until then, the manpage should have a big fat
warning that writing to the target's memory is simply not possible
(safely).
]]

and
<https://lore.kernel.org/r/CAG48ez0m4Y24ZBZCh+Tf4ORMm9_q4n7VOzpGjwGF7_Fe8EQH=Q@mail.gmail.com>:

[[
The second bit of trouble is that if the supervisor is so oblivious
that it doesn't realize that syscalls can be interrupted, it'll run
into other problems. Let's say the target process does something like
this:

int func(void) {
  char pathbuf[4096];
  sprintf(pathbuf, "/tmp/blah.%d", some_number);
  mount("foo", pathbuf, ...);
}

and mount() is handled with a notification. If the supervisor just
reads the path string and immediately passes it into the real mount()
syscall, something like this can happen:

target: starts mount()
target: receives signal, aborts mount()
target: runs signal handler, returns from signal handler
target: returns out of func()
supervisor: receives notification
supervisor: reads path from remote buffer
supervisor: calls mount()

but because the stack allocation has already been freed by the time
the supervisor reads it, the supervisor just reads random garbage, and
beautiful fireworks ensue.

So the supervisor *fundamentally* has to be written to expect that at
*any* time, the target can abandon a syscall. And every read of remote
memory has to be separated from uses of that remote memory by a
notification ID recheck.

And at that point, I think it's reasonable to expect the supervisor to
also be able to handle that a syscall can be aborted before the
notification is delivered.
]]

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 8742c19c9f seccomp_unotify.2: wfix
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 589a15959f seccomp_unotify.2: EXAMPLES: make SECCOMP_IOCTL_NOTIF_ID_VALID function return bool
- Rename the function that does the SECCOMP_IOCTL_NOTIF_ID_VALID
  check.
- Make that function return a 'bool' rather than terminating the
  process.
- Use that return value in the calling function.
- Rework/improve various related comments.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 6f0ca7da71 seccomp_unotify.2: EXAMPLES: Improve comments describing checkNotificationIdIsValid()
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 8a7703864c seccomp_unotify.2: EXAMPLES: make getTargetPathname() a bit more generically useful
Allow the caller to specify which system call argument should
be looked up as a pathname.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk dbcc2ad691 seccomp_unotify.2: SEE ALSO: add pidfd_open(2) and pidfd_getfd(2)
pidfd_open(2) and pidfd_getfd(2) presumably have use cases
with the user-space notification feature.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 9688ea78cb seccomp_unotify.2: NOTES: describe an example use-case
The container manager use case was the original motivation
for this feature.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk b8360d3701 seccomp_unotify.2: Remove FIXME asking about usefulness of POLLOUT/EPOLLOUT
According to Tycho Andersen, he had no particular use case
in mind when building this detail into the API.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk b183b6503c seccomp_unotify.2: srcfix: Add a further FIXME relating to SA_RESTART behavior
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 191350e602 seccomp_unotify.2: Various fixes after review comments from Kees Cook
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 16ba7af469 seccomp_unotify.2: Update a FIXME
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 7a27538327 cmsg.3, unix.7: Refer to seccomp_unotify(2) for an example of SCM_RIGHTS usage
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 730a8d48d1 signal.7: Add reference to seccomp_unotify(2)
The seccomp user-space notification feature can cause changes in
the semantics of SA_RESTART with respect to system calls that
would never normally be restarted. Point the reader to the page
that provide further details.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk fd1295e8f1 seccomp_unotify.2: Describe the interaction with SA_RESTART signal handlers
And, as noted by Jann Horn, note how the user-space notification
mechanism causes a small breakage in the user-space API with
respect to nonrestartable system calls.

====

From the email discussion with Jann Horn

> >> So, I partially demonstrated what you describe here, for two example
> >> system calls (epoll_wait() and pause()). But I could not exactly
> >> demonstrate things as I understand you to be describing them. (So,
> >> I'm not sure whether I have not understood you correctly, or
> >> if things are not exactly as you describe them.)
> >>
> >> Here's a scenario (A) that I tested:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >>    (epoll_wait() or pause(), both of which should never restart,
> >>    regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor is sleeping (i.e., is not blocked in
> >>    SECCOMP_IOCTL_NOTIF_RECV operation).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. SIGINT gets delivered to target; handler gets called;
> >>    ***and syscall gets restarted by the kernel***
> >>
> >> That last should never happen, of course, and is a result of the
> >> combination of both the user-notify filter and the SA_RESTART flag.
> >> If one or other is not present, then the system call is not
> >> restarted.
> >>
> >> So, as you note below, the UAPI gets broken a little.
> >>
> >> However, from your description above I had understood that
> >> something like the following scenario (B) could occur:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >>    (epoll_wait() or pause(), both of which should never restart,
> >>    regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
> >>    blocks).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. Supervisor gets seccomp user-space notification (i.e.,
> >>    SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
> >> 6. SIGINT gets delivered to target; handler gets called;
> >>    and syscall gets restarted by the kernel
> >> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
> >>    which gets another notification for the restarted system call.
> >>
> >> However, I don't observe such behavior. In step 6, the syscall
> >> does not get restarted by the kernel, but instead returns -1/EINTR.
> >> Perhaps I have misconstructed my experiment in the second case, or
> >> perhaps I've misunderstood what you meant, or is it possibly the
> >> case that things are not quite as you said?
>
> Thanks for the code, Jann (including the demo of the CLONE_FILES
> technique to pass the notification FD to the supervisor).
>
> But I think your code just demonstrates what I described in
> scenario A. So, it seems that I both understood what you
> meant (because my code demonstrates the same thing) and
> also misunderstood what you said (because I thought you
> were meaning something more like scenario B).

Ahh, sorry, I should've read your mail more carefully. Indeed, that
testcase only shows scenario A. But the following shows scenario B...

[Below, two pieces of code from Jann, with a lot of
cosmetic changes by mtk.]

====

[And from a follow-up in the same email thread:]

> If userspace relies on non-restarting behavior, it should be using
> something like epoll_pwait(). And that stuff only unblocks signals
> after we've already past the seccomp checks on entry.
Thanks for elaborating that detail, since as soon as you talked
about "enlarging a preexisting race" above, I immediately wondered
sigsuspend(), pselect(), etc.

(Mind you, I still wonder about the effect on system calls that
are normally nonrestartable because they have timeouts. My
understanding is that the kernel doesn't restart those system
calls because it's impossible for the kernel to restart the call
with the right timeout value. I wonder what happens when those
system calls are restarted in the scenario we're discussing.)

Anyway, returning to your point... So, to be clear (and to
quickly remind myself in case I one day reread this thread),
there is not a problem with sigsuspend(), pselect(), ppoll(),
and epoll_pwait() since:

* Before the syscall, signals are blocked in the target.
* Inside the syscall, signals are still blocked at the time
  the check is made for seccomp filters.
* If a seccomp user-space notification  event kicks, the target
  is put to sleep with the signals still blocked.
* The signal will only get delivered after the supervisor either
  triggers a spoofed success/failure return in the target or the
  supervisor sends a CONTINUE response to the kernel telling it
  to execute the target's system call. Either way, there won't be
  any restarting of the target's system call (and the supervisor
  thus won't see multiple notifications).

====

Scenario A

$ ./seccomp_unotify_restart_scen_A
C: installed seccomp: fd 3
C: woke 1 waiters
	P: child installed seccomp fd 3
C: About to call pause(): Success
	P: going to send SIGUSR1...
C: sigusr1_handler handler invoked
	P: about to terminate
C: got pdeath signal on parent termination
C: about to terminate

/* Modified version of code from Jann Horn */

#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <limits.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>

struct {
    int seccomp_fd;
} *shared;

static void
sigusr1_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: sigusr1_handler handler invoked\n");
}

static void
sigusr2_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: got pdeath signal on parent termination\n");
    printf("C: about to terminate\n");
    exit(0);
}

int
main(void)
{
    setbuf(stdout, NULL);

    /* Allocate memory that will be shared by parent and child */

    shared = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    if (shared == MAP_FAILED)
        err(1, "mmap");
    shared->seccomp_fd = -1;

    /* glibc's clone() wrapper doesn't support fork()-style usage */
    /* Child process and parent share file descriptor table */

    pid_t child = syscall(__NR_clone, CLONE_FILES | SIGCHLD,
                          NULL, NULL, NULL, 0);
    if (child == -1)
        err(1, "clone");

    /* CHILD */

    if (child == 0) {
        /* don't outlive the parent */
        prctl(PR_SET_PDEATHSIG, SIGUSR2);

        if (getppid() == 1)
            exit(0);

        /* Install seccomp filter */

        prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
        struct sock_filter insns[] = {
            BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                     offsetof(struct seccomp_data, nr)),
            BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_pause, 0, 1),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
        };
        struct sock_fprog prog = {
            .len = sizeof(insns) / sizeof(insns[0]),
            .filter = insns
        };
        int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
        if (seccomp_ret < 0)
            err(1, "install");
        printf("C: installed seccomp: fd %d\n", seccomp_ret);

        /* Place the notifier FD number into the shared memory */

        __atomic_store(&shared->seccomp_fd, &seccomp_ret,
                       __ATOMIC_RELEASE);

        /* Wake the parent */

        int futex_ret =
            syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
                    INT_MAX, NULL, NULL, 0);
        printf("C: woke %d waiters\n", futex_ret);

        /* Establish SA_RESTART handler for SIGUSR1 */

        struct sigaction act = {
            .sa_sigaction = sigusr1_handler,
            .sa_flags = SA_RESTART | SA_SIGINFO
        };
        if (sigaction(SIGUSR1, &act, NULL))
            err(1, "sigaction");

        struct sigaction act2 = {
            .sa_sigaction = sigusr2_handler,
            .sa_flags = 0
        };
        if (sigaction(SIGUSR2, &act2, NULL))
            err(1, "sigaction");

        /* Make a blocking system call */

        perror("C: About to call pause()");
        pause();
        perror("C: pause returned");

        exit(0);
    }

    /* PARENT */

    /* Wait for futex wake-up from child */

    int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
                            -1, NULL, NULL, 0);
    if (futex_ret == -1 && errno != EAGAIN)
        err(1, "futex wait");

    /* Get notification FD from the child */

    int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
    printf("\tP: child installed seccomp fd %d\n", fd);

    sleep(1);

    printf("\tP: going to send SIGUSR1...\n");
    kill(child, SIGUSR1);

    sleep(1);
    printf("\tP: about to terminate\n");

    exit(0);
}

====

Scenario B

$ ./seccomp_unotify_restart_scen_B
C: installed seccomp: fd 3
C: woke 1 waiters
C: About to call pause()
	P: child installed seccomp fd 3
	P: about to SECCOMP_IOCTL_NOTIF_RECV
	P: got notif: id=17773741941218455591 pid=25052 nr=34
	P: about to send SIGUSR1 to child...
	P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
	P: got notif: id=17773741941218455592 pid=25052 nr=34
	P: about to send SIGUSR1 to child...
	P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
	P: got notif: id=17773741941218455593 pid=25052 nr=34
	P: about to send SIGUSR1 to child...
	P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
	P: got notif: id=17773741941218455594 pid=25052 nr=34
	P: about to send SIGUSR1 to child...
C: sigusr1_handler handler invoked
C: got pdeath signal on parent termination
C: about to terminate

/* Modified version of code from Jann Horn */

#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <string.h>
#include <limits.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>

struct {
    int seccomp_fd;
} *shared;

static void
sigusr1_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: sigusr1_handler handler invoked\n");
}

static void
sigusr2_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: got pdeath signal on parent termination\n");
    printf("C: about to terminate\n");
    exit(0);
}

static size_t
max_size(size_t a, size_t b)
{
    return (a > b) ? a : b;
}

int
main(void)
{
    setbuf(stdout, NULL);

    /* Allocate memory that will be shared by parent and child */

    shared = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    if (shared == MAP_FAILED)
        err(1, "mmap");
    shared->seccomp_fd = -1;

    /* glibc's clone() wrapper doesn't support fork()-style usage */
    /* Child process and parent share file descriptor table */
    pid_t child = syscall(__NR_clone, CLONE_FILES | SIGCHLD,
                          NULL, NULL, NULL, 0);
    if (child == -1)
        err(1, "clone");

    /* CHILD */

    if (child == 0) {
        /* don't outlive the parent */
        prctl(PR_SET_PDEATHSIG, SIGUSR2);
        if (getppid() == 1)
            exit(0);

        /* Install seccomp filter */

        prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
        struct sock_filter insns[] = {
            BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                     offsetof(struct seccomp_data, nr)),
            BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_pause, 0, 1),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
        };
        struct sock_fprog prog = {
            .len = sizeof(insns) / sizeof(insns[0]),
            .filter = insns
        };
        int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
        if (seccomp_ret < 0)
            err(1, "install");
        printf("C: installed seccomp: fd %d\n", seccomp_ret);

        /* Place the notifier FD number into the shared memory */

        __atomic_store(&shared->seccomp_fd, &seccomp_ret,
                       __ATOMIC_RELEASE);

        /* Wake the parent */

        int futex_ret =
            syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
                    INT_MAX, NULL, NULL, 0);
        printf("C: woke %d waiters\n", futex_ret);

        /* Establish SA_RESTART handler for SIGUSR1 */

        struct sigaction act = {
            .sa_sigaction = sigusr1_handler,
            .sa_flags = SA_RESTART | SA_SIGINFO
        };
        if (sigaction(SIGUSR1, &act, NULL))
            err(1, "sigaction");

        struct sigaction act2 = {
            .sa_sigaction = sigusr2_handler,
            .sa_flags = 0
        };
        if (sigaction(SIGUSR2, &act2, NULL))
            err(1, "sigaction");

        /* Make a blocking system call */

        printf("C: About to call pause()\n");
        pause();
        perror("C: pause returned");

        exit(0);
    }

    /* PARENT */

    /* Wait for futex wake-up from child */

    int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
                            -1, NULL, NULL, 0);
    if (futex_ret == -1 && errno != EAGAIN)
        err(1, "futex wait");

    /* Get notification FD from the child */

    int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
    printf("\tP: child installed seccomp fd %d\n", fd);

    /* Discover seccomp buffer sizes and allocate notification buffer */

    struct seccomp_notif_sizes sizes;
    if (syscall(__NR_seccomp, SECCOMP_GET_NOTIF_SIZES, 0, &sizes))
        err(1, "notif_sizes");
    struct seccomp_notif *notif =
        malloc(max_size(sizeof(struct seccomp_notif),
                        sizes.seccomp_notif));
    if (!notif)
        err(1, "malloc");

    for (int i = 0; i < 4; i++) {
        printf("\tP: about to SECCOMP_IOCTL_NOTIF_RECV\n");
        memset(notif, '\0', sizes.seccomp_notif);
        if (ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, notif))
            err(1, "notif_recv");
        printf("\tP: got notif: id=%llu pid=%u nr=%d\n",
               notif->id, notif->pid, notif->data.nr);
        sleep(1);
        printf("\tP: about to send SIGUSR1 to child...\n");
        kill(child, SIGUSR1);
    }
    sleep(1);

    exit(0);
}

====

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 1661264841 seccomp_unotify.2: EXAMPLE: correct the check for NUL in buffer returned by read()
In the usual case, read(fd, buf, PATH_MAX) will return PATH_MAX
bytes that include trailing garbage after the pathname. So the
right check is to scan from the start of the buffer to see if
there's a NUL, and error if there is not.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk d1774d6af8 seccomp_unotify.2: Better handling of invalid target pathname
After some discussions with Jann Horn, perhaps a better way of
dealing with an invalid target pathname is to trigger an
error for the system call.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 47056412d7 seccomp_unotify.2: EXAMPLE: rename a variable
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 2f37aeb620 seccomp_unotify.2: EXAMPLE: Improve allocation of response buffer
From a conversation with Jann Horn:

[[
>>>>            struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
>>>
>>> This should probably do something like max(sizes.seccomp_notif_resp,
>>> sizeof(struct seccomp_notif_resp)) in case the program was built
>>> against new UAPI headers that make struct seccomp_notif_resp big, but
>>> is running under an old kernel where that struct is still smaller?
>>
>> I'm confused. Why? I mean, if the running kernel says that it expects
>> a buffer of a certain size, and we allocate a buffer of that size,
>> what's the problem?
>
> Because in userspace, we cast the result of malloc() to a "struct
> seccomp_notif_resp *". If the kernel tells us that it expects a size
> smaller than sizeof(struct seccomp_notif_resp), then we end up with a
> pointer to a struct that consists partly of allocated memory, partly
> of out-of-bounds memory, which is generally a bad idea - I'm not sure
> whether the C standard permits that. And if userspace then e.g.
> decides to access some member of that struct that is beyond what the
> kernel thinks is the struct size, we get actual OOB memory accesses.
Got it. (But gosh, this seems like a fragile API mess.)

I added the following to the code:

    /* When allocating the response buffer, we must allow for the fact
       that the user-space binary may have been built with user-space
       headers where 'struct seccomp_notif_resp' is bigger than the
       response buffer expected by the (older) kernel. Therefore, we
       allocate a buffer that is the maximum of the two sizes. This
       ensures that if the supervisor places bytes into the response
       structure that are past the response size that the kernel expects,
       then the supervisor is not touching an invalid memory location. */

    size_t resp_size = sizes.seccomp_notif_resp;
    if (sizeof(struct seccomp_notif_resp) > resp_size)
        resp_size = sizeof(struct seccomp_notif_resp);

    struct seccomp_notif_resp *resp = malloc(resp_size);
]]

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk bf892a6527 seccomp_unotify.2: EXAMPLE: ensure path read() by the supervisor is null-terminated
From a conversation with Jann Horn:

    >> We should probably make sure here that the value we read is actually
    >> NUL-terminated?
    >
    > So, I was curious about that point also. But, (why) are we not
    > guaranteed that it will be NUL-terminated?

    Because it's random memory filled by another process, which we don't
    necessarily trust. While seccomp notifiers aren't usable for applying
    *extra* security restrictions, the supervisor will still often be more
    privileged than the supervised process.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk e4db7ae69d seccomp_unotify.2: wfix in example program
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 5c12cebdf2 seccomp_unotify.2: Small wording fix
Change "read(2) will return 0" to "read(2) may return 0".

Quoting Jann Horn:

    Maybe make that "may return 0" instead of "will return 0" -
    reading from /proc/$pid/mem can only return 0 in the
    following cases AFAICS:

    1. task->mm was already gone at open() time
    2. mm->mm_users has dropped to zero (the mm only has lazytlb
       users; page tables and VMAs are being blown away or have
       been blown away)
    3. the syscall was called with length 0

    When a process has gone away, normally mm->mm_users will
    drop to zero, but someone else could theoretically still be
    holding a reference to the mm (e.g. someone else in the
    middle of accessing /proc/$pid/mem).  (Such references
    should normally not be very long-lived though.)

    Additionally, in the unlikely case that the OOM killer just
    chomped through the page tables of the target process, I
    think the read will return -EIO (same error as if the
    address was simply unmapped) if the address is within a
    non-shared mapping. (Maybe that's something procfs could do
    better...)

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk e06808b4b1 seccomp_unotify.2: Minor wording change + add a FIXME
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk bcfeed7d4e seccomp_unotify.2: User-space notification can't be used to implement security policy
Add some strongly worded text warning the reader about the correct
uses of seccomp user-space notification.

Reported-by: Jann Horn <jannh@google.com>
Cowritten-by: Christian Brauner <christian@brauner.io>
Cowritten-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 03e4237409 seccomp_unotify.2: Fixes after review comments from Christian Brauner
Reported-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk fd376c6b2a seccomp.2, seccomp_unotify.2: Clarify that there can be only one SECCOMP_FILTER_FLAG_NEW_LISTENER
Reported-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk cd3224b7df seccomp_unotify.2: Note when FD indicates EOF/(E)POLLHUP in (e)poll/select
Verified by experiment.

Reported-by: Christian Brauner <christian.brauner@canonical.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 6048506c77 seccomp_unotify.2: Note when notification FD indicates as writable by select/poll/epoll
Reported-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk ea4d03e6b0 seccomp_unotify.2: Minor fixes
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk a08715b41e seccomp_unotify.2: Fixes after review comments by Jann Horn
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk d85217eff7 seccomp_unotify.2: Add BUGS section describing SECCOMP_IOCTL_NOTIF_RECV bug
Tycho Andersen confirmed that this issue is present.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 72a8602617 seccomp_unotify.2: srcfix: remove bogus FIXME
Pathname arguments are limited to PATH_MAX bytes.

Reported-by: Tycho Andersen <tycho@tycho.pizza>
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 391194cd52 seccomp_unotify.2: Changes after feed back from Tycho Andersen
Reported-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk a9a8e35644 seccomp_unotify.2: Document the seccomp user-space notification mechanism
The APIs used by this mechanism comprise not only seccomp(2), but
also a number of ioctl(2) operations. And any useful example
demonstrating these APIs is will necessarily be rather long.
Trying to cram all of this into the seccomp(2) page would make
that page unmanageably long. Therefore, let's document this
mechanism in a separate page.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 0a86ac9c9b seccomp.2: Note that SECCOMP_RET_USER_NOTIF can be overridden
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 8459e46597 seccomp.2: wfix: mention term "supervisor" in description of SECCOMP_RET_USER_NOTIF
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 1741f7fc2e seccomp.2: SEE ALSO: add seccomp_unotify(2)
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk 2bbe9bd9ae seccomp.2: Rework SECCOMP_GET_NOTIF_SIZES somewhat
The existing text says the structures (plural!) contain a 'struct
seccomp_data'. But this is only true for the received notification
structure (seccomp_notif). So, reword the sentence to be more
general, noting simply that the structures may evolve over time.

Add some comments to the structure definition.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk b723c6d8dd seccomp.2: Add some details for SECCOMP_FILTER_FLAG_NEW_LISTENER
Rework the description a little, and note that the close-on-exec
flag is set for the returned file descriptor.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:17 +12:00
Michael Kerrisk d7a3918456 seccomp.2: Minor edits to Tycho's SECCOMP_FILTER_FLAG_NEW_LISTENER patch
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:16 +12:00
Tycho Andersen b9395f4a3e seccomp.2: Document SECCOMP_FILTER_FLAG_NEW_LISTENER
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:16 +12:00
Michael Kerrisk 8fa47f3ae4 seccomp.2: Reorder list of SECCOMP_SET_MODE_FILTER flags alphabetically
(No content changes.)

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:16 +12:00
Michael Kerrisk 3bed246e7e seccomp.2: Some reworking of Tycho's SECCOMP_RET_USER_NOTIF patch
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:16 +12:00
Tycho Andersen c734bbd265 seccomp.2: Document SECCOMP_RET_USER_NOTIF
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:16 +12:00
Michael Kerrisk 6fc8b8a0a1 seccomp.2: Minor edits to Tycho Andersen's patch
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:16 +12:00
Tycho Andersen 9bc48145a6 seccomp.2: Document SECCOMP_GET_NOTIF_SIZES
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:40:16 +12:00
Michael Kerrisk 408483bd31 socketcall.2: srcfix
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar b6687e3971 socketcall.2: Use syscall(SYS_...); for system calls without a wrapper
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar 1b4d275a0e sigprocmask.2: Use syscall(SYS_...); for raw system calls
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar aa03a4e732 shmop.2: Remove unused include
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar 1cd36d9dea sgetmask.2: Use syscall(SYS_...); for system calls without a wrapper
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar 18e21e1e4c set_tid_address.2: Use syscall(SYS_...); for system calls without a wrapper
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar ba4d34a16d set_thread_area.2: Use syscall(SYS_...); for system calls without a wrapper
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar 9202a1eb8e rt_sigqueueinfo.2: Use syscall(SYS_...); for system calls without a wrapper
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar 5e9623f3b9 open.2: Remove unused <sys/stat.h>
I can't see a reason to include it.  <fcntl.h> provides O_*
constants for 'flags', S_* constants for 'mode', and mode_t.

Probably a long time ago, some of those weren't defined in
<fcntl.h>, and both headers needed to be included, or maybe it's
a historical error.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Michael Kerrisk 0ba6b2966c system_data_types.7: Minor enhancement of description of mode_t
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar 0ace616cf8 mode_t.3: New link to system_data_types(7)
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar e0b6220511 system_data_types.7: Add 'mode_t'
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar 6c2508dc6f blksize_t.3: New link to system_data_types(7)
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar 111ad1edd5 system_data_types.7: Add 'blksize_t'
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar acb5994605 cc_t.3: New link to system_data_types(7)
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar f71cb14dcb system_data_types.7: Add 'cc_t'
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar d9e9879139 blkcnt_t.3: New link to system_data_types(7)
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
Alejandro Colomar 8d1df7f260 system_data_types.7: Add 'blkcnt_t'
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:46 +12:00
dann frazier 9d39058523 kernel_lockdown.7: Remove additional text alluding to lifting via SysRq
My previous patch intended to drop the docs for the lockdown lift
SysRq, but it missed this other section that refers to lifting it
via a keyboard - an allusion to that same SysRq.

Signed-off-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:37:17 +12:00
dann frazier a989677777 kernel_lockdown.7: Remove description of lifting via SysRq (not upstream)
The patch that implemented lockdown lifting via SysRq ended up
getting dropped[*] before the feature was merged upstream. Having
the feature documented but unsupported has caused some confusion
for our users.

[*] http://archive.lwn.net:8080/linux-kernel/CACdnJuuxAM06TcnczOA6NwxhnmQUeqqm3Ma8btukZpuCS+dOqg@mail.gmail.com/

Signed-off-by: dann frazier <dann.frazier@canonical.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Pedro Principeza <pedro.principeza@canonical.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Kyle McMartin <kyle@redhat.com>
Cc: Matthew Garrett <mjg59@google.com>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:33:07 +12:00
Alejandro Colomar 336ae0d258 Makefile, README: Break installation into a target for each mandir
Instead of having a monolithic 'make install', break it into
multiple targets such as 'make install-man3'.  This simplifies
packaging, for example in Debian, where they break this project
into several packages: 'manpages' and 'manpages-dev', each
containing different mandirs.

The above allows for multithread installation: 'make -j'

Also, don't overwrite files that don't need to be overwritten, by
having a target for files, which makes use of make's timestamp
comparison.

This allows for much faster installation times.

For comparison, on my laptop (i7-8850H; 6C/12T):

Old Makefile:
	~/src/linux/man-pages$ time sudo make >/dev/null

	real	0m7.509s
	user	0m5.269s
	sys	0m2.614s

	The times with the old makefile, varied a lot, between
	5 and 10 seconds.  The times after applying this patch
	are much more consistent.  BTW, I compared these times to
	the very old Makefile of man-pages-5-09, and those were
	around 3.5 s, so it was a bit of my fault to have such a
	slow Makefile, when I changed the Makefile some weeks ago.

New Makefile (full clean install):
	~/src/linux/man-pages$ time sudo make >/dev/null

	real	0m5.160s
	user	0m4.326s
	sys	0m1.137s
	~/src/linux/man-pages$ time sudo make -j2 >/dev/null

	real	0m1.602s
	user	0m2.529s
	sys	0m0.289s
	~/src/linux/man-pages$ time sudo make -j >/dev/null

	real	0m1.398s
	user	0m2.502s
	sys	0m0.281s

	Here we can see that 'make -j' drops times drastically,
	compared to the old monolithic Makefile.  Not only that,
	but since when we are working with the man pages there
	aren't many pages involved, times will be even better.

	Here are some times with a single page changed (touched):

New Makefile (one page touched):
	~/src/linux/man-pages$ touch man2/membarrier.2
	~/src/linux/man-pages$ time sudo make install
	-	INSTALL	/usr/local/share/man/man2/membarrier.2

	real	0m0.988s
	user	0m0.966s
	sys	0m0.025s
	~/src/linux/man-pages$ touch man2/membarrier.2
	~/src/linux/man-pages$ time sudo make install -j
	-	INSTALL	/usr/local/share/man/man2/membarrier.2

	real	0m0.989s
	user	0m0.943s
	sys	0m0.049s

Also, modify the output of the make install and uninstall commands
so that a line is output for each file or directory that is
installed, similarly to the kernel's Makefile.  This doesn't apply
to html targets, which haven't been changed in this commit.

Also, make sure that for each invocation of $(INSTALL_DIR), no
parents are created, (i.e., avoid `mkdir -p` behavior).  The GNU
make manual states that it can create race conditions.  Instead,
declare as a prerequisite for each directory its parent directory,
and let make resolve the order of creation.

Also, use ':=' instead of '=' to improve performance, by
evaluating each assignment only once.

Ensure than the shell is not called when not needed, by removing
all ";" and quotes in the commands.

See also: <https://stackoverflow.com/q/67862417/6872717>

Specify conventions and rationales used in the Makefile in a comment.

Add copyright.

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-06-10 10:32:59 +12:00
Michael Kerrisk 14987c153f setresuid.2: tfix (Oxford comma)
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-05-21 20:19:28 +12:00
Michael Kerrisk e4a403876d select.2: Strengthen the warning regarding the low value of FD_SETSIZE
All modern code should avoid select(2) in favor of poll(2)
or epoll(7).

For a long history of this problem, see:

https://marc.info/?l=bugtraq&m=110660879328901
    List:       bugtraq
    Subject:    SECURITY.NNOV: Multiple applications fd_set structure bitmap array index overflow
    From:       3APA3A <3APA3A () security ! nnov ! ru>
    Date:       2005-01-24 20:30:08

https://sourceware.org/legacy-ml/libc-alpha/2003-05/msg00171.html
    User-settable FD_SETSIZE and select()
    From: mtk-lists at gmx dot net
    To: libc-alpha at sources dot redhat dot com
    Date: Mon, 19 May 2003 14:49:03 +0200 (MEST)
    Subject: User-settable FD_SETSIZE and select()

https://sourceware.org/bugzilla/show_bug.cgi?id=10352

http://0pointer.net/blog/file-descriptor-limits.html
https://twitter.com/pid_eins/status/1394962183033868292

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-05-20 11:00:11 +12:00
Michael Kerrisk 2a1ba6ae7f select.2: Relocate sentence about the fd_set value-result arguments to BUGS
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2021-05-20 09:49:09 +12:00
23 changed files with 2457 additions and 124 deletions

251
Makefile
View File

@ -1,26 +1,210 @@
# Do not print "Entering directory ..."
########################################################################
# Copyright (C) 2021 Alejandro Colomar <alx.manpages@gmail.com>
# SPDX-License-Identifier: GPL-2.0 OR LGPL-2.0
########################################################################
# Conventions:
#
# - Follow "Makefile Conventions" from the "GNU Coding Standards" closely.
# However, when something could be improved, don't follow those.
# - Uppercase variables, when referring files, refer to files in this repo.
# - Lowercase variables, when referring files, refer to system files.
# - Variables starting with '_' refer to absolute paths, including $(DESTDIR).
# - Variables ending with '_' refer to a subdir of their parent dir, which
# is in a variable of the same name but without the '_'. The subdir is
# named after this project: <*/man>.
# - Variables ending in '_rm' refer to files that can be removed (exist).
# - Variables ending in '_rmdir' refer to dirs that can be removed (exist).
# - Targets of the form '%-rm' remove their corresponding file '%'.
# - Targets of the form '%/.-rmdir' remove their corresponding dir '%/'.
# - Targets of the form '%/.' create their corresponding directory '%/'.
# - Every file or directory to be created depends on its parent directory.
# This avoids race conditions caused by `mkdir -p`. Only the root
# directories are created with parents.
# - The 'FORCE' target is used to make phony some variables that can't be
# .PHONY to avoid some optimizations.
#
########################################################################
MAKEFLAGS += --no-print-directory
MAKEFLAGS += --silent
MAKEFLAGS += --warn-undefined-variables
htmlbuilddir = $(CURDIR)/.html
HTOPTS =
DESTDIR =
prefix = /usr/local
datarootdir = $(prefix)/share
docdir = $(datarootdir)/doc
mandir = $(datarootdir)/man
htmldir = $(docdir)
htmldir_ = $(htmldir)/man
htmlext = .html
htmlbuilddir := $(CURDIR)/.html
HTOPTS :=
DESTDIR :=
prefix := /usr/local
datarootdir := $(prefix)/share
docdir := $(datarootdir)/doc
MANDIR := $(CURDIR)
mandir := $(datarootdir)/man
MAN1DIR := $(MANDIR)/man1
MAN2DIR := $(MANDIR)/man2
MAN3DIR := $(MANDIR)/man3
MAN4DIR := $(MANDIR)/man4
MAN5DIR := $(MANDIR)/man5
MAN6DIR := $(MANDIR)/man6
MAN7DIR := $(MANDIR)/man7
MAN8DIR := $(MANDIR)/man8
man1dir := $(mandir)/man1
man2dir := $(mandir)/man2
man3dir := $(mandir)/man3
man4dir := $(mandir)/man4
man5dir := $(mandir)/man5
man6dir := $(mandir)/man6
man7dir := $(mandir)/man7
man8dir := $(mandir)/man8
manext := \.[0-9]
man1ext := .1
man2ext := .2
man3ext := .3
man4ext := .4
man5ext := .5
man6ext := .6
man7ext := .7
man8ext := .8
htmldir := $(docdir)
htmldir_ := $(htmldir)/man
htmlext := .html
INSTALL := install
INSTALL_DATA := $(INSTALL) -m 644
INSTALL_DIR := $(INSTALL) -m 755 -d
RM := rm
RMDIR := rmdir --ignore-fail-on-non-empty
MAN_SECTIONS := 1 2 3 4 5 6 7 8
INSTALL = install
INSTALL_DATA = $(INSTALL) -m 644
INSTALL_DIR = $(INSTALL) -m 755 -d
.PHONY: all
all:
$(MAKE) uninstall;
$(MAKE) install;
$(MAKE) uninstall
$(MAKE) install
%/.:
$(info - INSTALL $(@D))
$(INSTALL_DIR) $(@D)
%-rm:
$(info - RM $*)
$(RM) $*
%-rmdir:
$(info - RMDIR $(@D))
$(RMDIR) $(@D)
.PHONY: install
install: install-man | installdirs
@:
.PHONY: installdirs
installdirs: | installdirs-man
@:
.PHONY: uninstall remove
uninstall remove: uninstall-man
@:
.PHONY: clean
clean:
find man?/ -type f \
|while read f; do \
rm -f "$(htmlbuilddir)/$$f".*; \
done;
########################################################################
# man
MANPAGES := $(sort $(shell find $(MANDIR)/man?/ -type f | grep '$(manext)$$'))
_manpages := $(patsubst $(MANDIR)/%,$(DESTDIR)$(mandir)/%,$(MANPAGES))
_man1pages := $(filter %$(man1ext),$(_manpages))
_man2pages := $(filter %$(man2ext),$(_manpages))
_man3pages := $(filter %$(man3ext),$(_manpages))
_man4pages := $(filter %$(man4ext),$(_manpages))
_man5pages := $(filter %$(man5ext),$(_manpages))
_man6pages := $(filter %$(man6ext),$(_manpages))
_man7pages := $(filter %$(man7ext),$(_manpages))
_man8pages := $(filter %$(man8ext),$(_manpages))
MANDIRS := $(sort $(shell find $(MANDIR)/man? -type d))
_mandirs := $(patsubst $(MANDIR)/%,$(DESTDIR)$(mandir)/%/.,$(MANDIRS))
_man1dir := $(filter %man1/.,$(_mandirs))
_man2dir := $(filter %man2/.,$(_mandirs))
_man3dir := $(filter %man3/.,$(_mandirs))
_man4dir := $(filter %man4/.,$(_mandirs))
_man5dir := $(filter %man5/.,$(_mandirs))
_man6dir := $(filter %man6/.,$(_mandirs))
_man7dir := $(filter %man7/.,$(_mandirs))
_man8dir := $(filter %man8/.,$(_mandirs))
_mandir := $(DESTDIR)$(mandir)/.
_manpages_rm := $(addsuffix -rm,$(wildcard $(_manpages)))
_man1pages_rm := $(filter %$(man1ext)-rm,$(_manpages_rm))
_man2pages_rm := $(filter %$(man2ext)-rm,$(_manpages_rm))
_man3pages_rm := $(filter %$(man3ext)-rm,$(_manpages_rm))
_man4pages_rm := $(filter %$(man4ext)-rm,$(_manpages_rm))
_man5pages_rm := $(filter %$(man5ext)-rm,$(_manpages_rm))
_man6pages_rm := $(filter %$(man6ext)-rm,$(_manpages_rm))
_man7pages_rm := $(filter %$(man7ext)-rm,$(_manpages_rm))
_man8pages_rm := $(filter %$(man8ext)-rm,$(_manpages_rm))
_mandirs_rmdir := $(addsuffix -rmdir,$(wildcard $(_mandirs)))
_man1dir_rmdir := $(addsuffix -rmdir,$(wildcard $(_man1dir)))
_man2dir_rmdir := $(addsuffix -rmdir,$(wildcard $(_man2dir)))
_man3dir_rmdir := $(addsuffix -rmdir,$(wildcard $(_man3dir)))
_man4dir_rmdir := $(addsuffix -rmdir,$(wildcard $(_man4dir)))
_man5dir_rmdir := $(addsuffix -rmdir,$(wildcard $(_man5dir)))
_man6dir_rmdir := $(addsuffix -rmdir,$(wildcard $(_man6dir)))
_man7dir_rmdir := $(addsuffix -rmdir,$(wildcard $(_man7dir)))
_man8dir_rmdir := $(addsuffix -rmdir,$(wildcard $(_man8dir)))
_mandir_rmdir := $(addsuffix -rmdir,$(wildcard $(_mandir)))
install_manX := $(foreach x,$(MAN_SECTIONS),install-man$(x))
installdirs_manX := $(foreach x,$(MAN_SECTIONS),installdirs-man$(x))
uninstall_manX := $(foreach x,$(MAN_SECTIONS),uninstall-man$(x))
.SECONDEXPANSION:
$(_manpages): $(DESTDIR)$(mandir)/man%: $(MANDIR)/man% | $$(@D)/.
$(info - INSTALL $@)
$(INSTALL_DATA) -T $< $@
$(_mandirs): %/.: | $$(dir %). $(_mandir)
$(_mandirs_rmdir): $(DESTDIR)$(mandir)/man%/.-rmdir: $$(_man%pages_rm) FORCE
$(_mandir_rmdir): $(uninstall_manX) FORCE
.PHONY: $(install_manX)
$(install_manX): install-man%: $$(_man%pages) | installdirs-man%
@:
.PHONY: install-man
install-man: $(install_manX)
@:
.PHONY: $(installdirs_manX)
$(installdirs_manX): installdirs-man%: $$(_man%dir) $(_mandir)
@:
.PHONY: installdirs-man
installdirs-man: $(installdirs_manX)
@:
.PHONY: $(uninstall_manX)
$(uninstall_manX): uninstall-man%: $$(_man%pages_rm) $$(_man%dir_rmdir)
@:
.PHONY: uninstall-man
uninstall-man: $(_mandir_rmdir) $(uninstall_manX)
@:
########################################################################
# html
# Use with
# make HTOPTS=whatever html
@ -57,28 +241,6 @@ installdirs-html:
$(INSTALL_DIR) "$(DESTDIR)$(htmldir_)/$$d" || exit $$?; \
done;
.PHONY: install
install: | installdirs
find man?/ -type f \
|while read f; do \
$(INSTALL_DATA) -T "$$f" "$(DESTDIR)$(mandir)/$$f" || exit $$?; \
done;
.PHONY: installdirs
installdirs:
find man?/ -type d \
|while read d; do \
$(INSTALL_DIR) "$(DESTDIR)$(mandir)/$$d" || exit $$?; \
done;
.PHONY: uninstall remove
uninstall remove:
find man?/ -type f \
|while read f; do \
rm -f "$(DESTDIR)$(mandir)/$$f" || exit $$?; \
rm -f "$(DESTDIR)$(mandir)/$$f".* || exit $$?; \
done;
.PHONY: uninstall-html
uninstall-html:
find man?/ -type f \
@ -86,12 +248,9 @@ uninstall-html:
rm -f "$(DESTDIR)$(htmldir_)/$$f".* || exit $$?; \
done;
.PHONY: clean
clean:
find man?/ -type f \
|while read f; do \
rm -f "$(htmlbuilddir)/$$f".* || exit $$?; \
done;
########################################################################
# tests
# Check if groff reports warnings (may be words of sentences not displayed)
# from https://lintian.debian.org/tags/groff-message.html
@ -109,3 +268,7 @@ check-groff-warnings:
# someone might also want to look at /var/catman/cat2 or so ...
# a problem is that the location of cat pages varies a lot
########################################################################
FORCE:

8
README
View File

@ -26,9 +26,17 @@ To install to a path different from /usr/local, use
distribution from its destination. Use with caution, and remember to
use "prefix" if desired, as with the "install" target.
To install only a specific man section (mandir) such as man3, use
"make install-man3". Similar syntax can be used to uninstall a
specific man section, such as man7: "make uninstall-man7".
"make" or "make all" will perform "make uninstall" followed by "make
install".
Consider using multiple threads (at least 2) when installing
these man pages, as the Makefile is optimized for multiple threads:
"make -j install".
Copyrights
==========
See the 'man-pages-x.y.Announce' file.

View File

@ -53,7 +53,6 @@
open, openat, creat \- open and possibly create a file
.SH SYNOPSIS
.nf
.B #include <sys/stat.h>
.B #include <fcntl.h>
.PP
.BI "int open(const char *" pathname ", int " flags );

View File

@ -27,9 +27,14 @@
rt_sigqueueinfo, rt_tgsigqueueinfo \- queue a signal and data
.SH SYNOPSIS
.nf
.BI "int rt_sigqueueinfo(pid_t " tgid ", int " sig ", siginfo_t *" info );
.BI "int rt_tgsigqueueinfo(pid_t " tgid ", pid_t " tid ", int " sig \
", siginfo_t *" info );
.BR "#include <linux/signal.h>" " /* Definition of " SI_* " constants */"
.BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */"
.B #include <unistd.h>
.PP
.BI "int syscall(SYS_rt_sigqueueinfo, pid_t " tgid ,
.BI " int " sig ", siginfo_t *" info );
.BI "int syscall(SYS_rt_tgsigqueueinfo, pid_t " tgid ", pid_t " tid ,
.BI " int " sig ", siginfo_t *" info );
.fi
.PP
.IR Note :

View File

@ -2,6 +2,7 @@
.\" and Copyright (C) 2012 Will Drewry <wad@chromium.org>
.\" and Copyright (C) 2008, 2014,2017 Michael Kerrisk <mtk.manpages@gmail.com>
.\" and Copyright (C) 2017 Tyler Hicks <tyhicks@canonical.com>
.\" and Copyright (C) 2020 Tycho Andersen <tycho@tycho.ws>
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
@ -206,6 +207,37 @@ The recognized
are:
.RS
.TP
.BR SECCOMP_FILTER_FLAG_LOG " (since Linux 4.14)"
.\" commit e66a39977985b1e69e17c4042cb290768eca9b02
All filter return actions except
.BR SECCOMP_RET_ALLOW
should be logged.
An administrator may override this filter flag by preventing specific
actions from being logged via the
.IR /proc/sys/kernel/seccomp/actions_logged
file.
.TP
.BR SECCOMP_FILTER_FLAG_NEW_LISTENER " (since Linux 5.0)"
.\" commit 6a21cc50f0c7f87dae5259f6cfefe024412313f6
After successfully installing the filter program,
return a new user-space notification file descriptor.
(The close-on-exec flag is set for the file descriptor.)
When the filter returns
.BR SECCOMP_RET_USER_NOTIF
a notification will be sent to this file descriptor.
.IP
At most one seccomp filter using the
.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
flag can be installed for a thread.
.IP
See
.BR seccomp_unotify (2)
for further details.
.TP
.BR SECCOMP_FILTER_FLAG_SPEC_ALLOW " (since Linux 4.17)"
.\" commit 00a02d0c502a06d15e07b857f8ff921e3e402675
Disable Speculative Store Bypass mitigation.
.TP
.BR SECCOMP_FILTER_FLAG_TSYNC
When adding a new filter, synchronize all other threads of the calling
process to the same seccomp filter tree.
@ -221,20 +253,6 @@ Synchronization will fail if another thread in the same process is in
.BR SECCOMP_MODE_STRICT
or if it has attached new seccomp filters to itself,
diverging from the calling thread's filter tree.
.TP
.BR SECCOMP_FILTER_FLAG_LOG " (since Linux 4.14)"
.\" commit e66a39977985b1e69e17c4042cb290768eca9b02
All filter return actions except
.BR SECCOMP_RET_ALLOW
should be logged.
An administrator may override this filter flag by preventing specific
actions from being logged via the
.IR /proc/sys/kernel/seccomp/actions_logged
file.
.TP
.BR SECCOMP_FILTER_FLAG_SPEC_ALLOW " (since Linux 4.17)"
.\" commit 00a02d0c502a06d15e07b857f8ff921e3e402675
Disable Speculative Store Bypass mitigation.
.RE
.TP
.BR SECCOMP_GET_ACTION_AVAIL " (since Linux 4.14)"
@ -250,6 +268,34 @@ The value of
must be 0, and
.IR args
must be a pointer to an unsigned 32-bit filter return action.
.TP
.BR SECCOMP_GET_NOTIF_SIZES " (since Linux 5.0)"
.\" commit 6a21cc50f0c7f87dae5259f6cfefe024412313f6
Get the sizes of the seccomp user-space notification structures.
Since these structures may evolve and grow over time,
this command can be used to determine how
much memory to allocate for sending and receiving notifications.
.IP
The value of
.IR flags
must be 0, and
.IR args
must be a pointer to a
.IR "struct seccomp_notif_sizes" ,
which has the following form:
.IP
.EX
struct seccomp_notif_sizes
__u16 seccomp_notif; /* Size of notification structure */
__u16 seccomp_notif_resp; /* Size of response structure */
__u16 seccomp_data; /* Size of \(aqstruct seccomp_data\(aq */
};
.EE
.IP
See
.BR seccomp_unotify (2)
for further details.
.\"
.SS Filters
When adding filters via
.BR SECCOMP_SET_MODE_FILTER ,
@ -568,6 +614,26 @@ portion of the filter's return value being passed to user space as the
.IR errno
value without executing the system call.
.TP
.BR SECCOMP_RET_USER_NOTIF " (since Linux 5.0)"
.\" commit 6a21cc50f0c7f87dae5259f6cfefe024412313f6
Forward the system call to an attached user-space supervisor
process to allow that process to decide what to do with the system call.
If there is no attached supervisor (either
because the filter was not installed with the
.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
flag or because the file descriptor was closed), the filter returns
.BR ENOSYS
(similar to what happens when a filter returns
.BR SECCOMP_RET_TRACE
and there is no tracer).
See
.BR seccomp_unotify (2)
for further details.
.IP
Note that the supervisor process will not be notified
if another filter returns an action value with a precedence greater than
.BR SECCOMP_RET_USER_NOTIF .
.TP
.BR SECCOMP_RET_TRACE
When returned, this value will cause the kernel to attempt to notify a
.BR ptrace (2)-based
@ -740,6 +806,12 @@ capability in its user namespace, or had not set
before using
.BR SECCOMP_SET_MODE_FILTER .
.TP
.BR EBUSY
While installing a new filter, the
.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
flag was specified,
but a previous filter had already been installed with that flag.
.TP
.BR EFAULT
.IR args
was not a valid address.
@ -1157,6 +1229,7 @@ main(int argc, char **argv)
.BR bpf (2),
.BR prctl (2),
.BR ptrace (2),
.BR seccomp_unotify (2),
.BR sigaction (2),
.BR proc (5),
.BR signal (7),

1980
man2/seccomp_unotify.2 Normal file

File diff suppressed because it is too large Load Diff

View File

@ -70,6 +70,18 @@ Feature Test Macro Requirements for glibc (see
_POSIX_C_SOURCE >= 200112L
.fi
.SH DESCRIPTION
.BR "WARNING" :
.BR select ()
can monitor only file descriptors numbers that are less than
.BR FD_SETSIZE
(1024)\(eman unreasonably low limit for many modern applications\(emand
this limitation will not change.
All modern applications should instead use
.BR poll (2)
or
.BR epoll (7),
which do not suffer this limitation.
.PP
.BR select ()
allows a program to monitor multiple file descriptors,
waiting until one or more of the file descriptors become "ready"
@ -80,15 +92,6 @@ perform a corresponding I/O operation (e.g.,
or a sufficiently small
.BR write (2))
without blocking.
.PP
.BR select ()
can monitor only file descriptors numbers that are less than
.BR FD_SETSIZE ;
.BR poll (2)
and
.BR epoll (7)
do not have this limitation.
See BUGS.
.\"
.SS File descriptor sets
The principal arguments of
@ -108,12 +111,6 @@ to indicate which file descriptors are currently "ready".
Thus, if using
.BR select ()
within a loop, the sets \fImust be reinitialized\fP before each call.
The implementation of the
.I fd_set
arguments as value-result arguments is a design error that is avoided in
.BR poll (2)
and
.BR epoll (7).
.PP
The contents of a file descriptor set can be manipulated
using the following macros:
@ -655,6 +652,13 @@ or
.BR epoll (7)
instead.
.PP
The implementation of the
.I fd_set
arguments as value-result arguments is a design error that is avoided in
.BR poll (2)
and
.BR epoll (7).
.PP
According to POSIX,
.BR select ()
should check all specified file descriptors in the three file descriptor sets,

View File

@ -11,28 +11,31 @@
get_thread_area, set_thread_area \- manipulate thread-local storage information
.SH SYNOPSIS
.nf
.B #include <linux/unistd.h>
.BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */"
.B #include <unistd.h>
.PP
.B #if defined __i386__ || defined __x86_64__
.B # include <asm/ldt.h>
.BR "# include <asm/ldt.h>" " /* Definition of " "struct user_desc" " */"
.PP
.BI "int get_thread_area(struct user_desc *" u_info );
.BI "int set_thread_area(struct user_desc *" u_info );
.BI "int syscall(SYS_get_thread_area, struct user_desc *" u_info );
.BI "int syscall(SYS_set_thread_area, struct user_desc *" u_info );
.PP
.B #elif defined __m68k__
.PP
.B "int get_thread_area(void);"
.BI "int set_thread_area(unsigned long " tp );
.B "int syscall(SYS_get_thread_area);"
.BI "int syscall(SYS_set_thread_area, unsigned long " tp );
.PP
.B #elif defined __mips__
.PP
.BI "int set_thread_area(unsigned long " addr );
.BI "int syscall(SYS_set_thread_area, unsigned long " addr );
.PP
.B #endif
.fi
.PP
.IR Note :
There are no glibc wrappers for these system calls; see NOTES.
glibc provides no wrappers for these system calls,
necessitating the use of
.BR syscall (2).
.SH DESCRIPTION
These calls provide architecture-specific support for a thread-local storage
implementation.
@ -172,10 +175,7 @@ and
are Linux-specific and should not be used in programs that are intended
to be portable.
.SH NOTES
Glibc does not provide wrappers for these system calls,
since they are generally intended for use only by threading libraries.
In the unlikely event that you want to call them directly, use
.BR syscall (2).
These system calls are generally intended for use only by threading libraries.
.PP
.BR arch_prctl (2)
can interfere with

View File

@ -27,13 +27,17 @@
set_tid_address \- set pointer to thread ID
.SH SYNOPSIS
.nf
.B #include <linux/unistd.h>
.BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */"
.B #include <unistd.h>
.PP
.BI "pid_t set_tid_address(int *" tidptr );
.BI "pid_t syscall(SYS_set_tid_address, int *" tidptr );
.fi
.PP
.IR Note :
There is no glibc wrapper for this system call; see NOTES.
glibc provides no wrapper for
.BR set_tid_address (),
necessitating the use of
.BR syscall (2).
.SH DESCRIPTION
For each thread, the kernel maintains two attributes (addresses) called
.I set_child_tid
@ -99,9 +103,6 @@ This call is present since Linux 2.5.48.
Details as given here are valid since Linux 2.5.49.
.SH CONFORMING TO
This system call is Linux-specific.
.SH NOTES
Glibc does not provide a wrapper for this system call; call it using
.BR syscall (2).
.SH SEE ALSO
.BR clone (2),
.BR futex (2),

View File

@ -42,7 +42,7 @@ saved set-user-ID of the calling process.
.PP
An unprivileged process may change its real UID,
effective UID, and saved set-user-ID, each to one of:
the current real UID, the current effective UID or the
the current real UID, the current effective UID, or the
current saved set-user-ID.
.PP
A privileged process (on Linux, one having the \fBCAP_SETUID\fP capability)

View File

@ -27,12 +27,17 @@
sgetmask, ssetmask \- manipulation of signal mask (obsolete)
.SH SYNOPSIS
.nf
.B "long sgetmask(void);"
.BI "long ssetmask(long " newmask );
.BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */"
.B #include <unistd.h>
.PP
.B "long syscall(SYS_sgetmask, void);"
.BI "long syscall(SYS_ssetmask, long " newmask );
.fi
.PP
.IR Note :
There are no glibc wrappers for these system calls; see NOTES.
glibc provides no wrappers for these functions,
necessitating the use of
.BR syscall (2).
.SH DESCRIPTION
These system calls are obsolete.
.IR "Do not use them" ;
@ -73,10 +78,6 @@ option.
.SH CONFORMING TO
These system calls are Linux-specific.
.SH NOTES
Glibc does not provide wrappers for these obsolete system calls;
in the unlikely event that you want to call them, use
.BR syscall (2).
.PP
These system calls are unaware of signal numbers greater than 31
(i.e., real-time signals).
.PP

View File

@ -42,7 +42,6 @@
shmat, shmdt \- System V shared memory operations
.SH SYNOPSIS
.nf
.B #include <sys/types.h>
.B #include <sys/shm.h>
.PP
.BI "void *shmat(int " shmid ", const void *" shmaddr ", int " shmflg );

View File

@ -37,12 +37,16 @@ sigprocmask, rt_sigprocmask \- examine and change blocked signals
.BI "int sigprocmask(int " how ", const sigset_t *restrict " set ,
.BI " sigset_t *restrict " oldset );
.PP
.BR "#include <signal.h>" " /* Definition of " SIG_* " constants */"
.BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */"
.B #include <unistd.h>
.PP
/* Prototype for the underlying system call */
.BI "int rt_sigprocmask(int " how ", const kernel_sigset_t *" set ,
.BI "int syscall(SYS_rt_sigprocmask, int " how ", const kernel_sigset_t *" set ,
.BI " kernel_sigset_t *" oldset ", size_t " sigsetsize );
.PP
/* Prototype for the legacy system call (deprecated) */
.BI "int sigprocmask(int " how ", const old_kernel_sigset_t *" set ,
.BI "int syscall(SYS_sigprocmask, int " how ", const old_kernel_sigset_t *" set ,
.BI " old_kernel_sigset_t *" oldset );
.fi
.PP

View File

@ -27,13 +27,18 @@
socketcall \- socket system calls
.SH SYNOPSIS
.nf
.B #include <linux/net.h>
.BR "#include <linux/net.h>" " /* Definition of " SYS_* " constants */"
.BR "#include <sys/syscall.h>" " /* Definition of " SYS_socketcall " */"
.B #include <unistd.h>
.PP
.BI "int socketcall(int " call ", unsigned long *" args );
.BI "int syscall(SYS_socketcall, int " call ", unsigned long *" args );
.fi
.PP
.IR Note :
There is no glibc wrapper for this system call; see NOTES.
glibc provides no wrapper for
.BR socketcall (),
necessitating the use of
.BR syscall (2).
.SH DESCRIPTION
.BR socketcall ()
is a common kernel entry point for the socket system calls.
@ -156,10 +161,6 @@ T}
This call is specific to Linux, and should not be used in programs
intended to be portable.
.SH NOTES
Glibc does not provide a wrapper for this system call;
in the unlikely event that you want to call it directly, do so using
.BR syscall (2).
.PP
On some architectures\(emfor example, x86-64 and ARM\(emthere is no
.BR socketcall ()
system call; instead

1
man3/blkcnt_t.3 Normal file
View File

@ -0,0 +1 @@
.so man7/system_data_types.7

1
man3/blksize_t.3 Normal file
View File

@ -0,0 +1 @@
.so man7/system_data_types.7

1
man3/cc_t.3 Normal file
View File

@ -0,0 +1 @@
.so man7/system_data_types.7

View File

@ -246,6 +246,10 @@ cmsg\->cmsg_len = CMSG_LEN(sizeof(myfds));
memcpy(CMSG_DATA(cmsg), myfds, sizeof(myfds));
.EE
.in
.PP
For a complete code example that shows passing of file descriptors
over a UNIX domain socket, see
.BR seccomp_unotify (2).
.SH SEE ALSO
.BR recvmsg (2),
.BR sendmsg (2)

1
man3/mode_t.3 Normal file
View File

@ -0,0 +1 @@
.so man7/system_data_types.7

View File

@ -19,9 +19,6 @@ modification of the kernel image and to prevent access to security and
cryptographic data located in kernel memory, whilst still permitting driver
modules to be loaded.
.PP
Lockdown is typically enabled during boot and may be terminated, if configured,
by typing a special key combination on a directly attached physical keyboard.
.PP
If a prohibited or restricted feature is accessed or used, the kernel will emit
a message that looks like:
.PP
@ -33,11 +30,6 @@ where X indicates the process name and Y indicates what is restricted.
.PP
On an EFI-enabled x86 or arm64 machine, lockdown will be automatically enabled
if the system boots in EFI Secure Boot mode.
.PP
If the kernel is appropriately configured, lockdown may be lifted by typing
the appropriate sequence on a directly attached physical keyboard.
For x86 machines, this is
.IR SysRq+x .
.\"
.SS Coverage
When lockdown is in effect, a number of features are disabled or have their

View File

@ -857,6 +857,15 @@ The
.BR sleep (3)
function is also never restarted if interrupted by a handler,
but gives a success return: the number of seconds remaining to sleep.
.PP
In certain circumstances, the
.BR seccomp (2)
user-space notification feature can lead to restarting of system calls
that would otherwise never be restarted by
.BR SA_RESTART ;
for details, see
.BR seccomp_unotify (2).
.\"
.SS Interruption of system calls and library functions by stop signals
On Linux, even in the absence of signal handlers,
certain blocking interfaces can fail with the error

View File

@ -85,6 +85,61 @@ POSIX.1-2001 and later.
.BR aio_write (3),
.BR lio_listio (3)
.RE
.\"------------------------------------- blkcnt_t ---------------------/
.TP
.I blkcnt_t
.RS
.IR Include :
.IR <sys/types.h> .
Alternatively,
.IR <sys/stat.h> .
.PP
Used for file block counts.
According to POSIX,
it shall be a signed integer type.
.PP
.IR "Conforming to" :
POSIX.1-2001 and later.
.PP
.IR "See also" :
.BR stat (2)
.RE
.\"------------------------------------- blksize_t --------------------/
.TP
.I blksize_t
.RS
.IR Include :
.IR <sys/types.h> .
Alternatively,
.IR <sys/stat.h> .
.PP
Used for file block sizes.
According to POSIX,
it shall be a signed integer type.
.PP
.IR "Conforming to" :
POSIX.1-2001 and later.
.PP
.IR "See also" :
.BR stat (2)
.RE
.\"------------------------------------- cc_t -------------------------/
.TP
.I cc_t
.RS
.IR Include :
.IR <termios.h> .
.PP
Used for terminal special characters.
According to POSIX,
it shall be an unsigned integer type.
.PP
.IR "Conforming to" :
POSIX.1-2001 and later.
.PP
.IR "See also" :
.BR termios (3)
.RE
.\"------------------------------------- clock_t ----------------------/
.TP
.I clock_t
@ -726,6 +781,35 @@ C99 and later; POSIX.1-2001 and later.
.IR "See also" :
.BR lldiv (3)
.RE
.\"------------------------------------- mode_t -----------------------/
.TP
.I mode_t
.RS
.IR Include :
.IR <sys/types.h> .
Alternatively,
.IR <fcntl.h> ,
.IR <ndbm.h> ,
.IR <spawn.h> ,
.IR <sys/ipc.h> ,
.IR <sys/mman.h> ,
or
.IR <sys/stat.h> .
.PP
Used for some file attributes (e.g., file mode).
According to POSIX,
it shall be an integer type.
.PP
.IR "Conforming to" :
POSIX.1-2001 and later.
.PP
.IR "See also" :
.BR chmod (2),
.BR mkdir (2),
.BR open (2),
.BR stat (2),
.BR umask (2)
.RE
.\"------------------------------------- off64_t ----------------------/
.TP
.I off64_t

View File

@ -1180,10 +1180,12 @@ main(int argc, char *argv[])
}
.EE
.PP
For an example of the use of
.BR SCM_RIGHTS
For examples of the use of
.BR SCM_RIGHTS ,
see
.BR cmsg (3).
.BR cmsg (3)
and
.BR seccomp_unotify (2).
.SH SEE ALSO
.BR recvmsg (2),
.BR sendmsg (2),