seccomp_unotify.2: Describe the interaction with SA_RESTART signal handlers

And, as noted by Jann Horn, note how the user-space notification
mechanism causes a small breakage in the user-space API with
respect to nonrestartable system calls.

====

From the email discussion with Jann Horn

> >> So, I partially demonstrated what you describe here, for two example
> >> system calls (epoll_wait() and pause()). But I could not exactly
> >> demonstrate things as I understand you to be describing them. (So,
> >> I'm not sure whether I have not understood you correctly, or
> >> if things are not exactly as you describe them.)
> >>
> >> Here's a scenario (A) that I tested:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >>    (epoll_wait() or pause(), both of which should never restart,
> >>    regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor is sleeping (i.e., is not blocked in
> >>    SECCOMP_IOCTL_NOTIF_RECV operation).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. SIGINT gets delivered to target; handler gets called;
> >>    ***and syscall gets restarted by the kernel***
> >>
> >> That last should never happen, of course, and is a result of the
> >> combination of both the user-notify filter and the SA_RESTART flag.
> >> If one or other is not present, then the system call is not
> >> restarted.
> >>
> >> So, as you note below, the UAPI gets broken a little.
> >>
> >> However, from your description above I had understood that
> >> something like the following scenario (B) could occur:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >>    (epoll_wait() or pause(), both of which should never restart,
> >>    regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
> >>    blocks).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. Supervisor gets seccomp user-space notification (i.e.,
> >>    SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
> >> 6. SIGINT gets delivered to target; handler gets called;
> >>    and syscall gets restarted by the kernel
> >> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
> >>    which gets another notification for the restarted system call.
> >>
> >> However, I don't observe such behavior. In step 6, the syscall
> >> does not get restarted by the kernel, but instead returns -1/EINTR.
> >> Perhaps I have misconstructed my experiment in the second case, or
> >> perhaps I've misunderstood what you meant, or is it possibly the
> >> case that things are not quite as you said?
>
> Thanks for the code, Jann (including the demo of the CLONE_FILES
> technique to pass the notification FD to the supervisor).
>
> But I think your code just demonstrates what I described in
> scenario A. So, it seems that I both understood what you
> meant (because my code demonstrates the same thing) and
> also misunderstood what you said (because I thought you
> were meaning something more like scenario B).

Ahh, sorry, I should've read your mail more carefully. Indeed, that
testcase only shows scenario A. But the following shows scenario B...

[Below, two pieces of code from Jann, with a lot of
cosmetic changes by mtk.]

====

[And from a follow-up in the same email thread:]

> If userspace relies on non-restarting behavior, it should be using
> something like epoll_pwait(). And that stuff only unblocks signals
> after we've already past the seccomp checks on entry.
Thanks for elaborating that detail, since as soon as you talked
about "enlarging a preexisting race" above, I immediately wondered
sigsuspend(), pselect(), etc.

(Mind you, I still wonder about the effect on system calls that
are normally nonrestartable because they have timeouts. My
understanding is that the kernel doesn't restart those system
calls because it's impossible for the kernel to restart the call
with the right timeout value. I wonder what happens when those
system calls are restarted in the scenario we're discussing.)

Anyway, returning to your point... So, to be clear (and to
quickly remind myself in case I one day reread this thread),
there is not a problem with sigsuspend(), pselect(), ppoll(),
and epoll_pwait() since:

* Before the syscall, signals are blocked in the target.
* Inside the syscall, signals are still blocked at the time
  the check is made for seccomp filters.
* If a seccomp user-space notification  event kicks, the target
  is put to sleep with the signals still blocked.
* The signal will only get delivered after the supervisor either
  triggers a spoofed success/failure return in the target or the
  supervisor sends a CONTINUE response to the kernel telling it
  to execute the target's system call. Either way, there won't be
  any restarting of the target's system call (and the supervisor
  thus won't see multiple notifications).

====

Scenario A

$ ./seccomp_unotify_restart_scen_A
C: installed seccomp: fd 3
C: woke 1 waiters
	P: child installed seccomp fd 3
C: About to call pause(): Success
	P: going to send SIGUSR1...
C: sigusr1_handler handler invoked
	P: about to terminate
C: got pdeath signal on parent termination
C: about to terminate

/* Modified version of code from Jann Horn */

#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <limits.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>

struct {
    int seccomp_fd;
} *shared;

static void
sigusr1_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: sigusr1_handler handler invoked\n");
}

static void
sigusr2_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: got pdeath signal on parent termination\n");
    printf("C: about to terminate\n");
    exit(0);
}

int
main(void)
{
    setbuf(stdout, NULL);

    /* Allocate memory that will be shared by parent and child */

    shared = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    if (shared == MAP_FAILED)
        err(1, "mmap");
    shared->seccomp_fd = -1;

    /* glibc's clone() wrapper doesn't support fork()-style usage */
    /* Child process and parent share file descriptor table */

    pid_t child = syscall(__NR_clone, CLONE_FILES | SIGCHLD,
                          NULL, NULL, NULL, 0);
    if (child == -1)
        err(1, "clone");

    /* CHILD */

    if (child == 0) {
        /* don't outlive the parent */
        prctl(PR_SET_PDEATHSIG, SIGUSR2);

        if (getppid() == 1)
            exit(0);

        /* Install seccomp filter */

        prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
        struct sock_filter insns[] = {
            BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                     offsetof(struct seccomp_data, nr)),
            BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_pause, 0, 1),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
        };
        struct sock_fprog prog = {
            .len = sizeof(insns) / sizeof(insns[0]),
            .filter = insns
        };
        int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
        if (seccomp_ret < 0)
            err(1, "install");
        printf("C: installed seccomp: fd %d\n", seccomp_ret);

        /* Place the notifier FD number into the shared memory */

        __atomic_store(&shared->seccomp_fd, &seccomp_ret,
                       __ATOMIC_RELEASE);

        /* Wake the parent */

        int futex_ret =
            syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
                    INT_MAX, NULL, NULL, 0);
        printf("C: woke %d waiters\n", futex_ret);

        /* Establish SA_RESTART handler for SIGUSR1 */

        struct sigaction act = {
            .sa_sigaction = sigusr1_handler,
            .sa_flags = SA_RESTART | SA_SIGINFO
        };
        if (sigaction(SIGUSR1, &act, NULL))
            err(1, "sigaction");

        struct sigaction act2 = {
            .sa_sigaction = sigusr2_handler,
            .sa_flags = 0
        };
        if (sigaction(SIGUSR2, &act2, NULL))
            err(1, "sigaction");

        /* Make a blocking system call */

        perror("C: About to call pause()");
        pause();
        perror("C: pause returned");

        exit(0);
    }

    /* PARENT */

    /* Wait for futex wake-up from child */

    int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
                            -1, NULL, NULL, 0);
    if (futex_ret == -1 && errno != EAGAIN)
        err(1, "futex wait");

    /* Get notification FD from the child */

    int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
    printf("\tP: child installed seccomp fd %d\n", fd);

    sleep(1);

    printf("\tP: going to send SIGUSR1...\n");
    kill(child, SIGUSR1);

    sleep(1);
    printf("\tP: about to terminate\n");

    exit(0);
}

====

Scenario B

$ ./seccomp_unotify_restart_scen_B
C: installed seccomp: fd 3
C: woke 1 waiters
C: About to call pause()
	P: child installed seccomp fd 3
	P: about to SECCOMP_IOCTL_NOTIF_RECV
	P: got notif: id=17773741941218455591 pid=25052 nr=34
	P: about to send SIGUSR1 to child...
	P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
	P: got notif: id=17773741941218455592 pid=25052 nr=34
	P: about to send SIGUSR1 to child...
	P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
	P: got notif: id=17773741941218455593 pid=25052 nr=34
	P: about to send SIGUSR1 to child...
	P: about to SECCOMP_IOCTL_NOTIF_RECV
C: sigusr1_handler handler invoked
	P: got notif: id=17773741941218455594 pid=25052 nr=34
	P: about to send SIGUSR1 to child...
C: sigusr1_handler handler invoked
C: got pdeath signal on parent termination
C: about to terminate

/* Modified version of code from Jann Horn */

#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <string.h>
#include <limits.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>

struct {
    int seccomp_fd;
} *shared;

static void
sigusr1_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: sigusr1_handler handler invoked\n");
}

static void
sigusr2_handler(int sig, siginfo_t * info, void *uctx)
{
    printf("C: got pdeath signal on parent termination\n");
    printf("C: about to terminate\n");
    exit(0);
}

static size_t
max_size(size_t a, size_t b)
{
    return (a > b) ? a : b;
}

int
main(void)
{
    setbuf(stdout, NULL);

    /* Allocate memory that will be shared by parent and child */

    shared = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE,
                  MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    if (shared == MAP_FAILED)
        err(1, "mmap");
    shared->seccomp_fd = -1;

    /* glibc's clone() wrapper doesn't support fork()-style usage */
    /* Child process and parent share file descriptor table */
    pid_t child = syscall(__NR_clone, CLONE_FILES | SIGCHLD,
                          NULL, NULL, NULL, 0);
    if (child == -1)
        err(1, "clone");

    /* CHILD */

    if (child == 0) {
        /* don't outlive the parent */
        prctl(PR_SET_PDEATHSIG, SIGUSR2);
        if (getppid() == 1)
            exit(0);

        /* Install seccomp filter */

        prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
        struct sock_filter insns[] = {
            BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                     offsetof(struct seccomp_data, nr)),
            BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_pause, 0, 1),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
            BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
        };
        struct sock_fprog prog = {
            .len = sizeof(insns) / sizeof(insns[0]),
            .filter = insns
        };
        int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
        if (seccomp_ret < 0)
            err(1, "install");
        printf("C: installed seccomp: fd %d\n", seccomp_ret);

        /* Place the notifier FD number into the shared memory */

        __atomic_store(&shared->seccomp_fd, &seccomp_ret,
                       __ATOMIC_RELEASE);

        /* Wake the parent */

        int futex_ret =
            syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
                    INT_MAX, NULL, NULL, 0);
        printf("C: woke %d waiters\n", futex_ret);

        /* Establish SA_RESTART handler for SIGUSR1 */

        struct sigaction act = {
            .sa_sigaction = sigusr1_handler,
            .sa_flags = SA_RESTART | SA_SIGINFO
        };
        if (sigaction(SIGUSR1, &act, NULL))
            err(1, "sigaction");

        struct sigaction act2 = {
            .sa_sigaction = sigusr2_handler,
            .sa_flags = 0
        };
        if (sigaction(SIGUSR2, &act2, NULL))
            err(1, "sigaction");

        /* Make a blocking system call */

        printf("C: About to call pause()\n");
        pause();
        perror("C: pause returned");

        exit(0);
    }

    /* PARENT */

    /* Wait for futex wake-up from child */

    int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
                            -1, NULL, NULL, 0);
    if (futex_ret == -1 && errno != EAGAIN)
        err(1, "futex wait");

    /* Get notification FD from the child */

    int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
    printf("\tP: child installed seccomp fd %d\n", fd);

    /* Discover seccomp buffer sizes and allocate notification buffer */

    struct seccomp_notif_sizes sizes;
    if (syscall(__NR_seccomp, SECCOMP_GET_NOTIF_SIZES, 0, &sizes))
        err(1, "notif_sizes");
    struct seccomp_notif *notif =
        malloc(max_size(sizeof(struct seccomp_notif),
                        sizes.seccomp_notif));
    if (!notif)
        err(1, "malloc");

    for (int i = 0; i < 4; i++) {
        printf("\tP: about to SECCOMP_IOCTL_NOTIF_RECV\n");
        memset(notif, '\0', sizes.seccomp_notif);
        if (ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, notif))
            err(1, "notif_recv");
        printf("\tP: got notif: id=%llu pid=%u nr=%d\n",
               notif->id, notif->pid, notif->data.nr);
        sleep(1);
        printf("\tP: about to send SIGUSR1 to child...\n");
        kill(child, SIGUSR1);
    }
    sleep(1);

    exit(0);
}

====

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
Michael Kerrisk 2020-10-24 14:29:11 +02:00
parent 1661264841
commit fd1295e8f1
1 changed files with 38 additions and 0 deletions

View File

@ -682,6 +682,44 @@ In other words,
in order to continue a system call, the supervisor should be sure that
another security mechanism or the kernel itself will sufficiently block
the system call if its arguments are rewritten to something unsafe.
.\"
.SS Interaction with SA_RESTART signal handlers
Consider the following scenario:
.IP \(bu 2
The target process has used
.BR sigaction (2)
to install a signal handler with the
.B SA_RESTART
flag.
.IP \(bu
The target has made a system call that triggered a seccomp
user-space notification and the target is currently blocked
until the supervisor sends a notification response.
.IP \(bu
A signal is delivered to the target and the signal handler is executed.
.IP \(bu
When (if) the supervisor attempts to send a notification response, the
.B SECCOMP_IOCTL_NOTIF_SEND
.BR ioctl (2))
operation will fail with the
.BR ENOENT
error.
.PP
In this scenario, the kernel will restart the target's system call.
Consequently, the supervisor will receive another user-space notification.
Thus, depending on how many times the blocked system call
is interrupted by a signal handler,
the supervisor may receive multiple notifications for
the same system call in the target.
.PP
One oddity is that system call restarting as described in this scenario
will occur even for the blocking system calls listed in
.BR signal (7)
that would
.B never
normally be restarted by the
.BR SA_RESTART
flag.
.SH BUGS
If a
.BR SECCOMP_IOCTL_NOTIF_RECV