From fd1295e8f14b8b3a13beeb716e5a91ca46d8e893 Mon Sep 17 00:00:00 2001 From: Michael Kerrisk Date: Sat, 24 Oct 2020 14:29:11 +0200 Subject: [PATCH] seccomp_unotify.2: Describe the interaction with SA_RESTART signal handlers And, as noted by Jann Horn, note how the user-space notification mechanism causes a small breakage in the user-space API with respect to nonrestartable system calls. ==== From the email discussion with Jann Horn > >> So, I partially demonstrated what you describe here, for two example > >> system calls (epoll_wait() and pause()). But I could not exactly > >> demonstrate things as I understand you to be describing them. (So, > >> I'm not sure whether I have not understood you correctly, or > >> if things are not exactly as you describe them.) > >> > >> Here's a scenario (A) that I tested: > >> > >> 1. Target installs seccomp filters for a blocking syscall > >> (epoll_wait() or pause(), both of which should never restart, > >> regardless of SA_RESTART) > >> 2. Target installs SIGINT handler with SA_RESTART > >> 3. Supervisor is sleeping (i.e., is not blocked in > >> SECCOMP_IOCTL_NOTIF_RECV operation). > >> 4. Target makes a blocking system call (epoll_wait() or pause()). > >> 5. SIGINT gets delivered to target; handler gets called; > >> ***and syscall gets restarted by the kernel*** > >> > >> That last should never happen, of course, and is a result of the > >> combination of both the user-notify filter and the SA_RESTART flag. > >> If one or other is not present, then the system call is not > >> restarted. > >> > >> So, as you note below, the UAPI gets broken a little. > >> > >> However, from your description above I had understood that > >> something like the following scenario (B) could occur: > >> > >> 1. Target installs seccomp filters for a blocking syscall > >> (epoll_wait() or pause(), both of which should never restart, > >> regardless of SA_RESTART) > >> 2. Target installs SIGINT handler with SA_RESTART > >> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which > >> blocks). > >> 4. Target makes a blocking system call (epoll_wait() or pause()). > >> 5. Supervisor gets seccomp user-space notification (i.e., > >> SECCOMP_IOCTL_NOTIF_RECV ioctl() returns > >> 6. SIGINT gets delivered to target; handler gets called; > >> and syscall gets restarted by the kernel > >> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation > >> which gets another notification for the restarted system call. > >> > >> However, I don't observe such behavior. In step 6, the syscall > >> does not get restarted by the kernel, but instead returns -1/EINTR. > >> Perhaps I have misconstructed my experiment in the second case, or > >> perhaps I've misunderstood what you meant, or is it possibly the > >> case that things are not quite as you said? > > Thanks for the code, Jann (including the demo of the CLONE_FILES > technique to pass the notification FD to the supervisor). > > But I think your code just demonstrates what I described in > scenario A. So, it seems that I both understood what you > meant (because my code demonstrates the same thing) and > also misunderstood what you said (because I thought you > were meaning something more like scenario B). Ahh, sorry, I should've read your mail more carefully. Indeed, that testcase only shows scenario A. But the following shows scenario B... [Below, two pieces of code from Jann, with a lot of cosmetic changes by mtk.] ==== [And from a follow-up in the same email thread:] > If userspace relies on non-restarting behavior, it should be using > something like epoll_pwait(). And that stuff only unblocks signals > after we've already past the seccomp checks on entry. Thanks for elaborating that detail, since as soon as you talked about "enlarging a preexisting race" above, I immediately wondered sigsuspend(), pselect(), etc. (Mind you, I still wonder about the effect on system calls that are normally nonrestartable because they have timeouts. My understanding is that the kernel doesn't restart those system calls because it's impossible for the kernel to restart the call with the right timeout value. I wonder what happens when those system calls are restarted in the scenario we're discussing.) Anyway, returning to your point... So, to be clear (and to quickly remind myself in case I one day reread this thread), there is not a problem with sigsuspend(), pselect(), ppoll(), and epoll_pwait() since: * Before the syscall, signals are blocked in the target. * Inside the syscall, signals are still blocked at the time the check is made for seccomp filters. * If a seccomp user-space notification event kicks, the target is put to sleep with the signals still blocked. * The signal will only get delivered after the supervisor either triggers a spoofed success/failure return in the target or the supervisor sends a CONTINUE response to the kernel telling it to execute the target's system call. Either way, there won't be any restarting of the target's system call (and the supervisor thus won't see multiple notifications). ==== Scenario A $ ./seccomp_unotify_restart_scen_A C: installed seccomp: fd 3 C: woke 1 waiters P: child installed seccomp fd 3 C: About to call pause(): Success P: going to send SIGUSR1... C: sigusr1_handler handler invoked P: about to terminate C: got pdeath signal on parent termination C: about to terminate /* Modified version of code from Jann Horn */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct { int seccomp_fd; } *shared; static void sigusr1_handler(int sig, siginfo_t * info, void *uctx) { printf("C: sigusr1_handler handler invoked\n"); } static void sigusr2_handler(int sig, siginfo_t * info, void *uctx) { printf("C: got pdeath signal on parent termination\n"); printf("C: about to terminate\n"); exit(0); } int main(void) { setbuf(stdout, NULL); /* Allocate memory that will be shared by parent and child */ shared = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0); if (shared == MAP_FAILED) err(1, "mmap"); shared->seccomp_fd = -1; /* glibc's clone() wrapper doesn't support fork()-style usage */ /* Child process and parent share file descriptor table */ pid_t child = syscall(__NR_clone, CLONE_FILES | SIGCHLD, NULL, NULL, NULL, 0); if (child == -1) err(1, "clone"); /* CHILD */ if (child == 0) { /* don't outlive the parent */ prctl(PR_SET_PDEATHSIG, SIGUSR2); if (getppid() == 1) exit(0); /* Install seccomp filter */ prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); struct sock_filter insns[] = { BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)), BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_pause, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW) }; struct sock_fprog prog = { .len = sizeof(insns) / sizeof(insns[0]), .filter = insns }; int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog); if (seccomp_ret < 0) err(1, "install"); printf("C: installed seccomp: fd %d\n", seccomp_ret); /* Place the notifier FD number into the shared memory */ __atomic_store(&shared->seccomp_fd, &seccomp_ret, __ATOMIC_RELEASE); /* Wake the parent */ int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE, INT_MAX, NULL, NULL, 0); printf("C: woke %d waiters\n", futex_ret); /* Establish SA_RESTART handler for SIGUSR1 */ struct sigaction act = { .sa_sigaction = sigusr1_handler, .sa_flags = SA_RESTART | SA_SIGINFO }; if (sigaction(SIGUSR1, &act, NULL)) err(1, "sigaction"); struct sigaction act2 = { .sa_sigaction = sigusr2_handler, .sa_flags = 0 }; if (sigaction(SIGUSR2, &act2, NULL)) err(1, "sigaction"); /* Make a blocking system call */ perror("C: About to call pause()"); pause(); perror("C: pause returned"); exit(0); } /* PARENT */ /* Wait for futex wake-up from child */ int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT, -1, NULL, NULL, 0); if (futex_ret == -1 && errno != EAGAIN) err(1, "futex wait"); /* Get notification FD from the child */ int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE); printf("\tP: child installed seccomp fd %d\n", fd); sleep(1); printf("\tP: going to send SIGUSR1...\n"); kill(child, SIGUSR1); sleep(1); printf("\tP: about to terminate\n"); exit(0); } ==== Scenario B $ ./seccomp_unotify_restart_scen_B C: installed seccomp: fd 3 C: woke 1 waiters C: About to call pause() P: child installed seccomp fd 3 P: about to SECCOMP_IOCTL_NOTIF_RECV P: got notif: id=17773741941218455591 pid=25052 nr=34 P: about to send SIGUSR1 to child... P: about to SECCOMP_IOCTL_NOTIF_RECV C: sigusr1_handler handler invoked P: got notif: id=17773741941218455592 pid=25052 nr=34 P: about to send SIGUSR1 to child... P: about to SECCOMP_IOCTL_NOTIF_RECV C: sigusr1_handler handler invoked P: got notif: id=17773741941218455593 pid=25052 nr=34 P: about to send SIGUSR1 to child... P: about to SECCOMP_IOCTL_NOTIF_RECV C: sigusr1_handler handler invoked P: got notif: id=17773741941218455594 pid=25052 nr=34 P: about to send SIGUSR1 to child... C: sigusr1_handler handler invoked C: got pdeath signal on parent termination C: about to terminate /* Modified version of code from Jann Horn */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct { int seccomp_fd; } *shared; static void sigusr1_handler(int sig, siginfo_t * info, void *uctx) { printf("C: sigusr1_handler handler invoked\n"); } static void sigusr2_handler(int sig, siginfo_t * info, void *uctx) { printf("C: got pdeath signal on parent termination\n"); printf("C: about to terminate\n"); exit(0); } static size_t max_size(size_t a, size_t b) { return (a > b) ? a : b; } int main(void) { setbuf(stdout, NULL); /* Allocate memory that will be shared by parent and child */ shared = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0); if (shared == MAP_FAILED) err(1, "mmap"); shared->seccomp_fd = -1; /* glibc's clone() wrapper doesn't support fork()-style usage */ /* Child process and parent share file descriptor table */ pid_t child = syscall(__NR_clone, CLONE_FILES | SIGCHLD, NULL, NULL, NULL, 0); if (child == -1) err(1, "clone"); /* CHILD */ if (child == 0) { /* don't outlive the parent */ prctl(PR_SET_PDEATHSIG, SIGUSR2); if (getppid() == 1) exit(0); /* Install seccomp filter */ prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); struct sock_filter insns[] = { BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)), BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_pause, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW) }; struct sock_fprog prog = { .len = sizeof(insns) / sizeof(insns[0]), .filter = insns }; int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog); if (seccomp_ret < 0) err(1, "install"); printf("C: installed seccomp: fd %d\n", seccomp_ret); /* Place the notifier FD number into the shared memory */ __atomic_store(&shared->seccomp_fd, &seccomp_ret, __ATOMIC_RELEASE); /* Wake the parent */ int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE, INT_MAX, NULL, NULL, 0); printf("C: woke %d waiters\n", futex_ret); /* Establish SA_RESTART handler for SIGUSR1 */ struct sigaction act = { .sa_sigaction = sigusr1_handler, .sa_flags = SA_RESTART | SA_SIGINFO }; if (sigaction(SIGUSR1, &act, NULL)) err(1, "sigaction"); struct sigaction act2 = { .sa_sigaction = sigusr2_handler, .sa_flags = 0 }; if (sigaction(SIGUSR2, &act2, NULL)) err(1, "sigaction"); /* Make a blocking system call */ printf("C: About to call pause()\n"); pause(); perror("C: pause returned"); exit(0); } /* PARENT */ /* Wait for futex wake-up from child */ int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT, -1, NULL, NULL, 0); if (futex_ret == -1 && errno != EAGAIN) err(1, "futex wait"); /* Get notification FD from the child */ int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE); printf("\tP: child installed seccomp fd %d\n", fd); /* Discover seccomp buffer sizes and allocate notification buffer */ struct seccomp_notif_sizes sizes; if (syscall(__NR_seccomp, SECCOMP_GET_NOTIF_SIZES, 0, &sizes)) err(1, "notif_sizes"); struct seccomp_notif *notif = malloc(max_size(sizeof(struct seccomp_notif), sizes.seccomp_notif)); if (!notif) err(1, "malloc"); for (int i = 0; i < 4; i++) { printf("\tP: about to SECCOMP_IOCTL_NOTIF_RECV\n"); memset(notif, '\0', sizes.seccomp_notif); if (ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, notif)) err(1, "notif_recv"); printf("\tP: got notif: id=%llu pid=%u nr=%d\n", notif->id, notif->pid, notif->data.nr); sleep(1); printf("\tP: about to send SIGUSR1 to child...\n"); kill(child, SIGUSR1); } sleep(1); exit(0); } ==== Reported-by: Jann Horn Signed-off-by: Michael Kerrisk --- man2/seccomp_unotify.2 | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/man2/seccomp_unotify.2 b/man2/seccomp_unotify.2 index 034224d77..b9be0e518 100644 --- a/man2/seccomp_unotify.2 +++ b/man2/seccomp_unotify.2 @@ -682,6 +682,44 @@ In other words, in order to continue a system call, the supervisor should be sure that another security mechanism or the kernel itself will sufficiently block the system call if its arguments are rewritten to something unsafe. +.\" +.SS Interaction with SA_RESTART signal handlers +Consider the following scenario: +.IP \(bu 2 +The target process has used +.BR sigaction (2) +to install a signal handler with the +.B SA_RESTART +flag. +.IP \(bu +The target has made a system call that triggered a seccomp +user-space notification and the target is currently blocked +until the supervisor sends a notification response. +.IP \(bu +A signal is delivered to the target and the signal handler is executed. +.IP \(bu +When (if) the supervisor attempts to send a notification response, the +.B SECCOMP_IOCTL_NOTIF_SEND +.BR ioctl (2)) +operation will fail with the +.BR ENOENT +error. +.PP +In this scenario, the kernel will restart the target's system call. +Consequently, the supervisor will receive another user-space notification. +Thus, depending on how many times the blocked system call +is interrupted by a signal handler, +the supervisor may receive multiple notifications for +the same system call in the target. +.PP +One oddity is that system call restarting as described in this scenario +will occur even for the blocking system calls listed in +.BR signal (7) +that would +.B never +normally be restarted by the +.BR SA_RESTART +flag. .SH BUGS If a .BR SECCOMP_IOCTL_NOTIF_RECV