x86 and ARM are the most common architectures, but currently
are in the second subfield in the signal number lists.
Instead, swap that info with subfield 1, so the most
common architectures are first in the list.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This patch adds the signal numbers for parisc to the signal(7) man page.
Those parisc-specific values for the various signals are valid since the
Linux kernel upstream commit ("parisc: Reduce SIGRTMIN from 37 to 32 to
behave like other Linux architectures") during development of kernel 3.18:
http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f25df2eff5b25f52c139d3ff31bc883eee9a0ab
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Mention that the named constants (SECBIT_KEEP_CAPS and others)
are available only if the linux/securebits.h user-space header
is included.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
eBPF sub-system on Linux can use "helper functions", functions
implemented in the kernel that can be called from within a eBPF program
injected by a user on Linux. The kernel already supports a long list of
such helpers (sixty-seven at this time, new ones are under review).
Therefore, it is proposed to create a new manual page, separate from
bpf(2), to document those helpers for people willing to develop new eBPF
programs.
Additionally, in an effort to keep this documentation in synchronisation
with what is implemented in the kernel, it is further proposed to keep
the documentation itself in the kernel sources, as comments in file
"include/uapi/linux/bpf.h", and to generate the man page from there.
This patch adds the new man page, generated from kernel sources, to the
man-pages repository. For each eBPF helper function, a description of
the helper, of its arguments and of the return value is provided. The
idea is that all future changes for this page should be redirected to
the kernel file "include/uapi/linux/bpf.h", and the modified page
generated from there.
Generating the page itself is a two-step process. First, the
documentation is extracted from include/uapi/linux/bpf.h, and converted
to a RST (reStructuredText-formatted) page, with the relevant script
from Linux sources:
$ ./scripts/bpf_helpers_doc.py > /tmp/bpf-helpers.rst
The second step consists in turning the RST document into the final man
page, with rst2man:
$ rst2man /tmp/bpf-helpers.rst > bpf-helpers.7
The bpf.h file was taken as at kernel 4.19
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This just adds to the point made by Marcus Gelderie's patch. Note
also that SECBIT_KEEP_CAPS provides the same functionality as the
prctl() PR_SET_KEEPCAPS flag, and the prctl(2) manual page has the
correct description of the semantics (i.e., that the flag affects
the treatment of onlt the permitted capability set).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The description of SECBIT_KEEP_CAPS is misleading about the
effects on the effective capabilities of a process during a
switch to nonzero UIDs. The effective set is cleared based on
the effective UID switching to a nonzero value, even if
SECBIT_KEEP_CAPS is set. However, with this bit set, the
effective and permitted sets are not cleared if the real and
saved set-user-ID are set to nonzero values.
This was tested using the following C code and reading the kernel
source at security/commoncap.c: cap_emulate_setxuid.
void print_caps(void) {
cap_t current = cap_get_proc();
if (!current) {
perror("Current caps");
return;
}
char *text = cap_to_text(current, NULL);
if (!text) {
perror("Converting caps to text");
goto free_caps;
}
printf("Capabilities: %s\n", text);
cap_free(text);
free_caps:
cap_free(current);
}
void print_creds(void) {
uid_t ruid, suid, euid;
if (getresuid(&ruid, &euid, &suid)) {
perror("Error getting UIDs");
return;
}
printf("real = %d, effective = %d, saved set-user-ID = %d\n", ruid, euid, suid);
}
void set_caps(int size, const cap_value_t *caps) {
cap_t current = cap_init();
if (!current) {
perror("Error getting current caps");
return;
}
if (cap_clear(current)) {
perror("Error clearing caps");
}
if (cap_set_flag(current, CAP_INHERITABLE, size, caps, CAP_SET)) {
perror("setting caps");
goto free_caps;
}
if (cap_set_flag(current, CAP_EFFECTIVE, size, caps, CAP_SET)) {
perror("setting caps");
goto free_caps;
}
if (cap_set_flag(current, CAP_PERMITTED, size, caps, CAP_SET)) {
perror("setting caps");
goto free_caps;
}
if (cap_set_proc(current)) {
perror("Comitting caps");
goto free_caps;
}
free_caps:
cap_free(current);
}
const cap_value_t caps[] = {CAP_SETUID, CAP_SETPCAP};
const size_t num_caps = sizeof(caps) / sizeof(cap_value_t);
int main(int argc, char **argv) {
puts("[+] Dropping most capabilities to reduce amount of console output...");
set_caps(num_caps, caps);
puts("[+] Dropped capabilities. Starting with these credentials and capabilities:");
print_caps();
print_creds();
if (argc >= 2 && 0 == strncmp(argv[1], "keep", 4)) {
puts("[+] Setting SECBIT_KEEP_CAPS bit");
if (prctl(PR_SET_SECUREBITS, SECBIT_KEEP_CAPS, 0, 0, 0)) {
perror("Setting secure bits");
return 1;
}
}
puts("[+] Setting effective UID to 1000");
if (seteuid(1000)) {
perror("Error setting effective UID");
return 2;
}
print_caps();
print_creds();
puts("[+] Raising caps again");
set_caps(num_caps, caps);
print_caps();
print_creds();
puts("[+] Setting all remaining UIDs to nonzero values");
if (setreuid(1000, 1000)) {
perror("Error setting all UIDs to 1000");
return 3;
}
print_caps();
print_creds();
return 0;
}
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Prefer the word "owns" rather than "associated with" when
describing the relationship between user namespaces and non-user
namespaces. The existing text used a mix of the two terms, with
"associated with" being predominant, but to my ear, describing the
relationship as "ownership" is more comprehensible.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
There is too much detail in socket(2). Move most of it into
a new page instead.
Cowritten-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Clarify the example by making an implied detail more explicit.
Quoting the Troy Engel on the problem with the original text:
The problem is "and a process in a sibling cgroup (sub2)"
(shown as PID 20124 here) - how did this get here? How do I
recreate this? Following this example, there's no mention of
how, it's out of place when following the instructions.
There is nothing in any of the cgroup files which contain
this (# grep freezer /proc/*/cgroup) while at this stage.
The intent is understood, however the man page seems to skip
a step to create this in the teaching example. We should add
whatever simple steps are needed to create the "process in a
sibling cgroup" as outlined so it makes sense - as written,
I have no clue where "sibling cgroup (sub2)" came from, it
just appeared out of the blue in that step. Thanks!
See https://bugzilla.kernel.org/show_bug.cgi?id=201047
Reported-by: Troy Engel <troyengel@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The intended text was hidden elsewhere in the source of the
page as a comment.
https://bugzilla.kernel.org/show_bug.cgi?id=201029
Reported-by: Mike Weilgart <mike.weilgart@verticalsysadmin.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In particular, it is possible to write "threaded" to a
cgroup.type file if the current type is "domain threaded".
Previously, the text had implied that this was not possible.
Verified by experiment on Linux 4.15 and 4.19-rc.
Reported-by: Leah Hanson <lhanson@pivotal.io>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
After clone(CLONE_NEWPID), /proc/PID/ns/pid_for_children is empty
until the first child is created. Verified by experiment.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Some other socket options that are applicable for TCP and UDP sockets
are documented in socket(7), so help the reader by pointing them at
that page.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Thanks to a tip from Keith Packard:
https://keithp.com/blogs/fd-passing/
(Also verified by experiment.)
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
When sending ancillary data, at least one byte of real data should
also be sent. This is strictly necessary for stream sockets
(verified by experiment). It is not required for datagram sockets
on Linux (verified by experiment), but portable applications
should do so.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
If the ancillary data buffer for receiving SCM_RIGHTS file
descriptors is too small, then the excess file descriptors are
automatically closed in the receiving process. Verified by
experiment.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Verified by experiment and reading the source code (although
the SCM_RIGHTS case is not so clear to me in the source code).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
If the buffer supplied to recvmsg() to receive ancillary data is
too small, then the data is truncated and the MSG_CTRUNC flag is
set.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The file UID does not come into play when creating a v3
security.capability extended attribute.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In particular, note that it may be difficult for an application
to know about the existence of duplicate file descriptors.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Note a useful performance benefit of EPOLLET: ensuring that
only one of multiple waiters (in epoll_wait()) is woken
up when a file descriptor becomes ready.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The parisc gateway page currently only exports 3 functions:
The lws_entry for CAS operations (at 0xb0), the set_thread_pointer
function for usage in glibc (at 0xe0) and the Linux syscall entry
(at 0x100).
All other symbols in the manpage are internal labels and
shouldn't be used directly by userspace or glibc, so drop them
from the man page documentation.
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Note ENOTDIR error that occurs when requesting a watch on a
nondirectory with IN_ONLYDIR.
Reported-by: Paul Millar <paul.millar@desy.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add background details on ambient and bounding set when
discussing capability transformations during execve(2).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
capset(2) and capget(2) apply operate only on the permitted,
effective, and inheritable process capability sets.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
When comparing two namespaces symlinks to see if they refer to
the same namespace, both the inode number and the device ID
should be compared. This point was already made clear in
ioctl_ns(2), but was missing from this page.
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
There was some confused missing of concepts between the
two subsections, and some other details that needed fixing up.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Confirmed with Serge Hallyn that: "nsroot" means the UID 0
in the namespace as it would be mapped into the initial userns.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Use more consistent layout for lists of functions, and
remove punctuation from the lists to make them less cluttered.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
We define in detail the X/Open System Interfaces i.e. _XOPEN_UNIX
and all of the X/Open System Interfaces (XSI) Options Groups.
The XSI options groups include encryption, realtime, advanced
realtime, realtime threads, advanced realtime threads, tracing,
streams, and legacy interfaces.
Signed-off-by: Carlos O'Donell <carlos@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As noted by Rusty Russell:
I was really surprised that sendmsg() returned EBADF on a valid fd;
turns out I was using sendmsg with SCM_RIGHTS to send a closed fd,
which gives EBADF (see test program below).
But this is only obliquely referenced in unix(7):
SCM_RIGHTS
Send or receive a set of open file descriptors
from another process. The data portion contains
an integer array of the file descriptors. The
passed file descriptors behave as though they have
been created with dup(2).
EBADF is not mentioned in the unix(7) ERRORS (it's mentioned in
dup(2)).
int fdpass_send(int sockout, int fd)
{
/* From the cmsg(3) manpage: */
struct msghdr msg = { 0 };
struct cmsghdr *cmsg;
struct iovec iov;
char c = 0;
union { /* Ancillary data buffer, wrapped in a union
in order to ensure it is suitably aligned */
char buf[CMSG_SPACE(sizeof(fd))];
struct cmsghdr align;
} u;
msg.msg_control = u.buf;
msg.msg_controllen = sizeof(u.buf);
memset(&u, 0, sizeof(u));
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(fd));
memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));
msg.msg_name = NULL;
msg.msg_namelen = 0;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_flags = 0;
/* Keith Packard reports that 0-length sends don't work, so we
* always send 1 byte. */
iov.iov_base = &c;
iov.iov_len = 1;
return sendmsg(sockout, &msg, 0);
}
int fdpass_recv(int sockin)
{
/* From the cmsg(3) manpage: */
struct msghdr msg = { 0 };
struct cmsghdr *cmsg;
struct iovec iov;
int fd;
char c;
union { /* Ancillary data buffer, wrapped in a union
in order to ensure it is suitably aligned */
char buf[CMSG_SPACE(sizeof(fd))];
struct cmsghdr align;
} u;
msg.msg_control = u.buf;
msg.msg_controllen = sizeof(u.buf);
msg.msg_name = NULL;
msg.msg_namelen = 0;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_flags = 0;
iov.iov_base = &c;
iov.iov_len = 1;
if (recvmsg(sockin, &msg, 0) < 0)
return -1;
cmsg = CMSG_FIRSTHDR(&msg);
if (!cmsg
|| cmsg->cmsg_len != CMSG_LEN(sizeof(fd))
|| cmsg->cmsg_level != SOL_SOCKET
|| cmsg->cmsg_type != SCM_RIGHTS) {
errno = -EINVAL;
return -1;
}
memcpy(&fd, CMSG_DATA(cmsg), sizeof(fd));
return fd;
}
static void child(int sockfd)
{
int newfd = fdpass_recv(sockfd);
assert(newfd < 0);
exit(0);
}
int main(void)
{
int sv[2];
int pid, ret;
assert(socketpair(AF_UNIX, SOCK_STREAM, 0, sv) == 0);
pid = fork();
if (pid == 0) {
close(sv[1]);
child(sv[0]);
}
close(sv[0]);
ret = fdpass_send(sv[1], sv[0]);
printf("fdpass of bad fd return %i (%s)\n", ret, strerror(errno));
return 0;
}
Reported-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The last argument is passed by value, not reference.
Reported-by: Tomi Salminen <tsalminen@forcepoint.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
gettimeofday() is declared obsolete by POSIX. Mention instead
the modern APIs for working with the realtime clock.
See https://bugzilla.kernel.org/show_bug.cgi?id=199049
Reported-by: Enrique Garcia <cquike@arcor.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
/proc/[pid]/ns/pid_for_children has a value only after first
child is created in PID namespace. Verified by experiment.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Some notes from a conversation with Tejun Heo:
Subject: Re: cgroups(7): documenting cgroups v2 delegation
Date: Wed, 10 Jan 2018 14:27:26 -0800
From: Tejun Heo <tj@kernel.org>
> > 1. When delegating, cgroup.threads should be delegated. Doing that
> > selectively doesn't achieve anything meaningful.
>
> Understood. But surely delegating cgroup.threads is effectively
> meaningless when delegating a "domain" cgroup tree? (Obviously it's
> not harmful to delegate the the cgroup.threads file in this case;
> it's just not useful to do so.)
Yeap, unless we can somehow support non-root mixed domains.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As discussed with Tejun Heo and Roman Gushchin, the
omission of this file from the list is a bug, and
is about to be fixed by a kernel patch from Roman.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
We are about to add description of a different kind
of delegation (nsdelegate) with its own subheading.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Existing cgroups under threaded root *must*, by definition,
be either domain or part of threaded subtrees, so this is not
a constraint on the creation of threaded subtrees.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The placement of a thread in the run queue for its new
priority depends on the direction of movement in priority.
(This appears to contradict POSIX, except in the case of
pthread_setschedprio().)
As reported by Andrea, and followed up by me:
> I point out that the semantics of sched_setscheduler(2) for RT threads
> indicated in sched(7) and, in particular, in
>
> "A call to sched_setscheduler(2), sched_setparam(2), or
> sched_setattr(2) will put the SCHED_FIFO (or SCHED_RR) thread
> identified by pid at the start of the list if it was runnable."
>
> does not "reflect" the current implementation of this syscall(s) that, in
> turn; based on the source, I think a more appropriate description of this
> semantics would be:
>
> "... the effect on its position in the thread list depends on the
> direction of the modification, as follows:
>
> a. if the priority is raised, the thread becomes the tail of the
> thread list.
> b. if the priority is unchanged, the thread does not change position
> in the thread list.
> c. if the priority is lowered, the thread becomes the head of the
> thread list."
>
> (copied from
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_08_04_01
> ).
So, I did some testing, and can confirm that the above is the behavior
on Linux for changes to scheduling priorities for RT processes.
(My tests consisted of creating a multithreaded process where all
threads are confined to the same CPU with taskset(), and each thread
is in a CPU-bound loop. I then maipulated their priorities with
chrt(1) and watched the CPU time being consumed with ps(1).)
Back in SUSv2 there was this text:
[[
6. If a thread whose policy or priority has been modified is a running
thread or is runnable, it then becomes the tail of the thread list for
its new priority.
]]
And certainly Linux used to behave this way. I remember testing it,
and when one looks at the Linux 2.2 source code for example, one can
see that there is a call to move_first_runqueue() in this case. At some
point, things changed, and I have not investigated exactly where that
change occurred (but I imagine it was quite a long time ago).
Looking at SUSv4, let's expand the range of your quote, since
point 7 is interesting. Here's text from Section 2.8.4
"Process Scheduling" in POSIX.1-2008/SUSv4 TC2:
[[
7. If a thread whose policy or priority has been modified other
than by pthread_setschedprio() is a running thread or is runnable,
it then becomes the tail of the thread list for its new priority.
8. If a thread whose priority has been modified by pthread_setschedprio()
is a running thread or is runnable, the effect on its position in the
thread list depends on the direction of the modification, as follows:
a. If the priority is raised, the thread becomes the tail of the
thread list.
b. If the priority is unchanged, the thread does not change position
in the thread list.
c. If the priority is lowered, the thread becomes the head of the
thread list.
]]
(Note that the preceding points mention variously sched_setscheduler(),
sched_setsparam(), and pthread_setschedprio(), so that the mention of
just pthread_setschedprio() in points 7 and 8 is significant.)
Now, since chrt(1) uses sched_setscheduler(), rather than
pthread_setschedprio(), then arguably the Linux behavior is a
violation of POSIX. (Indeed, buried in the man-pages source, I find
that I many years ago wrote the comment:
In 2.2.x and 2.4.x, the thread is placed at the front of the queue
In 2.0.x, the Right Thing happened: the thread went to the back -- MTK
But the Linux behavior seems reasonable to me and I'm inclined
to just document it (see the patch below).
Reported-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Logically, this section should follow the section that
describes cgroup.subtree_control.
No content changes in this patch.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
According to The Open Group Base Specifications Issue 7, RATIONALE
section of
http://pubs.opengroup.org/onlinepubs/9699919799/ basedefs/netinet_in.h.html
some INADDR_* values must be converted using htonl().
INADDR_ANY and INADDR_BROADCAST are byte-order-neutral so they do
not require htonl(), however I only comment this fact in NOTES.
On the text I recommend to use htonl(), "even if for some subset
it's not necessary".
Signed-off-by: Ricardo Biehl Pasquali <pasqualirb@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Sockets support both read(2)/write(2) and send(2)/recv(2) system
calls. Each of these is actually a family of multiple system
calls such as send(2), sendfile(2), sendmsg(2), sendmmsg(2), and
sendto(2).
This patch claries which families of system calls can be used.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The AF_VSOCK address family has been available since Linux 3.9.
This patch adds vsock.7 and describes its use along the same lines as
existing ip.7, unix.7, and netlink.7 man pages.
CC: Jorgen Hansen <jhansen@vmware.com>
CC: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Note explicitly that SECBIT_NO_SETUID_FIXUP is relevant for
the permitted, effective, and ambient capability sets.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
See cap_emulate_setxuid():
kuid_t root_uid = make_kuid(old->user_ns, 0);
if ((uid_eq(old->uid, root_uid) ||
uid_eq(old->euid, root_uid) ||
uid_eq(old->suid, root_uid)) &&
(!uid_eq(new->uid, root_uid) &&
!uid_eq(new->euid, root_uid) &&
!uid_eq(new->suid, root_uid))) {
if (!issecure(SECURE_KEEP_CAPS)) {
cap_clear(new->cap_permitted);
cap_clear(new->cap_effective);
}
/*
* Pre-ambient programs expect setresuid to nonroot followed
* by exec to drop capabilities. We should make sure that
* this remains the case.
*/
cap_clear(new->cap_ambient);
}
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Two reports that the description of SO_RXQ_OVFL was wrong.
======
Commentary from Tobias:
This bug pertains to the manpage as visible on man7.org right
now.
The socket(7) man page has this paragraph:
SO_RXQ_OVFL (since Linux 2.6.33)
Indicates that an unsigned 32-bit value ancillary
message (cmsg) should be attached to received skbs
indicating the number of packets dropped by the
socket between the last received packet and this
received packet.
The second half is wrong: the counter (internally,
SOCK_SKB_CB(skb)->dropcount is *not* reset after every packet.
That is, it is a proper counter, not a gauge, in monitoring
parlance.
A better version of that paragraph:
SO_RXQ_OVFL (since Linux 2.6.33)
Indicates that an unsigned 32-bit value ancillary
message (cmsg) should be attached to received skbs
indicating the number of packets dropped by the
socket since its creation.
======
Commentary from Petr
Generic SO_RXQ_OVFL helpers sock_skb_set_dropcount() and
sock_recv_drops() implements returning of sk->sk_drops (the total
number of dropped packets), although the documentation says the
number of dropped packets since the last received one should be
returned (quoting the current socket.7):
SO_RXQ_OVFL (since Linux 2.6.33)
Indicates that an unsigned 32-bit value ancillary message (cmsg)
should be attached to received skbs indicating the number of packets
dropped by the socket between the last received packet and this
received packet.
I assume the documentation needs to be updated, as fixing this in
the code could break programs depending on the current behavior,
although the formerly planned functionality seems to be more
useful.
The problem can be revealed with the following program:
int extract_drop(struct msghdr *msg)
{
struct cmsghdr *cmsg;
int rtn;
for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg,cmsg)) {
if (cmsg->cmsg_level == SOL_SOCKET &&
cmsg->cmsg_type == SO_RXQ_OVFL) {
memcpy(&rtn, CMSG_DATA(cmsg), sizeof rtn);
return rtn;
}
}
return -1;
}
int main(int argc, char *argv[])
{
struct sockaddr_in addr = { .sin_family = AF_INET };
char msg[48*1024], cmsgbuf[256];
struct iovec iov = { .iov_base = msg, .iov_len = sizeof msg };
int sk1, sk2, i, one = 1;
sk1 = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
sk2 = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
inet_pton(AF_INET, "127.0.0.1", &addr.sin_addr);
addr.sin_port = htons(53333);
bind(sk1, (struct sockaddr*)&addr, sizeof addr);
connect(sk2, (struct sockaddr*)&addr, sizeof addr);
// Kernel doubles this limit, but it accounts also the SKB overhead,
// but it receives as long as there is at least 1 byte free.
i = sizeof msg;
setsockopt(sk1, SOL_SOCKET, SO_RCVBUF, &i, sizeof i);
setsockopt(sk1, SOL_SOCKET, SO_RXQ_OVFL, &one, sizeof one);
for (i = 0; i < 4; i++) {
int rtn;
send(sk2, msg, sizeof msg, 0);
send(sk2, msg, sizeof msg, 0);
send(sk2, msg, sizeof msg, 0);
do {
struct msghdr msghdr = {
.msg_iov = &iov, .msg_iovlen = 1,
.msg_control = &cmsgbuf,
.msg_controllen = sizeof cmsgbuf };
rtn = recvmsg(sk1, &msghdr, MSG_DONTWAIT);
if (rtn > 0) {
printf("rtn: %d drop %d\n", rtn,
extract_drop(&msghdr));
} else {
printf("rtn: %d\n", rtn);
}
} while (rtn > 0);
}
return 0;
}
which prints
rtn: 49152 drop -1
rtn: 49152 drop -1
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 1
rtn: -1
rtn: 49152 drop 2
rtn: 49152 drop 2
rtn: -1
rtn: 49152 drop 3
rtn: 49152 drop 3
rtn: -1
although it should print (according to the documentation):
rtn: 49152 drop 0
rtn: 49152 drop 0
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 0
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 0
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 0
rtn: -1
Reported-by: Petr Malat <oss@malat.biz>
Reported-by: Tobias Klausmann <klausman@schwarzvogel.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>