Sockets support both read(2)/write(2) and send(2)/recv(2) system
calls. Each of these is actually a family of multiple system
calls such as send(2), sendfile(2), sendmsg(2), sendmmsg(2), and
sendto(2).
This patch claries which families of system calls can be used.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The AF_VSOCK address family has been available since Linux 3.9.
This patch adds vsock.7 and describes its use along the same lines as
existing ip.7, unix.7, and netlink.7 man pages.
CC: Jorgen Hansen <jhansen@vmware.com>
CC: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Note explicitly that SECBIT_NO_SETUID_FIXUP is relevant for
the permitted, effective, and ambient capability sets.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
See cap_emulate_setxuid():
kuid_t root_uid = make_kuid(old->user_ns, 0);
if ((uid_eq(old->uid, root_uid) ||
uid_eq(old->euid, root_uid) ||
uid_eq(old->suid, root_uid)) &&
(!uid_eq(new->uid, root_uid) &&
!uid_eq(new->euid, root_uid) &&
!uid_eq(new->suid, root_uid))) {
if (!issecure(SECURE_KEEP_CAPS)) {
cap_clear(new->cap_permitted);
cap_clear(new->cap_effective);
}
/*
* Pre-ambient programs expect setresuid to nonroot followed
* by exec to drop capabilities. We should make sure that
* this remains the case.
*/
cap_clear(new->cap_ambient);
}
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Two reports that the description of SO_RXQ_OVFL was wrong.
======
Commentary from Tobias:
This bug pertains to the manpage as visible on man7.org right
now.
The socket(7) man page has this paragraph:
SO_RXQ_OVFL (since Linux 2.6.33)
Indicates that an unsigned 32-bit value ancillary
message (cmsg) should be attached to received skbs
indicating the number of packets dropped by the
socket between the last received packet and this
received packet.
The second half is wrong: the counter (internally,
SOCK_SKB_CB(skb)->dropcount is *not* reset after every packet.
That is, it is a proper counter, not a gauge, in monitoring
parlance.
A better version of that paragraph:
SO_RXQ_OVFL (since Linux 2.6.33)
Indicates that an unsigned 32-bit value ancillary
message (cmsg) should be attached to received skbs
indicating the number of packets dropped by the
socket since its creation.
======
Commentary from Petr
Generic SO_RXQ_OVFL helpers sock_skb_set_dropcount() and
sock_recv_drops() implements returning of sk->sk_drops (the total
number of dropped packets), although the documentation says the
number of dropped packets since the last received one should be
returned (quoting the current socket.7):
SO_RXQ_OVFL (since Linux 2.6.33)
Indicates that an unsigned 32-bit value ancillary message (cmsg)
should be attached to received skbs indicating the number of packets
dropped by the socket between the last received packet and this
received packet.
I assume the documentation needs to be updated, as fixing this in
the code could break programs depending on the current behavior,
although the formerly planned functionality seems to be more
useful.
The problem can be revealed with the following program:
int extract_drop(struct msghdr *msg)
{
struct cmsghdr *cmsg;
int rtn;
for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg,cmsg)) {
if (cmsg->cmsg_level == SOL_SOCKET &&
cmsg->cmsg_type == SO_RXQ_OVFL) {
memcpy(&rtn, CMSG_DATA(cmsg), sizeof rtn);
return rtn;
}
}
return -1;
}
int main(int argc, char *argv[])
{
struct sockaddr_in addr = { .sin_family = AF_INET };
char msg[48*1024], cmsgbuf[256];
struct iovec iov = { .iov_base = msg, .iov_len = sizeof msg };
int sk1, sk2, i, one = 1;
sk1 = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
sk2 = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
inet_pton(AF_INET, "127.0.0.1", &addr.sin_addr);
addr.sin_port = htons(53333);
bind(sk1, (struct sockaddr*)&addr, sizeof addr);
connect(sk2, (struct sockaddr*)&addr, sizeof addr);
// Kernel doubles this limit, but it accounts also the SKB overhead,
// but it receives as long as there is at least 1 byte free.
i = sizeof msg;
setsockopt(sk1, SOL_SOCKET, SO_RCVBUF, &i, sizeof i);
setsockopt(sk1, SOL_SOCKET, SO_RXQ_OVFL, &one, sizeof one);
for (i = 0; i < 4; i++) {
int rtn;
send(sk2, msg, sizeof msg, 0);
send(sk2, msg, sizeof msg, 0);
send(sk2, msg, sizeof msg, 0);
do {
struct msghdr msghdr = {
.msg_iov = &iov, .msg_iovlen = 1,
.msg_control = &cmsgbuf,
.msg_controllen = sizeof cmsgbuf };
rtn = recvmsg(sk1, &msghdr, MSG_DONTWAIT);
if (rtn > 0) {
printf("rtn: %d drop %d\n", rtn,
extract_drop(&msghdr));
} else {
printf("rtn: %d\n", rtn);
}
} while (rtn > 0);
}
return 0;
}
which prints
rtn: 49152 drop -1
rtn: 49152 drop -1
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 1
rtn: -1
rtn: 49152 drop 2
rtn: 49152 drop 2
rtn: -1
rtn: 49152 drop 3
rtn: 49152 drop 3
rtn: -1
although it should print (according to the documentation):
rtn: 49152 drop 0
rtn: 49152 drop 0
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 0
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 0
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 0
rtn: -1
Reported-by: Petr Malat <oss@malat.biz>
Reported-by: Tobias Klausmann <klausman@schwarzvogel.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Christian Brauner's patch added the Linux 4.15 details,
but we need to retain the historical details.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This patch documents the following kernel commit:
commit 6397fac4915ab3002dc15aae751455da1a852f25
Author: Christian Brauner <christian.brauner@ubuntu.com>
Date: Wed Oct 25 00:04:41 2017 +0200
userns: bump idmap limits to 340
Since Linux 4.15 the number of idmap lines has been bumped to 340.
The patch also removes the "(arbitrary)" in "There is an
(arbitrary) limit on the number of lines in the file." since the
340 line limit is well-explained by the current implementation.
The struct recording the idmaps is 12 bytes and quite some proc
files only allow writes the size of a single page size which is
4096kB. This leaves room for 340 idmappings (340 * 12 = 4080
bytes). The struct layout itself has been chosen very carefully
to allow for an implementation that limits the time-complexity for
the idmap codepaths to O(log n). However, I think it's unnecessary
to expose this much implementation detail to users in the man
page. So only mention this in the commit message. Furthermore,
the comment about the page size restriction is misleading. The
kernel sources show that >= page size is considered an error.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
People seem to be using "cf." ("confere"), which means "compare",
to mean "see" instead, for which the Latin abbreviation would be
"q.v." ("quod vide" -> "which see").
In some cases "cf." might actually be the correct term but it's
still not clear what specific aspects of a function/system call
one is supposed to be comparing.
I left one use in place in hope of obtaining clarification,
because it looks like it might be useful there, if contextualized.
Migrate these uses to English and add them to the list of
abbreviations to be avoided.
If the patch to vfork(2) is not accepted, then the cf. still needs
an \& after it because it is at the end of the line but not the
end of a sentence.
Signed-off-by: G. Branden Robinson <g.branden.robinson@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>