We are about to add description of a different kind
of delegation (nsdelegate) with its own subheading.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Existing cgroups under threaded root *must*, by definition,
be either domain or part of threaded subtrees, so this is not
a constraint on the creation of threaded subtrees.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The placement of a thread in the run queue for its new
priority depends on the direction of movement in priority.
(This appears to contradict POSIX, except in the case of
pthread_setschedprio().)
As reported by Andrea, and followed up by me:
> I point out that the semantics of sched_setscheduler(2) for RT threads
> indicated in sched(7) and, in particular, in
>
> "A call to sched_setscheduler(2), sched_setparam(2), or
> sched_setattr(2) will put the SCHED_FIFO (or SCHED_RR) thread
> identified by pid at the start of the list if it was runnable."
>
> does not "reflect" the current implementation of this syscall(s) that, in
> turn; based on the source, I think a more appropriate description of this
> semantics would be:
>
> "... the effect on its position in the thread list depends on the
> direction of the modification, as follows:
>
> a. if the priority is raised, the thread becomes the tail of the
> thread list.
> b. if the priority is unchanged, the thread does not change position
> in the thread list.
> c. if the priority is lowered, the thread becomes the head of the
> thread list."
>
> (copied from
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_08_04_01
> ).
So, I did some testing, and can confirm that the above is the behavior
on Linux for changes to scheduling priorities for RT processes.
(My tests consisted of creating a multithreaded process where all
threads are confined to the same CPU with taskset(), and each thread
is in a CPU-bound loop. I then maipulated their priorities with
chrt(1) and watched the CPU time being consumed with ps(1).)
Back in SUSv2 there was this text:
[[
6. If a thread whose policy or priority has been modified is a running
thread or is runnable, it then becomes the tail of the thread list for
its new priority.
]]
And certainly Linux used to behave this way. I remember testing it,
and when one looks at the Linux 2.2 source code for example, one can
see that there is a call to move_first_runqueue() in this case. At some
point, things changed, and I have not investigated exactly where that
change occurred (but I imagine it was quite a long time ago).
Looking at SUSv4, let's expand the range of your quote, since
point 7 is interesting. Here's text from Section 2.8.4
"Process Scheduling" in POSIX.1-2008/SUSv4 TC2:
[[
7. If a thread whose policy or priority has been modified other
than by pthread_setschedprio() is a running thread or is runnable,
it then becomes the tail of the thread list for its new priority.
8. If a thread whose priority has been modified by pthread_setschedprio()
is a running thread or is runnable, the effect on its position in the
thread list depends on the direction of the modification, as follows:
a. If the priority is raised, the thread becomes the tail of the
thread list.
b. If the priority is unchanged, the thread does not change position
in the thread list.
c. If the priority is lowered, the thread becomes the head of the
thread list.
]]
(Note that the preceding points mention variously sched_setscheduler(),
sched_setsparam(), and pthread_setschedprio(), so that the mention of
just pthread_setschedprio() in points 7 and 8 is significant.)
Now, since chrt(1) uses sched_setscheduler(), rather than
pthread_setschedprio(), then arguably the Linux behavior is a
violation of POSIX. (Indeed, buried in the man-pages source, I find
that I many years ago wrote the comment:
In 2.2.x and 2.4.x, the thread is placed at the front of the queue
In 2.0.x, the Right Thing happened: the thread went to the back -- MTK
But the Linux behavior seems reasonable to me and I'm inclined
to just document it (see the patch below).
Reported-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Logically, this section should follow the section that
describes cgroup.subtree_control.
No content changes in this patch.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
According to The Open Group Base Specifications Issue 7, RATIONALE
section of
http://pubs.opengroup.org/onlinepubs/9699919799/ basedefs/netinet_in.h.html
some INADDR_* values must be converted using htonl().
INADDR_ANY and INADDR_BROADCAST are byte-order-neutral so they do
not require htonl(), however I only comment this fact in NOTES.
On the text I recommend to use htonl(), "even if for some subset
it's not necessary".
Signed-off-by: Ricardo Biehl Pasquali <pasqualirb@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Sockets support both read(2)/write(2) and send(2)/recv(2) system
calls. Each of these is actually a family of multiple system
calls such as send(2), sendfile(2), sendmsg(2), sendmmsg(2), and
sendto(2).
This patch claries which families of system calls can be used.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The AF_VSOCK address family has been available since Linux 3.9.
This patch adds vsock.7 and describes its use along the same lines as
existing ip.7, unix.7, and netlink.7 man pages.
CC: Jorgen Hansen <jhansen@vmware.com>
CC: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Note explicitly that SECBIT_NO_SETUID_FIXUP is relevant for
the permitted, effective, and ambient capability sets.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
See cap_emulate_setxuid():
kuid_t root_uid = make_kuid(old->user_ns, 0);
if ((uid_eq(old->uid, root_uid) ||
uid_eq(old->euid, root_uid) ||
uid_eq(old->suid, root_uid)) &&
(!uid_eq(new->uid, root_uid) &&
!uid_eq(new->euid, root_uid) &&
!uid_eq(new->suid, root_uid))) {
if (!issecure(SECURE_KEEP_CAPS)) {
cap_clear(new->cap_permitted);
cap_clear(new->cap_effective);
}
/*
* Pre-ambient programs expect setresuid to nonroot followed
* by exec to drop capabilities. We should make sure that
* this remains the case.
*/
cap_clear(new->cap_ambient);
}
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Two reports that the description of SO_RXQ_OVFL was wrong.
======
Commentary from Tobias:
This bug pertains to the manpage as visible on man7.org right
now.
The socket(7) man page has this paragraph:
SO_RXQ_OVFL (since Linux 2.6.33)
Indicates that an unsigned 32-bit value ancillary
message (cmsg) should be attached to received skbs
indicating the number of packets dropped by the
socket between the last received packet and this
received packet.
The second half is wrong: the counter (internally,
SOCK_SKB_CB(skb)->dropcount is *not* reset after every packet.
That is, it is a proper counter, not a gauge, in monitoring
parlance.
A better version of that paragraph:
SO_RXQ_OVFL (since Linux 2.6.33)
Indicates that an unsigned 32-bit value ancillary
message (cmsg) should be attached to received skbs
indicating the number of packets dropped by the
socket since its creation.
======
Commentary from Petr
Generic SO_RXQ_OVFL helpers sock_skb_set_dropcount() and
sock_recv_drops() implements returning of sk->sk_drops (the total
number of dropped packets), although the documentation says the
number of dropped packets since the last received one should be
returned (quoting the current socket.7):
SO_RXQ_OVFL (since Linux 2.6.33)
Indicates that an unsigned 32-bit value ancillary message (cmsg)
should be attached to received skbs indicating the number of packets
dropped by the socket between the last received packet and this
received packet.
I assume the documentation needs to be updated, as fixing this in
the code could break programs depending on the current behavior,
although the formerly planned functionality seems to be more
useful.
The problem can be revealed with the following program:
int extract_drop(struct msghdr *msg)
{
struct cmsghdr *cmsg;
int rtn;
for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg,cmsg)) {
if (cmsg->cmsg_level == SOL_SOCKET &&
cmsg->cmsg_type == SO_RXQ_OVFL) {
memcpy(&rtn, CMSG_DATA(cmsg), sizeof rtn);
return rtn;
}
}
return -1;
}
int main(int argc, char *argv[])
{
struct sockaddr_in addr = { .sin_family = AF_INET };
char msg[48*1024], cmsgbuf[256];
struct iovec iov = { .iov_base = msg, .iov_len = sizeof msg };
int sk1, sk2, i, one = 1;
sk1 = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
sk2 = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
inet_pton(AF_INET, "127.0.0.1", &addr.sin_addr);
addr.sin_port = htons(53333);
bind(sk1, (struct sockaddr*)&addr, sizeof addr);
connect(sk2, (struct sockaddr*)&addr, sizeof addr);
// Kernel doubles this limit, but it accounts also the SKB overhead,
// but it receives as long as there is at least 1 byte free.
i = sizeof msg;
setsockopt(sk1, SOL_SOCKET, SO_RCVBUF, &i, sizeof i);
setsockopt(sk1, SOL_SOCKET, SO_RXQ_OVFL, &one, sizeof one);
for (i = 0; i < 4; i++) {
int rtn;
send(sk2, msg, sizeof msg, 0);
send(sk2, msg, sizeof msg, 0);
send(sk2, msg, sizeof msg, 0);
do {
struct msghdr msghdr = {
.msg_iov = &iov, .msg_iovlen = 1,
.msg_control = &cmsgbuf,
.msg_controllen = sizeof cmsgbuf };
rtn = recvmsg(sk1, &msghdr, MSG_DONTWAIT);
if (rtn > 0) {
printf("rtn: %d drop %d\n", rtn,
extract_drop(&msghdr));
} else {
printf("rtn: %d\n", rtn);
}
} while (rtn > 0);
}
return 0;
}
which prints
rtn: 49152 drop -1
rtn: 49152 drop -1
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 1
rtn: -1
rtn: 49152 drop 2
rtn: 49152 drop 2
rtn: -1
rtn: 49152 drop 3
rtn: 49152 drop 3
rtn: -1
although it should print (according to the documentation):
rtn: 49152 drop 0
rtn: 49152 drop 0
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 0
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 0
rtn: -1
rtn: 49152 drop 1
rtn: 49152 drop 0
rtn: -1
Reported-by: Petr Malat <oss@malat.biz>
Reported-by: Tobias Klausmann <klausman@schwarzvogel.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Christian Brauner's patch added the Linux 4.15 details,
but we need to retain the historical details.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This patch documents the following kernel commit:
commit 6397fac4915ab3002dc15aae751455da1a852f25
Author: Christian Brauner <christian.brauner@ubuntu.com>
Date: Wed Oct 25 00:04:41 2017 +0200
userns: bump idmap limits to 340
Since Linux 4.15 the number of idmap lines has been bumped to 340.
The patch also removes the "(arbitrary)" in "There is an
(arbitrary) limit on the number of lines in the file." since the
340 line limit is well-explained by the current implementation.
The struct recording the idmaps is 12 bytes and quite some proc
files only allow writes the size of a single page size which is
4096kB. This leaves room for 340 idmappings (340 * 12 = 4080
bytes). The struct layout itself has been chosen very carefully
to allow for an implementation that limits the time-complexity for
the idmap codepaths to O(log n). However, I think it's unnecessary
to expose this much implementation detail to users in the man
page. So only mention this in the commit message. Furthermore,
the comment about the page size restriction is misleading. The
kernel sources show that >= page size is considered an error.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
People seem to be using "cf." ("confere"), which means "compare",
to mean "see" instead, for which the Latin abbreviation would be
"q.v." ("quod vide" -> "which see").
In some cases "cf." might actually be the correct term but it's
still not clear what specific aspects of a function/system call
one is supposed to be comparing.
I left one use in place in hope of obtaining clarification,
because it looks like it might be useful there, if contextualized.
Migrate these uses to English and add them to the list of
abbreviations to be avoided.
If the patch to vfork(2) is not accepted, then the cf. still needs
an \& after it because it is at the end of the line but not the
end of a sentence.
Signed-off-by: G. Branden Robinson <g.branden.robinson@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
When referring to the architecture, consistently use "x86-64",
not "x86_64". Hitherto, there was a mixture of usages, with
"x86-64" predominant.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The location has been changed in Linux commit
v4.10-rc1~40^2~86^2~4.
* man7/unicode.7 (.SS Private Use Areas (PUA)): Amend pointer to
Documentation/unicode.txt with change introduced in Linux 4.10
(move to Documentation/admin-guide/unicode.rst).
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The file has been moved in Linux commit v2.6.29-rc2~47.
* man7/cpuset.7 (.SH SEE ALSO): Add information about the location
of cpusets.txt since Linux 2.6.29.
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Linux commit v4.10-rc1~40^2~86^2~4 moves initrd documentation from
Documentation/initrd.txt to Documentation/admin-quide/initrd.rst.
* man4/initrd.4 (.SS Changing the normal root filesystem,
.SH SEE ALSO): Amend pointer to in-kernel initrd documentation
with change introduced in Linux 4.10 (move to
Documentation/admin-guide/initrd.rst).
* man5/proc.5 (.SS Files and directories)
<.TP .I /proc/sys/kernel/real-root-dev>: Likewise.
* man7/bootparam.7 (.SS Boot arguments for ramdisk use)
<.TP .B 'noinitrd'>: Likewise.
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In 4.13 the release cycle, key management documentation has been
reformatted to ReStructured text and moved to a separate
"keys" directory.
Relevant kernel commits: v4.13-rc1~34^2~27, v4.13-rc1~34^2~25
* man2/add_key.2 (.SH SEE ALSO): Amend pointers to
Documentation/security/keys.txt and Documentation/keys-request-key.txt
with changes introduced in Linux 4.13 (Documentation/keys/core.rst and
Documentation/keys/request-key.rst).
* man2/request_key.2 (.SH SEE ALSO): Likewise.
* man7/keyrings.7 (.SH SEE ALSO): Likewise.
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>