The cgroup.sane_behavior file returns the hard-coded value "0" and
is kept for legacy purposes. Mention this in the man-page.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The display of the /proc/PID/ns renders very wide. Make it
narrower by eliminating some nonessential info via some
awk(1) filtering.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Andrei Vagin implemented a change I suggested:
clock-IDs are now be expressed in symbolic form (e.g.,
"monotonic") instead of numeric form (e.g., 1) when reading
/proc/PID/timerns_offsets, and can be expressed either
symbolically or numerically when writing to that file.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In particular, note the ERANGE restrictions reported by
Thomas Gleixner.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
signal.7: Which signal is delivered in response to a CPU exception
is under-documented and does not always make sense. See
<https://bugzilla.kernel.org/show_bug.cgi?id=205831> for an
example where it doesn’t make sense; per the discussion there,
this cannot be changed because of backward compatibility concerns,
so let’s instead document the problem.
sigaction.2: For related reasons, the kernel doesn’t always fill
in all of the fields of the siginfo_t when delivering signals from
CPU exceptions. Document this as well. I imagine this one
_could_ be fixed, but the problem would still be relevant to
anyone using an older kernel.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The example is misleading. It is not a good idea to unlink an
existing socket because we might try to start the server multiple
times. In this case it is preferable to receive an error.
We could add code that removes the socket when the server process
is killed but that would stretch the example too far.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Note the kernel version that added SO_TIMESTAMPNS,
and (from the kernel commit) note tha SO_TIMESTAMPNS and
SO_TIMESTAMP are mutually exclusive.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
===========
DESCRIPTION
===========
I added a paragraph for ``SO_TIMESTAMP``, and modified the
paragraph for ``SIOCGSTAMP`` in relation to ``SO_TIMESTAMPNS``.
I based the documentation on the existing ``SO_TIMESTAMP``
documentation, and
on my experience using ``SO_TIMESTAMPNS``.
I asked a question on stackoverflow, which helped me understand
``SO_TIMESTAMPNS``:
https://stackoverflow.com/q/60971556/6872717
Testing of the feature being documented
=======================================
I wrote a simple server and client test.
In the client side, I connected a socket specifying
``SOCK_STREAM`` and ``"tcp"``.
Then I enabled timestamp in ns:
.. code-block:: c
int enable = 1;
if (setsockopt(sd, SOL_SOCKET, SO_TIMESTAMPNS, &enable,
sizeof(enable)))
goto err;
Then I prepared the msg header:
.. code-block:: c
char buf[BUFSIZ];
char cbuf[BUFSIZ];
struct msghdr msg;
struct iovec iov;
memset(buf, 0, ARRAY_BYTES(buf));
iov.iov_len = ARRAY_BYTES(buf) - 1;
iov.iov_base = buf;
msg.msg_name = NULL;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = cbuf;
msg.msg_controllen = ARRAY_BYTES(cbuf);
And got some times before and after receiving the msg:
.. code-block:: c
struct timespec tm_before, tm_recvmsg, tm_after, tm_msg;
clock_gettime(CLOCK_REALTIME, &tm_before);
usleep(500000);
clock_gettime(CLOCK_REALTIME, &tm_recvmsg);
n = recvmsg(sd, &msg, MSG_WAITALL);
if (n < 0)
goto err;
usleep(1000000);
clock_gettime(CLOCK_REALTIME, &tm_after);
After that I read the timestamp of the msg:
.. code-block:: c
struct cmsghdr *cmsg;
for (cmsg = CMSG_FIRSTHDR(&msg); cmsg;
cmsg = CMSG_NXTHDR(&msg, cmsg)) {
if (cmsg->cmsg_level == SOL_SOCKET &&
cmsg->cmsg_type == SO_TIMESTAMPNS) {
memcpy(&tm_msg, CMSG_DATA(cmsg), sizeof(tm_msg));
break;
}
}
if (!cmsg)
goto err;
And finally printed the results:
.. code-block:: c
double tdiff;
printf("%s\n", buf);
tdiff = timespec_diff_ms(&tm_before, &tm_recvmsg);
printf("tm_r - tm_b = %lf ms\n", tdiff);
tdiff = timespec_diff_ms(&tm_before, &tm_after);
printf("tm_a - tm_b = %lf ms\n", tdiff);
tdiff = timespec_diff_ms(&tm_before, &tm_msg);
printf("tm_m - tm_b = %lf ms\n", tdiff);
Which printed:
::
asdasdfasdfasdfadfgdfghfthgujty 6, 0;
tm_r - tm_b = 500.000000 ms
tm_a - tm_b = 1500.000000 ms
tm_m - tm_b = 18.000000 ms
System:
::
Linux debian 5.4.0-4-amd64 #1 SMP Debian 5.4.19-1 (2020-02-13) x86_64
GNU/Linux
gcc (Debian 9.3.0-8) 9.3.0
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Linux 5.6 added the new well-known VMADDR_CID_LOCAL for
local communication.
This patch explains how to use it and removes the legacy
VMADDR_CID_RESERVED no longer available.
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add a '.RE' macro to terminate the last .RS block.
There is no change in the output.
Signed-off-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In many cases, these don't improve readability, and (when stacked)
they sometimes have the side effect of sometimes forcing text
to be justified within a narrow column range.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
PVS-Studio reports that in
char buf[8192];
/* ... */
nh = (struct nlmsghdr *) buf,
the pointer 'buf' is cast to a more strictly aligned pointer type.
This is undefined behaviour. One possible solution to make sure
that buf is correctly aligned is to declare buf as an array of
struct nlmsghdr. Other solutions include allocating the array on
the heap, use an union, or stdalign features. With this patch,
the buffer still contains 8192 bytes.
This was raised on Stack Overflow:
https://stackoverflow.com/questions/57745580/netlink-receive-buffer-alignment
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The definition of the tpacket_auxdata struct in the manpage is not
the same as the definition found in
/include/uapi/linux/if_packet.h.
In particular, instead of a tp_padding field, there is a
tp_vlan_tpid field. An example of a project using this field is
libpcap[1].
[1]: https://github.com/the-tcpdump-group/libpcap/blob/master/pcap-linux.c#L349
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The structure 'struct sockaddr_vm' has additional element
'unsigned char svm_zero[]' since version v3.9-rc1
(include/uapi/linux/vm_sockets.h). Linux kernel checks that this
element is zeroed (net/vmw_vsock/vsock_addr.c). Reflect this on
the vsock man page.
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=205583
Signed-off-by: Mikhail Golubev <Mikhail.Golubev@opensynergy.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In the given example, the second recvmsg(2) call should receive four bytes,
as the third sendmsg(2) call only sends four.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Make the page more compact by removing the stub subsections that
list the manual pages for the namespace types. And while we're
here, add an explanation of the table columns.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Eric Biederman:
I hate to nitpick, but I am going to say that when I read
the text above the phrase "mount namespace of the process
that created the new mount namespace" feels wrong.
Either you use unshare(2) and the mount namespace of the
process that created the mount namespace changes.
Or you use clone(2) and you could argue it is the new child
that created the mount namespace.
Having a different mount namespace at the end of the
creation operation feels like it makes your phrase confusing
about what the starting mount namespace is. I hate to use
references that are ambiguous when things are changing.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Provide a more detailed explanation of the initialization of
the mount point list in a new mount namespace.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The current text talks about "parent mount namespaces", but there
is no such concept. As confirmed by Eric Biederman, what is mean
here is "the mount namespace this mount namespace started as a
copy of". So, this change writes up Eric's description in a more
detailed way.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
After creating a new mount namespace, it may be desirable to
disable mount propagation. Give the reader a more explicit
hint about this.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In a recent conversation with Mathieu Desnoyers I was reminded
that we haven't written up anything about how deferred
cancellation and asynchronous signal handlers interact. Mathieu
ran into some of this behaviour and I promised to improve the
documentation in this area to point out the potential pitfall.
Thoughts?
8< --- 8< --- 8<
In pthread_setcancelstate.3, pthreads.7, and signal-safety.7 we
describe that if you have an asynchronous signal nesting over a
deferred cancellation region that any cancellation point in the
signal handler may trigger a cancellation that will behave
as-if it was an asynchronous cancellation. This asynchronous
cancellation may have unexpected effects on the consistency of
the application. Therefore care should be taken with asynchronous
signals and deferred cancellation.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
If a mount point is deleted or renamed or removed in one mount
namespace, this will cause an object that is mounted at that
location in another mount namespace to be unmounted (as verified
by experiment). This was implied by the existing text, but it is
better to make this detail explicit.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
See fs/xattr.c::xattr_permission()"
/*
* In the user.* namespace, only regular files and directories can have
* extended attributes. For sticky directories, only the owner and
* privileged users can write attributes.
*/
if (!strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN)) {
if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
return (mask & MAY_WRITE) ? -EPERM : -ENODATA;
if (S_ISDIR(inode->i_mode) && (inode->i_mode & S_ISVTX) &&
(mask & MAY_WRITE) && !inode_owner_or_capable(inode))
return -EPERM;
}
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
If the file descriptors received in SCM_RIGHTS would cause
the process to its exceed RLIMIT_NOFILE limit, the excess
FDs are discarded.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>