And while we are at it, remove a sentence that makes an obvious
point (that mremap() uses the Linux page table scheme).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Details of glibc 2.4, which is by now fairly old, would be
better at the end of NOTES than at the start.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
From a mailing list conversation:
On 5/24/18 9:03 PM, Heinrich Schuchardt wrote:
> Hello Michael,
>
> in the mmap(2) man page MAP_ANON is described as deprecated.
>
> When I look at the NetBSD manpage
> http://netbsd.gw.com/cgi-bin/man-cgi?mmap+2+NetBSD-current
> I found that MAP_ANONYMOUS is not defined.
>
> https://www.dragonflybsd.org/cgi/web-man?command=mmap§ion=2
> indicates MAP_ANONYMOUS is an alias for MAP_ANON and is provided for
> compatibility.
>
> https://man.openbsd.org/mmap.2 also knows MAP_ANONYMOUS as a synonym.
>
> https://www.unix.com/man-page/osx/2/mmap/ does not know MAP_ANONYMOUS.
>
> So shouldn't the man page indicate that MAP_ANON is to be favored to
> write portable code? And correspondingly mark MAP_ANONYMOUS as synonym
> only kept for compatibility.
>
> The Open Group Base Specifications Issue 7, 2018 Edition does not
> reference either of both. So both values are not POSIX but it is not
> correct to describe them as Linux only.
The text saying that MAP_ANON is deprecated is ancient (at least
20 years old). I don't know why that text was added.
Things are not simple though: it looks like there's at least
one historical implementation (HP-US) that defines MAP_ANONYMOUS
but not MAP_ANON.
I've applied the patch below.
Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As noted by Heinrich Schuchardt:
he list contains hex values of different constants. I just wonder for
which architecture (alpha, i386, mips, or sparc at that time). No
information is supplied.
Current values depend on the architecture, e.g.
On amd64
0x82307201 VFAT_IOCTL_READDIR_BOTH
0x82307202 VFAT_IOCTL_READDIR_SHORT
0x80047210 FAT_IOCTL_GET_ATTRIBUTES
0x40047211 FAT_IOCTL_SET_ATTRIBUTES
0x80047213 FAT_IOCTL_GET_VOLUME_ID
On mips
0x42187201 VFAT_IOCTL_READDIR_BOTH
0x42187202 VFAT_IOCTL_READDIR_SHORT
0x40047210 FAT_IOCTL_GET_ATTRIBUTES
0x80047211 FAT_IOCTL_SET_ATTRIBUTES
0x40047213 FAT_IOCTL_GET_VOLUME_ID
Hence hex values should be removed.
Reported-by:
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Linux kernel commit 337684a1746f "fs: return EPERM on immutable
inode" changed (nd unified the return value of the utimensat(2)
from -EACCES to -EPERM in case of an immutable flag. Modify the
man page to reflect the same.
The entire discussion of returning the correct return value is at:
http://lists.linux.it/pipermail/ltp/2017-January/003424.html
[mtk: The change was in Linux 4.8]
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The sysctls fs.protected_fifos and fs.protected_regular can cause
open(2) to fail with EACCES (see Documentation/sysctl/fs.rst for
details.)
Signed-off-by: Joseph C. Sible <josephcsible@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As far as I can see, /proc/sys/fs/aio-max-nr is a
system-wide limit, not a per-user limit. This seems to be
confirmed by comments in fs/aio.c (Linux 5.6) sources).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Current code ignores the MPOL_MF_STRICT when handling hugetlb
mapping, now patch([1]) handles MPOL_MF_STRICT in same semantic as
other mapping. So, we can remove the note about 'MPOL_MF_STRICT
is ignored on huge page mappings', and no changes to other part of
man-page.
[1] https://lore.kernel.org/linux-mm/1581559627-6206-1-git-send-email-lixinhai.lxh@gmail.com/
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
A new feature of one-shoot polling through io_submit was
introduced in bfe4037e722ec commit. Keep things up-to-date due
to changes in linux/aio_abi.h.
Signed-off-by: Julia Suvorova <jusual@mail.ru>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
signal.7: Which signal is delivered in response to a CPU exception
is under-documented and does not always make sense. See
<https://bugzilla.kernel.org/show_bug.cgi?id=205831> for an
example where it doesn’t make sense; per the discussion there,
this cannot be changed because of backward compatibility concerns,
so let’s instead document the problem.
sigaction.2: For related reasons, the kernel doesn’t always fill
in all of the fields of the siginfo_t when delivering signals from
CPU exceptions. Document this as well. I imagine this one
_could_ be fixed, but the problem would still be relevant to
anyone using an older kernel.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
After a comment from Matthew Bobrowski:
Although, I would just have to point out that it doesn't
necessarily have to be a "script" file, but rather a file of
any type that can have its contents interpreted, which then
results in a form of program execution i.e.
$ /usr/lib64/ld-linux-x86-64.so.2 ./foo
In this case, foo is not a "script" file.
Reported-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Back in 2011, a mail from Andrea Arcangeli noted some details
that I never got round to incorporating into the manual page.
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This subcommand was added a few years ago to support cpuid emulation
on x86 targets, but no changes to the man page appear to have been
made at the time. This commit adds a description for it and the
corresponding getter.
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Update the details on AF_UNSPEC and circumstances in which
socket can be reconnected.
From a mail conversation with Eric Dumazet:
> connect() man page seems obsolete or confusing :
>
> Generally, connection-based protocol sockets may successfully
> connect() only once; connectionless protocol sockets may use
> connect() multiple times to change their association.
> Connectionless sockets may dissolve the association by connecting to
> an address with the sa_family member of sockaddr set to AF_UNSPEC
> (supported on Linux since kernel 2.2).>
>
> 1) At least TCP has supported AF_UNSPEC thing forever.
> 2) By definition connectionless sockets do not have an association,
> why would they call connect(AF_UNSPEC) to remove a connection
> which does not exist ...
Calling connect() on a connectionless socket serves two purposes:
a) Assigns a default outgoing address for datagrams (sent using write(2)).
b) Causes datagrams sent from sources other than the peer address to be
discarded.
Both of these things are true in AF_UNIX and the Internet domains.
Using connect(AF_UNSPEC) allows the local datagram socket to clear
this association (without having to connect() to a *different*
peer), so that now it can send datagrams to any peer and receive
datagrams for any peer, (I've just retested all of this.)
>
> Maybe we should rewrite this paragraph to match reality, since
> this causes confusion.
>
>
> Some protocol sockets may successfully connect() only once.
> Some protocol sockets may use connect() multiple times to change
> their association.
> Some protocol sockets may dissolve the association by connecting to
> an address with the sa_family member of sockaddr set to AF_UNSPEC
> (supported on Linux since kernel 2.2).
When I first saw your note, I was afraid that I had written
the offending text. But, I see it has been there since the
manual page was first added in 1992 (other than the piece
"(supported since on Linux since kernel 2.2)", which I added in
2007). Perhaps it was true in 1992.
Anyway, I confirm your statement about TCP sockets. The
connect(AF_UNSPEC) thing works; thereafter, the socket may be
connected to another socket.
Interestingly, connect(AF_UNSPEC) does not seem to work for
UNIX domain stream sockets. (My light testing gives an EINVAL
error on connect(AF_UNSPEC) of an already connected UNIX stream
socket. I could not easily spot where this error was being
generated in the kernel though.)
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Quoting Matthew Wilcox:
The current text of the lseek manpage is ambiguous about
the behaviour of lseek(SEEK_DATA) for a file which is
entirely a hole (or the end of the file is a hole and the
pos lies within the hole). The draft POSIX language is
specific (ENXIO is returned when whence is SEEK_DATA and
offset lies within the final hole of the file). Could I
trouble you to wordsmith that in?
If you want to look at the draft POSIX text, it's here:
https://www.austingroupbugs.net/view.php?id=415
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
See Linux source as of v5.4:
kernel/time/posix-clock.c
Signed-off-by: Eric Rannaud <e@nanocritical.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
From email discussions with Thomas Gleixner:
======
Hello Thomas, et al,
Following on from our discussion of read() on a timerfd [1], I
happened to remember a Debian bug report [2] that points out that
timer_settime() can fail with the error ECANCELED, which is both
surprising and odd (because despite the error, the timer does get
updated).
The relevant kernel code (I think, from your commit [3]) seems to be
the following in timerfd_setup():
if (texp != 0) {
if (flags & TFD_TIMER_ABSTIME)
texp = timens_ktime_to_host(clockid, texp);
if (isalarm(ctx)) {
if (flags & TFD_TIMER_ABSTIME)
alarm_start(&ctx->t.alarm, texp);
else
alarm_start_relative(&ctx->t.alarm, texp);
} else {
hrtimer_start(&ctx->t.tmr, texp, htmode);
}
if (timerfd_canceled(ctx))
return -ECANCELED;
}
Using a small test program [4] shows the behavior. The program loops,
repeatedly calling timerfd_settime() (with a delay of a few seconds
before each call). In another terminal window, enter the following
command a few times:
$ sudo date -s "5 seconds" # Add 5 secs to wall-clock time
I see behavior as follows (the /sudo date -s "5 seconds"/ command was
executed before loop iterations 0, 2, and 4):
[[
$ ./timerfd_settime_ECANCELED
0
Current time is 1585729978 secs, 868510078 nsecs
Timer value is now 0 secs, 0 nsecs
timerfd_settime() succeeded
Timer value is now 9 secs, 999991977 nsecs
1
Current time is 1585729982 secs, 716339545 nsecs
Timer value is now 6 secs, 152167990 nsecs
timerfd_settime() succeeded
Timer value is now 9 secs, 999992940 nsecs
2
Current time is 1585729991 secs, 567377831 nsecs
Timer value is now 1 secs, 148959376 nsecs
timerfd_settime: Operation canceled
Timer value is now 9 secs, 999976294 nsecs
3
Current time is 1585729995 secs, 405385503 nsecs
Timer value is now 6 secs, 161989917 nsecs
timerfd_settime() succeeded
Timer value is now 9 secs, 999993317 nsecs
4
Current time is 1585730004 secs, 225036165 nsecs
Timer value is now 1 secs, 180346909 nsecs
timerfd_settime: Operation canceled
Timer value is now 9 secs, 999984345 nsecs
]]
I note from the above.
(1) If the wall-clock is changed before the first timerfd_settime()
call, the call succeeds. This is of course expected.
(2) If the wall-clock is changed after a timerfd_settime() call, then
the next timerfd_settime() call fails with ECANCELED.
(3) Even if the timerfd_settime() call fails, the timer is still updated(!).
Some questions:
(a) What is the rationale for timerfd_settime() failing with ECANCELED
in this case? (Currently, the manual page says nothing about this.)
(b) It seems at the least surprising, but more likely a bug, that
timerfd_settime() fails with ECANCELED while at the same time
successfully updating the timer value.
Your thoughts?
Thanks,
Michael
[1] https://lore.kernel.org/lkml/3cbd0919-c82a-cb21-c10f-0498433ba5d1@gmail.com/
[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=947091
[3]
commit 99ee5315dac6211e972fa3f23bcc9a0343ff58c4
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Wed Apr 27 14:16:42 2011 +0200
timerfd: Allow timers to be cancelled when clock was set
[4]
/* timerfd_settime_ECANCELED.c */
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <inttypes.h>
#include <sys/timerfd.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)
int
main(int argc, char *argv[])
{
struct itimerspec ts, gts;
struct timespec start;
int tfd = timerfd_create(CLOCK_REALTIME, 0);
if (tfd == -1)
errExit("timerfd_create");
ts.it_interval.tv_sec = 0;
ts.it_interval.tv_nsec = 10;
int flags = TFD_TIMER_ABSTIME | TFD_TIMER_CANCEL_ON_SET;
for (long j ; ; j++) {
/* Inject a delay into each loop, by calling getppid()
many times */
for (int k = 0; k < 10000000; k++)
getppid();
if (j % 1 == 0)
printf("%ld\n", j);
/* Display the current wall-clock time */
if (clock_gettime(CLOCK_REALTIME, &start) == -1)
errExit("clock_gettime");
printf("Current time is %ld secs, %ld nsecs\n",
start.tv_sec, start.tv_nsec);
/* Before resetting the timer, retrieve its current value
so that after the timerfd_settime() call, we can see
whether the the value has changed */
if (timerfd_gettime(tfd, >s) == -1)
perror("timerfd_gettime");
printf("Timer value is now %ld secs, %ld nsecs\n",
gts.it_value.tv_sec, gts.it_value.tv_nsec);
/* Reset the timer to now + 10 secs */
ts.it_value.tv_sec = start.tv_sec + 10;
ts.it_value.tv_nsec = start.tv_nsec;
if (timerfd_settime(tfd, flags, &ts, NULL) == -1)
perror("timerfd_settime");
else
printf("timerfd_settime() succeeded\n");
/* Display the timer value once again */
if (timerfd_gettime(tfd, >s) == -1)
perror("timerfd_gettime");
printf("Timer value is now %ld secs, %ld nsecs\n",
gts.it_value.tv_sec, gts.it_value.tv_nsec);
printf("\n");
}
}
=======
Subject: Re: timer_settime() and ECANCELED
Date: Wed, 01 Apr 2020 19:42:42 +0200
From: Thomas Gleixner <tglx@linutronix.de>
Michael,
"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> Following on from our discussion of read() on a timerfd [1], I
> happened to remember a Debian bug report [2] that points out that
> timer_settime() can fail with the error ECANCELED, which is both
> surprising and odd (because despite the error, the timer does get
> updated).
...
> (1) If the wall-clock is changed before the first timerfd_settime()
> call, the call succeeds. This is of course expected.
> (2) If the wall-clock is changed after a timerfd_settime() call, then
> the next timerfd_settime() call fails with ECANCELED.
> (3) Even if the timerfd_settime() call fails, the timer is still updated(!).
>
> Some questions:
> (a) What is the rationale for timerfd_settime() failing with ECANCELED
> in this case? (Currently, the manual page says nothing about this.)
> (b) It seems at the least surprising, but more likely a bug, that
> timerfd_settime() fails with ECANCELED while at the same time
> successfully updating the timer value.
Really good question and TBH I can't remember why this is implemented in
the way it is, but I have a faint memory that at least (a) is
intentional.
After staring at the code for a while I came up with the following
answers:
(a): If the clock was set event ("date -s ...") which triggered the
cancel was not yet consumed by user space via read(), then that
information would get lost because arming the timer to the new
value has to reset the state.
(b): Arming the timer in that case is indeed very questionable, but it
could be argued that because the clock was set event happened with
the old expiry value that the new expiry value is not affected.
I'd be happy to change that and not arm the timer in the case of a
pending cancel, but I fear that some user space already depends on
that behaviour.
Thanks,
tglx
======
Subject: Re: timer_settime() and ECANCELED
Date: Thu, 02 Apr 2020 10:49:18 +0200
From: Thomas Gleixner <tglx@linutronix.de>
To: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> On 4/1/20 7:42 PM, Thomas Gleixner wrote:
>> (b): Arming the timer in that case is indeed very questionable, but it
>> could be argued that because the clock was set event happened with
>> the old expiry value that the new expiry value is not affected.
>>
>> I'd be happy to change that and not arm the timer in the case of a
>> pending cancel, but I fear that some user space already depends on
>> that behaviour.
>
> Yes, that's the risk, of course. So, shall we just document all
> this in the manual page?
I think so.
Thanks,
tglx
======
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This patch documents the PR_SET_IO_FLUSHER and PR_GET_IO_FLUSHER
prctl commands added to the linux kernel for 5.6 in commit:
commit 8d19f1c8e1937baf74e1962aae9f90fa3aeab463
Author: Mike Christie <mchristi@redhat.com>
Date: Mon Nov 11 18:19:00 2019 -0600
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
Reviewed-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Mostly verified by testing and reading the code.
There is unfortunately quite a bit of inconsistency across API~s:
clock_gettime clock_settime clock_nanosleep timer_create timerfd_create
CLOCK_BOOTTIME y n (EINVAL) y y y
CLOCK_BOOTTIME_ALARM y n (EINVAL) y [1] y [1] y [1]
CLOCK_MONOTONIC y n (EINVAL) y y y
CLOCK_MONOTONIC_COARSE y n (EINVAL) n (ENOTSUP) n (ENOTSUP) n (EINVAL)
CLOCK_MONOTONIC_RAW y n (EINVAL) n (ENOTSUP) n (ENOTSUP) n (EINVAL)
CLOCK_REALTIME y y y y y
CLOCK_REALTIME_ALARM y n (EINVAL) y [1] y [1] y [1]
CLOCK_REALTIME_COARSE y n (EINVAL) n (ENOTSUP) n (ENOTSUP) n (EINVAL)
CLOCK_TAI y n (EINVAL) y y n (EINVAL)
CLOCK_PROCESS_CPUTIME_ID y n (EINVAL) y y n (EINVAL)
CLOCK_THREAD_CPUTIME_ID y n (EINVAL) n (EINVAL [2]) y n (EINVAL)
pthread_getcpuclockid() y n (EINVAL) y y n (EINVAL)
[1] The caller must have CAP_WAKE_ALARM, or the error EPERM results.
[2] This error is generated in the glibc wrapper.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Presumably (and from a quick glance at the source code)
since Linux 3.10. when CLOCK_TAI was introduced.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Presumably (and from a quick glance at the source code)
since Linux 2.6.39, when CLOCK_BOOTTIME was introduced.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
From email discussions with Aleksa Sarai:
> .\" FIXME I find the "previously-functional systems" in the previous
> .\" sentence a little odd (since openat2() ia new sysycall), so I would
> .\" like to clarify a little...
> .\" Are you referring to the scenario where someone might take an
> .\" existing application that uses openat() and replaces the uses
> .\" of openat() with openat2()? In which case, is it correct to
> .\" understand that you mean that one should not just indiscriminately
> .\" add the RESOLVE_NO_XDEV flag to all of the openat2() calls?
> .\" If I'm not on the right track, could you point me in the right
> .\" direction please.
This is mostly meant as a warning to hopefully avoid applications
because the developer didn't realise that system paths may contain
symlinks or bind-mounts. For an application which has switched to
openat2() and then uses RESOLVE_NO_SYMLINKS for a non-security reason,
it's possible that on some distributions (or future versions of a
distribution) that their application will stop working because a system
path suddenly contains a symlink or is a bind-mount.
This was a concern which was brought up on LWN some time ago. If you can
think of a phrasing that makes this more clear, I'd appreciate it.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Devi R K reported this issue, and went on to note:
> We have written a program using real time clock and it has been raised to
> the community.
>
> https://lore.kernel.org/lkml/alpine.DEB.2.21.1908191943280.1796@nanos.tec.linutronix.de/T/
[...]
Thanks for pointing me at that thread. In particular, the test
program at
https://lore.kernel.org/lkml/alpine.DEB.2.21.1908191943280.1796@nanos.tec.linutronix.de/T/#m489d81abdfbb2699743e18c37657311f8d52a4cd
[...]
I think this patch does not really capture the details
properly. The immediately preceding paragraph says:
If the associated clock is either CLOCK_REALTIME or
CLOCK_REALTIME_ALARM, the timer is absolute
(TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET
was specified when calling timerfd_settime(), then read(2)
fails with the error ECANCELED if the real-time clock
undergoes a discontinuous change. (This allows the reading
application to discover such discontinuous changes to the
clock.)
Following on from that, I think we should have a paragraph that says
something like:
If the associated clock is either CLOCK_REALTIME or
CLOCK_REALTIME_ALARM, the timer is absolute
(TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET
was not specified when calling timerfd_settime(), then a
discontinuous negative change to the clock
(e.g., clock_settime(2)) may cause read(2) to unblock, but
return a value of 0 (i.e., no bytes read), if the clock
change occurs after the time expired, but before the
read(2) on the timerfd file descriptor.
This seems consistent with Thomas's observations in
https://lore.kernel.org/lkml/alpine.DEB.2.21.1908191943280.1796@nanos.tec.linutronix.de/T/#m49b78122b573a2749a05b720dc9fa036546db490
==
Thomas Gleixner replied:
Yes, that's correct. Accurate as always!
This is pretty much in line with clock_nanosleep(CLOCK_REALTIME,
TIMER_ABSTIME) which has a similar problem vs. observability in user
space.
clock_nanosleep(2) mutters:
"POSIX.1 specifies that after changing the value of the CLOCK_REALTIME
clock via clock_settime(2), the new clock value shall be used to
determine the time at which a thread blocked on an absolute
clock_nanosleep() will wake up; if the new clock value falls past the
end of the sleep interval, then the clock_nanosleep() call will return
immediately."
which can be interpreted as guarantee that clock_nanosleep() never
returns prematurely, i.e. the assert() in the below code would indicate
a kernel failure:
ret = clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &expiry, NULL);
if (!ret) {
clock_gettime(CLOCK_REALTIME, &now);
assert(now >= expiry);
}
But that assert can trigger when CLOCK_REALTIME was modified after the
timer fired and the kernel decided to wake up the task and let it return
to user space.
clock_nanosleep(..., &expiry)
arm_timer(expires);
schedule();
-> timer interrupt
now = ktime_get_real();
if (expires <= now)
-------------------------------- After this point
wakeup(); clock_settime(2) or
adjtimex(2) which
makes CLOCK_REALTIME
jump back far enough will
cause the above assert
to trigger.
...
return from syscall (retval == 0)
There is no guarantee against clock_settime() coming after the
wakeup. Even if we put another check into the return to user path then
we won't catch a clock_settime() which comes right after that and before
user space invokes clock_gettime().
POSIX spec Issue 7 (2018 edition) says:
The suspension for the absolute clock_nanosleep() function (that is,
with the TIMER_ABSTIME flag set) shall be in effect at least until the
value of the corresponding clock reaches the absolute time specified by
rqtp.
And that's what the kernel implements for clock_nanosleep() and timerfd
behaves exactly the same way.
The wakeup of the waiter, i.e. task blocked in clock_nanosleep(2),
read(2), poll(2), is not happening _before_ the absolute time specified
is reached.
If clock_settime() happens right before the expiry check, then it does
the right thing, but any modification to the clock after the wakeup
cannot be mitigated. At least not in a way which would make the assert()
in the example code above a reliable indicator for a kernel fail.
That's the reason why I rejected the attempt to mitigate that particular
0 tick issue in timerfd as it would just scratch a particular itch but
still not provide any guarantee. So having the '0' return documented is
the right way to go.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: devi R.K <devi.feb27@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
For me, source lines such as:
.BR perf_setattr "(2), " perf_event_open "(2), and " clone3 (2).
is harder to read than:
.BR perf_setattr (2),
.BR perf_event_open (2),
and
.BR clone3 (2).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
There are currently three of these forward references (two in
DESCRIPTION, one in ERRORS). This is a little redundant.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Rather than trying to merge the new syscall documentation into
open.2 (which would probably result in the man-page being
incomprehensible), instead the new syscall gets its own dedicated
page with links between open(2) and openat2(2) to avoid
duplicating information such as the list of O_* flags or common
errors.
In addition to describing all of the key flags, information about
the extensibility design is provided so that users can better
understand why they need to pass sizeof(struct open_how) and how
their programs will work across kernels. After some discussions
with David Laight, I also included explicit instructions to zero
the structure to avoid issues when recompiling with new headers.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The fact that CLOCK_PROCESS_CPUTIME_ID and
CLOCK_PROCESS_CPUTIME_ID are not settable isn't a bug,
since POSIX does allow the possibility that these clocks
are not settable.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The sixth argument of futex is uaddr2, instead of uaddr.
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add example programs demonstrating usage of shmget(2), shmat(2),
semget(2), semctl(2), and semop(2).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Document the verity attribute for statx(), which was added in
Linux 5.5.
For more context, see the fs-verity documentation:
https://www.kernel.org/doc/html/latest/filesystems/fsverity.html
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Fix clone3() syscall description for CLONE_PARENT_SETTID: kernel uses
cl_args.parent_tid instead of the specified cl_args.child_tid.
Signed-off-by: Krzysztof Małysa <varqox@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Consecutive calls to clock_gettime(CLOCK_MONOTONIC) are guaranteed
to return MONOTONIC values, which means that they either return
the *SAME* time value like the last call, or a later (higher) time
value.
Due to high resolution counters, like TSC on x86, most people see
that the values returned increase, but on other less common
platforms it's less likely that consecutive calls return newer
values, and instead users may unexpectedly get back the SAME time
value.
I think it makes sense to document that people should not expect
to see "always-growing" time values. For example in Debian I've
seen in quite some source packages where return values of
consecutive calls are compared against each other and then the
package build fails if they are equal (e.g. ruby-hitimes, ...).
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
* man2/syscalls.2 (.SH DESCRIPTION) <\fBgetdtablesize\fP(2)>: Remove "since
Linux 2.0" part for the osf_getdtablesize note, as syscall is generally
available since Linux 2.0; add line break after the word "as".
(.SH DESCRIPTION) <\fBpwrite\fP(2)>: Add line breaks.
(.SH DESCRIPTION) <\fBvm86old\fP(2)>: Add a line break after "in".
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In many cases, these don't improve readability, and (when stacked)
they sometimes have the side effect of sometimes forcing text
to be justified within a narrow column range.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
And add self to copyright, since, by now, the majority of the
text in the page has now been (re)written by me.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>