futex.2: Terminology fixes

Here is the result of a first pass over futex.2.  I tried to
do nothing that is too controversial.  I tried to apply the
terminology that at least Darren and I had in mind
consistently; but please check again.

The major changes are in how futexes are described in the
introductory parts of the page.  I hope it's easier to understand
now.  I've also tried to add some more precision to the the
description of the synchronization semantics (e.g., it makes a
difference whether we claim something is atomic (without further
qualification), or just atomic wrt.  other futex operations).
In some cases, that adds some verbosity to the text -- but I
believe that this is worth the clarity and consistency in using
terms, for example.

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
Torvald Riegel 2015-03-02 17:33:18 +01:00 committed by Michael Kerrisk
parent 63de469c6a
commit 4b35dc5dab
1 changed files with 188 additions and 130 deletions

View File

@ -40,50 +40,65 @@ There is no glibc wrapper for this system call; see NOTES.
.PP
The
.BR futex ()
system call provides a method for
a program to wait for a value at a given address to change, and a
method to wake up anyone waiting on a particular address.
system call provides a method for waiting until a certain condition becomes
true. It is typically used as a blocking construct in the context of
shared-memory synchronization: The program implements the majority of the
synchronization in user space, and uses one of operations of the system call
when it is likely that it has to block for a longer time until the condition
becomes true. The program uses another operation of the system call to wake
anyone waiting for a particular condition.
The condition is represented by the futex word, which is an address in memory
supplied to the
.BR futex ()
system call, and the value at this memory location.
(While the virtual addresses for the same memory in separate
processes may not be equal,
the kernel maps them internally so that the same memory mapped
in different locations will correspond for
.BR futex ()
calls.)
This system call is typically used to
implement the contended case of a lock in shared memory, as
described in
.BR futex (7).
In the uncontended case,
all operations on the futex memory location are performed
in user space using atomic machine-language instructions,
and the kernel maintains no information about the futex.
The kernel allocates state information for the futex only
in the contended case, when operations such as
When executing a futex operation that requests to block a thread, the kernel
will only block if the futex word has the value that the calling thread
supplied as expected value. The load from the futex word, the comparison with
the expected value, and the actual blocking will happen atomically and totally
ordered with respect to concurrently executing futex operations on the same
futex word, such as operations that wake threads blocked on this futex word.
Thus, the futex word is used to connect the synchronization in user space with
the implementation of blocking by the kernel; similar to an atomic
compare-and-exchange operation that potentially changes shared memory,
blocking via a futex is an atomic compare-and-block operation. See NOTES for
a detailed specification of the synchronization semantics.
One example use of futexes is implementing locks. The state of the lock (i.e.,
acquired or not acquired) can be represented as an atomically accessed flag
in shared memory. In the uncontended case, a thread can access or modify the
lock state with atomic instructions, for example atomically changing it from
not acquired to acquired using an atomic compare-and-exchange instruction. If
a thread cannot acquire a lock because it is already acquired by another
thread, it can request to block if and only the lock is still acquired by
using the lock's flag as futex word and expecting a value that represents the
acquired state. When releasing the lock, a thread has to first reset the
lock state to not acquired and then execute the futex operation that wakes
one thread blocked on the futex word that is the lock's flag (this can be
be further optimized to avoid unnecessary wake-ups). See
.BR futex (7)
for more detail on how to use futexes.
Besides the basic wait and wake-up futex functionality, there are further
futex operations aimed at supporting more complex use cases. Also note that
no explicit initialization or destruction are necessary to use futexes; the
kernel maintains a futex (i.e., the kernel-internal implementation artifact)
only while operations such as
.BR FUTEX_WAIT ,
described below, are performed.
When a futex operation did not finish uncontended in user space, a
.BR futex ()
call needs to be made to the kernel to arbitrate.
Arbitration can either mean putting the caller
to sleep or, conversely, waking a waiting process or thread.
.PP
Callers of
.BR futex ()
are expected to adhere to the semantics described in
.BR futex (7).
As these semantics involve writing nonportable assembly instructions
(see the example library referred to in SEE ALSO),
this in turn probably means that most users will in fact be
library authors and not general application developers.
described below, are being performed on a particular futex word.
.\"
.SS Arguments
The
.I uaddr
argument points to an integer which stores the counter (futex).
On all platforms, futexes are four-byte integers that
must be aligned on a four-byte boundary.
argument points to the futex word. On all platforms, futexes are four-byte
integers that must be aligned on a four-byte boundary.
The operation to perform on the futex is specified in the
.I futex_op
argument;
@ -117,7 +132,7 @@ when interpreted in this fashion.
Where it is required, the
.IR uaddr2
argument is a pointer to a second futex that is employed by the operation.
argument is a pointer to a second futex word that is employed by the operation.
The interpretation of the final integer argument,
.IR val3 ,
depends on the operation.
@ -139,8 +154,8 @@ are as follows:
.\" commit 34f01cc1f512fa783302982776895c73714ebbc2
This option bit can be employed with all futex operations.
It tells the kernel that the futex is process-private and not shared
with another process
(i.e., it is being used for synchronization between threads).
with another process (i.e., it is only being used for synchronization between
threads of the same process).
This allows the kernel to choose the fast path for validating
the user-space address and avoids expensive VMA lookups,
taking reference counts on file backing store, and so on.
@ -191,24 +206,32 @@ is one of the following:
.BR FUTEX_WAIT " (since Linux 2.6.0)"
.\" Strictly speaking, since some time in 2.5.x
This operation tests that the value at the
location pointed to by the futex address
futex word pointed to by the address
.I uaddr
still contains the value
still contains the expected value
.IR val ,
and then sleeps awaiting
and if so, then sleeps awaiting
.B FUTEX_WAKE
on the futex address.
The test and sleep steps are performed atomically.
on the futex word. The load of the value of the futex word is an atomic memory
access (i.e., using atomic machine instructions of the respective
architecture). This load, the comparison with the expected value, and
starting to sleep are performed atomically and totally ordered with respect
to other futex operations on the same futex word. If the thread starts to
sleep, it is considered a waiter on this futex word.
If the futex value does not match
.IR val ,
then the call fails immediately with the error
.BR EAGAIN .
.\" FIXME I added the following sentence. Please confirm that it is correct.
The purpose of the test step is to detect races where
another process or thread changes the value of the futex between
the time it was last checked and the time of the
The purpose of the comparison with the expected value is to prevent lost
wake-ups: If another thread changed the value of the futex word after the
calling thread decided to block based on the prior value, and if the other
thread executed a
.BR FUTEX_WAKE
operation (or similar wake-up) after the value change and before this
.BR FUTEX_WAIT
operation.
operation, then the latter will observe the value change and will not start
to sleep.
If the
.I timeout
@ -230,14 +253,15 @@ and
.I val3
are ignored.
For
.BR futex (7),
this call is executed if decrementing the count gave a negative value
(indicating contention),
and will sleep until another process or thread releases
the futex and executes the
.B FUTEX_WAKE
operation.
.\" XXX I think we should remove this. Or maybe adapt to a different example.
.\" For
.\" .BR futex (7),
.\" this call is executed if decrementing the count gave a negative value
.\" (indicating contention),
.\" and will sleep until another process or thread releases
.\" the futex and executes the
.\" .B FUTEX_WAKE
.\" operation.
.\"
.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.\"
@ -246,9 +270,11 @@ operation.
.\" Strictly speaking, since Linux 2.5.x
This operation wakes at most
.I val
of the waiters that are waiting (i.e., inside
.\" XXX I believe FUTEX_WAIT_BITSET waiters, for example, could also be woken
.\" (therefore, make it e.g. instead of i.e.)?
of the waiters that are waiting (e.g., inside
.BR FUTEX_WAIT )
on the futex at the address
on the futex word at the address
.IR uaddr .
Most commonly,
.I val
@ -267,11 +293,12 @@ and
.I val3
are ignored.
For
.BR futex (7),
this is executed if incrementing the count showed that there were waiters,
.\" XXX I think we should remove this. Or maybe adapt to a different example.
.\" For
.\" .BR futex (7),
.\" this is executed if incrementing the count showed that there were waiters,
.\" FIXME How does "incrementing the count showed that there were waiters"?
once the futex value has been set to 1 (indicating that it is available).
.\" once the futex value has been set to 1 (indicating that it is available).
.\"
.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.\"
@ -283,7 +310,7 @@ This operation creates a file descriptor that is associated with the futex at
The caller must close the returned file descriptor after use.
When another process or thread performs a
.BR FUTEX_WAKE
on the futex, the file descriptor indicates as being readable with
on the futex word, the file descriptor indicates as being readable with
.BR select (2),
.BR poll (2),
and
@ -303,6 +330,7 @@ and
.I val3
are ignored.
.\" FIXME We never define "upped". Maybe just remove that sentence?
To prevent race conditions, the caller should test if the futex has
been upped after
.B FUTEX_FD
@ -319,8 +347,11 @@ from Linux 2.6.26 onward.
.TP
.BR FUTEX_REQUEUE " (since Linux 2.6.0)"
.\" Strictly speaking: from Linux 2.5.70
.\" FIXME Is there some indication that it is broken in general, or is this
.\" comment implicitly speaking about the condvar (?) use case? If the latter
.\" we might want to weaken the advice a little.
.IR "Avoid using this operation" .
It is broken (unavoidably racy) for its intended purpose.
It is broken for its intended purpose.
Use
.BR FUTEX_CMP_REQUEUE
instead.
@ -337,35 +368,13 @@ is ignored.)
.\"
.TP
.BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)"
This operation was added as a replacement for the earlier
.BR FUTEX_REQUEUE ,
because that operation was racy for its intended use.
As with
.BR FUTEX_REQUEUE ,
the
.BR FUTEX_CMP_REQUEUE
operation is used to avoid a "thundering herd" effect when
.B FUTEX_WAKE
is used and all of the waiters that are woken up
need to acquire another futex.
It differs from
.BR FUTEX_REQUEUE
in that it first checks whether the location
This operation first checks whether the location
.I uaddr
still contains the value
.IR val3 .
If not, the operation fails with the error
.BR EAGAIN .
.\" FIXME I added the following sentence on the rationale for
.\" FUTEX_CMP_REQUEUE. Is it correct? Should it be expanded?
This additional feature of
.BR FUTEX_CMP_REQUEUE
can be used by the caller to (atomically) detect changes
in the value of the target futex at
.IR uaddr2 .
The operation wakes up a maximum of
Otherwise, the operation wakes up a maximum of
.I val
waiters that are waiting on the futex at
.IR uaddr .
@ -376,13 +385,31 @@ from the wait queue of the source futex at
.I uaddr
and added to the wait queue of the target futex at
.IR uaddr2 .
The
.I val2
argument specifies an upper limit on the number of waiters
that are requeued to the futex at
.IR uaddr2 .
.\" FIXME Is this correct? Or is just the decision which threads to wake or
.\" requeue part of the atomic operation?
The load from
.I uaddr
is an atomic memory access (i.e., using atomic machine instructions of the
respective architecture). This load, the comparison with
.IR val3 ,
and the requeueing of any waiters are performed atomically and totally ordered
with respect to other operations on the same futex word.
This operation was added as a replacement for the earlier
.BR FUTEX_REQUEUE .
The difference is that the check of the value at
.I uaddr
can be used to ensure that requeueing only happens under certain conditions.
Both operations can be used to avoid a "thundering herd" effect when
.B FUTEX_WAKE
is used and all of the waiters that are woken need to acquire another futex.
.\" FIXME Please review the following new paragraph to see if it is
.\" accurate.
Typical values to specify for
@ -416,6 +443,9 @@ operation equivalent to
.\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721
.\" Author: Jakub Jelinek <jakub@redhat.com>
.\" Date: Tue Sep 6 15:16:25 2005 -0700
.\" FIXME The glibc condvar implementation is currently being revised (e.g.,
.\" to not use an internal lock anymore).
.\" It is probably more future-proof to remove this paragraph.
This operation was added to support some user-space use cases
where more than one futex must be handled at the same time.
The most notable example is the implementation of
@ -429,7 +459,9 @@ high rates of contention and context switching.
The
.BR FUTEX_WAIT_OP
operation is equivalent to atomically executing the following code:
operation is equivalent to execute the following code atomically and totally
ordered with respect to other futex operations on any of the two supplied
futex words:
.in +4n
.nf
@ -446,23 +478,24 @@ In other words,
does the following:
.RS
.IP * 3
saves the original value of the futex at
.IR uaddr2 ;
.IP *
performs an operation to modify the value of the futex at
saves the original value of the futex word at
.IR uaddr2
and performs an operation to modify the value of the futex at
.IR uaddr2 ;
this is an atomic read-modify-write memory access (i.e., using atomic machine
instructions of the respective architecture)
.IP *
wakes up a maximum of
.I val
waiters on the futex
waiters on the futex for the futex word at
.IR uaddr ;
and
.IP *
dependent on the results of a test of the original value of the futex at
dependent on the results of a test of the original value of the futex word at
.IR uaddr2 ,
wakes up a maximum of
.I val2
waiters on the futex
waiters on the futex for the futex word at
.IR uaddr2 .
.RE
.IP
@ -676,7 +709,7 @@ have their priorities raised to be the same as the high-priority task.
.\" based on mail discussions with Darren Hart. Does it seem okay?
From a user-space perspective,
what makes a futex PI-aware is a policy agreement between user space
and the kernel about the value of the futex (described in a moment),
and the kernel about the value of the futex word (described in a moment),
coupled with the use of the PI futex operations described below
(in particular,
.BR FUTEX_LOCK_PI ,
@ -697,11 +730,13 @@ and
.\" significantly. Please check it.
.\"
The PI futex operations described below differ from the other
futex operations in that they impose policy on the use of the futex value:
futex operations in that they impose policy on the use of the value of the
futex word:
.IP * 3
If the lock is unowned, the futex value shall be 0.
If the lock is not acquired, the futex word's value shall be 0.
.IP *
If the lock is owned, the futex value shall be the thread ID (TID; see
If the lock is acquired, the futex word's value shall be the thread ID (TID;
see
.BR gettid (2))
of the owning thread.
.IP *
@ -709,32 +744,34 @@ of the owning thread.
If the lock is owned and there are threads contending for the lock,
then the
.B FUTEX_WAITERS
bit shall be set in the futex value; in other words, the futex value is:
bit shall be set in the futex word's value; in other words, this value is:
FUTEX_WAITERS | TID
.PP
Note that a PI futex never just has the value
Note that a PI futex word never just has the value
.BR FUTEX_WAITERS ,
which is a permissible state for non-PI futexes.
With this policy in place,
a user-space application can acquire an unowned
lock or release an uncontended lock using atomic
instructions executed in user-space (e.g.,
a user-space application can acquire a not-acquired
lock or release a lock that no other threads try to acquire using atomic
instructions executed in user space (e.g., a compare-and-swap operation such
as
.I cmpxchg
on the x86 architecture).
Locking an unowned lock simply consists of setting
the futex value to the caller's TID.
Releasing an uncontended lock simply requires setting the futex value to 0.
Acquiring a lock simply consists of using compare-and-swap to atomically set
the futex word's value to the caller's TID if its previous value was 0.
Releasing a lock requires using compare-and-swap to set the futex word's
value to 0 if the previous value was the expected TID.
If a futex is currently owned (i.e., has a nonzero value),
If a futex is already acquired (i.e., has a nonzero value),
waiters must employ the
.B FUTEX_LOCK_PI
operation to acquire the lock.
If a lock is contended (i.e., the
If other threads are waiting for the lock, then the
.B FUTEX_WAITERS
bit is set in the futex value), the lock owner must employ the
bit is set in the futex value; in this case, the lock owner must employ the
.B FUTEX_UNLOCK_PI
operation to release the lock.
@ -752,7 +789,7 @@ before the calling thread returns to user space.
It is important to note
.\" FIXME We need some explanation here of *why* it is important to
.\" note this. Can someone explain?
that the kernel will update the futex value prior
that the kernel will update the futex word's value prior
to returning to user space.
Unlike the other futex operations described above,
the PI futex operations are designed
@ -782,8 +819,8 @@ PI futexes are operated on by specifying one of the following values in
.\" Please check, in case I injected errors.
.\"
This operation is used after after an attempt to acquire
the futex lock via an atomic user-space instruction failed
because the futex has a nonzero value\(emspecifically,
the lock via an atomic user-space instruction failed
because the futex word has a nonzero value\(emspecifically,
because it contained the namespace-specific TID of the lock owner.
.\" FIXME In the preceding line, what does "namespace-specific" mean?
.\" (I kept those words from tglx.)
@ -791,13 +828,13 @@ because it contained the namespace-specific TID of the lock owner.
.\" (I suppose we are talking PID namespaces here, but I want to
.\" be sure.)
The operation checks the value of the futex at the address
The operation checks the value of the futex word at the address
.IR uaddr .
If the value is 0, then the kernel tries to atomically set
the futex value to the caller's TID.
If that fails,
.\" FIXME What would be the cause of failure?
or the futex value is nonzero,
or the futex word's value is nonzero,
the kernel atomically sets the
.B FUTEX_WAITERS
bit, which signals the futex owner that it cannot unlock the futex in
@ -855,6 +892,8 @@ This operation tries to acquire the futex at
.\" the difference(s) between FUTEX_LOCK_PI and FUTEX_TRYLOCK_PI.
.\" Can someone propose something?
.\"
.\" FIXME Additionally, we claim above that just FUTEX_WAITERS is never an
.\" allowed state.
It deals with the situation where the TID value at
.I uaddr
is 0, but the
@ -1049,7 +1088,13 @@ The return value on success depends on the operation,
as described in the following list:
.TP
.B FUTEX_WAIT
Returns 0 if the caller was woken up.
Returns 0 if the caller was woken up. Note that a wake-up can also be
caused by common futex usage patterns in unrelated code that happened to have
previously used the futex word's memory location (e.g., typical futex-based
implementations of Pthreads mutexes can cause this under some conditions).
Therefore, callers should always conservatively assume that a return value of
0 can mean a spurious wake-up, and use the futex word's value (i.e., the user
space synchronization scheme) to decide whether to continue to block or not.
.TP
.B FUTEX_WAKE
Returns the number of waiters that were woken up.
@ -1062,22 +1107,25 @@ Returns the number of waiters that were woken up.
.TP
.B FUTEX_CMP_REQUEUE
Returns the total number of waiters that were woken up or
requeued to the futex at
requeued to the futex for the futex word at
.IR uaddr2 .
If this value is greater than
.IR val ,
then difference is the number of waiters requeued to the futex at
then difference is the number of waiters requeued to the futex for the futex
word at
.IR uaddr2 .
.TP
.B FUTEX_WAKE_OP
Returns the total number of waiters that were woken up.
This is the sum of the woken waiters on the two futexes at
This is the sum of the woken waiters on the two futexes for the futex words at
.I uaddr
and
.IR uaddr2 .
.TP
.B FUTEX_WAIT_BITSET
Returns 0 if the caller was woken up.
Returns 0 if the caller was woken up. See
.B FUTEX_WAIT
for how to interpret this correctly in practice.
.TP
.B FUTEX_WAKE_BITSET
Returns the number of waiters that were woken up.
@ -1093,15 +1141,17 @@ Returns 0 if the futex was successfully unlocked.
.TP
.B FUTEX_CMP_REQUEUE_PI
Returns the total number of waiters that were woken up or
requeued to the futex at
requeued to the futex for the futex word at
.IR uaddr2 .
If this value is greater than
.IR val ,
then difference is the number of waiters requeued to the futex at
then difference is the number of waiters requeued to the futex for the futex
word at
.IR uaddr2 .
.TP
.B FUTEX_WAIT_REQUEUE_PI
Returns 0 if the caller was successfully requeued to the futex at
Returns 0 if the caller was successfully requeued to the futex for the futex
word at
.IR uaddr2 .
.\"
.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
@ -1109,10 +1159,11 @@ Returns 0 if the caller was successfully requeued to the futex at
.SH ERRORS
.TP
.B EACCES
No read access to futex memory.
No read access to the memory of a futex word.
.TP
.B EAGAIN
.RB ( FUTEX_WAIT ,
.BR FUTEX_WAIT_BITSET ,
.BR FUTEX_WAIT_REQUEUE_PI )
The value pointed to by
.I uaddr
@ -1136,6 +1187,7 @@ The value pointed to by
is not equal to the expected value
.IR val3 .
.\" FIXME: Is the following sentence correct?
.\" I would prefer to remove this sentence. --triegel@redhat.com
(This probably indicates a race;
use the safe
.B FUTEX_WAKE
@ -1164,7 +1216,7 @@ Try again.
.RB ( FUTEX_LOCK_PI ,
.BR FUTEX_TRYLOCK_PI ,
.BR FUTEX_CMP_REQUEUE_PI )
The futex at
The futex word at
.I uaddr
is already locked by the caller.
.TP
@ -1175,7 +1227,7 @@ is already locked by the caller.
.\" constants are synonymous. Is there a reason that both names
.\" are used?
.RB ( FUTEX_CMP_REQUEUE_PI )
While requeueing a waiter to the PI futex at
While requeueing a waiter to the PI futex for the futex word at
.IR uaddr2 ,
the kernel detected a deadlock.
.TP
@ -1196,7 +1248,6 @@ operation was interrupted by a signal (see
.BR signal (7)).
In kernels before Linux 2.6.22, this error could also be returned for
on a spurious wakeup; since Linux 2.6.22, this no longer happens.
or a spurious wakeup.
.TP
.B EINVAL
The operation in
@ -1353,7 +1404,7 @@ nor
.BR FUTEX_UNLOCK_PI ,
.BR FUTEX_CMP_REQUEUE_PI ,
.BR FUTEX_WAIT_REQUEUE_PI )
A run-time check determined that the operation not available.
A run-time check determined that the operation is not available.
The PI futex operations are not implemented on all architectures and
are not supported on some CPU variants.
.TP
@ -1371,7 +1422,7 @@ the futex at
.TP
.BR EPERM
.RB ( FUTEX_UNLOCK_PI )
The caller does not own the futex.
The caller does not own the lock represented by the futex word.
.TP
.BR ESRCH
.RB ( FUTEX_LOCK_PI ,
@ -1379,7 +1430,7 @@ The caller does not own the futex.
.BR FUTEX_CMP_REQUEUE_PI )
.\" FIXME I reworded the following sentence a bit differently from
.\" tglx's formulation. Is it okay?
The thread ID in the futex at
The thread ID in the futex word at
.I uaddr
does not exist.
.TP
@ -1387,7 +1438,7 @@ does not exist.
.RB ( FUTEX_CMP_REQUEUE_PI )
.\" FIXME I reworded the following sentence a bit differently from
.\" tglx's formulation. Is it okay?
The thread ID in the futex at
The thread ID in the futex word at
.I uaddr2
does not exist.
.TP
@ -1418,6 +1469,9 @@ This system call is Linux-specific.
.SH NOTES
Glibc does not provide a wrapper for this system call; call it using
.BR syscall (2).
.\" TODO FIXME Above, we cite this section and claim it contains details on
.\" the synchronization semantics; add the C11 equivalents here (or whatever
.\" we find consensus for).
.\"
.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.\"
@ -1651,3 +1705,7 @@ Futex example library, futex-*.tar.bz2 at
.\"
.\" FIXME Are there any other resources that should be listed
.\" in the SEE ALSO section?
.\" FIXME We should probably refer to the glibc code here, in particular the
.\" glibc-internal futex wrapper functions that are WIP, and the
.\" generic pthread_mutex_t and perhaps condvar implementations.