I recently realized that I had been reasoning improperly about
what umount(MNT_DETACH) did based on an insufficient description
in the umount.2 man page, that matched my intuition but not the
implementation.
When there are no submounts, MNT_DETACH is essentially harmless to
applications. Where there are submounts, MNT_DETACH changes what
is visible to applications using the detach directories.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Include linux/ext2_fs.h does not contain any ioctl definitions
anymore.
Request codes EXT2_IOC* have been replaced by FS_IOC* in
linux/fs.h.
Some definitions of FS_IOC_* use long* but the actual code expects
int* (see fs/ext2/ioctl.c).
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Clone has so many effects that it's an oversimplification to say
that the *main* use of clone is to create a thread. (In fact,
the use of clone() to create new processes may well be more
common, since glibc's fork() is a wrapper that calls clone().)
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Normally, system calls return EINVAL for flags they don't support.
Explicitly document that clone does *not* produce an error for these two
obsolete flags.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
From email conversation with Konstantin:
> * Are you saying there are case where successful
> setsockopt() via nf_register_sockopt() might return a
> value other zero?
Yes - it happens when the option is served by a custom
netfilter hook (this is how I bumped into this). Example:
Userspace code:
=================== cut here ================================
int main(void) {
int sock;
if ((sock = socket(AF_INET, SOCK_RAW, IPPROTO_RAW)) < 0)
return -1;
return setsockopt(sock, IPPROTO_IP, TEST_SETSOCKOPT_RETURN, NULL, 0);
}
=================== cut here ================================
Kernel module, handling the option 400 "TEST_SETSOCKOPT_RETURN":
=================== cut here ================================
/* Random value - just should not be already used by the running
system: */
static int test_sock_set_so(struct sock *sk, int cmd, void *param,
unsigned len) {
return 42;
}
static struct nf_sockopt_ops test_sock_ops = {
list: {NULL, NULL},
pf: PF_INET,
set_optmin: TEST_SETSOCKOPT_RETURN,
set_optmax: (TEST_SETSOCKOPT_RETURN + 1),
set: test_sock_set_so,
get_optmin: 0,
get_optmax: 0,
get: NULL
};
static int test_sock_init(void) {
return nf_register_sockopt(&test_sock_ops); /* sanity check
skipped */
}
static void test_sock_exit(void) {
nf_unregister_sockopt(&test_sock_ops);
}
module_init(test_sock_init);
module_exit(test_sock_exit);
=================== cut here ================================
After successful loading of the module, the executable returns 42,
and as I understand, that is the intention of netfilter authors.
Netfilter code calls the registered handle and just returns back to
user what it receives from it.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
add FAT_IOCTL_GET_VOLUME_ID
SEE ALSO ioctl_fat.2
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Change "behaviour" to American spelling "behavior".
Signed-off-by: Bill Pemberton <wfp5p@worldbroken.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The function mmap() and munmap() are thread safe.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The number of padding bytes has changed over tyme, as some
bytes are used, so describe this aspect of the structure
less explicitly.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The point made in this fairly ancient text is more or less evident
from the DESCRIPTION, and it's not clear what "standard" is being
referred to.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As from upstream commit:
commit 3f31d07571eeea18a7d34db9af21d2285b807a17
Author: Hugh Dickins <hughd@google.com>
Date: Tue May 29 15:06:40 2012 -0700
mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE
Now tmpfs supports hole-punching via fallocate(), switch madvise_remove()
to use do_fallocate() instead of vmtruncate_range(): which extends
madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs.
madvise(,,MADV_REMOVE) support was extended by ext4, ocfs2 and xfs.
bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1120294
Justification from Rafael Aquini:
Well, that code is committed in kernel since v3.5 (2012) and it
surely is the expected behaviour since. It seems to me that
madvise(2) man page text for MADV_REMOVE just got out-of-date in
that regard.
This patch mentions this support in madvise.2 man page.
Reworded and corrected by Michael Kerrisk and Hugh Dickins. Thank you.
Signed-off-by: Jan Chaloupka <jchaloup@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
madvise() is one of those system calls that has congealed over
time, as has the man page. It's helpful to split the discussion
of 'advice' into those flags into two groups:
* Those flags that are (1) widespread across implementations;
(2) have counterparts in posix_madvise(3); and (3) were present
in the initial Linux madvise implementation.
* The rest, which are values that (1) may not have counterparts
in other implementations; (2) have no counterparts in
posix_madvise(3); and (3) were added to Linux in more recent
times.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I can find no evidence that madvise() was in POSIX.1b.
Certainly, it's not mentioned in Bill Gallmeister's
POSIX.4 book (O'Reilly).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Over time, bit rot has afflicted this page. Since the original
text was written many new Linux-specific flags have been added.
So, now it's better to explicitly list the flags that
correspond to the POSIX analog of madvise().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The fields d_ino and d_off of structure __fat_dirent are explained.
The different return values of VFAT_IOCTL_READDIR_BOTH and
VFAT_IOCTL_READDIR_SHORT are explained.
The usage of the return value in the example is corrected.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The ioctl(2) system call may be used to retrieve information about
the FAT file system and to set file attributes.
Signed-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The lm bit should never have existed in the first place. Sigh.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The documentation for set_thread_area was very vague. This
improves it, accounts for recent kernel changes, and merges
it with get_thread_area.2.
get_thread_area.2 now becomes a link.
While I'm at it, clarify the related arch_prctl.2 man page.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This clarifies the behavior and documents all four functions.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Currently the PERF_EVENT_IOC_REFRESH ioctl, when applied to a group
leader, will refresh all children. Also if a refresh value of 0
is chosen then the refresh becomes infinite (never runs out).
Back in 2011 PAPI was relying on these behaviors but I was told
that both were unsupported and subject to being removed at any time.
(See https://lkml.org/lkml/2011/5/24/337 )
However the behavior has not been changed.
This patch updates the manpage to still list the behavior as
unsupported, but removes the inaccurate description of it
only being a problem with 2.6 kernels.
Signed-off-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
fork.2 should clearly point out that child and parent
process run in separate memory spaces.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Extend description of PTRACE_SEIZE with the short summary of its
differences from PTRACE_ATTACH.
The following paragraph:
PTRACE_EVENT_STOP
Stop induced by PTRACE_INTERRUPT command, or group-stop, or ini-
tial ptrace-stop when a new child is attached (only if attached
using PTRACE_SEIZE), or PTRACE_EVENT_STOP if PTRACE_SEIZE was used.
has an editing error (the part after last comma makes no sense).
Removing it.
Mention that legacy post-execve SIGTRAP is disabled by PTRACE_SEIZE.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This behaviour was verified by reading the kernel source and
confirming the behaviour using a test program.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The following program illustrates the difference between TCP
and Unix stream sockets doing sendfile. Since TCP implements
zero-copy, the new modifications to the file transferred is
seen upon reading despite the modifications happening after
sendfile was last called.
Unix stream sockets do not implement zero-copy (as of
Linux 3.15), so readers continue to see the contents of the
file at the time it was sent, not as they are at the time of
reading.
----------------- sendfile-mod.c ---------------
#define _GNU_SOURCE
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/sendfile.h>
#include <arpa/inet.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <assert.h>
#include <fcntl.h>
static void tcp_socketpair(int sv[2])
{
struct sockaddr_in addr;
socklen_t addrlen = sizeof(addr);
int l = socket(PF_INET, SOCK_STREAM, 0);
int c = socket(PF_INET, SOCK_STREAM, 0);
int a;
int val = 1;
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = INADDR_ANY;
addr.sin_port = 0;
assert(0 == bind(l, (struct sockaddr*)&addr, addrlen));
assert(0 == listen(l, 1024));
assert(0 == getsockname(l, (struct sockaddr *)&addr, &addrlen));
assert(0 == connect(c, (struct sockaddr *)&addr, addrlen));
a = accept4(l, NULL, NULL, SOCK_NONBLOCK);
assert(a >= 0);
close(l);
assert(0 == ioctl(c, FIONBIO, &val));
sv[0] = a;
sv[1] = c;
}
int main(int argc, char *argv[])
{
int pair[2];
FILE *tmp = tmpfile();
int tfd;
char buf[16384];
ssize_t w, r;
size_t i;
const size_t n = 2048;
off_t off = 0;
char expect[4096];
int flags = SOCK_STREAM|SOCK_NONBLOCK;
tfd = fileno(tmp);
assert(tfd >= 0);
/* prepare the tempfile */
memset(buf, 'a', sizeof(buf));
for (i = 0; i < n; i++)
assert(sizeof(buf) == write(tfd, buf, sizeof(buf)));
if (argc == 2 && strcmp(argv[1], "unix") == 0)
assert(0 == socketpair(AF_UNIX, flags, 0, pair));
else if (argc == 2 && strcmp(argv[1], "pipe") == 0)
assert(0 == pipe2(pair, O_NONBLOCK));
else
tcp_socketpair(pair);
/* fill up the socket buffer */
for (;;) {
w = sendfile(pair[1], tfd, &off, n);
if (w > 0)
continue;
if (w < 0 && errno == EAGAIN)
break;
assert(0 && "unhandled error" && w && errno);
}
printf("wrote off=%lld\n", (long long)off);
/* rewrite the tempfile */
memset(buf, 'A', sizeof(buf));
assert(0 == lseek(tfd, 0, SEEK_SET));
for (i = 0; i < n; i++)
assert(sizeof(buf) == write(tfd, buf, sizeof(buf)));
/* we should be reading 'a's, not 'A's */
memset(expect, 'a', sizeof(expect));
do {
r = read(pair[0], buf, sizeof(expect));
/* TCP fails here since it is zero copy (on Linux 3.15.5) */
if (r > 0)
assert(memcmp(buf, expect, r) == 0);
} while (r > 0);
return 0;
}
Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
CLONE_PARENT_SETTID only stores child thread ID in parent memory.
Signed-off-by: Peng Haitao <penght@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This patch the fact that a successful execve(2) in a process that
is sharing a file descriptor table results in unsharing the table.
I discovered this through testing and verified it by source
inspection - there is a call to unshare_files() early in
do_execve_common().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I encountered these errors while writing testcase for migrate_pages
syscall for LTP (Linux test project).
I checked stable kernel tree 3.5 to see which paths return these.
Both can be returned from get_nodes(), which is called from:
SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
const unsigned long __user *, old_nodes,
const unsigned long __user *, new_nodes)
The testcase does following:
EFAULT
a) old_nodes/new_nodes is area mmaped with PROT_NONE
b) old_nodes/new_nodes is area not mmapped in process address
space, -1 or area that has been just munmmaped
EINVAL
a) maxnodes overflows kernel limit
b) new_nodes contain node, which has no memory or does not exist
or is not returned for get_mempolicy(MPOL_F_MEMS_ALLOWED).
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I puzzled over mprotect()'s effect on /proc/*/maps for a while
yesterday -- it was setting "x" without PROT_EXEC being specified.
Here is a patch to add some explanation.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
We have users who are terribly confused why their binaries
with CAP_DAC_OVERRIDE capability see EACCESS from access() calls,
but are able to read the file.
The reason is access() isn't the "can I read/write/execute this
file?" question, it is the "(assuming that I'm a setuid binary,)
can *the user who invoked me* read/write/execute this file?"
question.
That's why it uses real UIDs as documented, and why it ignores
capabilities when capability-endorsed binaries are run by non-root
(this patch adds this information).
To make users more likely to notice this less-known detail,
the patch expands the explanation with rationale for this logic
into a separate paragraph.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: linux-man@vger.kernel.org
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I am not sure why we have:
"EAGAIN fork() cannot allocate sufficient memory to copy
the parent's page tables and allocate a task structure
or the child."
The text seems to be there from the time when man-pages
were moved to git so there is no history for it.
And it doesn't reflect reality: the kernel reports both
dup_task_struct and dup_mm failures as ENOMEM to the
userspace. This seems to be the case from early 2.x times
so let's simply remove this part.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Verified by experiment on Linux 3.15 and 3.19rc4.
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Let's assume Michael's email address did not change.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add a reference to the AF_ALG protocol accessible via socket(2).
Signed-off-by: Stephan Mueller <stephan.mueller@atsec.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
With his last patches for getrandom.2 Michael Kerrisk posed a few
questions and left some comments in the man-page. This patch
seeks to clarify the open issues.
72 For example, if the call is interrupted by a signal handler,
73 it may return a partially filled buffer, or fail with the error
74 .BR EINTR .
75 .\" Tested with buffer sizes > 256 bytes: both partial reads
76 .\" and EINTR can occur, with the former being more frequent.
77 .\"
Michael's observation agrees with the code.
For buffer size > 256: If the buffer is still empty EINTR occurs.
If any number of bytes has been read to the buffer, that number
is returned. The comment can be removed.
78 .\" mtk: In the absence of signals, in my testing, even very large reads
79 .\" return full buffers. I found that reads of up to 33554431 always
80 .\" returned a filled buffer. Specifying 'buflen' > 33554431 always
81 .\" returned just 33554431 bytes. (I'm not sure where that number comes
from.
The maximum number of bytes transferred is limited for
/dev/urandom to:
nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
// <= 0x1fffff
and for /dev/random to
nbytes = min_t(size_t, nbytes, SEC_XFER_SIZE); // <= 0x200
Lets put this into the NOTES section.
224 When reading from
225 .IR /dev/random ,
226 blocking requests of any size can be interrupted by a signal
227 (the call fails with the error
228 .BR EINTR ).
Thats ok.
82 If the pool has not yet been initialized, then the call blocks, unless
83 .B GRND_RANDOM
84 is specified in
85 .IR flags .
86 .\" FIXME We need a bit more information here.
87 .\" The reader will ask: when is /dev/urandom initialized?
88 .\" There should be some text here to explain that.
Entropy is collected from different sources, e.g.
- time of reaping a thread
- MAC address of a network interfaces
- Allwinner security ID
- ROM content of a firewire device
- ...
When more than 128 bits have been collected, the pool is set
to initialized.
I suggest that detailed information about the initialization
should be provided on the random.4 page.
I added a paragraph in the NOTES section.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The patch clarifies when blocking may occur while calling
getrandom().
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Theodore Ts'o confirmed the bug described in
https://lkml.org/lkml/2014/11/29/16
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Kernel 3.17 introduces a new system call getrandom(2).
The man page in this patch is based on the commit message by
Theodore Ts'o and suggestions by Michael Kerrisk.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
With the kernel "uapi" changes of a a few releases ago, these
constants are now automatically provided to glibc.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I noticed you were adding git commit references to the various
Linux version markers.
This adds git commit references for all Linux kernel version
notes in perf_event_open.2
mtk: I backed out two pieces of Vince's patch that were not
source comments. They can be dealt with as separate commits.
Signed-off-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Update the perf_event_open manpage to be more consistent when
discussing overflow events. It merges the discussion of
poll-type notifications with those generated by SIGIO
signal handlers.
This addresses the remaining FIXMEs is the document.
Signed-off-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Remove an inaccurate paragraph about values in the attr.config
field. This information was never true in any released kernel;
it somehow snuck into the manpage because it is still described
this way in tools/perf/design.txt in the kernel source tree.
Signed-off-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
New manual page for the new PCI MMIO memory access system
calls, s390_pci_mmio_write() and s390_pci_mmio_read(),
added for the s390 platform.
Signed-off-by: Alexey Ishchuk <aishchuk@linux.vnet.ibm.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>