Three versions of "stat" appeared on 32-bit systems,
dealing with structures of different (increasing) sizes.
Explain some of the details, and also note that the
situation is simpler on modern 64-bit architectures.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The multiple-system-call-version phenomenon is particular a
feature of older 32-bit platforms. Hint at that fact in the text.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In the discussion of the new *64() system calls added in
Linux 2.4, use truncate64() father than ftruncate64(),
since the text goes on to say "and their analogs that work with
file descriptors or symbolic links".
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
madvise(2) actually returns with error EINVAL for MADV_REMOVE
when used for hugetlb VMAs, not EOPNOTSUPP, and this has been
the case since MADV_REMOVE was introduced in commit f6b3ec238d12
("madvise(MADV_REMOVE): remove pages from tmpfs shm backing
store").
Specify the exact behavior.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The glibc wrapper gives an EINVAL error on attempts to change the
disposition of either of the two real-time signals used by NPTL.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
At the kernel level, credentials (UIDs and GIDs) are a per-thread
attribute. NPTL uses a signal-based mechanism to ensure that
when one thread changes its credentials, all other threads change
credentials to the same values. By this means, the NPTL
implementation conforms to the POSIX requirement that the threads
in a process share credentials.
Reported-by: Shawn Landden <shawn@churchofgit.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
At the kernel level, credentials (UIDs and GIDs) are a per-thread
attribute. NPTL uses a signal-based mechanism to ensure that
when one thread changes its credentials, all other threads change
credentials to the same values. By this means, the NPTL
implementation conforms to the POSIX requirement that the threads
in a process share credentials.
Reported-by: Shawn Landden <shawn@churchofgit.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
At the kernel level, credentials (UIDs and GIDs) are a per-thread
attribute. NPTL uses a signal-based mechanism to ensure that
when one thread changes its credentials, all other threads change
credentials to the same values. By this means, the NPTL
implementation conforms to the POSIX requirement that the threads
in a process share credentials.
Reported-by: Shawn Landden <shawn@churchofgit.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
At the kernel level, credentials (UIDs and GIDs) are a per-thread
attribute. NPTL uses a signal-based mechanism to ensure that
when one thread changes its credentials, all other threads change
credentials to the same values. By this means, the NPTL
implementation conforms to the POSIX requirement that the threads
in a process share credentials.
Reported-by: Shawn Landden <shawn@churchofgit.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
At the kernel level, credentials (UIDs and GIDs) are a per-thread
attribute. NPTL uses a signal-based mechanism to ensure that
when one thread changes its credentials, all other threads change
credentials to the same values. By this means, the NPTL
implementation conforms to the POSIX requirement that the threads
in a process share credentials.
Reported-by: Shawn Landden <shawn@churchofgit.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
On Wed, Mar 11, 2015 at 10:43:50PM +0100, Mikael Pettersson wrote:
> Jann Horn writes:
> > Or should I throw this patch away and write a patch
> > for the prctl() manpage instead that documents that
> > being able to call sigreturn() implies being able to
> > effectively call sigprocmask(), at least on some
> > architectures like X86?
>
> Well, that is the semantics of sigreturn(). It is essentially
> setcontext() [which includes the actions of sigprocmask()], but
> with restrictions on parameter placement (at least on x86).
>
> You could introduce some setting to restrict that aspect for
> seccomp processes, but you can't change this for normal processes
> without breaking things.
Then I think it's probably better and easier to just document the
existing behavior? If a new setting would have to be introduced
and developers would need to be aware of that, it's probably
easier to just tell everyone to use SIGKILL.
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Mikael Pettersson <mikpelinux@gmail.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Fix a warning of groff: line 527: warning [p 6, 2.3i]: cannot adjust line
Signed-off-by: Stéphane Aulery <saulery@free.fr>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Fix a warning of groff: line 192: warning [p 2, 4.7i]: cannot adjust line
Signed-off-by: Stéphane Aulery <saulery@free.fr>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The description mentions close(2). Hence it should also be referenced
in the SEE ALSO section.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The list of errnos for msgrcv() lists both EAGAIN and ENOMSG as
the errno for no message available with the IPC_NOWAIT flag.
ENOMSG is the errno that will be set.
Signed-off-by: Bill Pemberton <wfp5p@worldbroken.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I recently realized that I had been reasoning improperly about
what umount(MNT_DETACH) did based on an insufficient description
in the umount.2 man page, that matched my intuition but not the
implementation.
When there are no submounts, MNT_DETACH is essentially harmless to
applications. Where there are submounts, MNT_DETACH changes what
is visible to applications using the detach directories.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Include linux/ext2_fs.h does not contain any ioctl definitions
anymore.
Request codes EXT2_IOC* have been replaced by FS_IOC* in
linux/fs.h.
Some definitions of FS_IOC_* use long* but the actual code expects
int* (see fs/ext2/ioctl.c).
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Clone has so many effects that it's an oversimplification to say
that the *main* use of clone is to create a thread. (In fact,
the use of clone() to create new processes may well be more
common, since glibc's fork() is a wrapper that calls clone().)
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Normally, system calls return EINVAL for flags they don't support.
Explicitly document that clone does *not* produce an error for these two
obsolete flags.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
From email conversation with Konstantin:
> * Are you saying there are case where successful
> setsockopt() via nf_register_sockopt() might return a
> value other zero?
Yes - it happens when the option is served by a custom
netfilter hook (this is how I bumped into this). Example:
Userspace code:
=================== cut here ================================
int main(void) {
int sock;
if ((sock = socket(AF_INET, SOCK_RAW, IPPROTO_RAW)) < 0)
return -1;
return setsockopt(sock, IPPROTO_IP, TEST_SETSOCKOPT_RETURN, NULL, 0);
}
=================== cut here ================================
Kernel module, handling the option 400 "TEST_SETSOCKOPT_RETURN":
=================== cut here ================================
/* Random value - just should not be already used by the running
system: */
static int test_sock_set_so(struct sock *sk, int cmd, void *param,
unsigned len) {
return 42;
}
static struct nf_sockopt_ops test_sock_ops = {
list: {NULL, NULL},
pf: PF_INET,
set_optmin: TEST_SETSOCKOPT_RETURN,
set_optmax: (TEST_SETSOCKOPT_RETURN + 1),
set: test_sock_set_so,
get_optmin: 0,
get_optmax: 0,
get: NULL
};
static int test_sock_init(void) {
return nf_register_sockopt(&test_sock_ops); /* sanity check
skipped */
}
static void test_sock_exit(void) {
nf_unregister_sockopt(&test_sock_ops);
}
module_init(test_sock_init);
module_exit(test_sock_exit);
=================== cut here ================================
After successful loading of the module, the executable returns 42,
and as I understand, that is the intention of netfilter authors.
Netfilter code calls the registered handle and just returns back to
user what it receives from it.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
add FAT_IOCTL_GET_VOLUME_ID
SEE ALSO ioctl_fat.2
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Change "behaviour" to American spelling "behavior".
Signed-off-by: Bill Pemberton <wfp5p@worldbroken.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The function mmap() and munmap() are thread safe.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The number of padding bytes has changed over tyme, as some
bytes are used, so describe this aspect of the structure
less explicitly.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The point made in this fairly ancient text is more or less evident
from the DESCRIPTION, and it's not clear what "standard" is being
referred to.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
As from upstream commit:
commit 3f31d07571eeea18a7d34db9af21d2285b807a17
Author: Hugh Dickins <hughd@google.com>
Date: Tue May 29 15:06:40 2012 -0700
mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE
Now tmpfs supports hole-punching via fallocate(), switch madvise_remove()
to use do_fallocate() instead of vmtruncate_range(): which extends
madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs.
madvise(,,MADV_REMOVE) support was extended by ext4, ocfs2 and xfs.
bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1120294
Justification from Rafael Aquini:
Well, that code is committed in kernel since v3.5 (2012) and it
surely is the expected behaviour since. It seems to me that
madvise(2) man page text for MADV_REMOVE just got out-of-date in
that regard.
This patch mentions this support in madvise.2 man page.
Reworded and corrected by Michael Kerrisk and Hugh Dickins. Thank you.
Signed-off-by: Jan Chaloupka <jchaloup@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
madvise() is one of those system calls that has congealed over
time, as has the man page. It's helpful to split the discussion
of 'advice' into those flags into two groups:
* Those flags that are (1) widespread across implementations;
(2) have counterparts in posix_madvise(3); and (3) were present
in the initial Linux madvise implementation.
* The rest, which are values that (1) may not have counterparts
in other implementations; (2) have no counterparts in
posix_madvise(3); and (3) were added to Linux in more recent
times.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I can find no evidence that madvise() was in POSIX.1b.
Certainly, it's not mentioned in Bill Gallmeister's
POSIX.4 book (O'Reilly).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Over time, bit rot has afflicted this page. Since the original
text was written many new Linux-specific flags have been added.
So, now it's better to explicitly list the flags that
correspond to the POSIX analog of madvise().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The fields d_ino and d_off of structure __fat_dirent are explained.
The different return values of VFAT_IOCTL_READDIR_BOTH and
VFAT_IOCTL_READDIR_SHORT are explained.
The usage of the return value in the example is corrected.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The ioctl(2) system call may be used to retrieve information about
the FAT file system and to set file attributes.
Signed-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The lm bit should never have existed in the first place. Sigh.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The documentation for set_thread_area was very vague. This
improves it, accounts for recent kernel changes, and merges
it with get_thread_area.2.
get_thread_area.2 now becomes a link.
While I'm at it, clarify the related arch_prctl.2 man page.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This clarifies the behavior and documents all four functions.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Currently the PERF_EVENT_IOC_REFRESH ioctl, when applied to a group
leader, will refresh all children. Also if a refresh value of 0
is chosen then the refresh becomes infinite (never runs out).
Back in 2011 PAPI was relying on these behaviors but I was told
that both were unsupported and subject to being removed at any time.
(See https://lkml.org/lkml/2011/5/24/337 )
However the behavior has not been changed.
This patch updates the manpage to still list the behavior as
unsupported, but removes the inaccurate description of it
only being a problem with 2.6 kernels.
Signed-off-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
fork.2 should clearly point out that child and parent
process run in separate memory spaces.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Extend description of PTRACE_SEIZE with the short summary of its
differences from PTRACE_ATTACH.
The following paragraph:
PTRACE_EVENT_STOP
Stop induced by PTRACE_INTERRUPT command, or group-stop, or ini-
tial ptrace-stop when a new child is attached (only if attached
using PTRACE_SEIZE), or PTRACE_EVENT_STOP if PTRACE_SEIZE was used.
has an editing error (the part after last comma makes no sense).
Removing it.
Mention that legacy post-execve SIGTRAP is disabled by PTRACE_SEIZE.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This behaviour was verified by reading the kernel source and
confirming the behaviour using a test program.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The following program illustrates the difference between TCP
and Unix stream sockets doing sendfile. Since TCP implements
zero-copy, the new modifications to the file transferred is
seen upon reading despite the modifications happening after
sendfile was last called.
Unix stream sockets do not implement zero-copy (as of
Linux 3.15), so readers continue to see the contents of the
file at the time it was sent, not as they are at the time of
reading.
----------------- sendfile-mod.c ---------------
#define _GNU_SOURCE
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/sendfile.h>
#include <arpa/inet.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <assert.h>
#include <fcntl.h>
static void tcp_socketpair(int sv[2])
{
struct sockaddr_in addr;
socklen_t addrlen = sizeof(addr);
int l = socket(PF_INET, SOCK_STREAM, 0);
int c = socket(PF_INET, SOCK_STREAM, 0);
int a;
int val = 1;
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = INADDR_ANY;
addr.sin_port = 0;
assert(0 == bind(l, (struct sockaddr*)&addr, addrlen));
assert(0 == listen(l, 1024));
assert(0 == getsockname(l, (struct sockaddr *)&addr, &addrlen));
assert(0 == connect(c, (struct sockaddr *)&addr, addrlen));
a = accept4(l, NULL, NULL, SOCK_NONBLOCK);
assert(a >= 0);
close(l);
assert(0 == ioctl(c, FIONBIO, &val));
sv[0] = a;
sv[1] = c;
}
int main(int argc, char *argv[])
{
int pair[2];
FILE *tmp = tmpfile();
int tfd;
char buf[16384];
ssize_t w, r;
size_t i;
const size_t n = 2048;
off_t off = 0;
char expect[4096];
int flags = SOCK_STREAM|SOCK_NONBLOCK;
tfd = fileno(tmp);
assert(tfd >= 0);
/* prepare the tempfile */
memset(buf, 'a', sizeof(buf));
for (i = 0; i < n; i++)
assert(sizeof(buf) == write(tfd, buf, sizeof(buf)));
if (argc == 2 && strcmp(argv[1], "unix") == 0)
assert(0 == socketpair(AF_UNIX, flags, 0, pair));
else if (argc == 2 && strcmp(argv[1], "pipe") == 0)
assert(0 == pipe2(pair, O_NONBLOCK));
else
tcp_socketpair(pair);
/* fill up the socket buffer */
for (;;) {
w = sendfile(pair[1], tfd, &off, n);
if (w > 0)
continue;
if (w < 0 && errno == EAGAIN)
break;
assert(0 && "unhandled error" && w && errno);
}
printf("wrote off=%lld\n", (long long)off);
/* rewrite the tempfile */
memset(buf, 'A', sizeof(buf));
assert(0 == lseek(tfd, 0, SEEK_SET));
for (i = 0; i < n; i++)
assert(sizeof(buf) == write(tfd, buf, sizeof(buf)));
/* we should be reading 'a's, not 'A's */
memset(expect, 'a', sizeof(expect));
do {
r = read(pair[0], buf, sizeof(expect));
/* TCP fails here since it is zero copy (on Linux 3.15.5) */
if (r > 0)
assert(memcmp(buf, expect, r) == 0);
} while (r > 0);
return 0;
}
Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
CLONE_PARENT_SETTID only stores child thread ID in parent memory.
Signed-off-by: Peng Haitao <penght@cn.fujitsu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This patch the fact that a successful execve(2) in a process that
is sharing a file descriptor table results in unsharing the table.
I discovered this through testing and verified it by source
inspection - there is a call to unshare_files() early in
do_execve_common().
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I encountered these errors while writing testcase for migrate_pages
syscall for LTP (Linux test project).
I checked stable kernel tree 3.5 to see which paths return these.
Both can be returned from get_nodes(), which is called from:
SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
const unsigned long __user *, old_nodes,
const unsigned long __user *, new_nodes)
The testcase does following:
EFAULT
a) old_nodes/new_nodes is area mmaped with PROT_NONE
b) old_nodes/new_nodes is area not mmapped in process address
space, -1 or area that has been just munmmaped
EINVAL
a) maxnodes overflows kernel limit
b) new_nodes contain node, which has no memory or does not exist
or is not returned for get_mempolicy(MPOL_F_MEMS_ALLOWED).
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I puzzled over mprotect()'s effect on /proc/*/maps for a while
yesterday -- it was setting "x" without PROT_EXEC being specified.
Here is a patch to add some explanation.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
We have users who are terribly confused why their binaries
with CAP_DAC_OVERRIDE capability see EACCESS from access() calls,
but are able to read the file.
The reason is access() isn't the "can I read/write/execute this
file?" question, it is the "(assuming that I'm a setuid binary,)
can *the user who invoked me* read/write/execute this file?"
question.
That's why it uses real UIDs as documented, and why it ignores
capabilities when capability-endorsed binaries are run by non-root
(this patch adds this information).
To make users more likely to notice this less-known detail,
the patch expands the explanation with rationale for this logic
into a separate paragraph.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: linux-man@vger.kernel.org
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
I am not sure why we have:
"EAGAIN fork() cannot allocate sufficient memory to copy
the parent's page tables and allocate a task structure
or the child."
The text seems to be there from the time when man-pages
were moved to git so there is no history for it.
And it doesn't reflect reality: the kernel reports both
dup_task_struct and dup_mm failures as ENOMEM to the
userspace. This seems to be the case from early 2.x times
so let's simply remove this part.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>