Advertise to userspace that they should use proper pid_t types
for arguments returning a pid.
The kernel-internal struct kernel_clone_args currently uses int
as type and since POSIX mandates that pid_t is a signed integer
type and glibc and friends use int this is not an issue. After
the merge window for v5.5 closes we can switch struct
kernel_clone_args over to using pid_t as well without any danger
in regressing current userspace.
Also note, that the new set tid feature which will be merged for
v5.5 uses pid_t types as well.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
If mmap() fails it will return MAP_FAILED which according to the manpage
is (void *)-1 not NULL.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Fix two spelling mistakes in manpage describing the clone{2,3}()
syscalls/syscall wrappers.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Christian Brauner suggested mmap(MAP_STACKED), rather than
malloc(), as the canonical way of allocating a stack for the
child of clone(), and Jann Horn noted some reasons why:
Not on Linux, but on OpenBSD, they do use MAP_STACK now
AFAIK; this was announced here:
<http://openbsd-archive.7691.n7.nabble.com/stack-register-checking-td338238.html>.
Basically they periodically check whether the userspace
stack pointer points into a MAP_STACK region, and if not,
they kill the process. So even if it's a no-op on Linux, it
might make sense to advise people to use the flag to improve
portability? I'm not sure if that's something that belongs
in Linux manpages.
Another reason against malloc() is that when setting up
thread stacks in proper, reliable software, you'll probably
want to place a guard page (in other words, a 4K PROT_NONE
VMA) at the bottom of the stack to reliably catch stack
overflows; and you probably don't want to do that with
malloc, in particular with non-page-aligned allocations.
And the OpenBSD 6.5 manual pages says:
MAP_STACK
Indicate that the mapping is used as a stack. This
flag must be used in combination with MAP_ANON and
MAP_PRIVATE.
And I then noticed that MAP_STACK seems already to be on
FreeBSD for a long time:
MAP_STACK
Map the area as a stack. MAP_ANON is implied.
Offset should be 0, fd must be -1, and prot should
include at least PROT_READ and PROT_WRITE. This
option creates a memory region that grows to at
most len bytes in size, starting from the stack
top and growing down. The stack top is the start‐
ing address returned by the call, plus len bytes.
The bottom of the stack at maximum growth is the
starting address returned by the call.
The entire area is reserved from the point of view
of other mmap() calls, even if not faulted in yet.
Reported-by: Jann Horn <jannh@google.com>
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The obsolete CLONE_DETACHED flag has never been properly
documented, but now the discussion CLONE_PIDFD also requires
mention of CLONE_DETACHED. So, properly document CLONE_DETACHED,
and mention its interactions with CLONE_PIDFD.
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Sometimes the descriptions of these flags mentioned the
corresponding section 7 namespace manual page and then the
required capabilities, and sometimes the order was the was
the reverse. Make it consistent.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Remove details of UTS, IPC, and network namespaces that are
already covered in the corresponding namespaces pages in
section 7. This change is for consistency, since corresponding
details were not provided for other namespace types in clone(2)
and these details do not appear in unshare(2).
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
After feedback from Christian Brauner [1], I've adjusted a few pieces
of the clone3() text, and also adjusted some of the older text in
the page.
[1] https://lore.kernel.org/linux-man/20191107151941.dw4gtul5lrtax4se@wittgenstein/
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Change the text in the introductory paragraph (which was written
20 years ago) to reflect the fact that clone*() does more things
nowadays.
Cowritten-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Adjust references to namespaces(7) to be references to pages
describing specific namespace types.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Use "flags mask" as a generic term to refer to the clone()
'flags' argument and the clone3() 'cl_args.flags' field.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Sometime soon, we'll have to add documentation of clone3() to this
page. As a preparatorys step, make the names of the clone()
arguments the same as the fields in the clone3() 'args' struct:
ctid ==> child_pid
ptid ==> parent_tid
newtls ==> tld
child_stack ==> stack
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
In the kernel source (kernel/fork.c::copy_process()), there is:
pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
O_RDWR | O_CLOEXEC);
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Add an entry for CLONE_PIDFD. This flag is available starting
with kernel 5.2. If specified, a process file descriptor
("pidfd") referring to the child process will be returned in
the ptid argument.
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Quoting Branden:
*roff escape sequences may sometimes look like C escapes, but that
is misleading. *roff is in part a macro language and that means
recursive expansion to arbitrary depths.
You can get away with "\\" in a context where no macro expansion
is taking place, but try to spell a literal backslash this way in
the argument to a macro and you will likely be unhappy with
results.
Try viewing the attached file with "man -l".
"\e" is the preferred and portable way to get a portable "escape
literal" going back to CSTR #54, the original Bell Labs troff
paper.
groff(7) discusses the issue:
\\ reduces to a single backslash; useful to delay its
interpretation as escape character in copy mode. For a
printable backslash, use \e, or even better \[rs], to be
independent from the current escape character.
As of groff 1.22.4, groff_man(7) does as well:
\e Widely used in man pages to represent a backslash output
glyph. It works reliably as long as the .ec request is
not used, which should never happen in man pages, and it
is slightly more portable than the more exact ‘\(rs’
(“reverse solidus”) escape sequence.
People not concerned with portability to extremely old troffs should
probably just use \(rs (or \[rs]), as it means "the backslash
glyph", not "the glyph corresponding to whatever the current escape
character is".
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
See copy_process() in kernel/fork.c:
if (clone_flags & CLONE_THREAD) {
if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
(task_active_pid_ns(current) !=
current->nsproxy->pid_ns_for_children))
return ERR_PTR(-EINVAL);
}
current->nsproxy->pid_ns_for_children is where unshare(CLONE_NEWPID)
stashes the pending namespace.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The extra detail has little of noting with -test 2.6.0
added a particular feature has little value these days,
and is likely to confuse some readers who don't know
(and probably don't care) about the historical details.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Note that clone() definition on IA-64 is the same as on
SH/Tile/Alpha, align __clone2 declarations in line with the
previous ones, add clone2 syscall prototype.
Signed-off-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
The discussion is phrased in terms of signals send using kill(2),
but applies equally to a signal sent by the kernel.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Note that EINVAL can occur with CLONE_NEWUSER if the kernel was
not configured with CONFIG_USER_NS.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>