mirror of https://github.com/mkerrisk/man-pages
692 lines
16 KiB
Groff
692 lines
16 KiB
Groff
.\" Copyright (c) 2016, IBM Corporation.
|
|
.\" Written by Mike Rapoport <rppt@linux.vnet.ibm.com>
|
|
.\" and Copyright (C) 2016 Michael Kerrisk <mtk.manpages@gmail.com>
|
|
.\"
|
|
.\" %%%LICENSE_START(VERBATIM)
|
|
.\" Permission is granted to make and distribute verbatim copies of this
|
|
.\" manual provided the copyright notice and this permission notice are
|
|
.\" preserved on all copies.
|
|
.\"
|
|
.\" Permission is granted to copy and distribute modified versions of this
|
|
.\" manual under the conditions for verbatim copying, provided that the
|
|
.\" entire resulting derived work is distributed under the terms of a
|
|
.\" permission notice identical to this one.
|
|
.\"
|
|
.\" Since the Linux kernel and libraries are constantly changing, this
|
|
.\" manual page may be incorrect or out-of-date. The author(s) assume no
|
|
.\" responsibility for errors or omissions, or for damages resulting from
|
|
.\" the use of the information contained herein. The author(s) may not
|
|
.\" have taken the same level of care in the production of this manual,
|
|
.\" which is licensed free of charge, as they might when working
|
|
.\" professionally.
|
|
.\"
|
|
.\" Formatted or processed versions of this manual, if unaccompanied by
|
|
.\" the source, must acknowledge the copyright and authors of this work.
|
|
.\" %%%LICENSE_END
|
|
.\"
|
|
.\"
|
|
.TH IOCTL_USERFAULTFD 2 2021-03-22 "Linux" "Linux Programmer's Manual"
|
|
.SH NAME
|
|
ioctl_userfaultfd \- create a file descriptor for handling page faults in user
|
|
space
|
|
.SH SYNOPSIS
|
|
.nf
|
|
.B #include <sys/ioctl.h>
|
|
.PP
|
|
.BI "int ioctl(int " fd ", int " cmd ", ...);"
|
|
.fi
|
|
.SH DESCRIPTION
|
|
Various
|
|
.BR ioctl (2)
|
|
operations can be performed on a userfaultfd object (created by a call to
|
|
.BR userfaultfd (2))
|
|
using calls of the form:
|
|
.PP
|
|
.in +4n
|
|
.EX
|
|
ioctl(fd, cmd, argp);
|
|
.EE
|
|
.in
|
|
In the above,
|
|
.I fd
|
|
is a file descriptor referring to a userfaultfd object,
|
|
.I cmd
|
|
is one of the commands listed below, and
|
|
.I argp
|
|
is a pointer to a data structure that is specific to
|
|
.IR cmd .
|
|
.PP
|
|
The various
|
|
.BR ioctl (2)
|
|
operations are described below.
|
|
The
|
|
.BR UFFDIO_API ,
|
|
.BR UFFDIO_REGISTER ,
|
|
and
|
|
.BR UFFDIO_UNREGISTER
|
|
operations are used to
|
|
.I configure
|
|
userfaultfd behavior.
|
|
These operations allow the caller to choose what features will be enabled and
|
|
what kinds of events will be delivered to the application.
|
|
The remaining operations are
|
|
.IR range
|
|
operations.
|
|
These operations enable the calling application to resolve page-fault
|
|
events.
|
|
.\"
|
|
.SS UFFDIO_API
|
|
(Since Linux 4.3.)
|
|
Enable operation of the userfaultfd and perform API handshake.
|
|
.PP
|
|
The
|
|
.I argp
|
|
argument is a pointer to a
|
|
.IR uffdio_api
|
|
structure, defined as:
|
|
.PP
|
|
.in +4n
|
|
.EX
|
|
struct uffdio_api {
|
|
__u64 api; /* Requested API version (input) */
|
|
__u64 features; /* Requested features (input/output) */
|
|
__u64 ioctls; /* Available ioctl() operations (output) */
|
|
};
|
|
.EE
|
|
.in
|
|
.PP
|
|
The
|
|
.I api
|
|
field denotes the API version requested by the application.
|
|
.PP
|
|
The kernel verifies that it can support the requested API version,
|
|
and sets the
|
|
.I features
|
|
and
|
|
.I ioctls
|
|
fields to bit masks representing all the available features and the generic
|
|
.BR ioctl (2)
|
|
operations available.
|
|
.PP
|
|
For Linux kernel versions before 4.11, the
|
|
.I features
|
|
field must be initialized to zero before the call to
|
|
.BR UFFDIO_API ,
|
|
and zero (i.e., no feature bits) is placed in the
|
|
.I features
|
|
field by the kernel upon return from
|
|
.BR ioctl (2).
|
|
.PP
|
|
Starting from Linux 4.11, the
|
|
.I features
|
|
field can be used to ask whether particular features are supported
|
|
and explicitly enable userfaultfd features that are disabled by default.
|
|
The kernel always reports all the available features in the
|
|
.I features
|
|
field.
|
|
.PP
|
|
To enable userfaultfd features the application should set
|
|
a bit corresponding to each feature it wants to enable in the
|
|
.I features
|
|
field.
|
|
If the kernel supports all the requested features it will enable them.
|
|
Otherwise it will zero out the returned
|
|
.I uffdio_api
|
|
structure and return
|
|
.BR EINVAL .
|
|
.\" FIXME add more details about feature negotiation and enablement
|
|
.PP
|
|
The following feature bits may be set:
|
|
.TP
|
|
.BR UFFD_FEATURE_EVENT_FORK " (since Linux 4.11)"
|
|
When this feature is enabled,
|
|
the userfaultfd objects associated with a parent process are duplicated
|
|
into the child process during
|
|
.BR fork (2)
|
|
and a
|
|
.B UFFD_EVENT_FORK
|
|
event is delivered to the userfaultfd monitor
|
|
.TP
|
|
.BR UFFD_FEATURE_EVENT_REMAP " (since Linux 4.11)"
|
|
If this feature is enabled,
|
|
when the faulting process invokes
|
|
.BR mremap (2),
|
|
the userfaultfd monitor will receive an event of type
|
|
.BR UFFD_EVENT_REMAP .
|
|
.TP
|
|
.BR UFFD_FEATURE_EVENT_REMOVE " (since Linux 4.11)"
|
|
If this feature is enabled,
|
|
when the faulting process calls
|
|
.BR madvise (2)
|
|
with the
|
|
.B MADV_DONTNEED
|
|
or
|
|
.B MADV_REMOVE
|
|
advice value to free a virtual memory area
|
|
the userfaultfd monitor will receive an event of type
|
|
.BR UFFD_EVENT_REMOVE .
|
|
.TP
|
|
.BR UFFD_FEATURE_EVENT_UNMAP " (since Linux 4.11)"
|
|
If this feature is enabled,
|
|
when the faulting process unmaps virtual memory either explicitly with
|
|
.BR munmap (2),
|
|
or implicitly during either
|
|
.BR mmap (2)
|
|
or
|
|
.BR mremap (2),
|
|
the userfaultfd monitor will receive an event of type
|
|
.BR UFFD_EVENT_UNMAP .
|
|
.TP
|
|
.BR UFFD_FEATURE_MISSING_HUGETLBFS " (since Linux 4.11)"
|
|
If this feature bit is set,
|
|
the kernel supports registering userfaultfd ranges on hugetlbfs
|
|
virtual memory areas
|
|
.TP
|
|
.BR UFFD_FEATURE_MISSING_SHMEM " (since Linux 4.11)"
|
|
If this feature bit is set,
|
|
the kernel supports registering userfaultfd ranges on shared memory areas.
|
|
This includes all kernel shared memory APIs:
|
|
System V shared memory,
|
|
.BR tmpfs (5),
|
|
shared mappings of
|
|
.IR /dev/zero ,
|
|
.BR mmap (2)
|
|
with the
|
|
.B MAP_SHARED
|
|
flag set,
|
|
.BR memfd_create (2),
|
|
and so on.
|
|
.TP
|
|
.BR UFFD_FEATURE_SIGBUS " (since Linux 4.14)"
|
|
.\" commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e
|
|
If this feature bit is set, no page-fault events
|
|
.RB ( UFFD_EVENT_PAGEFAULT )
|
|
will be delivered.
|
|
Instead, a
|
|
.B SIGBUS
|
|
signal will be sent to the faulting process.
|
|
Applications using this
|
|
feature will not require the use of a userfaultfd monitor for processing
|
|
memory accesses to the regions registered with userfaultfd.
|
|
.PP
|
|
The returned
|
|
.I ioctls
|
|
field can contain the following bits:
|
|
.\" FIXME This user-space API seems not fully polished. Why are there
|
|
.\" not constants defined for each of the bit-mask values listed below?
|
|
.TP
|
|
.B 1 << _UFFDIO_API
|
|
The
|
|
.B UFFDIO_API
|
|
operation is supported.
|
|
.TP
|
|
.B 1 << _UFFDIO_REGISTER
|
|
The
|
|
.B UFFDIO_REGISTER
|
|
operation is supported.
|
|
.TP
|
|
.B 1 << _UFFDIO_UNREGISTER
|
|
The
|
|
.B UFFDIO_UNREGISTER
|
|
operation is supported.
|
|
.PP
|
|
This
|
|
.BR ioctl (2)
|
|
operation returns 0 on success.
|
|
On error, \-1 is returned and
|
|
.I errno
|
|
is set to indicate the error.
|
|
Possible errors include:
|
|
.TP
|
|
.B EFAULT
|
|
.I argp
|
|
refers to an address that is outside the calling process's
|
|
accessible address space.
|
|
.TP
|
|
.B EINVAL
|
|
The userfaultfd has already been enabled by a previous
|
|
.BR UFFDIO_API
|
|
operation.
|
|
.TP
|
|
.B EINVAL
|
|
The API version requested in the
|
|
.I api
|
|
field is not supported by this kernel, or the
|
|
.I features
|
|
field passed to the kernel includes feature bits that are not supported
|
|
by the current kernel version.
|
|
.\" FIXME In the above error case, the returned 'uffdio_api' structure is
|
|
.\" zeroed out. Why is this done? This should be explained in the manual page.
|
|
.\"
|
|
.\" Mike Rapoport:
|
|
.\" In my understanding the uffdio_api
|
|
.\" structure is zeroed to allow the caller
|
|
.\" to distinguish the reasons for -EINVAL.
|
|
.\"
|
|
.SS UFFDIO_REGISTER
|
|
(Since Linux 4.3.)
|
|
Register a memory address range with the userfaultfd object.
|
|
The pages in the range must be "compatible".
|
|
.PP
|
|
Up to Linux kernel 4.11,
|
|
only private anonymous ranges are compatible for registering with
|
|
.BR UFFDIO_REGISTER .
|
|
.PP
|
|
Since Linux 4.11,
|
|
hugetlbfs and shared memory ranges are also compatible with
|
|
.BR UFFDIO_REGISTER .
|
|
.PP
|
|
The
|
|
.I argp
|
|
argument is a pointer to a
|
|
.I uffdio_register
|
|
structure, defined as:
|
|
.PP
|
|
.in +4n
|
|
.EX
|
|
struct uffdio_range {
|
|
__u64 start; /* Start of range */
|
|
__u64 len; /* Length of range (bytes) */
|
|
};
|
|
|
|
struct uffdio_register {
|
|
struct uffdio_range range;
|
|
__u64 mode; /* Desired mode of operation (input) */
|
|
__u64 ioctls; /* Available ioctl() operations (output) */
|
|
};
|
|
.EE
|
|
.in
|
|
.PP
|
|
The
|
|
.I range
|
|
field defines a memory range starting at
|
|
.I start
|
|
and continuing for
|
|
.I len
|
|
bytes that should be handled by the userfaultfd.
|
|
.PP
|
|
The
|
|
.I mode
|
|
field defines the mode of operation desired for this memory region.
|
|
The following values may be bitwise ORed to set the userfaultfd mode for
|
|
the specified range:
|
|
.TP
|
|
.B UFFDIO_REGISTER_MODE_MISSING
|
|
Track page faults on missing pages.
|
|
.TP
|
|
.B UFFDIO_REGISTER_MODE_WP
|
|
Track page faults on write-protected pages.
|
|
.PP
|
|
Currently, the only supported mode is
|
|
.BR UFFDIO_REGISTER_MODE_MISSING .
|
|
.PP
|
|
If the operation is successful, the kernel modifies the
|
|
.I ioctls
|
|
bit-mask field to indicate which
|
|
.BR ioctl (2)
|
|
operations are available for the specified range.
|
|
This returned bit mask is as for
|
|
.BR UFFDIO_API .
|
|
.PP
|
|
This
|
|
.BR ioctl (2)
|
|
operation returns 0 on success.
|
|
On error, \-1 is returned and
|
|
.I errno
|
|
is set to indicate the error.
|
|
Possible errors include:
|
|
.\" FIXME Is the following error list correct?
|
|
.\"
|
|
.TP
|
|
.B EBUSY
|
|
A mapping in the specified range is registered with another
|
|
userfaultfd object.
|
|
.TP
|
|
.B EFAULT
|
|
.I argp
|
|
refers to an address that is outside the calling process's
|
|
accessible address space.
|
|
.TP
|
|
.B EINVAL
|
|
An invalid or unsupported bit was specified in the
|
|
.I mode
|
|
field; or the
|
|
.I mode
|
|
field was zero.
|
|
.TP
|
|
.B EINVAL
|
|
There is no mapping in the specified address range.
|
|
.TP
|
|
.B EINVAL
|
|
.I range.start
|
|
or
|
|
.I range.len
|
|
is not a multiple of the system page size; or,
|
|
.I range.len
|
|
is zero; or these fields are otherwise invalid.
|
|
.TP
|
|
.B EINVAL
|
|
There as an incompatible mapping in the specified address range.
|
|
.\" Mike Rapoport:
|
|
.\" ENOMEM if the process is exiting and the
|
|
.\" mm_struct has gone by the time userfault grabs it.
|
|
.SS UFFDIO_UNREGISTER
|
|
(Since Linux 4.3.)
|
|
Unregister a memory address range from userfaultfd.
|
|
The pages in the range must be "compatible" (see the description of
|
|
.BR UFFDIO_REGISTER .)
|
|
.PP
|
|
The address range to unregister is specified in the
|
|
.IR uffdio_range
|
|
structure pointed to by
|
|
.IR argp .
|
|
.PP
|
|
This
|
|
.BR ioctl (2)
|
|
operation returns 0 on success.
|
|
On error, \-1 is returned and
|
|
.I errno
|
|
is set to indicate the error.
|
|
Possible errors include:
|
|
.TP
|
|
.B EINVAL
|
|
Either the
|
|
.I start
|
|
or the
|
|
.I len
|
|
field of the
|
|
.I ufdio_range
|
|
structure was not a multiple of the system page size; or the
|
|
.I len
|
|
field was zero; or these fields were otherwise invalid.
|
|
.TP
|
|
.B EINVAL
|
|
There as an incompatible mapping in the specified address range.
|
|
.TP
|
|
.B EINVAL
|
|
There was no mapping in the specified address range.
|
|
.\"
|
|
.SS UFFDIO_COPY
|
|
(Since Linux 4.3.)
|
|
Atomically copy a continuous memory chunk into the userfault registered
|
|
range and optionally wake up the blocked thread.
|
|
The source and destination addresses and the number of bytes to copy are
|
|
specified by the
|
|
.IR src ", " dst ", and " len
|
|
fields of the
|
|
.I uffdio_copy
|
|
structure pointed to by
|
|
.IR argp :
|
|
.PP
|
|
.in +4n
|
|
.EX
|
|
struct uffdio_copy {
|
|
__u64 dst; /* Destination of copy */
|
|
__u64 src; /* Source of copy */
|
|
__u64 len; /* Number of bytes to copy */
|
|
__u64 mode; /* Flags controlling behavior of copy */
|
|
__s64 copy; /* Number of bytes copied, or negated error */
|
|
};
|
|
.EE
|
|
.in
|
|
.PP
|
|
The following value may be bitwise ORed in
|
|
.IR mode
|
|
to change the behavior of the
|
|
.B UFFDIO_COPY
|
|
operation:
|
|
.TP
|
|
.B UFFDIO_COPY_MODE_DONTWAKE
|
|
Do not wake up the thread that waits for page-fault resolution
|
|
.PP
|
|
The
|
|
.I copy
|
|
field is used by the kernel to return the number of bytes
|
|
that was actually copied, or an error (a negated
|
|
.IR errno -style
|
|
value).
|
|
.\" FIXME Above: Why is the 'copy' field used to return error values?
|
|
.\" This should be explained in the manual page.
|
|
If the value returned in
|
|
.I copy
|
|
doesn't match the value that was specified in
|
|
.IR len ,
|
|
the operation fails with the error
|
|
.BR EAGAIN .
|
|
The
|
|
.I copy
|
|
field is output-only;
|
|
it is not read by the
|
|
.B UFFDIO_COPY
|
|
operation.
|
|
.PP
|
|
This
|
|
.BR ioctl (2)
|
|
operation returns 0 on success.
|
|
In this case, the entire area was copied.
|
|
On error, \-1 is returned and
|
|
.I errno
|
|
is set to indicate the error.
|
|
Possible errors include:
|
|
.TP
|
|
.B EAGAIN
|
|
The number of bytes copied (i.e., the value returned in the
|
|
.I copy
|
|
field)
|
|
does not equal the value that was specified in the
|
|
.I len
|
|
field.
|
|
.TP
|
|
.B EINVAL
|
|
Either
|
|
.I dst
|
|
or
|
|
.I len
|
|
was not a multiple of the system page size, or the range specified by
|
|
.IR src
|
|
and
|
|
.IR len
|
|
or
|
|
.IR dst
|
|
and
|
|
.IR len
|
|
was invalid.
|
|
.TP
|
|
.B EINVAL
|
|
An invalid bit was specified in the
|
|
.IR mode
|
|
field.
|
|
.TP
|
|
.BR ENOENT " (since Linux 4.11)"
|
|
The faulting process has changed
|
|
its virtual memory layout simultaneously with an outstanding
|
|
.B UFFDIO_COPY
|
|
operation.
|
|
.TP
|
|
.BR ENOSPC " (from Linux 4.11 until Linux 4.13)"
|
|
The faulting process has exited at the time of a
|
|
.B UFFDIO_COPY
|
|
operation.
|
|
.TP
|
|
.BR ESRCH " (since Linux 4.13)"
|
|
The faulting process has exited at the time of a
|
|
.B UFFDIO_COPY
|
|
operation.
|
|
.\"
|
|
.SS UFFDIO_ZEROPAGE
|
|
(Since Linux 4.3.)
|
|
Zero out a memory range registered with userfaultfd.
|
|
.PP
|
|
The requested range is specified by the
|
|
.I range
|
|
field of the
|
|
.I uffdio_zeropage
|
|
structure pointed to by
|
|
.IR argp :
|
|
.PP
|
|
.in +4n
|
|
.EX
|
|
struct uffdio_zeropage {
|
|
struct uffdio_range range;
|
|
__u64 mode; /* Flags controlling behavior of copy */
|
|
__s64 zeropage; /* Number of bytes zeroed, or negated error */
|
|
};
|
|
.EE
|
|
.in
|
|
.PP
|
|
The following value may be bitwise ORed in
|
|
.IR mode
|
|
to change the behavior of the
|
|
.B UFFDIO_ZEROPAGE
|
|
operation:
|
|
.TP
|
|
.B UFFDIO_ZEROPAGE_MODE_DONTWAKE
|
|
Do not wake up the thread that waits for page-fault resolution.
|
|
.PP
|
|
The
|
|
.I zeropage
|
|
field is used by the kernel to return the number of bytes
|
|
that was actually zeroed,
|
|
or an error in the same manner as
|
|
.BR UFFDIO_COPY .
|
|
.\" FIXME Why is the 'zeropage' field used to return error values?
|
|
.\" This should be explained in the manual page.
|
|
If the value returned in the
|
|
.I zeropage
|
|
field doesn't match the value that was specified in
|
|
.IR range.len ,
|
|
the operation fails with the error
|
|
.BR EAGAIN .
|
|
The
|
|
.I zeropage
|
|
field is output-only;
|
|
it is not read by the
|
|
.B UFFDIO_ZEROPAGE
|
|
operation.
|
|
.PP
|
|
This
|
|
.BR ioctl (2)
|
|
operation returns 0 on success.
|
|
In this case, the entire area was zeroed.
|
|
On error, \-1 is returned and
|
|
.I errno
|
|
is set to indicate the error.
|
|
Possible errors include:
|
|
.TP
|
|
.B EAGAIN
|
|
The number of bytes zeroed (i.e., the value returned in the
|
|
.I zeropage
|
|
field)
|
|
does not equal the value that was specified in the
|
|
.I range.len
|
|
field.
|
|
.TP
|
|
.B EINVAL
|
|
Either
|
|
.I range.start
|
|
or
|
|
.I range.len
|
|
was not a multiple of the system page size; or
|
|
.I range.len
|
|
was zero; or the range specified was invalid.
|
|
.TP
|
|
.B EINVAL
|
|
An invalid bit was specified in the
|
|
.IR mode
|
|
field.
|
|
.TP
|
|
.BR ESRCH " (since Linux 4.13)"
|
|
The faulting process has exited at the time of a
|
|
.B UFFDIO_ZEROPAGE
|
|
operation.
|
|
.\"
|
|
.SS UFFDIO_WAKE
|
|
(Since Linux 4.3.)
|
|
Wake up the thread waiting for page-fault resolution on
|
|
a specified memory address range.
|
|
.PP
|
|
The
|
|
.B UFFDIO_WAKE
|
|
operation is used in conjunction with
|
|
.BR UFFDIO_COPY
|
|
and
|
|
.BR UFFDIO_ZEROPAGE
|
|
operations that have the
|
|
.BR UFFDIO_COPY_MODE_DONTWAKE
|
|
or
|
|
.BR UFFDIO_ZEROPAGE_MODE_DONTWAKE
|
|
bit set in the
|
|
.I mode
|
|
field.
|
|
The userfault monitor can perform several
|
|
.BR UFFDIO_COPY
|
|
and
|
|
.BR UFFDIO_ZEROPAGE
|
|
operations in a batch and then explicitly wake up the faulting thread using
|
|
.BR UFFDIO_WAKE .
|
|
.PP
|
|
The
|
|
.I argp
|
|
argument is a pointer to a
|
|
.I uffdio_range
|
|
structure (shown above) that specifies the address range.
|
|
.PP
|
|
This
|
|
.BR ioctl (2)
|
|
operation returns 0 on success.
|
|
On error, \-1 is returned and
|
|
.I errno
|
|
is set to indicate the error.
|
|
Possible errors include:
|
|
.TP
|
|
.B EINVAL
|
|
The
|
|
.I start
|
|
or the
|
|
.I len
|
|
field of the
|
|
.I ufdio_range
|
|
structure was not a multiple of the system page size; or
|
|
.I len
|
|
was zero; or the specified range was otherwise invalid.
|
|
.SH RETURN VALUE
|
|
See descriptions of the individual operations, above.
|
|
.SH ERRORS
|
|
See descriptions of the individual operations, above.
|
|
In addition, the following general errors can occur for all of the
|
|
operations described above:
|
|
.TP
|
|
.B EFAULT
|
|
.I argp
|
|
does not point to a valid memory address.
|
|
.TP
|
|
.B EINVAL
|
|
(For all operations except
|
|
.BR UFFDIO_API .)
|
|
The userfaultfd object has not yet been enabled (via the
|
|
.BR UFFDIO_API
|
|
operation).
|
|
.SH CONFORMING TO
|
|
These
|
|
.BR ioctl (2)
|
|
operations are Linux-specific.
|
|
.SH BUGS
|
|
In order to detect available userfault features and
|
|
enable some subset of those features
|
|
the userfaultfd file descriptor must be closed after the first
|
|
.BR UFFDIO_API
|
|
operation that queries features availability and reopened before
|
|
the second
|
|
.BR UFFDIO_API
|
|
operation that actually enables the desired features.
|
|
.SH EXAMPLES
|
|
See
|
|
.BR userfaultfd (2).
|
|
.SH SEE ALSO
|
|
.BR ioctl (2),
|
|
.BR mmap (2),
|
|
.BR userfaultfd (2)
|
|
.PP
|
|
.IR Documentation/admin\-guide/mm/userfaultfd.rst
|
|
in the Linux kernel source tree
|