mirror of https://github.com/mkerrisk/man-pages
bpf.2: Improvements after comments from Alexei Starovoitov
Plus various other improvements of my own. Reported-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
parent
61baa4ff4e
commit
f774ddf14d
198
man2/bpf.2
198
man2/bpf.2
|
@ -43,16 +43,12 @@ the kernel statically analyzes the programs before loading them,
|
|||
in order to ensure that they cannot harm the running system.
|
||||
.P
|
||||
eBPF extends cBPF in multiple ways, including the ability to call
|
||||
in-kernel helper functions (via the
|
||||
a fixed set of in-kernel helper functions
|
||||
.\" See 'enum bpf_func_id' in include/uapi/linux/bpf.h
|
||||
(via the
|
||||
.B BPF_CALL
|
||||
opcode extension provided by eBPF)
|
||||
and access shared data structures such as BPF maps.
|
||||
The programs can be written in a restricted C that is compiled into
|
||||
eBPF bytecode and executed on the in-kernel virtual machine or
|
||||
just-in-time compiled into native code.
|
||||
(Various features are omitted from this restricted C, such as loops,
|
||||
global variables, variadic functions, floating-point numbers,
|
||||
and passing structures as function arguments.)
|
||||
.SS Extended BPF Design/Architecture
|
||||
.P
|
||||
.\" FIXME In the following line, what does "different data types" mean?
|
||||
|
@ -60,6 +56,14 @@ and passing structures as function arguments.)
|
|||
BPF maps are a generic data structure for storage of different data types.
|
||||
A user process can create multiple maps (with key/value-pairs being
|
||||
opaque bytes of data) and access them via file descriptors.
|
||||
.\" FIXME What does the next sentence mean?
|
||||
.\" Isn't "from inside the kernel" redundant? (I mean: all eBPF programs
|
||||
.\" are running inside the kernel, right?)
|
||||
.\" And what does "in parallel" mean?
|
||||
.\" Would a simpler version of this sentence be correct? As in:
|
||||
.\" "Different eBPF programs can access the same maps in parallel."
|
||||
.\" ?
|
||||
.\" (Actually, the page already says soomething like that lower down.)
|
||||
eBPF programs can access maps from inside the kernel in parallel.
|
||||
It's up to the user process and eBPF program to decide what they store
|
||||
inside maps.
|
||||
|
@ -81,13 +85,11 @@ events, classification event by qdisc (for eBPF programs attached to a
|
|||
.BR tc (8)
|
||||
classifier), and other types that may be added in the future.
|
||||
A new event triggers execution of the eBPF program, which
|
||||
may store information about the event in the maps.
|
||||
may store information about the event in eBPF maps.
|
||||
Beyond storing data, eBPF programs may call a fixed set of
|
||||
in-kernel helper functions.
|
||||
The same program can be attached to multiple events and different
|
||||
The same eBPF program can be attached to multiple events and different
|
||||
eBPF programs can access the same map:
|
||||
.\" FIXME Can maps be shared between processes? (E.g., what happens
|
||||
.\" when fork() is called?)
|
||||
|
||||
.in +4n
|
||||
.nf
|
||||
|
@ -107,11 +109,24 @@ The operation to be performed by the
|
|||
.BR bpf ()
|
||||
system call is determined by the
|
||||
.IR cmd
|
||||
argument, which can be one of the following:
|
||||
argument.
|
||||
Each operation takes an accompanying argument,
|
||||
provided via
|
||||
.IR attr ,
|
||||
which is a pointer to a union of type
|
||||
.IR bpf_attr
|
||||
(see below).
|
||||
The
|
||||
.I size
|
||||
argument is the size of the union pointed to by
|
||||
.IR attr .
|
||||
|
||||
The value provided in
|
||||
.IR cmd
|
||||
is one of the following:
|
||||
.TP
|
||||
.B BPF_MAP_CREATE
|
||||
Create a map with the specified type and attributes and return
|
||||
a file descriptor that refers to the map.
|
||||
Create a map with and return a file descriptor that refers to the map.
|
||||
.TP
|
||||
.B BPF_MAP_LOOKUP_ELEM
|
||||
Look up an element by key in a specified map and return its value.
|
||||
|
@ -129,15 +144,6 @@ of the next element.
|
|||
.B BPF_PROG_LOAD
|
||||
Verify and load an eBPF program,
|
||||
returning a new file descriptor associated with the program.
|
||||
.PP
|
||||
The
|
||||
.I attr
|
||||
argument is a pointer to a union of type
|
||||
.IR bpf_attr
|
||||
(see below);
|
||||
.I size
|
||||
is the size of the union pointed to by
|
||||
.IR attr .
|
||||
.P
|
||||
The
|
||||
.I bpf_attr
|
||||
|
@ -156,7 +162,8 @@ union bpf_attr {
|
|||
in a map */
|
||||
};
|
||||
|
||||
struct { /* Used by BPF_MAP_*_ELEM commands */
|
||||
struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
|
||||
commands */
|
||||
__u32 map_fd;
|
||||
__aligned_u64 key;
|
||||
union {
|
||||
|
@ -175,7 +182,7 @@ union bpf_attr {
|
|||
__u32 log_size; /* size of user buffer */
|
||||
__aligned_u64 log_buf; /* user supplied 'char *'
|
||||
buffer */
|
||||
__u32 kern_version;
|
||||
__u32 kern_version;
|
||||
/* checked when prog_type=kprobe
|
||||
(since Linux 4.1) */
|
||||
.\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
|
||||
|
@ -257,13 +264,13 @@ is calling
|
|||
.BR bpf_map_*_elem ()
|
||||
helper functions with a correctly initialized
|
||||
.I key
|
||||
and that the program doesn't access the map element
|
||||
and to check that the program doesn't access the map element
|
||||
.I value
|
||||
beyond the specified
|
||||
.IR value_size .
|
||||
For example, when a map is created with a
|
||||
.IR key_size
|
||||
of 8 and the program calls
|
||||
of 8 and the eBPF program calls
|
||||
|
||||
.in +4n
|
||||
.nf
|
||||
|
@ -285,7 +292,7 @@ starting address will cause out-of-bounds stack access.
|
|||
|
||||
Similarly, when a map is created with a
|
||||
.I value_size
|
||||
of 1 and the program calls
|
||||
of 1 and the eBPF program contains
|
||||
|
||||
.in +4n
|
||||
.nf
|
||||
|
@ -300,14 +307,13 @@ pointer beyond the specified 1 byte
|
|||
.I value_size
|
||||
limit.
|
||||
|
||||
Currently, two
|
||||
.I map_type
|
||||
are supported:
|
||||
Currently, the following values are supported for
|
||||
.IR map_type :
|
||||
|
||||
.in +4n
|
||||
.nf
|
||||
enum bpf_map_type {
|
||||
BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */
|
||||
BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */
|
||||
BPF_MAP_TYPE_HASH,
|
||||
BPF_MAP_TYPE_ARRAY,
|
||||
BPF_MAP_TYPE_PROG_ARRAY,
|
||||
|
@ -317,14 +323,95 @@ enum bpf_map_type {
|
|||
|
||||
.I map_type
|
||||
selects one of the available map implementations in the kernel.
|
||||
.\" FIXME We need an explanation of BPF_MAP_TYPE_HASH here
|
||||
.\" FIXME We need an explanation of BPF_MAP_TYPE_ARRAY here
|
||||
.\" FIXME We need an explanation of why one might choose HASH versus ARRAY
|
||||
.\" FIXME We need an explanation of why one might choose each of
|
||||
.\" these map implementations
|
||||
For all map types,
|
||||
programs access maps with the same
|
||||
.BR bpf_map_lookup_elem ()/
|
||||
eBPF programs access maps with the same
|
||||
.BR bpf_map_lookup_elem ()
|
||||
and
|
||||
.BR bpf_map_update_elem ()
|
||||
helper functions.
|
||||
|
||||
The map types are as follows
|
||||
.RS
|
||||
.TP
|
||||
.B BPF_MAP_TYPE_HASH
|
||||
.\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475
|
||||
.\" FIXME Please review the following list of points, which draws
|
||||
.\" heavily from the commit message, but reworks the text significantly
|
||||
.\" and so may have introduced errors.
|
||||
Hash-table BPF maps have the following characteristics:
|
||||
.RS
|
||||
.IP * 3
|
||||
Maps are created and destroyed by user-space programs.
|
||||
Both user-space and eBPF programs
|
||||
can perform lookuo, update, and delete operations.
|
||||
.IP *
|
||||
The kernel takes care of allocating and freeing key/value pairs.
|
||||
.IP *
|
||||
The
|
||||
.BR map_update_elem ()
|
||||
helper with fail to insert new element when the
|
||||
.I max_entries
|
||||
limit is reached.
|
||||
(This ensures that eBPF programs cannot exhaust memory.)
|
||||
.IP *
|
||||
.BR map_update_elem ()
|
||||
replaces existing elements atomically.
|
||||
.RE
|
||||
.IP
|
||||
Hash-table maps are
|
||||
optimized for speed of lookup.
|
||||
.TP
|
||||
.B BPF_MAP_TYPE_ARRAY
|
||||
.\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3
|
||||
.\" FIXME Please review the following list of points, which draws
|
||||
.\" heavily from the commit message, but reworks the text significantly
|
||||
.\" and so may have introduced errors.
|
||||
Array BPF maps have the following characteristics:
|
||||
.RS
|
||||
.IP * 3
|
||||
Optimized for fastest possible lookup.
|
||||
In the future ithe verifier/JIT compiler
|
||||
may recognize lookup() operations that employ a constant key
|
||||
and optimize it into constant pointer.
|
||||
It is possible to optimize a non-constant
|
||||
key into direct pointer arithmetic as well, since pointers and
|
||||
.I value_size
|
||||
are constant for the life of the eBPF program.
|
||||
In other words,
|
||||
.BR array_map_lookup_elem ()
|
||||
may be 'inlined' by the verifier/JIT compiler
|
||||
while preserving concurrent access to this map from user space.
|
||||
.IP *
|
||||
All array elements pre-allocated and zero initialized at init time
|
||||
.IP *
|
||||
The key is an array index, and must be exactly four bytes.
|
||||
.IP *
|
||||
.BR map_delete_elem ()
|
||||
fails with the error
|
||||
.BR EINVAL ,
|
||||
since elements cannot be deleted.
|
||||
.IP *
|
||||
.BR map_update_elem ()
|
||||
replaces elements in an non-atomic fashion;
|
||||
for atomic updates, a hash-table map should be used instead.
|
||||
.RE
|
||||
.IP
|
||||
Among the uses for array maps are the following:
|
||||
.RS
|
||||
.IP * 3
|
||||
As "global" eBPF variables: an array of 1 element whose key is (index) 0
|
||||
and where the value is a collection of 'global' variables which
|
||||
eBPF programs can use to keep state between events.
|
||||
.IP *
|
||||
Aggregation of tracing events into a fixed set of buckets.
|
||||
.RE
|
||||
.TP
|
||||
.BR BPF_MAP_TYPE_PROG_ARRAY " (since Linux 4.2)"
|
||||
.\" FIXME: we need documentation of BPF_MAP_TYPE_PROG_ARRAY
|
||||
[To be completed]
|
||||
.RE
|
||||
.TP
|
||||
.B BPF_MAP_LOOKUP_ELEM
|
||||
The
|
||||
|
@ -521,9 +608,6 @@ Delete the map referred to by the file descriptor
|
|||
.IR map_fd .
|
||||
When the user-space program that created a map exits, all maps will
|
||||
be deleted automatically.
|
||||
.\" FIXME What are the semantics when a file descriptor is duplicated
|
||||
.\" (dup() etc.)? (I.e., when is a map deallocated automatically?)
|
||||
.\"
|
||||
.SS BPF programs
|
||||
.TP 4
|
||||
.B BPF_PROG_LOAD
|
||||
|
@ -532,8 +616,6 @@ The
|
|||
command is used to load an eBPF program into the kernel.
|
||||
The return value for this command is a new file descriptor associated
|
||||
with this program.
|
||||
.\" FIXME What is the effect of closing this FD?
|
||||
.\" FIXME What is the effect of duplicating this FD?
|
||||
|
||||
.in +4n
|
||||
.nf
|
||||
|
@ -565,11 +647,12 @@ is one of the available program types:
|
|||
.in +4n
|
||||
.nf
|
||||
enum bpf_prog_type {
|
||||
BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid
|
||||
program type */
|
||||
BPF_PROG_TYPE_SOCKET_FILTER,
|
||||
BPF_PROG_TYPE_SCHED_CLS,
|
||||
.\" FIXME BPF_PROG_TYPE_SCHED_CLS appears not to exist?
|
||||
BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid
|
||||
program type */
|
||||
BPF_PROG_TYPE_SOCKET_FILTER, /* Since Linux 3.19 */
|
||||
BPF_PROG_TYPE_KPROBE, /* Since Linux 4.1 */
|
||||
BPF_PROG_TYPE_SCHED_CLS, /* Since Linux 4.1 */
|
||||
BPF_PROG_TYPE_SCHED_ACT, /* Since Linux 4.1 */
|
||||
};
|
||||
.fi
|
||||
.in
|
||||
|
@ -901,6 +984,27 @@ In the current implementation, all
|
|||
commands require the caller to have the
|
||||
.B CAP_SYS_ADMIN
|
||||
capability.
|
||||
|
||||
.\" FIXME Alexei, is the following correct?
|
||||
eBPF objects (maps and programs) can be shared between processes.
|
||||
For example, after
|
||||
.BR fork (2),
|
||||
the child inherits file descriptors referring to the same eBPF objects.
|
||||
In addition, file descriptors referring to eBPF objects can be
|
||||
transferred over UNIX domain sockets.
|
||||
File descriptors referring to eBPF objects can be duplicated
|
||||
in the usual way, using
|
||||
.BR dup (2)
|
||||
and similar calls.
|
||||
An eBPF object is deallocated only after all file descriptors
|
||||
referring to the object have been closed.
|
||||
|
||||
eBPF programs can be written in a restricted C that is compiled into
|
||||
eBPF bytecode and executed on the in-kernel virtual machine or
|
||||
just-in-time compiled into native code.
|
||||
(Various features are omitted from this restricted C, such as loops,
|
||||
global variables, variadic functions, floating-point numbers,
|
||||
and passing structures as function arguments.)
|
||||
.SH SEE ALSO
|
||||
.BR seccomp (2),
|
||||
.BR socket (7),
|
||||
|
|
Loading…
Reference in New Issue