bpf.2: Improvements after comments from Alexei Starovoitov

Plus various other improvements of my own.

Reported-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
Michael Kerrisk 2015-07-22 14:53:08 +02:00
parent 61baa4ff4e
commit f774ddf14d
1 changed files with 151 additions and 47 deletions

View File

@ -43,16 +43,12 @@ the kernel statically analyzes the programs before loading them,
in order to ensure that they cannot harm the running system.
.P
eBPF extends cBPF in multiple ways, including the ability to call
in-kernel helper functions (via the
a fixed set of in-kernel helper functions
.\" See 'enum bpf_func_id' in include/uapi/linux/bpf.h
(via the
.B BPF_CALL
opcode extension provided by eBPF)
and access shared data structures such as BPF maps.
The programs can be written in a restricted C that is compiled into
eBPF bytecode and executed on the in-kernel virtual machine or
just-in-time compiled into native code.
(Various features are omitted from this restricted C, such as loops,
global variables, variadic functions, floating-point numbers,
and passing structures as function arguments.)
.SS Extended BPF Design/Architecture
.P
.\" FIXME In the following line, what does "different data types" mean?
@ -60,6 +56,14 @@ and passing structures as function arguments.)
BPF maps are a generic data structure for storage of different data types.
A user process can create multiple maps (with key/value-pairs being
opaque bytes of data) and access them via file descriptors.
.\" FIXME What does the next sentence mean?
.\" Isn't "from inside the kernel" redundant? (I mean: all eBPF programs
.\" are running inside the kernel, right?)
.\" And what does "in parallel" mean?
.\" Would a simpler version of this sentence be correct? As in:
.\" "Different eBPF programs can access the same maps in parallel."
.\" ?
.\" (Actually, the page already says soomething like that lower down.)
eBPF programs can access maps from inside the kernel in parallel.
It's up to the user process and eBPF program to decide what they store
inside maps.
@ -81,13 +85,11 @@ events, classification event by qdisc (for eBPF programs attached to a
.BR tc (8)
classifier), and other types that may be added in the future.
A new event triggers execution of the eBPF program, which
may store information about the event in the maps.
may store information about the event in eBPF maps.
Beyond storing data, eBPF programs may call a fixed set of
in-kernel helper functions.
The same program can be attached to multiple events and different
The same eBPF program can be attached to multiple events and different
eBPF programs can access the same map:
.\" FIXME Can maps be shared between processes? (E.g., what happens
.\" when fork() is called?)
.in +4n
.nf
@ -107,11 +109,24 @@ The operation to be performed by the
.BR bpf ()
system call is determined by the
.IR cmd
argument, which can be one of the following:
argument.
Each operation takes an accompanying argument,
provided via
.IR attr ,
which is a pointer to a union of type
.IR bpf_attr
(see below).
The
.I size
argument is the size of the union pointed to by
.IR attr .
The value provided in
.IR cmd
is one of the following:
.TP
.B BPF_MAP_CREATE
Create a map with the specified type and attributes and return
a file descriptor that refers to the map.
Create a map with and return a file descriptor that refers to the map.
.TP
.B BPF_MAP_LOOKUP_ELEM
Look up an element by key in a specified map and return its value.
@ -129,15 +144,6 @@ of the next element.
.B BPF_PROG_LOAD
Verify and load an eBPF program,
returning a new file descriptor associated with the program.
.PP
The
.I attr
argument is a pointer to a union of type
.IR bpf_attr
(see below);
.I size
is the size of the union pointed to by
.IR attr .
.P
The
.I bpf_attr
@ -156,7 +162,8 @@ union bpf_attr {
in a map */
};
struct { /* Used by BPF_MAP_*_ELEM commands */
struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
commands */
__u32 map_fd;
__aligned_u64 key;
union {
@ -175,7 +182,7 @@ union bpf_attr {
__u32 log_size; /* size of user buffer */
__aligned_u64 log_buf; /* user supplied 'char *'
buffer */
__u32 kern_version;
__u32 kern_version;
/* checked when prog_type=kprobe
(since Linux 4.1) */
.\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5
@ -257,13 +264,13 @@ is calling
.BR bpf_map_*_elem ()
helper functions with a correctly initialized
.I key
and that the program doesn't access the map element
and to check that the program doesn't access the map element
.I value
beyond the specified
.IR value_size .
For example, when a map is created with a
.IR key_size
of 8 and the program calls
of 8 and the eBPF program calls
.in +4n
.nf
@ -285,7 +292,7 @@ starting address will cause out-of-bounds stack access.
Similarly, when a map is created with a
.I value_size
of 1 and the program calls
of 1 and the eBPF program contains
.in +4n
.nf
@ -300,14 +307,13 @@ pointer beyond the specified 1 byte
.I value_size
limit.
Currently, two
.I map_type
are supported:
Currently, the following values are supported for
.IR map_type :
.in +4n
.nf
enum bpf_map_type {
BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */
BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */
BPF_MAP_TYPE_HASH,
BPF_MAP_TYPE_ARRAY,
BPF_MAP_TYPE_PROG_ARRAY,
@ -317,14 +323,95 @@ enum bpf_map_type {
.I map_type
selects one of the available map implementations in the kernel.
.\" FIXME We need an explanation of BPF_MAP_TYPE_HASH here
.\" FIXME We need an explanation of BPF_MAP_TYPE_ARRAY here
.\" FIXME We need an explanation of why one might choose HASH versus ARRAY
.\" FIXME We need an explanation of why one might choose each of
.\" these map implementations
For all map types,
programs access maps with the same
.BR bpf_map_lookup_elem ()/
eBPF programs access maps with the same
.BR bpf_map_lookup_elem ()
and
.BR bpf_map_update_elem ()
helper functions.
The map types are as follows
.RS
.TP
.B BPF_MAP_TYPE_HASH
.\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475
.\" FIXME Please review the following list of points, which draws
.\" heavily from the commit message, but reworks the text significantly
.\" and so may have introduced errors.
Hash-table BPF maps have the following characteristics:
.RS
.IP * 3
Maps are created and destroyed by user-space programs.
Both user-space and eBPF programs
can perform lookuo, update, and delete operations.
.IP *
The kernel takes care of allocating and freeing key/value pairs.
.IP *
The
.BR map_update_elem ()
helper with fail to insert new element when the
.I max_entries
limit is reached.
(This ensures that eBPF programs cannot exhaust memory.)
.IP *
.BR map_update_elem ()
replaces existing elements atomically.
.RE
.IP
Hash-table maps are
optimized for speed of lookup.
.TP
.B BPF_MAP_TYPE_ARRAY
.\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3
.\" FIXME Please review the following list of points, which draws
.\" heavily from the commit message, but reworks the text significantly
.\" and so may have introduced errors.
Array BPF maps have the following characteristics:
.RS
.IP * 3
Optimized for fastest possible lookup.
In the future ithe verifier/JIT compiler
may recognize lookup() operations that employ a constant key
and optimize it into constant pointer.
It is possible to optimize a non-constant
key into direct pointer arithmetic as well, since pointers and
.I value_size
are constant for the life of the eBPF program.
In other words,
.BR array_map_lookup_elem ()
may be 'inlined' by the verifier/JIT compiler
while preserving concurrent access to this map from user space.
.IP *
All array elements pre-allocated and zero initialized at init time
.IP *
The key is an array index, and must be exactly four bytes.
.IP *
.BR map_delete_elem ()
fails with the error
.BR EINVAL ,
since elements cannot be deleted.
.IP *
.BR map_update_elem ()
replaces elements in an non-atomic fashion;
for atomic updates, a hash-table map should be used instead.
.RE
.IP
Among the uses for array maps are the following:
.RS
.IP * 3
As "global" eBPF variables: an array of 1 element whose key is (index) 0
and where the value is a collection of 'global' variables which
eBPF programs can use to keep state between events.
.IP *
Aggregation of tracing events into a fixed set of buckets.
.RE
.TP
.BR BPF_MAP_TYPE_PROG_ARRAY " (since Linux 4.2)"
.\" FIXME: we need documentation of BPF_MAP_TYPE_PROG_ARRAY
[To be completed]
.RE
.TP
.B BPF_MAP_LOOKUP_ELEM
The
@ -521,9 +608,6 @@ Delete the map referred to by the file descriptor
.IR map_fd .
When the user-space program that created a map exits, all maps will
be deleted automatically.
.\" FIXME What are the semantics when a file descriptor is duplicated
.\" (dup() etc.)? (I.e., when is a map deallocated automatically?)
.\"
.SS BPF programs
.TP 4
.B BPF_PROG_LOAD
@ -532,8 +616,6 @@ The
command is used to load an eBPF program into the kernel.
The return value for this command is a new file descriptor associated
with this program.
.\" FIXME What is the effect of closing this FD?
.\" FIXME What is the effect of duplicating this FD?
.in +4n
.nf
@ -565,11 +647,12 @@ is one of the available program types:
.in +4n
.nf
enum bpf_prog_type {
BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid
program type */
BPF_PROG_TYPE_SOCKET_FILTER,
BPF_PROG_TYPE_SCHED_CLS,
.\" FIXME BPF_PROG_TYPE_SCHED_CLS appears not to exist?
BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid
program type */
BPF_PROG_TYPE_SOCKET_FILTER, /* Since Linux 3.19 */
BPF_PROG_TYPE_KPROBE, /* Since Linux 4.1 */
BPF_PROG_TYPE_SCHED_CLS, /* Since Linux 4.1 */
BPF_PROG_TYPE_SCHED_ACT, /* Since Linux 4.1 */
};
.fi
.in
@ -901,6 +984,27 @@ In the current implementation, all
commands require the caller to have the
.B CAP_SYS_ADMIN
capability.
.\" FIXME Alexei, is the following correct?
eBPF objects (maps and programs) can be shared between processes.
For example, after
.BR fork (2),
the child inherits file descriptors referring to the same eBPF objects.
In addition, file descriptors referring to eBPF objects can be
transferred over UNIX domain sockets.
File descriptors referring to eBPF objects can be duplicated
in the usual way, using
.BR dup (2)
and similar calls.
An eBPF object is deallocated only after all file descriptors
referring to the object have been closed.
eBPF programs can be written in a restricted C that is compiled into
eBPF bytecode and executed on the in-kernel virtual machine or
just-in-time compiled into native code.
(Various features are omitted from this restricted C, such as loops,
global variables, variadic functions, floating-point numbers,
and passing structures as function arguments.)
.SH SEE ALSO
.BR seccomp (2),
.BR socket (7),