bpf.2: Improvements after comments from Alexei Starovoitov

Plus various other improvements of my own. Reported-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2015-07-22 14:53:08 +02:00 · 2015-07-22 14:53:08 +02:00 · f774ddf14d
parent 61baa4ff4e
commit f774ddf14d
1 changed files with 151 additions and 47 deletions
--- a/man2/bpf.2
+++ b/man2/bpf.2
@ -43,16 +43,12 @@ the kernel statically analyzes the programs before loading them,
 in order to ensure that they cannot harm the running system.
 .P
 eBPF extends cBPF in multiple ways, including the ability to call
-in-kernel helper functions (via the
+a fixed set of in-kernel helper functions
+.\" See 'enum bpf_func_id' in include/uapi/linux/bpf.h
+(via the
 .B BPF_CALL
 opcode extension provided by eBPF)
 and access shared data structures such as BPF maps.
-The programs can be written in a restricted C that is compiled into
-eBPF bytecode and executed on the in-kernel virtual machine or
-just-in-time compiled into native code.
-(Various features are omitted from this restricted C, such as loops, 
-global variables, variadic functions, floating-point numbers,
-and passing structures as function arguments.)
 .SS Extended BPF Design/Architecture
 .P
 .\" FIXME In the following line, what does "different data types" mean?
@ -60,6 +56,14 @@ and passing structures as function arguments.)
 BPF maps are a generic data structure for storage of different data types.
 A user process can create multiple maps (with key/value-pairs being
 opaque bytes of data) and access them via file descriptors.
+.\" FIXME What does the next sentence mean?
+.\" Isn't "from inside the kernel" redundant? (I mean: all eBPF programs
+.\" are running inside the kernel, right?)
+.\" And what does "in parallel" mean?
+.\" Would a simpler version of this sentence be correct? As in:
+.\"     "Different eBPF programs can access the same maps in parallel."
+.\" ?
+.\" (Actually, the page already says soomething like that lower down.)
 eBPF programs can access maps from inside the kernel in parallel.
 It's up to the user process and eBPF program to decide what they store
 inside maps.
@ -81,13 +85,11 @@ events, classification event by qdisc (for eBPF programs attached to a
 .BR tc (8)
 classifier), and other types that may be added in the future.
 A new event triggers execution of the eBPF program, which
-may store information about the event in the maps.
+may store information about the event in eBPF maps.
 Beyond storing data, eBPF programs may call a fixed set of
 in-kernel helper functions.
-The same program can be attached to multiple events and different
+The same eBPF program can be attached to multiple events and different
 eBPF programs can access the same map:
-.\" FIXME Can maps be shared between processes? (E.g., what happens
-.\"       when fork() is called?)

 .in +4n
 .nf
@ -107,11 +109,24 @@ The operation to be performed by the
 .BR bpf ()
 system call is determined by the
 .IR cmd
-argument, which can be one of the following:
+argument.
+Each operation takes an accompanying argument,
+provided via
+.IR attr ,
+which is a pointer to a union of type
+.IR bpf_attr
+(see below).
+The
+.I size
+argument is the size of the union pointed to by
+.IR attr .
+
+The value provided in
+.IR cmd
+is one of the following:
 .TP
 .B BPF_MAP_CREATE
-Create a map with the specified type and attributes and return 
-a file descriptor that refers to the map.
+Create a map with and return a file descriptor that refers to the map.
 .TP
 .B BPF_MAP_LOOKUP_ELEM
 Look up an element by key in a specified map and return its value.
@ -129,15 +144,6 @@ of the next element.
 .B BPF_PROG_LOAD
 Verify and load an eBPF program,
 returning a new file descriptor associated with the program.
-.PP
-The
-.I attr
-argument is a pointer to a union of type
-.IR bpf_attr
-(see below);
-.I size
-is the size of the union pointed to by
-.IR attr .
 .P
 The
 .I bpf_attr
@ -156,7 +162,8 @@ union bpf_attr {
                                      in a map */
    };

-    struct {    /* Used by BPF_MAP_*_ELEM commands */
+    struct {    /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
+                   commands */
        __u32         map_fd;
        __aligned_u64 key;
        union {
@ -175,7 +182,7 @@ union bpf_attr {
        __u32         log_size;   /* size of user buffer */
        __aligned_u64 log_buf;    /* user supplied 'char *'
                                     buffer */
-	__u32         kern_version;
+        __u32         kern_version;
                                  /* checked when prog_type=kprobe
                                     (since Linux 4.1) */
 .\"                 commit 2541517c32be2531e0da59dfd7efc1ce844644f5
@ -257,13 +264,13 @@ is calling
 .BR bpf_map_*_elem ()
 helper functions with a correctly initialized
 .I key
-and that the program doesn't access the map element
+and to check that the program doesn't access the map element
 .I value
 beyond the specified
 .IR value_size .
 For example, when a map is created with a
 .IR key_size
-of 8 and the program calls
+of 8 and the eBPF program calls

 .in +4n
 .nf
@ -285,7 +292,7 @@ starting address will cause out-of-bounds stack access.

 Similarly, when a map is created with a
 .I value_size
-of 1 and the program calls
+of 1 and the eBPF program contains

 .in +4n
 .nf
@ -300,14 +307,13 @@ pointer beyond the specified 1 byte
 .I value_size
 limit.

-Currently, two
-.I map_type
-are supported:
+Currently, the following values are supported for
+.IR map_type :

 .in +4n
 .nf
 enum bpf_map_type {
-    BPF_MAP_TYPE_UNSPEC,    /* Reserve 0 as invalid map type */
+    BPF_MAP_TYPE_UNSPEC,      /* Reserve 0 as invalid map type */
    BPF_MAP_TYPE_HASH,
    BPF_MAP_TYPE_ARRAY,
    BPF_MAP_TYPE_PROG_ARRAY,
@ -317,14 +323,95 @@ enum bpf_map_type {

 .I map_type
 selects one of the available map implementations in the kernel.
-.\" FIXME We need an explanation of BPF_MAP_TYPE_HASH here
-.\" FIXME We need an explanation of BPF_MAP_TYPE_ARRAY here
-.\" FIXME We need an explanation of why one might choose HASH versus ARRAY
+.\" FIXME We need an explanation of why one might choose each of
+.\"       these map implementations
 For all map types,
-programs access maps with the same
-.BR bpf_map_lookup_elem ()/
+eBPF programs access maps with the same
+.BR bpf_map_lookup_elem ()
+and
 .BR bpf_map_update_elem ()
 helper functions.
+
+The map types are as follows
+.RS
+.TP
+.B BPF_MAP_TYPE_HASH
+.\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475
+.\" FIXME Please review the following list of points, which draws
+.\" heavily from the commit message, but reworks the text significantly
+.\" and so may have introduced errors.
+Hash-table BPF maps have the following characteristics:
+.RS
+.IP * 3
+Maps are created and destroyed by user-space programs.
+Both user-space and eBPF programs
+can perform lookuo, update, and delete operations.
+.IP *
+The kernel takes care of allocating and freeing key/value pairs.
+.IP *
+The
+.BR map_update_elem ()
+helper with fail to insert new element when the
+.I max_entries
+limit is reached.
+(This ensures that eBPF programs cannot exhaust memory.)
+.IP *
+.BR map_update_elem ()
+replaces existing elements atomically.
+.RE
+.IP
+Hash-table maps are 
+optimized for speed of lookup.
+.TP
+.B BPF_MAP_TYPE_ARRAY
+.\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3
+.\" FIXME Please review the following list of points, which draws
+.\" heavily from the commit message, but reworks the text significantly
+.\" and so may have introduced errors.
+Array BPF maps have the following characteristics:
+.RS
+.IP * 3
+Optimized for fastest possible lookup.
+In the future ithe verifier/JIT compiler
+may recognize lookup() operations that employ a constant key
+and optimize it into constant pointer.
+It is possible to optimize a non-constant
+key into direct pointer arithmetic as well, since pointers and
+.I value_size
+are constant for the life of the eBPF program.
+In other words,
+.BR array_map_lookup_elem ()
+may be 'inlined' by the verifier/JIT compiler
+while preserving concurrent access to this map from user space.
+.IP *
+All array elements pre-allocated and zero initialized at init time
+.IP *
+The key is an array index, and must be exactly four bytes.
+.IP *
+.BR map_delete_elem ()
+fails with the error
+.BR EINVAL ,
+since elements cannot be deleted.
+.IP *
+.BR map_update_elem ()
+replaces elements in an non-atomic fashion;
+for atomic updates, a hash-table map should be used instead.
+.RE
+.IP
+Among the uses for array maps are the following:
+.RS
+.IP * 3
+As "global" eBPF variables: an array of 1 element whose key is (index) 0
+and where the value is a collection of 'global' variables which
+eBPF programs can use to keep state between events.
+.IP *
+Aggregation of tracing events into a fixed set of buckets.
+.RE
+.TP
+.BR BPF_MAP_TYPE_PROG_ARRAY " (since Linux 4.2)"
+.\" FIXME: we need documentation of BPF_MAP_TYPE_PROG_ARRAY
+[To be completed]
+.RE
 .TP
 .B BPF_MAP_LOOKUP_ELEM
 The
@ -521,9 +608,6 @@ Delete the map referred to by the file descriptor
 .IR map_fd .
 When the user-space program that created a map exits, all maps will
 be deleted automatically.
-.\" FIXME What are the semantics when a file descriptor is duplicated
-.\"       (dup() etc.)? (I.e., when is a map deallocated automatically?)
-.\"
 .SS BPF programs
 .TP 4
 .B BPF_PROG_LOAD
@ -532,8 +616,6 @@ The
 command is used to load an eBPF program into the kernel.
 The return value for this command is a new file descriptor associated
 with this program.
-.\" FIXME What is the effect of closing this FD?
-.\" FIXME What is the effect of duplicating this FD?

 .in +4n
 .nf
@ -565,11 +647,12 @@ is one of the available program types:
 .in +4n
 .nf
 enum bpf_prog_type {
-    BPF_PROG_TYPE_UNSPEC,   /* Reserve 0 as invalid
-                               program type */
-    BPF_PROG_TYPE_SOCKET_FILTER,
-    BPF_PROG_TYPE_SCHED_CLS,
-.\" FIXME BPF_PROG_TYPE_SCHED_CLS appears not to exist?
+    BPF_PROG_TYPE_UNSPEC,        /* Reserve 0 as invalid
+                                    program type */
+    BPF_PROG_TYPE_SOCKET_FILTER, /* Since Linux 3.19 */
+    BPF_PROG_TYPE_KPROBE,        /* Since Linux 4.1 */
+    BPF_PROG_TYPE_SCHED_CLS,     /* Since Linux 4.1 */
+    BPF_PROG_TYPE_SCHED_ACT,     /* Since Linux 4.1 */
 };
 .fi
 .in
@ -901,6 +984,27 @@ In the current implementation, all
 commands require the caller to have the
 .B CAP_SYS_ADMIN
 capability.
+
+.\" FIXME Alexei, is the following correct?
+eBPF objects (maps and programs) can be shared between processes.
+For example, after
+.BR fork (2),
+the child inherits file descriptors referring to the same eBPF objects.
+In addition, file descriptors referring to eBPF objects can be
+transferred over UNIX domain sockets.
+File descriptors referring to eBPF objects can be duplicated
+in the usual way, using
+.BR dup (2)
+and similar calls.
+An eBPF object is deallocated only after all file descriptors
+referring to the object have been closed.
+
+eBPF programs can be written in a restricted C that is compiled into
+eBPF bytecode and executed on the in-kernel virtual machine or
+just-in-time compiled into native code.
+(Various features are omitted from this restricted C, such as loops, 
+global variables, variadic functions, floating-point numbers,
+and passing structures as function arguments.)
 .SH SEE ALSO
 .BR seccomp (2),
 .BR socket (7),