mirror of https://github.com/mkerrisk/man-pages
bpf.2: Fixes after comments by Daniel Borkmann
Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
parent
b87d8ba6f2
commit
953d26734e
181
man2/bpf.2
181
man2/bpf.2
|
@ -25,7 +25,7 @@
|
||||||
.\"
|
.\"
|
||||||
.TH BPF 2 2015-03-10 "Linux" "Linux Programmer's Manual"
|
.TH BPF 2 2015-03-10 "Linux" "Linux Programmer's Manual"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
bpf - perform a command on an extended eBPF map or program
|
bpf - perform a command on an extended BPF map or program
|
||||||
.SH SYNOPSIS
|
.SH SYNOPSIS
|
||||||
.nf
|
.nf
|
||||||
.B #include <linux/bpf.h>
|
.B #include <linux/bpf.h>
|
||||||
|
@ -53,7 +53,7 @@ and access shared data structures such as eBPF maps.
|
||||||
.\"
|
.\"
|
||||||
.\" FIXME In the following line, what does "different data types" mean?
|
.\" FIXME In the following line, what does "different data types" mean?
|
||||||
.\" Are the values in a map not just blobs?
|
.\" Are the values in a map not just blobs?
|
||||||
.\" Daniel Borkman commented:
|
.\" Daniel Borkmann commented:
|
||||||
.\" Sort of, currently, these blobs can have different sizes of keys
|
.\" Sort of, currently, these blobs can have different sizes of keys
|
||||||
.\" and values (you can even have structs as keys). For the map itself
|
.\" and values (you can even have structs as keys). For the map itself
|
||||||
.\" they are treated as blob internally. However, recently, bpf tail call
|
.\" they are treated as blob internally. However, recently, bpf tail call
|
||||||
|
@ -63,15 +63,14 @@ and access shared data structures such as eBPF maps.
|
||||||
.\" the tail call could be done as follow-up after we have an initial man
|
.\" the tail call could be done as follow-up after we have an initial man
|
||||||
.\" page in the tree included.
|
.\" page in the tree included.
|
||||||
.\"
|
.\"
|
||||||
BPF maps are a generic data structure for storage of different data types.
|
eBPF maps are a generic data structure for storage of different data types.
|
||||||
A user process can create multiple maps (with key/value-pairs being
|
A user process can create multiple maps (with key/value-pairs being
|
||||||
opaque bytes of data) and access them via file descriptors.
|
opaque bytes of data) and access them via file descriptors.
|
||||||
Different eBPF programs can access the same maps in parallel.
|
Different eBPF programs can access the same maps in parallel.
|
||||||
It's up to the user process and eBPF program to decide what they store
|
It's up to the user process and eBPF program to decide what they store
|
||||||
inside maps.
|
inside maps.
|
||||||
.P
|
.P
|
||||||
eBPF programs are similar to kernel modules.
|
eBPF programs are loaded by the user
|
||||||
They are loaded by the user
|
|
||||||
process and automatically unloaded when the process exits.
|
process and automatically unloaded when the process exits.
|
||||||
.\"
|
.\"
|
||||||
.\" FIXME Daniel Borkmann commented about the preceding sentence:
|
.\" FIXME Daniel Borkmann commented about the preceding sentence:
|
||||||
|
@ -80,7 +79,7 @@ process and automatically unloaded when the process exits.
|
||||||
.\" eBPF classifier and actions, and here it's slightly different: in tc,
|
.\" eBPF classifier and actions, and here it's slightly different: in tc,
|
||||||
.\" we load the programs, maps etc, and push down the eBPF program fd in
|
.\" we load the programs, maps etc, and push down the eBPF program fd in
|
||||||
.\" order to let the kernel hold reference on the program itself.
|
.\" order to let the kernel hold reference on the program itself.
|
||||||
.\"
|
.\"
|
||||||
.\" Thus, there, the program fd that the application owns is gone when the
|
.\" Thus, there, the program fd that the application owns is gone when the
|
||||||
.\" application terminates, but the eBPF program itself still lives on
|
.\" application terminates, but the eBPF program itself still lives on
|
||||||
.\" inside the kernel.
|
.\" inside the kernel.
|
||||||
|
@ -93,11 +92,12 @@ An in-kernel verifier statically determines that the eBPF program
|
||||||
terminates and is safe to execute.
|
terminates and is safe to execute.
|
||||||
During verification, the kernel increments reference counts for each of
|
During verification, the kernel increments reference counts for each of
|
||||||
the maps that the eBPF program uses,
|
the maps that the eBPF program uses,
|
||||||
so that the selected maps cannot be removed until the program is unloaded.
|
so that the attached maps can't be removed until the program is unloaded.
|
||||||
|
|
||||||
eBPF programs can be attached to different events.
|
eBPF programs can be attached to different events.
|
||||||
These events can be the arrival of network packets, tracing
|
These events can be the arrival of network packets, tracing
|
||||||
events, classification event by qdisc (for eBPF programs attached to a
|
events, classification events by network queueing disciplines
|
||||||
|
(for eBPF programs attached to a
|
||||||
.BR tc (8)
|
.BR tc (8)
|
||||||
classifier), and other types that may be added in the future.
|
classifier), and other types that may be added in the future.
|
||||||
A new event triggers execution of the eBPF program, which
|
A new event triggers execution of the eBPF program, which
|
||||||
|
@ -109,13 +109,13 @@ eBPF programs can access the same map:
|
||||||
|
|
||||||
.in +4n
|
.in +4n
|
||||||
.nf
|
.nf
|
||||||
tracing tracing tracing packet packet
|
tracing tracing tracing packet packet
|
||||||
event A event B event C on eth0 on eth1
|
event A event B event C on eth0 on eth1
|
||||||
| | | | |
|
| | | | |
|
||||||
| | | | |
|
| | | | |
|
||||||
--> tracing <-- tracing socket socket
|
--> tracing <-- tracing socket tc ingress
|
||||||
prog_1 prog_2 prog_3 prog_4
|
prog_1 prog_2 prog_3 classifier
|
||||||
| | | |
|
| | | | prog_4
|
||||||
|--- -----| |-------| map_3
|
|--- -----| |-------| map_3
|
||||||
map_1 map_2
|
map_1 map_2
|
||||||
.fi
|
.fi
|
||||||
|
@ -142,7 +142,7 @@ The value provided in
|
||||||
is one of the following:
|
is one of the following:
|
||||||
.TP
|
.TP
|
||||||
.B BPF_MAP_CREATE
|
.B BPF_MAP_CREATE
|
||||||
Create a map with and return a file descriptor that refers to the map.
|
Create a map and return a file descriptor that refers to the map.
|
||||||
.TP
|
.TP
|
||||||
.B BPF_MAP_LOOKUP_ELEM
|
.B BPF_MAP_LOOKUP_ELEM
|
||||||
Look up an element by key in a specified map and return its value.
|
Look up an element by key in a specified map and return its value.
|
||||||
|
@ -240,13 +240,15 @@ returning a new file descriptor that refers to the map.
|
||||||
.in +4n
|
.in +4n
|
||||||
.nf
|
.nf
|
||||||
int
|
int
|
||||||
bpf_create_map(enum bpf_map_type map_type, int key_size,
|
bpf_create_map(enum bpf_map_type map_type,
|
||||||
int value_size, int max_entries)
|
unsigned int key_size,
|
||||||
|
unsigned int value_size,
|
||||||
|
unsigned int max_entries)
|
||||||
{
|
{
|
||||||
union bpf_attr attr = {
|
union bpf_attr attr = {
|
||||||
.map_type = map_type,
|
.map_type = map_type,
|
||||||
.key_size = key_size,
|
.key_size = key_size,
|
||||||
.value_size = value_size,
|
.value_size = value_size,
|
||||||
.max_entries = max_entries
|
.max_entries = max_entries
|
||||||
};
|
};
|
||||||
|
|
||||||
|
@ -271,12 +273,12 @@ is set to
|
||||||
or
|
or
|
||||||
.BR ENOMEM .
|
.BR ENOMEM .
|
||||||
|
|
||||||
The attributes
|
The
|
||||||
.I key_size
|
.I key_size
|
||||||
and
|
and
|
||||||
.I value_size
|
.I value_size
|
||||||
will be used by the verifier during program loading to check that the program
|
attributes will be used by the verifier during program loading
|
||||||
is calling
|
to check that the program is calling
|
||||||
.BR bpf_map_*_elem ()
|
.BR bpf_map_*_elem ()
|
||||||
helper functions with a correctly initialized
|
helper functions with a correctly initialized
|
||||||
.I key
|
.I key
|
||||||
|
@ -362,12 +364,12 @@ in the map referred to by the file descriptor
|
||||||
.in +4n
|
.in +4n
|
||||||
.nf
|
.nf
|
||||||
int
|
int
|
||||||
bpf_lookup_elem(int fd, void *key, void *value)
|
bpf_lookup_elem(int fd, const void *key, void *value)
|
||||||
{
|
{
|
||||||
union bpf_attr attr = {
|
union bpf_attr attr = {
|
||||||
.map_fd = fd,
|
.map_fd = fd,
|
||||||
.key = ptr_to_u64(key),
|
.key = ptr_to_u64(key),
|
||||||
.value = ptr_to_u64(value),
|
.value = ptr_to_u64(value),
|
||||||
};
|
};
|
||||||
|
|
||||||
return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
|
return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
|
||||||
|
@ -399,13 +401,14 @@ in the map referred to by the file descriptor
|
||||||
.in +4n
|
.in +4n
|
||||||
.nf
|
.nf
|
||||||
int
|
int
|
||||||
bpf_update_elem(int fd, void *key, void *value, __u64 flags)
|
bpf_update_elem(int fd, const void *key, const void *value,
|
||||||
|
uint64_t flags)
|
||||||
{
|
{
|
||||||
union bpf_attr attr = {
|
union bpf_attr attr = {
|
||||||
.map_fd = fd,
|
.map_fd = fd,
|
||||||
.key = ptr_to_u64(key),
|
.key = ptr_to_u64(key),
|
||||||
.value = ptr_to_u64(value),
|
.value = ptr_to_u64(value),
|
||||||
.flags = flags,
|
.flags = flags,
|
||||||
};
|
};
|
||||||
|
|
||||||
return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
|
return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
|
||||||
|
@ -450,7 +453,7 @@ and the element with
|
||||||
.I key
|
.I key
|
||||||
already exists in the map.
|
already exists in the map.
|
||||||
.B ENOENT
|
.B ENOENT
|
||||||
will be returned if
|
will be returned if
|
||||||
.I flags
|
.I flags
|
||||||
specifies
|
specifies
|
||||||
.B BPF_EXIST
|
.B BPF_EXIST
|
||||||
|
@ -470,11 +473,11 @@ from the map referred to by the file descriptor
|
||||||
.in +4n
|
.in +4n
|
||||||
.nf
|
.nf
|
||||||
int
|
int
|
||||||
bpf_delete_elem(int fd, void *key)
|
bpf_delete_elem(int fd, const void *key)
|
||||||
{
|
{
|
||||||
union bpf_attr attr = {
|
union bpf_attr attr = {
|
||||||
.map_fd = fd,
|
.map_fd = fd,
|
||||||
.key = ptr_to_u64(key),
|
.key = ptr_to_u64(key),
|
||||||
};
|
};
|
||||||
|
|
||||||
return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
|
return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
|
||||||
|
@ -494,7 +497,7 @@ The
|
||||||
command looks up an element by
|
command looks up an element by
|
||||||
.I key
|
.I key
|
||||||
in the map referred to by the file descriptor
|
in the map referred to by the file descriptor
|
||||||
.IR fd
|
.IR fd
|
||||||
and sets the
|
and sets the
|
||||||
.I next_key
|
.I next_key
|
||||||
pointer to the key of the next element.
|
pointer to the key of the next element.
|
||||||
|
@ -502,11 +505,11 @@ pointer to the key of the next element.
|
||||||
.nf
|
.nf
|
||||||
.in +4n
|
.in +4n
|
||||||
int
|
int
|
||||||
bpf_get_next_key(int fd, void *key, void *next_key)
|
bpf_get_next_key(int fd, const void *key, void *next_key)
|
||||||
{
|
{
|
||||||
union bpf_attr attr = {
|
union bpf_attr attr = {
|
||||||
.map_fd = fd,
|
.map_fd = fd,
|
||||||
.key = ptr_to_u64(key),
|
.key = ptr_to_u64(key),
|
||||||
.next_key = ptr_to_u64(next_key),
|
.next_key = ptr_to_u64(next_key),
|
||||||
};
|
};
|
||||||
|
|
||||||
|
@ -572,7 +575,7 @@ limit is reached.
|
||||||
replaces existing elements atomically.
|
replaces existing elements atomically.
|
||||||
.RE
|
.RE
|
||||||
.IP
|
.IP
|
||||||
Hash-table maps are
|
Hash-table maps are
|
||||||
optimized for speed of lookup.
|
optimized for speed of lookup.
|
||||||
.TP
|
.TP
|
||||||
.B BPF_MAP_TYPE_ARRAY
|
.B BPF_MAP_TYPE_ARRAY
|
||||||
|
@ -603,7 +606,11 @@ fails with the error
|
||||||
since elements cannot be deleted.
|
since elements cannot be deleted.
|
||||||
.IP *
|
.IP *
|
||||||
.BR map_update_elem ()
|
.BR map_update_elem ()
|
||||||
replaces elements in an non-atomic fashion;
|
replaces elements in a
|
||||||
|
.B nonatomic
|
||||||
|
fashion;
|
||||||
|
.\" Daniel Borkmann: when you have a value_size of sizeof(long), you can
|
||||||
|
.\" however use __sync_fetch_and_add() atomic builtin from the LLVM backend
|
||||||
for atomic updates, a hash-table map should be used instead.
|
for atomic updates, a hash-table map should be used instead.
|
||||||
.RE
|
.RE
|
||||||
.IP
|
.IP
|
||||||
|
@ -633,17 +640,17 @@ with this eBPF program.
|
||||||
char bpf_log_buf[LOG_BUF_SIZE];
|
char bpf_log_buf[LOG_BUF_SIZE];
|
||||||
|
|
||||||
int
|
int
|
||||||
bpf_prog_load(enum bpf_prog_type prog_type,
|
bpf_prog_load(enum bpf_prog_type type,
|
||||||
const struct bpf_insn *insns, int insn_cnt,
|
const struct bpf_insn *insns, int insn_cnt,
|
||||||
const char *license)
|
const char *license)
|
||||||
{
|
{
|
||||||
union bpf_attr attr = {
|
union bpf_attr attr = {
|
||||||
.prog_type = prog_type,
|
.prog_type = type,
|
||||||
.insns = ptr_to_u64(insns),
|
.insns = ptr_to_u64(insns),
|
||||||
.insn_cnt = insn_cnt,
|
.insn_cnt = insn_cnt,
|
||||||
.license = ptr_to_u64(license),
|
.license = ptr_to_u64(license),
|
||||||
.log_buf = ptr_to_u64(bpf_log_buf),
|
.log_buf = ptr_to_u64(bpf_log_buf),
|
||||||
.log_size = LOG_BUF_SIZE,
|
.log_size = LOG_BUF_SIZE,
|
||||||
.log_level = 1,
|
.log_level = 1,
|
||||||
};
|
};
|
||||||
|
|
||||||
|
@ -687,13 +694,26 @@ is the number of instructions in the program referred to by
|
||||||
is a license string, which must be GPL compatible to call helper functions
|
is a license string, which must be GPL compatible to call helper functions
|
||||||
marked
|
marked
|
||||||
.IR gpl_only .
|
.IR gpl_only .
|
||||||
|
.\" Daniel Borkmann commented:
|
||||||
|
.\" Not strictly. So here, the same rules apply as with kernel modules.
|
||||||
|
.\" I.e. what the kernel checks for are the following license strings:
|
||||||
|
.\"
|
||||||
|
.\" static inline int license_is_gpl_compatible(const char *license)
|
||||||
|
.\" {
|
||||||
|
.\" return (strcmp(license, "GPL") == 0
|
||||||
|
.\" || strcmp(license, "GPL v2") == 0
|
||||||
|
.\" || strcmp(license, "GPL and additional rights") == 0
|
||||||
|
.\" || strcmp(license, "Dual BSD/GPL") == 0
|
||||||
|
.\" || strcmp(license, "Dual MIT/GPL") == 0
|
||||||
|
.\" || strcmp(license, "Dual MPL/GPL") == 0);
|
||||||
|
.\" }
|
||||||
.IP *
|
.IP *
|
||||||
.I log_buf
|
.I log_buf
|
||||||
is a pointer to a caller-allocated buffer in which the in-kernel
|
is a pointer to a caller-allocated buffer in which the in-kernel
|
||||||
verifier can store the verification log.
|
verifier can store the verification log.
|
||||||
This log is a multi-line string that can be checked by
|
This log is a multi-line string that can be checked by
|
||||||
the program author in order to understand how the verifier came to
|
the program author in order to understand how the verifier came to
|
||||||
the conclusion that the BPF program is unsafe.
|
the conclusion that the eBPF program is unsafe.
|
||||||
The format of the output can change at any time as the verifier evolves.
|
The format of the output can change at any time as the verifier evolves.
|
||||||
.IP *
|
.IP *
|
||||||
.I log_size
|
.I log_size
|
||||||
|
@ -725,20 +745,25 @@ and user-space programs can then fetch data from the map.
|
||||||
Conversely, user-space programs can use a map as a configuration mechanism,
|
Conversely, user-space programs can use a map as a configuration mechanism,
|
||||||
populating the map with values checked by the eBPF program,
|
populating the map with values checked by the eBPF program,
|
||||||
which then modifies its behavior on the fly according to those values.
|
which then modifies its behavior on the fly according to those values.
|
||||||
|
.\"
|
||||||
|
.\"
|
||||||
.SS eBPF program types
|
.SS eBPF program types
|
||||||
By picking
|
The eBPF program type
|
||||||
.IR prog_type ,
|
.RI ( prog_type )
|
||||||
the program author selects a set of helper functions that can be called from
|
determines the subset of a kernel helper functions that the program
|
||||||
the eBPF program and the corresponding format of
|
may call.
|
||||||
.I struct bpf_context
|
The program type also determines dthe program input (context)\(emthe
|
||||||
|
format of
|
||||||
|
.I "struct bpf_context"
|
||||||
(which is the data blob passed into the eBPF program as the first argument).
|
(which is the data blob passed into the eBPF program as the first argument).
|
||||||
For example, programs loaded with a
|
|
||||||
.I prog_type
|
For example, a tracing program does not have the exact same
|
||||||
of
|
subset of helper functions as a socket filter program
|
||||||
.B BPF_PROG_TYPE_KPROBE
|
(though they may have some helpers in common).
|
||||||
may call the
|
Similarly,
|
||||||
.BR bpf_probe_read ()
|
the input (context) for a tracing program is a set of register values,
|
||||||
helper, whereas other program types can't employ this helper.
|
while for a socket filter it is a network packet.
|
||||||
|
|
||||||
The set of functions available to eBPF programs of a given type may increase
|
The set of functions available to eBPF programs of a given type may increase
|
||||||
in the future.
|
in the future.
|
||||||
|
|
||||||
|
@ -764,7 +789,7 @@ The
|
||||||
.I bpf_context
|
.I bpf_context
|
||||||
argument is a pointer to a
|
argument is a pointer to a
|
||||||
.IR "struct __sk_buff" .
|
.IR "struct __sk_buff" .
|
||||||
.\" FIXME: We need some text here to explain how the program
|
.\" FIXME: We need some text here to explain how the program
|
||||||
.\" accesses __sk_buff
|
.\" accesses __sk_buff
|
||||||
.\" See 'struct __sk_buff' and commit 9bac3d6d548e5
|
.\" See 'struct __sk_buff' and commit 9bac3d6d548e5
|
||||||
.\" Alexei commented:
|
.\" Alexei commented:
|
||||||
|
@ -967,8 +992,8 @@ are not set to zero.
|
||||||
For
|
For
|
||||||
.BR BPF_PROG_LOAD,
|
.BR BPF_PROG_LOAD,
|
||||||
indicates an attempt to load an invalid program.
|
indicates an attempt to load an invalid program.
|
||||||
BPF programs can be deemed
|
eBPF programs can be deemed
|
||||||
einvalid due to unrecognized instructions, the use of reserved fields, jumps
|
invalid due to unrecognized instructions, the use of reserved fields, jumps
|
||||||
out of range, infinite loops or calls of unknown functions.
|
out of range, infinite loops or calls of unknown functions.
|
||||||
.TP
|
.TP
|
||||||
.BR EACCES
|
.BR EACCES
|
||||||
|
@ -998,7 +1023,7 @@ indicates that the element with the given
|
||||||
was not found.
|
was not found.
|
||||||
.TP
|
.TP
|
||||||
.BR E2BIG
|
.BR E2BIG
|
||||||
The BPF program is too large or a map reached the
|
The eBPF program is too large or a map reached the
|
||||||
.I max_entries
|
.I max_entries
|
||||||
limit (maximum number of elements).
|
limit (maximum number of elements).
|
||||||
.SH VERSIONS
|
.SH VERSIONS
|
||||||
|
@ -1031,16 +1056,36 @@ referring to the object have been closed.
|
||||||
|
|
||||||
eBPF programs can be written in a restricted C that is compiled (using the
|
eBPF programs can be written in a restricted C that is compiled (using the
|
||||||
.B clang
|
.B clang
|
||||||
compiler) into eBPF bytecode and executed on the in-kernel virtual machine or
|
compiler) into eBPF bytecode.
|
||||||
just-in-time compiled into native code.
|
Various features are omitted from this restricted C, such as loops,
|
||||||
(Various features are omitted from this restricted C, such as loops,
|
|
||||||
global variables, variadic functions, floating-point numbers,
|
global variables, variadic functions, floating-point numbers,
|
||||||
and passing structures as function arguments.)
|
and passing structures as function arguments.
|
||||||
Some examples can be found in the
|
Some examples can be found in the
|
||||||
.I samples/bpf/*_kern.c
|
.I samples/bpf/*_kern.c
|
||||||
files in the kernel source tree.
|
files in the kernel source tree.
|
||||||
.\" There are also examples for the tc classifier, in the iproute2
|
.\" There are also examples for the tc classifier, in the iproute2
|
||||||
.\" project, in examples/bpf
|
.\" project, in examples/bpf
|
||||||
|
|
||||||
|
The kernel contains a just-in-time (JIT) compiler that translates
|
||||||
|
eBPF bytecode into native machine code for better performance.
|
||||||
|
The JIT compiler is disabled by default,
|
||||||
|
but its operation can be controlled by writing one of the
|
||||||
|
following integer strings to the file
|
||||||
|
.IR /proc/sys/net/core/bpf_jit_enable :
|
||||||
|
.IP 0 3
|
||||||
|
Disable JIT compilation (default).
|
||||||
|
.IP 1
|
||||||
|
Normal compilation.
|
||||||
|
.IP 2
|
||||||
|
Debugging mode.
|
||||||
|
The generated opcodes are dumped in hexadecimal into the kernel log.
|
||||||
|
These opcodes can then be disassembled using the program
|
||||||
|
.IR tools/net/bpf_jit_disasm.c
|
||||||
|
provided in the kernel source tree.
|
||||||
|
.\" .PP
|
||||||
|
.\" The JIT compiler is currently available for the x86-64, arm64,
|
||||||
|
.\" and s390 architectures.
|
||||||
|
.\" FIXME: and others?
|
||||||
.SH SEE ALSO
|
.SH SEE ALSO
|
||||||
.BR seccomp (2),
|
.BR seccomp (2),
|
||||||
.BR socket (7),
|
.BR socket (7),
|
||||||
|
|
Loading…
Reference in New Issue