bpf.2: New page documenting bpf(2)

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2015-03-13 12:16:32 -07:00 · 2015-03-13 12:16:32 -07:00 · cc7ac21d25
parent b1e6b7c776
commit cc7ac21d25
1 changed files with 630 additions and 0 deletions
--- a/man2/bpf.2
+++ b/man2/bpf.2
@ -0,0 +1,630 @@
+.\" Copyright (C) 2015 Alexei Starovoitov <ast@kernel.org>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH BPF 2 2015-03-10 "Linux" "Linux Programmer's Manual"
+.SH NAME
+bpf - perform a command on an extended BPF map or program
+.SH SYNOPSIS
+.nf
+.B #include <linux/bpf.h>
+.sp
+.BI "int bpf(int cmd, union bpf_attr *attr, unsigned int size);
+
+.SH DESCRIPTION
+.BR bpf()
+syscall is a multiplexor for a range of different operations on extended
+Berkeley Packet Filter which can be characterized as
+"universal in-kernel virtual machine". The extended BPF (or eBPF) is similar to
+the original BPF (or classic BPF) used to filter network packets. Both
+statically analyze the programs before loading them into the kernel to
+ensure that they cannot harm the running system.
+.P
+eBPF extends classic BPF in multiple ways including the ability to call
+in-kernel helper functions and access shared data structures like BPF maps.
+The programs can be written in a restricted C that is compiled into
+eBPF bytecode and executed on the in-kernel virtual machine or JITed into
+native code.
+.SS Extended BPF Design/Architecture
+.P
+BPF maps are a generic data structure for storage of different data types.
+A user process can create multiple maps (with key/value-pairs being
+opaque bytes of data) and access them via file descriptor.
+BPF programs can access maps from inside the kernel in parallel.
+It's up to the user process and BPF program to decide what they store
+inside maps.
+.P
+BPF programs are similar to kernel modules. They are loaded by the user
+process and automatically unloaded when the process exits.
+Each BPF program is a set of instructions that is safe to run until
+its completion. The BPF verifier statically determines that the program
+terminates and is safe to execute. During
+verification the program takes hold of maps that it intends to use,
+so selected maps cannot be removed until the program is unloaded. The program
+can be attached to different events. These events can be packets, tracing
+events and other types that may be added in the future. A new event triggers
+execution of the program which may store information about the event in the maps.
+Beyond storing data the programs may call into in-kernel helper functions.
+The same program can be attached to multiple events and different programs can
+access the same map:
+.nf
+  tracing     tracing     tracing     packet     packet
+  event A     event B     event C     on eth0    on eth1
+   |             |          |           |          |
+   |             |          |           |          |
+   --> tracing <--      tracing       socket     socket
+        prog_1           prog_2       prog_3     prog_4
+        |  |               |            |
+     |---  -----|  |-------|           map_3
+   map_1       map_2
+.fi
+.SS Syscall Arguments
+.B bpf()
+syscall operation is determined by
+.IR cmd
+which can be one of the following:
+.TP
+.B BPF_MAP_CREATE
+Create a map with the given type and attributes and return map FD
+.TP
+.B BPF_MAP_LOOKUP_ELEM
+Lookup element by key in a given map and return its value
+.TP
+.B BPF_MAP_UPDATE_ELEM
+Create or update element (key/value pair) in a given map
+.TP
+.B BPF_MAP_DELETE_ELEM
+Lookup and delete element by key in a given map
+.TP
+.B BPF_MAP_GET_NEXT_KEY
+Lookup element by key in a given map and return key of next element
+.TP
+.B BPF_PROG_LOAD
+Verify and load BPF program
+.TP
+.B attr
+is a pointer to a union of type bpf_attr as defined below.
+.TP
+.B size
+is the size of the union.
+.P
+.nf
+union bpf_attr {
+    struct { /* anonymous struct used by BPF_MAP_CREATE command */
+        __u32             map_type;
+        __u32             key_size;    /* size of key in bytes */
+        __u32             value_size;  /* size of value in bytes */
+        __u32             max_entries; /* max number of entries in a map */
+    };
+
+    struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
+        __u32             map_fd;
+        __aligned_u64     key;
+        union {
+            __aligned_u64 value;
+            __aligned_u64 next_key;
+        };
+	__u64             flags;
+    };
+
+    struct { /* anonymous struct used by BPF_PROG_LOAD command */
+        __u32         prog_type;
+        __u32         insn_cnt;
+        __aligned_u64 insns;     /* 'const struct bpf_insn *' */
+        __aligned_u64 license;   /* 'const char *' */
+        __u32         log_level; /* verbosity level of verifier */
+        __u32         log_size;  /* size of user buffer */
+        __aligned_u64 log_buf;   /* user supplied 'char *' buffer */
+    };
+} __attribute__((aligned(8)));
+.fi
+.SS BPF maps
+maps are a generic data structure for storage of different types
+and sharing data between kernel and userspace.
+
+Any map type has the following attributes:
+  . type
+  . max number of elements
+  . key size in bytes
+  . value size in bytes
+
+The following wrapper functions demonstrate how this syscall can be used to
+access the maps. The functions use the
+.IR cmd
+argument to invoke different operations.
+.TP
+.B BPF_MAP_CREATE
+.nf
+int bpf_create_map(enum bpf_map_type map_type, int key_size,
+                   int value_size, int max_entries)
+{
+    union bpf_attr attr = {
+        .map_type = map_type,
+        .key_size = key_size,
+        .value_size = value_size,
+        .max_entries = max_entries
+    };
+
+    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
+}
+.fi
+bpf() syscall creates a map of
+.I map_type
+type and given attributes
+.I key_size, value_size, max_entries.
+On success it returns a process-local file descriptor. On error, \-1 is returned and
+.I errno
+is set to EINVAL or EPERM or ENOMEM.
+
+The attributes
+.I key_size
+and
+.I value_size
+will be used by the verifier during program loading to check that the program
+is calling bpf_map_*_elem() helper functions with a correctly initialized
+.I key
+and that the program doesn't access map element
+.I value
+beyond the specified
+.I value_size.
+For example, when a map is created with key_size = 8 and the program calls
+.nf
+bpf_map_lookup_elem(map_fd, fp - 4)
+.fi
+the program will be rejected,
+since the in-kernel helper function bpf_map_lookup_elem(map_fd, void *key) expects
+to read 8 bytes from 'key' pointer, but 'fp - 4' starting address will cause
+out of bounds stack access.
+
+Similarly, when a map is created with value_size = 1 and the program calls
+.nf
+value = bpf_map_lookup_elem(...);
+*(u32 *)value = 1;
+.fi
+the program will be rejected, since it accesses the
+.I value
+pointer beyond the specified 1 byte value_size limit.
+
+Currently two
+.I map_type
+are supported:
+.nf
+enum bpf_map_type {
+   BPF_MAP_TYPE_UNSPEC,
+   BPF_MAP_TYPE_HASH,
+   BPF_MAP_TYPE_ARRAY,
+};
+.fi
+.I map_type
+selects one of the available map implementations in the kernel. For all map_types
+programs access maps with the same bpf_map_lookup_elem()/bpf_map_update_elem()
+helper functions.
+.TP
+.B BPF_MAP_LOOKUP_ELEM
+.nf
+int bpf_lookup_elem(int fd, void *key, void *value)
+{
+    union bpf_attr attr = {
+        .map_fd = fd,
+        .key = ptr_to_u64(key),
+        .value = ptr_to_u64(value),
+    };
+
+    return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
+}
+.fi
+bpf() syscall looks up an element with a given
+.I key
+in a map
+.I fd.
+If an element is found it returns zero and stores element's value into
+.I value.
+If no element is found it returns \-1 and sets
+.I errno
+to ENOENT.
+.TP
+.B BPF_MAP_UPDATE_ELEM
+.nf
+int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
+{
+    union bpf_attr attr = {
+        .map_fd = fd,
+        .key = ptr_to_u64(key),
+        .value = ptr_to_u64(value),
+        .flags = flags,
+    };
+
+    return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
+}
+.fi
+The call creates or updates an element with a given
+.I key/value
+in a map
+.I fd
+according to
+.I flags
+which can have one of 3 possible values:
+.nf
+#define BPF_ANY         0 /* create new element or update existing */
+#define BPF_NOEXIST     1 /* create new element if it didn't exist */
+#define BPF_EXIST       2 /* update existing element */
+.fi
+On success it returns zero.
+On error, \-1 is returned and
+.I errno
+is set to EINVAL, EPERM, ENOMEM or E2BIG.
+.B E2BIG
+indicates that the number of elements in the map reached
+.I max_entries
+limit specified at map creation time.
+.B EEXIST
+will be returned from a call to bpf_update_elem(fd, key, value, BPF_NOEXIST) if
+the element with 'key' already exists in the map.
+.B ENOENT
+will be returned from a call to bpf_update_elem(fd, key, value, BPF_EXIST) if
+the element with 'key' doesn't exist in the map.
+.TP
+.B BPF_MAP_DELETE_ELEM
+.nf
+int bpf_delete_elem(int fd, void *key)
+{
+    union bpf_attr attr = {
+        .map_fd = fd,
+        .key = ptr_to_u64(key),
+    };
+
+    return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
+}
+.fi
+The call deletes an element in a map
+.I fd
+with a given
+.I key.
+Returns zero on success. If the element is not found it returns \-1 and sets
+.I errno
+to ENOENT.
+.TP
+.B BPF_MAP_GET_NEXT_KEY
+.nf
+int bpf_get_next_key(int fd, void *key, void *next_key)
+{
+    union bpf_attr attr = {
+        .map_fd = fd,
+        .key = ptr_to_u64(key),
+        .next_key = ptr_to_u64(next_key),
+    };
+
+    return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
+}
+.fi
+The call looks up an element by
+.I key
+in a given map
+.I fd
+and sets the
+.I next_key
+pointer to the key of the next element.
+If
+.I key
+is not found, it returns zero and sets the
+.I next_key
+pointer to the key of the first element.
+If
+.I key
+is the last element, it returns \-1 and sets
+.I errno
+to ENOENT. Other possible
+.I errno
+values are ENOMEM, EFAULT, EPERM and EINVAL.
+This method can be used to iterate over all elements in the map.
+.TP
+.B close(map_fd)
+will delete the map
+.I map_fd.
+When the user space program that created maps exits all maps will
+be deleted automatically.
+
+.P
+.SS BPF programs
+
+.TP
+.B BPF_PROG_LOAD
+This
+.IR cmd
+is used to load extended BPF program into the kernel.
+
+.nf
+char bpf_log_buf[LOG_BUF_SIZE];
+
+int bpf_prog_load(enum bpf_prog_type prog_type,
+                  const struct bpf_insn *insns, int insn_cnt,
+                  const char *license)
+{
+    union bpf_attr attr = {
+        .prog_type = prog_type,
+        .insns = ptr_to_u64(insns),
+        .insn_cnt = insn_cnt,
+        .license = ptr_to_u64(license),
+        .log_buf = ptr_to_u64(bpf_log_buf),
+        .log_size = LOG_BUF_SIZE,
+        .log_level = 1,
+    };
+
+    return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+}
+.fi
+.B prog_type
+is one of the available program types:
+.nf
+enum bpf_prog_type {
+        BPF_PROG_TYPE_UNSPEC,
+        BPF_PROG_TYPE_SOCKET_FILTER,
+        BPF_PROG_TYPE_SCHED_CLS,
+};
+.fi
+By picking
+.I prog_type
+the program author selects a set of helper functions callable from
+the program and the corresponding format of
+.I struct bpf_context
+(which is the data blob passed into the program as the first argument).
+For example, the programs loaded with
+.I prog_type
+= BPF_PROG_TYPE_SOCKET_FILTER may call bpf_map_lookup_elem() helper,
+whereas some future types may not.
+The set of functions available to the programs under a given type may increase
+in the future.
+
+Currently the set of functions for
+.B BPF_PROG_TYPE_SOCKET_FILTER
+is:
+.nf
+bpf_map_lookup_elem(map_fd, void *key)              // lookup key in a map_fd
+bpf_map_update_elem(map_fd, void *key, void *value) // update key/value
+bpf_map_delete_elem(map_fd, void *key)              // delete key in a map_fd
+.fi
+
+and bpf_context is a pointer to 'struct sk_buff'. Programs cannot
+access fields of 'sk_buff' directly.
+
+More program types may be added in the future. Like
+.B BPF_PROG_TYPE_KPROBE
+and bpf_context for it may be defined as a pointer to 'struct pt_regs'.
+
+.B insns
+array of "struct bpf_insn" instructions.
+
+.B insn_cnt
+number of instructions in the program.
+
+.B license
+license string, which must be GPL compatible to call helper functions
+marked gpl_only.
+
+.B log_buf
+user supplied buffer that the in-kernel verifier is using to store the
+verification log. This log is a multi-line string that can be checked by
+the program author in order to understand how the verifier came to
+the conclusion that the BPF program is unsafe.
+The format of the output can change at any time as the verifier evolves.
+
+.B log_size
+size of user buffer. If the size of the buffer is not large enough to store all
+verifier messages, \-1 is returned and
+.I errno
+is set to ENOSPC.
+
+.B log_level
+verbosity level of the verifier. A value of zero means that the verifier will
+not provide a log.
+
+.TP
+.B close(prog_fd)
+will unload the BPF program.
+.P
+The maps are accessible from programs and used to exchange data between
+programs and between them and user space.
+Programs process various events (like kprobe, packets) and
+store their data into maps. User space fetches data from the maps.
+Either the same or a different map may be used by user space as a configuration
+space to alter program behavior on the fly.
+.SS Events
+.P
+Once a program is loaded, it can be attached to an event. Various kernel
+subsystems have different ways to do so. For example:
+
+.nf
+setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));
+.fi
+will attach the program
+.I prog_fd
+to socket
+.I sock
+which was received from a prior call to socket().
+
+In the future
+.nf
+ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+.fi
+may attach the program
+.I prog_fd
+to perf event
+.I event_fd
+which was received by prior call to perf_event_open().
+
+.SH EXAMPLES
+.nf
+/* bpf+sockets example:
+ * 1. create array map of 256 elements
+ * 2. load program that counts number of packets received
+ *    r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
+ *    map[r0]++
+ * 3. attach prog_fd to raw socket via setsockopt()
+ * 4. print number of received TCP/UDP packets every second
+ */
+int main(int ac, char **av)
+{
+    int sock, map_fd, prog_fd, key;
+    long long value = 0, tcp_cnt, udp_cnt;
+
+    map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 256);
+    if (map_fd < 0) {
+        printf("failed to create map '%s'\\n", strerror(errno));
+        /* likely not run as root */
+        return 1;
+    }
+
+    struct bpf_insn prog[] = {
+        BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),           /* r6 = r1 */
+        BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)), /* r0 = ip->proto */
+        BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+        BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),          /* r2 = fp */
+        BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),         /* r2 = r2 - 4 */
+        BPF_LD_MAP_FD(BPF_REG_1, map_fd),              /* r1 = map_fd */
+        BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),       /* r0 = map_lookup(r1, r2) */
+        BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),         /* if (r0 == 0) goto pc+2 */
+        BPF_MOV64_IMM(BPF_REG_1, 1),                   /* r1 = 1 */
+        BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),  /* lock *(u64 *)r0 += r1 */
+        BPF_MOV64_IMM(BPF_REG_0, 0),                   /* r0 = 0 */
+        BPF_EXIT_INSN(),                               /* return r0 */
+    };
+
+    prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog), "GPL");
+
+    sock = open_raw_sock("lo");
+
+    assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) == 0);
+
+    for (;;) {
+        key = IPPROTO_TCP;
+        assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
+        key = IPPROTO_UDP
+        assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
+        printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
+        sleep(1);
+    }
+
+    return 0;
+}
+.fi
+.SH RETURN VALUE
+For a successful call, the return value depends on the operation:
+.TP
+.B BPF_MAP_CREATE
+The new file descriptor associated with the BPF map.
+.TP
+.B BPF_PROG_LOAD
+The new file descriptor associated with the BPF program.
+.TP
+All other commands
+Zero.
+.PP
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EPERM
+bpf() syscall was made without sufficient privilege
+(without the
+.B CAP_SYS_ADMIN
+capability).
+.TP
+.B ENOMEM
+Cannot allocate sufficient memory.
+.TP
+.B EBADF
+.I fd
+is not an open file descriptor
+.TP
+.B EFAULT
+One of the pointers (
+.I key
+or
+.I value
+or
+.I log_buf
+or
+.I insns
+) is outside the accessible address space.
+.TP
+.B EINVAL
+The value specified in
+.I cmd
+is not recognized by this kernel.
+.TP
+.B EINVAL
+For
+.BR BPF_MAP_CREATE ,
+either
+.I map_type
+or attributes are invalid.
+.TP
+.B EINVAL
+For
+.BR BPF_MAP_*_ELEM
+commands,
+some of the fields of "union bpf_attr" that are not used by this command
+are not set to zero.
+.TP
+.B EINVAL
+For
+.BR BPF_PROG_LOAD,
+indicates an attempt to load an invalid program. BPF programs can be deemed
+invalid due to unrecognized instructions, the use of reserved fields, jumps
+out of range, infinite loops or calls of unknown functions.
+.TP
+.BR EACCES
+For
+.BR BPF_PROG_LOAD,
+even though all program instructions are valid, the program has been
+rejected because it was deemed unsafe. This may be because it may have
+accessed a disallowed memory region or an uninitialized stack/register or
+because the function contraints don't match the actual types or because
+there was a misaligned memory access.
+In such case it is recommended to call bpf() again with
+.I log_level = 1
+and examine
+.I log_buf
+for the specific reason provided by the verifier.
+.TP
+.BR ENOENT
+For
+.B BPF_MAP_LOOKUP_ELEM
+or
+.B BPF_MAP_DELETE_ELEM,
+indicates that the element with the given
+.I key
+was not found.
+.TP
+.BR E2BIG
+program is too large or
+a map reached
+.I max_entries
+limit (max number of elements).
+.SH NOTES
+These commands may be used only by a privileged process (one having the
+.B CAP_SYS_ADMIN
+capability).
+.SH SEE ALSO
+Both classic and extended BPF are explained in Documentation/networking/filter.txt