packet.7: Document fanout, ring, and auxiliary options

The packet socket manual page does not list all socket options.

This patch adds descriptions of the common packet socket options
  PACKET_AUXDATA, PACKET_FANOUT, PACKET_RX_RING, PACKET_STATISTICS,
  PACKET_TX_RING

and the ring-specific options
  PACKET_LOSS, PACKET_RESERVE, PACKET_TIMESTAMP, PACKET_VERSION

It does not yet add descriptions for
  PACKET_COPY_THRESH, PACKET_HDRLEN, PACKET_ORIGDEV,
  PACKET_TX_HAS_OFF, PACKET_TX_TIMESTAMP, PACKET_VNET_HDR

It tries to balance being informative with exposing kernel detail
that is unlikely to be used by most readers or that may change
frequently. For implementation details, the manpage points to the
documentation in kernel Documentation/networking. Let me know if
options should be added or removed.

Source: PACKET_FANOUT, PACKET_RX_RING and PACKET_VERSION are in
/tools/testing/net/psock_fanout.c in the latest Linux kernel source
tree. PACKET_STATISTICS was in the first version of that test.
PACKET_TX_RING I have used elsewhere. The other options are based
on reading kernel code.

 [Very minor fixups. -dborkman]

Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
Willem de Bruijn 2013-12-06 12:18:49 -05:00 committed by Michael Kerrisk
parent 65b1e0ed19
commit dbb4f7516b
1 changed files with 210 additions and 9 deletions

View File

@ -177,17 +177,22 @@ and
.I sll_ifindex
are used.
.SS Socket options
Packet socket options are configured by calling
.BR setsockopt (2)
with level
.BR SOL_PACKET .
.TP
.BR PACKET_ADD_MEMBERSHIP
.PD 0
.TP
.BR PACKET_DROP_MEMBERSHIP
.PD
Packet sockets can be used to configure physical layer multicasting
and promiscuous mode.
It works by calling
.BR setsockopt (2)
on a packet socket for
.B SOL_PACKET
and one of the options
.B PACKET_ADD_MEMBERSHIP
to add a binding or
adds a binding and
.B PACKET_DROP_MEMBERSHIP
to drop it.
drops it.
They both expect a
.B packet_mreq
structure as argument:
@ -227,11 +232,207 @@ In addition the traditional ioctls
.BR SIOCADDMULTI ,
.B SIOCDELMULTI
can be used for the same purpose.
.TP
.BR PACKET_AUXDATA " (since Linux 2.6.21)"
.\" commit 8dc4194474159660d7f37c495e3fc3f10d0db8cc
If this binary option is enabled, the packet socket passes a metadata
structure along with each packet in the
.BR recvmsg (2)
control field.
The structure can be read with
.BR cmsg (3).
It is defined as
.in +4n
.nf
struct tpacket_auxdata {
__u32 tp_status;
__u32 tp_len; /* packet length */
__u32 tp_snaplen; /* captured length */
__u16 tp_mac;
__u16 tp_net;
__u16 tp_vlan_tci;
__u16 tp_padding;
};
.fi
.in
.TP
.BR PACKET_FANOUT " (since Linux 3.1)"
.\" commit dc99f600698dcac69b8f56dda9a8a00d645c5ffc
To scale processing across threads, packet sockets can form a fanout
group.
In this mode, each matching packet is enqueued onto only one
socket in the group.
A socket joins a fanout group by calling
.BR setsockopt (2)
with level
.B SOL_PACKET
and option
.BR PACKET_FANOUT .
Each network namespace can have up to 65536 independent groups.
A socket selects a group by encoding the ID in the first 16 bits of
the integer option value.
The first packet socket to join a group implicitly creates it.
To successfully join an existing group, subsequent packet sockets
must have the same protocol, device settings and fanout mode and
flags (see below).
Packet sockets can leave a fanout group only by closing the socket.
The group is deleted when the last socket is closed.
Fanout supports multiple algorithms to spread traffic between sockets.
The default mode,
.BR PACKET_FANOUT_HASH ,
sends packets from the same flow to the same socket to maintain
per-flow ordering.
For each packet, it chooses a socket by taking the packet flow hash
modulo the number of sockets in the group, where a flow hash is a hash
over network layer address and optional transport layer port fields.
The load balance mode
.BR PACKET_FANOUT_LB
implements a round-robin algorithm.
.BR PACKET_FANOUT_CPU
selects the socket based on the CPU that the packet arrived on.
.BR PACKET_FANOUT_ROLLOVER
processes all data on a single socket, moves to the next when one
becomes backlogged.
.BR PACKET_FANOUT_RND
selects the socket using a pseudo random number generator.
Fanout modes can take additional options.
IP fragmentation causes packets from the same flow to have different
flow hashes.
The flag
.BR PACKET_FANOUT_FLAG_DEFRAG ,
if set, causes packet to be defragmented before fanout is applied, to
preserve order even in this case.
Fanout mode and options are communicated in the second 16 bits of the
integer option value.
The flag
.BR PACKET_FANOUT_FLAG_ROLLOVER
enables the roll over mechanism as a backup strategy: if the
original fanout algorithm selects a backlogged socket, the packet
rolls over to the next available one.
.TP
.BR PACKET_LOSS " (with PACKET_TX_RING)"
If set, do not silently drop a packet on transmission error, but
return it with status set to
.BR TP_STATUS_WRONG_FORMAT .
.TP
.BR PACKET_RESERVE " (with PACKET_RX_RING)"
By default, a packet receive ring writes packets immediately following the
metadata structure and alignment padding.
This integer option reserves additional headroom.
.TP
.BR PACKET_RX_RING
Create a memory mapped ring buffer for asynchronous packet reception.
The packet socket reserves a contiguous region of application address
space, lays it out into an array of packet slots and copies packets
(up to
.IR tp_snaplen
) into subsequent slots.
Each packet is preceded by a metadata structure similar to
.IR tpacket_auxdata .
The protocol fields encode the offset to the data
from the start of the metadata header.
.I tp_net
stores the offset to the network layer.
If the packet socket is of type
.BR SOCK_DGRAM ,
then
.I tp_mac
is the same.
If it is of type
.BR SOCK_RAW ,
then that field stores the offset to the link layer frame.
Packet socket and application communicate the head and tail of the ring
through the
.I tp_status
field.
The packet socket owns all slots with status
.BR TP_STATUS_KERNEL .
After filling a slot, it changes the status of the slot to transfer
ownership to the application.
During normal operation, the new status is
.BR TP_STATUS_USER ,
to signal that a correctly received packet has been stored.
When the application has finished processing a packet, it transfers
ownership of the slot back to the socket by setting the status to
.BR TP_STATUS_KERNEL .
Packet sockets implement multiple variants of the packet ring.
The implementation details are described in
.IR Documentation/networking/packet_mmap.txt
in the Linux kernel source tree.
.TP
.BR PACKET_STATISTICS
Retrieve packet socket statistics in the form of a structure
.in +4n
.nf
struct tpacket_stats {
unsigned int tp_packets; /* total packet count */
unsigned int tp_drops; /* dropped packet count */
};
.fi
.in
Receiving statistics resets the internal counters.
The statistics structure differs when using a ring of variant
.BR TPACKET_V3 .
.TP
.BR PACKET_TIMESTAMP " (with PACKET_RX_RING)"
.\" commit 614f60fa9d73a9e8fdff3df83381907fea7c5649
The packet receive ring always stores a timestamp in the metadata header.
By default, this is a software generated timestamp generated when the
packet is copied into the ring.
This integer option selects the type of timestamp.
Besides the default, it support the two hardware formats described in
.IR Documentation/networking/timestamping.txt
in the Linux kernel source tree.
.TP
.BR PACKET_TX_RING " (since Linux 2.6.31)"
.\" commit 69e3c75f4d541a6eb151b3ef91f34033cb3ad6e1
Create a memory mapped ring buffer for packet transmission.
This option is similar to
.BR PACKET_RX_RING
and takes the same arguments.
The application writes packets into slots with status
.BR TP_STATUS_AVAILABLE
and schedules them for transmission by changing the status to
.BR TP_STATUS_SEND_REQUEST .
When packets are ready to be transmitted, the application calls
.BR send (2)
or a variant thereof.
The
.I buf
and
.I len
fields of this call are ignored.
If an address is passed using
.BR sendto (2)
or
.BR sendmsg (2) ,
then that overrides the socket default.
On successful transmission, the socket resets the slot to
.BR TP_STATUS_AVAILABLE .
It discards packets silently on error unless
.BR PACKET_LOSS
is set.
.TP
.BR PACKET_VERSION " (with PACKET_RX_RING)"
.\" commit bbd6ef87c544d88c30e4b762b1b61ef267a7d279
By default,
.BR PACKET_RX_RING
creates a packet receive ring of variant
.BR TPACKET_V1 .
To create another variant, configure the desired variant by setting this
integer option before creating the ring.
.SS Ioctls
.B SIOCGSTAMP
can be used to receive the timestamp of the last received packet.
Argument is a
.I struct timeval.
.I struct timeval
variable.
.\" FIXME Document SIOCGSTAMPNS
In addition all standard ioctls defined in
@ -318,7 +519,7 @@ header to get a fully conforming packet.
Incoming 802.3 packets are not multiplexed on the DSAP/SSAP protocol
fields; instead they are supplied to the user as protocol
.B ETH_P_802_2
with the LLC header prepended.
with the LLC header prefixed.
It is thus not possible to bind to
.BR ETH_P_802_3 ;
bind to