old-www/LDP/tlk/net/net.html

<HTML>
<center>
<A HREF="../tlk-toc.html"> Table of Contents</A>,
<A href="../tlk.html" target="_top"> Show Frames</A>,
<A href="../net/net.html" target="_top"> No Frames</A>
</center>
<hr>
<META NAME="TtH" CONTENT="1.03">

<p>
                  <H1><A NAME="tth_chAp10">Chapter 10     <br>Networks</H1>
<A NAME="networks-chapter"></A>
<p>
<A NAME="network-chapter"></A><img src="../logos/sit3-bw-tran.1.gif"><br> <tt><b></tt></b> Networking and Linux are terms that are almost synonymous.
In a very real sense Linux is a product of the Internet or World Wide Web (WWW).
Its developers and users use the web to exchange information ideas, code, and Linux itself
is often used to support the networking needs of organizations.
This chapter describes how Linux supports the network protocols known collectively as
TCP/IP.

<p>
The TCP/IP protocols were designed to support communications between computers connected to the
ARPANET, an American research network funded by the US government.
The ARPANET pioneered networking concepts such as packet switching and protocol layering
where one protocol uses the services of another.
ARPANET was retired in 1988 but its successors (NSF<a href="#tthFtNtAAB" name=tthFrefAAB><sup>1</sup></a> NET and
the Internet) have grown even larger.
What is now known as the World Wide Web grew from the ARPANET and is itself supported by the
TCP/IP protocols.
Unix <sup><font size=-4><tt>T</tt>M</font></sup>&nbsp;was extensively used on the ARPANET and the first released networking version of
Unix <sup><font size=-4><tt>T</tt>M</font></sup>&nbsp;was 4.3 BSD.
Linux's networking implementation is modeled on 4.3 BSD in that it supports BSD sockets (with
some extensions) and the full range of TCP/IP networking.
This programming interface was chosen because of its popularity and to help applications
be portable between Linux and other Unix <sup><font size=-4><tt>T</tt>M</font></sup>&nbsp;platforms.

<p>

<H2><A NAME="tth_sEc10.1">10.1&nbsp;</A> An Overview of TCP/IP Networking</H2>
This section gives an overview of the main principles of TCP/IP networking.
It is not meant to be an exhaustive description, for that I suggest that you read
.

In an IP network every machine is assigned an IP address,
this is a 32 bit number that uniquely identifies the machine.
The WWW is a very large, and growing, IP network and every machine that is connected to it has to have
a unique IP address assigned to it.
IP addresses are represented by four numbers separated by dots, for example, <tt>16.42.0.9</tt>.
This IP address is actually in two parts, the <em>network</em> address and the <em>host</em> address.
The sizes of these parts may vary (there are several classes of IP addresses)
but using <tt>16.42.0.9</tt> as an example, the network address would
be <tt>16.42</tt> and the host address <tt>0.9</tt>.
The host address is further subdivided into a <em>subnetwork</em> and a <em>host</em> address.
Again, using <tt>16.42.0.9</tt> as an example, the subnetwork address would be <tt>16.42.0</tt> and the host
address 16.42.0.9.
This subdivision of the IP address allows organizations to subdivide their networks.
For example, <tt>16.42</tt> could be the network address of the ACME Computer Company; <tt>16.42.0</tt> would
be subnet <tt>0</tt> and <tt>16.42.1</tt> would be subnet <tt>1</tt>.
These subnets might be in separate buildings, perhaps connected by leased telephone lines or even
microwave links.
IP addresses are assigned by the network administrator and having IP subnetworks is a good way
of distributing the administration of the network.
IP subnet administrators are free to allocate IP addresses within their IP subnetworks.

<p>
Generally though, IP addresses are somewhat hard to remember.
Names are much easier.
<tt>linux.acme.com</tt> is much easier to remember than <tt>16.42.0.9</tt> but there must be some mechanism
to convert the network names into an IP address.
These names can be statically specified in the <tt>/etc/hosts</tt> file or Linux can ask a Distributed Name Server (DNS
server) to resolve the name for it.
In this case the local host must know the IP address of one or more DNS servers and these are specified
in <tt>/etc/resolv.conf</tt>.

<p>
Whenever you connect to another machine, say when reading a web page, its IP address is used to
exchange data with that machine.
This data is contained in IP packets each of which have an IP header containing the
IP addresses of the source and destination machine's IP addresses, a checksum and other useful information.
The checksum is derived from the data in the IP packet and allows the receiver of IP packets to tell if the
IP packet was corrupted during transmission, perhaps by a noisy telephone line.
The data transmitted by an application may have been broken down into smaller packets which are easier
to handle.
The size of the IP data packets varies depending on the connection media; ethernet packets are generally
bigger than PPP packets.
The destination host must reassemble the data packets before giving the data to the receiving application.
You can see this fragmentation and reassembly of data graphically if you access a web page containing a lot of
graphical images via a moderately slow serial link.

<p>
Hosts connected to the same IP subnet can send IP packets directly to each other, all other IP packets
will be sent to a special host, a gateway.
Gateways (or routers) are connected to more than one IP subnet and they will resend IP packets
received on one subnet, but destined for another onwards.
For example, if subnets <tt>16.42.1.0</tt> and <tt>16.42.0.0</tt> are connected together by a gateway then
any packets sent from subnet <tt>0</tt> to subnet <tt>1</tt> would have to be directed to the gateway so that it
could route them.
The local host builds up routing tables which allow it to route IP packets to the correct machine.
For every IP destination there is an entry in the routing tables which tells Linux which host to send
IP packets to in order that they reach their destination.
These routing tables are dynamic and change over time as applications use the network and as the network
topology changes.

<p>

<p><A NAME="tth_fIg10.1"></A>
<center><center> <img src="protocols.gif"><br>
<p>
</center></center><center>      Figure 10.1: TCP/IP Protocol Layers</center>
<A NAME="protocols-figure"></A>
<p>
<p>The IP protocol is a transport layer that is used by other protocols to carry their data.
The Transmission Control Protocol (TCP) is a reliable end to end protocol that uses IP to transmit
and receive its own packets.
Just as IP packets have their own header, TCP has its own header.
TCP is a connection based protocol where two networking applications are connected by a single,
virtual connection even  though there may be many subnetworks, gateways and routers between
them.
TCP reliably transmits and receives data between the two applications and guarantees that there will
be no lost or duplicated data.
When TCP transmits its packet using IP, the data contained within the IP packet is the TCP packet itself.
The IP layer on each communicating host is responsible for transmitting and receiving IP packets.
User Datagram Protocol (UDP) also uses the IP layer to transport its packets, unlike TCP, UDP is not
a reliable protocol but offers a datagram service.
This use of IP by other protocols means that when IP packets are received the receiving IP layer must
know which upper protocol layer to give the data contained in this IP packet to.
To facilitate this every IP packet header has a byte containing a protocol identifier.
When TCP asks the IP layer to transmit an IP packet , that IP packet's header states that it contains
a TCP packet.
The receiving IP layer uses that protocol identifier to decide which layer to pass the received data
up to, in this case the TCP layer.
When applications communicate via TCP/IP they must specify not only the target's IP address but also the
<em>port</em> address of the application.
A port address uniquely identifies an application and standard network applications use standard port
addresses; for example, web servers use port 80.
These registered port addresses can be seen in <tt>/etc/services</tt>.

<p>
This layering of protocols does not stop with TCP, UDP and IP.
The IP protocol layer itself uses many different physical media to transport IP packets to other
IP hosts.
These media may themselves add their own protocol headers.
One such example is the ethernet layer, but PPP and SLIP are others.
An ethernet network allows many hosts to be simultaneously connected to a single physical
cable.
Every transmitted ethernet frame can be seen by all connected hosts and so every ethernet
device has a unique address.
Any ethernet frame transmitted to that address will be received by the addressed host but
ignored by all the other hosts connected to the network.
These unique addresses are built into each ethernet device when they are manufactured and it
is usually kept in an SROM<a href="#tthFtNtAAC" name=tthFrefAAC><sup>2</sup></a> on the ethernet card.
Ethernet addresses are 6 bytes long, an example would be <tt>08-00-2b-00-49-A4</tt>.
Some ethernet addresses are reserved for multicast purposes and ethernet frames sent with
these destination addresses will be received by all hosts on the network.
As ethernet frames can carry many different protocols (as data) they, like IP packets,
contain a protocol identifier in their headers.
This allows the ethernet layer to correctly receive IP packets and to pass them onto the
IP layer.

<p>
In order to send an IP packet via a multi-connection protocol such as ethernet, the IP layer
must find the ethernet address of the IP host.
This is because IP addresses are simply an addressing concept, the ethernet devices themselves
have their own physical addresses.
IP addresses on the other hand can be assigned and reassigned by network administrators at will
but the network hardware responds only to ethernet frames with its own physical address or to special
multicast addresses which all machines must receive.
Linux uses the Address Resolution Protocol (or ARP) to allow machines to translate
IP addresses into real hardware addresses such as ethernet addresses.
A host wishing to know the hardware address associated with an IP address sends an ARP request packet
containing the IP address that it wishes translating to all nodes on the network by sending it to
a multicast address.
The target host that owns the IP address, responds with an ARP reply that contains its physical
hardware address.
ARP is not just restricted to ethernet devices, it can resolve IP addresses for other physical
media, for example FDDI.
Those network devices that cannot ARP are marked so that Linux does not attempt to ARP.
There is also the reverse function, Reverse ARP or RARP, which translates phsyical network
addresses into IP addresses.
This is used by gateways, which respond to ARP requests on behalf of IP addresses that are in the
remote network.

<p>

<H2><A NAME="tth_sEc10.2">10.2&nbsp;</A> The Linux TCP/IP Networking Layers</H2>

<p><A NAME="tth_fIg10.2"></A>
<p>

<center><center> <img src="layers.gif"><br>
<p>
</center></center><center>      Figure 10.2: Linux Networking Layers</center>
<A NAME="layers-figure"></A>
<p>
<p>Just like the network protocols themselves,
Figure&nbsp;<A href="#layers-figure"
> 10.2</A> shows that Linux implements the internet protocol address family
as a series of connected layers of software.
BSD sockets are supported by a generic socket management software concerned only with BSD sockets.
Supporting this is the INET socket layer, this
manages the communication end points for the IP based protocols TCP and UDP.
UDP (User Datagram Protocol) is a connectionless protocol whereas TCP (Transmission Control Protocol) is a
reliable end to end protocol.
When UDP packets are transmitted, Linux neither knows nor cares if they arrive safely at their destination.
TCP packets are numbered and both ends of the TCP connection make sure that transmitted data is received
correctly.
The IP layer contains code implementing the Internet Protocol.
This code prepends IP headers to transmitted data and understands how to route incoming IP packets to either
the TCP or UDP layers.
Underneath the IP layer, supporting all of Linux's networking are the network devices, for example PPP
and ethernet.
Network devices do not always represent physical devices; some like the loopback device are purely
software devices.
Unlike standard Linux devices that are created via the <font face="helvetica">mknod</font>
command, network devices appear only if the underlying software has found and initialized them.
You will only see <tt>/dev/eth0</tt> when you have built a kernel with the appropriate ethernet device driver
in it.
The ARP protocol sits between the IP layer and the protocols that support ARPing for addresses.

<p>

<H2><A NAME="tth_sEc10.3">10.3&nbsp;</A> The BSD Socket Interface</H2>

<p>
This is a general interface which not only supports various forms of networking but is also an
inter-process communications mechanism.
A socket describes one end of a communications link, two communicating processes would each have
a socket describing their end of the communication link between them.
Sockets could be thought of as a special case of pipes but, unlike pipes, sockets have no limit
on the amount of data that they can contain.
Linux supports several classes of socket and these are known as <em>address families</em>.
This is because each class has its own method of addressing its communications.
Linux supports the following socket address families or domains:

<p>

<table><tr><td>	UNIX		</td><td>	Unix domain sockets,
<tr><td>
	INET		</td><td>	The Internet address family supports communications via
<tr><td>
			</td><td>	TCP/IP protocols
<tr><td>
	AX25		</td><td>	Amateur radio X25
<tr><td>
	IPX		</td><td>	Novell IPX
<tr><td>
	APPLETALK	</td><td>	Appletalk DDP
<tr><td>
	X25		</td><td>	X25</table>


<p>
There are several socket types and these represent the type of service that supports the connection.
Not all address families support all types of service.
Linux BSD sockets support a number of socket types:

<DL compact>
<p>
	<dt><b>Stream</b></dt><dd> These sockets provide reliable two way sequenced data streams with a guarantee
		that data cannot be lost, corrupted or duplicated in transit.  Stream sockets are
		supported by the TCP protocol of the Internet (INET) address family.
	<dt><b>Datagram</b></dt><dd> These sockets also provide two way data transfer but, unlike stream sockets,
		there is no guarantee that the messages will arrive.  Even if they do arrive there is
		no guarantee that they will arrive in order or even not be duplicated or corrupted.
		This type of socket is supported by the UDP protocol of the Internet address family.
	<dt><b>Raw</b></dt><dd> This allows processes direct (hence ``raw'') access to the underlying protocols.
		It is, for example, possible to open a raw socket to an ethernet device and see
		raw IP data traffic.
	<dt><b>Reliable Delivered Messages</b></dt><dd> These are very like datagram sockets but the data is
		guaranteed to arrive.
	<dt><b>Sequenced Packets</b></dt><dd> These are like stream sockets except that the data packet sizes
		are fixed.
	<dt><b>Packet</b></dt><dd> This is not a standard BSD socket type, it is a Linux specific extension that
		 allows processes to access packets directly at the device level.
</DL>
<p>
Processes that communicate using sockets use a client server model.
A server provides a service and clients make use of that service.
One example would be a Web Server, which provides web pages and a web client, or browser, which reads
those pages.
A server using sockets, first creates a socket and then binds a name to it.
The format of this name is dependent on the socket's address family and it is, in effect, the local
address of the server.
The socket's name or address is specified using the <tt>sockaddr</tt> data structure.
An INET socket would have an IP port address bound to it.
The registered port numbers can be seen in <tt>/etc/services</tt>; for example, the port number for
a web server is 80.
Having bound an address to the socket, the server then listens for incoming connection requests
specifying the bound address.
The originator of the request, the client, creates a socket and makes a connection request on it,
specifying the target address of the server.
For an INET socket the address of the server is its IP address and its port number.
These incoming requests must find their way up through the various protocol layers and then
wait on the server's listening socket.
Once the server has received the incoming request it either accepts or rejects it.
If the incoming request is to be accepted, the server must create a new socket to accept it
on.
Once a socket has been used for listening for incoming connection requests it cannot be used
to support a connection.
With the connection established both ends are free to send and receive data.
Finally, when the connection is no longer needed it can be shutdown.
Care is taken to ensure that data packets in transit are correctly dealt with.

<p>
The exact meaning of operations on a BSD socket depends on its underlying address family.
Setting up TCP/IP connections is very different from setting up an amateur radio X.25 connection.
Like the virtual filesystem, Linux abstracts the socket interface with the BSD socket layer being
concerned with the BSD socket interface to the application programs which is in turn supported by
independent address family specific software.
At kernel initialization time, the address families built into the kernel register themselves with the
BSD socket interface.
Later on, as applications create and use BSD sockets, an association is made between the BSD
socket and its supporting address family.
This association is made via cross-linking data structures and tables of address family specific
support routines.
For example there is an address family specific socket creation routine which the BSD socket
interface uses when an application creates a new socket.

<p>
When the kernel is configured, a number of address families and protocols are built into the <tt>protocols</tt>
 vector.
Each is represented by its name, for example ``INET'' and the address of its initialization routine.
When the socket interface is initialized at boot time each protocol's initialization routine is called.
For the socket address families this results in them registering a set of protocol operations.
This is a set of routines, each of which performs a a particular operation specific to that address
family.
The registered protocol operations are kept in the <tt>pops</tt> vector, a vector
of pointers to <tt>proto_ops</tt> data structures.

<p>
The <tt>proto_ops</tt> data structure consists of the address family type and a set of pointers to socket
operation routines specific to a particular address family.
The <tt>pops</tt> vector is indexed by the address family identifier, for example the Internet address
family identifier (AF_INET is 2).

<p>

<p><A NAME="tth_fIg10.3"></A>
<center><center> <img src="sockets.gif"><br>
<p>
</center></center><center>      Figure 10.3: Linux BSD Socket Data Structures</center>
<A NAME="sockets-figure"></A>
<p>
<p>
<H2><A NAME="tth_sEc10.4">10.4&nbsp;</A> The INET Socket Layer</H2>

<p>
The INET socket layer supports the internet address family which contains the TCP/IP protocols.
As discussed above, these protocols are layered, one protocol using the services of another.
Linux's TCP/IP code and data structures reflect this layering.
Its interface with the BSD socket layer is through the set of Internet address family socket operations
which it registers with the BSD socket layer during network initialization.
These are kept in the <tt>pops</tt> vector along with the other registered address
families.
The BSD socket layer calls the INET layer socket support routines from the registered INET <tt>proto_ops</tt>
data structure to perform work for it.
For example a BSD socket create request that gives the address family as INET will use the underlying
INET socket create function.
The BSD socket layer passes the <tt>socket</tt> data structure representing the BSD socket to the INET layer
in each of these operations.
Rather than clutter the BSD <tt>socket</tt> wiht TCP/IP specific information, the INET socket layer uses
its own data structure, the <tt>sock</tt>
 which it links to the BSD <tt>socket</tt>
data structure.
This linkage can be seen in Figure&nbsp;<A href="#sockets-figure"
> 10.3</A>.
It links the <tt>sock</tt> data structure to the BSD <tt>socket</tt> data structure using the <tt>data</tt> pointer
in the BSD <tt>socket</tt>.
This means that subsequent INET socket calls can easily retrieve the <tt>sock</tt> data structure.
The <tt>sock</tt> data structure's protocol operations pointer is also set up at creation time and it depends
on the  protocol requested.
If TCP is requested, then the <tt>sock</tt> data structure's protocol operations pointer will point to
the set of TCP protocol operations needed for a TCP connection.

<p>

<H3><A NAME="tth_sEc10.4.1">10.4.1&nbsp;</A> Creating a BSD Socket</H3>

<p>
The system call to create a new socket passes identifiers for its address family,
socket type and protocol.

<p>
Firstly the requested address family  is used to search the <tt>pops</tt> vector for a matching
address family.
It may be that a particular address family is implemented as a kernel module and, in this case,
the <tt>kerneld</tt> daemon must load the module before we can continue.
A new <tt>socket</tt> data structure is allocated to represent the BSD socket.
Actually the <tt>socket</tt> data structure is physically part of the VFS <tt>inode</tt> data structure
and allocating a socket really means allocating a VFS <tt>inode</tt>.
This may seem strange unless you consider that sockets can be operated on in just the same way
that ordinairy files can.
As all files are represented by a VFS <tt>inode</tt> data structure, then in order to support file
operations, BSD sockets must also be represented by a VFS <tt>inode</tt> data structure.

<p>
The newly created BSD <tt>socket</tt> data structure contains a pointer to the address family
specific socket routines and this is set to the <tt>proto_ops</tt> data structure retrieved from
the <tt>pops</tt> vector.
Its type is set to the sccket type requested; one of SOCK_STREAM, SOCK_DGRAM and so on.
The address family specific creation routine is called using the address kept in the
<tt>proto_ops</tt> data structure.

<p>
A free file descriptor is allocated from the current processes <tt>fd</tt> vector
and the <tt>file</tt> data structure that it points at is initialized.
This includes setting the file operations pointer to point to the set of BSD socket file
operations supported by the BSD socket interface.
Any future operations will be directed to the socket interface
and it will in turn pass them to the supporting address family by calling its address family operation
routines.

<p>

<H3><A NAME="tth_sEc10.4.2">10.4.2&nbsp;</A> Binding an Address to an INET BSD Socket</H3>

<p>
In order to be able to listen for incoming internet connection requests, each server must create an
INET BSD socket and bind its address to it.
The bind operation is mostly handled within the INET socket layer with some support from the
underlying TCP and UDP protocol layers.
The socket having an address bound to cannot be being used for any other communication.
This means that the <tt>socket</tt>'s state must be <tt>TCP_CLOSE</tt>.
The <tt>sockaddr</tt> pass to the bind operation contains the IP address to be bound to and, optionally,
a port number.
Normally the IP address bound to would be one that has been assigned to a network device
that supports the INET address family and whose interface is up and able to be used.
You can see which network interfaces are currently active in the system by using the
<font face="helvetica">ifconfig</font> command.
The IP address may also be the IP broadcast address of either all 1's or all 0's.
These are special addresses that mean ``send to everybody''<a href="#tthFtNtAAD" name=tthFrefAAD><sup>3</sup></a>.
The IP address could also be specified as any IP address if the machine is acting as a transparent
proxy or firewall, but only processes with superuser privileges can bind to any IP address.
The IP address bound to is saved in the <tt>sock</tt> data structure in the <tt>recv_addr</tt> and
<tt>saddr</tt> fields.
These are used in hash lookups and as the sending IP address respectively.
The port number is optional and if it is not specified the supporting network
is asked for a free one.
By convention, port numbers less than 1024 cannot be used by processes without superuser privileges.
If the underlying network does allocate a port number it always allocates ones greater than
1024.

<p>
As packets are being received by the underlying network devices they must be routed to the correct
INET and BSD sockets so that they can be processed.
For this reason UDP and TCP maintain hash tables which are used to lookup the addresses
within incoming IP messages and direct them to the correct <tt>socket</tt>/<tt>sock</tt> pair.
TCP is a connection oriented protocol and so there is more information involved in processing
TCP packets than there is in processing UDP packets.

<p>
UDP maintains a hash table of allocated UDP ports, the <tt>udp_hash</tt>
table.
This consists of pointers to <tt>sock</tt> data structures indexed by a hash function based
on the port number.
As the UDP hash table is much smaller than the number of permissible port numbers
(<tt>udp_hash</tt> is only 128 or <tt>UDP_HTABLE_SIZE</tt> entries long)
some entries in the table point to a chain of <tt>sock</tt> data structures linked together using
each <tt>sock</tt>'s <tt>next</tt> pointer.

<p>
TCP is much more complex as it maintains several hash tables.
However, TCP does not actually add the binding <tt>sock</tt> data stucture into its hash tables
during the bind operation, it merely checks that the port number requested is not currently
being used.
The <tt>sock</tt> data structure is added to TCP's hash tables during the <em>listen</em> operation.

<p>
<font face="helvetica">REVIEW NOTE:</font> <em>What about the route entered?</em>

<p>

<H3><A NAME="tth_sEc10.4.3">10.4.3&nbsp;</A> Making a Connection on an INET BSD Socket</H3>
Once a socket has been created and, provided it has not been used to listen for inbound connection
requests, it can be used to make outbound connection requests.
For connectionless protocols like UDP this socket operation does not do a whole lot but for connection
orientated protocols like TCP it involves building a virtual circuit between two applications.

<p>
An outbound connection can only be made on an INET BSD socket that is in the right state; that is to say
one that  does not already have a connection established and one that is not being used for listening
for inbound connections.
This means that the BSD <tt>socket</tt> data structure must be in state <tt>SS_UNCONNECTED</tt>.
The UDP protocol does not establish virtual connections between applications, any messages sent
are datagrams, one off messages that may or may not reach their destinations.
It does, however, support the <em>connect</em> BSD socket operation.
A connection operation on a UDP INET BSD socket simply sets up the addresses of the remote application;
its IP address and its IP port number.
Additionally it sets up a cache of the routing table entry so that UDP packets sent on this BSD socket
do not need to check the routing database again (unless this route becomes invalid).
The cached routing information is pointed at from the <tt>ip_route_cache</tt> pointer in the
INET <tt>sock</tt> data structure.
If no addressing information is given, this cached routing and IP addressing information will be
automatically be used for messages sent using this BSD socket.
UDP moves the <tt>sock</tt>'s state to <tt>TCP_ESTABLISHED</tt>.

<p>
For a connect operation on a TCP BSD socket, TCP must build a TCP message containing the connection
information and send it to IP destination given.
The TCP message contains information about the connection, a unique starting message sequence number,
the maximum sized message that can be managed by the initiating host, the transmit and receive window
size and so on.
Within TCP all messages are numbered and the initial sequence number is used as the first message
number.
Linux chooses a reasonably random value to avoid malicious protocol attacks.
Every message transmitted by one end of the TCP connection and successfully received by the other is
acknowledged to say that it arrived successfully and uncorrupted.
Unacknowledges messages will be retransmitted.
The transmit and receive window size is the number of outstanding messages that there can be without an
acknowledgement being sent.
The maximum message size is based on the network device that is being used at the initiating end of the
request.
If the receiving end's network device supports smaller maximum message sizes then the connection will
use the minimum of the two.
The application making the outbound TCP connection request must now wait for a response from the
target application to accept or reject the connection request.
As the TCP <tt>sock</tt> is now expecting incoming messages, it is added to the <tt>tcp_listening_hash</tt>
so that incoming TCP messages can be directed to this <tt>sock</tt> data structure.
TCP also starts timers so that the outbound connection request can be timed out if the target application
does not respond to the request.

<p>

<H3><A NAME="tth_sEc10.4.4">10.4.4&nbsp;</A> Listening on an INET BSD Socket</H3>
Once a socket has had an address bound to it, it may listen for incoming connection
requests specifying the bound addresses.
A network application can listen on a socket without first binding an address to it; in this
case the INET socket layer finds an unused port number (for this protocol) and automatically
binds it to the socket.
The listen socket function moves the socket into state <tt>TCP_LISTEN</tt> and does any
network specific work needed to allow incoming connections.

<p>
For UDP sockets, changing the socket's state is enough but TCP now adds the socket's
<tt>sock</tt> data structure into two hash tables as it is now active.
These are the <tt>tcp_bound_hash</tt> table and the
<tt>tcp_listening_hash</tt>.
Both are indexed via a hash function based on the IP port number.

<p>
Whenever an incoming TCP connection request is received for an active listening socket,
TCP builds a new <tt>sock</tt> data structure to represent it.
This <tt>sock</tt> data structure will become the bottom half of the TCP connection when it is
eventually accepted.
It also clones the incoming <tt>sk_buff</tt> containing the connection request and queues it onto
the <tt>receive_queue</tt> for the listening <tt>sock</tt> data structure.
The clone <tt>sk_buff</tt> contains a pointer to the newly created <tt>sock</tt> data structure.

<p>

<H3><A NAME="tth_sEc10.4.5">10.4.5&nbsp;</A> Accepting Connection Requests</H3>
UDP does not support the concept of connections,
accepting INET socket connection requests only applies to the TCP protocol as
an accept operation on a listening socket causes a new <tt>socket</tt> data structure to
be cloned from the original listening <tt>socket</tt>.
The accept operation is then passed to the supporting protocol layer, in this case
INET to accept any incoming connection requests.
The INET protocol layer will fail the accept operation if the underlying protocol, say
UDP, does not support connections.
Otherwise the accept operation is passed through to the real protocol, in this case TCP.
The accept operation can be either blocking or non-blocking.
In the non-blocking case if there are no incoming connections to accept, the accept operation
will fail and the newly created <tt>socket</tt> data structure will be thrown away.
In the blocking case the network application performing the accept operation will be added to
a wait queue and then suspended until a TCP connection request is received.
Once a connection request has been received the <tt>sk_buff</tt> containing the request is
discarded and the <tt>sock</tt> data structure is returned to the INET socket layer where
it is linked to the new <tt>socket</tt> data structure created earlier.
The file descriptor (<tt>fd</tt>) number of the new <tt>socket</tt> is returned to the network application,
and the application can then use that file descriptor in socket operations on the newly created
INET BSD socket.

<p>

<H2><A NAME="tth_sEc10.5">10.5&nbsp;</A> The IP Layer</H2>

<H3><A NAME="tth_sEc10.5.1">10.5.1&nbsp;</A> Socket Buffers</H3>

<p>
One of the problems of having many layers of network protocols, each one using the services of
another,	is that each protocol needs to
add protocol headers and tails to data as it is transmitted and to remove them as
it processes received data.
This make passing data buffers between the protocols difficult as each layer needs to find where its
particular protocol headers and tails are.
One solution is to copy buffers at each layer but that would be inefficient.
Instead, Linux uses socket buffers or <tt>sk_buffs</tt> to pass data between the protocol layers and
the network device drivers.
<tt>sk_buffs</tt> contain pointer and length fields that allow each protocol layer to manipulate
the application data via standard functions or ``methods''.

<p>

<p><A NAME="tth_fIg10.4"></A>
<center><center> <img src="sk_buff.gif"><br>
<p>
</center></center><center>      Figure 10.4: The Socket Buffer (sk_buff)</center>
<A NAME="skbuff-figure"></A>
<p>
<p>Figure&nbsp;<A href="#skbuff-figure"
> 10.4</A> shows the <tt>sk_buff</tt>
 data structure;
each <tt>sk_buff</tt> has a block of data associated with it.
The <tt>sk_buff</tt> has four data pointers, which are used to manipulate and manage the socket buffer's
data:

<DL compact>
<p>
	<dt><b>head</b></dt><dd> points to the start of the data area in memory.  This is fixed when the
		<tt>sk_buff</tt> and its associated data block is allocated,
	<dt><b>data</b></dt><dd> points at the current start of the protocol data.  This pointer varies depending
		on the protocol layer that currently owns the <tt>sk_buff</tt>,
	<dt><b>tail</b></dt><dd> points at the current end of the protocol data.  Again, this pointer
		varies depending on the owning protocol layer,
	<dt><b>end</b></dt><dd> points at the end of the data area in memory.  This is fixed when the <tt>sk_buff</tt>
		is allocated.
</DL>
<p>
There are two length fields <tt>len</tt> and <tt>truesize</tt>, which describe the length of the
current protocol packet and the total size of the data buffer respectively.
The <tt>sk_buff</tt> handling code provides standard mechanisms for adding and removing protocol headers
and tails to the application data.
These safely manipulate the <tt>data</tt>, <tt>tail</tt> and <tt>len</tt> fields in the <tt>sk_buff</tt>:

<DL compact>
<p>
	<dt><b>push</b></dt><dd> This moves the <tt>data</tt> pointer towards the start of the data area and
		increments the <tt>len</tt> field.  This is used when adding data or protocol headers
		to the start of the data to be transmitted,


<p>
	<dt><b>pull</b></dt><dd> This moves the <tt>data</tt> pointer away from the start, towards the end of
		the data area and decrements the <tt>len</tt> field.
		This is used when removing data or protocol headers from the start of the data
		that has been received,


<p>
	<dt><b>put</b></dt><dd> This moves the <tt>tail</tt> pointer towards the end of the data area and increments
		the <tt>len</tt> field.  This is used when adding data or protocol information to the
		end of the data to be transmitted,


<p>
	<dt><b>trim</b></dt><dd> This moves the <tt>tail</tt> pointer towards the start of the data area and
		decrements the <tt>len</tt> field.
		This is used when removing data or protocol tails from the received packet.


<p>
</DL>The <tt>sk_buff</tt> data structure also contains pointers that are used as it is stored in doubly linked
circular lists of <tt>sk_buff</tt>'s during processing.
There are generic <tt>sk_buff</tt> routines for adding <tt>sk_buffs</tt> to the front and back of these lists
and for removing them.

<p>

<H3><A NAME="tth_sEc10.5.2">10.5.2&nbsp;</A> Receiving IP Packets</H3>
Chapter&nbsp;<A href="../dd/drivers.html"
> dd-chapter</A> described how Linux's network drivers built are into the kernel
and initialized.
This results in a series of <tt>device</tt> data structures linked together in the <tt>dev_base</tt>
list.
Each <tt>device</tt> data structure describes its device and provides a set of callback routines
that the network protocol layers call when they need the network driver to perform work.
These functions are mostly concerned with transmitting data and with the network device's
addresses.
When a network device receives packets from its network it must convert the received
data into <tt>sk_buff</tt> data structures.
These received <tt>sk_buff</tt>'s are added onto the <tt>backlog</tt> queue
by the network drivers as they are received.

<p>
If the <tt>backlog</tt> queue grows too large, then the
received <tt>sk_buff</tt>'s are discarded.
The network bottom half is flagged as ready to run as there is work to do.

<p>
When the network bottom half handler is run by the scheduler it processes any network packets
waiting to be transmitted
before processing the <tt>backlog</tt> queue of <tt>sk_buff</tt>'s determining
which protocol layer to pass the received packets to.

<p>
As the Linux networking layers were initialized, each protocol registered itself by adding a
<tt>packet_type</tt> data structure onto either the <tt>ptype_all</tt> list or
into the <tt>ptype_base</tt> hash table.
The <tt>packet_type</tt> data structure contains the protocol type, a pointer to a network device, a pointer
to the protocol's receive data processing routine and, finally, a pointer to the next
<tt>packet_type</tt> data structure in the list or hash chain.
The <tt>ptype_all</tt> chain is used to snoop all packets being received from any network device
and is not normally used.
The <tt>ptype_base</tt> hash table is hashed by protocol identifier and is used to decide
which protocol should receive the incoming network packet.
The network bottom half matches the protocol types of incoming <tt>sk_buff</tt>'s against
one or more of the <tt>packet_type</tt> entries in either table.
The protocol may match more than one entry, for example when snooping all network traffic, and
in this case the <tt>sk_buff</tt> will be cloned.
The <tt>sk_buff</tt> is passed to the matching protocol's handling routine.

<p>

<H3><A NAME="tth_sEc10.5.3">10.5.3&nbsp;</A> Sending IP Packets</H3>
Packets are transmitted by applications exchanging data or else they are generated by
the network protocols as they support established connections or connections being
established.
Whichever way the data is generated, an <tt>sk_buff</tt> is built to contain the data
and various headers are added by  the protocol layers as it passes through them.

<p>
The <tt>sk_buff</tt> needs to be passed to a network device to be transmitted.
First though the protocol, for example IP,  needs to decide which network device to
use.
This depends on the best route for the packet.
For computers connected by modem to a single network, say via the PPP protocol, the
routing choice is easy.
The packet should either be sent to the local host via the loopback device or to the
gateway at the end of the PPP modem connection.
For computers connected to an ethernet the choices are harder as there are many computers
connected to the network.

<p>
For every IP packet transmitted, IP uses the routing tables to resolve the route for the
destination IP address.
Each IP destination successfully looked up in the routing tables returns a <tt>rtable</tt>

<p>
data structure describing the route to use.
This includes the source IP address to use, the address of the network <tt>device</tt> data
structure and, sometimes, a prebuilt hardware header.
This hardware header is network device specific and contains the source and destination
physical addresses and other media specific information.
If the network device is an ethernet device, the hardware header would be as shown in
Figure&nbsp;<A href="#protocols-figure"
> 10.1</A> and the source and destination addresses would be physical
ethernet addresses.
The hardware header is cached with the route because it must be appended to each IP
packet transmitted on this route and constructing it takes time.
The hardware header may contain physical addresses that have to be resolved using the
ARP protocol.
In this case the outgoing packet is stalled until the address has been resolved.
Once it has been resolved and the hardware header built, the hardware header is cached
so that future IP packets sent using this interface do not have to ARP.

<p>

<H3><A NAME="tth_sEc10.5.4">10.5.4&nbsp;</A> Data Fragmentation</H3>

<p>
Every network device has a maximum packet size and it cannot transmit or receive a
data packet bigger than this.
The IP protocol allows for this and will fragment data into smaller units to fit into
the packet size that the network device can handle.
The IP protocol header includes a fragment field which contains a flag and the fragment
offset.

<p>
When an IP packet is ready to be transmited,

<p>
IP finds the network device to send the IP packet out on.
This device is found from the IP routing tables.
Each <tt>device</tt> has a field describing its maximum transfer unit (in bytes), this is
the <tt>mtu</tt> field.
If the device's mtu is smaller than the packet size of the IP packet that is waiting to be
transmitted, then the IP packet must be broken down into smaller (mtu sized) fragments.
Each fragment is represented by an <tt>sk_buff</tt>; its IP header marked to show that it
is a fragment and what offset into the data this IP packet contains.
The last packet is marked as being the last IP fragment.
If, during the fragmentation, IP cannot allocate an <tt>sk_buff</tt>, the transmit will
fail.

<p>
Receiving IP fragments is a little more difficult than sending them because the IP fragments
can be received in any order and they must all be received before they can
be reassembled.
Each time an IP packet is received

it is checked to see if it is an IP fragment.
The first time that the fragment of a message is received, IP creates a new <tt>ipq</tt> data
structure, and this is linked into the <tt>ipqueue</tt> list of IP fragments
awaiting recombination.
As more IP fragments are received, the correct <tt>ipq</tt> data structure is found and a new <tt>ipfrag</tt>
data structure is created to describe this fragment.
Each <tt>ipq</tt> data structure uniquely describes a fragmented IP receive frame with its source and
destination IP addresses, the upper layer protocol identifier and the identifier for this IP
frame.
When all of the fragments have been received, they are combined into a single <tt>sk_buff</tt> and
passed up to the next protocol level to be processed.
Each <tt>ipq</tt> contains a timer that is restarted each time a valid fragment is received.
If this timer expires, the <tt>ipq</tt> data structure and its <tt>ipfrag</tt>'s are dismantled and the message
is presumed to have been lost in transit.
It is then up to the higher level protocols to retransmit the message.

<p>

<H2><A NAME="tth_sEc10.6">10.6&nbsp;</A> The Address Resolution Protocol (ARP)</H2>
The Address Resolution Protocol's role is to provide translations of IP addresses into
physical hardware addresses such as ethernet addresses.
IP needs this translation just before it passes the data (in the form
of an <tt>sk_buff</tt>) to the device driver for transmission.

<p>
It performs various checks to see if this device needs a hardware header and, if it
does, if the hardware header for the packet needs to be rebuilt.
Linux caches hardware headers to avoid frequent rebuilding of them.
If the hardware header needs rebuilding, it calls the device specific hardware header rebuilding
routine.
All ethernet devices use the same generic header rebuilding routine

<p>
which in turn uses the ARP services to translate the destination IP address into a physical
address.

<p>
The ARP protocol itself is very simple and consists of two message types, an ARP request and
an ARP reply.
The ARP request contains the IP address that needs translating and the reply (hopefully)
contains the translated IP address, the hardware address.
The ARP request is broadcast to all hosts connected to the network, so, for an ethernet
network, all of the machines connected to the ethernet will see the ARP request.
The machine that owns the IP address in the request will respond to the ARP request with an
ARP reply containing its own physical address.

<p>
The ARP protocol layer in Linux is built around a table of <tt>arp_table</tt> data structures which
each describe an IP to physical address translation.
These entries are created as IP addresses need to be translated and removed as they become stale
over time.
Each <tt>arp_table</tt> data structure has the following fields:

<p>

<table><tr><td>	last used		</td><td> the time that this ARP entry was last used,
<tr><td>
	last updated		</td><td> the time that this ARP entry was last updated,
<tr><td>
	flags			</td><td> these describe this entry's state, if it is complete and so on,
<tr><td>
	IP address		</td><td> The IP address that this entry describes
<tr><td>
	hardware address	</td><td> The translated hardware address
<tr><td>
	hardware header		</td><td> This is a pointer to a cached hardware header,
<tr><td>
	timer			</td><td> This is a <tt>timer_list</tt> entry used to time out ARP requests
<tr><td>
				</td><td> that do not get a response,
<tr><td>
	retries			</td><td> The number of times that this ARP request has been
<tr><td>
				</td><td> retried,
<tr><td>
	<tt>sk_buff</tt> queue	</td><td> List of <tt>sk_buff</tt> entries waiting for this IP address
<tr><td>
				</td><td> to be resolved</table>


<p>
The ARP table consists of a table of pointers (the <tt>arp_tables</tt> vector)
to chains of <tt>arp_table</tt> entries.
The entries are cached to speed up access to them, each entry is found by taking the last two
bytes of its IP address to generate an index into the table and then following the chain of
entries until the correct one is found.
Linux also caches prebuilt hardware headers off the <tt>arp_table</tt> entries in the form
of <tt>hh_cache</tt> data structures.

<p>
When an IP address translation is requested and there is no corresponding <tt>arp_table</tt> entry,
ARP must send an ARP request message.
It creates a new <tt>arp_table</tt> entry in the table and queues the <tt>sk_buff</tt> containing the
network packet that needs the address translation on the <tt>sk_buff</tt> queue of the new entry.
It sends out an ARP request and sets the ARP expiry timer running.
If there is no response then ARP will retry the request a number of times and if there is
still no response ARP will remove the <tt>arp_table</tt> entry.
Any <tt>sk_buff</tt> data structures queued waiting for the IP address to be translated will be
notified and it is up to the protocol layer that is transmitting them to cope with this failure.
UDP does not care about lost packets but TCP will attempt to retransmit on an established TCP
link.
If the owner of the IP address responds with its hardware address, the <tt>arp_table</tt> entry
is marked as complete and any queued <tt>sk_buff</tt>'s will be removed from the queue and
will go on to be transmitted.
The hardware address is written into the hardware header of each <tt>sk_buff</tt>.

<p>
The ARP protocol layer must also respond to ARP requests that specfy its IP address.
It registers its protocol type (<tt>ETH_P_ARP</tt>), generating a <tt>packet_type</tt> data structure.
This means that it will be passed all ARP packets that are received by the network devices.
As well as ARP replies, this includes ARP requests.
It generates an ARP reply using the hardware address kept in the receiving device's <tt>device</tt>
data structure.

<p>
Network topologies can change over time and IP addresses can be reassigned to different hardware
addresses.
For example, some dial up services assign an IP address as each connection is established.
In order that the ARP table contains up to date entries,
ARP runs a periodic timer which looks through all of the <tt>arp_table</tt> entries to see which have
timed out.
It is very careful not to remove entries that contain one or more cached hardware headers.
Removing these entries is dangerous as other data structures rely on them.
Some <tt>arp_table</tt> entries are permanent and these are marked so that they will not be deallocated.
The ARP table cannot be allowed to grow too large; each <tt>arp_table</tt> entry consumes some kernel memory.
Whenever the a new entry needs to be allocated and the ARP table has reached its maximum size
the table is pruned by searching out the oldest entries and removing them.

<p>

<H2><A NAME="tth_sEc10.7">10.7&nbsp;</A> IP Routing</H2>

<p>
The IP routing function determines where to send IP packets destined for a particular IP
address.
There are many choices to be made when transmitting IP packets.
Can the destination be reached at all?
If it can be reached, which network device should be used to transmit it?
If there is more than one network device that could be used to reach the destination, which is
the better one?
The IP routing database maintains information that gives answers to these questions.
There are two databases, the most important being the Forwarding Information Database.
This is an exhaustive list of known IP destinations and their best routes.
A smaller and much faster database, the <em>route cache</em> is used for quick lookups of routes
for IP destinations.
Like all caches, it must contain only the frequently accessed routes; its contents are derived
from the Forwarding Information Database.

<p>
Routes are added and deleted via IOCTL requests to the BSD socket interface.
These are passed onto the protocol to process.
The INET protocol layer only allows processes with superuser privileges to add and delete IP
routes.
These routes can be fixed or they can be dynamic and change over time.
Most systems use fixed routes unless they themselves are routers.
Routers run routing protocols which constantly check on the availability of routes to all known
IP destinations.
Systems that are not routers are known as end systems.
The routing protocols are implemented as daemons, for example GATED, and they
also add and delete routes via the IOCTL BSD socket interface.

<p>

<H3><A NAME="tth_sEc10.7.1">10.7.1&nbsp;</A> The Route Cache</H3>
Whenever an IP route is looked up, the route cache is first checked for a matching route.
If there is no matching route in the route cache the Forwarding Information Database is
searched for a route.
If no route can be found there, the IP packet will fail to be sent and the application notified.
If a route is in the Forwarding Information Database and not in the route cache, then a new
entry is generated and added into the route cache for this route.
The route cache is a table (<tt>ip_rt_hash_table</tt>)
that contains pointers to chains of <tt>rtable</tt> data structures.
The index into the route table is a hash function based on the least significant two bytes
of the IP address.
These are the two bytes most likely to be different between destinations and provide the best
spread of hash values.
Each <tt>rtable</tt> entry contains information about the route; the destination IP address, the
network <tt>device</tt> to use to reach that IP address, the maximum size of message that can
be used and so on.
It also has a reference count, a usage count and a timestamp of the last time that they were
used (in <tt>jiffies</tt>).
The reference count is incremented each time the route is used to show the
number of network connections using this route.
It is decremented as applications stop using the route.
The usage count is incremented each time the route is looked up and
is used to order the <tt>rtable</tt> entry in its chain of hash entries.
The last used timestamp for all of the entries in the route cache is periodically checked to see
if the <tt>rtable</tt> is too old

<p>
.
If the route has not been recently used, it is discarded from the route cache.
If routes are kept in the route cache they are ordered so that the most used entries are at
the front of the hash chains.
This means that finding them will be quicker when routes are looked up.

<p>

<H3><A NAME="tth_sEc10.7.2">10.7.2&nbsp;</A> The Forwarding Information Database</H3>

<p><A NAME="tth_fIg10.5"></A>
<p>

<center><center> <img src="fib.gif"><br>
<p>
</center></center><center>      Figure 10.5: The Forwarding Information Database</center>
<A NAME="fib-figure"></A>
<p>
<p>The forwarding information database (shown in Figure&nbsp;<A href="#fib-figure"
> 10.5</A> contains IP's view of the routes available
to this system at this time.
It is quite a complicated data structure and, although it is reasonably efficiently arranged, it is not a
quick database to consult.
In particular it would be very slow to look up destinations in this database for every IP packet
transmitted.
This is the reason that the route cache exists: to speed up IP packet transmission using known good
routes.
The route cache is derived from the forwarding database and represents its commonly used entries.

<p>
Each IP subnet is represented by a <tt>fib_zone</tt> data structure.
All of these are pointed at from the <tt>fib_zones</tt> hash table.
The hash index is derived from the IP subnet mask.
All routes to the same subnet are described by pairs of <tt>fib_node</tt> and <tt>fib_info</tt> data structures
queued onto the <tt>fz_list</tt>  of each <tt>fib_zone</tt> data structure.
If the number of routes in this subnet grows large, a hash table is generated to make finding the <tt>fib_node</tt>
data structures easier.

<p>
Several routes may exist to the same IP subnet and these routes can go through one of several gateways.
The IP routing layer does not allow more than one route to a subnet using the same gateway.
In other words, if there are several routes to a subnet, then each route is guaranteed to use a different
gateway.
Associated with each route is its <em>metric</em>.
This is a measure of how advantagious this route is.
A route's metric is, essentially, the number of IP subnets that it must hop across before it reaches
the destination subnet.
The higher the metric, the worse the route.

<p>
<hr><H3>Footnotes:</H3>

<p><a name=tthFtNtAAB></a><a href="#tthFrefAAB"><sup>1</sup></a> National Science Foundation
<p><a name=tthFtNtAAC></a><a href="#tthFrefAAC"><sup>2</sup></a> Synchronous Read Only Memory
<p><a name=tthFtNtAAD></a><a href="#tthFrefAAD"><sup>3</sup></a> duh?  What used for?
<p><hr><small>File translated from T<sub><font size=-1>E</font></sub>X by <a href="http://hutchinson.belmont.ma.us/tth/tth.html">T<sub><font size=-1>T</font></sub>H</a>, version 1.0.</small>
<hr>
<center>
<A HREF="../net/net.html"> Top of Chapter</A>,
<A HREF="../tlk-toc.html"> Table of Contents</A>,
<A href="../tlk.html" target="_top"> Show Frames</A>,
<A href="../net/net.html" target="_top"> No Frames</A><br>
<EFBFBD> 1996-1999 David A Rusling <A HREF="../misc/copyright.html">copyright notice</a>.
</center>
</HTML>