LDP/LDP/guide/docbook/linux-ip/routing.xml

<!-- $Id$ -->

<chapter id="ch-routing">
  <title>IP Routing</title>
  <para>
    Routing is fundamental to the design of the Internet Protocol.  IP has
    routing cleverly been designed to minimize the complexity for leaf
    nodes and networks.  Linux can be used as a leaf node such as a
    workstation where setting the IP address, netmask and
    default gateway suffices for all routing needs.  Alternatively, the same
    routing subsystem can be used in the core of a network connecting
    multiple public and private networks.
  </para>
  <para>
    This chapter will begin with the
    <link linkend="routing-intro">basics of IP routing with linux</link>,
    <link linkend="routing-local">routing to locally connected
    destinations</link> and
    <link linkend="routing-default">routing to destinations through the
    default gateway</link>.  Subsequent topics will include 
    <link linkend="routing-selection">the kernel's route selection
    algorithm</link>, the
    <link linkend="routing-cache">routing cache</link>,
    <link linkend="routing-tables">routing tables</link>, the
    <link linkend="routing-rpdb">routing policy database</link>, and
    <link linkend="routing-icmp">issues with ICMP and routing</link>.
  </para>
  <para>
    The precinct of this documentation is primarily static routing.  Though 
    dynamic routing is fundamental to large networks, Internet service
    providers, and backbone providers, this documentation is targetted for
    smaller networks, particularly networks which use static routing.
    Nonetheless, the concepts introduced also apply to dynamic routing
    environments.
  </para>
  <para>
    The linux routing subsystem has been designed with large
    scale networks in mind, without forgetting the need for easy
    configurability for leaf nodes, such as workstations and servers.
  </para>
  <section id="routing-intro">
    <title>Introduction to Linux Routing</title>
    <para>
      The design of IP routing allows for very simple route
      definitions for small networks, while not hindering the flexibility of
      routing in complex environments.  A tenet of IP routing is
      its ability to define what adresses are locally reachable as opposed to
      not directly known destinations.  Every IP capable host knows about at
      least three destinations: itself, locally connected computers and
      everywhere else.
    </para>
    <para>
      Most fully-featured IP-aware networked operating systems
      (all unix-like operating systems with IP stacks,
      modern Macintoshes, and modern Windows) include support for the loopback
      device and IP.  This is an IP and range configured on the host machine
      itself which allows the machine to talk to itself.
    </para>
    <para>
      The second group of IP addresses are the IPs in the locally
      reachable network segment.  Each machine with a connection to an IP
      network can reach a subset of the entire IP address space on its
      directly connected network interface.
    </para>
    <para>
      All other hosts or destination IPs fall into a third range.  Any host
      which is not on the machine itself or locally reachable (i.e. connected
      to the same media segment) is only reachable through an IP routing
      device.  This routing device must have an IP address in the locally
      reachable IP address range.
    </para>
    <para>
      All IP networking is a permutation of these three fundamental concepts
      of reachability.   This list summarizes the three possible
      classifications for reachability of destination IP addresses from any
      single source machine.
    </para>
    <anchor id="list-routing-intro"/>
    <orderedlist>
      <listitem>
        <para>
          The IP address is reachable on the machine itself.  Under linux
          this is considered
          <link linkend="tb-tools-ip-addr-scope">scope host</link> and is used
          for IPs bound to any network device including loopback, and the
          network range for the loopback device.
        </para>
      </listitem>
      <listitem>
        <para>
          The IP address is reachable on the directly connected link layer
          medium.
        </para>
      </listitem>
      <listitem>
        <para>
          The IP address is ultimately reachable through a router which
          is reachable on the directly connected link layer medium.
        </para>
      </listitem>
    </orderedlist>
    <para>
      FIXME....remember these concepts--keep them in mind for the remainder of
      the chapter.
    </para>
    <para>
      The network address is an IP address which is unusable for an individual
      machine, but represents the first address in a range of address defining
      an entire IP subnet (formerly subnetwork).  When combined
      with the netmask, the result describes the size of the range of
      acceptable addresses.  For more on addressing, see the tutorials and
      documentation in the
      <link linkend="links-general-ip">links section</link>.
    </para>
    <para>
      FIXME previous paragraph was ripped from former location and needs to be
      integrated into the flow of the introduction.
    </para>
  </section>
  <section id="routing-local">
    <title>Routing to Locally Connected Networks</title>
    <para>
      Any IP network is defined by two sets of numbers: network address and
      netmask.  By convention, there are two ways to represent these two
      numbers.  Netmask notation is the convention and tradition in IP
      networking
      although the more succinct CIDR notation is gaining popularity.
    </para>
    <para>
      In the
      <link linkend="ax-example-network">example network</link>, &isolde; has
      IP address 192.168.100.17.
      In CIDR notation, &isolde;'s address is 192.168.100.17/24, and in
      traditional netmask notation, 192.168.100.17/255.255.255.0.
      Any of the
      <link linkend="tools-ipcalc">IP calculators</link>, confirms that the
      first usable IP address is 192.168.100.1 and the last usable IP address
      is 192.168.100.254.
      Importantly, the IP network address, 192.168.100.0/24, is reachable
      through the directly connected ethernet interface (refer to
      <link linkend="list-routing-intro">classification 2</link>).
      Therefore, &isolde; should be able to reach any IP address in
      this range directly on the locally connected ethernet segment.
    </para>
    <para>
      Below is the routing table for &isolde;, first shown with the
      conventional <command>route -n</command> output
      <footnote>
        <para>
          The <command>route -n</command> output can also be produced with
          <command>netstat -rn</command> and is commonly used by
          admininstrators who rely on platform independent behaviour across
          heterogeneous UN*X systems.  This traditional routing table output
          uses conventional netmask notation to denote network size.
        </para>
      </footnote>
      and then with the
      <command>ip route show</command>
      <footnote>
        <para>
          Refer to the
          <link linkend="tools-ip-route"><command>ip route</command></link>
          section for a fuller discussion of this linux specific tool.
          The routing table output from <command>ip route</command> uses
          exclusively CIDR notation.
        </para>
      </footnote>
      command.  Each of these tools conveys
      the same routing table and operates on the same kernel routing table.
      For more on the routing table displayed in
      <xref linkend="ex-routing-local"/>, consult
      <xref linkend="routing-table-main"/>.
    </para>
    <example id="ex-routing-local">
      <title>Identifying the locally connected networks with
        <command>route</command></title>
      <programlisting>
<prompt>[root@isolde]# </prompt><userinput>route -n</userinput>
<computeroutput>Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.100.0   0.0.0.0         255.255.255.0   U     0      0        0 eth0
127.0.0.0       0.0.0.0         255.0.0.0       U     0      0        0 lo
0.0.0.0         192.168.100.254 0.0.0.0         UG    0      0        0 eth0</computeroutput>
<prompt>[root@isolde]# </prompt><userinput>ip route show</userinput>
<computeroutput>192.168.100.0/24 dev eth0  scope link 
127.0.0.0/8 dev lo  scope link 
default via 192.168.100.254 dev eth0</computeroutput>
      </programlisting>
    </example>
    <para>
      In the above example, the locally reachable destination is
      192.168.100.0/255.255.255.0 which can also be written 192.168.100.0/24
      as in <command>ip route show</command>.  In classful networking
      terms, the network to which &isolde; is directly connected is called a
      class C sized network.
    </para>
    <para>
      When a process (program) on &isolde; wants to communicate with another
      machine on the locally connected network, it will attempt to initiate a
      connection from 192.168.100.17 (&isolde;'s IP).  The kernel will consult
      the routing table to determine how to send the outbound packet.
      Assuming the destination is 192.168.100.32.  The kernel will find that
      192.168.100.32 falls inside the IP address range 192.168.100.0/24 and
      will select this route for the outbound packet.
    </para>
    <para>
      The packet will be sent to the locally connected network segment
      directly, because &isolde; interprets from the routing table 
      that 192.168.100.32 is directly reachable through the physical network
      connection on eth0.
    </para>
    <para>
      Occasionally, a machine will be directly connected to two different
      IP networks on the same device.
      The routing table will show that both networks are reachable through
      the same physical device.  For more on this topic, see
      <xref linkend="adv-media-share"/>.  Similarly, multi-homed hosts will
      have routes for all locally connected networks through the
      locally-connected network interface.  For more on this sort of
      configuration, see
      <xref linkend="adv-multi-homed"/>.
    </para>
    <para>
      This covers the classification of IP destinations which are available on
      a locally connected network.  This highlights the importance of an
      accurate netmask and network address.  IP ranges which are not hosted on
      the machine itself and which do not fall in the range of the locally
      connected networks must be reached through a router.  The next section
      will cover IP destinations of this class.
    </para>
  </section>
  <section id="routing-default">
    <title>Routing through a Router, Commonly the Default Gateway</title>
    <para>
      Generalization of routing to a locally connected router.  FIXME
    </para>
    <para>
      The default gateway is the catch-all route.  If no more specific a
      route exists in a routing table, the default route will be used.
      Many servers and workstations are connected to leaf networks
      with only one router, hence
      <xref linkend="ex-routing-local"/>
      shows a very common sort of routing table.  There's a route for
      localhost, for the locally connected IP network, and a default
      route.
    </para>
    <para>
      For Internet-connected hosts, the default route is customarily set to
      the IP of the locally reachable router which has a path to the Internet.
      Each router in turn has a default gateway pointing to another
      Internet-connected router until the packet is handed off to an Internet
      Service Provider's network.
    </para>
    <para>
      FIXME.
    </para>
  </section>
  <section id="routing-selection">
    <title>Route Selection</title>
    <para>
      Crucial to the proper ability of hosts to exchange IP packets is the
      correct selection of a route to the destination.  The rules for the
      selection of route path are traditionally made on a hop-by-hop basis
      <footnote>
        <para>
          This document could stand to allude to MPLS implementations under
          linux, for those who want to look at traffic engineering and packet
          tagging on backbones.  This is certainly not in the scope of this
          chapter, and should be in a separate chapter, which covers
          developing technologies.
        </para>
      </footnote>
      based upon the destination address of the packet.  Linux
      behaves as a conventional routing device in this way, but can also 
      provide a more flexible capability.  Routes can be chosen and
      prioritized based on other packet characteristics.
    </para>
    <para>
      The route selection algorithm under linux has been generalized to
      enable the powerful latter scenario without complicating the
      overwhelmingly common case of the former scenario.
    </para>
    <section id="routing-selection-common">
      <title>The Common Case</title>
      <para>
        The above sections on routing to a
        <link linkend="routing-local">local network</link> and
        <link linkend="routing-default">the default gateway</link>
        expose the importance of destination address for route selection.
        In this simplified model, the kernel need only know the destination
        address of the packet, which it compares against the routing tables to
        determine the route by which to send the packet.
      </para>
      <para>
        The kernel searches for a matching entry for the destination first in
        the routing cache and then the main routing table.
        In the case that the machine has recently transmitted a
        packet to the destination address, the
        <link linkend="routing-cache">routing cache</link> will contain an
        entry for the destination.  The kernel will select the same route, and
        transmit the packet accordingly.
      </para>
      <para>
        If the linux machine has not recently transmitted a packet to this
        destination address, it will look up the destination in its routing
        table using a technique known longest prefix match
        <footnote>
          <para>
            Refer to
            <ulink url="http://www.isi.edu/in-notes/rfc3222.txt">RFC
            3222</ulink> for further details.
          </para>
        </footnote>.
        In practical terms, the concept of longest prefix match means that the
        most specific network route to the destination will be chosen.
      </para>
      <para>
        The kernel will search through the routing table to find the
        most specific destination IP range.  This search is known as longest
        prefix match.  Every route entry is
        comprised of two parts, the base address and the number of significant
        bits in that address.  Together, the base address and the number of
        significant bits ar referred to as prefix length.  This prefix
        length can be represented either as a netmask or a CIDR prefix.
        The kernel will match route entries with the
        highest number of significant bits before other routes.
      </para>
      <para>
        The use of the
        longest prefix match allows network routes for large networks to be
        overridden by more specific host routes, as required in
        <xref linkend="ex-basic-del-static"/>, for example.  Conversely, it is
        this same property of longest prefix match which allows routes to
        individual destinations to be aggregated into larger network
        addresses.  Instead of entering individual routes for each host, large
        numbers of contiguous network addresses can be aggregated.  This is
        the premise of CIDR networking.  See
        <xref linkend="links-general-ip"/> for further details.
      </para>
      <para>
        Since IP routing defines a default route for all 
        not specifically known paths, while
        more specific routes take precedence, it should make
        sense that the most specific possible route for any given destination
        is the preferred route.
      </para>
      <para>
        In the common case, the route selection is based completely on the
        destination address.  Conventional (as opposed to policy-based) IP
        networking relies on only the destination address to select a route
        for a packet.
      </para>
      <para>
        Because the majority of linux systems have no need of policy
        based routing
        features, they use the conventional routing technique of longest
        prefix match.  While this meets the needs of a large subset of
        linux networking needs, there is latent potential in the linux IP
        stack.
      </para>
    </section>
    <section id="routing-selection-adv">
      <title>The Whole Story</title>
      <para>
        With the prevalence of low cost bandwidth, easily configured VPN
        tunnels, and increasing reliance on networks, the technique of
        selecting a route based solely on the destination IP address range no
        longer suffices for all situations.
        The discussion of the common case
        of route selection under linux neglects one
        of the most powerful features in the linux IP stack.
        Since kernel 2.2, linux has
        supported policy based routing through the use of
        <link linkend="routing-tables">multiple routing tables</link> and the
        <link linkend="routing-rpdb">routing policy database (RPDB)</link>.
        Together, they allow a network
        administrator to configure a machine select different routing
        tables and routes based on a number of criteria.
      </para>
      <para>
        Selectors used in policy-basedrouting are simply attributes of a packet
        passing through the linux routing code.  The source address of a
        packet, the ToS flags, an fwmark (a mark carried through the kernel in
        the data structure representing the packet), and the interface name on
        which the packet was received are attributes which can be used as
        selectors.  By selecting a routing table based
        on packet attributes, an administrator can have
        granular control over the network path of any packet.
      </para>
      <para>
        With this knowledge of the RPDB and multiple
        routing tables, let's revisit in detail the method by which the
        kernel selects the proper route for a packet.  Understanding
        the series of steps the kernel takes for route selection should
        demystify advanced routing.  In fact, advanced routing could more
        accurately be called policy-based networking.
      </para>
      <para>
        When determining the route by which to send a packet, the kernel always
        <link linkend="routing-cache">consults the routing cache first</link>.
        The routing cache is a hash table used for quick access to recently
        used routes.  If the kernel finds an entry in the routing cache, the
        corresponding entry will be used.  If there is no entry in the
        routing cache, the kernel begins the process of route selection.  For
        details on the method of matching a route in the routing cache, see
        <xref linkend="routing-cache"/>.
      </para>
      <para>
        If there is no entry in the routing cache,
        the kernel iterates by priority through the routing policy database.
        For each matching entry in the RPDB, the kernel will try to
        find a matching route to the destination IP
        address in the specified routing table.  If a matching route is found,
        the kernel will select this route, and forward the
        packet.  If no matching entry is found in the specified routing table,
        the kernel will pass to the next rule in the RPDB, until it finds a
        match or falls through the end of the RPDB and all consulted routing
        tables.
      </para>
      <para>
        Here is a snippet of python-esque pseudocode to illustrate the
        kernel's route selection process again.  Each of the lookups below
        occurs in kernel hash tables which are accessible to the user through
        the use of various <command>iproute2</command> tools.
        <programlisting>
if packet.routeCacheLookupKey in routeCache :
    route = routeCache[ packet.routeCacheLookupKey ]
else
    for rule in rpdb :
        if packet.rpdbLookupKey in rule :
            routeTable = rule[ lookupTable ]
            if packet.routeLookupKey in routeTable :
                route = route_table[ packet.routeLookup_key ]
        </programlisting>
<!--  

  I don't know if this is correct!  Need to learn about how the routing
  cache is populated with information. 2003-02-05

                route_cache[ packet.routeCacheLookupKey ] = route

  -->


        This pseudocode provides some explanation of the decisions
        required to find a route.  The final piece of information
        required to understand the decision making process is the lookup
        process for each of the three hash table lookups.  In
        <xref linkend="tb-routing-selection-adv"/>, each key is listed in order
        of importance.  Optional keys are listed in italics and represent keys
        that will be matched if they are present.
      </para>
      <table id="tb-routing-selection-adv">
        <title>Keys used for hash table lookups during route selection</title>
        <tgroup cols="3" align="center" colsep="1" rowsep="1">
          <thead>
            <row>
              <entry>route cache</entry>
              <entry>RPDB</entry>
              <entry>route table</entry>
            </row>
          </thead>
          <tbody>
            <row>
              <entry>destination</entry>
              <entry>source</entry>
              <entry>destination</entry>
            </row>
            <row>
              <entry>source</entry>
              <entry><emphasis>destination</emphasis></entry>
              <entry><emphasis>ToS</emphasis></entry>
            </row>
            <row>
              <entry><emphasis>ToS</emphasis></entry>
              <entry><emphasis>ToS</emphasis></entry>
              <entry><emphasis><link linkend="tb-tools-ip-addr-scope">scope</link></emphasis></entry>
            </row>
            <row>
              <entry><emphasis>fwmark</emphasis></entry>
              <entry><emphasis>fwmark</emphasis></entry>
              <entry><emphasis>oif</emphasis></entry>
            </row>
            <row>
              <entry><emphasis>iif</emphasis></entry>
              <entry><emphasis>iif</emphasis></entry>
              <entry></entry>
            </row>
          </tbody>
        </tgroup>
      </table>
      <para>
        Observation of the output of <command>ip rule show</command>
        (cf. <xref linkend="ex-tools-ip-rule-show"/>)
        on a box whose RPDB has not been changed should reveal a
        high priority rule, rule 0.  This rule, created at RPDB
        initialization, instructs the kernel to try to find a match for the
        destination in the
        <link linkend="routing-table-local">local routing table</link>.  If
        there is no match for the packet in the local routing table, the next
        present rule (32766) causes the kernel to perform a route
        lookup in the
        main routing table.  Normally, the main routing table will contain a
        default route if not a more specific route.
        Failing a route lookup in the main routing table the final rule
        (32767) instructs the kernel to perform a route lookup in table 253.
      </para>

      <!--
        
        FIXME; include an XREF here to the State vs Statless discussion

        -->

      <para>
        A common mistake when working with multiple routing tables involves
        forgetting about the statelessness of IP routing.  Consider a solution
        which selects a more reliable connection for SMTP data by routing
        based on fwmark.  If any packet with a source or destination address
        is marked with an fwmark and handled by a separate routing table, this
        routing table must contain routes by which to reach both the SMTP
        client and SMTP server.
      </para>
      <para>
        FIXME; maybe point to a practical example elsewhere?
      </para>
    </section>
    <section id="routing-selection-summary">
      <title>Summary</title>
      <para>
        Route selection, which was once a simple matter of matching the
        destination address against a single routing table, becomes immensely
        more complex and flexible when routes can be selected based on a
        packet's attributes.  FIXME....horrible conclusion.  Needs work.
      </para>
      <para>
        For more ideas on how to use policy routing, how to work with
        multiple routing tables, and how to troubleshoot, see
        <xref linkend="adv-rpdb"/>.
      </para>
    </section>
  </section>
  <section id="routing-source-address-selection">
    <title>Source Address Selection</title>
    <para>
      For now, refer to the section in the <command>iproute2</command> command
      reference.  The excerpt is available
      <ulink url="http://defiant.coinet.com/iproute2/ip-cref/node155.html">here</ulink>.
    </para>
  </section>
  <section id="routing-cache">
    <title>Routing Cache</title>
    <para>
      The routing cache is also known as the forwarding information base.
      This term may be familiar to users of other routing systems.
    </para>
    <para>
      The routing cache stores recently used routing entries in a fast and
      convenient hash lookup table, and is consulted before the routing
      tables.  If the kernel finds a matching entry during route cache lookup,
      it will forward the packet immediately and stop traversing the routing
      tables.
    </para>
    <para>
      Because the routing cache is maintained by the kernel separately from
      the routing tables, manipulating the routing tables may not have an
      immediate effect on the kernel's choice of path for a given packet.
      To prevent the non-deterministic lag between the time that a new route
      is entered into the kernel routing tables and the time that a new lookup
      in those route tables is performed, use
      <link linkend="tools-ip-route-flush-cache"><command>ip route flush
      cache</command></link>.  Once the route cache has been emptied, new
      route lookups (if not by a packet, then manually with
      <link linkend="tools-ip-route-get"><command>ip route
      get</command></link>) will result in a new lookup to the kernel routing
      tables.
    </para>
    <para>
      The following is a listing of the hash lookup keys 
      in the routing cache and a description of each key.  Compare this list
      with the elements identified in
      <xref linkend="tb-routing-selection-adv"/>.
    </para>
    <variablelist>
      <varlistentry>
        <term>dst</term>
        <term>Destination Address</term>
        <listitem>
          <para>
            The destination IP address of the packet.  This is the destination
            address on the packet at the time of the route lookup. The address
            is a host address.  All 32 bits are significant during this lookup.
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>src</term>
        <term>Source Address</term>
        <listitem>
          <para>
            The source IP address of the packet.  This is the source address
            on the packet at the time of the route lookup.  The address is a
            host address.  All 32 bits are significant during this lookup.
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>tos</term>
        <term>Type of Service</term>
        <listitem>
          <para>
            The ToS marking on the packet.  If there is no ToS marking on the
            packet (tos == 0), this lookup key is unused.  If there is a ToS
            marking, the kernel will search for a match with this ToS value.
            If no matching (dst, src, tos) is found, the kernel will continue
            the search for a route by traversing the RPDB.
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>fwmark</term>
        <listitem>
          <para>
            The mark on a packet added administratively by the packet filter.
            This mark is not part of the physical IP packet, and only exists
            as part of the structure held in memory to represent the IP
            packet.  If there is no fwmark on the packet, this lookup key is
            unused.  When present, the kernel will search for a matching 
            (dst, src, tos?, fwmark) entry.  If no matching entry is found,
            the kernel will continue the search for a route by traversing the
            RPDB.
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>iif</term>
        <term>inbound interface</term>
        <listitem>
          <para>
            The name of the interface on which the packet arrived.
          </para>
        </listitem>
      </varlistentry>
    </variablelist>
    <para>
    </para>
    <para>
      The following attributes may be stored for each entry in the routing
      cache.
    </para>
    <variablelist>
      <varlistentry>
        <term>cwnd</term>
        <term>FIXME Window</term>
        <listitem>
          <para>
            FIXME.  A) I don't know what it is.
                    B) I don't know how to describe it.
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>advmss</term>
        <term>Advertised Maximum Segment Size</term>
        <listitem>
          <para>
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>src</term>
        <term>(Preferred Local) Source Address</term>
        <listitem>
          <para>
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>mtu</term>
        <term>Maximum Transmission Unit</term>
        <listitem>
          <para>
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>rtt</term>
        <term>Round Trip Time</term>
        <listitem>
          <para>
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>rttvar</term>
        <term>Round Trip Time Variation</term>
        <listitem>
          <para>
            FIXME.  Gotta find some references to this, too.
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>age</term>
        <listitem>
          <para>
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>users</term>
        <listitem>
          <para>
          </para>
        </listitem>
      </varlistentry>
      <varlistentry>
        <term>used</term>
        <listitem>
          <para>
          </para>
        </listitem>
      </varlistentry>
    </variablelist>
    <para>
      Collectively these hash keys uniquely identify routes in the forwarding
      information base (routing cache) and each entry provides attributes of
      the route.
    </para>
    <para>
    </para>
    <para>
    </para>
    <para>
    </para>
  </section>
  <section id="routing-tables">
    <title>Routing Tables</title>
    <para>
      Linux supports multiple routing tables.  In addition to the two commonly
      used routing tables
      (<link linkend="routing-table-local">the local</link> and
      <link linkend="routing-table-main">main</link> routing tables), the
      kernel supports 253 more.  When an IP address is configured for an
      interface, the kernel adds any required routes to these two
      routing tables.
    </para>
    <para>
      The multiple routing table system provides a flexible infrastructure on
      top of which to implement policy routing.  By allowing multiple
      traditional routing tables (keyed primarily to destination address)
      to be combined with the RPDB (keyed primarily to source address), the
      kernel supports a well-known and well-understood interface while
      simultaneously expanding and extanding its routing capabilities.
      Each routing table still operates in the traditional fashion, similar
      to other networking stacks.
      Linux simply allows you to choose from a
      number of routing tables, and to traverse routing tables in a
      user-definable sequence until a matching route is found.
    </para>
    <para>
      Any given routing table can contain an arbitrary number of entries,
      each of which is keyed on the following characteristics (cf.
      <xref linkend="tb-routing-selection-adv"/>)
      <itemizedlist>
        <listitem>
          <para>
            destination address; a network or host address (primary key)
          </para>
        </listitem>
        <listitem>
          <para>
            tos; Type of Service
          </para>
        </listitem>
        <listitem>
          <para>
            <link linkend="tb-tools-ip-addr-scope">scope</link>
          </para>
        </listitem>
        <listitem>
          <para>
            output interface
          </para>
        </listitem>
      </itemizedlist>
    </para>
    <para>
      For practical purposes, this means that (even) a single routing table can
      contain multiple routes to the same destination if the ToS differs
      on each route
      <footnote>
        <para>
          If somebody has used scope or oif as additional keys in a routing
          table, and has an example, I'd love to see it, for possible
          inclusion in this documentation.
        </para>
      </footnote>.
    </para>
    <para>
      Each table is identified by a positive integer between 1 and 253
      <footnote>
        <para>
          Can anybody describe to me what is in table 0?  It looks almost like
          an aggregation of the routing entries in routing tables 254 and 255.
        </para>
      </footnote>.
      The two routing tables employed by the kernel are
      <link linkend="routing-table-local">table 255, the local routing
      table</link>, and
      <link linkend="routing-table-main">table 254, the main routing
      table</link>.  For examples of using multiple routing tables, see
      <xref linkend="ch-advanced"/>, in particular,
      <xref linkend="ex-adv-multi-internet-outbound-ip-routing"/>,
      <xref linkend="ex-adv-multi-internet-outbound-ip-rule"/> and 
      <xref linkend="ex-adv-multi-internet-inbound"/>.  Also be sure
      to read
      <xref linkend="adv-rpdb"/>.
    </para>
    <para>
      The routing table manipulated by the conventional
      <command>route</command> command is the main routing table.
      Additionally, the use of both <command>ip address</command> and
      <command>ifconfig</command> will change the local routing table.  For
      further documentation on how to manipulate the other routing
      tables, see the command description of
      <link linkend="tools-ip-route"><command>ip route</command></link>.
    </para>
    <section id="routing-table-local">
      <title>The Local Routing Table</title>
      <para>
        The local routing table is maintained by the kernel.  Normally, the
        local routing table should not be manipulated,
        but it is available for viewing.  In
        <xref linkend="ex-tools-ip-route-show-local"/>, you'll see two of the
        common uses of the local routing table.  The first common use is the
        specification of broadcast address, necessary only for link layers
        which support broadcast addressing.  The second common
        type of entry in a
        local routing table is a route to local destination addresses.
      </para>
      <para>
        If the the machine has several IP addresses on one ethernet interface,
        there will be a route to each locally hosted IP in the local routing
        table.  This is a normal
        <link linkend="list-basic-ifconfig-side-effects-up">side effect</link>
        of bringing up an IP address on an interface under linux.  
        Maintenance of the broadcast and local routes in the local routing
        table can only be done by the kernel.
        Hence, read-only fingers are
        recommended when working with the local routing table.
      </para>
      <para>
        There is one other type of route which commonly ends up in the local
        routing table.  When using <command>iproute2</command> NAT, there will
        be entries in the local routing table for each network address
        translation.  Refer to
        <xref linkend="ex-tools-ip-route-nat-simple"/> and
        <xref linkend="ex-tools-ip-route-nat-network"/> for example output.
      </para>
    </section>
    <section id="routing-table-main">
      <title>The Main Routing Table</title>
      <para>
        The main routing table is the routing table most people think of when
        considering the routing table on a linux box.  The main routing table
        is operated on by default with the <command>ip route</command> command
        and also with the <command>route</command> command.
      </para>
      <para>
        The main routing table is also populated automatically by the kernel
        when new interfaces are brought up with network addresses.  Visit 
        <link linkend="list-basic-ifconfig-side-effects-up">this summary of
        side effects</link> of interface definition and activation with
        <command>ifconfig</command> or <command>ip address</command>.
      </para>
      <para>
        The overwhelming majority of linux machines use the main routing table
        and only the main routing table.  This is usually the only source of
        routing information on a machine.
      </para>
    </section>
  </section>
  <section id="routing-rpdb">
    <title>Routing Policy Database</title>
    <para>
      The routing policy database controls the order in which the kernel
      searches through the routing tables.
    </para>
    <para>
    </para>
    <para>
    </para>
  </section>
  <section id="routing-icmp">
    <title>ICMP and Routing</title>
    <para>
      ICMP is a very important part of the communication between hosts on
      IP networks.  Used by routers and endpoints (clients and servers)
      ICMP communicates error conditions in networks and
      provides a means for endpoints to receive information
      about a network path or requested connection.
    </para>
    <para>
      One of the commonest uses of ICMP by the administrator of a network is
      the use of
      <link linkend="tools-ping"><command>ping</command></link> to detect the
      state of a machine in the network.  There are other types of ICMP which
      are used for other inter-computer communication.  One other common type
      of ICMP is the ICMP returned by a router or host which is not accepting
      connections.  Essentially, the host returns the ICMP as a polite method
      of saying <quote>Go away.</quote>.
    </para>
    <para>
    </para>
    <section id="routing-icmp-mtu">
      <title>MTU, MSS, and ICMP</title>
      <para>
        One important use of ICMP, which is completely transparent
        to most users (and indeed many admins), is the use of ICMP to discover
        the Path Maximum Transmission Unit (PMTU).  By discovering the Path MTU
        and setting the MTU for the destination to this value, a host can
        minimize the delay of traffic due to fragmentation, and
        (theoretically) attain a more even rate of data transmission.
      </para>
      <!-- FIXME; make sure to make a full discussion of PMTU -->
      <!--

Example from Giovanni Quadriglio.  Needs to be incorporated into the
document.

As usual I've forgotten the PMTU example

- - Example PMTU - playing with Path MTU Discovering

eth =       0       1      0       0
     - - - -        - - - -        - - - - 
     |server| - - - |router| - - - |client|
     - - - -        - - - -         - - - -
MTU =      1500   1000    1500   1500


[root@server]# nc -l -p 9999
[root@router]# ifconfig eth1 mtu 1000

Now if on router we issue:

[root@client]# tcpdump -i eth0

and later on client we issue:

[root@client]# cat data | nc server 9999

(data is a file of 2000 byte in size for example)

we can see router sends the client the ICMP error:

server unreachable - need to frag but DF bit set (mtu=1000) !

now if PMTU discovery is enabled on client the new packet len. will be
recalculated with this new MTU in mind so that DF is always set
and the packet will reach server without being fragmented

if on client we had issued:
[root@client]# sysclt -w net.ipv4.ip_no_pmtu_disc=1

PMTU discovery on client would has been disabled. New packets starting from
client
will not have DF bit set and fragmentation will occour during the
path from client to server (i.e router fragments the packet).

It could happen to touch this parameter because of bad ICMP filtering
on some router.


        -->
      <para>
        Path MTU can be quite easily broken if any single hop along the way
        blocks all ICMP.  Be sure to allow ICMP unreachable/fragmentation
        needed packets into and out of your network.  This will prevent you
        from being one of the unclueful network admins who cause PMTU
        problems.
      </para>
      <!-- FIXME; XREF link to minimum firewall for ICMP -->
      <para>
      </para>
    </section>
    <section id="routing-icmp-redirect">
      <title>ICMP Redirects and Routing</title>
      <para>
        An ICMP redirect is a router's way of communicating
        that there is a better path out of this network or into another one
        than the one the host had chosen.  In
        <link linkend="example-network-netmap">the example network</link>, 
        &tristan; has a route to the world through &masq-gw; and a route to
        192.168.98.0/24 through &isdn-router;.  If &tristan; sends a packet
        for 192.168.98.0/24 to &masq-gw;, the optimal response is for
        &masq-gw; to suggest with an ICMP redirect that &tristan; send such
        packets via &isdn-router; instead.
      </para>
      <para>
        It is by this method that hosts can learn what networks are reachable
        through which routers on the local network segment.  ICMP redirect
        messages, however, are easy to forge, and were (at one time) used to
        subvert poorly configured machines.  While this is infrequently a
        problem, it's still good practice to ignore ICMP redirect
        messages in general, and create static routes where necessary to
        prevent ICMP redirect messages from being generated on your network.
      </para>
      <para>
        To examine an example of ICMP redirect in action, we simply
        need to send a packet directly from &tristan; to
        &morgan;.  We assume that &masq-gw; has a route to 192.168.98.0/24
        via 192.168.99.1 (&isdn-router;), that &tristan; has no
        such route.
      </para>
      <example id="ex-routing-icmp-redirect">
        <title>ICMP Redirect on the Wire
          <footnote>
            <para>
              Consult <xref linkend="tb-example-network-hosts"/> for details on
              the IP and MAC addresses of the hosts referred to in this
              example.
            </para>
          </footnote>
        </title>
        <programlisting>
<prompt>[root@tristan]# </prompt><userinput>echo test | nc 192.168.98.82 22</userinput>
<prompt>[root@tristan]# </prompt><userinput>tcpdump -nneqti eth0</userinput>
<computeroutput>0:80:c8:f8:4a:51 0:80:c8:f8:5c:71 74: 192.168.99.35.54510 > 192.168.98.82.22: tcp 0 (DF)
0:80:c8:f8:5c:71 0:80:c8:f8:4a:51 102: 192.168.99.254 > 192.168.99.35: icmp: redirect 192.168.98.82 to host 192.168.99.1 [tos 0xc0] 
0:80:c8:f8:5c:71 0:c0:7b:45:6a:39 74: 192.168.99.35.54510 > 192.168.98.82.22: tcp 0 (DF)</computeroutput>
        </programlisting>
      </example>
      <para>
        There's a great deal of information above, so let's examine the
        important parts.  We have the first three packets which passed by our
        NIC as a result of this attempt to establish a session.  First, we see
        a packet from &tristan; bound for &morgan; with &tristan;'s source MAC
        and &masq-gw;'s destination MAC.  Because &masq-gw; is &tristan;'s
        default gateway, &tristan; will send all packets there.
      </para>
      <para>
        The next packet is the ICMP redirect, informing &tristan; of a 
        better route.  It includes several pieces of information.
        Implicitly, the source IP indicates what router is suggesting the
        alternate route, and the contents specify what the intended
        destination was, and what the better route is.  Note that &masq-gw;
        suggests using 192.168.99.1 (&isdn-router;) as the gateway for this
        destination.
      </para>
      <para>
        The final packet is part of the intended session, but has the MAC
        address of &masq-gw;  on it.  &masq-gw; has (courteously) informed us
        that we should not use it as a route for the intended destination, but
        has also (courteously) forwarded the packet as we had requested.  In
        this small network, it is acceptable to allow ICMP redirect messages,
        although these should always be dropped at network borders, both
        inbound and outbound.
      </para>
      <para>
        So, in summary, ICMP redirect messages are not intrinsically dangerous
        or problematic, but they shouldn't exist in well-maintained networks.
        If you happen to see them growing in the shadows of your network, some
        careful observation should show you what hosts are affected and which
        routing tables could use some attention.
      </para>
    </section>
  </section>
</chapter>