From 888e31004c849bf4c3580c978d74bc88efd4bf41 Mon Sep 17 00:00:00 2001 From: gferg <> Date: Fri, 26 May 2000 20:03:38 +0000 Subject: [PATCH] updated --- LDP/howto/linuxdoc/Adv-Routing-HOWTO.sgml | 791 ++++++++++++++++++++-- LDP/howto/linuxdoc/HOWTO-INDEX.sgml | 6 +- 2 files changed, 755 insertions(+), 42 deletions(-) diff --git a/LDP/howto/linuxdoc/Adv-Routing-HOWTO.sgml b/LDP/howto/linuxdoc/Adv-Routing-HOWTO.sgml index 73e5e197..3a454eb2 100644 --- a/LDP/howto/linuxdoc/Adv-Routing-HOWTO.sgml +++ b/LDP/howto/linuxdoc/Adv-Routing-HOWTO.sgml @@ -1,3 +1,4 @@ +
+Most Linux distributions, and most UNIX's, currently use the
+venerable 'arp', 'ifconfig' and 'route' commands. While these tools work,
+they show some unexpected behaviour under Linux 2.2 and up. For example, GRE
+tunnels are an integral part of routing these days, but require completely
+different tools.
+
+With iproute2, tunnels are an integral part of the tool set
+
+The 2.2 and above Linux kernels include a completely redesigned network
+subsystem. This new networking code brings Linux performance and a feature
+set with little competition in the general OS arena. In fact, the new
+routing filtering, and classifying code is more featureful then that
+provided by many dedicated routers and firewalls and traffic shaping
+products.
+
+As new networking concepts have been invented, people have found ways to
+plaster them on top of the existing framework in existing OSes. This
+constant layering of cruft has lead to networking code that is filled with
+strange behaviour, much like most human languages. In the past, Linux
+emulated SunOS's handling of many of these things, which was not ideal.
+
+This new framework has made it possible to clearly express features
+previously not possible.
+
+
+Linux has a sophisticated system for bandwidth provisioning called Traffic
+Control. This system supports various method for classifying, prioritising,
+sharing, and limiting both inbound and outbound traffic.
+
+
+We'll start off with a tiny tour of iproute2 possibilities.
+
+You should make sure that you have the userland tools installed. This
+package is called 'iproute' on both RedHat and Debian, and may otherwise be
+found at ftp://ftp.inr.ac.ru/ip-routing/iproute2-2.2.4-now-ss??????.tar.gz".
+Some parts of iproute require you to have certain kernel options enabled.
+
+FIXME: We should mention
+This may come as a surprise, but iproute2 is already configured! The current
+commands ifconfig and route are already using the advanced
+syscalls, but mostly with very default (ie, boring) settings.
+
+The ip tool is central, and we'll ask it do display our interfaces
+for us.
+
+ Your mileage may vary, but this is what it shows on my NAT router at
+home. I'll only explain part of the output as not everything is directly
+relevant.
+
+We first see the loopback interface. While your computer may function
+somewhat without one, I'd advise against it. The mtu size (maximum transfer
+unit) is 3924 octects, and it is not supposed to queue. Which makes sense
+because the loopback interface is a figment of your kernels imagination.
+
+I'll skip the dummy interface for now, and it may not be present on your
+computer. Then there are my two network interfaces, one at the side of my
+cable modem, the other serves my home ethernet segment. Furthermore, we see
+a ppp0 interface.
+
+Note the absence of IP addresses. Iproute disconnects the concept of 'links'
+and 'IP addresses'. With IP aliasing, the concept of 'the' IP address had
+become quite irrelevant anyhow.
+
+It does show us the MAC addresses though, the hardware identifier of our
+ethernet interfaces.
+
+
+This contains more information. It shows all our addresses, and to which
+cards they belong. 'inet' stands for Internet. There are lots of other
+address families, but these don't concern us right now.
+
+Lets examine eth0 somewhat closer. It says that it is related to the inet
+address '10.0.0.1/8'. What does this mean? The /8 stands for the number of
+bits that are in the Network Address. There are 32 bits, so we have 24 bits
+left that are part of our network. The first 8 bits of 10.0.0.1 correspond
+to 10.0.0.0, our Network Address, and our netmask is 255.0.0.0.
+
+The other bits are connected to this interface, so 10.250.3.13 is directly
+available on eth0, as is 10.0.0.1 for example.
+
+With ppp0, the same concept goes, though the numbers are different. It's
+address is 212.64.94.251, without a subnet mask. This means that we have a
+point-to-point connection and that every address, with the exception of
+212.64.94.251, is remote. There is more information however, it tells us
+that on the other side of the link is yet again only one address,
+212.64.94.1. The /32 tells us that there are no 'network bits'.
+
+It is absolutely vital that you grasp these concepts. Refer to the
+documentation mentioned at the beginning of this HOWTO if you have trouble.
+
+You may also note 'qdisc', which stands for Queueing Discipline. This will
+become vital later on.
+
+
+Well, we now know how to find 10.x.y.z addresses, and we are able to reach
+212.64.94.1. This is not enough however, so we need instructions on how to
+reach the world. The internet is available via our ppp connection, and it
+appears that 212.64.94.1 is willing to spread our packets around the
+world, and deliver results back to us.
+
+
+ARP is the Address Resolution Protocol as described in
+
@@ -200,6 +449,9 @@ If you have a large router, you may well cater for the needs of different
people, who should be served differently. The routing policy database allows
you to do this by having multiple sets of routing tables.
+If you want to use this feature, make sure that your kernel is compiled with
+the "IP: policy routing" feature.
+
When the kernel needs to make a routing decision, it finds out which table
needs to be consulted. By default, there are three tables. The old 'route'
tool modifies the main and local tables, as does the ip tool (by default).
@@ -212,7 +464,7 @@ The default rules:
32767: from all lookup default
-This lists the priority of a rules. We see that all rules apply to all
+This lists the priority of all rules. We see that all rules apply to all
packets ('from all'). We've seen the 'main' table before, it's output by
ip route ls, but the 'local' and 'default' table are new.
@@ -285,7 +537,185 @@ And we are done. It is left as an exercise for the reader to implement this
in ip-up.
-FIXME: waiting for our feature tunnel editor to finish his stuff
+There are 3 kinds of tunnels in Linux. There's IP in IP tunneling, GRE tunneling and tunnels that live outside the kernel (like, for example PPTP).
+
+Tunnels can be used to do some very unusual and very cool stuff. They can also make things go horribly wrong when you don't configure them right. Don't point your default route to a tunnel device unless you know _exactly_ what you are doing :-). Furthermore, tunneling increases overhead, because it needs an extra set of IP headers. Typically this is 20 bytes per packet, so if the normal packet size (MTU) on a network is 1500 bytes, a packet that is sent through a tunnel can only be 1480 bytes big. This is not necessarily a problem, but be sure to read up on IP packet fragmentation/reassembly when you plan to connect large networks with tunnels. Oh, and of course, the fastest way to dig a tunnel is to dig at both sides.
+
+
+This kind of tunneling has been available in Linux for a long time. It requires 2 kernel modules,
+ipip.o and new_tunnel.o.
+
+Let's say you have 3 networks: Internal networks A and B, and intermediate network C (or let's say, Internet).
+So we have network A:
+
+
+GRE is a tunneling protocol that was originally developed by Cisco, and it
+can do a few more things than IP-in-IP tunneling. For example, you can also
+transport multicast traffic and IPv6 through a GRE tunnel.
+
+In Linux, you'll need the ip_gre module.
+
+
+Let's do IPv4 tunneling first:
+
+Let's say you have 3 networks: Internal networks A and B, and intermediate network C (or let's say, Internet).
+
+So we have network A:
+
+On the router of network A, you do the following:
+ In the third line we set the route for network B. Note the different notation for the netmask. If you're not familiar with this notation, here's how it works: you write out the netmask in binary form, and you count all the ones. If you don't know how to do that, just remember that 255.0.0.0 is /8, 255.255.0.0 is /16 and 255.255.255.0 is /24. Oh, and 255.255.254.0 is /23, in case you were wondering.
+
+But enough about this, let's go on with the router of network B.
+
+
+BIG FAT WARNING !!
+
+The following is untested and might therefore be
+completely and utter BOLLOCKS. Proceed at your own risk. Don't say I didn't
+warn you.
+
+FIXME: check & try all this
+
+
+A short bit about IPv6 addresses:
+IPv6 addresses are, compared to IPv4 addresses, monstrously big. An example:
+
+On with the tunnels.
+
+Let's assume that you have the following IPv6 network, and you want to connect it to 6bone, or a friend.
+
+
+
+GRE tunnels are currently the preferred type of tunneling. It's a standard that's also widely adopted outside the Linux community and therefore a Good Thing.
+
+
+There are literally dozens of implementations of tunneling outside the kernel. Best known are of course PPP and PPTP, but there are lots more (some proprietary, some secure, some that don't even use IP) and that is really beyond the scope of this HOWTO.
+
+
+FIXME: Waiting for our feature editor Stefan to finish his stuf
+
FIXME: Editor Vacancy!
@@ -566,6 +996,10 @@ We now need to create two new classes, within our Office class:
FIXME: Finish this example!
+
+FIXME: document TEQL
+
The Linux kernel offers us lots of queueing disciplines. By far the most
@@ -588,26 +1022,66 @@ SFQ, as said earlier, is not quite deterministic, but works (on average).
Its main benefits are that it requires little CPU and memory. 'Real' fair
queueing requires that the kernel keep track of all running sessions.
-This is far too much work so SFQ keeps track of only a number of sessions by
-tracking things based on a hash. Two different sessions might end up in the
-same hash, which isn't very bad but should not be a permanent situation.
-Therefore the kernel perturbs the hash with a certain frequency, which can
-be specified on the tc command line.
+Stochastic Fairness Queueing (SFQ) is a simple implementation
+of fair queueing algorithms family. It's less accurate than
+others, but it also requires less calculations while being
+almost perfectly fair.
+
+The key word in SFQ is conversation (or flow), being a sequence
+of data packets having enough common parameters to distinguish
+it from other conversations. The parameters used in case of
+IP packets are source and destination address, and the protocol
+number.
+
+SFQ consists of dynamically allocated number of FIFO queues,
+one queue for one conversation. The discipline runs in round-robin,
+sending one packet from each FIFO in one turn, and this is why
+it's called fair. The main advantage of SFQ is that it allows
+fair sharing the link between several applications and prevent
+bandwidth take-over by one client. SFQ however cannot determine
+interactive flows from bulk ones -- one usually needs to do
+the selection with CBQ before, and then direct the bulk traffic
+into SFQ.
+
+
-This queue is very straightforward. Imagine a bucket, which holds a number
-of tokens. Tokens are added with a certain frequency, until the bucket fills
-up. By then, the bucket contains 'b' tokens.
+The Token Bucket Filter (TBF) is a simple queue, that only passes packets
+arriving at rate in bounds of some administratively set limit, with
+possibility to buffer short bursts.
-Whenever packets arrive, they are stored. If there are more tokens than
-packets, these packets are sent out ('dequeued') immediately in a burst
-transfer.
+The TBF implementation consists of a buffer (bucket), constatly filled by
+some virtual pieces of information called tokens, at specific rate (token
+rate). The most important parameter of the bucket is its size, that is
+number of tokens it can store.
-If there are more packets then tokens, all packets for which there is a
-token are sent off, the rest have to wait for new tokens to arrive. So, if
-the size of a token is, say, 1000 octets, and we add 8 tokens per second,
-our eventual data rate is 64kilobit per second, excluding a
-certain 'burstiness' that we allow.
+Each arriving token lets one incoming data packet of out the queue and is
+then deleted from the bucket. Associating this algorithm with the two flows
+-- token and data, gives us three possible scenarios:
+
+ The last scenario is very important, because it allows to
+administratively shape the bandwidth available to data, passing the filter.
+The accumulation of tokens allows short burst of overlimit data to be still
+passed without loss, but any lasting overload will cause packets to be
+constantly dropped.
The Linux kernel seems to go beyond this specification, and also allows us
to limit the speed of the burst transmission. However, Alexey warns us:
@@ -684,7 +1158,9 @@ how this works.
So far we've seen how iproute works, and netfilter was mentioned a few
times. This would be a good time to browse through The "fw" classifier relies on the firewall tagging the packets to be shaped. So,
first we will setup the firewall to tag them:
-FIXME: Equivalent iptables command?
-
- The "u32" classifier is a filter that filters directly based on the
- contents of the packet. Thus it can filter based on source or destination
- addresses or ports. It can filter based on the TOS and other truly bizarre
- fields. It does this by taking a specification of the form
- [offset/mask/value] and applying that to all the packets. Fortunately you
- can use symbolic names much as with tcpdump.
+The U32 filter is the most advanced filter available in the current
+implementation. It entirely based on hashing tables, which make it
+robust when there are many filter rules.
+
+In its simplest form the U32 filter is a list of records, each
+consisting of two fields: a selector and an action. The selectors,
+described below, are compared with the currently processed IP packet
+until the first match and the associated action is performed. The
+simplest type of action would be directing the packet into defined
+CBQ class.
+
+The commandline of tc filter program, used to configure the filter,
+consists of three parts: filter specification, a selector and an action.
+The filter specification can be defined as:
+The U32 selector contains definition of the pattern, that will be matched
+to the currently processed packet. Precisely, it defines which bits are
+to be matched in the packet header and nothing more, but this simple
+method is very powerful. Let's take a look at the following examplesm
+taken directly from a pretty complex, real-world filter:
+
+
+For now, leave the first line alone - all these parameters describe
+the filter's hash tables. Focus on the selector line, containing
+match keyword. This selector will match to IP headers, whose
+second byte will be 0x10 (0010). As you can guess, the 00ff number is
+the match mask, telling the filter exactly which bits to match. Here
+it's 0xff, so the byte will match if it's exactly 0x10. The at
+keyword means that the match is to be started at specified offset (in
+bytes) -- in this case it's beginning of the packet. Translating all
+that to human language, the packet will match if its Type of Service
+field will have ,,low delay'' bits set. Let's analyze another rule:
+
+
+The nexthdr option means next header encapsulated in the IP packet,
+i.e. header of upper-layer protocol. The match will also start here
+at the beginning of the next header. The match should occur in the
+second, 32-bit word of the header. In TCP and UDP protocols this field
+contains packet's destination port. The number is given in big-endian
+format, i.e. older bits first, so we simply read 0x0016 as 22 decimal,
+which stands for SSH service if this was TCP. As you guess, this match
+is ambigous without a context, and we will discuss this later.
+
+
+Having understood all the above, we will find the following selector
+quite easy to read: match c0a80100/ffffff00 at 16. What we
+got here is a three byte match at 17-th byte, counting from the IP
+header start. This will match for packets with destination address
+anywhere in 192.168.1/24 network. After analyzing the examples, we
+can summarize what we have learnt.
+
+
+General selectors define the pattern, mask and offset the pattern
+will be matched to the packet contents. Using the general selectors
+you can match virtually any single bit in the IP (or upper layer)
+header. They are more difficult to write and read, though, than
+specific selectors that described below. The general selector syntax
+is:
+
+
+One of the keywords u32, u16 or u8 specifies
+length of the pattern in bits. PATTERN and MASK should follow, of length
+defined by the previous keyword. The OFFSET parameter is the offset,
+in bytes, to start matching. If nexthdr+ keyword is given,
+the offset is relative to start of the upper layer header.
+
+
+Some examples:
+
+
+Packet will match to this rule, if its time to live (TTL) is 64.
+TTL is the field starting just after 8-th byte of the IP header.
+
+
+This rule will only match TCP packets with ACK bit set. Here we can see
+an example of using two selectors, the final result will be logical AND
+of their results. If we take a look at TCP header diagram, we can see
+that the ACK bit is second older bit (0x10) in the 14-th byte of the TCP
+header (at nexthdr+13). As for the second selector, if we'd like
+to make our life harder, we could write match u8 0x06 0xff at 9
+instead if using the specific selector protocol tcp, because
+6 is the number of TCP protocol, present in 10-th byte of the IP header.
+On the other hand, in this example we couldn't use any specific selector
+for the first match - simply because there's no specific selector to match
+TCP ACK bits.
+
+
+The following table contains a list of all specific selectors
+the author of this section has found in the tc program
+source code. They simply make your life easier and increase readability
+of your filter's configuration.
+
+FIXME: table placeholder - the table is in separate file ,,selector.html''
+
+FIXME: it's also still in Polish :-(
+
+FIXME: must be sgml'ized
+
+Some examples:
+
+
- FIXME: Doesn't work
-
This classifier filters based on the results of the routing tables. When a
packet that is traversing through the classes reaches one that is marked
with the "route" filter, it splits the packets up based on information in
@@ -862,16 +1513,45 @@ FIXME: What are the other possibilities?
send it to the given class and give it a priority of 100. Then, to finally
kick it into action, you add the appropriate routing entry:
+ The trick here is to define 'realm' based on either destination or source.
+ The way to do it is like this:
+
FIXME: Fill me in
@@ -1294,6 +1974,8 @@ We then create classes for our customers:
Then we add filters for our two classes:
This is the low-down on how the packet queueing system really works.
+
+Lists the steps the kernel takes to classify a packet, etc...
+
+FIXME: Write this.
+
+ Go through Alexeys extremely tricky example involving the unused bits
+in the TOS field.
+
+FIXME: Write this.
+
+ I'd like to include a brief description of other packet shaping systems
+in other operating systems and how they compare to the Linux one. Since Linux
+is one of the few OSes that has a completely original (non-BSD derived) TCP/IP
+stack, I think it would be useful to see how other people do it.
+
+Unfortunately I have no experiene with other systems so cannot write this.
+
+FIXME: Anyone? - Martijn
@@ -1574,7 +2278,7 @@ less. We may include a section on this at a later date.
url="http://www.cisco.com/univercd/cc/td/doc/product/software/ios111/cc111/car.htm"
name="IOS Committed Access Rate">