From 888e31004c849bf4c3580c978d74bc88efd4bf41 Mon Sep 17 00:00:00 2001 From: gferg <> Date: Fri, 26 May 2000 20:03:38 +0000 Subject: [PATCH] updated --- LDP/howto/linuxdoc/Adv-Routing-HOWTO.sgml | 791 ++++++++++++++++++++-- LDP/howto/linuxdoc/HOWTO-INDEX.sgml | 6 +- 2 files changed, 755 insertions(+), 42 deletions(-) diff --git a/LDP/howto/linuxdoc/Adv-Routing-HOWTO.sgml b/LDP/howto/linuxdoc/Adv-Routing-HOWTO.sgml index 73e5e197..3a454eb2 100644 --- a/LDP/howto/linuxdoc/Adv-Routing-HOWTO.sgml +++ b/LDP/howto/linuxdoc/Adv-Routing-HOWTO.sgml @@ -1,3 +1,4 @@ + Linux 2.4 Advanced Routing HOWTO -<author>bert hubert <ahu@ds9a.nl> &nl; +<author>Netherlabs BV (bert hubert <bert.hubert@netherlabs.nl>)&nl; Gregory Maxwell <greg@linuxpower.cx> &nl; +Remco van Mook <remco@virtu.nl> &nl; Martijn van Oosterhout <kleptog@cupid.suninternet.com> &nl; +Paul B Schroeder <paulsch@us.ibm.com> &nl; howto@ds9a.nl -<date>v0.0.3 $Date$ +<date>v0.1.0 $Date$ <abstract> A very hands-on approach to iproute2, traffic shaping and a bit of netfilter </abstract> @@ -193,6 +196,252 @@ name="Rusty's Remarkably Unreliable Guides"> We will be focusing mostly on what is possible by combining netfilter and iproute2. +<sect>Introduction to iproute2 +<sect1>Why iproute2? +<p> +Most Linux distributions, and most UNIX's, currently use the +venerable 'arp', 'ifconfig' and 'route' commands. While these tools work, +they show some unexpected behaviour under Linux 2.2 and up. For example, GRE +tunnels are an integral part of routing these days, but require completely +different tools. + +With iproute2, tunnels are an integral part of the tool set + +The 2.2 and above Linux kernels include a completely redesigned network +subsystem. This new networking code brings Linux performance and a feature +set with little competition in the general OS arena. In fact, the new +routing filtering, and classifying code is more featureful then that +provided by many dedicated routers and firewalls and traffic shaping +products. + +As new networking concepts have been invented, people have found ways to +plaster them on top of the existing framework in existing OSes. This +constant layering of cruft has lead to networking code that is filled with +strange behaviour, much like most human languages. In the past, Linux +emulated SunOS's handling of many of these things, which was not ideal. + +This new framework has made it possible to clearly express features +previously not possible. + +<sect1>Iproute2 tour +<p> +Linux has a sophisticated system for bandwidth provisioning called Traffic +Control. This system supports various method for classifying, prioritising, +sharing, and limiting both inbound and outbound traffic. + + +We'll start off with a tiny tour of iproute2 possibilities. +<sect1>Prerequisites +<p> +You should make sure that you have the userland tools installed. This +package is called 'iproute' on both RedHat and Debian, and may otherwise be +found at <tt>ftp://ftp.inr.ac.ru/ip-routing/iproute2-2.2.4-now-ss??????.tar.gz"</tt>. +Some parts of iproute require you to have certain kernel options enabled. + +FIXME: We should mention <url url="ftp://ftp.inr.ac.ru/ip-routing/iproute2-current.tar.gz"> +is always the latest + +<sect1>Exploring your current configuration +<p> +This may come as a surprise, but iproute2 is already configured! The current +commands <tt>ifconfig</tt> and <tt>route</tt> are already using the advanced +syscalls, but mostly with very default (ie, boring) settings. + +The <tt>ip</tt> tool is central, and we'll ask it do display our interfaces +for us. +<sect2><tt>ip</tt> shows us our links +<p> +<tscreen><verb> +[ahu@home ahu]$ ip link list +1: lo: <LOOPBACK,UP> mtu 3924 qdisc noqueue + link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 +2: dummy: <BROADCAST,NOARP> mtu 1500 qdisc noop + link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff +3: eth0: <BROADCAST,MULTICAST,PROMISC,UP> mtu 1400 qdisc pfifo_fast qlen 100 + link/ether 48:54:e8:2a:47:16 brd ff:ff:ff:ff:ff:ff +4: eth1: <BROADCAST,MULTICAST,PROMISC,UP> mtu 1500 qdisc pfifo_fast qlen 100 + link/ether 00:e0:4c:39:24:78 brd ff:ff:ff:ff:ff:ff +3764: ppp0: <POINTOPOINT,MULTICAST,NOARP,UP> mtu 1492 qdisc pfifo_fast qlen 10 + link/ppp + +</verb></tscreen> +<p>Your mileage may vary, but this is what it shows on my NAT router at +home. I'll only explain part of the output as not everything is directly +relevant. + +We first see the loopback interface. While your computer may function +somewhat without one, I'd advise against it. The mtu size (maximum transfer +unit) is 3924 octects, and it is not supposed to queue. Which makes sense +because the loopback interface is a figment of your kernels imagination. + +I'll skip the dummy interface for now, and it may not be present on your +computer. Then there are my two network interfaces, one at the side of my +cable modem, the other serves my home ethernet segment. Furthermore, we see +a ppp0 interface. + +Note the absence of IP addresses. Iproute disconnects the concept of 'links' +and 'IP addresses'. With IP aliasing, the concept of 'the' IP address had +become quite irrelevant anyhow. + +It does show us the MAC addresses though, the hardware identifier of our +ethernet interfaces. +<sect2><tt>ip</tt> shows us our IP addresses +<p> +<tscreen><verb> +[ahu@home ahu]$ ip address show +1: lo: <LOOPBACK,UP> mtu 3924 qdisc noqueue + link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 + inet 127.0.0.1/8 brd 127.255.255.255 scope host lo +2: dummy: <BROADCAST,NOARP> mtu 1500 qdisc noop + link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff +3: eth0: <BROADCAST,MULTICAST,PROMISC,UP> mtu 1400 qdisc pfifo_fast qlen 100 + link/ether 48:54:e8:2a:47:16 brd ff:ff:ff:ff:ff:ff + inet 10.0.0.1/8 brd 10.255.255.255 scope global eth0 +4: eth1: <BROADCAST,MULTICAST,PROMISC,UP> mtu 1500 qdisc pfifo_fast qlen 100 + link/ether 00:e0:4c:39:24:78 brd ff:ff:ff:ff:ff:ff +3764: ppp0: <POINTOPOINT,MULTICAST,NOARP,UP> mtu 1492 qdisc pfifo_fast qlen 10 + link/ppp + inet 212.64.94.251 peer 212.64.94.1/32 scope global ppp0 +</verb></tscreen> +<p> +This contains more information. It shows all our addresses, and to which +cards they belong. 'inet' stands for Internet. There are lots of other +address families, but these don't concern us right now. + +Lets examine eth0 somewhat closer. It says that it is related to the inet +address '10.0.0.1/8'. What does this mean? The /8 stands for the number of +bits that are in the Network Address. There are 32 bits, so we have 24 bits +left that are part of our network. The first 8 bits of 10.0.0.1 correspond +to 10.0.0.0, our Network Address, and our netmask is 255.0.0.0. + +The other bits are connected to this interface, so 10.250.3.13 is directly +available on eth0, as is 10.0.0.1 for example. + +With ppp0, the same concept goes, though the numbers are different. It's +address is 212.64.94.251, without a subnet mask. This means that we have a +point-to-point connection and that every address, with the exception of +212.64.94.251, is remote. There is more information however, it tells us +that on the other side of the link is yet again only one address, +212.64.94.1. The /32 tells us that there are no 'network bits'. + +It is absolutely vital that you grasp these concepts. Refer to the +documentation mentioned at the beginning of this HOWTO if you have trouble. + +You may also note 'qdisc', which stands for Queueing Discipline. This will +become vital later on. + +<sect2><tt>ip</tt> shows us our routes +<p> +Well, we now know how to find 10.x.y.z addresses, and we are able to reach +212.64.94.1. This is not enough however, so we need instructions on how to +reach the world. The internet is available via our ppp connection, and it +appears that 212.64.94.1 is willing to spread our packets around the +world, and deliver results back to us. + +<tscreen><verb> +[ahu@home ahu]$ ip route show +212.64.94.1 dev ppp0 proto kernel scope link src 212.64.94.251 +10.0.0.0/8 dev eth0 proto kernel scope link src 10.0.0.1 +127.0.0.0/8 dev lo scope link +default via 212.64.94.1 dev ppp0 +</verb></tscreen> + +This is pretty much self explanatory. The first 4 lines of output explicitly +state what was already implied by <tt>ip address show</tt>, the last line +tells us that the rest of the world can be found via 212.64.94.1, our +default gateway. We can see that it is a gateway because of the word +via, which tells us that we need to send packets to 212.64.94.1, and that it +will take care of things. + +For reference, this is what the old 'route' utility shows us: +<tscreen><verb> +[ahu@home ahu]$ route -n +Kernel IP routing table +Destination Gateway Genmask Flags Metric Ref Use +Iface +212.64.94.1 0.0.0.0 255.255.255.255 UH 0 0 0 ppp0 +10.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 eth0 +127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo +0.0.0.0 212.64.94.1 0.0.0.0 UG 0 0 0 ppp0 +</verb></tscreen> + +<sect1>ARP +<p> +ARP is the Address Resolution Protocol as described in +<url url="http://www.faqs.org/rfcs/rfc826.html" name="RFC 826">. +ARP is used by a networked machine to resolve the hardware location/address of +another machine on the same +local network. Machines on the Internet are generally known by their names +which resolve to IP +addresses. This is how a machine on the foo.com network is able to communicate +with another machine which is on the bar.net network. An IP address, though, +cannot tell you the physical location of a machine. This is where ARP comes +into the picture. + +Let's take a very simple example. Suppose I have a network composed of several +machines. Two of the machines which are currently on my network are foo +with an IP address of 10.0.0.1 and bar with an IP address of 10.0.0.2. +Now foo wants to ping bar to see that he is alive, but alas, foo has no idea +where bar is. So when foo decides to ping bar he will need to send +out an ARP request. +This ARP request is akin to foo shouting out on the network "Bar (10.0.0.2)! +Where are you?" As a result of this every machine on the network will hear +foo shouting, but only bar (10.0.0.2) will respond. Bar will then send an +ARP reply directly back to foo which is akin +bar saying, +"Foo (10.0.0.1) I am here at 00:60:94:E9:08:12." After this simple transaction +used to locate his friend on the network foo is able to communicate with bar +until he (his arp cache) forgets where bar is. + +Now let's see how this works. +You can view your machines current arp/neighbor cache/table like so: +<tscreen><verb> +[root@espa041 /home/src/iputils]# ip neigh show +9.3.76.42 dev eth0 lladdr 00:60:08:3f:e9:f9 nud reachable +9.3.76.1 dev eth0 lladdr 00:06:29:21:73:c8 nud reachable +</verb></tscreen> + +As you can see my machine espa041 (9.3.76.41) knows where to find espa042 +(9.3.76.42) and +espagate (9.3.76.1). Now let's add another machine to the arp cache. + +<tscreen><verb> +[root@espa041 /home/paulsch/.gnome-desktop]# ping -c 1 espa043 +PING espa043.austin.ibm.com (9.3.76.43) from 9.3.76.41 : 56(84) bytes of data. +64 bytes from 9.3.76.43: icmp_seq=0 ttl=255 time=0.9 ms + +--- espa043.austin.ibm.com ping statistics --- +1 packets transmitted, 1 packets received, 0% packet loss +round-trip min/avg/max = 0.9/0.9/0.9 ms + +[root@espa041 /home/src/iputils]# ip neigh show +9.3.76.43 dev eth0 lladdr 00:06:29:21:80:20 nud reachable +9.3.76.42 dev eth0 lladdr 00:60:08:3f:e9:f9 nud reachable +9.3.76.1 dev eth0 lladdr 00:06:29:21:73:c8 nud reachable +</verb></tscreen> + +As a result of espa041 trying to contact espa043, espa043's hardware +address/location has now been added to the arp/nieghbor cache. +So until the entry for +espa043 times out (as a result of no communication between the two) espa041 +knows where to find espa043 and has no need to send an ARP request. + +Now let's delete espa043 from our arp cache: + +<tscreen><verb> +[root@espa041 /home/src/iputils]# ip neigh delete 9.3.76.43 dev eth0 +[root@espa041 /home/src/iputils]# ip neigh show +9.3.76.43 dev eth0 nud failed +9.3.76.42 dev eth0 lladdr 00:60:08:3f:e9:f9 nud reachable +9.3.76.1 dev eth0 lladdr 00:06:29:21:73:c8 nud stale +</verb></tscreen> + +Now espa041 has again forgotten where to find espa043 and will need to send +another ARP request the next time he needs to communicate with espa043. +You can also see from the above output that espagate (9.3.76.1) has been +changed to the "stale" state. This means that the location shown is still +valid, but it will have to be confirmed at the first transaction to that +machine. <sect>Rules - routing policy database <p> @@ -200,6 +449,9 @@ If you have a large router, you may well cater for the needs of different people, who should be served differently. The routing policy database allows you to do this by having multiple sets of routing tables. +If you want to use this feature, make sure that your kernel is compiled with +the "IP: policy routing" feature. + When the kernel needs to make a routing decision, it finds out which table needs to be consulted. By default, there are three tables. The old 'route' tool modifies the main and local tables, as does the ip tool (by default). @@ -212,7 +464,7 @@ The default rules: 32767: from all lookup default </verb></tscreen> -This lists the priority of a rules. We see that all rules apply to all +This lists the priority of all rules. We see that all rules apply to all packets ('from all'). We've seen the 'main' table before, it's output by <tt>ip route ls</tt>, but the 'local' and 'default' table are new. @@ -285,7 +537,185 @@ And we are done. It is left as an exercise for the reader to implement this in ip-up. <sect>GRE and other tunnels <p> -FIXME: waiting for our feature tunnel editor to finish his stuff +There are 3 kinds of tunnels in Linux. There's IP in IP tunneling, GRE tunneling and tunnels that live outside the kernel (like, for example PPTP). +<sect1>A few general remarks about tunnels: +<p> +Tunnels can be used to do some very unusual and very cool stuff. They can also make things go horribly wrong when you don't configure them right. Don't point your default route to a tunnel device unless you know _exactly_ what you are doing :-). Furthermore, tunneling increases overhead, because it needs an extra set of IP headers. Typically this is 20 bytes per packet, so if the normal packet size (MTU) on a network is 1500 bytes, a packet that is sent through a tunnel can only be 1480 bytes big. This is not necessarily a problem, but be sure to read up on IP packet fragmentation/reassembly when you plan to connect large networks with tunnels. Oh, and of course, the fastest way to dig a tunnel is to dig at both sides. +<p> +<sect1>IP in IP tunneling +<p> +This kind of tunneling has been available in Linux for a long time. It requires 2 kernel modules, +ipip.o and new_tunnel.o. + +Let's say you have 3 networks: Internal networks A and B, and intermediate network C (or let's say, Internet). +So we have network A: + +<tscreen><verb> +network 10.0.1.0 +netmask 255.255.255.0 +router 10.0.1.1 +</verb></tscreen> +The router has address 172.16.17.18 on network C. + +and network B: +<tscreen><verb> +network 10.0.2.0 +netmask 255.255.255.0 +router 10.0.2.1 +</verb></tscreen> +The router has address 172.19.20.21 on network C. + +As far as network C is concerned, we assume that it will pass any packet sent +from A to B and vice versa. You might even use the Internet for this. + +Here's what you do: + +First, make sure the modules are installed: + +<tscreen><verb> +insmod ipip.o +insmod new_tunnel.o +</verb></tscreen> +Then, on the router of network A, you do the following: +<tscreen><verb> +ifconfig tunl0 10.0.1.1 pointopoint 172.19.20.21 +route add -net 10.0.2.0 netmask 255.255.255.0 dev tunl0 +</verb></tscreen> +And on the router of network B: +<tscreen><verb> +ifconfig tunl0 10.0.2.1 pointopoint 172.16.17.18 +route add -net 10.0.1.0 netmask 255.255.255.0 dev tunl0 +</verb></tscreen> +And if you're finished with your tunnel: +<tscreen><verb> +ifconfig tunl0 down +</verb></tscreen> +Presto, you're done. You can't forward broadcast or IPv6 traffic through +an IP-in-IP tunnel, though. You just connect 2 IPv4 networks that normally wouldn't be able to talk to each other, that's all. As far as compatibility goes, this code has been around a long time, so it's compatible all the way back to 1.3 kernels. Linux IP-in-IP tunneling doesn't work with other Operating Systems or routers, as far as I know. It's simple, it works. Use it if you have to, otherwise use GRE. + +<sect1>GRE tunneling +<p> +GRE is a tunneling protocol that was originally developed by Cisco, and it +can do a few more things than IP-in-IP tunneling. For example, you can also +transport multicast traffic and IPv6 through a GRE tunnel. + +In Linux, you'll need the ip_gre module. + +<sect2>IPv4 Tunneling +<p> +Let's do IPv4 tunneling first: + +Let's say you have 3 networks: Internal networks A and B, and intermediate network C (or let's say, Internet). + +So we have network A: +<tscreen><verb> +network 10.0.1.0 +netmask 255.255.255.0 +router 10.0.1.1 +</verb></tscreen> +The router has address 172.16.17.18 on network C. +Let's call this network neta (ok, hardly original) + +and network B: +<tscreen><verb> +network 10.0.2.0 +netmask 255.255.255.0 +router 10.0.2.1 +</verb></tscreen> +The router has address 172.19.20.21 on network C. +Let's call this network netb (still not original) + +As far as network C is concerned, we assume that it will pass any packet sent +from A to B and vice versa. How and why, we do not care. +<p> +On the router of network A, you do the following: +<tscreen><verb> +ip tunnel add netb mode gre remote 172.19.20.21 local 172.16.17.18 ttl 255 +ip addr add 10.0.1.1 dev netb +ip route add 10.0.2.0/24 dev netb +</verb></tscreen> + +Let's discuss this for a bit. In line 1, we added a tunnel device, and +called it netb (which is kind of obvious because that's where we want it to +go). Furthermore we told it to use the GRE protocol (mode gre), that the +remote address is 172.19.20.21 (the router at the other end), that our +tunneling packets should originate from 172.16.17.18 (which allows your +router to have several IP addresses on network C and let you decide which +one to use for tunneling) and that the TTL field of the packet should be set +to 255 (ttl 255). + +In the second line we gave the newly born interface netb the address +10.0.1.1. This is OK for smaller networks, but when you're starting up a +mining expedition (LOTS of tunnels), you might want to consider using +another IP range for tunneling interfaces (in this example, you could use +10.0.3.0). + +<p>In the third line we set the route for network B. Note the different notation for the netmask. If you're not familiar with this notation, here's how it works: you write out the netmask in binary form, and you count all the ones. If you don't know how to do that, just remember that 255.0.0.0 is /8, 255.255.0.0 is /16 and 255.255.255.0 is /24. Oh, and 255.255.254.0 is /23, in case you were wondering. +<p> +But enough about this, let's go on with the router of network B. +<tscreen><verb> +ip tunnel add neta mode gre remote 172.16.17.18 local 172.19.20.21 ttl 255 +ip addr add 10.0.2.1 dev neta +ip route add 10.0.1.0/24 dev neta +</verb></tscreen> +And when you want to remove the tunnelon router A: +<tscreen><verb> +ip link set netb down +ip tunnel del netb +</verb></tscreen> +Of course, you can replace netb with neta for router B. + +<sect2>IPv6 Tunneling +<p> + +BIG FAT WARNING !! + +The following is untested and might therefore be +completely and utter BOLLOCKS. Proceed at your own risk. Don't say I didn't +warn you. + +FIXME: check & try all this + +<p> +A short bit about IPv6 addresses:<p> +IPv6 addresses are, compared to IPv4 addresses, monstrously big. An example: +<verb>3ffe:2502:200:40:281:48fe:dcfe:d9bc</verb> +So, to make writing them down easier, there are a few rules: +<itemize> +<item>Don't use leading zeroes. Same as in IPv4. +<item>Use colons to separate every 16 bits or two bytes. +<item>When you have lots of consecutive zeroes, you can write this down as ::. You can only do this once in an address and only for quantities of 16 bits, though. +</itemize> +Using these rules, the address 3ffe:0000:0000:0000:0000:0020:34A1:F32C can be written down as 3ffe::20:34A1:F32C, which is a lot shorter. +<p> +On with the tunnels. + +Let's assume that you have the following IPv6 network, and you want to connect it to 6bone, or a friend. + +<tscreen><verb> +Network 3ffe:406:5:1:5:a:2:1/96 +</verb></tscreen> +Your IPv4 address is 172.16.17.18, and the 6bone router has IPv4 address 172.22.23.24. +<p> +<tscreen><verb> +ip tunnel add sixbone mode sit remote 172.22.23.24 local 172.16.17.18 ttl 255 +ip link set sixbone up +ip addr add 3ffe:406:5:1:5:a:2:1/96 dev sixbone +ip route add 3ffe::/15 dev sixbone +</verb></tscreen> + +Let's discuss this. In the first line, we created a tunnel device called sixbone. We gave it mode sit (which is IPv6 in IPv4 tunneling) and told it where to go to (remote) and where to come from (local). TTL is set to maximum, 255. Next, we made the device active (up). After that, we added our own network address, and set a route for 3ffe::/15 (which is currently all of 6bone) through the tunnel. +<p> +GRE tunnels are currently the preferred type of tunneling. It's a standard that's also widely adopted outside the Linux community and therefore a Good Thing. +<p> +<sect1>Userland tunnels +<p> +There are literally dozens of implementations of tunneling outside the kernel. Best known are of course PPP and PPTP, but there are lots more (some proprietary, some secure, some that don't even use IP) and that is really beyond the scope of this HOWTO. + +<sect>IPsec: secure IP over the internet +<p> +FIXME: Waiting for our feature editor Stefan to finish his stuf + <sect>Multicast routing <p> FIXME: Editor Vacancy! @@ -566,6 +996,10 @@ We now need to create two new classes, within our Office class: FIXME: Finish this example! +<sect1>Loadsharing over multiple interfaces +<p> +FIXME: document TEQL + <sect>More queueing disciplines <p> The Linux kernel offers us lots of queueing disciplines. By far the most @@ -588,26 +1022,66 @@ SFQ, as said earlier, is not quite deterministic, but works (on average). Its main benefits are that it requires little CPU and memory. 'Real' fair queueing requires that the kernel keep track of all running sessions. -This is far too much work so SFQ keeps track of only a number of sessions by -tracking things based on a hash. Two different sessions might end up in the -same hash, which isn't very bad but should not be a permanent situation. -Therefore the kernel perturbs the hash with a certain frequency, which can -be specified on the <tt>tc</tt> command line. +Stochastic Fairness Queueing (SFQ) is a simple implementation +of fair queueing algorithms family. It's less accurate than +others, but it also requires less calculations while being +almost perfectly fair. + +The key word in SFQ is conversation (or flow), being a sequence +of data packets having enough common parameters to distinguish +it from other conversations. The parameters used in case of +IP packets are source and destination address, and the protocol +number. + +SFQ consists of dynamically allocated number of FIFO queues, +one queue for one conversation. The discipline runs in round-robin, +sending one packet from each FIFO in one turn, and this is why +it's called fair. The main advantage of SFQ is that it allows +fair sharing the link between several applications and prevent +bandwidth take-over by one client. SFQ however cannot determine +interactive flows from bulk ones -- one usually needs to do +the selection with CBQ before, and then direct the bulk traffic +into SFQ. + + <sect1>Token Bucket Filter <p> -This queue is very straightforward. Imagine a bucket, which holds a number -of tokens. Tokens are added with a certain frequency, until the bucket fills -up. By then, the bucket contains 'b' tokens. +The Token Bucket Filter (TBF) is a simple queue, that only passes packets +arriving at rate in bounds of some administratively set limit, with +possibility to buffer short bursts. -Whenever packets arrive, they are stored. If there are more tokens than -packets, these packets are sent out ('dequeued') immediately in a burst -transfer. +The TBF implementation consists of a buffer (bucket), constatly filled by +some virtual pieces of information called tokens, at specific rate (token +rate). The most important parameter of the bucket is its size, that is +number of tokens it can store. -If there are more packets then tokens, all packets for which there is a -token are sent off, the rest have to wait for new tokens to arrive. So, if -the size of a token is, say, 1000 octets, and we add 8 tokens per second, -our eventual data rate is 64kilobit per second, excluding a -certain 'burstiness' that we allow. +Each arriving token lets one incoming data packet of out the queue and is +then deleted from the bucket. Associating this algorithm with the two flows +-- token and data, gives us three possible scenarios: + +<itemize> + +<item> The data arrives into TBF at rate <em>equal</em> the rate of incoming +tokens. In this case each incoming packet has its matching token and passes +the queue without delay. + +<item> The data arrives into TBF at rate <em>smaller</em> than the token rate. +Only some tokens are deleted at output of each data packet sent out the +queue, so the tokens accumulate, up to the bucket size. The saved tokens can +be then used to send data over the token rate, if short data burst occurs. + +<item> The data arrives into TBF at rate <em>bigger</em> than the token rate. In +this case filter overrun occurs -- incoming data can be only sent out +without loss until all accumulated tokens are used. After that, overlimit +packets are dropped. + +</itemize> + +<p> The last scenario is very important, because it allows to +administratively shape the bandwidth available to data, passing the filter. +The accumulation of tokens allows short burst of overlimit data to be still +passed without loss, but any lasting overload will cause packets to be +constantly dropped. The Linux kernel seems to go beyond this specification, and also allows us to limit the speed of the burst transmission. However, Alexey warns us: @@ -684,7 +1158,9 @@ how this works. So far we've seen how iproute works, and netfilter was mentioned a few times. This would be a good time to browse through <url name="Rusty's Remarkably Unreliable guides" -url="http://netfilter.kernelnotes.org/unreliable-guides/">. +url="http://netfilter.kernelnotes.org/unreliable-guides/">. Netfilter itself +can be found <url name="here" +url="http://antarctica.penguincomputing.com/~netfilter/">. Netfilter allows us to filter packets, or mangle their headers. One special feature is that we can mark a packet with a number. This is done with the @@ -799,8 +1275,6 @@ is 1:1. <p> The "fw" classifier relies on the firewall tagging the packets to be shaped. So, first we will setup the firewall to tag them: -FIXME: Equivalent iptables command? - <tscreen><verb> # iptables -I PREROUTING -t mangle -p tcp -d HostA \ -j MARK --set-mark 1 @@ -827,26 +1301,203 @@ FIXME: Equivalent iptables command? <sect1>The "u32" classifier <p> - The "u32" classifier is a filter that filters directly based on the - contents of the packet. Thus it can filter based on source or destination - addresses or ports. It can filter based on the TOS and other truly bizarre - fields. It does this by taking a specification of the form - [offset/mask/value] and applying that to all the packets. Fortunately you - can use symbolic names much as with tcpdump. +The U32 filter is the most advanced filter available in the current +implementation. It entirely based on hashing tables, which make it +robust when there are many filter rules. + +In its simplest form the U32 filter is a list of records, each +consisting of two fields: a selector and an action. The selectors, +described below, are compared with the currently processed IP packet +until the first match and the associated action is performed. The +simplest type of action would be directing the packet into defined +CBQ class. + +The commandline of <tt>tc filter</tt> program, used to configure the filter, +consists of three parts: filter specification, a selector and an action. +The filter specification can be defined as: <tscreen><verb> -# tc filter add dev eth1 parent 1:0 protocol ip prio 1 u32 match ip dst HostA flowid 1:1 +tc filter add dev IF [ protocol PROTO ] + [ (preference|priority) PRIO ] + [ parent CBQ ] </verb></tscreen> -FIXME: What are the other possibilities? +The <tt>protocol</tt> field describes protocol that the filter will be +applied to. We will only discuss case of <tt>ip</tt> protocol. The +<tt>preference</tt> field (<tt>priority</tt> can be used alternatively) +sets the priority of currently defined filter. This is important, since +you can have several filters (lists of rules) with different priorities. +Each list will be passed in the order the rules were added, then list with +lower priority (higher preference number) will be processed. The <tt>parent</tt> +field defines the CBQ tree top (e.g. 1:0), the filter should be attached +to. - That all there is to it. +The options decribed apply to all filters, not only U32. + +<sect2>U32 selector + +<p> +The U32 selector contains definition of the pattern, that will be matched +to the currently processed packet. Precisely, it defines which bits are +to be matched in the packet header and nothing more, but this simple +method is very powerful. Let's take a look at the following examplesm +taken directly from a pretty complex, real-world filter: + +<tscreen><verb> +# filter parent 1: protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:3 \ + match 00100000/00ff0000 at 0 +</verb></tscreen> + +<p> +For now, leave the first line alone - all these parameters describe +the filter's hash tables. Focus on the selector line, containing +<tt>match</tt> keyword. This selector will match to IP headers, whose +second byte will be 0x10 (0010). As you can guess, the 00ff number is +the match mask, telling the filter exactly which bits to match. Here +it's 0xff, so the byte will match if it's exactly 0x10. The <tt>at</tt> +keyword means that the match is to be started at specified offset (in +bytes) -- in this case it's beginning of the packet. Translating all +that to human language, the packet will match if its Type of Service +field will have ,,low delay'' bits set. Let's analyze another rule: + +<tscreen><verb> +# filter parent 1: protocol ip pref 10 u32 fh 800::803 order 2051 key ht 800 bkt 0 flowid 1:3 \ + match 00000016/0000ffff at nexthdr+0 +</verb></tscreen> + +<p> +The <tt>nexthdr</tt> option means next header encapsulated in the IP packet, +i.e. header of upper-layer protocol. The match will also start here +at the beginning of the next header. The match should occur in the +second, 32-bit word of the header. In TCP and UDP protocols this field +contains packet's destination port. The number is given in big-endian +format, i.e. older bits first, so we simply read 0x0016 as 22 decimal, +which stands for SSH service if this was TCP. As you guess, this match +is ambigous without a context, and we will discuss this later. + +<p> +Having understood all the above, we will find the following selector +quite easy to read: <tt>match c0a80100/ffffff00 at 16</tt>. What we +got here is a three byte match at 17-th byte, counting from the IP +header start. This will match for packets with destination address +anywhere in 192.168.1/24 network. After analyzing the examples, we +can summarize what we have learnt. + +<sect2>General selectors + +<p> +General selectors define the pattern, mask and offset the pattern +will be matched to the packet contents. Using the general selectors +you can match virtually any single bit in the IP (or upper layer) +header. They are more difficult to write and read, though, than +specific selectors that described below. The general selector syntax +is: + +<tscreen><verb> +match [ u32 | u16 | u8 ] PATTERN MASK [ at OFFSET | nexthdr+OFFSET] +</verb></tscreen> + +<p> +One of the keywords <tt>u32</tt>, <tt>u16</tt> or <tt>u8</tt> specifies +length of the pattern in bits. PATTERN and MASK should follow, of length +defined by the previous keyword. The OFFSET parameter is the offset, +in bytes, to start matching. If <tt>nexthdr+</tt> keyword is given, +the offset is relative to start of the upper layer header. + +<p> +Some examples: + +<tscreen><verb> +# tc filter add dev ppp14 parent 1:0 prio 10 u32 \ + match u8 64 0xff at 8 \ + flowid 1:4 +</verb></tscreen> + +<p> +Packet will match to this rule, if its time to live (TTL) is 64. +TTL is the field starting just after 8-th byte of the IP header. + +<tscreen><verb> +# tc filter add dev ppp14 parent 1:0 prio 10 u32 \ + match u8 0x10 0xff at nexthdr+13 \ + protocol tcp \ + flowid 1:3 \ +</verb></tscreen> + +<p> +This rule will only match TCP packets with ACK bit set. Here we can see +an example of using two selectors, the final result will be logical AND +of their results. If we take a look at TCP header diagram, we can see +that the ACK bit is second older bit (0x10) in the 14-th byte of the TCP +header (<tt>at nexthdr+13</tt>). As for the second selector, if we'd like +to make our life harder, we could write <tt>match u8 0x06 0xff at 9</tt> +instead if using the specific selector <tt>protocol tcp</tt>, because +6 is the number of TCP protocol, present in 10-th byte of the IP header. +On the other hand, in this example we couldn't use any specific selector +for the first match - simply because there's no specific selector to match +TCP ACK bits. + +<sect2>Specific selectors +<p> +The following table contains a list of all specific selectors +the author of this section has found in the <tt>tc</tt> program +source code. They simply make your life easier and increase readability +of your filter's configuration. + +FIXME: table placeholder - the table is in separate file ,,selector.html'' + +FIXME: it's also still in Polish :-( + +FIXME: must be sgml'ized + +Some examples: + +<tscreen><verb> +# tc filter add dev ppp0 parent 1:0 prio 10 u32 \ + match ip tos 0x10 0xff \ + flowid 1:4 +</verb></tscreen> + +The above rule will match packets, which have the TOS field set to 0x10. +The TOS field starts at second byte of the packet and is one byte big, +so we coul write an equivalent general selector: <tt>match u8 0x10 0xff +at 1</tt>. This gives us hint to the internals of U32 filter -- the +specific rules are always translated to general ones, and in this +form they are stored in the kernel memory. This leads to another conclusion +-- the <tt>tcp</tt> and <tt>udp</tt> selectors are exactly the same +and this is why you can't use single <tt>match tcp dst 53 0xffff</tt> +selector to match TCP packets sent to given port -- they will also +match UDP packets sent to this port. You must remember to also specify +the protocol and end up with the following rule: + +<tscreen><verb> +# tc filter add dev ppp0 parent 1:0 prio 10 u32 \ + match tcp dst 53 0xffff \ + match ip protocol 0x6 0xff \ + flowid 1:2 +</verb></tscreen> + +<!-- +TODO: + +describe more options + +match +offset +hashkey +classid | flowid +divisor +order +link +ht +sample +police + +--> <sect1>The "route" classifier <p> - FIXME: Doesn't work - This classifier filters based on the results of the routing tables. When a packet that is traversing through the classes reaches one that is marked with the "route" filter, it splits the packets up based on information in @@ -862,16 +1513,45 @@ FIXME: What are the other possibilities? send it to the given class and give it a priority of 100. Then, to finally kick it into action, you add the appropriate routing entry: + The trick here is to define 'realm' based on either destination or source. + The way to do it is like this: + <tscreen><verb> -# ip route add HostA via Gateway flow 1:1 +# ip route add Host/Network via Gateway dev Device realm RealmNumber +</verb></tscreen> + + For instance, we can define our destination network 192.168.10.0 with a realm + number 10: + +<tscreen><verb> +# ip route add 192.168.10.0/24 via 192.168.10.1 dev eth1 realm 10 +</verb></tscreen> + + When adding route filters, we can use realm numbers to represent the + networks or hosts and specify how the routes match the filters. + +<tscreen><verb> +# tc filter add dev eth1 parent 1:0 protocol ip prio 100 \ + route to 10 classid 1:10 +</verb></tscreen> + + The above rule says packets going to the network 192.168.10.0 match class id + 1:10. + + Route filter can also be used to match source routes. For example, there is + a subnetwork attached to the Linux router on eth2. + +<tscreen><verb> +# ip route add 192.168.2.0/24 dev eth2 realm 2 +# tc filter add dev eth1 parent 1:0 protocol ip prio 100 \ + route from 2 classid 1:2 </verb></tscreen> - [Strangely, though I think I've done everything in the example, this doesn't - seem to work for me. I get an error that goes: + Here the filter specifies that packets from the subnetwork 192.168.2.0 + (realm 2) will match class id 1:2. - Error: either "to" is duplicate, or "flow" is a garbage. - - Someone who knows will have to comment on this.] + + <sect1>The "rsvp" classifier <p>FIXME: Fill me in @@ -1294,6 +1974,8 @@ We then create classes for our customers: Then we add filters for our two classes: <tscreen><verb> ##FIXME: Why this line, what does it do?, what is a divisor?: +##FIXME: A divisor has something to do with a hash table, and the number of +## buckets - ahu # tc filter add dev eth0 parent 1:0 protocol ip prio 5 handle 1: u32 divisor 1 # tc filter add dev eth0 parent 1:0 prio 5 u32 match ip src 188.177.166.1 flowid 1:1 @@ -1488,6 +2170,28 @@ will be quite complex and really not intended for normal users. You have been warned. FIXME: Decide what really need to go in here. +<sect1>How does packet queueing really work? +<p>This is the low-down on how the packet queueing system really works. + +Lists the steps the kernel takes to classify a packet, etc... + +FIXME: Write this. + +<sect1>Advanced uses of the packet queueing system +<p>Go through Alexeys extremely tricky example involving the unused bits +in the TOS field. + +FIXME: Write this. + +<sect1>Other packet shaping systems +<p>I'd like to include a brief description of other packet shaping systems +in other operating systems and how they compare to the Linux one. Since Linux +is one of the few OSes that has a completely original (non-BSD derived) TCP/IP +stack, I think it would be useful to see how other people do it. + +Unfortunately I have no experiene with other systems so cannot write this. + +FIXME: Anyone? - Martijn <sect>Dynamic routing - OSPF and BGP <p> @@ -1574,7 +2278,7 @@ less. We may include a section on this at a later date. url="http://www.cisco.com/univercd/cc/td/doc/product/software/ios111/cc111/car.htm" name="IOS Committed Access Rate"></tag> <label id="CAR"> -From the helpful folks of Cisco who have the laudable habit of putting +>From the helpful folks of Cisco who have the laudable habit of putting their documentation online. Cisco syntax is different but the concepts are the same, except that we can do more and do it without routers the price of cars :-) @@ -1594,8 +2298,13 @@ helping. <itemize> <item>Jamal Hadi <hadi%cyberus.ca> <item>Nadeem Hasan <nhasan@usa.net> +<item>Jason Lunz <j@cc.gatech.edu> <item>Alexey Mahotkin <alexm@formulabez.ru> +<item>Pawel Krawczyk <kravietz%alfa.ceti.pl> +<item>Wim van der Most <item>Glen Turner <glen.turner%aarnet.edu.au> +<item>Song Wang <wsong@ece.uci.edu> </itemize> </article> + diff --git a/LDP/howto/linuxdoc/HOWTO-INDEX.sgml b/LDP/howto/linuxdoc/HOWTO-INDEX.sgml index eb10b657..a5a7fbd3 100644 --- a/LDP/howto/linuxdoc/HOWTO-INDEX.sgml +++ b/LDP/howto/linuxdoc/HOWTO-INDEX.sgml @@ -121,7 +121,7 @@ Covers using adaptive technology to make Linux accessible to those who could not use it otherwise. <item><htmlurl url="Adv-Routing-HOWTO.html" name="Adv-Routing-HOWTO">, -<bf/Linux 2.4 Advanced Routing HOWTO/ <p><em/Updated: April 2000/. +<bf/Linux 2.4 Advanced Routing HOWTO/ <p><em/Updated: May 2000/. A very hands-on approach to iproute2, traffic shaping and a bit of netfilter. @@ -528,6 +528,10 @@ including Emacs and Ispell. Explains how to setup and then use a filesystem that, when mounted by a user, dynamically and transparently encrypts its contents. +<item><htmlurl url="LVM-HOWTO.html" name="LVM-HOWTO">, +<BF/Logical Volume Manager HOWTO/ <p><em/Updated: May 2000/. +A very hands-on HOWTO for Linux LVM. + <item><htmlurl url="Mail-Administrator-HOWTO.html" name="Mail-Administrator-HOWTO">, <BF/The Linux Electronic Mail Administrator HOWTO/ <p><em/Updated: January 2000/. Describes the setup, care and feeding of Electronic Mail (e-mail)