This commit is contained in:
gferg 2000-10-09 16:01:53 +00:00
parent d2d7ea38f8
commit 6a9f0c5160
1 changed files with 384 additions and 94 deletions

View File

@ -1,4 +1,3 @@
<!doctype linuxdoc system>
<!-- $Id$
@ -31,7 +30,7 @@ This document is dedicated to lots of people, and is my attempt to do
something back. To list but a few:
<p>
<itemize>
<item>Rusty Russel
<item>Rusty Russell
<item>Alexey N. Kuznetsov
<item>The good folks from Google
<item>The staff of Casema Internet
@ -46,7 +45,7 @@ routing. Unbeknownst to most users, you already run tools which allow you to
do spectacular things. Commands like 'route' and 'ifconfig' are actually
very thin wrappers for the very powerful iproute2 infrastructure
<p>
I hope that this HOWTO will become as readable as the ones by Rusty Russel
I hope that this HOWTO will become as readable as the ones by Rusty Russell
of (amongst other things) netfilter fame.
You can always reach us by writing the <url name="HOWTO team"
@ -60,7 +59,8 @@ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
In short, if your STM-64 backbone breaks down and distributes pornography to
your most esteemed customers - it's never our fault. Sorry.
Copyright (c) 2000 by bert hubert, Gregory Maxwell and Martijn van Oosterhout
Copyright (c) 2000 by bert hubert, Gregory Maxwell, Martijn van
Oosterhout, Remco van Mook, Paul B. Schroeder and others.
Please freely copy and distribute (sell or give away) this document in any
format. It's requested that corrections and/or comments be fowarded to the
@ -97,7 +97,7 @@ Here are some orther references which might help learn you more:
<descrip>
<tag><url
url="http://netfilter.kernelnotes.org/unreliable-guides/networking-concepts-HOWTO.html"
name="Rusty Russels networking-concepts-HOWTO"></tag>
name="Rusty Russell's networking-concepts-HOWTO"></tag>
Very nice introduction, explaining what a network is, and how it is
connected to other networks
<tag>Linux Networking-HOWTO (Previously the Net-3 HOWTO)</tag>
@ -180,6 +180,19 @@ A Makefile is supplied which should help you create postscript, dvi, pdf,
html and plain text. You may need to install sgml-tools, ghostscript and
tetex to get all formats.
<sect1>Mailing list
<p>
The authors receive an increasing amount of mail about this HOWTO. Because
of the clear interest of the community, it has been decided to start a
mailinglist where people can talk to each other about Advanced Routing and
Traffic Control. You can subscribe to the list
<url url="http://mailman.ds9a.nl/mailman/listinfo/lartc" name="here">.
<p>
It should be pointed out that the authors are very hesitant of answering
questions asked not on the list. We would like the archive of the list to
become some kind of knowledge base. If you have a question, please search
the archive, and then post to the mailinglist.
<sect1>Layout of this document
<p>
We will be doing interesting stuff almost immediately, which also means that
@ -205,12 +218,12 @@ they show some unexpected behaviour under Linux 2.2 and up. For example, GRE
tunnels are an integral part of routing these days, but require completely
different tools.
With iproute2, tunnels are an integral part of the tool set
With iproute2, tunnels are an integral part of the tool set.
The 2.2 and above Linux kernels include a completely redesigned network
subsystem. This new networking code brings Linux performance and a feature
set with little competition in the general OS arena. In fact, the new
routing filtering, and classifying code is more featureful then that
routing, filtering, and classifying code is more featureful then that
provided by many dedicated routers and firewalls and traffic shaping
products.
@ -220,13 +233,13 @@ constant layering of cruft has lead to networking code that is filled with
strange behaviour, much like most human languages. In the past, Linux
emulated SunOS's handling of many of these things, which was not ideal.
This new framework has made it possible to clearly express features
previously not possible.
This new framework makes it possible to clearly express features
previously beyond Linux's reach.
<sect1>Iproute2 tour
<sect1>iproute2 tour
<p>
Linux has a sophisticated system for bandwidth provisioning called Traffic
Control. This system supports various method for classifying, prioritising,
Control. This system supports various method for classifying, prioritizing,
sharing, and limiting both inbound and outbound traffic.
@ -238,14 +251,14 @@ package is called 'iproute' on both RedHat and Debian, and may otherwise be
found at <tt>ftp://ftp.inr.ac.ru/ip-routing/iproute2-2.2.4-now-ss??????.tar.gz"</tt>.
Some parts of iproute require you to have certain kernel options enabled.
FIXME: We should mention <url url="ftp://ftp.inr.ac.ru/ip-routing/iproute2-current.tar.gz">
is always the latest
You can also try <url name="here" url="ftp://ftp.inr.ac.ru/ip-routing/iproute2-current.tar.gz">
for the latest version.
<sect1>Exploring your current configuration
<p>
This may come as a surprise, but iproute2 is already configured! The current
commands <tt>ifconfig</tt> and <tt>route</tt> are already using the advanced
syscalls, but mostly with very default (ie, boring) settings.
syscalls, but mostly with very default (ie. boring) settings.
The <tt>ip</tt> tool is central, and we'll ask it do display our interfaces
for us.
@ -270,16 +283,16 @@ home. I'll only explain part of the output as not everything is directly
relevant.
We first see the loopback interface. While your computer may function
somewhat without one, I'd advise against it. The mtu size (maximum transfer
unit) is 3924 octects, and it is not supposed to queue. Which makes sense
because the loopback interface is a figment of your kernels imagination.
somewhat without one, I'd advise against it. The MTU size (Maximum Transfer
Unit) is 3924 octets, and it is not supposed to queue. Which makes sense
because the loopback interface is a figment of your kernel's imagination.
I'll skip the dummy interface for now, and it may not be present on your
computer. Then there are my two network interfaces, one at the side of my
computer. Then there are my two physical network interfaces, one at the side of my
cable modem, the other serves my home ethernet segment. Furthermore, we see
a ppp0 interface.
Note the absence of IP addresses. Iproute disconnects the concept of 'links'
Note the absence of IP addresses. iproute disconnects the concept of 'links'
and 'IP addresses'. With IP aliasing, the concept of 'the' IP address had
become quite irrelevant anyhow.
@ -305,10 +318,10 @@ ethernet interfaces.
</verb></tscreen>
<p>
This contains more information. It shows all our addresses, and to which
cards they belong. 'inet' stands for Internet. There are lots of other
cards they belong. 'inet' stands for Internet (IPv4). There are lots of other
address families, but these don't concern us right now.
Lets examine eth0 somewhat closer. It says that it is related to the inet
Let's examine eth0 somewhat closer. It says that it is related to the inet
address '10.0.0.1/8'. What does this mean? The /8 stands for the number of
bits that are in the Network Address. There are 32 bits, so we have 24 bits
left that are part of our network. The first 8 bits of 10.0.0.1 correspond
@ -317,7 +330,7 @@ to 10.0.0.0, our Network Address, and our netmask is 255.0.0.0.
The other bits are connected to this interface, so 10.250.3.13 is directly
available on eth0, as is 10.0.0.1 for example.
With ppp0, the same concept goes, though the numbers are different. It's
With ppp0, the same concept goes, though the numbers are different. Its
address is 212.64.94.251, without a subnet mask. This means that we have a
point-to-point connection and that every address, with the exception of
212.64.94.251, is remote. There is more information however, it tells us
@ -450,7 +463,7 @@ people, who should be served differently. The routing policy database allows
you to do this by having multiple sets of routing tables.
If you want to use this feature, make sure that your kernel is compiled with
the "IP: policy routing" feature.
the "IP: advanced router" and "IP: policy routing" features.
When the kernel needs to make a routing decision, it finds out which table
needs to be consulted. By default, there are three tables. The old 'route'
@ -472,7 +485,7 @@ If we want to do fancy things, we generate rules which point to different
tables which allow us to override system wide routing rules.
For the exact semantics on what the kernel does when there are more matching
rules, see Alexey's ip-cfref documentation.
rules, see Alexey's ip-cref documentation.
<sect1>Simple source routing
<p>
@ -540,7 +553,7 @@ in ip-up.
There are 3 kinds of tunnels in Linux. There's IP in IP tunneling, GRE tunneling and tunnels that live outside the kernel (like, for example PPTP).
<sect1>A few general remarks about tunnels:
<p>
Tunnels can be used to do some very unusual and very cool stuff. They can also make things go horribly wrong when you don't configure them right. Don't point your default route to a tunnel device unless you know _exactly_ what you are doing :-). Furthermore, tunneling increases overhead, because it needs an extra set of IP headers. Typically this is 20 bytes per packet, so if the normal packet size (MTU) on a network is 1500 bytes, a packet that is sent through a tunnel can only be 1480 bytes big. This is not necessarily a problem, but be sure to read up on IP packet fragmentation/reassembly when you plan to connect large networks with tunnels. Oh, and of course, the fastest way to dig a tunnel is to dig at both sides.
Tunnels can be used to do some very unusual and very cool stuff. They can also make things go horribly wrong when you don't configure them right. Don't point your default route to a tunnel device unless you know <bf>exactly</bf> what you are doing :-). Furthermore, tunneling increases overhead, because it needs an extra set of IP headers. Typically this is 20 bytes per packet, so if the normal packet size (MTU) on a network is 1500 bytes, a packet that is sent through a tunnel can only be 1480 bytes big. This is not necessarily a problem, but be sure to read up on IP packet fragmentation/reassembly when you plan to connect large networks with tunnels. Oh, and of course, the fastest way to dig a tunnel is to dig at both sides.
<p>
<sect1>IP in IP tunneling
<p>
@ -599,7 +612,7 @@ GRE is a tunneling protocol that was originally developed by Cisco, and it
can do a few more things than IP-in-IP tunneling. For example, you can also
transport multicast traffic and IPv6 through a GRE tunnel.
In Linux, you'll need the ip_gre module.
In Linux, you'll need the ip_gre.o module.
<sect2>IPv4 Tunneling
<p>
@ -668,15 +681,6 @@ Of course, you can replace netb with neta for router B.
<sect2>IPv6 Tunneling
<p>
BIG FAT WARNING !!
The following is untested and might therefore be
completely and utter BOLLOCKS. Proceed at your own risk. Don't say I didn't
warn you.
FIXME: check &amp; try all this
<p>
A short bit about IPv6 addresses:<p>
IPv6 addresses are, compared to IPv4 addresses, monstrously big. An example:
<verb>3ffe:2502:200:40:281:48fe:dcfe:d9bc</verb>
@ -718,10 +722,11 @@ FIXME: Waiting for our feature editor Stefan to finish his stuf
<sect>Multicast routing
<p>
FIXME: Editor Vacancy!
FIXME: Editor Vacancy! (somebody is working on it, though)
<sect>Using Class Based Queueing for bandwidth management
<p>
Now, when I discovered this, it *really* blew me away. Linux 2.2 comes with
Now, when I discovered this, it <em>really</em> blew me away. Linux 2.2 comes with
everything to manage bandwidth in ways comparable to high-end dedicated
bandwidth management systems.
@ -751,14 +756,14 @@ We will explore how our ISP could have used Linux to manage their bandwidth.
<sect1>What is queueing?
<p>
With queueing we determine the order in which data is *sent*. It it important
to realise this, we can only shape data that we transmit. How this changing
With queueing we determine the order in which data is <em>sent</em>. It it important
to realise that we can only shape data that we transmit. How does this changing
the order determine the speed of transmission? Imagine a cash register which
is able to process 3 customers per minute.
People wishing to pay go stand in line at the 'tail end' of the queue. This
is 'fifo queueing'. Let's suppose however that we let certain people always
join in the middle of the queue, in stead of at the end. These people spend
is 'FIFO queueing' (First In, First Out). Let's suppose however that we let certain people always
join in the middle of the queue, instead of at the end. These people spend
a lot less time in the queue and are therefore able to shop faster.
With the way the internet works, we have no direct control of what people
@ -776,7 +781,7 @@ This is the equivalent of not reading half of your mail, and hoping that
people will stop sending it to you. With the difference that it works for
the Internet :-)
FIXME: explain that normally, ACKs are used to determine speed
FIXME: explain congestion windows
<tscreen><verb>
[The Internet] ---<E3, T3, whatever>--- [Linux router] --- [Office+ISP]
@ -784,15 +789,15 @@ FIXME: explain that normally, ACKs are used to determine speed
</verb></tscreen>
Now, our Linux router has two interfaces which I shall dub eth0 and eth1.
Eth1 is connected to our router which moves packets from to and from our
eth1 is connected to our router which moves packets from to and from our
fibre link.
Eth0 is connected to a subnet which contains both the corporate firewall and
eth0 is connected to a subnet which contains both the corporate firewall and
our network head ends, through which we can connect to our customers.
Because we can only limit what we send, we need two separate but possibly
very similar sets of rules. By modifying queueing on eth0, we determine how
fast data gets sent to our customers, and therefor how much downstream
fast data gets sent to our customers, and therefore how much downstream
bandwidth is available for them. Their 'download speed' in short.
On eth1, we determine how fast we send data to The Internet, how fast our
@ -804,7 +809,7 @@ CBQ enables us to generate several classes, and even classes within classes.
The larger devisions might be called 'agencies'. Within these classes may be
things like 'bulk' or 'interactive'.
For example, we may have a 10 megabit internet connection to 'the internet'
For example, we may have a 10 megabit connection to 'the internet'
which is to be shared by our customers, and our corporate needs. We should
not allow a few people at the office to steal away large amounts of
bandwidth which we should sell to our customers.
@ -817,7 +822,7 @@ create virtual circuits. This works, but frame is not very fine grained, ATM
is terribly inefficient at carrying IP traffic, and neither have standardised
ways to segregate different types of traffic into different VCs.
Hover, if you do use ATM, Linux can also happily perform deft acts of fancy
However, if you do use ATM, Linux can also happily perform deft acts of fancy
traffic classification for you too. Another way is to order separate
connections, but this is not very practical and also not very elegant, and
still does not solve all your problems.
@ -842,12 +847,6 @@ mention that on the command line as well. We tell the kernel that it can
allocate 10Mbit and that the average packet size is somewhere around 1000
octets.
FIXME: Double check with Alexey the the built in cell calculation is sufficient.
FIXME: With a 1500 mtu, the default cell is calculated same as the old example.
FIXME: I checked the sources (userspace and kernel), so we should be safe omitting it.
Now we need to generate our root class, from which all others descend:
<tscreen><verb>
# tc class add dev eth0 parent 10:0 classid 10:1 cbq bandwidth 10Mbit rate \
@ -885,17 +884,17 @@ To top it off, we generate the root Office class:
To make this a bit clearer, a diagram which shows our classes:
<tscreen><verb>
+-------------[10: 10Mbit]----------------------+
|+-------------[10:1 root 10Mbit]--------------+|
|| ||
|| +-[10:100 8Mbit]-+ +--[10:200 2Mbit]-----+ ||
|| | | | | ||
|| | ISP | | Office | ||
|| | | | | ||
|| +----------------+ +---------------------+ ||
|| ||
|+---------------------------------------------+|
+-----------------------------------------------+
+-------------[10: 10Mbit]-------------------------+
|+-------------[10:1 root 10Mbit]-----------------+|
|| ||
|| +-----[10:100 8Mbit]---------+ [10:200 2Mbit] ||
|| | | | | ||
|| | ISP | | Office | ||
|| | | | | ||
|| +----------------------------+ +------------+ ||
|| ||
|+------------------------------------------------+|
+--------------------------------------------------+
</verb></tscreen>
Ok, now we have told the kernel what our classes are, but not yet how to
@ -1153,14 +1152,38 @@ thus leaving more of the available bandwidth for others.
See the section on protecting your host from SYN floods for an example on
how this works.
<sect1>WRR
<p>
This qdisc is not included in the standard kernels but can be downloaded from
<url url="http://wipl-wrr.dkik.dk/wrr/">.
Currently the qdisc is only tested with Linux 2.2 kernels but it will
probably work with 2.4 kernels too.
The WRR qdisc distributes bandwidth between its classes using the weighted
round robin scheme. That is, like the CBQ qdisc it contains classes
into which arbitrary qdiscs can be plugged. All classes which have sufficient
demand will get bandwidth proportional to the weights associated with the classes.
The weights can be set manually using the <tt>tc</tt> program. But they
can also be made automatically decreasing for classes transferring much data.
The qdisc has a build-in classifier which assigns packets coming from or
sent to different machines to different classes. Either the MAC or IP and
either source or destination addresses can be used. The MAC address can only
be used when the Linux box is acting as an ethernet bridge, however. The
classes are automatically assigned to machines based on the packets seen.
The qdisc can be very useful at sites such as dorms where a lot of unrelated
individuals share an Internet connection. A set of scripts setting up a
relevant behavior for such a site is a central part of the WRR distribution.
<sect>Netfilter &amp; iproute - marking packets
<p>
So far we've seen how iproute works, and netfilter was mentioned a few
times. This would be a good time to browse through <url name="Rusty's Remarkably
Unreliable guides"
Unreliable Guides"
url="http://netfilter.kernelnotes.org/unreliable-guides/">. Netfilter itself
can be found <url name="here"
url="http://antarctica.penguincomputing.com/~netfilter/">.
url="http://netfilter.filewatcher.org/">.
Netfilter allows us to filter packets, or mangle their headers. One special
feature is that we can mark a packet with a number. This is done with the
@ -1218,6 +1241,8 @@ IP: advanced router (CONFIG_IP_ADVANCED_ROUTER) [Y/n/?]
IP: use netfilter MARK value as routing key (CONFIG_IP_ROUTE_FWMARK) [Y/n/?]
</verb></tscreen>
See also <ref id="SQUID" name="Transparent web-caching using netfilter, iproute2, ipchains and squid">
in the Cookbook.
<sect>More classifiers
<p>
@ -1340,7 +1365,7 @@ The options decribed apply to all filters, not only U32.
The U32 selector contains definition of the pattern, that will be matched
to the currently processed packet. Precisely, it defines which bits are
to be matched in the packet header and nothing more, but this simple
method is very powerful. Let's take a look at the following examplesm
method is very powerful. Let's take a look at the following examples,
taken directly from a pretty complex, real-world filter:
<tscreen><verb>
@ -1358,7 +1383,7 @@ it's 0xff, so the byte will match if it's exactly 0x10. The <tt>at</tt>
keyword means that the match is to be started at specified offset (in
bytes) -- in this case it's beginning of the packet. Translating all
that to human language, the packet will match if its Type of Service
field will have ,,low delay'' bits set. Let's analyze another rule:
field will have `low delay' bits set. Let's analyze another rule:
<tscreen><verb>
# filter parent 1: protocol ip pref 10 u32 fh 800::803 order 2051 key ht 800 bkt 0 flowid 1:3 \
@ -1627,18 +1652,28 @@ kernel.
<sect2>Generic ipv4
<p>
As a generic note, most rate limiting features don't work on loopback, so
don't test them locally.
don't test them locally. The limits are supplied in 'jiffies', and are
enforced using the earlier mentioned token bucket filter.
The kernel has an internal clock which runs at 'HZ' ticks (or 'jiffies') per
second. On intel, 'HZ' is mostly 100. So setting a *_rate file to, say 50,
would allow for 2 packets per second. The token bucket filter is also
configured to allow for a burst of at most 6 packets, if enough tokens have
been earned.
<descrip>
<tag>/proc/sys/net/ipv4/icmp_destunreach_rate</tag>
FIXME: fill this in
If the kernel decides that it can't deliver a packet, it will drop it, and
send the source of the packet an ICMP notice to this effect.
<tag>/proc/sys/net/ipv4/icmp_echo_ignore_all</tag>
FIXME: fill this in
Don't act on echo packets at all. Please don't set this by default, but if
you are used as a relay in a DoS attack, it may be useful.
<tag>/proc/sys/net/ipv4/icmp_echo_ignore_broadcasts [Useful]</tag>
If you ping the broadcast address of a network, all hosts are supposed to
respond. This makes for a dandy denial-of-service tool. Set this to 1 to
ignore these broadcast messages.
<tag>/proc/sys/net/ipv4/icmp_echoreply_rate</tag>
FIXME: fill this in
The rate at which echo replies are sent to any one destination.
<tag>/proc/sys/net/ipv4/icmp_ignore_bogus_error_responses</tag>
FIXME: fill this in
<tag>/proc/sys/net/ipv4/icmp_paramprob_rate</tag>
@ -1646,7 +1681,6 @@ FIXME: fill this in
<tag>/proc/sys/net/ipv4/icmp_timeexceed_rate</tag>
This the famous cause of the 'Solaris middle star' in traceroutes. Limits
number of ICMP Time Exceeded messages sent.
FIXME: Units of these rates - either I'm stupid, or this just doesn't work
<tag>/proc/sys/net/ipv4/igmp_max_memberships</tag>
FIXME: fill this in
<tag>/proc/sys/net/ipv4/inet_peer_gc_maxtime</tag>
@ -1667,19 +1701,17 @@ network. Don't do so for fun - routing loops cause much more damage that
way. You might even consider lowering it in some circumstances.
<tag>/proc/sys/net/ipv4/ip_dynaddr</tag>
You need to set this if you use dial-on-demand with a dynamic interface
address. Once your demand interface comes up, any queued packets will be
rebranded to have the right address. This solves the problem that the
address. Once your demand interface comes up, any local TCP sockets which haven't seen replies will be rebound to have the right address. This solves the problem that the
connection that brings up your interface itself does not work, but the
second try does.
<tag>/proc/sys/net/ipv4/ip_forward</tag>
If the kernel should attempt to forward packets. Off by default for hosts,
on by default when configured as a router.
If the kernel should attempt to forward packets. Off by default.
<tag>/proc/sys/net/ipv4/ip_local_port_range</tag>
Range of local ports for outgoing connections. Actually quite small by
default, 1024 to 4999.
<tag>/proc/sys/net/ipv4/ip_no_pmtu_disc</tag>
Set this if you want to disable Path MTU discovery - a technique to
determince the largest Maximum Transfer Unit possible on you path.
determine the largest Maximum Transfer Unit possible on your path.
<tag>/proc/sys/net/ipv4/ipfrag_high_thresh</tag>
FIXME: fill this in
<tag>/proc/sys/net/ipv4/ipfrag_low_thresh</tag>
@ -1713,16 +1745,23 @@ FIXME: fill this in
<tag>/proc/sys/net/ipv4/tcp_rfc1337</tag>
FIXME: fill this in
<tag>/proc/sys/net/ipv4/tcp_sack</tag>
Use Selective ACK which can be used to signify that only a single packet is
Use Selective ACK which can be used to signify that specific packets are
missing - therefore helping fast recovery.
<tag>/proc/sys/net/ipv4/tcp_stdurg</tag>
FIXME: fill this in
<tag>/proc/sys/net/ipv4/tcp_syn_retries</tag>
FIXME: fill this in
Number of SYN packets the kernel will send before giving up on the new
connection.
<tag>/proc/sys/net/ipv4/tcp_synack_retries</tag>
FIXME: fill this in
To open the other side of the connection, the kernel sends a SYN with a
piggybacked ACK on it, to acknowledge the earlier received SYN. This is part
2 of the threeway handshake. This setting determines the number of SYN+ACK
packets send before the kernel gives up on the connection.
<tag>/proc/sys/net/ipv4/tcp_timestamps</tag>
FIXME: fill this in
Timestamps are used, amongst other things, to protect against wrapping
sequence numbers. A 1 gigabit link might conceivably re-encounter a previous
sequence number with an out-of-line value, because if was of a previous
generation. The timestamp will let it recognise this 'ancient packet'.
<tag>/proc/sys/net/ipv4/tcp_tw_recycle</tag>
FIXME: fill this in
<tag>/proc/sys/net/ipv4/tcp_window_scaling</tag>
@ -1754,7 +1793,10 @@ See the section on reverse path filters.
<tag>/proc/sys/net/ipv4/conf/DEV/mc_forwarding</tag>
If we do multicast forwarding on this interface
<tag>/proc/sys/net/ipv4/conf/DEV/proxy_arp</tag>
FIXME: fill this in
If you set this to 1, all other interfaces will respond to arp queries
destined for addresses on this interface. Can be very useful when building 'ip
pseudo bridges'. Do take care that your netmasks are very correct before
enabling this!
<tag>/proc/sys/net/ipv4/conf/DEV/rp_filter</tag>
See the section on reverse path filters.
<tag>/proc/sys/net/ipv4/conf/DEV/secure_redirects</tag>
@ -1897,7 +1939,7 @@ links with small min's it might be wise to make max perhaps four or
more times large then min.
Burst controls how the RED algorithm responds to bursts. Burst must be set
large then min/avpkt. Experimentally, I've found (min+min+max)/(3*avpkt) to
larger then min/avpkt. Experimentally, I've found (min+min+max)/(3*avpkt) to
work okay.
Additionally, you need to set limit and avpkt. Limit is a safety value, after
@ -1912,7 +1954,7 @@ information.
FIXME: more needed. This means *you* greg :-) - ahu
<sect>Shaping Cookbook
<sect>Cookbook
<p>
This section contains 'cookbook' entries which may help you solve problems.
A cookbook is no replacement for understanding however, so try and comprehend
@ -1958,7 +2000,7 @@ popular daemons have support for this.
We first attach a CBQ qdisc to eth0:
<tscreen><verb>
# tc qdisc add dev eth0 root handle 1: bandwidth 10Mbit cell 8 avpkt 1000 \
# tc qdisc add dev eth0 root handle 1: cbq bandwidth 10Mbit cell 8 avpkt 1000 \
mpu 64
</verb></tscreen>
@ -1989,7 +2031,7 @@ FIXME: why no token bucket filter? is there a default pfifo_fast fallback
somewhere?
<sect1>Protecting your host from SYN floods
<p>From Alexeys iproute documentation, adapted to netfilter and with more
<p>From Alexey's iproute documentation, adapted to netfilter and with more
plausible paths. If you use this, take care to adjust the numbers to
reasonable values for your system.
@ -2028,7 +2070,7 @@ $TC qdisc add dev $INDEV handle ffff: ingress
#
# SYN packets are 40 bytes (320 bits) so three SYNs equals
# 960 bits (approximately 1kbit); so we rate limit below
# the incoming SYNs to 3/sec (not very sueful really; but
# the incoming SYNs to 3/sec (not very useful really; but
#serves to show the point - JHS
############################################################
$TC filter add dev $INDEV parent ffff: protocol ip prio 50 handle 1 fw \
@ -2098,7 +2140,7 @@ class:
</verb></tscreen>
<sect1>Prioritising interactive traffic
<sect1>Prioritizing interactive traffic
<p>
If lots of data is coming down your link, or going up for that matter, and
you are trying to do some maintenance via telnet or ssh, this may not go too
@ -2125,7 +2167,7 @@ author of the ipchains TOS-mangling code, puts it as follows:
<tscreen>
Especially the "Minimum Delay" is important for me. I switch it on for
"interactive" packets in my upstream (Linux) router. I'm
behind a 33k6 modem link. Linux prioritises packets in 3 queues. This
behind a 33k6 modem link. Linux prioritizes packets in 3 queues. This
way I get acceptable interactive performance while doing bulk
downloads at the same time.
</tscreen>
@ -2159,6 +2201,250 @@ netfilter. On your local box:
-j TOS --set-tos Maximize-Throughput
</verb></tscreen>
<sect1>Transparent web-caching using netfilter, iproute2, ipchains and squid
<p>
<label id="SQUID">
This section was sent in by reader Ram Narula from Internet for Education
(Thailand).
The regular technique in accomplishing this in Linux
is probably with use of ipchains AFTER making sure
that the "outgoing" port 80(web) traffic gets routed through
the server running squid.
There are 3 common methods to make sure "outgoing"
port 80 traffic gets routed to the server running squid
and 4th one is being introduced here.
<descrip>
<tag>Making the gateway router do it.</tag>
If you can tell your gateway router to
match packets that has outgoing destination port
of 80 to be sent to the IP address of squid server.
<p>
BUT
<p>
This would put additional load on the router and
some commercial routers might not even support this.
<tag>Using a Layer 4 switch.</tag>
Layer 4 switches can handle this without any problem.
<p>
BUT
<p>
The cost for this equipment is usually very high. Typical
layer 4 switch would normally cost more than
a typical router+good linux server.
<tag>Using cache server as network's gateway.</tag>
You can force ALL traffic through cache server.
<p>
BUT
<p>
This is quite risky because Squid does
utilize lots of cpu power which might
result in slower over-all network performance
or the server itself might crash and no one on the
network will be able to access the internet if
that occurs.
<tag>Linux+NetFilter router.</tag>
By using NetFilter another technique can be implemented
which is using NetFilter for "mark"ing the packets
with destination port 80 and using iproute2 to
route the "mark"ed packets to the Squid server.
</descrip>
<tscreen><verb>
|----------------|
| Implementation |
|----------------|
Addresses used
10.0.0.1 naret (NetFilter server)
10.0.0.2 silom (Squid server)
10.0.0.3 donmuang (Router connected to the internet)
10.0.0.4 kaosarn (other server on network)
10.0.0.5 RAS
10.0.0.0/24 main network
10.0.0.0/19 total network
|---------------|
|Network diagram|
|---------------|
Internet
|
donmuang
|
------------hub/switch----------
| | | |
naret silom kaosarn RAS etc.
</verb></tscreen>
First, make all traffic pass through naret by making
sure it is the default gateway except for silom.
Silom's default gateway has to be donmuang (10.0.0.3) or
this would create web traffic loop.
<p>
(all servers on my network had 10.0.0.1 as the default gateway
which was the former IP address of donmuang router so what I did
was changed the IP address of donmuang to 10.0.0.3 and gave
naret ip address of 10.0.0.1)
<tscreen><verb>
Silom
-----
-setup squid and ipchains
</verb></tscreen>
<p>
Setup Squid server on silom, make sure it does support
transparent caching/proxying, the default port is usually
3128, so all traffic for port 80 has to be redirected to port
3128 locally. This can be done by using ipchains with the following:
<tscreen><verb>
silom# ipchains -N allow1
silom# ipchains -A allow1 -p TCP -s 10.0.0.0/19 -d 0/0 80 -j REDIRECT 3128
silom# ipchains -I input -j allow1
</verb></tscreen>
<p>
Or, in netfilter lingo:
<tscreen><verb>
silom# iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 3128
</verb></tscreen>
(note: you might have other entries as well)
<p>
For more information on setting Squid server please refer
to Squid faq page on <url
url="http://squid.nlanr.net" name="http://squid.nlanr.net">).
<p>
Make sure ip forwarding is enabled on this server and the default
gateway for this server is donmuang router (NOT naret).
<tscreen><verb>
Naret
-----
-setup iptables and iproute2
-disable icmp REDIRECT messages (if needed)
</verb></tscreen>
<enum>
<item>"Mark" packets of destination port 80 with value 2
<tscreen><verb>
naret# iptables -A PREROUTING -i eth0 -t mangle -p tcp --dport 80 \
-j MARK --set-mark 2
</verb></tscreen>
</item>
<item>Setup iproute2 so it will route packets with "mark" 2 to silom
<tscreen><verb>
naret# echo 202 www.out >> /etc/iproute2/rt_tables
naret# ip rule add fwmark 2 table www.out
naret# ip route add default via 10.0.0.2 dev eth0 table www.out
naret# ip route flush cache
</verb></tscreen>
<p>
If donmuang and naret is on the same subnet then
naret should not send out icmp REDIRECT messages.
In this case it is, so icmp REDIRECTs has to be
disabled by:
<tscreen><verb>
naret# echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
naret# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects
naret# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects
</verb></tscreen>
</item>
</enum>
The setup is complete, check the configuration
<tscreen><verb>
On naret:
naret# iptables -t mangle -L
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
MARK tcp -- anywhere anywhere tcp dpt:www MARK set 0x2
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
naret# ip rule ls
0: from all lookup local
32765: from all fwmark 2 lookup www.out
32766: from all lookup main
32767: from all lookup default
naret# ip route list table www.out
default via 203.114.224.8 dev eth0
naret# ip route
10.0.0.1 dev eth0 scope link
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.1
127.0.0.0/8 dev lo scope link
default via 10.0.0.3 dev eth0
(make sure silom belongs to one of the above lines, in this case
it's the line with 10.0.0.0/24)
|------|
|-DONE-|
|------|
</verb></tscreen>
<sect2>Traffic flow diagram after implementation
<p>
<tscreen><verb>
|-----------------------------------------|
|Traffic flow diagram after implementation|
|-----------------------------------------|
INTERNET
/\
||
\/
-----------------donmuang router---------------------
/\ /\ ||
|| || ||
|| \/ ||
naret silom ||
*destination port 80 traffic=========>(cache) ||
/\ || ||
|| \/ \/
\\===================================kaosarn, RAS, etc.
</verb></tscreen>
Note that the network is asymmetric as there is one extra hop on
general outgoing path.
<tscreen><verb>
Here is run down for packet traversing the network from kaosarn
to and from the internet.
For web/http traffic:
kaosarn http request->naret->silom->donmuang->internet
http replies from internet->donmuang->silom->kaosarn
For non-web/http requests(eg. telnet):
kaosarn outgoing data->naret->donmuang->internet
incoming data from internet->donmuang->kaosarn
</verb></tscreen>
<sect>Advanced Linux Routing
<p>
This section is for all you people who either want to understand why the
@ -2178,7 +2464,7 @@ Lists the steps the kernel takes to classify a packet, etc...
FIXME: Write this.
<sect1>Advanced uses of the packet queueing system
<p>Go through Alexeys extremely tricky example involving the unused bits
<p>Go through Alexey's extremely tricky example involving the unused bits
in the TOS field.
FIXME: Write this.
@ -2262,7 +2548,8 @@ it is Linux specific, but it does a fair job discussing the theory and uses
of CBQ.
Very technical stuff, but good reading for those so inclined.
<tag><url url="http://ceti.pl/%7ekravietz/cbq/NET4_tc.html" name="http://ceti.pl/%7ekravietz/cbq/NET4_tc.html"></tag>
<tag><url url="http://ceti.pl/~kravietz/cbq/NET4_tc.html"
name="http://ceti.pl/~kravietz/cbq/NET4_tc.html"></tag>
Yet another HOWTO, this time in Polish! You can copy/paste command lines
however, they work just the same in every language. The author is
cooperating with us and may soon author sections of this HOWTO.
@ -2291,20 +2578,23 @@ well.
<sect>Acknowledgements
<p>
It is our goal to list everybody who has contributed to this HOWTO, or
helped us demistify how things work. While there are currently no plans
helped us demystify how things work. While there are currently no plans
for a Netfilter type scoreboard, we do like to recognise the people who are
helping.
<itemize>
<item>Jamal Hadi &lt;hadi%cyberus.ca&gt;
<item>Nadeem Hasan &lt;nhasan@usa.net&gt;
<item>Philippe Latu &lt;philippe.latu%linux-france.org&gt;
<item>Jason Lunz &lt;j@cc.gatech.edu&gt;
<item>Alexey Mahotkin &lt;alexm@formulabez.ru&gt;
<item>Pawel Krawczyk &lt;kravietz%alfa.ceti.pl&gt;
<item>Wim van der Most
<item>Ram Narula &lt;ram@princess1.net&gt;
<item>Rusty Rusell (with apologies for always misspelling your name)
<item>Charles Tassell &lt;ctassell%isn.net&gt;
<item>Glen Turner &lt;glen.turner%aarnet.edu.au&gt;
<item>Song Wang &lt;wsong@ece.uci.edu&gt;
</itemize>
</article>