Optimizing NFS Performance
Getting network settings right can improve NFS performance many times
over -- a tenfold increase in transfer speeds is not unheard of.
The most important things to get right are the rsize
and wsizemount options. Other factors listed below
may affect people with particular hardware setups.
Setting Block Size to Optimize Transfer Speeds
The rsize and wsizemount options specify the size of the chunks of data
that the client and server pass back and forth to each other. If no
rsize and wsize options
are specified, the default varies by which version of NFS we are using.
4096 bytes is the most common default, although for TCP-based mounts
in 2.2 kernels, and for all mounts beginning with 2.4 kernels, the
server specifies the default block size.
The defaults may be too big or too small. On the one hand, some
combinations of Linux kernels and network cards (largely on older
machines) cannot handle blocks that large. On the other hand, if they
can handle larger blocks, a bigger size might be faster.
So we'll want to experiment and find an rsize and wsize that works
and is as fast as possible. You can test the speed of your options
with some simple commands.
The first of these commands transfers 16384 blocks of 16k each from
the special file /dev/zero (which if you read it
just spits out zeros _really_ fast) to the mounted partition. We will
time it to see how long it takes. So, from the client machine, type:
# time dd if=/dev/zero of=/mnt/home/testfile bs=16k count=16384
This creates a 256Mb file of zeroed bytes. In general, you should
create a file that's at least twice as large as the system RAM
on the server, but make sure you have enough disk space! Then read
back the file into the great black hole on the client machine
(/dev/null) by typing the following:
# time dd if=/mnt/home/testfile of=/dev/null bs=16k
Repeat this a few times and average how long it takes. Be sure to
unmount and remount the filesystem each time (both on the client and,
if you are zealous, locally on the server as well), which should clear
out any caches.
Then unmount, and mount again with a larger and smaller block size.
They should probably be multiples of 1024, and not larger than
8192 bytes since that's the maximum size in NFS version 2. (Though
if you are using Version 3 you might want to try up to 32768.)
Wisdom has it that the block size should be a power of two since most
of the parameters that would constrain it (such as file system block
sizes and network packet size) are also powers of two. However, some
users have reported better successes with block sizes that are not
powers of two but are still multiples of the file system block size
and the network packet size.
Directly after mounting with a larger size, cd into the mounted
file system and do things like ls, explore the fs a bit to make
sure everything is as it should. If the rsize/wsize is too large
the symptoms are very odd and not 100% obvious. A typical symptom
is incomplete file lists when doing 'ls', and no error messages.
Or reading files failing mysteriously with no error messages. After
establishing that the given rsize/wsize works you can do the speed
tests again. Different server platforms are likely to have different
optimal sizes. SunOS and Solaris is reputedly a lot faster with 4096
byte blocks than with anything else.
Remember to edit /etc/fstab to reflect the rsize/wsize you found.Packet Size and Network Drivers
There are many shoddy network drivers available for Linux,
including for some fairly standard cards.
Try pinging back and forth between the two machines with large
packets using the and
options with ping (see man ping)
for more details and see if a lot of packets get or if they
take a long time for a reply. If so, you may have a problem
with the performance of your network card.
To correct such a problem, you may wish to reconfigure the packet
size that your network card uses. Very often there is a constraint
somewhere else in the network (such as a router) that causes a
smaller maximum packet size between two machines than what the
network cards on the machines are actually capable of. TCP should
autodiscover the appropriate packet size for a network, but UDP
will simply stay at a default value. So determining the appropriate
packet size is especially important if you are using NFS over UDP.
You can test for the network packet size using the tracepath command:
From the client machine, just type tracepath [server] 2049
and the path MTU should be reported at the bottom. You can then set the
MTU on your network card equal to the path MTU, by using the MTU option
to ifconfig, and see if fewer packets get dropped.
See the ifconfig man pages for details on how to reset the MTU.
Number of Instances of NFSD
Most startup scripts, Linux and otherwise, start 8 instances of nfsd.
In the early days of NFS, Sun decided on this number as a rule of thumb,
and everyone else copied. There are no good measures of how many
instances are optimal, but a more heavily-trafficked server may require
more. If you are using a 2.4 or higher kernel and you want to see how
heavily each nfsd thread is being used, you can look at the file
/proc/net/rpc/nfsd. The last ten numbers on the
th line in that file indicate the number of seconds
that the thread usage was at that percentage of the maximum allowable.
If you have a large number in the top three deciles, you may wish to
increase the number of nfsd instances. This is done
upon starting nfsd using the number of instances as
the command line option. See the nfsd man page for
more information.
Memory Limits on the Input Queue
On 2.2 and 2.4 kernels, the socket input queue, where requests
sit while they are currently being processed, has a small default
size limit of 64k. This means that if you are running 8 instances of
nfsd, each will only have 8k to store requests while it processes
them.
You should consider increasing this number to at least 256k for nfsd.
This limit is set in the proc file system using the files
/proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max.
It can be increased in three steps; the following method is a bit of
a hack but should work and should not cause any problems:
Increase the size listed in the file:
echo 262144 > /proc/sys/net/core/rmem_default
echo 262144 > /proc/sys/net/core/rmem_max
Restart nfsd, e.g., type /etc/rc.d/init.d/nfsd restart on Red Hat
Return the size limits to their normal size in case other kernel systems depend on it:
echo 65536 > /proc/sys/net/core/rmem_default
echo 65536 > /proc/sys/net/core/rmem_max
Be sure to perform this last step because machines have been reported
to crash if these values are left changed for long periods of time.
Overflow of Fragmented Packets
The NFS protocol uses fragmented UDP packets. The kernel has
a limit of how many fragments of incomplete packets it can
buffer before it starts throwing away packets. With 2.2 kernels
that support the /proc filesystem, you can
specify how many by editing the files
/proc/sys/net/ipv4/ipfrag_high_thresh and
/proc/sys/net/ipv4/ipfrag_low_thresh.
Once the number of unprocessed, fragmented packets reaches the
number specified by ipfrag_high_thresh (in bytes), the kernel
will simply start throwing away fragmented packets until the number
of incomplete packets reaches the number specified
by ipfrag_low_thresh. (With 2.2 kernels, the default is usually 256K).
This will look like packet loss, and if the high threshold is
reached your server performance drops a lot.
One way to monitor this is to look at the field IP: ReasmFails in the
file /proc/net/snmp; if it goes up too quickly during heavy file
activity, you may have problem. Good alternative values for
ipfrag_high_thresh and ipfrag_low_thresh
have not been reported; if you have a good experience with a
particular value, please let the maintainers and development team know.
Turning Off Autonegotiation of NICs and Hubs
Sometimes network cards will auto-negotiate badly with
hubs and switches and this can have strange effects.
Moreover, hubs may lose packets if they have different
ports running at different speeds. Try playing around
with the network speed and duplex settings.
Non-NFS-Related Means of Enhancing Server Performance
Offering general guidelines for setting up a well-functioning
file server is outside the scope of this document, but a few
hints may be worth mentioning: First, RAID 5 gives you good
read speeds but lousy write speeds; consider RAID 1/0 if both
write speed and redundancy are important. Second, using a
journalling filesystem will drastically reduce your reboot
time in the event of a system crash; as of this writing, ext3
(ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/) was the only
journalling filesystem that worked correctly with
NFS version 3, but no doubt that will change soon.
In particular, it looks like Reiserfs
should work with NFS version 3 on 2.4 kernels, though not yet
on 2.2 kernels. Finally, using an automounter (such as autofs
or amd) may prevent hangs if you cross-mount files
on your machines (whether on purpose or by oversight) and one of those
machines goes down. See the
Automount Mini-HOWTO
for details.