old-www/HOWTO/NFS-HOWTO/performance.html

<HTML
><HEAD
><TITLE
>Optimizing NFS Performance</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.76b+
"><LINK
REL="HOME"
TITLE="Linux NFS-HOWTO"
HREF="index.html"><LINK
REL="PREVIOUS"
TITLE="Setting up an NFS Client"
HREF="client.html"><LINK
REL="NEXT"
TITLE="Security and NFS"
HREF="security.html"></HEAD
><BODY
CLASS="SECT1"
BGCOLOR="#FFFFFF"
TEXT="#000000"
LINK="#0000FF"
VLINK="#840084"
ALINK="#0000FF"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
>Linux NFS-HOWTO</TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="client.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
></TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="security.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="SECT1"
><H1
CLASS="SECT1"
><A
NAME="PERFORMANCE">5. Optimizing NFS Performance</H1
><P
>Careful analysis of your environment, both from the client and from the server
point of view, is the first step necessary for optimal NFS performance. The
first sections will address issues that are generally important to the client.
Later (<A
HREF="performance.html#FRAG-OVERFLOW"
>Section 5.3</A
> and beyond), server side issues
will be discussed. In both
cases, these issues will not be limited exclusively to one side or the other,
but it is useful to separate the two in order to get a clearer picture of
cause and effect.</P
><P
>Aside from the general network configuration - appropriate network capacity,
faster NICs, full duplex settings in order to reduce collisions, agreement in
network speed among the switches and hubs, etc. - one of the most important
client optimization settings are the NFS data transfer buffer sizes, specified
by the <B
CLASS="COMMAND"
>mount</B
> command options <TT
CLASS="USERINPUT"
><B
>rsize</B
></TT
>
and <TT
CLASS="USERINPUT"
><B
>wsize</B
></TT
>.</P
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="BLOCKSIZES">5.1. Setting Block Size to Optimize Transfer Speeds</H2
><P
>The <B
CLASS="COMMAND"
>mount</B
> command options <TT
CLASS="USERINPUT"
><B
>rsize</B
></TT
>
and <TT
CLASS="USERINPUT"
><B
>wsize</B
></TT
> specify the size of the chunks of
data that the client and server pass back and forth
to each other. If no <TT
CLASS="USERINPUT"
><B
>rsize</B
></TT
>
and <TT
CLASS="USERINPUT"
><B
>wsize</B
></TT
> options are specified,
the default varies by which version of NFS we
are using. The most common default is 4K (4096 bytes), although for TCP-based
mounts in 2.2 kernels, and for all mounts beginning with 2.4 kernels, the
server specifies the default block size.</P
><P
>The theoretical limit for the NFS V2 protocol is 8K. For the V3 protocol, the
limit is specific to the server. On the Linux server, the maximum block size
is defined by the value of the kernel constant
<TT
CLASS="USERINPUT"
><B
>NFSSVC_MAXBLKSIZE</B
></TT
>, found in the
Linux kernel source file <TT
CLASS="FILENAME"
>./include/linux/nfsd/const.h</TT
>.
The current maximum block size for the kernel, as of 2.4.17, is 8K (8192 bytes),
but the patch set implementing NFS over TCP/IP transport in the 2.4
series, as of this writing, uses a value of 32K (defined in the
patch as 32*1024) for the maximum block size.</P
><P
>All 2.4 clients currently support up to 32K block transfer sizes, allowing the
standard 32K block transfers across NFS mounts from other servers, such as
Solaris, without client modification.</P
><P
>The defaults may be too big or too small, depending on the specific
combination of hardware and kernels. On the one hand, some combinations of
Linux kernels and network cards (largely on older machines) cannot handle
blocks that large. On the other hand, if they can handle larger blocks, a
bigger size might be faster.</P
><P
>You will want to experiment and find an <TT
CLASS="USERINPUT"
><B
>rsize</B
></TT
>
and <TT
CLASS="USERINPUT"
><B
>wsize</B
></TT
> that works and is as
fast as possible. You can test the speed of your options with some simple
commands, if your network environment is not heavily used. Note that your
results may vary widely unless you resort to using more complex benchmarks,
such as <SPAN
CLASS="APPLICATION"
>Bonnie</SPAN
>, <SPAN
CLASS="APPLICATION"
>Bonnie++</SPAN
>,
or <SPAN
CLASS="APPLICATION"
>IOzone</SPAN
>.</P
><P
>The first of these commands transfers 16384 blocks of 16k each from the
special file <TT
CLASS="FILENAME"
>/dev/zero</TT
> (which if
you read it just spits out zeros <EM
>really</EM
>
fast) to the mounted partition. We will time it to see how long it takes. So,
from the client machine, type:</P
><TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
WIDTH="100%"
><TR
><TD
><FONT
COLOR="#000000"
><PRE
CLASS="PROGRAMLISTING"
>    # time dd if=/dev/zero of=/mnt/home/testfile bs=16k count=16384</PRE
></FONT
></TD
></TR
></TABLE
><P
>This creates a 256Mb file of zeroed bytes. In general, you should create a
file that's at least twice as large as the system RAM on the server, but make
sure you have enough disk space! Then read back the file into the great black
hole on the client machine (<TT
CLASS="FILENAME"
>/dev/null</TT
>) by
typing the following:</P
><TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
WIDTH="100%"
><TR
><TD
><FONT
COLOR="#000000"
><PRE
CLASS="PROGRAMLISTING"
>    # time dd if=/mnt/home/testfile of=/dev/null bs=16k</PRE
></FONT
></TD
></TR
></TABLE
><P
>Repeat this a few times and average how long it takes. Be sure to unmount and
remount the filesystem each time (both on the client and, if you are zealous,
locally on the server as well), which should clear out any caches.</P
><P
>Then unmount, and mount again with a larger and smaller block size. They
should be multiples of 1024, and not larger than the maximum block size
allowed by your system. Note that NFS Version 2 is limited to a maximum of 8K,
regardless of the maximum block size defined by
<TT
CLASS="USERINPUT"
><B
>NFSSVC_MAXBLKSIZE</B
></TT
>; Version 3
will support up to 64K, if permitted. The block size should be a power of two
since most of the parameters that would constrain it (such as file system
block sizes and network packet size) are also powers of two. However, some
users have reported better successes with block sizes that are not powers of
two but are still multiples of the file system block size and the network
packet size.</P
><P
>Directly after mounting with a larger size, cd into the mounted
file system and do things like <B
CLASS="COMMAND"
>ls</B
>, explore
the filesystem a bit to make sure everything is as it
should. If the <TT
CLASS="USERINPUT"
><B
>rsize</B
></TT
>/<TT
CLASS="USERINPUT"
><B
>wsize</B
></TT
>
 is too large the symptoms are very odd and not 100%
obvious. A typical symptom is incomplete file lists when doing
<B
CLASS="COMMAND"
>ls</B
>, and no
error messages, or reading files failing mysteriously with no error messages.
After establishing that the given <TT
CLASS="USERINPUT"
><B
>rsize</B
></TT
>/
<TT
CLASS="USERINPUT"
><B
>wsize</B
></TT
> works you can do the speed tests
again. Different server platforms are likely to have different optimal sizes.</P
><P
>Remember to edit <TT
CLASS="FILENAME"
>/etc/fstab</TT
> to reflect the
<TT
CLASS="USERINPUT"
><B
>rsize</B
></TT
>/<TT
CLASS="USERINPUT"
><B
>wsize</B
></TT
> you found
to be the most desirable.</P
><P
>If your results seem inconsistent, or doubtful, you may need to analyze your
network more extensively while varying the <TT
CLASS="USERINPUT"
><B
>rsize</B
></TT
>
and <TT
CLASS="USERINPUT"
><B
>wsize</B
></TT
> values. In that
case, here are several pointers to benchmarks that may prove useful:</P
><P
></P
><UL
><LI
><P
>     Bonnie                            <A
HREF="http://www.textuality.com/bonnie/"
TARGET="_top"
>http://www.textuality.com/bonnie/</A
></P
></LI
><LI
><P
>     Bonnie++                          <A
HREF="http://www.coker.com.au/bonnie++/"
TARGET="_top"
>http://www.coker.com.au/bonnie++/</A
></P
></LI
><LI
><P
>     IOzone file system benchmark      <A
HREF="http://www.iozone.org/"
TARGET="_top"
>http://www.iozone.org/</A
></P
></LI
><LI
><P
>     The official NFS benchmark,
     SPECsfs97                         <A
HREF="http://www.spec.org/osg/sfs97/"
TARGET="_top"
>http://www.spec.org/osg/sfs97/</A
></P
></LI
></UL
><P
>The easiest benchmark with the widest coverage, including an extensive spread
of file sizes, and of IO types - reads, &#38; writes, rereads &#38; rewrites, random
access, etc. - seems to be IOzone. A recommended invocation of IOzone (for
which you must have root privileges) includes unmounting and remounting the
directory under test, in order to clear out the caches between tests, and
including the file close time in the measurements. Assuming you've already
exported <TT
CLASS="FILENAME"
>/tmp</TT
> to everyone from the server
<TT
CLASS="COMPUTEROUTPUT"
>foo</TT
>,
and that you've installed IOzone in the local directory, this should work:</P
><TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
WIDTH="100%"
><TR
><TD
><FONT
COLOR="#000000"
><PRE
CLASS="PROGRAMLISTING"
>    # echo "foo:/tmp /mnt/foo nfs rw,hard,intr,rsize=8192,wsize=8192 0 0"
    &#62;&#62; /etc/fstab
    # mkdir /mnt/foo
    # mount /mnt/foo
    # ./iozone -a -R -c -U /mnt/foo -f /mnt/foo/testfile &#62; logfile</PRE
></FONT
></TD
></TR
></TABLE
><P
>The benchmark should take 2-3 hours at most, but of course you will need to
run it for each value of rsize and wsize that is of interest. The web site
gives full documentation of the parameters, but the specific options used
above are:</P
><P
></P
><UL
><LI
><P
>     <TT
CLASS="USERINPUT"
><B
>-a</B
></TT
> Full automatic mode, which tests file sizes of 64K to 512M, using
     record sizes of 4K to 16M</P
></LI
><LI
><P
>     <TT
CLASS="USERINPUT"
><B
>-R</B
></TT
> Generate report in excel spreadsheet form (The "surface plot"
     option for graphs is best)</P
></LI
><LI
><P
>     <TT
CLASS="USERINPUT"
><B
>-c</B
></TT
> Include the file close time in the tests, which will pick up the
     NFS version 3 commit time</P
></LI
><LI
><P
>     <TT
CLASS="USERINPUT"
><B
>-U</B
></TT
> Use the given mount point to unmount and remount between tests;
     it clears out caches</P
></LI
><LI
><P
>     <TT
CLASS="USERINPUT"
><B
>-f</B
></TT
> When using unmount, you have to locate the test file in the
     mounted file system</P
></LI
></UL
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="PACKET-AND-NETWORK">5.2. Packet Size and Network Drivers</H2
><P
>  While many Linux network card drivers are excellent, some are quite shoddy,
  including a few drivers for some fairly standard cards.  It is worth
  experimenting with your network card directly to find out how it can
  best handle traffic.</P
><P
>   Try <B
CLASS="COMMAND"
>ping</B
>ing back and forth
   between the two machines with large packets using
   the <TT
CLASS="USERINPUT"
><B
>-f</B
></TT
> and <TT
CLASS="USERINPUT"
><B
>-s</B
></TT
>
   options with <B
CLASS="COMMAND"
>ping</B
> (see <EM
>ping(8)</EM
>
   for more details) and see if a
   lot of packets get dropped, or if they take a long time for a reply. If so,
   you may have a problem with the performance of your network card.</P
><P
>For a more extensive analysis of NFS behavior in particular, use the <B
CLASS="COMMAND"
>nfsstat</B
> command to look at nfs transactions, client and server statistics, network
statistics, and so forth. The <TT
CLASS="USERINPUT"
><B
>"-o net"</B
></TT
> option will show you the number of
dropped packets in relation to the total number of transactions. In UDP
transactions, the most important statistic is the number of retransmissions,
due to dropped packets, socket buffer overflows, general server congestion,
timeouts, etc. This will have a tremendously important effect on NFS
performance, and should be carefully monitored.
Note that <B
CLASS="COMMAND"
>nfsstat</B
> does not yet
implement the <TT
CLASS="USERINPUT"
><B
>-z</B
></TT
> option, which would zero out all counters, so you must look
at the current <B
CLASS="COMMAND"
>nfsstat</B
> counter values prior to running the benchmarks.</P
><P
>To correct network problems, you may wish to reconfigure the packet size that
your network card uses. Very often there is a constraint somewhere else in the
network (such as a router) that causes a smaller maximum packet size between
two machines than what the network cards on the machines are actually capable
of. TCP should autodiscover the appropriate packet size for a network, but UDP
will simply stay at a default value. So determining the appropriate packet
size is especially important if you are using NFS over UDP.</P
><P
>You can test for the network packet size using the <B
CLASS="COMMAND"
>tracepath</B
> command: From
the client machine, just type <TT
CLASS="USERINPUT"
><B
>tracepath</B
></TT
>
<EM
>server</EM
> <TT
CLASS="USERINPUT"
><B
>2049</B
></TT
> and the path MTU should
be reported at the bottom. You can then set the MTU on your network card equal
to the path MTU, by using the <TT
CLASS="USERINPUT"
><B
>MTU</B
></TT
>
option to <B
CLASS="COMMAND"
>ifconfig</B
>, and see if fewer packets
get dropped. See the <B
CLASS="COMMAND"
>ifconfig</B
> man pages for details on how to reset the MTU.</P
><P
>In addition, <B
CLASS="COMMAND"
>netstat -s</B
> will give the statistics collected for traffic across
all supported protocols. You may also look at
<TT
CLASS="FILENAME"
>/proc/net/snmp</TT
> for information
about current network behavior; see the next section for more details.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="FRAG-OVERFLOW">5.3. Overflow of Fragmented Packets</H2
><P
>Using an <TT
CLASS="USERINPUT"
><B
>rsize</B
></TT
> or <TT
CLASS="USERINPUT"
><B
>wsize</B
></TT
>
larger than your network's MTU (often set to 1500, in
many networks) will cause IP packet fragmentation when using NFS over UDP. IP
packet fragmentation and reassembly require a significant amount of CPU
resource at both ends of a network connection. In addition, packet
fragmentation also exposes your network traffic to greater unreliability,
since a complete RPC request must be retransmitted if a UDP packet fragment is
dropped for any reason. Any increase of RPC retransmissions, along with the
possibility of increased timeouts, are the single worst impediment to
performance for NFS over UDP.</P
><P
>Packets may be dropped for many reasons. If your network topography is
complex, fragment routes may differ, and may not all arrive at the Server for
reassembly. NFS Server capacity may also be an issue, since the kernel has a
limit of how many fragments it can buffer before it starts throwing away
packets. With kernels that support the <TT
CLASS="FILENAME"
>/proc</TT
>
filesystem, you can monitor the
files <TT
CLASS="FILENAME"
>/proc/sys/net/ipv4/ipfrag_high_thresh</TT
> and
<TT
CLASS="FILENAME"
>/proc/sys/net/ipv4/ipfrag_low_thresh</TT
>. Once the number of unprocessed,
fragmented packets reaches the number specified by <TT
CLASS="FILENAME"
>ipfrag_high_thresh</TT
> (in
bytes), the kernel will simply start throwing away fragmented packets until
the number of incomplete packets reaches the number specified by
<TT
CLASS="FILENAME"
>ipfrag_low_thresh.</TT
></P
><P
>Another counter to monitor is <TT
CLASS="USERINPUT"
><B
>IP: ReasmFails</B
></TT
>
in the file <TT
CLASS="FILENAME"
>/proc/net/snmp</TT
>; this
is the number of fragment reassembly failures. if it goes up too quickly
during heavy file activity, you may have problem.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="NFS-TCP">5.4. NFS over TCP</H2
><P
>A new feature, available for both 2.4 and 2.5 kernels but not yet
integrated into the mainstream kernel at the time of
this writing, is NFS over TCP. Using TCP
has a distinct advantage and a distinct disadvantage over UDP.  The advantage
is that it works far better than UDP on lossy networks.
When using TCP, a single dropped packet can be retransmitted, without
the retransmission of the entire RPC request, resulting in better performance
on lossy networks. In addition, TCP will handle network speed differences
better than UDP, due to the underlying flow control at the network level. </P
><P
>The disadvantage of using TCP is that it is not a stateless protocol like
UDP.  If your server crashes in the middle of a packet transmission,
the client will hang and any shares will need to be unmounted and remounted.</P
><P
>The overhead incurred by the TCP protocol will result in
somewhat slower performance than UDP under ideal network
conditions, but the cost is not severe, and is often not
noticable without careful measurement. If you are using
gigabit ethernet from end to end, you might also investigate the
usage of jumbo frames, since the high speed network may
allow the larger frame sizes without encountering increased
collision rates, particularly if you have set the network
to full duplex.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="TIMEOUT">5.5. Timeout and Retransmission Values</H2
><P
>Two mount command options, <TT
CLASS="USERINPUT"
><B
>timeo</B
></TT
>
and <TT
CLASS="USERINPUT"
><B
>retrans</B
></TT
>, control the behavior of UDP
requests when encountering client timeouts due to dropped packets, network
congestion, and so forth. The <TT
CLASS="USERINPUT"
><B
>-o timeo</B
></TT
>
option allows designation of the length
of time, in tenths of seconds, that the client will wait until it decides it
will not get a reply from the server, and must try to send the request again.
The default value is 7 tenths of a second. The
<TT
CLASS="USERINPUT"
><B
>-o retrans</B
></TT
> option allows
designation of the number of timeouts allowed before the client gives up, and
displays the <TT
CLASS="COMPUTEROUTPUT"
>Server not responding</TT
>
message. The default value is 3 attempts.
Once the client displays this message, it will continue to try to send
the request, but only once before displaying the error message if
another timeout occurs.  When the client reestablishes contact, it
will fall back to using the correct <TT
CLASS="USERINPUT"
><B
>retrans</B
></TT
>
value, and will display the <TT
CLASS="COMPUTEROUTPUT"
>Server OK</TT
> message.</P
><P
>If you are already encountering excessive retransmissions (see the output of
the <B
CLASS="COMMAND"
>nfsstat</B
> command), or want to increase the block transfer size without
encountering timeouts and retransmissions, you may want to adjust these
values. The specific adjustment will depend upon your environment, and in most
cases, the current defaults are appropriate.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="NFSD-INSTANCE">5.6. Number of Instances of the NFSD Server Daemon</H2
><P
>Most startup scripts, Linux and otherwise, start 8 instances of
<B
CLASS="COMMAND"
>nfsd</B
>. In the
early days of NFS, Sun decided on this number as a rule of thumb, and everyone
else copied. There are no good measures of how many instances are optimal, but
a more heavily-trafficked server may require more.
You should use at the very least one daemon per processor, but
four to eight per processor may be a better rule of thumb.
If you are using a 2.4 or
higher kernel and you want to see how heavily each
<B
CLASS="COMMAND"
>nfsd</B
> thread is being used,
you can look at the file <TT
CLASS="FILENAME"
>/proc/net/rpc/nfsd</TT
>.
The last ten numbers on the <TT
CLASS="USERINPUT"
><B
>th</B
></TT
>
line in that file indicate the number of seconds that the thread usage was at
that percentage of the maximum allowable. If you have a large number in the
top three deciles, you may wish to increase the number
of <B
CLASS="COMMAND"
>nfsd</B
> instances. This
is done upon starting <B
CLASS="COMMAND"
>nfsd</B
> using the
number of instances as the command line
option, and is specified in the NFS startup script
(<TT
CLASS="FILENAME"
>/etc/rc.d/init.d/nfs</TT
> on
Red Hat) as <TT
CLASS="USERINPUT"
><B
>RPCNFSDCOUNT</B
></TT
>.
See the <EM
>nfsd(8)</EM
> man page for more information.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="MEMLIMITS">5.7. Memory Limits on the Input Queue</H2
><P
>On 2.2 and 2.4 kernels, the socket input queue, where requests sit while they
are currently being processed, has a small default size limit (<TT
CLASS="FILENAME"
>rmem_default</TT
>)
of 64k. This queue is important for clients with heavy read loads, and servers
with heavy write loads. As an example, if you are running 8 instances of nfsd
on the server, each will only have 8k to store write requests while it
processes them. In addition, the socket output queue - important for clients
with heavy write loads and servers with heavy read loads - also has a small
default size (<TT
CLASS="FILENAME"
>wmem_default</TT
>).</P
><P
>Several published runs of the NFS benchmark
<A
HREF="http://www.spec.org/osg/sfs97/"
TARGET="_top"
>SPECsfs</A
>
specify usage of a much higher value for both
the read and write value sets, <TT
CLASS="FILENAME"
>[rw]mem_default</TT
> and
<TT
CLASS="FILENAME"
>[rw]mem_max</TT
>. You might
consider increasing these values to at least 256k. The read and write limits
are set in the proc file system using (for example) the files
<TT
CLASS="FILENAME"
>/proc/sys/net/core/rmem_default</TT
> and
<TT
CLASS="FILENAME"
>/proc/sys/net/core/rmem_max</TT
>. The
<TT
CLASS="FILENAME"
>rmem_default</TT
> value can be increased in three steps; the following method is a
bit of a hack but should work and should not cause any problems:</P
><P
></P
><UL
><LI
><P
>     Increase the size listed in the file:</P
><TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
WIDTH="90%"
><TR
><TD
><FONT
COLOR="#000000"
><PRE
CLASS="PROGRAMLISTING"
>     # echo 262144 &#62; /proc/sys/net/core/rmem_default
     # echo 262144 &#62; /proc/sys/net/core/rmem_max</PRE
></FONT
></TD
></TR
></TABLE
></LI
><LI
><P
>     Restart NFS.  For example, on Red Hat systems,</P
><TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
WIDTH="90%"
><TR
><TD
><FONT
COLOR="#000000"
><PRE
CLASS="PROGRAMLISTING"
>     # /etc/rc.d/init.d/nfs restart</PRE
></FONT
></TD
></TR
></TABLE
></LI
><LI
><P
>     You might return the size limits to their normal size in case other
     kernel systems depend on it:</P
><TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
WIDTH="90%"
><TR
><TD
><FONT
COLOR="#000000"
><PRE
CLASS="PROGRAMLISTING"
>     # echo 65536 &#62; /proc/sys/net/core/rmem_default
     # echo 65536 &#62; /proc/sys/net/core/rmem_max</PRE
></FONT
></TD
></TR
></TABLE
></LI
></UL
><P
>     This last step may be necessary because machines have been reported to
     crash if these values are left changed for long periods of time.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="AUTONEGOTIATION">5.8. Turning Off Autonegotiation of NICs and Hubs</H2
><P
>If network cards auto-negotiate badly with hubs and switches, and ports run at
different speeds, or with different duplex configurations, performance will be
severely impacted due to excessive collisions, dropped packets, etc. If you
see excessive numbers of dropped packets in the
<B
CLASS="COMMAND"
>nfsstat</B
> output, or poor
network performance in general, try playing around with the network speed and
duplex settings. If possible, concentrate on establishing a 100BaseT full
duplex subnet; the virtual elimination of collisions in full duplex will
remove the most severe performance inhibitor for NFS over UDP.  Be careful
when turning off autonegotiation on a card: The hub or switch that the card
is attached to will then resort to other mechanisms (such as parallel detection)
to determine the duplex settings, and some cards default to half duplex
because it is more likely to be supported by an old hub.  The best solution,
if the driver supports it, is to force the card to negotiate 100BaseT
full duplex.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="SYNC-ASYNC">5.9. Synchronous vs. Asynchronous Behavior in NFS</H2
><P
>The default export behavior for both NFS Version 2 and Version 3 protocols,
used by <B
CLASS="COMMAND"
>exportfs</B
> in <SPAN
CLASS="APPLICATION"
>nfs-utils</SPAN
>
versions prior to Version 1.11 (the latter is in the CVS tree,
but not yet released in a package, as of January, 2002) is
"asynchronous". This default permits the server to reply to client requests as
soon as it has processed the request and handed it off to the local file
system, without waiting for the data to be written to stable storage. This is
indicated by the <TT
CLASS="USERINPUT"
><B
>async</B
></TT
> option denoted in the server's export list. It yields
better performance at the cost of possible data corruption if the server
reboots while still holding unwritten data and/or metadata in its caches. This
possible data corruption is not detectable at the time of occurrence, since
the <TT
CLASS="USERINPUT"
><B
>async</B
></TT
> option instructs the server to lie to the client, telling the
client that all data has indeed been written to the stable storage, regardless
of the protocol used.</P
><P
>In order to conform with "synchronous" behavior, used as the default for most
proprietary systems supporting NFS (Solaris, HP-UX, RS/6000, etc.), and now
used as the default in the latest version of <B
CLASS="COMMAND"
>exportfs</B
>, the Linux Server's
file system must be exported with the <TT
CLASS="USERINPUT"
><B
>sync</B
></TT
> option. Note that specifying
synchronous exports will result in no option being seen in the server's export
list:</P
><P
></P
><UL
><LI
><P
>     Export a couple file systems to everyone, using slightly different
     options:</P
><P
><TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
WIDTH="90%"
><TR
><TD
><FONT
COLOR="#000000"
><PRE
CLASS="PROGRAMLISTING"
>    # /usr/sbin/exportfs -o rw,sync *:/usr/local
    # /usr/sbin/exportfs -o rw *:/tmp</PRE
></FONT
></TD
></TR
></TABLE
></P
></LI
><LI
><P
>     Now we can see what the exported file system parameters look like:</P
><P
><TABLE
BORDER="0"
BGCOLOR="#E0E0E0"
WIDTH="90%"
><TR
><TD
><FONT
COLOR="#000000"
><PRE
CLASS="PROGRAMLISTING"
>    # /usr/sbin/exportfs -v
    /usr/local *(rw)
    /tmp *(rw,async)</PRE
></FONT
></TD
></TR
></TABLE
></P
></LI
></UL
><P
>If your kernel is compiled with the <TT
CLASS="FILENAME"
>/proc</TT
> filesystem,
then the file <TT
CLASS="FILENAME"
>/proc/fs/nfs/exports</TT
> will also show the
full list of export options.</P
><P
>When synchronous behavior is specified, the server will not complete (that is,
reply to the client) an NFS version 2 protocol request until the local file
system has written all data/metadata to the disk. The server
<EM
>will</EM
> complete a
synchronous NFS version 3 request without this delay, and will return the
status of the data in order to inform the client as to what data should be
maintained in its caches, and what data is safe to discard. There are 3
possible status values, defined an enumerated type, <TT
CLASS="USERINPUT"
><B
>nfs3_stable_how</B
></TT
>, in
<TT
CLASS="FILENAME"
>include/linux/nfs.h</TT
>. The values, along with the subsequent actions taken due
to these results, are as follows:</P
><P
></P
><UL
><LI
><P
>NFS_UNSTABLE - Data/Metadata was not committed to stable storage on the
server, and must be cached on the client until a subsequent client commit
request assures that the server does send data to stable storage.</P
></LI
><LI
><P
>NFS_DATA_SYNC - Metadata was not sent to stable storage, and must be cached
on the client. A subsequent commit is necessary, as is required above.</P
></LI
><LI
><P
>NFS_FILE_SYNC - No data/metadata need be cached, and a subsequent commit
need not be sent for the range covered by this request.</P
></LI
></UL
><P
>In addition to the above definition of synchronous behavior, the client may
explicitly insist on total synchronous behavior, regardless of the protocol,
by opening all files with the <TT
CLASS="USERINPUT"
><B
>O_SYNC</B
></TT
> option. In this case, all replies to
client requests will wait until the data has hit the server's disk, regardless
of the protocol used (meaning that, in NFS version 3, all requests will be
<TT
CLASS="USERINPUT"
><B
>NFS_FILE_SYNC</B
></TT
> requests, and will require that the Server returns this status).
In that case, the performance of NFS Version 2 and NFS Version 3 will be
virtually identical.</P
><P
>If, however, the old default <TT
CLASS="USERINPUT"
><B
>async</B
></TT
>
behavior is used, the <TT
CLASS="USERINPUT"
><B
>O_SYNC</B
></TT
> option has
no effect at all in either version of NFS, since the server will reply to the
client without waiting for the write to complete. In that case the performance
differences between versions will also disappear.</P
><P
>Finally, note that, for NFS version 3 protocol requests, a subsequent commit
request from the NFS client at file close time, or at <B
CLASS="COMMAND"
>fsync()</B
> time, will force
the server to write any previously unwritten data/metadata to the disk, and
the server will not reply to the client until this has been completed, as long
as <TT
CLASS="USERINPUT"
><B
>sync</B
></TT
> behavior is followed. If <TT
CLASS="USERINPUT"
><B
>async</B
></TT
> is used, the commit is essentially
a no-op, since the server once again lies to the client, telling the client that
the data has been sent to stable storage. This again exposes the client and
server to data corruption, since cached data may be discarded on the client
due to its belief that the server now has the data maintained in stable
storage.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="NON-NFS-PERFORMANCE">5.10. Non-NFS-Related Means of Enhancing Server Performance</H2
><P
>In general, server performance and server disk access speed will have an
important effect on NFS performance.
Offering general guidelines for setting up a well-functioning file server is
outside the scope of this document, but a few hints may be worth mentioning:</P
><P
></P
><UL
><LI
><P
>If you have access to RAID arrays, use RAID 1/0 for both write speed and
redundancy; RAID 5 gives you good read speeds but lousy write speeds.</P
></LI
><LI
><P
>A journalling filesystem will drastically reduce your reboot time in the
event of a system crash. Currently,
<A
HREF="ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/"
TARGET="_top"
>ext3</A
> will work correctly with NFS
version 3. In addition, Reiserfs version 3.6 will work with NFS version 3 on
2.4.7 or later kernels (patches are available for previous kernels). Earlier versions
of Reiserfs did not include room for generation numbers in the inode, exposing
the possibility of undetected data corruption during a server reboot.</P
></LI
><LI
><P
>Additionally, journalled file systems can be configured to maximize
performance by taking advantage of the fact that journal updates are all that
is necessary for data protection. One example is using ext3 with <TT
CLASS="USERINPUT"
><B
>data=journal</B
></TT
>
so that all updates go first to the journal, and later to the main file
system. Once the journal has been updated, the NFS server can safely issue the
reply to the clients, and the main file system update can occur at the
server's leisure.</P
><P
>The journal in a journalling file system may also reside on a separate device
such as a flash memory card so that journal updates normally require no seeking. With only rotational
delay imposing a cost, this gives reasonably good synchronous IO performance.
Note that ext3 currently supports journal relocation, and ReiserFS will
(officially) support it soon. The Reiserfs tool package found at <A
HREF="ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.x.0k.tar.gz"
TARGET="_top"
>ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.x.0k.tar.gz </A
>
contains
the <B
CLASS="COMMAND"
>reiserfstune</B
> tool, which will allow journal relocation. It does, however,
require a kernel patch which has not yet been officially released as of
January, 2002.</P
></LI
><LI
><P
>Using an automounter (such as <SPAN
CLASS="APPLICATION"
>autofs</SPAN
> or <SPAN
CLASS="APPLICATION"
>amd</SPAN
>) may prevent hangs if you
cross-mount files on your machines (whether on purpose or by oversight) and
one of those machines goes down. See the <A
HREF="http://www.linuxdoc.org/HOWTO/mini/Automount.html"
TARGET="_top"
>Automount Mini-HOWTO</A
> for details.</P
></LI
><LI
><P
>Some manufacturers (Network Appliance, Hewlett Packard, and others) provide NFS
accelerators in the form of Non-Volatile RAM. NVRAM will boost access speed to
stable storage up to the equivalent of <TT
CLASS="USERINPUT"
><B
>async</B
></TT
> access.</P
></LI
></UL
></DIV
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="client.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="security.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>Setting up an NFS Client</TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
>&nbsp;</TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>Security and NFS</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>