1333 lines
32 KiB
HTML
1333 lines
32 KiB
HTML
<HTML
|
|
><HEAD
|
|
><TITLE
|
|
>Optimizing NFS Performance</TITLE
|
|
><META
|
|
NAME="GENERATOR"
|
|
CONTENT="Modular DocBook HTML Stylesheet Version 1.76b+
|
|
"><LINK
|
|
REL="HOME"
|
|
TITLE="Linux NFS-HOWTO"
|
|
HREF="index.html"><LINK
|
|
REL="PREVIOUS"
|
|
TITLE="Setting up an NFS Client"
|
|
HREF="client.html"><LINK
|
|
REL="NEXT"
|
|
TITLE="Security and NFS"
|
|
HREF="security.html"></HEAD
|
|
><BODY
|
|
CLASS="SECT1"
|
|
BGCOLOR="#FFFFFF"
|
|
TEXT="#000000"
|
|
LINK="#0000FF"
|
|
VLINK="#840084"
|
|
ALINK="#0000FF"
|
|
><DIV
|
|
CLASS="NAVHEADER"
|
|
><TABLE
|
|
SUMMARY="Header navigation table"
|
|
WIDTH="100%"
|
|
BORDER="0"
|
|
CELLPADDING="0"
|
|
CELLSPACING="0"
|
|
><TR
|
|
><TH
|
|
COLSPAN="3"
|
|
ALIGN="center"
|
|
>Linux NFS-HOWTO</TH
|
|
></TR
|
|
><TR
|
|
><TD
|
|
WIDTH="10%"
|
|
ALIGN="left"
|
|
VALIGN="bottom"
|
|
><A
|
|
HREF="client.html"
|
|
ACCESSKEY="P"
|
|
>Prev</A
|
|
></TD
|
|
><TD
|
|
WIDTH="80%"
|
|
ALIGN="center"
|
|
VALIGN="bottom"
|
|
></TD
|
|
><TD
|
|
WIDTH="10%"
|
|
ALIGN="right"
|
|
VALIGN="bottom"
|
|
><A
|
|
HREF="security.html"
|
|
ACCESSKEY="N"
|
|
>Next</A
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><HR
|
|
ALIGN="LEFT"
|
|
WIDTH="100%"></DIV
|
|
><DIV
|
|
CLASS="SECT1"
|
|
><H1
|
|
CLASS="SECT1"
|
|
><A
|
|
NAME="PERFORMANCE">5. Optimizing NFS Performance</H1
|
|
><P
|
|
>Careful analysis of your environment, both from the client and from the server
|
|
point of view, is the first step necessary for optimal NFS performance. The
|
|
first sections will address issues that are generally important to the client.
|
|
Later (<A
|
|
HREF="performance.html#FRAG-OVERFLOW"
|
|
>Section 5.3</A
|
|
> and beyond), server side issues
|
|
will be discussed. In both
|
|
cases, these issues will not be limited exclusively to one side or the other,
|
|
but it is useful to separate the two in order to get a clearer picture of
|
|
cause and effect.</P
|
|
><P
|
|
>Aside from the general network configuration - appropriate network capacity,
|
|
faster NICs, full duplex settings in order to reduce collisions, agreement in
|
|
network speed among the switches and hubs, etc. - one of the most important
|
|
client optimization settings are the NFS data transfer buffer sizes, specified
|
|
by the <B
|
|
CLASS="COMMAND"
|
|
>mount</B
|
|
> command options <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>rsize</B
|
|
></TT
|
|
>
|
|
and <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>wsize</B
|
|
></TT
|
|
>.</P
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="BLOCKSIZES">5.1. Setting Block Size to Optimize Transfer Speeds</H2
|
|
><P
|
|
>The <B
|
|
CLASS="COMMAND"
|
|
>mount</B
|
|
> command options <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>rsize</B
|
|
></TT
|
|
>
|
|
and <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>wsize</B
|
|
></TT
|
|
> specify the size of the chunks of
|
|
data that the client and server pass back and forth
|
|
to each other. If no <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>rsize</B
|
|
></TT
|
|
>
|
|
and <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>wsize</B
|
|
></TT
|
|
> options are specified,
|
|
the default varies by which version of NFS we
|
|
are using. The most common default is 4K (4096 bytes), although for TCP-based
|
|
mounts in 2.2 kernels, and for all mounts beginning with 2.4 kernels, the
|
|
server specifies the default block size.</P
|
|
><P
|
|
>The theoretical limit for the NFS V2 protocol is 8K. For the V3 protocol, the
|
|
limit is specific to the server. On the Linux server, the maximum block size
|
|
is defined by the value of the kernel constant
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>NFSSVC_MAXBLKSIZE</B
|
|
></TT
|
|
>, found in the
|
|
Linux kernel source file <TT
|
|
CLASS="FILENAME"
|
|
>./include/linux/nfsd/const.h</TT
|
|
>.
|
|
The current maximum block size for the kernel, as of 2.4.17, is 8K (8192 bytes),
|
|
but the patch set implementing NFS over TCP/IP transport in the 2.4
|
|
series, as of this writing, uses a value of 32K (defined in the
|
|
patch as 32*1024) for the maximum block size.</P
|
|
><P
|
|
>All 2.4 clients currently support up to 32K block transfer sizes, allowing the
|
|
standard 32K block transfers across NFS mounts from other servers, such as
|
|
Solaris, without client modification.</P
|
|
><P
|
|
>The defaults may be too big or too small, depending on the specific
|
|
combination of hardware and kernels. On the one hand, some combinations of
|
|
Linux kernels and network cards (largely on older machines) cannot handle
|
|
blocks that large. On the other hand, if they can handle larger blocks, a
|
|
bigger size might be faster.</P
|
|
><P
|
|
>You will want to experiment and find an <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>rsize</B
|
|
></TT
|
|
>
|
|
and <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>wsize</B
|
|
></TT
|
|
> that works and is as
|
|
fast as possible. You can test the speed of your options with some simple
|
|
commands, if your network environment is not heavily used. Note that your
|
|
results may vary widely unless you resort to using more complex benchmarks,
|
|
such as <SPAN
|
|
CLASS="APPLICATION"
|
|
>Bonnie</SPAN
|
|
>, <SPAN
|
|
CLASS="APPLICATION"
|
|
>Bonnie++</SPAN
|
|
>,
|
|
or <SPAN
|
|
CLASS="APPLICATION"
|
|
>IOzone</SPAN
|
|
>.</P
|
|
><P
|
|
>The first of these commands transfers 16384 blocks of 16k each from the
|
|
special file <TT
|
|
CLASS="FILENAME"
|
|
>/dev/zero</TT
|
|
> (which if
|
|
you read it just spits out zeros <EM
|
|
>really</EM
|
|
>
|
|
fast) to the mounted partition. We will time it to see how long it takes. So,
|
|
from the client machine, type:</P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> # time dd if=/dev/zero of=/mnt/home/testfile bs=16k count=16384</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><P
|
|
>This creates a 256Mb file of zeroed bytes. In general, you should create a
|
|
file that's at least twice as large as the system RAM on the server, but make
|
|
sure you have enough disk space! Then read back the file into the great black
|
|
hole on the client machine (<TT
|
|
CLASS="FILENAME"
|
|
>/dev/null</TT
|
|
>) by
|
|
typing the following:</P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> # time dd if=/mnt/home/testfile of=/dev/null bs=16k</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><P
|
|
>Repeat this a few times and average how long it takes. Be sure to unmount and
|
|
remount the filesystem each time (both on the client and, if you are zealous,
|
|
locally on the server as well), which should clear out any caches.</P
|
|
><P
|
|
>Then unmount, and mount again with a larger and smaller block size. They
|
|
should be multiples of 1024, and not larger than the maximum block size
|
|
allowed by your system. Note that NFS Version 2 is limited to a maximum of 8K,
|
|
regardless of the maximum block size defined by
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>NFSSVC_MAXBLKSIZE</B
|
|
></TT
|
|
>; Version 3
|
|
will support up to 64K, if permitted. The block size should be a power of two
|
|
since most of the parameters that would constrain it (such as file system
|
|
block sizes and network packet size) are also powers of two. However, some
|
|
users have reported better successes with block sizes that are not powers of
|
|
two but are still multiples of the file system block size and the network
|
|
packet size.</P
|
|
><P
|
|
>Directly after mounting with a larger size, cd into the mounted
|
|
file system and do things like <B
|
|
CLASS="COMMAND"
|
|
>ls</B
|
|
>, explore
|
|
the filesystem a bit to make sure everything is as it
|
|
should. If the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>rsize</B
|
|
></TT
|
|
>/<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>wsize</B
|
|
></TT
|
|
>
|
|
is too large the symptoms are very odd and not 100%
|
|
obvious. A typical symptom is incomplete file lists when doing
|
|
<B
|
|
CLASS="COMMAND"
|
|
>ls</B
|
|
>, and no
|
|
error messages, or reading files failing mysteriously with no error messages.
|
|
After establishing that the given <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>rsize</B
|
|
></TT
|
|
>/
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>wsize</B
|
|
></TT
|
|
> works you can do the speed tests
|
|
again. Different server platforms are likely to have different optimal sizes.</P
|
|
><P
|
|
>Remember to edit <TT
|
|
CLASS="FILENAME"
|
|
>/etc/fstab</TT
|
|
> to reflect the
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>rsize</B
|
|
></TT
|
|
>/<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>wsize</B
|
|
></TT
|
|
> you found
|
|
to be the most desirable.</P
|
|
><P
|
|
>If your results seem inconsistent, or doubtful, you may need to analyze your
|
|
network more extensively while varying the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>rsize</B
|
|
></TT
|
|
>
|
|
and <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>wsize</B
|
|
></TT
|
|
> values. In that
|
|
case, here are several pointers to benchmarks that may prove useful:</P
|
|
><P
|
|
></P
|
|
><UL
|
|
><LI
|
|
><P
|
|
> Bonnie <A
|
|
HREF="http://www.textuality.com/bonnie/"
|
|
TARGET="_top"
|
|
>http://www.textuality.com/bonnie/</A
|
|
></P
|
|
></LI
|
|
><LI
|
|
><P
|
|
> Bonnie++ <A
|
|
HREF="http://www.coker.com.au/bonnie++/"
|
|
TARGET="_top"
|
|
>http://www.coker.com.au/bonnie++/</A
|
|
></P
|
|
></LI
|
|
><LI
|
|
><P
|
|
> IOzone file system benchmark <A
|
|
HREF="http://www.iozone.org/"
|
|
TARGET="_top"
|
|
>http://www.iozone.org/</A
|
|
></P
|
|
></LI
|
|
><LI
|
|
><P
|
|
> The official NFS benchmark,
|
|
SPECsfs97 <A
|
|
HREF="http://www.spec.org/osg/sfs97/"
|
|
TARGET="_top"
|
|
>http://www.spec.org/osg/sfs97/</A
|
|
></P
|
|
></LI
|
|
></UL
|
|
><P
|
|
>The easiest benchmark with the widest coverage, including an extensive spread
|
|
of file sizes, and of IO types - reads, & writes, rereads & rewrites, random
|
|
access, etc. - seems to be IOzone. A recommended invocation of IOzone (for
|
|
which you must have root privileges) includes unmounting and remounting the
|
|
directory under test, in order to clear out the caches between tests, and
|
|
including the file close time in the measurements. Assuming you've already
|
|
exported <TT
|
|
CLASS="FILENAME"
|
|
>/tmp</TT
|
|
> to everyone from the server
|
|
<TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>foo</TT
|
|
>,
|
|
and that you've installed IOzone in the local directory, this should work:</P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> # echo "foo:/tmp /mnt/foo nfs rw,hard,intr,rsize=8192,wsize=8192 0 0"
|
|
>> /etc/fstab
|
|
# mkdir /mnt/foo
|
|
# mount /mnt/foo
|
|
# ./iozone -a -R -c -U /mnt/foo -f /mnt/foo/testfile > logfile</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><P
|
|
>The benchmark should take 2-3 hours at most, but of course you will need to
|
|
run it for each value of rsize and wsize that is of interest. The web site
|
|
gives full documentation of the parameters, but the specific options used
|
|
above are:</P
|
|
><P
|
|
></P
|
|
><UL
|
|
><LI
|
|
><P
|
|
> <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>-a</B
|
|
></TT
|
|
> Full automatic mode, which tests file sizes of 64K to 512M, using
|
|
record sizes of 4K to 16M</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
> <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>-R</B
|
|
></TT
|
|
> Generate report in excel spreadsheet form (The "surface plot"
|
|
option for graphs is best)</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
> <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>-c</B
|
|
></TT
|
|
> Include the file close time in the tests, which will pick up the
|
|
NFS version 3 commit time</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
> <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>-U</B
|
|
></TT
|
|
> Use the given mount point to unmount and remount between tests;
|
|
it clears out caches</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
> <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>-f</B
|
|
></TT
|
|
> When using unmount, you have to locate the test file in the
|
|
mounted file system</P
|
|
></LI
|
|
></UL
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="PACKET-AND-NETWORK">5.2. Packet Size and Network Drivers</H2
|
|
><P
|
|
> While many Linux network card drivers are excellent, some are quite shoddy,
|
|
including a few drivers for some fairly standard cards. It is worth
|
|
experimenting with your network card directly to find out how it can
|
|
best handle traffic.</P
|
|
><P
|
|
> Try <B
|
|
CLASS="COMMAND"
|
|
>ping</B
|
|
>ing back and forth
|
|
between the two machines with large packets using
|
|
the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>-f</B
|
|
></TT
|
|
> and <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>-s</B
|
|
></TT
|
|
>
|
|
options with <B
|
|
CLASS="COMMAND"
|
|
>ping</B
|
|
> (see <EM
|
|
>ping(8)</EM
|
|
>
|
|
for more details) and see if a
|
|
lot of packets get dropped, or if they take a long time for a reply. If so,
|
|
you may have a problem with the performance of your network card.</P
|
|
><P
|
|
>For a more extensive analysis of NFS behavior in particular, use the <B
|
|
CLASS="COMMAND"
|
|
>nfsstat</B
|
|
> command to look at nfs transactions, client and server statistics, network
|
|
statistics, and so forth. The <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>"-o net"</B
|
|
></TT
|
|
> option will show you the number of
|
|
dropped packets in relation to the total number of transactions. In UDP
|
|
transactions, the most important statistic is the number of retransmissions,
|
|
due to dropped packets, socket buffer overflows, general server congestion,
|
|
timeouts, etc. This will have a tremendously important effect on NFS
|
|
performance, and should be carefully monitored.
|
|
Note that <B
|
|
CLASS="COMMAND"
|
|
>nfsstat</B
|
|
> does not yet
|
|
implement the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>-z</B
|
|
></TT
|
|
> option, which would zero out all counters, so you must look
|
|
at the current <B
|
|
CLASS="COMMAND"
|
|
>nfsstat</B
|
|
> counter values prior to running the benchmarks.</P
|
|
><P
|
|
>To correct network problems, you may wish to reconfigure the packet size that
|
|
your network card uses. Very often there is a constraint somewhere else in the
|
|
network (such as a router) that causes a smaller maximum packet size between
|
|
two machines than what the network cards on the machines are actually capable
|
|
of. TCP should autodiscover the appropriate packet size for a network, but UDP
|
|
will simply stay at a default value. So determining the appropriate packet
|
|
size is especially important if you are using NFS over UDP.</P
|
|
><P
|
|
>You can test for the network packet size using the <B
|
|
CLASS="COMMAND"
|
|
>tracepath</B
|
|
> command: From
|
|
the client machine, just type <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>tracepath</B
|
|
></TT
|
|
>
|
|
<EM
|
|
>server</EM
|
|
> <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>2049</B
|
|
></TT
|
|
> and the path MTU should
|
|
be reported at the bottom. You can then set the MTU on your network card equal
|
|
to the path MTU, by using the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>MTU</B
|
|
></TT
|
|
>
|
|
option to <B
|
|
CLASS="COMMAND"
|
|
>ifconfig</B
|
|
>, and see if fewer packets
|
|
get dropped. See the <B
|
|
CLASS="COMMAND"
|
|
>ifconfig</B
|
|
> man pages for details on how to reset the MTU.</P
|
|
><P
|
|
>In addition, <B
|
|
CLASS="COMMAND"
|
|
>netstat -s</B
|
|
> will give the statistics collected for traffic across
|
|
all supported protocols. You may also look at
|
|
<TT
|
|
CLASS="FILENAME"
|
|
>/proc/net/snmp</TT
|
|
> for information
|
|
about current network behavior; see the next section for more details.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="FRAG-OVERFLOW">5.3. Overflow of Fragmented Packets</H2
|
|
><P
|
|
>Using an <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>rsize</B
|
|
></TT
|
|
> or <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>wsize</B
|
|
></TT
|
|
>
|
|
larger than your network's MTU (often set to 1500, in
|
|
many networks) will cause IP packet fragmentation when using NFS over UDP. IP
|
|
packet fragmentation and reassembly require a significant amount of CPU
|
|
resource at both ends of a network connection. In addition, packet
|
|
fragmentation also exposes your network traffic to greater unreliability,
|
|
since a complete RPC request must be retransmitted if a UDP packet fragment is
|
|
dropped for any reason. Any increase of RPC retransmissions, along with the
|
|
possibility of increased timeouts, are the single worst impediment to
|
|
performance for NFS over UDP.</P
|
|
><P
|
|
>Packets may be dropped for many reasons. If your network topography is
|
|
complex, fragment routes may differ, and may not all arrive at the Server for
|
|
reassembly. NFS Server capacity may also be an issue, since the kernel has a
|
|
limit of how many fragments it can buffer before it starts throwing away
|
|
packets. With kernels that support the <TT
|
|
CLASS="FILENAME"
|
|
>/proc</TT
|
|
>
|
|
filesystem, you can monitor the
|
|
files <TT
|
|
CLASS="FILENAME"
|
|
>/proc/sys/net/ipv4/ipfrag_high_thresh</TT
|
|
> and
|
|
<TT
|
|
CLASS="FILENAME"
|
|
>/proc/sys/net/ipv4/ipfrag_low_thresh</TT
|
|
>. Once the number of unprocessed,
|
|
fragmented packets reaches the number specified by <TT
|
|
CLASS="FILENAME"
|
|
>ipfrag_high_thresh</TT
|
|
> (in
|
|
bytes), the kernel will simply start throwing away fragmented packets until
|
|
the number of incomplete packets reaches the number specified by
|
|
<TT
|
|
CLASS="FILENAME"
|
|
>ipfrag_low_thresh.</TT
|
|
></P
|
|
><P
|
|
>Another counter to monitor is <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>IP: ReasmFails</B
|
|
></TT
|
|
>
|
|
in the file <TT
|
|
CLASS="FILENAME"
|
|
>/proc/net/snmp</TT
|
|
>; this
|
|
is the number of fragment reassembly failures. if it goes up too quickly
|
|
during heavy file activity, you may have problem.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="NFS-TCP">5.4. NFS over TCP</H2
|
|
><P
|
|
>A new feature, available for both 2.4 and 2.5 kernels but not yet
|
|
integrated into the mainstream kernel at the time of
|
|
this writing, is NFS over TCP. Using TCP
|
|
has a distinct advantage and a distinct disadvantage over UDP. The advantage
|
|
is that it works far better than UDP on lossy networks.
|
|
When using TCP, a single dropped packet can be retransmitted, without
|
|
the retransmission of the entire RPC request, resulting in better performance
|
|
on lossy networks. In addition, TCP will handle network speed differences
|
|
better than UDP, due to the underlying flow control at the network level. </P
|
|
><P
|
|
>The disadvantage of using TCP is that it is not a stateless protocol like
|
|
UDP. If your server crashes in the middle of a packet transmission,
|
|
the client will hang and any shares will need to be unmounted and remounted.</P
|
|
><P
|
|
>The overhead incurred by the TCP protocol will result in
|
|
somewhat slower performance than UDP under ideal network
|
|
conditions, but the cost is not severe, and is often not
|
|
noticable without careful measurement. If you are using
|
|
gigabit ethernet from end to end, you might also investigate the
|
|
usage of jumbo frames, since the high speed network may
|
|
allow the larger frame sizes without encountering increased
|
|
collision rates, particularly if you have set the network
|
|
to full duplex.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="TIMEOUT">5.5. Timeout and Retransmission Values</H2
|
|
><P
|
|
>Two mount command options, <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>timeo</B
|
|
></TT
|
|
>
|
|
and <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>retrans</B
|
|
></TT
|
|
>, control the behavior of UDP
|
|
requests when encountering client timeouts due to dropped packets, network
|
|
congestion, and so forth. The <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>-o timeo</B
|
|
></TT
|
|
>
|
|
option allows designation of the length
|
|
of time, in tenths of seconds, that the client will wait until it decides it
|
|
will not get a reply from the server, and must try to send the request again.
|
|
The default value is 7 tenths of a second. The
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>-o retrans</B
|
|
></TT
|
|
> option allows
|
|
designation of the number of timeouts allowed before the client gives up, and
|
|
displays the <TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>Server not responding</TT
|
|
>
|
|
message. The default value is 3 attempts.
|
|
Once the client displays this message, it will continue to try to send
|
|
the request, but only once before displaying the error message if
|
|
another timeout occurs. When the client reestablishes contact, it
|
|
will fall back to using the correct <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>retrans</B
|
|
></TT
|
|
>
|
|
value, and will display the <TT
|
|
CLASS="COMPUTEROUTPUT"
|
|
>Server OK</TT
|
|
> message.</P
|
|
><P
|
|
>If you are already encountering excessive retransmissions (see the output of
|
|
the <B
|
|
CLASS="COMMAND"
|
|
>nfsstat</B
|
|
> command), or want to increase the block transfer size without
|
|
encountering timeouts and retransmissions, you may want to adjust these
|
|
values. The specific adjustment will depend upon your environment, and in most
|
|
cases, the current defaults are appropriate.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="NFSD-INSTANCE">5.6. Number of Instances of the NFSD Server Daemon</H2
|
|
><P
|
|
>Most startup scripts, Linux and otherwise, start 8 instances of
|
|
<B
|
|
CLASS="COMMAND"
|
|
>nfsd</B
|
|
>. In the
|
|
early days of NFS, Sun decided on this number as a rule of thumb, and everyone
|
|
else copied. There are no good measures of how many instances are optimal, but
|
|
a more heavily-trafficked server may require more.
|
|
You should use at the very least one daemon per processor, but
|
|
four to eight per processor may be a better rule of thumb.
|
|
If you are using a 2.4 or
|
|
higher kernel and you want to see how heavily each
|
|
<B
|
|
CLASS="COMMAND"
|
|
>nfsd</B
|
|
> thread is being used,
|
|
you can look at the file <TT
|
|
CLASS="FILENAME"
|
|
>/proc/net/rpc/nfsd</TT
|
|
>.
|
|
The last ten numbers on the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>th</B
|
|
></TT
|
|
>
|
|
line in that file indicate the number of seconds that the thread usage was at
|
|
that percentage of the maximum allowable. If you have a large number in the
|
|
top three deciles, you may wish to increase the number
|
|
of <B
|
|
CLASS="COMMAND"
|
|
>nfsd</B
|
|
> instances. This
|
|
is done upon starting <B
|
|
CLASS="COMMAND"
|
|
>nfsd</B
|
|
> using the
|
|
number of instances as the command line
|
|
option, and is specified in the NFS startup script
|
|
(<TT
|
|
CLASS="FILENAME"
|
|
>/etc/rc.d/init.d/nfs</TT
|
|
> on
|
|
Red Hat) as <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>RPCNFSDCOUNT</B
|
|
></TT
|
|
>.
|
|
See the <EM
|
|
>nfsd(8)</EM
|
|
> man page for more information.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="MEMLIMITS">5.7. Memory Limits on the Input Queue</H2
|
|
><P
|
|
>On 2.2 and 2.4 kernels, the socket input queue, where requests sit while they
|
|
are currently being processed, has a small default size limit (<TT
|
|
CLASS="FILENAME"
|
|
>rmem_default</TT
|
|
>)
|
|
of 64k. This queue is important for clients with heavy read loads, and servers
|
|
with heavy write loads. As an example, if you are running 8 instances of nfsd
|
|
on the server, each will only have 8k to store write requests while it
|
|
processes them. In addition, the socket output queue - important for clients
|
|
with heavy write loads and servers with heavy read loads - also has a small
|
|
default size (<TT
|
|
CLASS="FILENAME"
|
|
>wmem_default</TT
|
|
>).</P
|
|
><P
|
|
>Several published runs of the NFS benchmark
|
|
<A
|
|
HREF="http://www.spec.org/osg/sfs97/"
|
|
TARGET="_top"
|
|
>SPECsfs</A
|
|
>
|
|
specify usage of a much higher value for both
|
|
the read and write value sets, <TT
|
|
CLASS="FILENAME"
|
|
>[rw]mem_default</TT
|
|
> and
|
|
<TT
|
|
CLASS="FILENAME"
|
|
>[rw]mem_max</TT
|
|
>. You might
|
|
consider increasing these values to at least 256k. The read and write limits
|
|
are set in the proc file system using (for example) the files
|
|
<TT
|
|
CLASS="FILENAME"
|
|
>/proc/sys/net/core/rmem_default</TT
|
|
> and
|
|
<TT
|
|
CLASS="FILENAME"
|
|
>/proc/sys/net/core/rmem_max</TT
|
|
>. The
|
|
<TT
|
|
CLASS="FILENAME"
|
|
>rmem_default</TT
|
|
> value can be increased in three steps; the following method is a
|
|
bit of a hack but should work and should not cause any problems:</P
|
|
><P
|
|
></P
|
|
><UL
|
|
><LI
|
|
><P
|
|
> Increase the size listed in the file:</P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> # echo 262144 > /proc/sys/net/core/rmem_default
|
|
# echo 262144 > /proc/sys/net/core/rmem_max</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></LI
|
|
><LI
|
|
><P
|
|
> Restart NFS. For example, on Red Hat systems,</P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> # /etc/rc.d/init.d/nfs restart</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></LI
|
|
><LI
|
|
><P
|
|
> You might return the size limits to their normal size in case other
|
|
kernel systems depend on it:</P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> # echo 65536 > /proc/sys/net/core/rmem_default
|
|
# echo 65536 > /proc/sys/net/core/rmem_max</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></LI
|
|
></UL
|
|
><P
|
|
> This last step may be necessary because machines have been reported to
|
|
crash if these values are left changed for long periods of time.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="AUTONEGOTIATION">5.8. Turning Off Autonegotiation of NICs and Hubs</H2
|
|
><P
|
|
>If network cards auto-negotiate badly with hubs and switches, and ports run at
|
|
different speeds, or with different duplex configurations, performance will be
|
|
severely impacted due to excessive collisions, dropped packets, etc. If you
|
|
see excessive numbers of dropped packets in the
|
|
<B
|
|
CLASS="COMMAND"
|
|
>nfsstat</B
|
|
> output, or poor
|
|
network performance in general, try playing around with the network speed and
|
|
duplex settings. If possible, concentrate on establishing a 100BaseT full
|
|
duplex subnet; the virtual elimination of collisions in full duplex will
|
|
remove the most severe performance inhibitor for NFS over UDP. Be careful
|
|
when turning off autonegotiation on a card: The hub or switch that the card
|
|
is attached to will then resort to other mechanisms (such as parallel detection)
|
|
to determine the duplex settings, and some cards default to half duplex
|
|
because it is more likely to be supported by an old hub. The best solution,
|
|
if the driver supports it, is to force the card to negotiate 100BaseT
|
|
full duplex.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="SYNC-ASYNC">5.9. Synchronous vs. Asynchronous Behavior in NFS</H2
|
|
><P
|
|
>The default export behavior for both NFS Version 2 and Version 3 protocols,
|
|
used by <B
|
|
CLASS="COMMAND"
|
|
>exportfs</B
|
|
> in <SPAN
|
|
CLASS="APPLICATION"
|
|
>nfs-utils</SPAN
|
|
>
|
|
versions prior to Version 1.11 (the latter is in the CVS tree,
|
|
but not yet released in a package, as of January, 2002) is
|
|
"asynchronous". This default permits the server to reply to client requests as
|
|
soon as it has processed the request and handed it off to the local file
|
|
system, without waiting for the data to be written to stable storage. This is
|
|
indicated by the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>async</B
|
|
></TT
|
|
> option denoted in the server's export list. It yields
|
|
better performance at the cost of possible data corruption if the server
|
|
reboots while still holding unwritten data and/or metadata in its caches. This
|
|
possible data corruption is not detectable at the time of occurrence, since
|
|
the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>async</B
|
|
></TT
|
|
> option instructs the server to lie to the client, telling the
|
|
client that all data has indeed been written to the stable storage, regardless
|
|
of the protocol used.</P
|
|
><P
|
|
>In order to conform with "synchronous" behavior, used as the default for most
|
|
proprietary systems supporting NFS (Solaris, HP-UX, RS/6000, etc.), and now
|
|
used as the default in the latest version of <B
|
|
CLASS="COMMAND"
|
|
>exportfs</B
|
|
>, the Linux Server's
|
|
file system must be exported with the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>sync</B
|
|
></TT
|
|
> option. Note that specifying
|
|
synchronous exports will result in no option being seen in the server's export
|
|
list:</P
|
|
><P
|
|
></P
|
|
><UL
|
|
><LI
|
|
><P
|
|
> Export a couple file systems to everyone, using slightly different
|
|
options:</P
|
|
><P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> # /usr/sbin/exportfs -o rw,sync *:/usr/local
|
|
# /usr/sbin/exportfs -o rw *:/tmp</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></P
|
|
></LI
|
|
><LI
|
|
><P
|
|
> Now we can see what the exported file system parameters look like:</P
|
|
><P
|
|
><TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="90%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> # /usr/sbin/exportfs -v
|
|
/usr/local *(rw)
|
|
/tmp *(rw,async)</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></P
|
|
></LI
|
|
></UL
|
|
><P
|
|
>If your kernel is compiled with the <TT
|
|
CLASS="FILENAME"
|
|
>/proc</TT
|
|
> filesystem,
|
|
then the file <TT
|
|
CLASS="FILENAME"
|
|
>/proc/fs/nfs/exports</TT
|
|
> will also show the
|
|
full list of export options.</P
|
|
><P
|
|
>When synchronous behavior is specified, the server will not complete (that is,
|
|
reply to the client) an NFS version 2 protocol request until the local file
|
|
system has written all data/metadata to the disk. The server
|
|
<EM
|
|
>will</EM
|
|
> complete a
|
|
synchronous NFS version 3 request without this delay, and will return the
|
|
status of the data in order to inform the client as to what data should be
|
|
maintained in its caches, and what data is safe to discard. There are 3
|
|
possible status values, defined an enumerated type, <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>nfs3_stable_how</B
|
|
></TT
|
|
>, in
|
|
<TT
|
|
CLASS="FILENAME"
|
|
>include/linux/nfs.h</TT
|
|
>. The values, along with the subsequent actions taken due
|
|
to these results, are as follows:</P
|
|
><P
|
|
></P
|
|
><UL
|
|
><LI
|
|
><P
|
|
>NFS_UNSTABLE - Data/Metadata was not committed to stable storage on the
|
|
server, and must be cached on the client until a subsequent client commit
|
|
request assures that the server does send data to stable storage.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>NFS_DATA_SYNC - Metadata was not sent to stable storage, and must be cached
|
|
on the client. A subsequent commit is necessary, as is required above.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>NFS_FILE_SYNC - No data/metadata need be cached, and a subsequent commit
|
|
need not be sent for the range covered by this request.</P
|
|
></LI
|
|
></UL
|
|
><P
|
|
>In addition to the above definition of synchronous behavior, the client may
|
|
explicitly insist on total synchronous behavior, regardless of the protocol,
|
|
by opening all files with the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>O_SYNC</B
|
|
></TT
|
|
> option. In this case, all replies to
|
|
client requests will wait until the data has hit the server's disk, regardless
|
|
of the protocol used (meaning that, in NFS version 3, all requests will be
|
|
<TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>NFS_FILE_SYNC</B
|
|
></TT
|
|
> requests, and will require that the Server returns this status).
|
|
In that case, the performance of NFS Version 2 and NFS Version 3 will be
|
|
virtually identical.</P
|
|
><P
|
|
>If, however, the old default <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>async</B
|
|
></TT
|
|
>
|
|
behavior is used, the <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>O_SYNC</B
|
|
></TT
|
|
> option has
|
|
no effect at all in either version of NFS, since the server will reply to the
|
|
client without waiting for the write to complete. In that case the performance
|
|
differences between versions will also disappear.</P
|
|
><P
|
|
>Finally, note that, for NFS version 3 protocol requests, a subsequent commit
|
|
request from the NFS client at file close time, or at <B
|
|
CLASS="COMMAND"
|
|
>fsync()</B
|
|
> time, will force
|
|
the server to write any previously unwritten data/metadata to the disk, and
|
|
the server will not reply to the client until this has been completed, as long
|
|
as <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>sync</B
|
|
></TT
|
|
> behavior is followed. If <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>async</B
|
|
></TT
|
|
> is used, the commit is essentially
|
|
a no-op, since the server once again lies to the client, telling the client that
|
|
the data has been sent to stable storage. This again exposes the client and
|
|
server to data corruption, since cached data may be discarded on the client
|
|
due to its belief that the server now has the data maintained in stable
|
|
storage.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="NON-NFS-PERFORMANCE">5.10. Non-NFS-Related Means of Enhancing Server Performance</H2
|
|
><P
|
|
>In general, server performance and server disk access speed will have an
|
|
important effect on NFS performance.
|
|
Offering general guidelines for setting up a well-functioning file server is
|
|
outside the scope of this document, but a few hints may be worth mentioning:</P
|
|
><P
|
|
></P
|
|
><UL
|
|
><LI
|
|
><P
|
|
>If you have access to RAID arrays, use RAID 1/0 for both write speed and
|
|
redundancy; RAID 5 gives you good read speeds but lousy write speeds.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>A journalling filesystem will drastically reduce your reboot time in the
|
|
event of a system crash. Currently,
|
|
<A
|
|
HREF="ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/"
|
|
TARGET="_top"
|
|
>ext3</A
|
|
> will work correctly with NFS
|
|
version 3. In addition, Reiserfs version 3.6 will work with NFS version 3 on
|
|
2.4.7 or later kernels (patches are available for previous kernels). Earlier versions
|
|
of Reiserfs did not include room for generation numbers in the inode, exposing
|
|
the possibility of undetected data corruption during a server reboot.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>Additionally, journalled file systems can be configured to maximize
|
|
performance by taking advantage of the fact that journal updates are all that
|
|
is necessary for data protection. One example is using ext3 with <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>data=journal</B
|
|
></TT
|
|
>
|
|
so that all updates go first to the journal, and later to the main file
|
|
system. Once the journal has been updated, the NFS server can safely issue the
|
|
reply to the clients, and the main file system update can occur at the
|
|
server's leisure.</P
|
|
><P
|
|
>The journal in a journalling file system may also reside on a separate device
|
|
such as a flash memory card so that journal updates normally require no seeking. With only rotational
|
|
delay imposing a cost, this gives reasonably good synchronous IO performance.
|
|
Note that ext3 currently supports journal relocation, and ReiserFS will
|
|
(officially) support it soon. The Reiserfs tool package found at <A
|
|
HREF="ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.x.0k.tar.gz"
|
|
TARGET="_top"
|
|
>ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.x.0k.tar.gz </A
|
|
>
|
|
contains
|
|
the <B
|
|
CLASS="COMMAND"
|
|
>reiserfstune</B
|
|
> tool, which will allow journal relocation. It does, however,
|
|
require a kernel patch which has not yet been officially released as of
|
|
January, 2002.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>Using an automounter (such as <SPAN
|
|
CLASS="APPLICATION"
|
|
>autofs</SPAN
|
|
> or <SPAN
|
|
CLASS="APPLICATION"
|
|
>amd</SPAN
|
|
>) may prevent hangs if you
|
|
cross-mount files on your machines (whether on purpose or by oversight) and
|
|
one of those machines goes down. See the <A
|
|
HREF="http://www.linuxdoc.org/HOWTO/mini/Automount.html"
|
|
TARGET="_top"
|
|
>Automount Mini-HOWTO</A
|
|
> for details.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>Some manufacturers (Network Appliance, Hewlett Packard, and others) provide NFS
|
|
accelerators in the form of Non-Volatile RAM. NVRAM will boost access speed to
|
|
stable storage up to the equivalent of <TT
|
|
CLASS="USERINPUT"
|
|
><B
|
|
>async</B
|
|
></TT
|
|
> access.</P
|
|
></LI
|
|
></UL
|
|
></DIV
|
|
></DIV
|
|
><DIV
|
|
CLASS="NAVFOOTER"
|
|
><HR
|
|
ALIGN="LEFT"
|
|
WIDTH="100%"><TABLE
|
|
SUMMARY="Footer navigation table"
|
|
WIDTH="100%"
|
|
BORDER="0"
|
|
CELLPADDING="0"
|
|
CELLSPACING="0"
|
|
><TR
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="left"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="client.html"
|
|
ACCESSKEY="P"
|
|
>Prev</A
|
|
></TD
|
|
><TD
|
|
WIDTH="34%"
|
|
ALIGN="center"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="index.html"
|
|
ACCESSKEY="H"
|
|
>Home</A
|
|
></TD
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="right"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="security.html"
|
|
ACCESSKEY="N"
|
|
>Next</A
|
|
></TD
|
|
></TR
|
|
><TR
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="left"
|
|
VALIGN="top"
|
|
>Setting up an NFS Client</TD
|
|
><TD
|
|
WIDTH="34%"
|
|
ALIGN="center"
|
|
VALIGN="top"
|
|
> </TD
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="right"
|
|
VALIGN="top"
|
|
>Security and NFS</TD
|
|
></TR
|
|
></TABLE
|
|
></DIV
|
|
></BODY
|
|
></HTML
|
|
> |