mirror of https://github.com/tLDP/LDP
653 lines
29 KiB
Plaintext
653 lines
29 KiB
Plaintext
<sect1 id="performance">
|
|
<title>Optimizing NFS Performance</title>
|
|
<para>
|
|
Careful analysis of your environment, both from the client and from the server
|
|
point of view, is the first step necessary for optimal NFS performance. The
|
|
first sections will address issues that are generally important to the client.
|
|
Later (<xref linkend="frag-overflow"> and beyond), server side issues
|
|
will be discussed. In both
|
|
cases, these issues will not be limited exclusively to one side or the other,
|
|
but it is useful to separate the two in order to get a clearer picture of
|
|
cause and effect.
|
|
</para>
|
|
<para>
|
|
Aside from the general network configuration - appropriate network capacity,
|
|
faster NICs, full duplex settings in order to reduce collisions, agreement in
|
|
network speed among the switches and hubs, etc. - one of the most important
|
|
client optimization settings are the NFS data transfer buffer sizes, specified
|
|
by the <command>mount</command> command options <userinput>rsize</userinput>
|
|
and <userinput>wsize</userinput>.
|
|
</para>
|
|
|
|
<sect2 id="blocksizes">
|
|
<title>Setting Block Size to Optimize Transfer Speeds</title>
|
|
<para>
|
|
The <command>mount</command> command options <userinput>rsize</userinput>
|
|
and <userinput>wsize</userinput> specify the size of the chunks of
|
|
data that the client and server pass back and forth
|
|
to each other. If no <userinput>rsize</userinput>
|
|
and <userinput>wsize</userinput> options are specified,
|
|
the default varies by which version of NFS we
|
|
are using. The most common default is 4K (4096 bytes), although for TCP-based
|
|
mounts in 2.2 kernels, and for all mounts beginning with 2.4 kernels, the
|
|
server specifies the default block size.
|
|
</para>
|
|
<para>
|
|
The theoretical limit for the NFS V2 protocol is 8K. For the V3 protocol, the
|
|
limit is specific to the server. On the Linux server, the maximum block size
|
|
is defined by the value of the kernel constant
|
|
<userinput>NFSSVC_MAXBLKSIZE</userinput>, found in the
|
|
Linux kernel source file <filename>./include/linux/nfsd/const.h</filename>.
|
|
The current maximum block size for the kernel, as of 2.4.17, is 8K (8192 bytes),
|
|
but the patch set implementing NFS over TCP/IP transport in the 2.4
|
|
series, as of this writing, uses a value of 32K (defined in the
|
|
patch as 32*1024) for the maximum block size.
|
|
</para>
|
|
<para>
|
|
All 2.4 clients currently support up to 32K block transfer sizes, allowing the
|
|
standard 32K block transfers across NFS mounts from other servers, such as
|
|
Solaris, without client modification.
|
|
</para>
|
|
<para>
|
|
The defaults may be too big or too small, depending on the specific
|
|
combination of hardware and kernels. On the one hand, some combinations of
|
|
Linux kernels and network cards (largely on older machines) cannot handle
|
|
blocks that large. On the other hand, if they can handle larger blocks, a
|
|
bigger size might be faster.
|
|
</para>
|
|
<para>
|
|
You will want to experiment and find an <userinput>rsize</userinput>
|
|
and <userinput>wsize</userinput> that works and is as
|
|
fast as possible. You can test the speed of your options with some simple
|
|
commands, if your network environment is not heavily used. Note that your
|
|
results may vary widely unless you resort to using more complex benchmarks,
|
|
such as <application>Bonnie</application>, <application>Bonnie++</application>,
|
|
or <application>IOzone</application>.
|
|
</para>
|
|
<para>
|
|
The first of these commands transfers 16384 blocks of 16k each from the
|
|
special file <filename>/dev/zero</filename> (which if
|
|
you read it just spits out zeros <emphasis>really</emphasis>
|
|
fast) to the mounted partition. We will time it to see how long it takes. So,
|
|
from the client machine, type:
|
|
</para>
|
|
<programlisting>
|
|
# time dd if=/dev/zero of=/mnt/home/testfile bs=16k count=16384
|
|
</programlisting>
|
|
<para>
|
|
This creates a 256Mb file of zeroed bytes. In general, you should create a
|
|
file that's at least twice as large as the system RAM on the server, but make
|
|
sure you have enough disk space! Then read back the file into the great black
|
|
hole on the client machine (<filename>/dev/null</filename>) by
|
|
typing the following:
|
|
</para>
|
|
<programlisting>
|
|
# time dd if=/mnt/home/testfile of=/dev/null bs=16k
|
|
</programlisting>
|
|
<para>
|
|
Repeat this a few times and average how long it takes. Be sure to unmount and
|
|
remount the filesystem each time (both on the client and, if you are zealous,
|
|
locally on the server as well), which should clear out any caches.
|
|
</para>
|
|
<para>
|
|
Then unmount, and mount again with a larger and smaller block size. They
|
|
should be multiples of 1024, and not larger than the maximum block size
|
|
allowed by your system. Note that NFS Version 2 is limited to a maximum of 8K,
|
|
regardless of the maximum block size defined by
|
|
<userinput>NFSSVC_MAXBLKSIZE</userinput>; Version 3
|
|
will support up to 64K, if permitted. The block size should be a power of two
|
|
since most of the parameters that would constrain it (such as file system
|
|
block sizes and network packet size) are also powers of two. However, some
|
|
users have reported better successes with block sizes that are not powers of
|
|
two but are still multiples of the file system block size and the network
|
|
packet size.
|
|
</para>
|
|
<para>
|
|
Directly after mounting with a larger size, cd into the mounted
|
|
file system and do things like <command>ls</command>, explore
|
|
the filesystem a bit to make sure everything is as it
|
|
should. If the <userinput>rsize</userinput>/<userinput>wsize</userinput>
|
|
is too large the symptoms are very odd and not 100%
|
|
obvious. A typical symptom is incomplete file lists when doing
|
|
<command>ls</command>, and no
|
|
error messages, or reading files failing mysteriously with no error messages.
|
|
After establishing that the given <userinput>rsize</userinput>/
|
|
<userinput>wsize</userinput> works you can do the speed tests
|
|
again. Different server platforms are likely to have different optimal sizes.
|
|
</para>
|
|
<para>
|
|
Remember to edit <filename>/etc/fstab</filename> to reflect the
|
|
<userinput>rsize</userinput>/<userinput>wsize</userinput> you found
|
|
to be the most desirable.
|
|
</para>
|
|
<para>
|
|
If your results seem inconsistent, or doubtful, you may need to analyze your
|
|
network more extensively while varying the <userinput>rsize</userinput>
|
|
and <userinput>wsize</userinput> values. In that
|
|
case, here are several pointers to benchmarks that may prove useful:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Bonnie <ulink
|
|
url="http://www.textuality.com/bonnie/">http://www.textuality.com/bonnie/</ulink>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Bonnie++ <ulink
|
|
url="http://www.coker.com.au/bonnie++/">http://www.coker.com.au/bonnie++/</ulink>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
IOzone file system benchmark <ulink url="http://www.iozone.org/">http://www.iozone.org/</ulink>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
The official NFS benchmark,
|
|
SPECsfs97 <ulink url="http://www.spec.org/osg/sfs97/">http://www.spec.org/osg/sfs97/</ulink>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
The easiest benchmark with the widest coverage, including an extensive spread
|
|
of file sizes, and of IO types - reads, & writes, rereads & rewrites, random
|
|
access, etc. - seems to be IOzone. A recommended invocation of IOzone (for
|
|
which you must have root privileges) includes unmounting and remounting the
|
|
directory under test, in order to clear out the caches between tests, and
|
|
including the file close time in the measurements. Assuming you've already
|
|
exported <filename>/tmp</filename> to everyone from the server
|
|
<computeroutput>foo</computeroutput>,
|
|
and that you've installed IOzone in the local directory, this should work:
|
|
</para>
|
|
<programlisting>
|
|
# echo "foo:/tmp /mnt/foo nfs rw,hard,intr,rsize=8192,wsize=8192 0 0"
|
|
>> /etc/fstab
|
|
# mkdir /mnt/foo
|
|
# mount /mnt/foo
|
|
# ./iozone -a -R -c -U /mnt/foo -f /mnt/foo/testfile > logfile
|
|
</programlisting>
|
|
<para>
|
|
The benchmark should take 2-3 hours at most, but of course you will need to
|
|
run it for each value of rsize and wsize that is of interest. The web site
|
|
gives full documentation of the parameters, but the specific options used
|
|
above are:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<userinput>-a</userinput> Full automatic mode, which tests file sizes of 64K to 512M, using
|
|
record sizes of 4K to 16M
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<userinput>-R</userinput> Generate report in excel spreadsheet form (The "surface plot"
|
|
option for graphs is best)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<userinput>-c</userinput> Include the file close time in the tests, which will pick up the
|
|
NFS version 3 commit time
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<userinput>-U</userinput> Use the given mount point to unmount and remount between tests;
|
|
it clears out caches
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<userinput>-f</userinput> When using unmount, you have to locate the test file in the
|
|
mounted file system
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</sect2>
|
|
<sect2 id="packet-and-network">
|
|
<title>Packet Size and Network Drivers</title>
|
|
<para>
|
|
While many Linux network card drivers are excellent, some are quite shoddy,
|
|
including a few drivers for some fairly standard cards. It is worth
|
|
experimenting with your network card directly to find out how it can
|
|
best handle traffic.
|
|
</para>
|
|
<para>
|
|
Try <command>ping</command>ing back and forth
|
|
between the two machines with large packets using
|
|
the <userinput>-f</userinput> and <userinput>-s</userinput>
|
|
options with <command>ping</command> (see <emphasis>ping(8)</emphasis>
|
|
for more details) and see if a
|
|
lot of packets get dropped, or if they take a long time for a reply. If so,
|
|
you may have a problem with the performance of your network card.
|
|
</para>
|
|
<para>
|
|
For a more extensive analysis of NFS behavior in particular, use the <command>
|
|
nfsstat</command> command to look at nfs transactions, client and server statistics, network
|
|
statistics, and so forth. The <userinput>"-o net"</userinput> option will show you the number of
|
|
dropped packets in relation to the total number of transactions. In UDP
|
|
transactions, the most important statistic is the number of retransmissions,
|
|
due to dropped packets, socket buffer overflows, general server congestion,
|
|
timeouts, etc. This will have a tremendously important effect on NFS
|
|
performance, and should be carefully monitored.
|
|
Note that <command>nfsstat</command> does not yet
|
|
implement the <userinput>-z</userinput> option, which would zero out all counters, so you must look
|
|
at the current <command>nfsstat</command> counter values prior to running the benchmarks.
|
|
</para>
|
|
<para>
|
|
To correct network problems, you may wish to reconfigure the packet size that
|
|
your network card uses. Very often there is a constraint somewhere else in the
|
|
network (such as a router) that causes a smaller maximum packet size between
|
|
two machines than what the network cards on the machines are actually capable
|
|
of. TCP should autodiscover the appropriate packet size for a network, but UDP
|
|
will simply stay at a default value. So determining the appropriate packet
|
|
size is especially important if you are using NFS over UDP.
|
|
</para>
|
|
<para>
|
|
You can test for the network packet size using the <command>tracepath</command> command: From
|
|
the client machine, just type <userinput>tracepath</userinput>
|
|
<emphasis>server</emphasis> <userinput>2049</userinput> and the path MTU should
|
|
be reported at the bottom. You can then set the MTU on your network card equal
|
|
to the path MTU, by using the <userinput>MTU</userinput>
|
|
option to <command>ifconfig</command>, and see if fewer packets
|
|
get dropped. See the <command>ifconfig</command> man pages for details on how to reset the MTU.
|
|
</para>
|
|
<para>
|
|
In addition, <command>netstat -s</command> will give the statistics collected for traffic across
|
|
all supported protocols. You may also look at
|
|
<filename>/proc/net/snmp</filename> for information
|
|
about current network behavior; see the next section for more details.
|
|
</para>
|
|
</sect2>
|
|
<sect2 id="frag-overflow">
|
|
<title>Overflow of Fragmented Packets</title>
|
|
<para>
|
|
Using an <userinput>rsize</userinput> or <userinput>wsize</userinput>
|
|
larger than your network's MTU (often set to 1500, in
|
|
many networks) will cause IP packet fragmentation when using NFS over UDP. IP
|
|
packet fragmentation and reassembly require a significant amount of CPU
|
|
resource at both ends of a network connection. In addition, packet
|
|
fragmentation also exposes your network traffic to greater unreliability,
|
|
since a complete RPC request must be retransmitted if a UDP packet fragment is
|
|
dropped for any reason. Any increase of RPC retransmissions, along with the
|
|
possibility of increased timeouts, are the single worst impediment to
|
|
performance for NFS over UDP.
|
|
</para>
|
|
<para>
|
|
Packets may be dropped for many reasons. If your network topography is
|
|
complex, fragment routes may differ, and may not all arrive at the Server for
|
|
reassembly. NFS Server capacity may also be an issue, since the kernel has a
|
|
limit of how many fragments it can buffer before it starts throwing away
|
|
packets. With kernels that support the <filename>/proc</filename>
|
|
filesystem, you can monitor the
|
|
files <filename>/proc/sys/net/ipv4/ipfrag_high_thresh</filename> and
|
|
<filename>/proc/sys/net/ipv4/ipfrag_low_thresh</filename>. Once the number of unprocessed,
|
|
fragmented packets reaches the number specified by <filename>ipfrag_high_thresh</filename> (in
|
|
bytes), the kernel will simply start throwing away fragmented packets until
|
|
the number of incomplete packets reaches the number specified by
|
|
<filename>ipfrag_low_thresh.</filename>
|
|
</para>
|
|
<para>
|
|
Another counter to monitor is <userinput>IP: ReasmFails</userinput>
|
|
in the file <filename>/proc/net/snmp</filename>; this
|
|
is the number of fragment reassembly failures. if it goes up too quickly
|
|
during heavy file activity, you may have problem.
|
|
</para>
|
|
</sect2>
|
|
<sect2 id="nfs-tcp">
|
|
<title>NFS over TCP</title>
|
|
<para>
|
|
A new feature, available for both 2.4 and 2.5 kernels but not yet
|
|
integrated into the mainstream kernel at the time of
|
|
this writing, is NFS over TCP. Using TCP
|
|
has a distinct advantage and a distinct disadvantage over UDP. The advantage
|
|
is that it works far better than UDP on lossy networks.
|
|
When using TCP, a single dropped packet can be retransmitted, without
|
|
the retransmission of the entire RPC request, resulting in better performance
|
|
on lossy networks. In addition, TCP will handle network speed differences
|
|
better than UDP, due to the underlying flow control at the network level.
|
|
</para>
|
|
<para>
|
|
The disadvantage of using TCP is that it is not a stateless protocol like
|
|
UDP. If your server crashes in the middle of a packet transmission,
|
|
the client will hang and any shares will need to be unmounted and remounted.
|
|
</para>
|
|
<para>
|
|
The overhead incurred by the TCP protocol will result in
|
|
somewhat slower performance than UDP under ideal network
|
|
conditions, but the cost is not severe, and is often not
|
|
noticable without careful measurement. If you are using
|
|
gigabit ethernet from end to end, you might also investigate the
|
|
usage of jumbo frames, since the high speed network may
|
|
allow the larger frame sizes without encountering increased
|
|
collision rates, particularly if you have set the network
|
|
to full duplex.
|
|
</para>
|
|
</sect2>
|
|
<sect2 id="timeout">
|
|
<title>Timeout and Retransmission Values</title>
|
|
<para>
|
|
Two mount command options, <userinput>timeo</userinput>
|
|
and <userinput>retrans</userinput>, control the behavior of UDP
|
|
requests when encountering client timeouts due to dropped packets, network
|
|
congestion, and so forth. The <userinput>-o timeo</userinput>
|
|
option allows designation of the length
|
|
of time, in tenths of seconds, that the client will wait until it decides it
|
|
will not get a reply from the server, and must try to send the request again.
|
|
The default value is 7 tenths of a second. The
|
|
<userinput>-o retrans</userinput> option allows
|
|
designation of the number of timeouts allowed before the client gives up, and
|
|
displays the <computeroutput>Server not responding</computeroutput>
|
|
message. The default value is 3 attempts.
|
|
Once the client displays this message, it will continue to try to send
|
|
the request, but only once before displaying the error message if
|
|
another timeout occurs. When the client reestablishes contact, it
|
|
will fall back to using the correct <userinput>retrans</userinput>
|
|
value, and will display the <computeroutput>Server OK</computeroutput> message.
|
|
</para>
|
|
<para>
|
|
If you are already encountering excessive retransmissions (see the output of
|
|
the <command>nfsstat</command> command), or want to increase the block transfer size without
|
|
encountering timeouts and retransmissions, you may want to adjust these
|
|
values. The specific adjustment will depend upon your environment, and in most
|
|
cases, the current defaults are appropriate.
|
|
</para>
|
|
</sect2>
|
|
<sect2 id="nfsd-instance">
|
|
<title>Number of Instances of the NFSD Server Daemon</title>
|
|
<para>
|
|
Most startup scripts, Linux and otherwise, start 8 instances of
|
|
<command>nfsd</command>. In the
|
|
early days of NFS, Sun decided on this number as a rule of thumb, and everyone
|
|
else copied. There are no good measures of how many instances are optimal, but
|
|
a more heavily-trafficked server may require more.
|
|
You should use at the very least one daemon per processor, but
|
|
four to eight per processor may be a better rule of thumb.
|
|
If you are using a 2.4 or
|
|
higher kernel and you want to see how heavily each
|
|
<command>nfsd</command> thread is being used,
|
|
you can look at the file <filename>/proc/net/rpc/nfsd</filename>.
|
|
The last ten numbers on the <userinput>th</userinput>
|
|
line in that file indicate the number of seconds that the thread usage was at
|
|
that percentage of the maximum allowable. If you have a large number in the
|
|
top three deciles, you may wish to increase the number
|
|
of <command>nfsd</command> instances. This
|
|
is done upon starting <command>nfsd</command> using the
|
|
number of instances as the command line
|
|
option, and is specified in the NFS startup script
|
|
(<filename>/etc/rc.d/init.d/nfs</filename> on
|
|
Red Hat) as <userinput>RPCNFSDCOUNT</userinput>.
|
|
See the <emphasis>nfsd(8)</emphasis> man page for more information.
|
|
</para>
|
|
</sect2>
|
|
<sect2 id="memlimits">
|
|
<title>Memory Limits on the Input Queue</title>
|
|
<para>
|
|
On 2.2 and 2.4 kernels, the socket input queue, where requests sit while they
|
|
are currently being processed, has a small default size limit (<filename>rmem_default</filename>)
|
|
of 64k. This queue is important for clients with heavy read loads, and servers
|
|
with heavy write loads. As an example, if you are running 8 instances of nfsd
|
|
on the server, each will only have 8k to store write requests while it
|
|
processes them. In addition, the socket output queue - important for clients
|
|
with heavy write loads and servers with heavy read loads - also has a small
|
|
default size (<filename>wmem_default</filename>).
|
|
</para>
|
|
<para>
|
|
Several published runs of the NFS benchmark
|
|
<ulink url="http://www.spec.org/osg/sfs97/">SPECsfs</ulink>
|
|
specify usage of a much higher value for both
|
|
the read and write value sets, <filename>[rw]mem_default</filename> and
|
|
<filename>[rw]mem_max</filename>. You might
|
|
consider increasing these values to at least 256k. The read and write limits
|
|
are set in the proc file system using (for example) the files
|
|
<filename>/proc/sys/net/core/rmem_default</filename> and
|
|
<filename>/proc/sys/net/core/rmem_max</filename>. The
|
|
<filename>rmem_default</filename> value can be increased in three steps; the following method is a
|
|
bit of a hack but should work and should not cause any problems:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Increase the size listed in the file:
|
|
</para>
|
|
<programlisting>
|
|
# echo 262144 > /proc/sys/net/core/rmem_default
|
|
# echo 262144 > /proc/sys/net/core/rmem_max
|
|
</programlisting>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Restart NFS. For example, on Red Hat systems,
|
|
</para>
|
|
<programlisting>
|
|
# /etc/rc.d/init.d/nfs restart
|
|
</programlisting>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
You might return the size limits to their normal size in case other
|
|
kernel systems depend on it:
|
|
</para>
|
|
<programlisting>
|
|
# echo 65536 > /proc/sys/net/core/rmem_default
|
|
# echo 65536 > /proc/sys/net/core/rmem_max
|
|
</programlisting>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
This last step may be necessary because machines have been reported to
|
|
crash if these values are left changed for long periods of time.
|
|
</para>
|
|
</sect2>
|
|
<sect2 id="autonegotiation">
|
|
<title>Turning Off Autonegotiation of NICs and Hubs</title>
|
|
<para>
|
|
If network cards auto-negotiate badly with hubs and switches, and ports run at
|
|
different speeds, or with different duplex configurations, performance will be
|
|
severely impacted due to excessive collisions, dropped packets, etc. If you
|
|
see excessive numbers of dropped packets in the
|
|
<command>nfsstat</command> output, or poor
|
|
network performance in general, try playing around with the network speed and
|
|
duplex settings. If possible, concentrate on establishing a 100BaseT full
|
|
duplex subnet; the virtual elimination of collisions in full duplex will
|
|
remove the most severe performance inhibitor for NFS over UDP. Be careful
|
|
when turning off autonegotiation on a card: The hub or switch that the card
|
|
is attached to will then resort to other mechanisms (such as parallel detection)
|
|
to determine the duplex settings, and some cards default to half duplex
|
|
because it is more likely to be supported by an old hub. The best solution,
|
|
if the driver supports it, is to force the card to negotiate 100BaseT
|
|
full duplex.
|
|
</para>
|
|
</sect2>
|
|
<sect2 id="sync-async">
|
|
<title>Synchronous vs. Asynchronous Behavior in NFS</title>
|
|
<para>
|
|
The default export behavior for both NFS Version 2 and Version 3 protocols,
|
|
used by <command>exportfs</command> in <application>nfs-utils</application>
|
|
versions prior to Version 1.11 (the latter is in the CVS tree,
|
|
but not yet released in a package, as of January, 2002) is
|
|
"asynchronous". This default permits the server to reply to client requests as
|
|
soon as it has processed the request and handed it off to the local file
|
|
system, without waiting for the data to be written to stable storage. This is
|
|
indicated by the <userinput>async</userinput> option denoted in the server's export list. It yields
|
|
better performance at the cost of possible data corruption if the server
|
|
reboots while still holding unwritten data and/or metadata in its caches. This
|
|
possible data corruption is not detectable at the time of occurrence, since
|
|
the <userinput>async</userinput> option instructs the server to lie to the client, telling the
|
|
client that all data has indeed been written to the stable storage, regardless
|
|
of the protocol used.
|
|
</para>
|
|
<para>
|
|
In order to conform with "synchronous" behavior, used as the default for most
|
|
proprietary systems supporting NFS (Solaris, HP-UX, RS/6000, etc.), and now
|
|
used as the default in the latest version of <command>exportfs</command>, the Linux Server's
|
|
file system must be exported with the <userinput>sync</userinput> option. Note that specifying
|
|
synchronous exports will result in no option being seen in the server's export
|
|
list:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Export a couple file systems to everyone, using slightly different
|
|
options:
|
|
</para>
|
|
<para>
|
|
<programlisting>
|
|
# /usr/sbin/exportfs -o rw,sync *:/usr/local
|
|
# /usr/sbin/exportfs -o rw *:/tmp
|
|
</programlisting>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Now we can see what the exported file system parameters look like:
|
|
</para>
|
|
<para>
|
|
<programlisting>
|
|
# /usr/sbin/exportfs -v
|
|
/usr/local *(rw)
|
|
/tmp *(rw,async)
|
|
</programlisting>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
If your kernel is compiled with the <filename>/proc</filename> filesystem,
|
|
then the file <filename>/proc/fs/nfs/exports</filename> will also show the
|
|
full list of export options.
|
|
</para>
|
|
<para>
|
|
When synchronous behavior is specified, the server will not complete (that is,
|
|
reply to the client) an NFS version 2 protocol request until the local file
|
|
system has written all data/metadata to the disk. The server
|
|
<emphasis>will</emphasis> complete a
|
|
synchronous NFS version 3 request without this delay, and will return the
|
|
status of the data in order to inform the client as to what data should be
|
|
maintained in its caches, and what data is safe to discard. There are 3
|
|
possible status values, defined an enumerated type, <userinput>nfs3_stable_how</userinput>, in
|
|
<filename>include/linux/nfs.h</filename>. The values, along with the subsequent actions taken due
|
|
to these results, are as follows:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
NFS_UNSTABLE - Data/Metadata was not committed to stable storage on the
|
|
server, and must be cached on the client until a subsequent client commit
|
|
request assures that the server does send data to stable storage.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
NFS_DATA_SYNC - Metadata was not sent to stable storage, and must be cached
|
|
on the client. A subsequent commit is necessary, as is required above.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
NFS_FILE_SYNC - No data/metadata need be cached, and a subsequent commit
|
|
need not be sent for the range covered by this request.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
In addition to the above definition of synchronous behavior, the client may
|
|
explicitly insist on total synchronous behavior, regardless of the protocol,
|
|
by opening all files with the <userinput>O_SYNC</userinput> option. In this case, all replies to
|
|
client requests will wait until the data has hit the server's disk, regardless
|
|
of the protocol used (meaning that, in NFS version 3, all requests will be
|
|
<userinput>NFS_FILE_SYNC</userinput> requests, and will require that the Server returns this status).
|
|
In that case, the performance of NFS Version 2 and NFS Version 3 will be
|
|
virtually identical.
|
|
</para>
|
|
<para>
|
|
If, however, the old default <userinput>async</userinput>
|
|
behavior is used, the <userinput>O_SYNC</userinput> option has
|
|
no effect at all in either version of NFS, since the server will reply to the
|
|
client without waiting for the write to complete. In that case the performance
|
|
differences between versions will also disappear.
|
|
</para>
|
|
<para>
|
|
Finally, note that, for NFS version 3 protocol requests, a subsequent commit
|
|
request from the NFS client at file close time, or at <command>fsync()</command> time, will force
|
|
the server to write any previously unwritten data/metadata to the disk, and
|
|
the server will not reply to the client until this has been completed, as long
|
|
as <userinput>sync</userinput> behavior is followed. If <userinput>async</userinput> is used, the commit is essentially
|
|
a no-op, since the server once again lies to the client, telling the client that
|
|
the data has been sent to stable storage. This again exposes the client and
|
|
server to data corruption, since cached data may be discarded on the client
|
|
due to its belief that the server now has the data maintained in stable
|
|
storage.
|
|
</para>
|
|
</sect2>
|
|
<sect2 id="non-nfs-performance">
|
|
<title>Non-NFS-Related Means of Enhancing Server Performance</title>
|
|
<para>
|
|
In general, server performance and server disk access speed will have an
|
|
important effect on NFS performance.
|
|
Offering general guidelines for setting up a well-functioning file server is
|
|
outside the scope of this document, but a few hints may be worth mentioning:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
If you have access to RAID arrays, use RAID 1/0 for both write speed and
|
|
redundancy; RAID 5 gives you good read speeds but lousy write speeds.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
A journalling filesystem will drastically reduce your reboot time in the
|
|
event of a system crash. Currently,
|
|
<ulink url="ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/">ext3
|
|
</ulink> will work correctly with NFS
|
|
version 3. In addition, Reiserfs version 3.6 will work with NFS version 3 on
|
|
2.4.7 or later kernels (patches are available for previous kernels). Earlier versions
|
|
of Reiserfs did not include room for generation numbers in the inode, exposing
|
|
the possibility of undetected data corruption during a server reboot.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Additionally, journalled file systems can be configured to maximize
|
|
performance by taking advantage of the fact that journal updates are all that
|
|
is necessary for data protection. One example is using ext3 with <userinput>data=journal</userinput>
|
|
so that all updates go first to the journal, and later to the main file
|
|
system. Once the journal has been updated, the NFS server can safely issue the
|
|
reply to the clients, and the main file system update can occur at the
|
|
server's leisure.
|
|
</para>
|
|
<para>
|
|
The journal in a journalling file system may also reside on a separate device
|
|
such as a flash memory card so that journal updates normally require no seeking. With only rotational
|
|
delay imposing a cost, this gives reasonably good synchronous IO performance.
|
|
Note that ext3 currently supports journal relocation, and ReiserFS will
|
|
(officially) support it soon. The Reiserfs tool package found at <ulink
|
|
url="ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.x.0k.tar.gz">
|
|
ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.x.0k.tar.gz </ulink>
|
|
contains
|
|
the <command>reiserfstune</command> tool, which will allow journal relocation. It does, however,
|
|
require a kernel patch which has not yet been officially released as of
|
|
January, 2002.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Using an automounter (such as <application>autofs</application> or <application>amd</application>) may prevent hangs if you
|
|
cross-mount files on your machines (whether on purpose or by oversight) and
|
|
one of those machines goes down. See the <ulink url="http://www.linuxdoc.org/HOWTO/mini/Automount.html">Automount Mini-HOWTO</ulink> for details.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Some manufacturers (Network Appliance, Hewlett Packard, and others) provide NFS
|
|
accelerators in the form of Non-Volatile RAM. NVRAM will boost access speed to
|
|
stable storage up to the equivalent of <userinput>async</userinput> access.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</sect2>
|
|
</sect1>
|