LDP/LDP/guide/docbook/Tuning-Linux/disk.sgml

579 lines
25 KiB
Plaintext

<chapter id="disk">
<title>Disk Tuning</title>
<section id="diskintro">
<title>Introduction to disk tuning</title>
<para>
Your drive system is one of the places where bottlenecks can
occur. All of your database information, boot code, swap
space, and user programs live on the hard drives. Hard drives
are speeding up, now topping 15,000 RPM and bus rates
of over 160MB per second. Even at these speeds, drives are still
much slower than RAM or your CPU, with requests to drives waiting
for the drive to spin to the right location, read and/or write
the data that is required, and send the answer back to the CPU.
</para>
<para>
To top this off, hard drives are one of the moving parts in your
machine, and moving parts are more likely to fail over time than
a solid state mechanism, like your memory or CPU. One of the
reasons why power supplies have such low MTBFs is the fan
that keeps the power supply cool has a low MTBF. Once the
fan dies, it's only a matter of time before the power supply
itself overheats and dies. Some systems overcome this
by installing multiple power supplies, so if one dies, the
others can take over. Fortunately for hard drives, this
failover is available in the form of RAID (Redundant Array
of Inexpensive Disks).
</para>
<para>
Then there are questions about the cost of differing systems
versus their performance versus other options. We will take
a look at these options throughout this chapter, starting
with an overview of existing hard drive technology for Linux.
</para>
</section>
<section id="diskoverview">
<title>Overview of Disk Technologies</title>
<para>
<indexterm>
<primary>IDE</primary>
</indexterm>
<indexterm>
<primary>SCSI</primary>
</indexterm>
There are, as you may know, two major hard disk technologies
that work under Linux: Integrated Drive Electronics (IDE), and Small
Computer System Interface (SCSI). At the lowest level, IDE and
SCSI drives share enough technology that IDE and SCSI drives
are exactly the same save for a PCB board and physical interface
on the disk itself. That PCB and phyiscal interface can command
a premium in price on drive.
</para>
<section id="diskoverviewscsi">
<title>SCSI drives</title>
<para>
The history of SCSI drives starts not with large servers
running multibillion dollar companies, but as its name
implies, with small systems. SCSI had one of its first
implementation on the Apple Macintosh Plus systems.
SCSI was made to be a general purpose interface, supporting
not only hard drives, but scanners, CD-ROMs, serial ports,
and other devices off of one common bus. Cables were
inexpensive, the protocol was relatively open, and was
fast for its time, reaching a maximum of 4MB per second.
</para>
<para>
SCSI was picked up by server vendors, such as
Sun and IBM, since it created a common way for servers
to share devices. Its advantages in design made it very
well suited to have multiple devices on the same bus
with little contention between devices. Unlike IDE, when
a SCSI controller makes a request, the drive drops its
hold on the line, fetches the information, and then
picks up the line and sends the response. This allows the
CPU to continue working on other tasks, so it is not I/O
bound waiting for the drive. Other devices on the SCSI bus
can communicate during this time. Another advantage to SCSI
is that you can have multiple SCSI controllers on the
same bus. This allows two or more machines to access a single
SCSI bus and drives that are on that bus.
</para>
<para>
The increased performance of SCSI and low quantity of
SCSI drives being made as compared to IDE drives has caused
an increase in price of SCSI drives. In many cases, this
can be almost double the price. But if you need the performance,
the cost is worth it.
</para>
</section>
<section id="diskoverviewide">
<title>IDE drives</title>
<para>
SCSI drives are actually stupid. They rely on a controller
chip to tell them what to do. In many cases, a SCSI drive
used by one SCSI controller cannot be moved to another
controller and expect to keep all the data safe, since
each controller has its own way of formatting data on
the drive. IDE drives contain most of the brains on the
drive itself. This makes it easier for system vendors
to add support into their systems, reducing cost. These
days, most systems will integrate serial ports, IDE, and
USB all in one chip. SCSI still requires a (large) dedicated
chip on the system to work.
</para>
<para>
The lower cost of parts and higher volume of IDE has made it
very low cost for most single user systems while retaining
some of the higher performance of SCSI. But many of the
features of SCSI are not available in IDE: external boxes,
hot swapping,
and devices like scanners. In addition, when an IDE drive
is asked to do something, that IDE bus is locked while
the drive processes the request. IDE devices are usually limited
to two devices per bus, but can have multiple busses per system.
Most motherboards come with two IDE busses standard.
</para>
<para>
Burst performance of IDE drives gets close to that of SCSI,
but will really only be seen on single user systems such
as personal workstations.
</para>
<para>
In the past few years, IDE has had trouble with the existing
x86 BIOS and larger capacity IDE drives. Since the BIOS
has only a limited amount of space to store drive information,
no way for supporting very large drives was almost impossible
without modifying the BIOS itself, which would cause headaches
for do-it-yourself system builders. Until 2000, Linux would
not boot on partitions that were larger than 1GB. Fortunately,
more recent BIOSes are smart enough to work around these
size limitations, and Linux has found a way around its
booting issues.
</para>
<para>
The future of IDE includes more of the existing ATA/66 and ATA/100
standards, which provide 66 Mbit and 100 Mbit burst speeds,
repspectively. Also upcoming is Serial ATA, which uses a lower
number of connectors and boosts speed rates to over 1Gbps.
</para>
</section>
<section id="diskwhattochoose">
<title>What Interface Should You Use?</title>
<para>
Based on the way that IDE and SCSI differ, you should take
performance into mind when choosing a hard drive bus
for your systems. A machine that is part of a cluster
that relies on network storage can use the less expensive
IDE drives, since the drive will not be a bottleneck. A
machine that will be used as the network storage device
should have the faster SCSI interface.
</para>
<para>
There are plenty of grey areas between these two extremes.
But you should ask yourself a few questions while deciding:
</para>
<itemizedlist>
<listitem>
<para>
Will there be more than one drive in the system?
</para>
</listitem>
</itemizedlist>
<itemizedlist>
<listitem>
<para>
Will the hard drive interface be a bottleneck
on the system?
</para>
</listitem>
</itemizedlist>
<itemizedlist>
<listitem>
<para>
Can the system the drives will be installed in
support the bus I intend to use?
</para>
</listitem>
</itemizedlist>
<para>
Considering the answers to these questions will get you
a long way to answering what bus you want to use.
</para>
</section>
</section>
<section id="disktuning">
<title>Tuning your drives</title>
<para>
Now that we understand the major types of drives available
for your system, let's take a look at the ways you can
tune your particular setup.
</para>
<section id="disktuningide">
<title>Tuning the IDE bus</title>
<para>
There are a few methods of tuning the IDE bus, both through
the BIOS setup, and within Linux. Some of these options
are not supported by all controllers and all drives, so you
will want to compare specifications of each before testing.
As always, be aware that some tuning commands can cause
data loss, so be sure to back up your data before experimenting.
</para>
<section id="disktuningidebios">
<title>Tuning IDE from within the BIOS</title>
<para>
Typically, most users will just leave the BIOS to automatically
configure itself based off a negotiation between the hard drive
and the controller in the system. There are, however, a few
options within the BIOS that users can examine to get either
improved performance, or debug tricky issues.
</para>
<para>
There are three methods for IDE to address a drive: Normal,
CHS, and LBA. Normal mode works only with drives smaller
than 500MB. Since no drives since 1996 have been made this
small, we can forget this mode. CHS stands for Cylinder,
Head, and Sector, while LBA stands for Logical Block Address.
CHS and LBA are fairly similar in performance, but be aware
that taking a drive that is configured in CHS and moving it
to a system that use LBA will destroy all the data on that
drive.
</para>
<para>
<indexterm>
<primary>DMA</primary>
</indexterm>
Now that the drive can be accessed, there are two methods
for that drive to get its data from the drive to the CPU:
PIO, and DMA. PIO requires the CPU to shuttle data from
the drive to memory so the CPU can use it later, but
is the most compatible way for Linux to talk to a drive
due to the large number of IDE controller chips.
DMA mode seems to be much faster, since the CPU is not
involved in moving the data between the drive and memory,
freeing the CPU to do other things. Until the later
Linux 2.2 kernels, DMA mode was not selected by default
for access to drives due to compatability with interface
chips. You can now select a specific IDE controller and
select DMA access to make sure your combination is
correct.
</para>
</section>
<section id="disktuningideos">
<title>Tuning IDE from within Linux</title>
<para>
As mentioned in <xref linkend="disktuningidebios">,
Linux has drivers for some specific IDE controllers,
plus a generic IDE interface. Once the OS boots, Linux
will ignore the state of the BIOS when it accesses hardware,
choosing instead to use its own drivers and options. By default,
the Linux kernel will have DMA access available, but not
activated. This can be changed if you choose to compile your
own kernel (see <xref linkend="kernelcompile"> for more
information).
</para>
<para>
<indexterm>
<primary>hdparm</primary>
</indexterm>
Once Linux is started, all IDE tuning occurs using
<command>hdparm</command>. Using <command>hdparm</command>
followed by a drive (<command>hdparm /dev/hda</command>) will
return the current settings for that drive.
</para>
<cmdsynopsis>
<command>hdparm</command>
<arg>-c <replaceable>iovalue</replaceable></arg>
<arg>-d</arg>
<arg>-m <replaceable>multimode</replaceable></arg>
<arg choice="req"><replaceable>device</replaceable></arg>
</cmdsynopsis>
<screen>
# hdparm /dev/hda
/dev/hda:
multcount = 0 (off)
I/O support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
nowerr = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 1559/240/63, sectors = 23572080, start = 0
#
</screen>
<para>
As you can see, there are a few options for tuning that
are available for this drive. The biggest boost for IDE drive
performance is to use DMA access, and this is already turned
on. If DMA access is not turned on by default in the kernel,
you can use -d1 to turn DMA access on, and -d0 to turn it off.
Note that DMA access will not necessarily increase performance
of getting data to and from the drive, but will free up the CPU
during data transfers.
</para>
<para>
Another item to take a look at is the I/O support, which
on our sample drive is set to 16-bit access instead of 32.
The -c option to <command>hdparm</command> will check the
I/O support for the specified device, there are a total of three
options that can be given to -c to set the I/O support. The
two most stable for systems are 0, used to set 16-bit access,
and 3, used to set 32-bit synchronous access. Using an
<replaceable>iovalue</replaceable> of
1 sets 32-bit access but may not work on all chipsets, causing
a system hang if the chipset does not support regular 32-bit
access.
</para>
<para>
Also test and check out multiple sector I/O with the -m option.
Many newer IDE drives support transferrint multiple sectors
worth of data per I/O cycle, reducing the numer of I/O cycles
required to tranfer data. The -i option to hdparm will tell
you what the maximum <replaceable>multimode</replaceable>
is in the MaxMultSect
section. By setting <replaceable>multimode</replaceable>
to its highest,
large transfers of data happen even quicker.
</para>
<warning>
<para>
The <command>hdparm</command> manual page indicates that
with some drives, using multiple sector I/O can result
in massive filsystem corruption. As always, be sure
to test your systems thoroughly before using them
in a production environment.
</para>
</warning>
<para>
Unfortunately, the correct setting for DMA, multiple sector
I/O, and I/O support varies by drive and controller. You will
have to do your own testing to find the correct combination
for your system. The right combination should be able to:
</para>
<itemizedlist>
<listitem>
<para>
Have a high throughput of data from the drive to the system.
</para>
</listitem>
<listitem>
<para>
Maintain that high throughput under heavy CPU load.
</para>
</listitem>
<listitem>
<para>
Prevent the drive or the IDE driver from corrupting data.
</para>
</listitem>
</itemizedlist>
<para>
Fortunately, testing throughput is easy with <command>
hdparm</command>. The -t option tests the throughput
of data directly from the drive, while -T tests throughput
of data directly from the Linux drive buffer. To ensure the
tests run properly, they should be run when the
system is quiet and few other processes are running. Usually,
runlevel 1, also known as single user mode, provides
this environment. Tests should be run two or three times.
</para>
<para>
Once an optimum setting is found, you can have hdparm make
the setting at boot time by including the relevant commands
in <filename>/etc/rc.local</filename>
</para>
</section>
</section>
<section id="disktuningscsi">
<title>Tuning SCSI disks</title>
<para>
</para>
<section id="disktuningscsioptions">
<title>SCSI Options</title>
<para>
In many cases, SCSI support is an add-in card to the
system, giving you many options in vendors and card
types. Some server motherboards include SCSI chipset
and support onboard as well. In addition, there are
dedicated RAID cards that implement various RAID levels
in hardware, offloading much of the logic from the CPU.
</para>
<para>
SCSI has three major protocol varieties: SCSI-1, SCSI-2,
and SCSI-3. Fortunately, the protocols are backwards and
forwards compatible, meaning that a SCSI-3 controller card
can talk to a SCSI-1 drive, albeit at a slower speed. For
best performance, make sure that the protocols of the controller
card and devices match.
</para>
<para>
On the physical side, cable types range from a DB-25 pin connector
on older Apple Macintoshes to 80-pin SCA connectors that not
only include the SCSI pins, but power and ID information.
</para>
<para>
Here's a quick rundown on the various connector types available.
</para>
<para>
INSERT TABLE OF CONNECTORS
</para>
<section id="disktuningscsiraid">
<title>SCSI RAID</title>
<para>
<indexterm>
<primary>SCSI</primary>
<secondary>RAID</secondary>
</indexterm>
Depending on the RAID level you select, you can optimize
a drive array ranging from high performance to high availability.
In addition, many hardware RAID cards support standby drives,
so if one drive in the RAID fails, a previously-unused drive
immediately fills in and becomes part of the RAID.
</para>
<para>
RAID 0 is also known as striping, where each drive in the array
is represented together as one large drive. It is very high
I/O performance, since data is written sequentially across
each drive. In RAID 0, no data space is lost, but if one
of the drives dies, the entire array is lost. Drives
in the array can be of varying sizes.
</para>
<para>
RAID 1 is known as mirroring. Each drive mirrors its contents
to another drive in the array. Performance is no slower than
and drive in the array, but if one drive in the array fails
its mirror immediately takes over with no loss. RAID 1 has
the highest overhead of drive space, requiring two drives
for the storage of one. Each drive pair must be the
same size.
</para>
<para>
RAID 5 is one of the most popular RAID formats. In this
setup, data is spread amongst at least three drives along
with parity data. In the event of any one drive failing, the
remaining drives still have all the data from all drives and
can continue without loss of data. If a second drive were
to fail, the entire array would fail. Drive space loss comes
to N-1 where N is the number of drives in the array. So
in a 5 drive array, you would get the storage of 4 drives.
All drives in this array must be the same size.
Since parity data has to be created and written across multiple
drives, the speed of RAID 5 is slow, but its reliability is
quite high. For those implementing high availablity on
their own, this is the best balance of speed and reliability.
</para>
<para>
There are other RAID levels available, and in some cases
these are vendor-specific (such as RAID 51, which is
a mirrored RAID 5 array). For best file performance,
RAID 0 will be the fastest. Configuration of each of
these levels usually happens outside of Linux, in the card's
BIOS, and is specific to each hardware RAID card.
</para>
<para>
For performance reasons, it is not recommended that you
use software RAID. Instead of offloading the RAID functionality
to a hardware card and offload the functions from the CPU, software
RAID uses the main CPU to create and maintain the RAID. This
eats up valuable I/O cycles.
</para>
</section> <!-- disktuningscsiraid -->
<section id="disktuningscsihw">
<title>Tuning SCSI Drives</title>
<para>
Since SCSI drivers are written to negotiate the best connection
between SCSI controller and its drives, there is very little
from the OS that is needed to improve your performance. Your
best idea for performance is to have the latest kernel and drivers
and to use the correct driver for your card. In some cases, there
are both Linux and BSD versions of drivers for your card. Some
report that the BSD drivers are faster, so if you have both
drivers available, test each out to see which gives better
performance.
</para>
<para>
On the hardware side, make sure that you are using the best
match of SCSI controller and drives. Your best performance
will probably be seen with Fibre Channel controllers and
drives, since they have the highest throughput. If
you have a large number of drives, you can span them across
multiple controllers. Linux supports up to 8 or 16 cards
each with the ability to hold up to 16 devices per card,
not counting the card itself.
</para>
<para>
One way of tuning the SCSI bus is to make sure it is properly
terminated. Without proper termination, the SCSI bus may ratchet
its speed down, or fail altogether. Termination should occur at
both ends of a physical SCSI chain, but most SCSI chipsets include
internal termination for their end. Purchase the correct
termination for your cable and put it at the far end of the cable.
This will make sure there is no signal reflections anywhere in the
cable that can cause interference.
</para>
</section> <!-- disktuningscsiraid -->
</section> <!-- disktuningscsi -->
</section> <!-- disktuningoptions -->
<section id="diskfilesystems">
<title>Tuning Filesystems</title>
<para>
Fast hard drive access is only half of getting fast read
and write access. The other half is getting the data from
the hard drive, through kernel, and to the application.
This is the function of the filesystem that the data lives on.
</para>
<para>
The first Linux filesystems (xiafs, minixfs, extfs) were
all designed to give some level of performance to Linux
while giving features close to what you would expect
from a regular UNIX operating system. Over time,
second extended filesystem (ext2fs) arrived and was the
standard filesystem for Linux for many years. Now, the number
of native filesystems for Linux include ext3, reiserfs,
GFS, Coda, XFS, JFS, and Intermezzo.
</para>
<para>
Such features that are available for Linux include inodes,
directories, files, and tools
to modify and create the filesystem. Inodes are the basic
building blocks of the filesystem, creating pointers to
file data, entries in directories, and directories themselves.
The directory of inodes is kept in the superblock, and these
superblocks are duplicated many times through the filesystem
so if one superblock gets corrupted, another can be used to
recover the missing data.
</para>
<para>
In the event the OS shuts down without unmounting its filesystems,
a filesystem check must be run in order to make sure the
inodes point to the data it should be pointing at. A journalling
filesystem (ext3, reiserfs, JFS, XFS) ensures that all writes
to a filesystem are
finished before reporting success to the OS.
</para>
<para>
Each type of filesystem has its advantages and disadvantages. A
filesystem like ext2 or ext3 are better tuned to large files, so access
for reads and writes happen very quickly. But if there are a large
amount of small files in a directory, its performance starts to suffer.
A filesystem like reiserfs is better tunes to smaller files, but
increases overhead for larger files.
</para>
<para>
For applications that are writing or reading the hard drive, block sizes
will allow Linux to write larger blocks of data to the hard drive in one
operation. For example, a block size of 64k will try to write to the
hard drive in 64kb chunks. Depending on the hard drive and interface,
larger block results in better performance. If the block size is not
set properly, it can result in poorer performance. If the optimal block
size is 64k, but is set for 32k, it would take two operations to write
the block to the hard drive. If it is set to 96kb, then it would take
OS will wait for a timeout period or the rest of the block size to fill
up before it writes the data to disk, dropping the latency of writing
data to the disk.
</para>
<para>
Block sizes are usually reserved for operations where raw data is being
written to or read from the hard drive. But applications like
<application>dd</application> can use varying size block sizes when
writing data to the drive, allowing you to tune various block sizes.
</para>
</section> <!-- diskfilesystems -->
</section> <!-- disktuning -->
</chapter> <!-- disk -->