Disk Tuning
Introduction to disk tuning Your drive system is one of the places where bottlenecks can occur. All of your database information, boot code, swap space, and user programs live on the hard drives. Hard drives are speeding up, now topping 15,000 RPM and bus rates of over 160MB per second. Even at these speeds, drives are still much slower than RAM or your CPU, with requests to drives waiting for the drive to spin to the right location, read and/or write the data that is required, and send the answer back to the CPU. To top this off, hard drives are one of the moving parts in your machine, and moving parts are more likely to fail over time than a solid state mechanism, like your memory or CPU. One of the reasons why power supplies have such low MTBFs is the fan that keeps the power supply cool has a low MTBF. Once the fan dies, it's only a matter of time before the power supply itself overheats and dies. Some systems overcome this by installing multiple power supplies, so if one dies, the others can take over. Fortunately for hard drives, this failover is available in the form of RAID (Redundant Array of Inexpensive Disks). Then there are questions about the cost of differing systems versus their performance versus other options. We will take a look at these options throughout this chapter, starting with an overview of existing hard drive technology for Linux.
Overview of Disk Technologies IDE SCSI There are, as you may know, two major hard disk technologies that work under Linux: Integrated Drive Electronics (IDE), and Small Computer System Interface (SCSI). At the lowest level, IDE and SCSI drives share enough technology that IDE and SCSI drives are exactly the same save for a PCB board and physical interface on the disk itself. That PCB and phyiscal interface can command a premium in price on drive.
SCSI drives The history of SCSI drives starts not with large servers running multibillion dollar companies, but as its name implies, with small systems. SCSI had one of its first implementation on the Apple Macintosh Plus systems. SCSI was made to be a general purpose interface, supporting not only hard drives, but scanners, CD-ROMs, serial ports, and other devices off of one common bus. Cables were inexpensive, the protocol was relatively open, and was fast for its time, reaching a maximum of 4MB per second. SCSI was picked up by server vendors, such as Sun and IBM, since it created a common way for servers to share devices. Its advantages in design made it very well suited to have multiple devices on the same bus with little contention between devices. Unlike IDE, when a SCSI controller makes a request, the drive drops its hold on the line, fetches the information, and then picks up the line and sends the response. This allows the CPU to continue working on other tasks, so it is not I/O bound waiting for the drive. Other devices on the SCSI bus can communicate during this time. Another advantage to SCSI is that you can have multiple SCSI controllers on the same bus. This allows two or more machines to access a single SCSI bus and drives that are on that bus. The increased performance of SCSI and low quantity of SCSI drives being made as compared to IDE drives has caused an increase in price of SCSI drives. In many cases, this can be almost double the price. But if you need the performance, the cost is worth it.
IDE drives SCSI drives are actually stupid. They rely on a controller chip to tell them what to do. In many cases, a SCSI drive used by one SCSI controller cannot be moved to another controller and expect to keep all the data safe, since each controller has its own way of formatting data on the drive. IDE drives contain most of the brains on the drive itself. This makes it easier for system vendors to add support into their systems, reducing cost. These days, most systems will integrate serial ports, IDE, and USB all in one chip. SCSI still requires a (large) dedicated chip on the system to work. The lower cost of parts and higher volume of IDE has made it very low cost for most single user systems while retaining some of the higher performance of SCSI. But many of the features of SCSI are not available in IDE: external boxes, hot swapping, and devices like scanners. In addition, when an IDE drive is asked to do something, that IDE bus is locked while the drive processes the request. IDE devices are usually limited to two devices per bus, but can have multiple busses per system. Most motherboards come with two IDE busses standard. Burst performance of IDE drives gets close to that of SCSI, but will really only be seen on single user systems such as personal workstations. In the past few years, IDE has had trouble with the existing x86 BIOS and larger capacity IDE drives. Since the BIOS has only a limited amount of space to store drive information, no way for supporting very large drives was almost impossible without modifying the BIOS itself, which would cause headaches for do-it-yourself system builders. Until 2000, Linux would not boot on partitions that were larger than 1GB. Fortunately, more recent BIOSes are smart enough to work around these size limitations, and Linux has found a way around its booting issues. The future of IDE includes more of the existing ATA/66 and ATA/100 standards, which provide 66 Mbit and 100 Mbit burst speeds, repspectively. Also upcoming is Serial ATA, which uses a lower number of connectors and boosts speed rates to over 1Gbps.
What Interface Should You Use? Based on the way that IDE and SCSI differ, you should take performance into mind when choosing a hard drive bus for your systems. A machine that is part of a cluster that relies on network storage can use the less expensive IDE drives, since the drive will not be a bottleneck. A machine that will be used as the network storage device should have the faster SCSI interface. There are plenty of grey areas between these two extremes. But you should ask yourself a few questions while deciding: Will there be more than one drive in the system? Will the hard drive interface be a bottleneck on the system? Can the system the drives will be installed in support the bus I intend to use? Considering the answers to these questions will get you a long way to answering what bus you want to use.
Tuning your drives Now that we understand the major types of drives available for your system, let's take a look at the ways you can tune your particular setup.
Tuning the IDE bus There are a few methods of tuning the IDE bus, both through the BIOS setup, and within Linux. Some of these options are not supported by all controllers and all drives, so you will want to compare specifications of each before testing. As always, be aware that some tuning commands can cause data loss, so be sure to back up your data before experimenting.
Tuning IDE from within the BIOS Typically, most users will just leave the BIOS to automatically configure itself based off a negotiation between the hard drive and the controller in the system. There are, however, a few options within the BIOS that users can examine to get either improved performance, or debug tricky issues. There are three methods for IDE to address a drive: Normal, CHS, and LBA. Normal mode works only with drives smaller than 500MB. Since no drives since 1996 have been made this small, we can forget this mode. CHS stands for Cylinder, Head, and Sector, while LBA stands for Logical Block Address. CHS and LBA are fairly similar in performance, but be aware that taking a drive that is configured in CHS and moving it to a system that use LBA will destroy all the data on that drive. DMA Now that the drive can be accessed, there are two methods for that drive to get its data from the drive to the CPU: PIO, and DMA. PIO requires the CPU to shuttle data from the drive to memory so the CPU can use it later, but is the most compatible way for Linux to talk to a drive due to the large number of IDE controller chips. DMA mode seems to be much faster, since the CPU is not involved in moving the data between the drive and memory, freeing the CPU to do other things. Until the later Linux 2.2 kernels, DMA mode was not selected by default for access to drives due to compatability with interface chips. You can now select a specific IDE controller and select DMA access to make sure your combination is correct.
Tuning IDE from within Linux As mentioned in , Linux has drivers for some specific IDE controllers, plus a generic IDE interface. Once the OS boots, Linux will ignore the state of the BIOS when it accesses hardware, choosing instead to use its own drivers and options. By default, the Linux kernel will have DMA access available, but not activated. This can be changed if you choose to compile your own kernel (see for more information). hdparm Once Linux is started, all IDE tuning occurs using hdparm. Using hdparm followed by a drive (hdparm /dev/hda) will return the current settings for that drive. hdparm -c iovalue -d -m multimode device # hdparm /dev/hda /dev/hda: multcount = 0 (off) I/O support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) nowerr = 0 (off) readonly = 0 (off) readahead = 8 (on) geometry = 1559/240/63, sectors = 23572080, start = 0 # As you can see, there are a few options for tuning that are available for this drive. The biggest boost for IDE drive performance is to use DMA access, and this is already turned on. If DMA access is not turned on by default in the kernel, you can use -d1 to turn DMA access on, and -d0 to turn it off. Note that DMA access will not necessarily increase performance of getting data to and from the drive, but will free up the CPU during data transfers. Another item to take a look at is the I/O support, which on our sample drive is set to 16-bit access instead of 32. The -c option to hdparm will check the I/O support for the specified device, there are a total of three options that can be given to -c to set the I/O support. The two most stable for systems are 0, used to set 16-bit access, and 3, used to set 32-bit synchronous access. Using an iovalue of 1 sets 32-bit access but may not work on all chipsets, causing a system hang if the chipset does not support regular 32-bit access. Also test and check out multiple sector I/O with the -m option. Many newer IDE drives support transferrint multiple sectors worth of data per I/O cycle, reducing the numer of I/O cycles required to tranfer data. The -i option to hdparm will tell you what the maximum multimode is in the MaxMultSect section. By setting multimode to its highest, large transfers of data happen even quicker. The hdparm manual page indicates that with some drives, using multiple sector I/O can result in massive filsystem corruption. As always, be sure to test your systems thoroughly before using them in a production environment. Unfortunately, the correct setting for DMA, multiple sector I/O, and I/O support varies by drive and controller. You will have to do your own testing to find the correct combination for your system. The right combination should be able to: Have a high throughput of data from the drive to the system. Maintain that high throughput under heavy CPU load. Prevent the drive or the IDE driver from corrupting data. Fortunately, testing throughput is easy with hdparm. The -t option tests the throughput of data directly from the drive, while -T tests throughput of data directly from the Linux drive buffer. To ensure the tests run properly, they should be run when the system is quiet and few other processes are running. Usually, runlevel 1, also known as single user mode, provides this environment. Tests should be run two or three times. Once an optimum setting is found, you can have hdparm make the setting at boot time by including the relevant commands in /etc/rc.local
Tuning SCSI disks
SCSI Options In many cases, SCSI support is an add-in card to the system, giving you many options in vendors and card types. Some server motherboards include SCSI chipset and support onboard as well. In addition, there are dedicated RAID cards that implement various RAID levels in hardware, offloading much of the logic from the CPU. SCSI has three major protocol varieties: SCSI-1, SCSI-2, and SCSI-3. Fortunately, the protocols are backwards and forwards compatible, meaning that a SCSI-3 controller card can talk to a SCSI-1 drive, albeit at a slower speed. For best performance, make sure that the protocols of the controller card and devices match. On the physical side, cable types range from a DB-25 pin connector on older Apple Macintoshes to 80-pin SCA connectors that not only include the SCSI pins, but power and ID information. Here's a quick rundown on the various connector types available. INSERT TABLE OF CONNECTORS
SCSI RAID SCSI RAID Depending on the RAID level you select, you can optimize a drive array ranging from high performance to high availability. In addition, many hardware RAID cards support standby drives, so if one drive in the RAID fails, a previously-unused drive immediately fills in and becomes part of the RAID. RAID 0 is also known as striping, where each drive in the array is represented together as one large drive. It is very high I/O performance, since data is written sequentially across each drive. In RAID 0, no data space is lost, but if one of the drives dies, the entire array is lost. Drives in the array can be of varying sizes. RAID 1 is known as mirroring. Each drive mirrors its contents to another drive in the array. Performance is no slower than and drive in the array, but if one drive in the array fails its mirror immediately takes over with no loss. RAID 1 has the highest overhead of drive space, requiring two drives for the storage of one. Each drive pair must be the same size. RAID 5 is one of the most popular RAID formats. In this setup, data is spread amongst at least three drives along with parity data. In the event of any one drive failing, the remaining drives still have all the data from all drives and can continue without loss of data. If a second drive were to fail, the entire array would fail. Drive space loss comes to N-1 where N is the number of drives in the array. So in a 5 drive array, you would get the storage of 4 drives. All drives in this array must be the same size. Since parity data has to be created and written across multiple drives, the speed of RAID 5 is slow, but its reliability is quite high. For those implementing high availablity on their own, this is the best balance of speed and reliability. There are other RAID levels available, and in some cases these are vendor-specific (such as RAID 51, which is a mirrored RAID 5 array). For best file performance, RAID 0 will be the fastest. Configuration of each of these levels usually happens outside of Linux, in the card's BIOS, and is specific to each hardware RAID card. For performance reasons, it is not recommended that you use software RAID. Instead of offloading the RAID functionality to a hardware card and offload the functions from the CPU, software RAID uses the main CPU to create and maintain the RAID. This eats up valuable I/O cycles.
Tuning SCSI Drives Since SCSI drivers are written to negotiate the best connection between SCSI controller and its drives, there is very little from the OS that is needed to improve your performance. Your best idea for performance is to have the latest kernel and drivers and to use the correct driver for your card. In some cases, there are both Linux and BSD versions of drivers for your card. Some report that the BSD drivers are faster, so if you have both drivers available, test each out to see which gives better performance. On the hardware side, make sure that you are using the best match of SCSI controller and drives. Your best performance will probably be seen with Fibre Channel controllers and drives, since they have the highest throughput. If you have a large number of drives, you can span them across multiple controllers. Linux supports up to 8 or 16 cards each with the ability to hold up to 16 devices per card, not counting the card itself. One way of tuning the SCSI bus is to make sure it is properly terminated. Without proper termination, the SCSI bus may ratchet its speed down, or fail altogether. Termination should occur at both ends of a physical SCSI chain, but most SCSI chipsets include internal termination for their end. Purchase the correct termination for your cable and put it at the far end of the cable. This will make sure there is no signal reflections anywhere in the cable that can cause interference.
Tuning Filesystems Fast hard drive access is only half of getting fast read and write access. The other half is getting the data from the hard drive, through kernel, and to the application. This is the function of the filesystem that the data lives on. The first Linux filesystems (xiafs, minixfs, extfs) were all designed to give some level of performance to Linux while giving features close to what you would expect from a regular UNIX operating system. Over time, second extended filesystem (ext2fs) arrived and was the standard filesystem for Linux for many years. Now, the number of native filesystems for Linux include ext3, reiserfs, GFS, Coda, XFS, JFS, and Intermezzo. Such features that are available for Linux include inodes, directories, files, and tools to modify and create the filesystem. Inodes are the basic building blocks of the filesystem, creating pointers to file data, entries in directories, and directories themselves. The directory of inodes is kept in the superblock, and these superblocks are duplicated many times through the filesystem so if one superblock gets corrupted, another can be used to recover the missing data. In the event the OS shuts down without unmounting its filesystems, a filesystem check must be run in order to make sure the inodes point to the data it should be pointing at. A journalling filesystem (ext3, reiserfs, JFS, XFS) ensures that all writes to a filesystem are finished before reporting success to the OS. Each type of filesystem has its advantages and disadvantages. A filesystem like ext2 or ext3 are better tuned to large files, so access for reads and writes happen very quickly. But if there are a large amount of small files in a directory, its performance starts to suffer. A filesystem like reiserfs is better tunes to smaller files, but increases overhead for larger files. For applications that are writing or reading the hard drive, block sizes will allow Linux to write larger blocks of data to the hard drive in one operation. For example, a block size of 64k will try to write to the hard drive in 64kb chunks. Depending on the hard drive and interface, larger block results in better performance. If the block size is not set properly, it can result in poorer performance. If the optimal block size is 64k, but is set for 32k, it would take two operations to write the block to the hard drive. If it is set to 96kb, then it would take OS will wait for a timeout period or the rest of the block size to fill up before it writes the data to disk, dropping the latency of writing data to the disk. Block sizes are usually reserved for operations where raw data is being written to or read from the hard drive. But applications like dd can use varying size block sizes when writing data to the drive, allowing you to tune various block sizes.