old-www/HOWTO/Multi-Disk-HOWTO-6.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
 <META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
 <TITLE>HOWTO: Multi Disk System Tuning: Technologies </TITLE>
 <LINK HREF="Multi-Disk-HOWTO-7.html" REL=next>
 <LINK HREF="Multi-Disk-HOWTO-5.html" REL=previous>
 <LINK HREF="Multi-Disk-HOWTO.html#toc6" REL=contents>
</HEAD>
<BODY>
<A HREF="Multi-Disk-HOWTO-7.html">Next</A>
<A HREF="Multi-Disk-HOWTO-5.html">Previous</A>
<A HREF="Multi-Disk-HOWTO.html#toc6">Contents</A>
<HR>
<H2><A NAME="technologies"></A> <A NAME="s6">6. Technologies </A></H2>

<P>
<!--
disk!technologies
-->

In order to decide how to get the most of your devices you need to
know what technologies are available and their implications. As always
there can be some tradeoffs with respect to speed, reliability, power,
flexibility, ease of use and complexity.
<P>Many of the techniques described below can be stacked in a number
of ways to maximise performance and reliability, though at the cost
of added complexity.
<P>
<P>
<H2><A NAME="RAID"></A> <A NAME="ss6.1">6.1 RAID</A>
</H2>

<P>
<!--
disk!technologies!RAID
-->

This is a method of increasing reliability, speed or both by using multiple
disks in parallel thereby decreasing access time and increasing transfer
speed. A checksum or mirroring system can be used to increase reliability.
Large servers can take advantage of such a setup but it might be overkill
for a single user system unless you already have a large number of disks
available. See other documents and FAQs for more information.
<P>For Linux one can set up a RAID system using either software
(the <CODE>md</CODE> module in the kernel), a Linux compatible
controller card (PCI-to-SCSI) or a SCSI-to-SCSI controller. Check the
documentation for what controllers can be used. A hardware solution is
usually faster, and perhaps also safer, but comes at a significant cost.
<P>A summary of available hardware RAID solutions for Linux is available
at
<A HREF="http://www.Linux-Consulting.com/Raid/Docs/raid_hw.txt">Linux Consulting</A>.
<P>
<P>
<P>
<H3><A NAME="SCSI-to-SCSI"></A> SCSI-to-SCSI</H3>

<P>
<!--
disk!technologies!RAID!SCSI-to-SCSI
-->

SCSI-to-SCSI controllers are usually implemented as complete cabinets
with drives and a controller that connects to the computer with a
second SCSI bus. This makes the entire cabinet of drives look like a
single large, fast SCSI drive and requires no special RAID driver. The
disadvantage is that the SCSI bus connecting the cabinet to the
computer becomes a bottleneck.
<P>A significant disadvantage for people with large disk farms is that there
is a limit to how many SCSI entries there can be in the
<A HREF="file:///dev/">/dev</A>
directory. In these cases using SCSI-to-SCSI will conserve entries.
<P>Usually they are configured via the front panel or with a terminal
connected to their on-board serial interface.
<P>
<P>Some manufacturers of such systems are
<A HREF="http://www.cmd.com">CMD</A>
and
<A HREF="http://www.syred.com">Syred</A>
whose web pages describe several systems.
<P>
<P>
<H3><A NAME="PCI-to-SCSI"></A> PCI-to-SCSI</H3>

<P>
<!--
disk!technologies!RAID!PCI-to-SCSI
-->

PCI-to-SCSI controllers are, as the name suggests,
connected to the high speed PCI
bus and is therefore not suffering from the same bottleneck as the
SCSI-to-SCSI controllers. These controllers require special drivers
but you also get the means of controlling the RAID configuration over
the network which simplifies management.
<P>Currently only a few families of PCI-to-SCSI host adapters
are supported under Linux.
<P>
<DL>
<P>
<DT><B>DPT</B><DD><P>The oldest and most mature is a range of controllers from
<A HREF="http://www.dpt.com">DPT</A>
including SmartCache I/III/IV and SmartRAID I/III/IV controller families.
These controllers are supported by the EATA-DMA driver in
the standard kernel. This company also has an informative
<A HREF="http://www.dpt.com">home page</A>
which also describes various general aspects
of RAID and SCSI in addition to the product related information.
<P>More information from  the author of the DPT controller drivers
(EATA* drivers) can be found at his pages on
<A HREF="http://www.uni-mainz.de/~neuffer/scsi/">SCSI</A>
and
<A HREF="http://www.uni-mainz.de/~neuffer/scsi/dpt/">DPT</A>.
<P>These are not the fastest but have a good track record of
proven reliability.
<P>Note that the maintenance tools for DPT controllers currently
run under DOS/Win only so you will need a small DOS/Win partition
for some of the software. This also means you have to boot the
system into Windows in order to maintain your RAID system.
<P>
<P>
<DT><B>ICP-Vortex</B><DD><P>A very recent addition is a range of controllers from
<A HREF="http://www.icp-vortex.com">ICP-Vortex</A>
featuring up to 5 independent channels and very fast hardware
based on the i960 chip. The Linux driver was written by the
company itself which shows they support Linux.
<P>As ICP-Vortex supplies the maintenance software for Linux it is
not necessary with a reboot to other operating systems for the
setup and maintenance of your RAID system. This saves you also
extra downtime.
<P>
<P>
<DT><B>Mylex DAC-960</B><DD><P>This is one of the latest entries which is out in early beta.
More information as well as drivers are available at
<A HREF="http://www.dandelion.com/Linux/DAC960.html">Dandelion Digital's Linux DAC960 Page</A>.
<P>
<P>
<DT><B>Compaq Smart-2 PCI Disk Array Controllers</B><DD><P>Another very recent entry and currently in beta release is the
<A HREF="http://www.insync.net/~frantzc/cpqarray.html">Smart-2</A>
driver.
<P>
<DT><B>IBM ServeRAID</B><DD><P>IBM has released their
<A HREF="http://www.developer.ibm.com/welcome/netfinity/serveraid_beta.html">driver</A>
as GPL.
<P>
<P>
</DL>
<P>
<P>
<P>
<P>
<H3><A NAME="soft-raid"></A> Software RAID</H3>

<P>
<!--
disk!technologies!RAID!Software RAID
-->

A number of operating systems offer software RAID using
ordinary disks and controllers. Cost is low and performance
for raw disk IO can be very high.
As this can be very CPU intensive it increases the load noticeably
so if the machine is CPU bound in performance rather then IO bound
you might be better off with a hardware PCI-to-RAID controller.
<P>Real cost, performance and especially reliability of software
vs. hardware RAID is a very controversial topic. Reliability
on Linux systems have been very good so far.
<P>The current software RAID project on Linux is the <CODE>md</CODE> system
(multiple devices) which offers much more than RAID so it is
described in more details later.
<P>
<P>
<P>
<H3><A NAME="raid-levels"></A> RAID Levels</H3>

<P>
<!--
disk!technologies!RAID!RAID levels
-->

RAID comes in many levels and flavours which I will give a brief
overview of this here. Much has been written about it and the
interested reader is recommended to read more about this in the
<A HREF="http://ostenfeld.dk/~jakob/Software-RAID.HOWTO/">Software RAID HOWTO</A>.
<P>
<UL>
<LI>RAID <EM>0</EM> is not redundant at all but offers the best
throughput of all levels here. Data is striped across a number of
drives so read and write operations take place in parallel across
all drives. On the other hand if a single drive fail then
everything is lost. Did I mention backups?
</LI>
<LI>RAID <EM>1</EM> is the most primitive method of obtaining redundancy
by duplicating data across all drives. Naturally this is
massively wasteful but you get one substantial advantage which is
fast access.
The drive that access the data first wins. Transfers
are not any faster than for a single drive, even though you might
get some faster read transfers by using one track reading per
drive.

Also if you have only 2 drives this is the only method of achieving
redundancy.
</LI>
<LI>RAID <EM>2</EM> and <EM>4</EM> are not so common and are not covered
here.
</LI>
<LI>RAID <EM>3</EM> uses a number of disks (at least 2) to store data
in a striped RAID 0 fashion. It also uses an additional redundancy
disk to store the XOR sum of the data from the data disks. Should
the redundancy disk fail, the system can continue to operate as if
nothing happened. Should any single data disk fail the system can
compute the data on this disk from the information on the redundancy
disk and all remaining disks. Any double fault will bring the whole
RAID set off-line.

RAID 3 makes sense only with at least 2 data disks (3 disks
including the redundancy disk). Theoretically there is no limit for
the number of disks in the set, but the probability of a fault
increases with the number of disks in the RAID set. Usually the
upper limit is 5 to 7 disks in a single RAID set.

Since RAID 3 stores all redundancy information on a dedicated disk
and since this information has to be updated whenever a write to any
data disk occurs, the overall write speed of a RAID 3 set is limited
by the write speed of the redundancy disk. This, too, is a limit for
the number of disks in a RAID set. The overall read speed of a RAID
3 set with all data disks up and running is that of a RAID 0 set
with that number of data disks. If the set has to reconstruct data
stored on a failed disk from redundant information, the performance
will be severely limited: All disks in the set have to be read and
XOR-ed to compute the missing information.
</LI>
<LI>RAID <EM>5</EM> is just like RAID 3, but the redundancy
information is spread on all disks of the RAID set. This improves
write performance, because load is distributed more evenly between
all available disks.
</LI>
</UL>
<P>There are also hybrids available based on RAID 0 or 1 and one other
level. Many combinations are possible but I have only seen a few
referred to. These are more complex than the above mentioned
RAID levels.
<P>RAID <EM>0/1</EM> combines striping with duplication which
gives very high transfers combined with fast seeks as well as
redundancy. The disadvantage is high disk consumption as well as
the above mentioned complexity.
<P>RAID <EM>1/5</EM> combines the speed and redundancy benefits of
RAID5 with the fast seek of RAID1. Redundancy is improved compared
to RAID 0/1 but disk consumption is still substantial. Implementing
such a system would involve typically more than 6 drives, perhaps
even several controllers or SCSI channels.
<P>
<P>
<H2><A NAME="vol-mgmnt"></A> <A NAME="ss6.2">6.2 Volume Management</A>
</H2>

<P>
<!--
disk!technologies!volume management
-->

Volume management is a way of overcoming the constraints of fixed
sized partitions and disks while still having a control of where
various parts of file space resides. With such a system you can
add new disks to your system and add space from this drive to parts
of the file space where needed, as well as migrating data out from
a disk developing faults to other drives before catastrophic failure
occurs.
<P>The system developed by
<A HREF="http://www.veritas.com">Veritas</A>
has become the defacto standard for logical volume management.
<P>Volume management is for the time being an area where Linux is lacking.
<P>One is the virtual partition system project
<A HREF="http://www-wsg.cso.uiuc.edu/~roth/">VPS</A>
that will reimplement many of the volume management functions found in
IBM's AIX system. Unfortunately this project is currently on hold.
<P>Another project is the
<A HREF="http://www.sistina.com/lvm/">Logical Volume Manager</A>
project that is similar to a project by HP.
<P>
<P>
<H2><A NAME="ss6.3">6.3 Linux <CODE>md</CODE> Kernel Patch</A>
</H2>

<P>
<!--
disk!technologies!md
-->

<!--
disk!technologies!raid
-->

<!--
disk!technologies!striping
-->

<!--
disk!technologies!translucence
-->

The Linux Multi Disk (md) provides a number of block level features
in various stages of development.
<P>RAID 0 (striping) and concatenation are very solid and in production quality
and also RAID 4 and 5 are quite mature.
<P>It is also possible to stack some
levels, for instance mirroring (RAID 1) two pairs of drives,
each pair set up as striped disks (RAID 0),
which offers the speed of RAID 0 combined with the reliability of RAID 1.
<P>In addition to RAID this system offers (in alpha stage) block level
volume management and soon also translucent file space.
Since this is done on the block level it can be used in combination
with any file system, even for <CODE>fat</CODE> using Wine.
<P>Think very carefully what drives you combine so you can operate all drives
in parallel, which gives you better performance and less wear. Read more
about this in the documentation that comes with <CODE>md</CODE>.
<P>
<P>Unfortunately The Linux software RAID has split into two trees,
the old stable versions 0.35 and 0.42 which are documented in the
official
<A HREF="http://linas.org/linux/Software-RAID/Software-RAID.html">Software-RAID HOWTO</A>
and the newer less stable 0.90 series which is documented in the
unofficial
<A HREF="http://ostenfeld.dk/~jakob/Software-RAID.HOWTO/">Software RAID HOWTO</A>
which is a work in progress.
<P>A
<A HREF="http://www-mddsp.enel.ucalgary.ca/People/adilger/online-ext2/">patch for online growth of ext2fs</A>
is available in early stages
and related work is taking place at
<A HREF="http://ext2resize.sourceforge.net/">the ext2fs resize project</A>
at Sourceforge.
<P>
<P>
<P>Hint: if you cannot get it to work properly you have forgotten to set
the <CODE>persistent-block</CODE> flag. Your best documentation is currently
the source code.
<P>
<P>
<P>
<H2><A NAME="ss6.4">6.4 Compression</A>
</H2>

<P>
<!--
disk!technologies!compression
-->

<!--
disk!compression!DouBle
-->

<!--
disk!compression!Zlibc
-->

<!--
disk!compression!dmsdos
-->

<!--
disk!compression!e2compr
-->

Disk compression versus file compression
is a hotly debated topic especially regarding
the added danger of file corruption. Nevertheless there are several options
available for the adventurous administrators. These take on many forms,
from kernel modules and patches to extra libraries but note that most
suffer various forms of limitations such as being read-only. As development
takes place at neck breaking speed the specs have undoubtedly changed by the
time you read this. As always: check the latest updates yourself. Here only
a few references are given.
<P>
<UL>
<LI>DouBle features file compression with some limitations.</LI>
<LI>Zlibc adds transparent on-the-fly decompression of files as they load.</LI>
<LI>there are many modules available for reading compressed files or
partitions that are native to various other operating systems though
currently most of these are read-only.</LI>
<LI>
<A HREF="http://bf9nt.uni-duisburg.de/mitarbeiter/gockel/software/dmsdos/">dmsdos</A>
(currently in version 0.9.2.0) offer many of the compression
options available for DOS and Windows. It is not yet complete but work is
ongoing and new features added regularly.</LI>
<LI><CODE>e2compr</CODE> is a package that extends <CODE>ext2fs</CODE> with compression
capabilities. It is still under testing and will therefore mainly be of
interest for kernel hackers but should soon gain stability for wider use.
Check the
<A HREF="http://e2compr.memalpha.cx/e2compr/">http://e2compr.memalpha.cx/e2compr/</A>
name="e2compr homepage">
for more information. I have reports of speed and good stability
which is why it is mentioned here.</LI>
</UL>
<P>
<P>
<H2><A NAME="ss6.5">6.5 ACL</A>
</H2>

<P>
<!--
disk!technologies!ACL
-->

Access Control List (ACL) offers finer control over file access
on a user by user basis, rather than the traditional owner, group
and others, as seen in directory listings (<CODE>drwxr-xr-x</CODE>). This
is currently not available in Linux but is expected in kernel 2.3
as hooks are already in place in <CODE>ext2fs</CODE>.
<P>
<P>
<H2><A NAME="ss6.6">6.6 <CODE>cachefs</CODE></A>
</H2>

<P>
<!--
disk!technologies!cachefs
-->

This uses part of a hard disk to cache slower media such as CD-ROM.
It is available under SunOS but not yet for Linux.
<P>
<P>
<H2><A NAME="ss6.7">6.7 Translucent or Inheriting File Systems</A>
</H2>

<P>
<!--
disk!technologies!translucent
-->

<!--
disk!technologies!inheriting
-->

This is a copy-on-write system where writes go to a different system
than the original source while making it look like an ordinary file
space. Thus the file space inherits the original data and the
translucent write back buffer can be private to each user.
<P>There is a number of applications:
<UL>
<LI>updating a live file system on CD-ROM, making it flexible, fast
while also conserving space,</LI>
<LI>original skeleton files for each new user, saving space since the
original data is kept in a single space and shared out,</LI>
<LI>parallel project development prototyping where every user can
seemingly modify the system globally while not affecting other users.</LI>
</UL>
<P>SunOS offers this feature and this is under development for Linux.
There was an old project called the Inheriting File Systems (<CODE>ifs</CODE>)
but this project has stopped.
One current project is part of the <CODE>md</CODE> system and offers
block level translucence so it can be applied to any file system.
<P>Sun has an informative
<A HREF="http://www.sun.ca/white-papers/tfs.html">page</A>
on translucent file system.
<P>It should be noted that
<A HREF="http://www.rational.com">Clearcase (now owned by Rational)</A>
pioneered and popularized translucent filesystems for software
configuration management by writing their own UNIX filesystem.
<P>
<P>
<P>
<P>
<H2><A NAME="physical-track-positioning"></A> <A NAME="ss6.8">6.8 Physical Track Positioning</A>
</H2>

<P>
<!--
disk!technologies!physical track positioning
-->

<!--
disk!technologies!track positioning
-->

This trick used to be very important when drives were slow and small,
and some file systems used to take the varying characteristics into
account when placing files. Although higher overall speed, on board
drive and controller caches and intelligence
has reduced the effect of this.
<P>Nevertheless there is still a little to be gained even today.
As we know, "<I>world dominance</I>" is soon within reach
but to achieve this "<I>fast</I>" we need to employ all the tricks
we can use
<A HREF="http://www.mit.edu:8001/finger?linus@linux.cs.helsinki.fi"> </A>.
<P>To understand the strategy we need to recall this near ancient piece
of knowledge and the properties of the various track locations.
This is based on the fact that transfer speeds generally increase for tracks
further away from the spindle, as well as the fact that it is faster
to seek to or from the central tracks than to or from the inner or outer
tracks.
<P>Most drives use disks running at constant angular velocity but use
(fairly) constant data density across all tracks. This means that
you will get much higher transfer rates on the outer tracks than
on the inner tracks; a characteristics which fits the requirements
for large libraries well.
<P>Newer disks use a logical geometry mapping which differs from the actual
physical mapping which is transparently mapped by the drive itself.
This makes the estimation of the "middle" tracks a little harder.
<P>In most cases track 0 is at the outermost track and this is the general
assumption most people use. Still, it should be kept in mind that there
are no guarantees this is so.
<P>
<P>
<DL>
<DT><B>Inner</B><DD><P>tracks are usually slow in transfer, and lying at one
end of the seeking position it is also slow to seek to.
<P>This is more suitable to the low end directories such as DOS, root
and print spools.
<P>
<DT><B>Middle</B><DD><P>tracks are on average faster with respect to transfers
than inner tracks and being in the middle also on average faster to
seek to.
<P>This characteristics is ideal for the most demanding parts such as
<CODE>swap</CODE>, <CODE>/tmp</CODE> and <CODE>/var/tmp</CODE>.
<P>
<DT><B>Outer</B><DD><P>tracks have on average even faster transfer characteristics
but like the inner tracks are at the end of the seek so statistically it
is equally slow to seek to as the inner tracks.
<P>Large files such as libraries would benefit from a place here.
<P>
</DL>
<P>Hence seek time reduction can be achieved by positioning frequently
accessed tracks in the middle so that the average seek distance and
therefore the seek time is short. This can be done either by using
<CODE>fdisk</CODE> or <CODE>cfdisk</CODE> to make a partition
on the middle tracks or by first
making a file (using <CODE>dd</CODE>) equal to half the size of the entire disk
before creating the files that are frequently accessed, after which
the dummy file can be deleted. Both cases assume starting from an
empty disk.
<P>The latter trick is suitable for news spools where the empty directory
structure can be placed in the middle before putting in the data files.
This also helps reducing fragmentation a little.
<P>This little trick can be used both on ordinary drives as well as RAID
systems. In the latter case the calculation for centring the tracks
will be different, if possible. Consult the latest RAID manual.
<P>The speed difference this makes depends on the drives, but a 50 percent
improvement is a typical value.
<P>
<H3><A NAME="disk-speed-values"></A> Disk Speed Values</H3>

<P>
<!--
disk!technologies!disk speed values
-->

The same mechanical head disk assembly (HDA) is often available
with a number of interfaces (IDE, SCSI etc) and the mechanical
parameters are therefore often comparable. The mechanics is
today often the limiting factor but development is improving
things steadily. There are two main parameters, usually quoted
in milliseconds (ms):
<P>
<UL>
<LI>Head movement - the speed at which the read-write head
is able to move from one track to the next, called access time.
If you do the mathematics and doubly integrate the seek
first across all possible starting tracks and
then across all possible target tracks you will find that
this is equivalent of a stroke across a third of all tracks.
</LI>
<LI>Rotational speed - which determines the time taken to
get to the right sector, called latency.</LI>
</UL>
<P>After voice coils replaced stepper motors for the head movement
the improvements seem to have levelled off and more energy is
now spent (literally) at improving rotational speed. This has
the secondary benefit of also improving transfer rates.
<P>Some typical values:
<BLOCKQUOTE><CODE>
<PRE>

                         Drive type


Access time (ms)        | Fast  Typical   Old
---------------------------------------------
Track-to-track             &lt;1       2       8
Average seek               10      15      30
End-to-end                 10      30      70
</PRE>
</CODE></BLOCKQUOTE>
<P>This shows that the very high end drives offer only marginally
better access times then the average drives but that the old
drives based on stepper motors are significantly worse.
<P>
<BLOCKQUOTE><CODE>
<PRE>

Rotational speed (RPM)  |  3600 | 4500 | 4800 | 5400 | 7200 | 10000
-------------------------------------------------------------------
Latency          (ms)   |    17 |   13 | 12.5 | 11.1 |  8.3 |   6.0
</PRE>
</CODE></BLOCKQUOTE>
<P>As latency is the average time taken to reach a given sector, the
formula is quite simply
<BLOCKQUOTE><CODE>
<PRE>
latency (ms) = 60000 / speed (RPM)
</PRE>
</CODE></BLOCKQUOTE>
<P>Clearly this too is an example of diminishing returns for the
efforts put into development. However, what really takes off
here is the power consumption, heat and noise.
<P>
<P>
<H2><A NAME="ss6.9">6.9 Yoke</A>
</H2>

<P>
<!--
disk!technologies!yoke
-->

There is also a
<A HREF="http://www.it.uc3m.es/cgi-bin/ptb/cvs-yoke.cgi">Linux Yoke Driver</A>
available in beta which
is intended to do hot-swappable transparent binding of
one Linux block device to another. This means that if you
bind two block devices together,
say <CODE>/dev/hda</CODE> and <CODE>/dev/loop0</CODE>,
writing to one device will mean also writing to the other
and reading from either will yield the same result.
<P>
<P>
<H2><A NAME="ss6.10">6.10 Stacking</A>
</H2>

<P>
<!--
disk!technologies!stacking
-->

One of the advantages of a layered design of an operating system
is that you have the flexibility to put the pieces together
in a number of ways.
For instance you can cache a CD-ROM with <CODE>cachefs</CODE> that is
a volume striped over 2 drives. This in turn can be set up
translucently with a volume that is NFS mounted from another machine.
RAID can be stacked in several layers to offer very fast seek and
transfer in such a way that it will work if even 3 drives fail.
The choices are many, limited only by imagination and, probably
more importantly, money.
<P>
<P>
<H2><A NAME="ss6.11">6.11 Recommendations</A>
</H2>

<P>
<!--
disk!technologies!recommendations
-->

There is a near infinite number of combinations available but my
recommendation is to start off with a simple setup without any
fancy add-ons. Get a feel for what is needed, where the maximum
performance is required, if it is access time or transfer speed
that is the bottle neck, and so on. Then phase in each component
in turn. As you can stack quite freely you should be able to
retrofit most components in as time goes by with relatively few
difficulties.
<P>RAID is usually a good idea but make sure you have a thorough
grasp of the technology and a solid back up system.
<P>
<P>
<P>
<HR>
<A HREF="Multi-Disk-HOWTO-7.html">Next</A>
<A HREF="Multi-Disk-HOWTO-5.html">Previous</A>
<A HREF="Multi-Disk-HOWTO.html#toc6">Contents</A>
</BODY>
</HTML>