LDP/LDP/howto/linuxdoc/Multi-Disk-HOWTO.sgml

6995 lines
271 KiB
Plaintext

<!doctype linuxdoc system>
<!-- This part is just my list of upcoming keywords. Do you really read this??
Need ext2fs aux progs: resize etc.
Partition: reasons: security, overflow protection; examples; flags.
nuni
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Changelog:
140197: Added Copyright, disclaimer
190197: cabling, ultra-2, OS types
22 : more OS, clustering
27 : more clustering, implementation
30 : more clustering, more implementation
0202 : correct typos
03 : added 'bits and pieces'
05 : new: maintenance
08 : new: Sun SPARC Solaris 2.5.1 setup
09 : new: partitioning suggestion table
10 : updates, tidying up ===> 0.12
1603 : upd. for rel., CLV/CAV, mnt.mountpoint, dpt, prjs, numbering
Debian 1.2.6 sizes
23 : minor corrections, updates etc.
0205 : more minor typos corrected and links added.
1905 : more minor typos corrected and links added.
(TheRef, WWW-FAQ, SCSI, Storage), unreleased.
2505 : cleaning links and adding section on heat and links for webbing and home pages.
2605 : Adding more info on CD-ROM file formats.
Released as 0.13
0506 : More on maintenance and misc formatting
0806 : Released as 0.13a
2206 : updated info on Dejanews, more on multi channel systems, released as 0.13b
1207 : updated info on Dejanews, released as 0.13c
1108 : adding many references, one FAQ, credit name update, unreleased
1208 : adding advanced chapter and notes on geometry
1308 : Released as 0.14 in time for Yggdrasil print deadline
1209 : Patch from kris, other inputs from edick and pot, tidying up ==>0.15
2409 : Minor editing to clean up the index page ==>0.15a
0510 : Fixing typos, cleaning up details on mailing lists ==>0.15b
0610 : Changed more info section to use sections rather than itemizing,
" added section on online resources,
" added new chapter on how to get help efficiently ==>0.16
1510 : Cleaned up tilde characters, added transfer speed table,
new sect2 on maintenance deletions, info from /proc ==>0.16a
2110 : Updated some links, more info on swap from Nakano-san,
performance tuning link for INN ==>0.16b
0511 : Updated some links, more info on HW RAID and benchmarks
preparing for LSL release ==>0.16c
0911 : Spam protection for all e-mail addresses but the author's ==>0.16d
2811 : Minor corrections, cleaned up KB, MB, GB and added info on e2compr
Finally removed the 'mini_' from the title! ==>0.16e
1012 : Added link to the new DPT RAID Howto. ==>0.16f
030298: More on tmpfs, booting, hdparm, ext2fs docs, Win (sysedit, regedit)
1202 : Merged in indexing from Redhat
1105 : Major overhaul after major system update, now using SGMLtools-1.0.5 on Debian 1.3. This is going to be messy!
- latency, fips32, reading plan, single drive partition, credits, new coordinator, SCSI arb pri, scsidev maj min numbering, devfs, more on swap, drive cache, FHS2.1
12(b):latency
21(c):credits, codename
1907 : Minor corrections.
0908 : Added example tables for systems with 1,2 and 4 ( opt RAID) drives
3008 : General cleanup
1009 : Gen. cleanup, capitalising headings and more acknowledgements
0111 : New translations, more fs notes and some minor editing
0311 : More fs related information
0811 : Major rearrangement of document, restructuring chapter on "Considerations", adding more "Recommendations" and more on file systems ->0.20
0922 : Major rearrangement continues ->0.20a
1217 : ...and continues ->0.20b
020199: ...and continues ->0.20c A new year starts...
1001 : More on fs ->0.20d and ->0.20e
1601 : More on read-only fs ->0.20f
1701 : More on networking fs ->0.20g (brief)
1801 : ... and continues ->0.20h
2001 : More on special fs ->0.20i
2301 : ... and continues plus some cleaning up ->0.20j and ->0.20k
2401 : Cleaning up ->0.20l and add back 'considerations' ->0.20m
2501 : Cleaning up the Implementation chapter ->0.21
2601 : ... and continues, adding more credits too ->0.21a ->0.21b
2601 : Fold in patches from Nakano-san. Manually. ->0.22
0102 : update release name, add minor details and fix typo ->0.22a
0102 : Add more links ->0.22b
0802 : Corrected bad typo, cleaned up header ->0.22c
1602 : Links to benchmarking, Sun info, fixed mount data ->0.22d
0703 : Fixed one link, better finger link, added BFS info ->0.22e
1804 : Renames to Metalab, added efs, more on FIPS and term ->0.22f
2804 : Added GFS ->0.22g
2405 : Added Userfs, Arla, FSresearch, new mirror
2705 : Corrected typo in mirror, added Software RAID HOWTO link
1807 : Many corrections from N.T. Added update on Chinese translation.
2507 : Added more on RAID, SCSI 160/m, smugfs, silicon disks and benchmarking ->022k
1808 : Added info on xfs, ext3fs, DVD and ShowFAT ->022l
2508 : Added info on Partition Resizer, fixed typos ->022m
1909 : Added numerous updates on file systems and disk tech ->0.23
2009 : Minor typos fixed ->0.23a
2009 : Added catch on mount-linking. Numerous minor typos fixed ->0.23b
3110 : Updates on SCSI/160, extfs growth and Italian translation ->0.23c
0711 : Fixed small typo ->0.23d
1211 : More on partition utilities ->0.23e
230100: More on partition utilities and fix a typo ->0.23f
0502 : Major upgrade based on inputs from schuulegaa (at) gatekeeper.txl.com ->0.24
1403 : Continuing the above ->0.24a
0203 : Continuing the above ->0.24b
3004 : Continuing the above ->0.24c
0105 : Various user inputs ->0.25
0105 : Various minor changes ->0.25a
0105 : Update with results from linkchecking ->0.25b
0205 : Update with results from linkchecking ->0.25c
0305 : Update with results from linkchecking ->0.25d
0305 : Update with results from linkchecking ->0.25e
2105 : Doc submitted to FHS list. Got one input there and some from translators ->0.30
2206 : Added link to JFFS, updated and corrected links ->0.30a
1907 : Added subsection on advanced mount options ->0.30b
1907 : Replace file tags with hyperlinks ->0.30c
2407 : Fixed a typo, sent in to ldp-submit ->0.31
2009 : User inputs to file systems and FEM correction ->0.32
1610 : Minor updates and release ->0.32a
0511 : Fixed one typo and added link to scsidev development page ->0.32b
1711 : Fixed typos and some links ->0.32c
0312 : Another round of link checking, will this never end? ->0.32d
1012 : Evidently not, more links updated -> 0.32e
1012 : And again, more links updated -> 0.32f
090101: Applied patch from Nakano-san -> 0.32g
0901 : Added new link to INN optimising, fixed one link ->0.32h
3006 : Added recovering disk failure, Win2000 RAID, iSCSI, corrections to mount point list ->0.32i
200502: A long overdue upgrade. Licence change, sep boot/root, GNU cp -av, memleak and formatting missing root ->0.33
2005 : ATA (big, fast, serial, cable select, no lone slave) tmpfs, limited outer tracks ->0.33a
-->
<article>
<!-- Title information -->
<!-- Old: <title>Mini_HOWTO: Multi Disk System Tuning -->
<title>HOWTO: Multi Disk System Tuning
<author>Stein Gjoen, <tt/sgjoen@nyx.net/
<date>v0.33a, 20 May 2002
<abstract>
<nidx>disk</nidx>
<nidx>partitions, disk (see disk)</nidx>
This document describes how best to use multiple disks and partitions
for a Linux system. Although some of this text is Linux specific the
general approach outlined here can be applied to many other multi tasking
operating systems.
</abstract>
<!-- Table of contents -->
<toc>
<!-- Begin the document -->
<!-- Old header follows
Mini_HOWTO: Multi Disk System Tuning
Version 0.7b (Yes, that right: this is a BETA)
Date 960823
By Stein Gjoen <sgjoen@nyx.net>
-->
<sect>Introduction
<p>
<nidx>disk!introduction</nidx>
<!-- In commemoration of the "<it/Linux Hacker V2.0 - The New Generation/" this
brand new release is code named the <bf/Patricia Miranda/ release. -->
<!-- After all, socks comes in pairs...
After all, this is a growing project... -->
<!-- In commemoration of recent legal development this
brand new release is code named the <bf/Trademark Resolution/ release. -->
<!-- For strange and artistic reasons this
brand new release is code named the <bf/Daybreak/ release. -->
<!-- In commemoration of recent news this brand new release is codenamed
the <bf/The Newer Generation/ release. -->
<!-- In commemoration of Linux kernel 2.2 release
this brand new release is codenamed the <bf/Daniella/ release. -->
For unclear reasons this brand new release is codenamed
<!-- the <bf/Sauchiehall/ release. -->
the <bf/Taylor3/ release.
New code names will appear as per industry standard guidelines
to emphasize the state-of-the-art-ness of this document.
<p>
This document was written for two reasons, mainly because I got hold
of 3 old SCSI disks to set up my Linux system on and I was pondering
how best to utilise the inherent possibilities of parallelizing in a
SCSI system. Secondly I hear there is a prize for people who write
documents...
This is intended to be read in conjunction with the Linux Filesystem
Structure Standard (FSSTND). It does not in any way replace it but tries to
suggest where physically to place directories detailed in the FSSTND,
in terms of drives, partitions, types, RAID, file system (fs),
physical sizes and other parameters that should be considered and
tuned in a Linux system, ranging from single home systems to large
servers on the Internet.
<!--
Even though it is now more than a year since last release of the FSSTND
work is still continuing, under a new name, and will encompass more than
Linux, fill in a few blanks hinted at in FSSTND version 1.2 as well as
other general improvements. The development mailing list is currently
private but a general release is hopefully in the near future.
-->
The followup to FSSTND is called the Filesystem Hierarchy Standard (FHS)
and covers more than Linux alone. FHS versions 2.0, 2.1 and 2.2 have been
released but there are still a few issues to be dealt with. Many recent
distributions are now aiming for FHS compliance.
<!-- removed 010630
and even
longer before this new standard will have an impact on actual
distributions. FHS is not yet used in any distributions but Debian
has announced they will use it in Debian 2.1 which is the current
distribution. Also SuSE is aiming for FHS compliance and no doubt more
will come. -->
It is also a good idea to read the Linux Installation guides thoroughly
and if you are using a PC system, which I guess the majority still does,
you can find much relevant and useful information in the FAQs for the
newsgroup comp.sys.ibm.pc.hardware especially for storage media.
This is also a learning experience for myself and I hope I can start
the ball rolling with this HOWTO and that it perhaps can evolve
into a larger more detailed and hopefully even more correct HOWTO.
<!-- Removed 2303
Note that this is a guide on how to design and map logical partitions
onto multiple disks and tune for performance and reliability, NOT how
to actually partition the disks or format them - yet.
-->
First of all we need a bit of legalese. Recent development shows it is
quite important.
<sect1>Copyright
<p>
<!-- 020520 Remove old Copyright
This HOWTO is copyrighted 1996 Stein Gjoen.
Unless otherwise stated, Linux HOWTO documents are copyrighted by their
respective authors. Linux HOWTO documents may be reproduced and distributed
in whole or in part, in any medium physical or electronic, as long as
this copyright notice is retained on all copies. Commercial redistribution
is allowed and encouraged; however, the author would like to be notified of
any such distributions.
All translations, derivative works, or aggregate works incorporating
any Linux HOWTO documents must be covered under this copyright notice.
That is, you may not produce a derivative work from a HOWTO and impose
additional restrictions on its distribution. Exceptions to these rules
may be granted under certain conditions; please contact the Linux HOWTO
coordinator at the address given below.
In short, we wish to promote dissemination of this information through as
many channels as possible. However, we do wish to retain copyright on the
HOWTO documents, and would like to be notified of any plans to redistribute
the HOWTOs.
If you have questions, please contact
( Greg Hankins, ) the Linux HOWTO coordinator,
at linux-howto@metalab.unc.edu via email.
-->
This document is Copyright 1996 Stein Gjoen. Permission is granted to
copy, distribute and/or modify this document under the terms of the
GNU Free Documentation License, Version 1.1 or any later version
published by the Free Software Foundation with no Invariant Sections,
no Front-Cover Texts, and no Back-Cover Texts.
If you have any questions, please contact <{linux-howto@metalab.unc.edu}>
<sect1>Disclaimer
<p>
Use the information in this document at your own risk. I disavow any
potential liability for the contents of this document. Use of the
concepts, examples, and/or other content of this document is entirely
at your own risk.
All copyrights are owned by their owners, unless specifically noted
otherwise. Use of a term in this document should not be regarded as
affecting the validity of any trademark or service mark.
Naming of particular products or brands should not be seen as endorsements.
You are strongly recommended to take a backup of your system before
major installation and backups at regular intervals.
<sect1>News
<p>
<nidx>disk!news on</nidx>
This is a major upgrade featuring a new copyright statement that is
intended to be Debian compliant and allow for inclusion in their
distribution. A number of mistakes are corrected and new features
added such as descriptions of recent ATA features and more.
<!-- This is a maintenance release featuring minor but numerous updates
and additions to file systems and also tools for mount tables. -->
<!-- This release features a major restructuring and more additions
than I can list here especially on
backup systems, hints and tips and even more on file system support.
Also there is now a new appendix with a shell script that helps
you characterise your system which is useful for debugging,
especially when asking others for help.
Also a section on troubleshooting has been added
as well as a subsection on mount options.
This HOWTO now uses indexing and is based on SGMLtools version 1.0.5
and the old version will therefore not format this document properly.
Also quite new is a number of new translations available.
Now a Chinese and also an Italian translation are under way.
-->
On the development front people are concentrating their energy towards
completing Linux 2.4 and until that is released there is not going to
be much news on disk technology for Linux.
<!-- Debian 2.1 is readying for release and as I use Debian for my test
systems I will make more updates when I upgrade. -->
Also now the document is available in postscript
both for US letter as well as European A4 formats.
The latest version number of this document can be gleaned from my
plan entry if you <!-- do "finger sgjoen@nox.nyx.net" -->
<!-- <url url="http://www.cs.indiana.edu/finger/nox.nyx.net/sgjoen" -->
<url url="http://www.mit.edu:8001/finger?sgjoen@nox.nyx.net"
name="finger"> my Nyx account.
Also, the latest version will be available on my web space on Nyx
in a number of formats:
<itemize>
<item>
<url url="http://www.nyx.net/&tilde;sgjoen/disk.html"
name="HTML">.
<item>
<url url="http://www.nyx.net/&tilde;sgjoen/disk.txt"
name="plain ASCII text"> (ca. 6200 lines).
<item>
<url url="http://www.nyx.net/&tilde;sgjoen/disk-US.ps.gz"
name="compressed postscript US letter format"> (ca. 90 pages).
<item>
<url url="http://www.nyx.net/&tilde;sgjoen/disk-A4.ps.gz"
name="compressed postscript European A4 format"> (ca. 85 pages).
<item>
<url url="http://www.nyx.net/&tilde;sgjoen/disk.sgml"
name="SGML source"> (ca. 260 KB).
</itemize>
A European mirror of the
<!-- <url url="http://home.sol.no/&tilde;gjoen/stein/disk.html" -->
<url url="http://home.online.no/&tilde;ggjoeen/stein/disk.html"
name="Multi Disk HOWTO">
just went on line.
<sect1>Credits
<p>
In this version I have the pleasure of acknowledging even more people
who have contributed in one way or another:
<!-- sjmudd (at) phoenix.ea4els.ampr.org changes to sjmudd (at) redestb.es -->
<tscreen><verb>
ronnej (at ) ucs.orst.edu
cm (at) kukuruz.ping.at
armbru (at) pond.sub.org
R.P.Blake (at) open.ac.uk
neuffer (at) goofy.zdv.Uni-Mainz.de
sjmudd (at) redestb.es
nat (at) nataa.fr.eu.org
sundbyk (at) oslo.geco-prakla.slb.com
ggjoeen (at) online.no
mike (at) i-Connect.Net
roth (at) uiuc.edu
phall (at) ilap.com
szaka (at) mirror.cc.u-szeged.hu
CMckeon (at) swcp.com
kris (at) koentopp.de
edick (at) idcomm.com
pot (at) fly.cnuce.cnr.it
earl (at) sbox.tu-graz.ac.at
ebacon (at) oanet.com
vax (at) linkdead.paranoia.com
tschenk (at) theoffice.net
pjfarley (at) dorsai.org
jean (at) stat.ubc.ca
johnf (at) whitsunday.net.au
clasen (at) unidui.uni-duisburg.de
eeslgw (at) ee.surrey.asc.uk
adam (at) onshore.com
anikolae (at) wega-fddi2.rz.uni-ulm.de
cjaeger (at) dwave.net
eperezte (at) c2i.net
yesteven (at) ms2.hinet.net
cj (at) samurajdata.se
tbotond (at) netx.hu
russel (at) coker.com.au
lars (at) iar.se
GALLAGS3 (at) labs.wyeth.com
morimoto (at) xantia.citroen.org
shulegaa (at) gatekeeper.txl.com
roman.legat (at) stud.uni-hannover.de
ahamish (at) hicks.alien.usr.com
hduff2 (at) worldnet.att.net
mbaehr (at) email.archlab.tuwien.ac.at
adc (at) postoffice.utas.edu.au
pjm (at) bofh.asn.au
jochen.berg (at) ac.com
jpotts (at) us.ibm.com
jarry (at) gmx.net
LeBlanc (at) mcc.ac.uk
masy (at) webmasters.gr.jp
karlheg (at) hegbloom.net
goeran (at) uddeborg.pp.se
wgm (at) telus.net
</verb></tscreen>
<sect1>Translations
<p>
Special thanks go to <tt/nakano (at) apm.seikei.ac.jp/ for doing the
<url url="http://www.linux.or.jp/JF/JFdocs/Multi-Disk-HOWTO.html"
name="Japanese translation">,
general contributions as well as contributing an example of
a computer in an academic setting, which is included at the end of this
document.
There are now many new translations available and special thanks go
to the translators for the job and the input they have given:
<itemize>
<item><url url="http://www.linuxdoc.org/"
name="German Translation"> by <tt/chewie (at) nuernberg.netsurf.de/
<item><url url="http://www.swe-doc.linux.nu"
name="Swedish Translation "> by <tt/jonah (at) swipnet.se/
<item><url url="http://www.lri.fr/&tilde;loisel/howto/"
name="French Translation"> by <tt/Patrick.Loiseleur (at) lri.fr/
<item><url url="http://www.linuxdoc.org/"
name="Chinese Translation"> by <tt/yesteven (at ) ms2.hinet.net/
<item><url url="http://www.pluto.linux.it/ildp/HOWTO/Multi-Disk-HOWTO.html"
name="Italian Translation"> by <tt/bigpaul (at) flashnet.it/
</itemize>
ICP Vortex is gratefully acknowledges for sending in-depth information
on their range of RAID controllers.
Also DPT is acknowledged for sending me documentation on their controllers
as well as permission to quote from the material. These quotes have been
approved before appearing here and will be clearly labelled. No quotes as
of yet but that is coming.
Not many still, so please read through this document, make a contribution
and join the elite. If I have forgotten anyone, please let me know.
New in this version is an appendix with a few tables you can fill in
for your system in order to simplify the design process.
Any comments or suggestions can be mailed to my mail address on Nyx:
<htmlurl url="mailto:sgjoen@nyx.net/"
name="sgjoen@nyx.net">.
So let's cut to the chase where <tt/swap/ and <tt>/tmp</tt> are
racing along hard drive...
<p>
<!-- <hrule> -->
<!--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-->
<sect>Structure
<p>
As this type of document is supposed to be as much for learning
as a technical reference document I have rearranged the structure
to this end. For the designer of a system it is more useful to
have the information presented in terms of the goals of this exercise
than from the point of view of the logical layer structure of the
devices themselves. Nevertheless this document would not be complete
without such a layer structure the computer field is so full of, so
I will include it here as an introduction to how it works.
It is a long time since the <em/mini/ in mini-HOWTO could be defended
as proper but I am convinced that this document is as long as it needs
to be in order to make the right design decisions, and not longer.
<sect1>Logical structure
<p>
<nidx>disk!structure, I/O subsystem</nidx>
This is based on how each layer access each other, traditionally
with the application on top and the physical layer on the bottom.
It is quite useful to show the interrelationship between each of
the layers used in controlling drives.
<tscreen><verb>
___________________________________________________________
|__ File structure ( /usr /tmp etc) __|
|__ File system (ext2fs, vfat etc) __|
|__ Volume management (AFS) __|
|__ RAID, concatenation (md) __|
|__ Device driver (SCSI, IDE etc) __|
|__ Controller (chip, card) __|
|__ Connection (cable, network) __|
|__ Drive (magnetic, optical etc) __|
-----------------------------------------------------------
</verb></tscreen>
In the above diagram both volume management and RAID and concatenation
are optional layers. The 3 lower layers are in hardware.
All parts are discussed at length later on in this document.
<sect1>Document structure
<p>
Most users start out with a given set of hardware and some plans on
what they wish to achieve and how big the system should be. This is
the point of view I will adopt in this document in presenting the
material, starting out with hardware, continuing with design constraints
before detailing the design strategy that I have found to work well.
I have used this both for my own personal computer at home, a multi
purpose server at work and found it worked quite well. In addition my
Japanese co-worker in this project have applied the same strategy on
a server in an academic setting with similar success.
Finally at the end I have detailed some configuration tables for use
in your own design. If you have any comments regarding this or notes
from your own design work I would like to hear from you so this
document can be upgraded.
<sect1>Reading plan
<p>
Although not the biggest HOWTO it is nevertheless rather big already
and I have been requested to make a reading plan to make it possible
to cut down on the volume
<descrip>
<tag/Expert/ (aka the elite). If you are familiar with Linux as well
as disk drive technologies you will find most of what you need in the
appendices. Additionally you are recommended to read the FAQ and the
<ref id="bits-n-pieces" name="Bits'n'pieces">
chapter.
<tag/Experienced/ (aka Competent). If you are familiar with computers
in general you can go straight to the chapters on
<ref id="technologies" name="technologies">
and continue from there on.
<tag/Newbie/ (mostly harmless). You just have to read the whole thing.
Sorry. In addition you are also recommended to read all the other disk
related HOWTOs.
</descrip>
<sect>Drive Technologies
<p>
<nidx>disk!technologies</nidx>
A far more complete discussion on drive technologies for IBM PCs
can be found at the home page of
<url url="http://thef-nym.sci.kun.nl/&tilde;pieterh/storage.html"
name="The Enhanced IDE/Fast-ATA FAQ">
which is also regularly posted on Usenet News.
There is also a site dedicated to
<url url="http://ata-atapi.com"
name="ATA and ATAPI Information and Software">.
Here I will just present what is needed to get an understanding
of the technology and get you started on your setup.
<sect1>Drives
<p>
<nidx>disk!drives</nidx>
This is the physical device where your data lives and although the
operating system makes the various types seem rather similar they
can in actual fact be very different. An understanding of how it
works can be very useful in your design work. Floppy drives fall
outside the scope of this document, though should there be a big
demand I could perhaps be persuaded to add a little here.
<sect1>Geometry
<p>
<nidx>disk!geometry</nidx>
Physically disk drives consists of one or more platters containing
data that is read in and out using sensors mounted on movable heads
that are fixed with respects to themselves. Data transfers therefore
happens across all surfaces simultaneously which defines a cylinder
of tracks. The drive is also divided into sectors containing a
number of data fields.
Drives are therefore often specified in terms of its geometry: the
number of Cylinders, Heads and Sectors (CHS).
For various reasons there is now a number of translations between
<itemize>
<item>the physical CHS of the drive itself
<item>the logical CHS the drive reports to the BIOS or OS
<item>the logical CHS used by the OS
</itemize>
Basically it is a mess and a source of much confusion. For more
information you are strongly recommended to read the
<em>Large Disk mini-HOWTO</em>
<sect1>Media
<p>
<nidx>disk!media</nidx>
The media technology determines important parameters such as
read/write rates, seek times, storage size as well as if it is
read/write or read only.
<sect2>Magnetic Drives <label id="magnetic-drives">
<p>
<nidx>disk!media!magnetic</nidx>
This is the typical read-write mass storage medium, and as
everything else in the computer world, comes in many flavours
with different properties. Usually this is the fastest technology
and offers read/write capability. The platter rotates with a
constant angular velocity (CAV) with a variable physical sector
density for more efficient magnetic media area utilisation.
In other words, the number of bits per unit length is kept
roughly constant by increasing the number of logical sectors
for the outer tracks.
Typical values for rotational speeds are 4500 and 5400 RPM, though
7200 is also used. Very recently also 10000 RPM has entered
the mass market.
Seek times are around 10 ms, transfer rates quite variable from
one type to another but typically 4-40 MB/s.
With the extreme high performance drives you should remember that
performance costs more electric power which is dissipated as heat,
see the point on
<ref id="power-heating" name="Power and Heating">.
Note that there are several kinds of transfers going on here, and
that these are quoted in different units. First of all there is
the platter-to-drive cache transfer, usually quoted in
Mbits/s. Typical values here is about 50-250 Mbits/s. The second
stage is from the built in drive cache to the adapter, and this
is typically quoted in MB/s, and typical quoted values here is
3-40 MB/s. Note, however, that this assumed data is already in
the cache and hence for maximum readout speed from the drive the
effective transfer rate will decrease dramatically.
<!-- removed due to redundancy with the above lines
<p>
Drives are usually described by the geometry or drive parameters which
is the number of heads, sectors and cylinders, which is confused by
translation schemes between physical and various logical geometries.
This is a mine field which is described in painful details in many
storage related FAQs. Read and weep.
-->
<sect2>Optical Drives
<p>
<nidx>disk!media!optical</nidx>
Optical read/write drives exist but are slow and not so common. They
were used in the NeXT machine but the low speed was a source for much
of the complaints. The low speed is mainly due to the thermal nature
of the phase change that represents the data storage. Even when using
relatively powerful lasers to induce the phase changes the effects are
still slower than the magnetic effect used in magnetic drives.
Today many people use CD-ROM drives which, as the
name suggests, is read-only. Storage is about 650 MB, transfer speeds
are variable, depending on the drive but can exceed 1.5 MB/s. Data is
stored on a spiraling single track so it is not useful to talk about
geometry for this. Data density is constant so the drive uses constant
linear velocity (CLV). Seek is also slower, about 100 ms, partially due
to the spiraling track. Recent, high speed drives, use a mix of
CLV and CAV in order to maximize performance. This also reduces access
time caused by the need to reach correct rotational speed for readout.
A new type (DVD) is on the horizon, offering up to about 18 GB on a
single disk.
<sect2>Solid State Drives
<p>
<nidx>disk!media!solid state</nidx>
This is a relatively recent addition to the available technology and
has been made popular especially in portable computers as well as in
embedded systems. Containing no movable parts they are very fast
both in terms of access and transfer rates. The most popular type is
flash RAM, but also other types of RAM is used. A few years ago many
had great hopes for magnetic bubble memories but it turned out to be
relatively expensive and is not that common.
In general the use of RAM disks are regarded as a bad idea as it is
normally more sensible to add more RAM to the motherboard and let the
operating system divide the memory pool into buffers, cache, program
and data areas. Only in very special cases, such as real time systems
with short time margins, can RAM disks be a sensible solution.
Flash RAM is today available in several 10's of megabytes
in storage and one might be tempted to use it for fast, temporary
storage in a computer. There is however a huge snag with this: flash
RAM has a finite life time in terms of the number of times you can
rewrite data, so putting
<tt>swap</tt>, <tt>/tmp</tt> or <tt>/var/tmp</tt> on such
a device will certainly shorten its lifetime dramatically.
Instead, using flash RAM for directories that are read often but
rarely written to, will be a big performance win.
In order to get the optimum life time out of flash RAM you will
need to use special drivers that will use the RAM evenly and
minimize the number of block erases.
This example illustrates the advantages of splitting up your directory
structure over several devices.
Solid state drives have no real cylinder/head/sector addressing but for
compatibility reasons this is simulated by the driver to give a uniform
interface to the operating system.
<sect1>Interfaces
<p>
<nidx>disk!interfaces</nidx>
There is a plethora of interfaces to chose from widely ranging in
price and performance. Most motherboards today include IDE interface
which are part of modern chipsets.
Many motherboards also include a SCSI interface chip made by Symbios
(formerly NCR) and that is connected directly to the PCI bus. Check
what you have and what BIOS support you have with it.
<sect2>MFM and RLL
<p>
<nidx>disk!interfaces!MFM</nidx>
<nidx>disk!interfaces!RLL</nidx>
Once upon a time this was the established technology, a time when
20 MB was awesome, which compared to todays sizes makes you think
that dinosaurs roamed the Earth with these drives. Like the dinosaurs
these are outdated and are slow and unreliable compared to what we
have today. Linux does support this but you are well advised to
think twice about what you would put on this. One might argue that
an emergency partition with a suitable vintage of DOS might be
fitting.
<sect2>ESDI
<p>
<nidx>disk!interfaces!ESDI</nidx>
<!--
This technology became outdated almost before it got popular, so you
are unlikely to come across it these days. Basically it was an attempt
of increasing the upper limit of the old interfaces. You might get
such a drive to work under Linux if it is compatible with the <tt/ST506/
standard. -->
<!-- Update from edick 970912 -->
Actually, ESDI was an adaptation of the very widely used SMD interface used on
"big" computers to the cable set used with the ST506 interface, which was more
convenient to package than the 60-pin + 26-pin connector pair used with SMD.
The ST506 was a "dumb" interface which relied entirely on the controller and
host computer to do everything from computing head/cylinder/sector locations
and keeping track of the head location, etc. ST506 required the controller to
extract clock from the recovered data, and control the physical location of
detailed track features on the medium, bit by bit. It had about a 10-year life
if you include the use of MFM, RLL, and ERLL/ARLL modulation schemes. ESDI,
on the other hand, had intelligence, often using three or four separate
microprocessors on a single drive, and high-level commands to format a track,
transfer data, perform seeks, and so on. Clock recovery from the data stream
was accomplished at the drive, which drove the clock line and presented its
data in NRZ, though error correction was still the task of the controller.
ESDI allowed the use of variable bit density recording, or, for that matter,
any other modulation technique, since it was locally generated and resolved at
the drive. Though many of the techniques used in ESDI were later incorporated
in IDE, it was the increased popularity of SCSI which led to the demise of ESDI
in computers. ESDI had a life of about 10 years, though mostly in servers and
otherwise "big" systems rather than PC's.
<sect2>IDE and ATA
<p>
<nidx>disk!interfaces!IDE</nidx>
<nidx>disk!interfaces!ATA</nidx>
Progress made the drive electronics migrate from the ISA slot
card over to the drive itself and Integrated Drive Electronics
was borne. It was simple, cheap and reasonably fast so the BIOS
designers provided the kind of snag that the computer industry is
so full of. A combination of an IDE limitation of 16 heads
together with the BIOS limitation of 1024 cylinders gave us the
infamous 504 MB limit. Following the computer industry traditions
again, the snag was patched with a kludge and we got all sorts of
translation schemes and BIOS bodges. This means that you need to
read the installation documentation very carefully and check up
on what BIOS you have and what date it has as the BIOS has to
tell Linux what size drive you have. Fortunately with Linux you
can also tell the kernel directly what size drive you have with
the drive parameters, check the documentation for LILO and Loadlin,
thoroughly. Note also that IDE is equivalent to ATA, AT Attachment.
IDE uses CPU-intensive Programmed Input/Output (PIO) to transfer
data to and from the drives and has no capability for the more
efficient Direct Memory Access (DMA) technology. Highest transfer
rate is 8.3 MB/s.
<sect2>EIDE, Fast-ATA and ATA-2
<p>
<nidx>disk!interfaces!EIDE</nidx>
<nidx>disk!interfaces!Fast-ATA</nidx>
<nidx>disk!interfaces!ATA-2</nidx>
These 3 terms are roughly equivalent, fast-ATA is ATA-2 but EIDE
additionally includes ATAPI. ATA-2 is what most use these days
which is faster and with DMA. Highest transfer rate is increased
to 16.6 MB/s.
<!-- from c't 9/97 -->
<sect2>Ultra-ATA
<p>
<nidx>disk!interfaces!Ultra-ATA</nidx>
A new, faster DMA mode that is approximately twice the speed of EIDE PIO-Mode 4
(33 MB/s). Disks with and without Ultra-ATA can be mixed on the same cable
without speed penalty for the faster adapters. The Ultra-ATA interface is
electrically identical with the normal Fast-ATA interface, including the
maximum cable length.
<!-- The newest development is the 66 MB/s version, DMA/66. -->
The ATA/66 was superceeded by ATA/100 and very recently we have
now gotten ATA/133. While the interface speed has iproved dramatically
the disks are often limited by platter-to-cache limites which today
stands at about 40 MB/s.
For more information read up on these overviews and whitepapers from Maxtor:
<url url="http://www.maxtor.com/products/FastDrive/default.htm"
name="Fast Drives Technology"> on the ATA/133 interface
and
<url url="http://www.maxtor.com/products/BigDrive/default.htm"
name="Big Drives Technology"> on breaking the 137 GB limit.
<sect2>Serial-ATA
<p>
<nidx>disk!interfaces!Serial-ATA</nidx>
A new, standard has been agreed upon, the <tt>Serial-ATA</tt>
interface, backed by the
<url url="http://www.serial-ata.org/"
name="The Serial ATA">
group who made the announcement in August 2001.
Advantages are numerous: simple, thin connectors rather than old
cumbersome cable mats that also obstructued air flow, higher speeds
(about 150 MB/s) and backward compatibility.
<sect2>ATAPI
<p>
<nidx>disk!interfaces!ATAPI</nidx>
The ATA Packet Interface was designed to support CD-ROM drives
using the IDE port and like IDE it is cheap and simple.
<sect2>SCSI
<p>
<nidx>disk!interfaces!SCSI</nidx>
The Small Computer System Interface is a multi purpose interface
that can be used to connect to everything from drives, disk arrays,
printers, scanners and more. The name is a bit of a misnomer as it
has traditionally been used by the higher end of the market as well
as in work stations since it is well suited for multi tasking
environments.
The standard interface is 8 bits wide and can address 8 devices.
There is a wide version with 16 bit that is twice as fast on the
same clock and can address 16 devices. The host adapter always
counts as a device and is usually number 7.
It is also possible to have 32 bit wide busses but this usually
requires a double set of cables to carry all the lines.
The old standard was 5 MB/s and the newer fast-SCSI increased this
to 10 MB/s. Recently ultra-SCSI, also known as Fast-20, arrived
with 20 MB/s transfer rates for an 8 bit wide bus.
New low voltage differential (LVD) signalling allows
these high speeds as well as much longer cabling than before.
Even more recently an even faster standard has been introduced:
SCSI 160 (originally named SCSI 160/m) which is capable of a monstrous 160 MB/s
over a 16 bit wide bus. Support is scarce yet but for a few
10000 RPM drives that can transfer 40 MB/s sustained.
Putting 6 such drives on a RAID will keep such a bus saturated
and also saturate most PCI busses. Obviously this is only for
the very highest end servers per today. More information on
this standard is available at
<url url="http://www.ultra160-scsi.com/"
name="The Ultra 160 SCSI home page">
Adaptec just announced a Linux driver for their SCSI 160 host adapter.
More information will come when more information becomes available.
Now also SCSI/320 is available.
The higher performance comes at a cost that is usually higher than for
(E)IDE. The importance of correct termination and good quality cables
cannot be overemphasized. SCSI drives also often tend to be of a higher
quality than IDE drives. Also adding SCSI devices tend to be easier
than adding more IDE drives: Often it is only a matter of plugging
or unplugging the device; some people do this without powering down
the system. This feature is most convenient when you have multiple
systems and you can just take the devices from one system to the
other should one of them fail for some reason.
There is a number of useful documents you should read if you use
SCSI, the SCSI HOWTO as well as the SCSI FAQ posted on Usenet News.
SCSI also has the advantage you can connect it easily to tape drives
for backing up your data, as well as some printers and scanners. It
is even possible to use it as a very fast network between computers
while simultaneously share SCSI devices on the same bus. Work is under
way but due to problems with ensuring cache coherency between the
different computers connected, this is a non trivial task.
SCSI numbers are also used for arbitration. If several drives request
service, the drive with the lowest number is given priority.
Note that newer SCSI cards will simultaneously support an array
of different types of SCSI devices all at individually optimized
speeds.
<sect1>Cabling
<p>
<nidx>disk!cabling</nidx>
I do not intend to make too many comments on hardware but I feel I
should make a little note on cabling. This might seem like a
remarkably low technological piece of equipment, yet sadly it is the
source of many frustrating problems. At todays high speeds one should
think of the cable more of a an RF device with its inherent demands on
impedance matching. If you do not take your precautions you will get a
much reduced reliability or total failure. Some SCSI host adapters are
more sensitive to this than others.
Shielded cables are of course better than unshielded but the price is
much higher. With a little care you can get good performance from a
cheap unshielded cable.
<itemize>
<!-- from c't 9/97 -->
<item>For Fast-ATA and Ultra-ATA, the maximum cable length is specified
as 45cm (18"). The data lines of both IDE channels are connected on many
boards, though, so they count as <bf/one/ cable. In any case EIDE cables
should be as short as possible. If there are mysterious crashes or
spontaneous changes of data, it is well worth investigating your cabling.
Try a lower PIO mode or disconnect the second channel and see if the problem
still occurs.
<item>For <tt>Cable Select</tt> (ATA drives) you set the drive jumpers
to cable select and use the cable to determine master and slave. This
is not much used.
<item>Do not have a slave on an ATA controller (primary or secondary)
without a master on the same controller, behaviour in these cases is
undetermined.
<item> Use as short cable as possible, but do not forget the
30 cm minimum separation for ultra SCSI
and 60 cm separation for differential SCSI.
<item> Avoid long stubs between the cable and the drive, connect
the plug on the cable directly to the drive without an extension.
<item> SCSI Cabling limitations:
<tscreen><verb>
Bus Speed (MHz) | Max Length (m)
--------------------------------------------------
5 | 6
10 (fast) | 3
20 (fast-20 / ultra) | 3 (max 4 devices), 1.5 (max 8 devices)
xx (differential) | 25 (max 16 devices
--------------------------------------------------
</verb></tscreen>
<item> Use correct termination for SCSI devices and at the correct
positions: both ends of the SCSI chain. Remember the host adapter
itself may have on board termination.
<item> Do not mix shielded or unshielded cabling, do not wrap
cables around metal, try to avoid proximity to metal parts along
parts of the cabling. Any such discontinuities can cause impedance
mismatching which in turn can cause reflection of signals which
increases noise on the cable.
This problems gets even more severe in the case of multi channel
controllers.
Recently someone suggested wrapping bubble plastic around the cables
in order to avoid too close proximity to metal, a real problem inside
crowded cabinets.
</itemize>
More information on SCSI cabling and termination can be found at
<!-- <url url="http://resource.simplenet.com/files/68_50_n.htm"
name="other"> --> various
web pages around the net.
<sect1>Host Adapters
<p>
<nidx>disk!adapters</nidx>
<nidx>disk!host adapters</nidx>
This is the other end of the interface from the drive, the part
that is connected to a computer bus. The speed of the computer
bus and that of the drives should be roughly similar, otherwise
you have a bottleneck in your system. Connecting a RAID 0
disk-farm to a ISA card is pointless. These days most computers
come with 32 bit PCI bus capable of 132 MB/s transfers which
should not represent a bottleneck for most people in the near
future.
As the drive electronic migrated to the drives the remaining part
that became the (E)IDE interface is so small it can easily fit into
the PCI chip set. The SCSI host adapter is more complex and often
includes a small CPU of its own and is therefore more expensive and
not integrated into the PCI chip sets available today. Technological
evolution might change this.
Some host adapters come with separate caching and intelligence but as
this is basically second guessing the operating system the gains are
heavily dependent on which operating system is used. Some of the more
primitive ones, that shall remain nameless, experience great gains.
Linux, on the other hand, have so much smarts of its own that the
gains are much smaller.
Mike Neuffer, who did the drivers for the DPT controllers, states that
the DPT controllers are intelligent enough that given enough cache
memory it will give you a big push in performance and suggests that
people who have experienced little gains with smart controllers just
have not used a sufficiently intelligent caching controller.
<sect1>Multi Channel Systems
<p>
<nidx>disk!multi-channel</nidx>
In order to increase throughput it is necessary to identify the most
significant bottlenecks and then eliminate them. In some systems, in
particular where there are a great number of drives connected, it is
advantageous to use several controllers working in parallel, both for
SCSI host adapters as well as IDE controllers which usually have 2
channels built in. Linux supports this.
Some RAID controllers feature 2 or 3 channels and it pays to spread
the disk load across all channels. In other words, if you have two
SCSI drives you want to RAID and a two channel controller, you should
put each drive on separate channels.
<sect1>Multi Board Systems
<p>
<nidx>disk!multi-board</nidx>
In addition to having both a SCSI and an IDE in the same machine
it is also possible to have more than one SCSI controller. Check
the SCSI-HOWTO on what controllers you can combine. Also you will
most likely have to tell the kernel it should probe for more than
just a single SCSI or a single IDE controller. This is done using
kernel parameters when booting, for instance using LILO.
Check the HOWTOs for SCSI and LILO for how to do this.
Multi board systems can offer significant speed gains if you
configure your disks right, especially for RAID0. Make sure you
interleave the controllers as well as the drives, so that you
add drives to the md RAID device in the right order.
If controller 1 is connected to drives <tt/sda/ and <tt/sdc/
while controller 2 is connected to drives <tt/sdb/ and <tt/sdd/
you will gain more paralellicity by adding in the order of
<tt/sda - sdc - sdb - sdd/ rather than <tt/sda - sdb - sdc - sdd/
because a read or write over more than one cluster will be more
likely to span two controllers.
<label id="drive-names">
The same methods can also be applied to IDE. Most motherboards
come with typically 4 IDE ports:
<itemize>
<item> <tt/hda/ primary master
<item> <tt/hdb/ primary slave
<item> <tt/hdc/ secondary master
<item> <tt/hdd/ secondary slave
</itemize>
where the two primaries share one flat cable and the secondaries
share another cable. Modern chipsets keep these independent.
Therefore it is best to RAID in the order <tt/hda - hdc - hdb - hdd/
as this will most likely parallelise both channels.
<sect1>Speed Comparison
<p>
<nidx>disk!speed comparison</nidx>
The following tables are given just to indicate what speeds are
possible but remember that these are the theoretical maximum
speeds. All transfer rates are in MB per second
and bus widths are measured in bits.
<sect2>Controllers
<p>
<nidx>disk!speed comparison!controllers</nidx>
<tscreen><verb>
IDE : 8.3 - 16.7
Ultra-ATA : 33 - 66
SCSI :
Bus width (bits)
Bus Speed (MHz) | 8 16 32
--------------------------------------------------
5 | 5 10 20
10 (fast) | 10 20 40
20 (fast-20 / ultra) | 20 40 80
40 (fast-40 / ultra-2) | 40 80 --
--------------------------------------------------
</verb></tscreen>
<sect2>Bus Types
<p>
<nidx>disk!speed comparison!bus types</nidx>
<tscreen><verb>
ISA : 8-12
EISA : 33
VESA : 40 (Sometimes tuned to 50)
PCI
Bus width (bits)
Bus Speed (MHz) | 32 64
--------------------------------------------------
33 | 132 264
66 | 264 528
--------------------------------------------------
</verb></tscreen>
<sect1>Benchmarking
<p>
<nidx>disk!benchmarking</nidx>
<nidx>disk!benchmarking!bonnie</nidx>
<nidx>disk!benchmarking!iozone</nidx>
<nidx>disk!Bonnie Raitt</nidx>
This is a very, very difficult topic and I will only make a few
cautious comments about this minefield. First of all, it is more
difficult to make comparable benchmarks that have any actual meaning.
This, however, does not stop people from trying...
Instead one can use benchmarking to diagnose your own system, to
check it is going as fast as it should, that is, not slowing down.
Also you would expect a significant increase when switching from
a simple file system to RAID, so a lack of performance gain will
tell you something is wrong.
When you try to benchmark you should not hack up your own, instead
look up <tt/iozone/ and <tt/bonnie/ and read the documentation very
carefully. In particular make sure your buffer size is bigger than
your RAM size, otherwise you test your RAM rather than your disks
which will give you unrealistically high performance.
A very simple benchmark can be obtained using <tt/hdparm -tT/ which
can be used both on IDE and SCSI drives.
<!-- More information about this is coming soon. -->
For more information on benchmarking and software for a number of
platforms, check out
<url url="http://www.acnc.com/benchmarks.html"
name="ACNC">
benchmark page
as well as
<!-- <url url="http://www.spin.ch/&tilde;tpo/bench.html" 000502 -->
<url url="http://spin.ch/&tilde;tpo/bench/"
name="this one">
and also
<!-- <url url="http://metalab.unc.edu/LDP/HOWTO/Benchmarking-HOWTO.html" -->
<url url="http://www.linuxdoc.org/HOWTO/Benchmarking-HOWTO.html"
name="The Benchmarking-HOWTO">.
There are also official home pages for
<url url="http://www.textuality.com/bonnie/"
name="bonnie">,
<url url="http://www.coker.com.au/bonnie++/"
name="bonnie++">
and
<url url="http://www.iozone.org"
name="iozone">.
Trivia: Bonnie is intended to locate bottlenecks, the name is a tribute
to Bonnie Raitt, "who knows how to use one" as the author puts it.
<sect1>Comparisons
<p>
<nidx>disk!comparisons</nidx>
SCSI offers more performance than EIDE but at a price. Termination
is more complex but expansion not too difficult. Having more than
4 (or in some cases 2) IDE drives can be complicated, with wide SCSI
you can have up to 15 per adapter. Some SCSI host adapters have
several channels thereby multiplying the number of possible drives
even further.
For SCSI you have to dedicate one IRQ per host adapter which can
control up to 15 drives. With EIDE you need one IRQ for each
channel (which can connect up to 2 disks, master and slave)
which can cause conflict.
RLL and MFM is in general too old, slow and unreliable to be of much
use.
<sect1>Future Development
<p>
<nidx>disk!future development</nidx>
<!-- c't 9/97: This is no longer future...
The general trend is for faster and faster devices for every update
in the specifications. ATA-3 is just out but does not define faster
transfers, that could happen in ATA-4 which is under way. Quantum
has already released DMA/33 and recent motherboard chip sets now
supports this standard.
-->
SCSI-3 is under way and will hopefully be released soon. Faster
devices are already being announced, recently an 80 MB/s
and then a 160 MB/s monster specification has been proposed and
also very recently became commercially available.
These are based around the Ultra-2 standard (which used a 40 MHz clock)
combined with a 16 bit cable.
Some manufacturers already announce SCSI-3
devices but this is currently rather premature as the standard is not
yet firm. As the transfer speeds increase the saturation point of the
PCI bus is getting closer. Currently the 64 bit version has a limit of
264 MB/s. The PCI transfer rate will in the future be increased from the
current 33 MHz to 66 MHz, thereby increasing the limit to 528 MB/s.
The ATA development is continuing and is increasing the performance
with the new ATA/100 standard. Since most ATA drives are slower in
sustained transfer from platter than this the performance increase
will for most people be small.
More interesting is the Serial ATA development, where the flat cable
will be replaced with a high speed serial link. This makes cabling
far simpler than today and also it solves the problem of cabling
obstructing airflow over the drives.
Another trend is for larger and larger drives. I hear it is possible
to get 75 GB on a single drive though this is rather expensive.
Currently the optimum storage for your money is about 30 GB but also
this is continuously increasing. The introduction of DVD will in the
near future have a big impact, with nearly 20 GB on a single disk you
can have a complete copy of even major FTP sites from around the
world. The only thing we can be reasonably sure about the future
is that even if it won't get any better, it will definitely be bigger.
Addendum: soon after I first wrote this I read that the maximum useful
speed for a CD-ROM was 20x as mechanical stability would be too great
a problem at these speeds. About one month after that again the first
commercial 24x CD-ROMs were available... Currently you can get 40x and
no doubt higher speeds are in the pipeline.
A project to encapsulate SCSI over TCP/IP, called
<url url="http://www.ietf.org/internet-drafts/draft-ietf-ips-iscsi-06.txt"
name="iSCSI">
has started, and one
<url url="http://www.cs.uml.edu/~mbrown/iSCSI"
name="Linux iSCSI implementation">
has appeared.
<sect1>Recommendations <label id="recommendations">
<p>
<nidx>disk!recommendations</nidx>
My personal view is that EIDE
or Ultra ATA is the best way to start out on your
system, especially if you intend to use DOS as well on your machine.
If you plan to expand your system over many years or use it as a
server I would strongly recommend you get SCSI drives. Currently
wide SCSI is a little more expensive. You are generally more likely
to get more for your money with standard width SCSI. There is also
differential versions of the SCSI bus which increases maximum length
of the cable. The price increase is even more substantial and cannot
therefore be recommended for normal users.
In addition to disk drives you can also connect some types of
scanners and printers and even networks to a SCSI bus.
Also keep in mind that as you expand your system you will draw ever
more power, so make sure your power supply is rated for the job and
that you have sufficient cooling. Many SCSI drives offer the option
of sequential spin-up which is a good idea for large systems.
See also
<ref id="power-heating" name="Power and Heating">.
<!--
I do not want to say too much about low level hardware here but I have
to make an exception for SCSI. Some people have a bit of trouble with
this and in the majority of cases the cause is sub standard cabling.
Certain SCSI adapters are known to be very sensitive to the quality
of the cables, see the SCSI HOWTO.
The importance of correct cabling and termination cannot be
overemphasized, read the manuals carefully. Also with the 20MHz Ultra
standard you now also have to keep in mind that there is now also a
minimum distance of 30cm between devices.
-->
<!--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-->
<!--
<sect>Considerations
<p>
<nidx>disk!considerations</nidx>
The starting point in this will be to consider where you are and what
you want to do. The typical home system starts out with existing
hardware and the newly converted Linux user will want to get the most
out of existing hardware. Someone setting up a new system for a
specific purpose (such as an Internet provider) will instead have to
consider what the goal is and buy accordingly. Being ambitious I will
try to cover the entire range.
Various purposes will also have different requirements regarding file
system placement on the drives, a large multiuser machine would
probably be best off with the <tt>/home</tt> directory on a
separate disk, just to give an example.
In general, for performance it is advantageous to split most things
over as many disks as possible but there is a limited number of
devices that can live on a SCSI bus and cost is naturally also a
factor. Equally important, file system maintenance becomes more
complicated as the number of partitions and physical drives increases.
-->
<sect>File System Structure
<p>
<nidx>disk!filesystem structure</nidx>
Linux has been multi tasking from the very beginning where a number
of programs interact and run continuously. It is therefore important
to keep a file structure that everyone can agree on so that the system
finds data where it expects to. Historically there has been so many
different standards that it was confusing and compatibility was
maintained using symbolic links which confused the issue even further
and the structure ended looking like a maze.
<nidx>disk!FSSTND</nidx>
In the case of Linux a standard was fortunately agreed on early on
called the <em/File Systems Standard/ (FSSTND) which today is used
by all main Linux distributions.
<nidx>disk!FHS</nidx>
Later it was decided to make a successor that should also support
operating systems other than just Linux, called
the <em/Filesystem Hierarchy Standard/ (FHS) at version 2.2 currently.
This standard is under continuous development and will
soon be adopted by Linux distributions.
I recommend not trying to roll your own structure as a lot of
thought has gone into the standards and many software packages
comply with the standards. Instead you can read more about this
at the
<url url="http://www.pathname.com/fhs/"
name="FHS home page">.
This HOWTO endeavours to comply with FSSTND
and will follow FHS when distributions become available.
<sect1>File System Features
<p>
<nidx>disk!filesystem features</nidx>
The various parts of FSSTND have different requirements regarding
speed, reliability and size, for instance losing root is a pain
but can easily be recovered. Losing <tt>/var/spool/mail</tt> is a
rather different issue. Here is a quick summary of some essential
parts and their properties and requirements. Note that this is
just a guide, there can be binaries in <tt>etc</tt> and
<tt>lib</tt> directories, libraries in <tt>bin</tt> directories
and so on.
<sect2>Swap
<p>
<nidx>disk!swap</nidx>
<descrip>
<tag/Speed/ Maximum! Though if you rely too much on swap you
should consider buying some more RAM. Note, however, that on
many old Pentium PC motherboards the cache will not work on RAM above 128 MB.
<tag/Size/ Similar as for RAM. Quick and dirty algorithm:
just as for tea: 16 MB for the machine and 2 MB for each user. Smallest
kernel run in 1 MB but is tight, use 4 MB for general work and light
applications, 8 MB for X11 or GCC or 16 MB to be comfortable.
(The author is known to brew a rather powerful cuppa tea...)
Some suggest that swap space should be 1-2 times the size of the
RAM, pointing out that the locality of the programs determines how
effective your added swap space is. Note that using the same
algorithm as for 4BSD is slightly incorrect as Linux does not
allocate space for pages in core.
A more thorough approach is to consider swap space plus RAM as
your total working set, so if you know how much space you will
need at most, you subtract the physical RAM you have and that
is the swap space you will need.
There is also another reason to be generous when dimensioning
your swap space: memory leaks. Ill behaving programs that do not free
the memory they allocate for themselves are said to have a memory leak.
This allocation remains even after the offending program has stopped
so this is a source of memory consumption.
Only after the program dies is the memory returned.
Once all physical RAM and
swap space are exhausted the only solution is to
kill the offending processes if possible, or failing that,
reboot and start over.
Thankfully such programs are not too common but should you come across
one you will find that extra swap space will buy you extra time between
reboots.
Also remember to take into account the type of programs you use.
Some programs that have large working sets, such as
<!-- finite element method (FEM) -->
image processing software
have huge data structures loaded in RAM rather than
working explicitly on disk files. Data and computing intensive
programs like this will cause excessive swapping if you have less
RAM than the requirements.
Other types of programs can lock their pages into RAM. This can be
for security reasons, preventing copies of data reaching a swap device
or for performance reasons such as in a real time module. Either way,
locking pages reduces the remaining amount of swappable memory and
can cause the system to swap earlier then otherwise expected.
In <tt/man 8 mkswap/ it is explained that each swap partition can
be a maximum of just under 128 MB in size for 32-bit machines
and just under 256 MB for 64-bit machines.
This however changed with kernel 2.2.0 after which the limit is 2 GB.
The man page has been updated to reflect this change.
<tag/Reliability/ Medium. When it fails you know it pretty quickly and
failure will cost you some lost work. You save often, don't you?
<tag/Note 1/ Linux offers the possibility of interleaved swapping
across multiple devices, a feature that can gain you much. Check out
"<tt>man 8 swapon</tt>" for more details. However, software raiding
<tt>swap</tt> across multiple devices adds more overheads than you gain.
Thus the <tt>/etc/fstab</tt> file might look like this:
<tscreen><verb>
/dev/sda1 swap swap pri=1 0 0
/dev/sdc1 swap swap pri=1 0 0
</verb></tscreen>
Remember that the <tt/fstab/ file is <em/very/ sensitive to the formatting
used, read the man page carefully and do <em/not/ just cut and paste
the lines above.
<tag/Note 2/ Some people use a RAM disk for swapping or some other
file systems. However, unless you have some very unusual requirements
or setups you are unlikely to gain much from this as this cuts into
the memory available for caching and buffering.
<tag/Note 2b/ There is once exception: on a number of badly designed
motherboards the on board cache memory is not able to cache all the
RAM that can be addressed. Many older motherboards could accept 128 MB
RAM but only cache the lower 64 MB. In such cases it would improve the
performance if you used the upper (uncached) 64 MB RAM for RAMdisk
based swap or other temporary storage.
</descrip>
<sect2>Temporary Storage (<tt>/tmp</tt> and <tt>/var/tmp</tt>)
<p>
<nidx>disk!temporary storage</nidx>
<descrip>
<tag/Speed/ Very high. On a separate disk/partition this will
reduce fragmentation generally, though <tt/ext2fs/ handles fragmentation
rather well.
<tag/Size/ Hard to tell, small systems are easy to run with just
a few MB but these are notorious hiding places for stashing files
away from prying eyes and quota enforcement and can grow without
control on larger machines. Suggested: small home machine: 8 MB,
large home machine: 32 MB, small server: 128 MB, and large
machines up to 500 MB (The machine used by the author at work has 1100
users and a 300 MB <tt>/tmp</tt> directory). Keep an eye on these directories,
not only for hidden files but also for old files. Also be prepared that
these partitions might be the first reason you might have to resize
your partitions.
<tag/Reliability/ Low. Often programs will warn or fail gracefully when
these areas fail or are filled up. Random file errors will of course
be more serious, no matter what file area this is.
<tag/Files/ Mostly short files but there can be a huge number of
them. Normally programs delete their old <tt>tmp</tt> files but if somehow an
interruption occurs they could survive. Many distributions have a policy
regarding cleaning out <tt>tmp</tt> files at boot time, you might want to
check out what your setup is.
<tag/Note1/ In FSSTND there is a note about putting <tt>/tmp</tt> on
RAM disk. This, however, is not recommended for the same reasons
as stated for swap. Also, as noted earlier, do not use flash RAM
drives for these directories. One should also keep in mind that some
systems are set to automatically clean <tt>tmp</tt> areas on rebooting.
<tag/Note2/ Older systems had a <tt>/usr/tmp</tt> but this is no longer
recommended and for historical reasons a symbolic link now makes it
point to one of the other <tt>tmp</tt> areas.
</descrip>
(* That was 50 lines, I am home and dry! *)
<sect2>Spool Areas (<tt>/var/spool/news</tt> and <tt>/var/spool/mail</tt>)
<p>
<nidx>disk!spool areas</nidx>
<descrip>
<tag/Speed/ High, especially on large news servers. News transfer
and expiring are disk intensive and will benefit from fast drives.
Print spools: low. Consider RAID0 for news.
<tag/Size/ For news/mail servers: whatever you can afford. For
single user systems a few MB will be sufficient if you read
continuously. Joining a list server and taking a holiday is, on the
other hand, not a good idea. (Again the machine I use at work
has 100 MB reserved for the entire <tt>/var/spool</tt>)
<tag/Reliability/ Mail: very high, news: medium, print spool: low. If
your mail is very important (isn't it always?) consider RAID for
reliability.
<tag/Files/ Usually a huge number of files that are around a few
KB in size. Files in the print spool can on the other hand be
few but quite sizable.
<tag/Note/ Some of the news documentation suggests putting all
the <tt>.overview</tt> files on a drive separate from the news
files, check out all news FAQs for more information.
Typical size is about 3-10 percent of total news spool size.
</descrip>
<sect2>Home Directories (<tt>/home</tt>) <label id="home-dirs">
<p>
<nidx>disk!home directories</nidx>
<descrip>
<tag/Speed/ Medium. Although many programs use <tt>/tmp</tt> for temporary
storage, others such as some news readers frequently update files in the
home directory which can be noticeable on large multiuser systems. For
small systems this is not a critical issue.
<tag/Size/ Tricky! On some systems people pay for storage so this
is usually then a question of finance. Large systems such as
<url url="http://www.nyx.net/"
name="Nyx.net">
(which is a free Internet service with mail, news and WWW services)
run successfully with a suggested limit of 100 KB per user and 300 KB as
enforced maximum. Commercial ISPs offer typically about 5 MB in their
standard subscription packages.
If however you are writing books or are doing design work the
requirements balloon quickly.
<tag/Reliability/ Variable. Losing <tt>/home</tt> on a single user machine is
annoying but when 2000 users call you to tell you their home
directories are gone it is more than just annoying. For some their
livelihood relies on what is here. You do regular backups of course?
<tag/Files/ Equally tricky. The minimum setup for a single user
tends to be a dozen files, 0.5 - 5 KB in size. Project related files
can be huge though.
<tag/Note1/ You might consider RAID for either speed or
reliability. If you want extremely high speed and reliability you
might be looking at other operating system and hardware platforms anyway.
(Fault tolerance etc.)
<tag/Note2/ Web browsers often use a local cache to speed up browsing and
this cache can take up a substantial amount of space and cause much disk
activity. There are many ways of avoiding this kind of performance hits,
for more information see the sections on
<ref id="server-home-dirs" name="Home Directories">
and
<ref id="www" name="WWW">.
<tag/Note3/ Users often tend to use up all available space on the
<tt>/home</tt> partition. The Linux Quota subsystem is capable of
limiting the number of blocks and the number of inode a single user
ID can allocate on a per-filesystem basis. See the <url
url="http://www.linuxdoc.org/HOWTO/mini/Quota.html" name="Linux Quota mini-HOWTO"> by
Albert M.C. Tam <tt/bertie (at) scn.org/
for details on setup.
</descrip>
<sect2>Main Binaries ( <tt>/usr/bin</tt> and <tt>/usr/local/bin</tt>)<label id="main-binaries">
<p>
<nidx>disk!main binaries</nidx>
<descrip>
<tag/Speed/ Low. Often data is bigger than the programs which are
demand loaded anyway so this is not speed critical. Witness the
successes of live file systems on CD ROM.
<tag/Size/ The sky is the limit but 200 MB should give you most of
what you want for a comprehensive system. A big system, for software
development or a multi purpose server should perhaps reserve 500 MB
both for installation and for growth.
<tag/Reliability/ Low. This is usually mounted under root where all
the essentials are collected. Nevertheless losing all the binaries is
a pain...
<tag/Files/ Variable but usually of the order of 10 - 100 KB.
</descrip>
<sect2>Libraries ( <tt>/usr/lib</tt> and <tt>/usr/local/lib</tt>)
<p>
<nidx>disk!libraries</nidx>
<descrip>
<tag/Speed/ Medium. These are large chunks of data loaded often,
ranging from object files to fonts, all susceptible to bloating. Often
these are also loaded in their entirety and speed is of some use here.
<tag/Size/ Variable. This is for instance where word processors
store their immense font files. The few that have given me feedback on
this report about 70 MB in their various <tt>lib</tt> directories.
A rather complete Debian 1.2 installation can take as much as
250 MB which can be taken as an realistic upper limit.
The following ones are some of the largest disk space consumers:
GCC, Emacs, TeX/LaTeX, X11 and perl.
<tag/Reliability/ Low. See point <ref id="main-binaries" name="Main binaries">.
<tag/Files/ Usually large with many of the order of 1 MB in size.
<tag/Note/ For historical reasons some programs keep executables in
the lib areas. One example is GCC which have some huge binaries in the
<tt>/usr/lib/gcc/lib</tt> hierarchy.
</descrip>
<sect2>Boot
<p>
<nidx>disk!boot</nidx>
<nidx>disk!1023</nidx>
<nidx>disk!nuni</nidx>
<descrip>
<tag/Speed/ Quite low: after all booting doesn't happen that often
and loading the kernel is just a tiny fraction of the time it takes
to get the system up and running.
<tag/Size/ Quite small, a complete image with some extras
fit on a single floppy so 5 MB should be plenty.
<tag/Reliability/ High. See section below on Root.
<tag/Note 1/ The most important part about the Boot partition is that
on many systems it <em/must/ reside below cylinder 1023. This is a
BIOS limitation that Linux cannot get around.
<tag/Note 1a/ The above is not necessarily true for recent IDE systems
and not for any SCSI disks. For more information check the latest
Large Disk HOWTO.
<tag/Note 2/ Recently a new boot loader has been written that overcomes
the 1023 sector limit. For more information check out this
<url url="http://www.linuxforum.com/plug/articles/nuni.html"
name="article">
on nuni.
</descrip>
<sect2>Root
<p>
<nidx>disk!root</nidx>
<descrip>
<tag/Speed/ Quite low: only the bare minimum is here, much of
which is only run at startup time.
<tag/Size/ Relatively small. However it is a good idea to keep
some essential rescue files and utilities on the root partition and
some keep several kernel versions. Feedback suggests about 20 MB would
be sufficient.
<tag/Reliability/ High. A failure here will possibly cause a fair bit
of grief and you might end up spending some time rescuing your boot
partition. With some practice you can of course do this in an hour or
so, but I would think if you have some practice doing this you are
also doing something wrong.
Naturally you do have a rescue disk? Of course this is updated since
you did your initial installation? There are many ready made rescue
disks as well as rescue disk creation tools you might find valuable.
Presumably investing some time in this saves you from becoming a
root rescue expert.
<tag/Note 1/ If you have plenty of drives you might consider putting
a spare emergency boot partition on a separate physical drive. It will
cost you a little bit of space but if your setup is huge the time saved,
should something fail, will be well worth the extra space.
<tag/Note 2/ For simplicity and also in case of emergencies
it is not advisable to put the root partition on a RAID level 0 system.
Also if you use RAID for your boot partition you have to remember to
have the <tt/md/ option turned on for your emergency kernel.
<tag/Note 3/ For simplicity it is quite common to keep Boot and Root
on the same partition. If you do that, then
in order to boot from LILO it is important that the
essential boot files reside wholly within cylinder 1023. This includes
the kernel as well as files found in <tt>/boot</tt>.
</descrip>
<sect2>DOS etc.
<p>
<nidx>disk!DOS-related issues</nidx>
At the danger of sounding heretical I have included this little section
about something many reading this document have strong feelings about.
Unfortunately many hardware items come with setup and maintenance tools
based around those systems, so here goes.
<descrip>
<tag/Speed/ Very low. The systems in question are not famed for speed
so there is little point in using prime quality drives. Multitasking or
multi-threading are not available so the command queueing facility found
in SCSI drives will not be taken advantage of. If you have an old IDE
drive it should be good enough. The exception is to some degree Win95
and more notably NT which have multi-threading support which should
theoretically be able to take advantage of the more advanced features
offered by SCSI devices.
<tag/Size/ The company behind these operating systems
is not famed for writing tight
code so you have to be prepared to spend a few tens of MB depending on
what version you install of the OS or Windows. With an old version of
DOS or Windows you might fit it all in on 50 MB.
<tag/Reliability/ Ha-ha. As the chain is no stronger than the weakest link
you can use any old drive. Since the OS is more likely to scramble itself
than the drive is likely to self destruct you will soon learn the
importance of keeping backups here.
Put another way: "<it/Your mission, should you choose to accept it,
is to keep this partition working. The warranty will self destruct
in 10 seconds.../"
Recently I was asked to justify my claims here. First of all I am not
calling DOS and Windows sorry excuses for operating systems. Secondly
there are various legal issues to be taken into account. Saying there
is a connection between the last two sentences are merely the ravings of the
paranoid. Surely. Instead I shall offer the esteemed reader a few
key words: DOS 4.0, DOS 6.x and various drive compression tools that
shall remain nameless.
</descrip>
<sect1>Explanation of Terms
<p>
<nidx>disk!terms explained</nidx>
Naturally the faster the better but often the happy installer of Linux
has several disks of varying speed and reliability so even though this
document describes performance as 'fast' and 'slow' it is just a rough
guide since no finer granularity is feasible. Even so there are a few
details that should be kept in mind:
<sect2>Speed <label id="speed">
<p>
<nidx>disk!terms explained!speed</nidx>
This is really a rather woolly mix of several terms: CPU load,
transfer setup overhead, disk seek time and transfer rate. It is in
the very nature of tuning that there is no fixed optimum, and in most
cases price is the dictating factor. CPU load is only significant for
IDE systems where the CPU does the transfer itself
but is generally low for SCSI, see SCSI documentation
for actual numbers. Disk seek time is also small, usually in the
millisecond range. This however is not a problem if you use command
queueing on SCSI where you then overlap commands keeping the bus busy
all the time. News spools are a special case consisting of a huge
number of normally small files so in this case seek time can become
more significant.
There are two main parameters that are of interest here:
<descrip>
<tag/Seek/ is usually specified in the average time take for the
read/write head to seek from one track to another. This parameter
is important when dealing with a large number of small files such
as found in spool files.
There is also the extra seek delay before the desired sector rotates
into position under the head. This delay is dependent on the angular
velocity of the drive which is why this parameter quite often is
quoted for a drive. Common values are 4500, 5400 and 7200 RPM (rotations
per minute). Higher RPM reduces the seek time but at a substantial cost.
Also drives working at 7200 RPM have been known to be noisy and to
generate a lot of heat, a factor that should be kept in mind if you
are building a large array or "disk farm". Very recently drives working
at 10000 RPM has entered the market and here the cooling requirements
are even stricter and minimum figures for air flow are given.
<tag/Transfer/ is usually specified in megabytes per second.
This parameter is important when handling large files that
have to be transferred. Library files, dictionaries and image files
are examples of this. Drives featuring a high rotation speed also
normally have fast transfers as transfer speed is proportional to
angular velocity for the same sector density.
</descrip>
It is therefore important to read the specifications for the drives
very carefully, and note that the maximum transfer speed quite often
is quoted for transfers out of the on board cache (burst speed)
and <em>not</em>
directly from the platter (sustained speed).
See also section on
<ref id="power-heating" name="Power and Heating">.
<sect2>Reliability
<p>
<nidx>disk!terms explained!reliability</nidx>
Naturally no-one would want low reliability disks but one might be
better off regarding old disks as unreliable. Also for RAID purposes
(See the relevant information) it is suggested to use a mixed set of disks
so that simultaneous disk crashes become less likely.
So far I have had only one report of total file system failure but
here unstable hardware seemed to be the cause of the problems.
Disks are cheap these days yet people still underestimate the
value of the contents of the drives. If you need higher reliability
make sure you replace old drives and keep spares. It is not unusual
that drives can work more or less continuous for years and years but
what often kills a drive in the end is power cycling.
<sect2>Files
<p>
<nidx>disk!terms explained!files</nidx>
The average file size is important in order to decide the most
suitable drive parameters. A large number of small files makes the
average seek time important whereas for big files the transfer speed
is more important. The command queueing in SCSI devices is very
handy for handling large numbers of small files, but for transfer EIDE
is not too far behind SCSI and normally much cheaper than SCSI.
<sect>File Systems
<p>
<nidx>disk!file systems</nidx>
Over time the requirements for file systems have increased and the
demands for large structures, large files, long file names and more
has prompted ever more advanced file systems, the system that
accesses and organises the data on mass storage.
Today there is a large number of file systems to choose from and this
section will describe these in detail.
The emphasis is on Linux but with more input I will be happy to add
information for a wider audience.
<sect1>General Purpose File Systems
<p>
Most operating systems usually have a general purpose file system for
every day use for most kinds of files, reflecting available features
in the OS such as permission flags, protection and recovery.
<sect2><tt/minix/
<p>
<nidx>disk!file system!minix</nidx>
This was the original fs for Linux, back in the days Linux was hosted
on minix machines. It is simple but limited in features and hardly ever
used these days other than in some rescue disks as it is rather compact.
<sect2><tt/xiafs/ and <tt/extfs/
<p>
<nidx>disk!file system!xiafs</nidx>
<nidx>disk!file system!extfs</nidx>
These are also old and have fallen in disuse and are no longer recommended.
<sect2><tt/ext2fs/
<p>
<nidx>disk!file system!ext2fs</nidx>
This is the established standard for general purpose in the Linux world.
It is fast, efficient and mature and is under continuous development and
features such as ACL and transparent compression are on the horizon.
For more information check the
<url url="http://web.mit.edu/tytso/www/linux/ext2.html"
name="ext2fs">
home page.
<sect2><tt/ext3fs/
<p>
<nidx>disk!file system!ext3fs</nidx>
This is the name for the upcoming successor to <tt/ext2fs/ due to enter
stable kernel in the near future. Many features are added to
<tt/ext2fs/ but to avoid confusion over the name after such a radical
upgrade the name will be changed too. You may have heard of it already
but source code is now in beta release . <!--not yet available. -->
Patches are available at
<url url="ftp://ftp.linux.org.uk/pub/linux/sct/fs/jfs"
name="Linux.org">.
<sect2><tt/ufs/
<p>
<nidx>disk!file system!ufs</nidx>
This is the fs used by BSD and variants thereof. It is mature but also
developed for older types of disk drives where geometries were known. The
fs uses a number of tricks to optimise performance but as disk geometries
are translated in a number of ways the net effect is no longer so optimal.
<sect2><tt/efs/
<p>
<nidx>disk!file system!efs</nidx>
The Extent File System (efs) is Silicon Graphics' early file system
widely used on IRIX before version 6.0 after which xfs has taken over.
While migration to xfs is encouraged efs is still supported
and much used on CDs.
There is a Linux driver available in early beta stage, available at
<url url="http://aeschi.ch.eu.org/efs/"
name="Linux extent file system">
home page.
<sect2><tt/XFS/
<p>
<nidx>disk!file system!XFS</nidx>
<url url="http://www.sgi.com/"
name="Silicon Graphics Inc (sgi)">
has started porting its mainframe grade file system to Linux.
Source is not yet available as they are busily cleaning out
legal encumbrance but once that is done they will provide the
source code under GPL.
More information is already available on the
<!-- <url url="http://www.sgi.com/projects/xfs/" 000502 -->
<url url="http://oss.sgi.com/projects/xfs/"
name="XFS project page">
at SGI.
<sect2><tt/reiserfs/
<p>
<nidx>disk!file system!reiserfs</nidx>
<nidx>disk!file system!tree based</nidx>
As of July, 23th 1997
Hans Reiser <tt/reiser (at) RICOCHET.NET/
has put up the source to his tree based
<!-- <url url="http://idiom.com/&tilde;beverly/reiserfs.html" 990919 -->
<!-- <url url="http://devlinux.com/namesys/" 000501 -->
<!-- <url url="http://devlinux.com/projects/reiserfs/" 001203 -->
<url url="http://www.namesys.com"
name="reiserfs">
on the web. While his filesystem has some very interesting features and
is much faster than <tt/ext2fs/ and is in use by a number of people.
Hopefully it will be ready for kernel 2.4.0 which might be ready at
the end of the year.
<!-- it is still very experimental and
difficult to integrate with the standard kernel. Expect some
interesting developments in the future - this is different from your
"average log based file system for Linux" project, because Hans
already has working code. -->
<sect2><tt/enh-fs/
<p>
<nidx>disk!file system!enhanced fs</nidx>
<!-- removed 990919
Currently in alpha stage the
<url url="http://www.coker.com.au/&tilde;russell/enh-fs.html"
name="Enhanced File System">
project aims to combine
file system and volume management into a single layer.
-->
The Enhanced File System project is now dead.
<sect2><tt/Tux2 fs/
<p>
<nidx>disk!file system!Tux2 fs</nidx>
This is a variation on the <tt/ext2fs/ that adds robustness
in case of unexpected interruptions such as power failure.
After such an event <tt/Tux2 fs/ will restart with the file system
in a consistent, recently recorded state without fsck or
other recovery operations. To achieve this <tt/Tux2 fs/ uses
a newly designed algorithm called Phase Tree.
More information can be found at the
<url url="http://tux2.sourceforge.net"
name="project home page">.
<sect1>Microsoft File Systems
<p>
<nidx>disk!file system!Microsoft</nidx>
<nidx>disk!file system!confusion</nidx>
This company is responsible for a lot, including a number of filesystems
that has at the very least caused confusions.
<sect2><tt/fat/
<p>
<nidx>disk!file system!fat</nidx>
Actually there are 2 <tt/fat/s out there, <tt/fat12/ and <tt/fat16/
depending on the partition size used but fortunately the difference
is so minor that the whole issue is transparent.
On the plus side these are fast and simple and most OSes understands
it and can both read and write this fs. And that is about it.
The minus side is limited safety, severely limited permission flags
and atrocious scalability. For instance with <tt/fat/ you cannot
have partitions larger than 2 GB.
<sect2><tt/fat32/
<p>
<nidx>disk!file system!fat32</nidx>
After about 10 years Microsoft realised <tt/fat/ was about, well, 10 years
behind the times and created this fs which scales reasonably well.
Permission flags are still limited.
NT 4.0 cannot read this file system but Linux can.
<sect2><tt/vfat/
<p>
<nidx>disk!file system!vfat</nidx>
At the same time as Microsoft launched <tt/fat32/ they also added
support for long file names, known as <tt/vfat/.
Linux reads <tt/vfat/ and <tt/fat32/ partitions by mounting with
type <tt/vfat/.
<sect2><tt/ntfs/
<p>
<nidx>disk!file system!ntfs</nidx>
This is the native fs of Win-NT but as complete information is not available
there is limited support for other OSes.
<sect1>Logging and Journaling File Systems
<p>
<nidx>disk!file system!logging file systems</nidx>
<nidx>disk!file system!journaling file systems</nidx>
These take a radically different approach to file updates by
logging modifications for files in a log and later at some
time checkpointing the logs.
Reading is roughly as fast as traditional file systems that
always update the files directly.
Writing is much faster as only updates are appended to a log.
All this is transparent to the user. It is in reliability and
particularly in checking file system integrity that these
file systems really shine.
Since the data before last checkpointing is known to be good
only the log has to be checked, and this is much faster than
for traditional file systems.
Note that while
<em/logging/ filesystems keep track of changes made to both data and inodes,
<em/journaling/ filesystems keep track only of inode changes.
Linux has quite a choice in such file systems but none are
yet in production quality. Some are also on hold.
<itemize>
<item>Adam Richter from Yggdrasil posted some time ago that they have been
working on a compressed log file based system but that this project is
currently on hold. Nevertheless a non-working version is available on
their FTP server. Check out
<url url="ftp://ftp.yggdrasil.com/private/adam"
name="the Yggdrasil ftp server">
where special patched versions of the kernel can be found.
<item>Another project is the
<!-- <url url="http://collective.cpoint.net/lfs/" 000503 -->
<url url="http://outflux.net/projects/lfs/"
name="Linux log-structured Filesystem Project">
which sadly also is on hold. Nevertheless this page contains
much information on the topic.
<item>Then there is the
<url url="http://www.complang.tuwien.ac.at/czezatke/lfs.html"
name="LinLogFS -- A Log-Structured Filesystem For Linux">
(formerly known as dtfs)
which seems to be going strong. Still in alpha but sufficiently
complete to make programs run off this file system
<item>Finally there is the
<url url="http://developer.axis.com/software/jffs/"
name="Journaling Flash File System">
designed for their embedded diskless systems such as
their Linux based web camera.
</itemize>
Note that <tt/ext3fs/, <tt/XFS/ and <tt/reiserfs/ also have
features for logging or journaling.
<sect1>Read-only File Systems
<p>
<nidx>disk!file system!read-only file systems</nidx>
Read-only media has not escaped the ever increasing complexities
seen in more general file systems so again there is a large choice
to choose from with corresponding opportunities for exciting mistakes.
Note that <tt/ext2fs/ works quite well on a CD-ROM
and seems to save space while offering the normal file system features
such as long file names and permissions that can be retained when
copying files across to read-write media. Also having <!-- <file>/dev</file> -->
<htmlurl url="file:///dev/"
name="/dev">
on a CD-ROM is possible.
<nidx>disk!file system!CD-ROM</nidx>
<nidx>disk!file system!DVD</nidx>
<nidx>disk!file system!loopback</nidx>
Most of these are used with the CD-ROM media but also the new
DVD can be used and you can even use it through the loopback device
on a hard disk file for verifying an image before burning a ROM.
<nidx>disk!file system!rom file systems</nidx>
<nidx>disk!file system!romfs</nidx>
There is a read-only <tt/romfs/ for Linux but as that is not disk
related nothing more will be said about it here.
<sect2><tt/High Sierra/
<p>
<nidx>disk!file system!High Sierra</nidx>
This was one of the earliest standards for CD-ROM formats,
supposedly named after the hotel where the final agreement took place.
<tt/High Sierra/ was so limited in features that new extensions simply
had to appear and while there has been no end to new formats the original
<tt/High Sierra/ remains the common precursor and is therefore still
widely supported.
<sect2><tt/iso9660/
<p>
<nidx>disk!file system!iso9660</nidx>
The International Standards Organisation made their extensions and
formalised the standard into what we know as the <tt/iso9660/ standard.
The Linux iso9660 file system supports both High Sierra as well as
<tt/Rock Ridge/ extensions.
<sect2><tt/Rock Ridge/
<p>
<nidx>disk!file system!Rock Ridge</nidx>
Not everyone accepts limits like short filenames and lack of permissions
so very soon the <tt/Rock Ridge/ extensions appeared to rectify these
shortcomings.
<sect2><tt/Joliet/
<p>
<nidx>disk!file system!Joliet</nidx>
Microsoft, not to be outdone in the standards extension game, decided
it should extend CD-ROM formats with some internationalisation features
and called it <tt/Joliet/.
Linux supports this standards in kernels 2.0.34 or newer.
You need to enable NLS in order to use it.
<sect2>Trivia
<p>
<nidx>disk!file system!Trivia</nidx>
Joliet is a city outside Chicago; best known for being the site of
the prison where Jake was locked up in the movie "Blues Brothers."
Rock Ridge (the UNIX extensions to ISO 9660) is named
after the (fictional) town in the movie "Blazing Saddles."
<sect2><tt/UDF/
<p>
<nidx>disk!file system!UDF</nidx>
With the arrival of DVD with up to about 17 GB of storage capacity
the world seemingly needed another format, this time ambitiously named
Universal Disk Format (UDF).
This is intended to replace <tt/iso9660/ and will be required for DVD.
Currently this is not in the standard Linux kernel but a project
is underway to make a
<url url="http://trylinux.com/projects/udf/index.html" <!-- 000502 -->
name="UDF driver">
for Linux. Patches and documentation are available.
More information is also available at the
<url url="http://atv.ne.mediaone.net/linux-dvd/"
name="Linux and DVDs">
page.
<!-- <url url="http://www.rpi.edu/&tilde;veliaa/linux-dvd" -->
<sect1>Networking File Systems
<p>
<nidx>disk!file system!networking file systems</nidx>
There is a large number of networking technologies available that
lets you distribute disks throughout a local or even global networks.
This is somewhat peripheral to the topic of this HOWTO but as it can
be used with local disks I will cover this briefly. It would be best
if someone (else) took this into a separate HOWTO...
<sect2><tt/NFS/
<p>
<nidx>disk!file system!NFS</nidx>
This is one of the earliest systems that allows mounting a file space
on one machine onto another. There are a number of problems with <tt/NFS/
ranging from performance to security but it has nevertheless become
established.
<sect2><tt/AFS/
<p>
<nidx>disk!file system!AFS</nidx>
This is a system that allows efficient sharing of files
across large networks. Starting out as an academic project
it is now sold by
<url url="http://www.transarc.com"
name="Transarc">
whose home page gives you more details.
Derek Atkins, of MIT, ported AFS to Linux and has also set up the
Linux AFS mailing List (
<htmlurl url="mailto:linux-afs@mit.edu"
name="linux-afs@mit.edu">)
for this which is open to the public.
Requests to join the list should go to
<htmlurl url="mailto:linux-afs-request@mit.edu"
name="linux-afs-request@mit.edu">
and finally bug reports should be directed to
<htmlurl url="mailto:linux-afs-bugs@mit.edu"
name="linux-afs-bugs@mit.edu">.
Important: as AFS uses encryption it is
restricted software and cannot easily be exported from the US.
IBM who owns Transarc, has announced the availability of the latest
version of client as well as server for Linux.
Arla is a free AFS implementation, check the
<url url="http://www.stacken.kth.se/projekt/arla/"
name="Arla homepage">
for more information as well as documentation.
<sect2>Coda
<p>
<nidx>disk!file system!Coda</nidx>
<!-- Major input from Dr. A V LeBlanc -->
<!-- Work has started on a free replacement of <tt/AFS/ and is called -->
A networking filesystem similar to <tt/AFS/ is underway and is called
<url url="http://coda.cs.cmu.edu/"
name="Coda">.
This is designed to be more robust and fault tolerant than <tt/AFS/,
and supports mobile, disconnected operations.
Currently it does not scale very well, and does not really have
proper administrative tools, as <tt/AFS/ does and <tt/ARLA/ is
beginning to.
<sect2><tt/nbd/
<p>
<nidx>disk!file system!nbd</nidx>
<nidx>disk!device!network block device</nidx>
The
<url url="http://atrey.karlin.mff.cuni.cz/&tilde;pavel/"
name="Network Block Device">
(<tt/nbd/) is available in Linux kernel 2.2
and later and offers reportedly excellent performance. The interesting
thing here is that it can be combined with RAID (see later).
<sect2><tt/enbd/
<p>
<nidx>disk!file system!enbd</nidx>
<nidx>disk!device!enhanced network block device</nidx>
The
<url url="http://www.it.uc3m.es/&tilde;ptb/nbd" <!-- 001213 -->
name="Enhanced Network Block Device">
(<tt/enbd/) is a project to enhance the <tt/nbd/ with
features such as block journaled multi channel communications,
internal failover and automatic balancing between channels
and more.
The intended use is for RAID over the net.
<sect2>GFS
<p>
<nidx>disk!file system!GFS</nidx>
<nidx>disk!device!Global File System</nidx>
The
<url url="http://gfs.lcse.umn.edu/"
name="Global File System">
is a new file system designed for storage across a wide area network.
It is currently in the early stages and more information will come
later.
<sect1>Special File Systems
<p>
In addition to the general file systems there is also a number of
more specific ones, usually to provide higher performance or other
features, usually with a tradeoff in other respects.
<sect2><tt/tmpfs/ and <tt/swapfs/ <label id="tmpfs">
<p>
<nidx>disk!file system!tmpfs</nidx>
<nidx>disk!file system!swapfs</nidx>
For short term fast file storage SunOS offers <tt/tmpfs/ which is
about the same as the <tt/swapfs/ on NeXT.
This overcomes the inherent slowness in <tt/ufs/ by caching file data
and keeping control information in memory. This means that data on such
a file system will be lost when rebooting and is therefore mainly
suitable for <tt>/tmp</tt> area but not <tt>/var/tmp</tt> which is where
temporary data that must survive a reboot, is placed.
SunOS offers very limited tuning for <tt/tmpfs/ and the number of
files is even limited by total physical memory of the machine.
<!-- Linux does not have an equivalent to such file system and it is felt
by many that <tt/ext2fs/ is fast enough to eliminate the need. -->
Linux now features <tt/tmpfs/ since kernel version 2.4 and is
enabled by turning on virtual memory file system support (former shm fs).
Under certain circumstances <tt/tmpfs/ can lock up the system in
early kerbel versions, make sure you use version 2.4.6 or later.
<sect2><tt/userfs/
<p>
<nidx>disk!file system!userfs</nidx>
<nidx>disk!file system!arcfs</nidx>
<nidx>disk!file system!docfs</nidx>
The user file system (<tt/userfs/) allows a number of extensions to
traditional file system use such as
FTP based file system, compression (<tt/arcfs/) and fast prototyping
and many other features. The <tt/docfs/ is based on this filesystem.
Check the
<url url="http://www.goop.org/&tilde;jeremy/userfs/"
name="userfs homepage">
for more information.
<sect2><tt/devfs/
<p>
<nidx>disk!file system!devfs</nidx>
When disks are added, removed or just fail it is likely that
disk device names of the remaining disks will change.
For instance if <tt/sdb/ fails then the old <tt/sdc/ becomes <tt/sdb/,
the old <tt/sdc/ becomes <tt/sdb/ and so on.
Note that in this case <tt/hda/, <tt/hdb/ etc will remain unchanged.
Likewise if a new drive is added the reverse may happen.
There is no guarantee that SCSI ID 0 becomes <tt/sda/ and that adding
disks in increasing ID order will just add a new device name without
renaming previous entries, as some SCSI drivers assign from ID 0 and up
while others reverse the scanning order.
Likewise adding a SCSI host adapter can also cause renaming.
Generally device names are assigned in the order they are found.
The source of the problem lies in the limited number of bits available
for major and minor numbering in the device files used to describe the
device itself. You an see these in the <!-- <file>/dev</file> -->
<htmlurl url="file:///dev/"
name="/dev">
directory, info
on the numbering and allocation can be found in <tt/man MAKEDEV/.
Currently there are 2 solutions to this problem in various stages of
development:
<descrip>
<tag/scsidev/ works by creating a database of drives and where they
belong, check <em/ man scsifs/ and the
<htmlurl url="http://www.garloff.de/kurt/linux/scsidev/"
name="scsidev home page">
for more information
<tag/devfs/ is a more long term project aimed at getting around the
whole business of device numbering by making the <!-- <file>/dev</file> -->
<htmlurl url="file:///dev/"
name="/dev">
directory a kernel file system in the same way as <!-- <file>/procfs</file> -->
<htmlurl url="file:///proc/"
name="/proc">
is.
More information will appear as it becomes available.
</descrip>
<sect2><tt/smugfs/
<p>
<nidx>disk!file system!smugfs</nidx>
<nidx>disk!file system!huge files</nidx>
For a number of reasons it is currently difficult to have files
bigger than 2 GB. One file system that tries to overcome this
limit is <tt/smugfs/ which is very fast but also simple. For instance
there are no directories and the block allocation is simple.
It is available as
<!-- http://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/smugfs-0.0.tar.gz -->
<url url="ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/"
name="compressed tarred source code">
and while it worked with kernel version 2.1.85 it is quite possible some
work is required to make it fit into newer kernels. Also the low version
number (0.0) suggests extra care is required.
<sect1>File System Recommendations
<p>
There is a jungle of choices but generally it is recommended to
use the general file system that comes with your distribution.
If you use <tt/ufs/ and have some kind of <tt/tmpfs/ available
you should first start off with the general file system to get
an idea of the space requirements and if necessary buy more
RAM to support the size of <tt/tmpfs/ you need. Otherwise you
will end up with mysterious crashes and lost time.
If you use dual boot and need to transfer data between the two
OSes one of the simplest ways is to use an appropriately sized
partition formatted with <tt/fat/ as most systems can reliably
read and write this.
Remember the limit of 2 GB for <tt/fat/ partitions.
For more information of file system interconnectivity you can
check out the
<!-- <url url="http://www.ceid.upatras.gr/&tilde;gef/fs/" 000502 -->
<url url="http://students.ceid.upatras.gr/&tilde;gef/fs/oldindex.html"
name="file system">
page
which has been superseded by
<url url="http://www.penguin.cz/&tilde;mhi/fs/"
name="file system">
and the article
<url url="http://linuxtoday.com/stories/5556.html"
name="Kragen's Amazing List of Filesystems">.
That guide is being superseded by a HOWTO which is underway and
a link will be added when it is ready.
To avoid total havoc with device renaming if a drive fails
check out the scanning order of your system and try to keep
your root system on <tt/hda/ or <tt/sda/ and removable media
such as ZIP drives at the end of the scanning order.
<sect>Technologies <label id="technologies">
<p>
<nidx>disk!technologies</nidx>
In order to decide how to get the most of your devices you need to
know what technologies are available and their implications. As always
there can be some tradeoffs with respect to speed, reliability, power,
flexibility, ease of use and complexity.
Many of the techniques described below can be stacked in a number
of ways to maximise performance and reliability, though at the cost
of added complexity.
<sect1>RAID<label id="RAID">
<p>
<nidx>disk!technologies!RAID</nidx>
This is a method of increasing reliability, speed or both by using multiple
disks in parallel thereby decreasing access time and increasing transfer
speed. A checksum or mirroring system can be used to increase reliability.
Large servers can take advantage of such a setup but it might be overkill
for a single user system unless you already have a large number of disks
available. See other documents and FAQs for more information.
For Linux one can set up a RAID system using either software
(the <tt>md</tt> module in the kernel), a Linux compatible
controller card (PCI-to-SCSI) or a SCSI-to-SCSI controller. Check the
documentation for what controllers can be used. A hardware solution is
usually faster, and perhaps also safer, but comes at a significant cost.
A summary of available hardware RAID solutions for Linux is available
at
<url url="http://www.Linux-Consulting.com/Raid/Docs/raid_hw.txt"
name="Linux Consulting">.
<sect2>SCSI-to-SCSI<label id="SCSI-to-SCSI">
<p>
<nidx>disk!technologies!RAID!SCSI-to-SCSI</nidx>
SCSI-to-SCSI controllers are usually implemented as complete cabinets
with drives and a controller that connects to the computer with a
second SCSI bus. This makes the entire cabinet of drives look like a
single large, fast SCSI drive and requires no special RAID driver. The
disadvantage is that the SCSI bus connecting the cabinet to the
computer becomes a bottleneck.
A significant disadvantage for people with large disk farms is that there
is a limit to how many SCSI entries there can be in the <!-- <tt>/dev</tt> -->
<htmlurl url="file:///dev/"
name="/dev">
directory. In these cases using SCSI-to-SCSI will conserve entries.
Usually they are configured via the front panel or with a terminal
connected to their on-board serial interface.
Some manufacturers of such systems are
<url url="http://www.cmd.com"
name="CMD">
and
<url url="http://www.syred.com"
name="Syred">
whose web pages describe several systems.
<sect2>PCI-to-SCSI<label id="PCI-to-SCSI">
<p>
<nidx>disk!technologies!RAID!PCI-to-SCSI</nidx>
PCI-to-SCSI controllers are, as the name suggests,
connected to the high speed PCI
bus and is therefore not suffering from the same bottleneck as the
SCSI-to-SCSI controllers. These controllers require special drivers
but you also get the means of controlling the RAID configuration over
the network which simplifies management.
Currently only a few families of PCI-to-SCSI host adapters
are supported under Linux.
<descrip>
<tag/DPT/
The oldest and most mature is a range of controllers from
<url url="http://www.dpt.com"
name="DPT">
including SmartCache I/III/IV and SmartRAID I/III/IV controller families.
These controllers are supported by the EATA-DMA driver in
the standard kernel. This company also has an informative
<url url="http://www.dpt.com"
name="home page">
which also describes various general aspects
of RAID and SCSI in addition to the product related information.
More information from the author of the DPT controller drivers
(EATA* drivers) can be found at his pages on
<!-- Old links updated 971021
<url url="http://www.i-connect.net/&tilde;mike/scsi/"
name="SCSI">
and
<url url="http://www.i-connect.net/&tilde;mike/scsi/dpt/"
name="DPT">.
-->
<url url="http://www.uni-mainz.de/&tilde;neuffer/scsi/"
name="SCSI">
and
<url url="http://www.uni-mainz.de/&tilde;neuffer/scsi/dpt/"
name="DPT">.
These are not the fastest but have a good track record of
proven reliability.
Note that the maintenance tools for DPT controllers currently
run under DOS/Win only so you will need a small DOS/Win partition
for some of the software. This also means you have to boot the
system into Windows in order to maintain your RAID system.
<tag/ICP-Vortex/
A very recent addition is a range of controllers from
<url url="http://www.icp-vortex.com"
name="ICP-Vortex">
featuring up to 5 independent channels and very fast hardware
based on the i960 chip. The Linux driver was written by the
company itself which shows they support Linux.
As ICP-Vortex supplies the maintenance software for Linux it is
not necessary with a reboot to other operating systems for the
setup and maintenance of your RAID system. This saves you also
extra downtime.
<tag/Mylex DAC-960/
This is one of the latest entries which is out in early beta.
More information as well as drivers are available at
<url url="http://www.dandelion.com/Linux/DAC960.html"
name="Dandelion Digital's Linux DAC960 Page">.
<tag/Compaq Smart-2 PCI Disk Array Controllers/
Another very recent entry and currently in beta release is the
<url url="http://www.insync.net/&tilde;frantzc/cpqarray.html"
name="Smart-2">
driver.
<tag/IBM ServeRAID/
IBM has released their
<url url="http://www.developer.ibm.com/welcome/netfinity/serveraid_beta.html"
name="driver">
as GPL.
</descrip>
<!--
SCSI-to-SCSI-controllers are small computers themselves, often with
a substantial amount of cache RAM. To the host system they mask
themselves as a gigantic, fast and reliable SCSI disk whereas to
their disks they look like the computer's SCSI host adapter. Some of
these controllers have the option to talk to multiple hosts
simultaneously. Since these controllers look to the host as a
normal, albeit large SCSI drive they need no special support from
the host system. Usually they are configured via the front panel or
with a vt100 terminal emulator connected to their on-board serial
interface.
Very recently I have heard that Syred also makes SCSI-to-SCSI
controllers that are supported under Linux. I have no more information
about this yet but will come back with more information soon. In the
mean time check out their
<url url="http://www.syred.com"
name="home">
pages for more information.
-->
<sect2>Software RAID<label id="soft-raid">
<p>
<nidx>disk!technologies!RAID!Software RAID</nidx>
A number of operating systems offer software RAID using
ordinary disks and controllers. Cost is low and performance
for raw disk IO can be very high.
As this can be very CPU intensive it increases the load noticeably
so if the machine is CPU bound in performance rather then IO bound
you might be better off with a hardware PCI-to-RAID controller.
Real cost, performance and especially reliability of software
vs. hardware RAID is a very controversial topic. Reliability
on Linux systems have been very good so far.
The current software RAID project on Linux is the <tt/md/ system
(multiple devices) which offers much more than RAID so it is
described in more details later.
<sect2>RAID Levels<label id="raid-levels">
<p>
<nidx>disk!technologies!RAID!RAID levels</nidx>
RAID comes in many levels and flavours which I will give a brief
overview of this here. Much has been written about it and the
interested reader is recommended to read more about this in the
<url url="http://ostenfeld.dk/&tilde;jakob/Software-RAID.HOWTO/"
name="Software RAID HOWTO">.
<itemize>
<item>RAID <em/0/ is not redundant at all but offers the best
throughput of all levels here. Data is striped across a number of
drives so read and write operations take place in parallel across
all drives. On the other hand if a single drive fail then
everything is lost. Did I mention backups?
<item>RAID <em/1/ is the most primitive method of obtaining redundancy
by duplicating data across all drives. Naturally this is
massively wasteful but you get one substantial advantage which is
fast access.
The drive that access the data first wins. Transfers
are not any faster than for a single drive, even though you might
get some faster read transfers by using one track reading per
drive.
Also if you have only 2 drives this is the only method of achieving
redundancy.
<item>RAID <em/2/ and <em/4/ are not so common and are not covered
here.
<item>RAID <em/3/ uses a number of disks (at least 2) to store data
in a striped RAID 0 fashion. It also uses an additional redundancy
disk to store the XOR sum of the data from the data disks. Should
the redundancy disk fail, the system can continue to operate as if
nothing happened. Should any single data disk fail the system can
compute the data on this disk from the information on the redundancy
disk and all remaining disks. Any double fault will bring the whole
RAID set off-line.
RAID 3 makes sense only with at least 2 data disks (3 disks
including the redundancy disk). Theoretically there is no limit for
the number of disks in the set, but the probability of a fault
increases with the number of disks in the RAID set. Usually the
upper limit is 5 to 7 disks in a single RAID set.
Since RAID 3 stores all redundancy information on a dedicated disk
and since this information has to be updated whenever a write to any
data disk occurs, the overall write speed of a RAID 3 set is limited
by the write speed of the redundancy disk. This, too, is a limit for
the number of disks in a RAID set. The overall read speed of a RAID
3 set with all data disks up and running is that of a RAID 0 set
with that number of data disks. If the set has to reconstruct data
stored on a failed disk from redundant information, the performance
will be severely limited: All disks in the set have to be read and
XOR-ed to compute the missing information.
<item>RAID <em/5/ is just like RAID 3, but the redundancy
information is spread on all disks of the RAID set. This improves
write performance, because load is distributed more evenly between
all available disks.
</itemize>
There are also hybrids available based on RAID 0 or 1 and one other
level. Many combinations are possible but I have only seen a few
referred to. These are more complex than the above mentioned
RAID levels.
RAID <em>0/1</em> combines striping with duplication which
gives very high transfers combined with fast seeks as well as
redundancy. The disadvantage is high disk consumption as well as
the above mentioned complexity.
RAID <em>1/5</em> combines the speed and redundancy benefits of
RAID5 with the fast seek of RAID1. Redundancy is improved compared
to RAID 0/1 but disk consumption is still substantial. Implementing
such a system would involve typically more than 6 drives, perhaps
even several controllers or SCSI channels.
<sect1>Volume Management<label id="vol-mgmnt">
<p>
<nidx>disk!technologies!volume management</nidx>
Volume management is a way of overcoming the constraints of fixed
sized partitions and disks while still having a control of where
various parts of file space resides. With such a system you can
add new disks to your system and add space from this drive to parts
of the file space where needed, as well as migrating data out from
a disk developing faults to other drives before catastrophic failure
occurs.
The system developed by
<url url="http://www.veritas.com"
name="Veritas">
has become the defacto standard for logical volume management.
Volume management is for the time being an area where Linux is lacking.
One is the virtual partition system project
<!-- <url url="http://www.uiuc.edu/ph/www/roth" 000503 -->
<url url="http://www-wsg.cso.uiuc.edu/&tilde;roth/"
name="VPS">
that will reimplement many of the volume management functions found in
IBM's AIX system. Unfortunately this project is currently on hold.
Another project is the
<!-- <url url="http://linux.msede.com/lvm/" 001210 -->
<url url="http://www.sistina.com/lvm/"
name="Logical Volume Manager">
project that is similar to a project by HP.
<sect1>Linux <tt/md/ Kernel Patch
<p>
<nidx>disk!technologies!md</nidx>
<nidx>disk!technologies!raid</nidx>
<nidx>disk!technologies!striping</nidx>
<nidx>disk!technologies!translucence</nidx>
The Linux Multi Disk (md) provides a number of block level features
in various stages of development.
RAID 0 (striping) and concatenation are very solid and in production quality
and also RAID 4 and 5 are quite mature.
It is also possible to stack some
levels, for instance mirroring (RAID 1) two pairs of drives,
each pair set up as striped disks (RAID 0),
which offers the speed of RAID 0 combined with the reliability of RAID 1.
In addition to RAID this system offers (in alpha stage) block level
volume management and soon also translucent file space.
Since this is done on the block level it can be used in combination
with any file system, even for <tt/fat/ using Wine.
Think very carefully what drives you combine so you can operate all drives
in parallel, which gives you better performance and less wear. Read more
about this in the documentation that comes with <tt/md/.
<!-- radical rework 000123 0.23f
Unfortunately the documentation is rather old and in parts misleading and
only refers to <tt/md/ version 0.35 which uses old style setup.
The new system is very different and will soon be released as version 1.0
but is currently undocumented. If you wish to try it out you should follow
the <tt/linux-raid/ mailing list.
Documentation is improving and a
<url url="http://ostenfeld.dk/&tilde;jakob/Software-RAID.HOWTO/"
name="Software RAID HOWTO">
is in progress.
-->
Unfortunately The Linux software RAID has split into two trees,
the old stable versions 0.35 and 0.42 which are documented in the
official
<url url="http://linas.org/linux/Software-RAID/Software-RAID.html"
name="Software-RAID HOWTO">
and the newer less stable 0.90 series which is documented in the
unofficial
<url url="http://ostenfeld.dk/&tilde;jakob/Software-RAID.HOWTO/"
name="Software RAID HOWTO">
which is a work in progress.
A
<url url="http://www-mddsp.enel.ucalgary.ca/People/adilger/online-ext2/"
name="patch for online growth of ext2fs">
is available in early stages
and related work is taking place at
<url url="http://ext2resize.sourceforge.net/"
name="the ext2fs resize project">
at Sourceforge.
<!-- &&& check positioning on the above... -->
Hint: if you cannot get it to work properly you have forgotten to set
the <tt/persistent-block/ flag. Your best documentation is currently
the source code.
<sect1>Compression
<p>
<nidx>disk!technologies!compression</nidx>
<nidx>disk!compression!DouBle</nidx>
<nidx>disk!compression!Zlibc</nidx>
<nidx>disk!compression!dmsdos</nidx>
<nidx>disk!compression!e2compr</nidx>
Disk compression versus file compression
is a hotly debated topic especially regarding
the added danger of file corruption. Nevertheless there are several options
available for the adventurous administrators. These take on many forms,
from kernel modules and patches to extra libraries but note that most
suffer various forms of limitations such as being read-only. As development
takes place at neck breaking speed the specs have undoubtedly changed by the
time you read this. As always: check the latest updates yourself. Here only
a few references are given.
<itemize>
<item>DouBle features file compression with some limitations.
<item>Zlibc adds transparent on-the-fly decompression of files as they load.
<item>there are many modules available for reading compressed files or
partitions that are native to various other operating systems though
currently most of these are read-only.
<item>
<url url="http://bf9nt.uni-duisburg.de/mitarbeiter/gockel/software/dmsdos/"
name="dmsdos">
(currently in version 0.9.2.0) offer many of the compression
options available for DOS and Windows. It is not yet complete but work is
ongoing and new features added regularly.
<item><tt/e2compr/ is a package that extends <tt>ext2fs</tt> with compression
capabilities. It is still under testing and will therefore mainly be of
interest for kernel hackers but should soon gain stability for wider use.
Check the
<!-- <url url="http://www.netspace.net.au/&tilde;reiter/e2compr.html" -->
<url url="http://e2compr.memalpha.cx/e2compr/" <!-- updated 000622 -->
name="e2compr homepage">
for more information. I have reports of speed and good stability
which is why it is mentioned here.
</itemize>
<sect1>ACL
<p>
<nidx>disk!technologies!ACL</nidx>
Access Control List (ACL) offers finer control over file access
on a user by user basis, rather than the traditional owner, group
and others, as seen in directory listings (<tt/drwxr-xr-x/). This
is currently not available in Linux but is expected in kernel 2.3
as hooks are already in place in <tt/ext2fs/.
<sect1><tt/cachefs/
<p>
<nidx>disk!technologies!cachefs</nidx>
This uses part of a hard disk to cache slower media such as CD-ROM.
It is available under SunOS but not yet for Linux.
<sect1>Translucent or Inheriting File Systems
<p>
<nidx>disk!technologies!translucent</nidx>
<nidx>disk!technologies!inheriting</nidx>
This is a copy-on-write system where writes go to a different system
than the original source while making it look like an ordinary file
space. Thus the file space inherits the original data and the
translucent write back buffer can be private to each user.
There is a number of applications:
<itemize>
<item>updating a live file system on CD-ROM, making it flexible, fast
while also conserving space,
<item>original skeleton files for each new user, saving space since the
original data is kept in a single space and shared out,
<item>parallel project development prototyping where every user can
seemingly modify the system globally while not affecting other users.
</itemize>
SunOS offers this feature and this is under development for Linux.
There was an old project called the Inheriting File Systems (<tt/ifs/)
but this project has stopped.
One current project is part of the <tt/md/ system and offers
block level translucence so it can be applied to any file system.
Sun has an informative
<url url="http://www.sun.ca/white-papers/tfs.html"
name="page">
on translucent file system.
It should be noted that
<url url="http://www.rational.com"
name="Clearcase (now owned by Rational)">
pioneered and popularized translucent filesystems for software
configuration management by writing their own UNIX filesystem.
<!--
This is the old section, from which I have moved
various parts to other sections.
<sect2>General File System Consideration
<p>
<nidx>disk!technologies!filesystem considerations</nidx>
In the Linux world <tt>ext2fs</tt> is well established as a
general purpose system.
Still for some purposes others can be a better choice. News spools lend
themselves to a log file based system whereas high reliability data might
need other formats. This is a hotly debated topic and there are currently
few choices available but work is underway. Log file systems also have the
advantage of very fast file checking. Mail servers in the 100 GB class can
suffer file checks taking several days before becoming operational after
rebooting.
The <tt/Minix/ file system is the oldest one, used in some rescue disk
systems but otherwise very little used these days. At one time the
<tt/Xiafs/ was a strong contender to the standard for Linux but seems
to have fallen behind these days.
Adam Richter from Yggdrasil posted recently that they have been
working on a compressed log file based system but that this project is
currently on hold. Nevertheless a non-working version is available on
their FTP server. Check out
<url url="ftp://ftp.yggdrasil.com/private/adam"
name="the Yggdrasil ftp server">
where special patched versions of the kernel can be found.
Hopefully this will be rolled into the mainstream kernel in the near future.
An alternative project is the
<url url="http://lucien.blight.com/&tilde;c-cook/prof/lfs/"
name="Logical Volume Manager">
project.
As of July, 23th 1997 <url url="mailto:reiser (at) RICOCHET.NET" name="Hans
Reiser"> has put up the source to his tree based <url
url="http://idiom.com/&tilde;beverly/reiserfs.html" name="reiserfs"> on
the web. While his filesystem has some very interesting features and
is much faster than <tt/ext2fs/, it is still very experimental and
difficult to integrate with the standard kernel. Expect some
interesting developments in the future - this is different from your
"average log based file system for Linux" project, because Hans
already has working code.
There is room for access control lists (ACL) and other unimplemented
features in the existing <tt>ext2fs</tt>, stay tuned for future
updates.
There is also an encrypted file system available but again as this is under
export control from the US, make sure you get it from a legal place.
Also under development is the
<url url="http://www.virtual.net.au/&tilde;rjc/enh-fs.html"
name="Enhanced File System">
project.
File systems is an active field of academic and industrial
research and development, the results of which are quite often
freely available. Linux has in many cases been a development tool
in such activities so you can expect a lot of continuous work
in this field, stay tuned for the latest development.
One example of a file system research is
<url url="http://www.cs.columbia.edu/&tilde;ezk/research"
name="Erez Zadok Research">
page.
<sect2>CD-ROM File Systems
<p>
<nidx>disk!technologies!CD-ROM filesystems</nidx>
There has been a number of file systems available for use on CD-ROM systems
and one of the earliest one was the <em/High Sierra/ format, supposedly named
after the hotel where the final agreement took place. This was the precursor
to the <em/ISO 9660/ format which is supported by Linux.
Later there were the <em/Rock Ridge/ extensions which added file system
features such as long filenames, permissions and more.
The Linux iso9660 file system supports both High Sierra as well as
Rock Ridge extensions.
However, once again Microsoft decided it should create another
standard and their latest effort here is called <em/Joliet/ and offers
some internationalisation features.
This is now available in linux kernel 2.0.34 or newer. You need to
enable NLS in order to use it.
In a recent Usenet News posting hpa (at) transmeta.com (H. Peter Anvin)
writes the following the following interesting piece of trivia:
<tscreen><verb>
Trivia:
Joliet is a city outside Chicago; best known for being the site of
the prison where Jake was locked up in the movie "Blues Brothers."
Rock Ridge (the UNIX extensions to ISO 9660) is named
after the (fictional) town in the movie "Blazing Saddles."
</verb></tscreen>
Very important note: it was actually Jake who was locked up. Oops.
<sect2>Compression
<p>
<nidx>disk!technologies!compression</nidx>
Disk compression versus file compression
is a hotly debated topic especially regarding
the added danger of file corruption. Nevertheless there are several options
available for the adventurous administrators. These take on many forms,
from kernel modules and patches to extra libraries but note that most
suffer various forms of limitations such as being read-only. As development
takes place at neck breaking speed the specs have undoubtedly changed by the
time you read this. As always: check the latest updates yourself. Here only
a few references are given.
<itemize>
<item>DouBle features file compression with some limitations.
<item>Zlibc adds transparent on-the-fly decompression of files as they load.
<item>dmsdos (currently in version 0.9.1.2) offer many of the compression
options available for DOS and Windows. It is not yet complete but work is
ongoing and new features added regularly.
<item><tt/e2compr/ is a package that extends <tt>ext2fs</tt> with compression
capabilities. It is still under testing and will therefore mainly be of
interest for kernel hackers but should soon gain stability for wider use.
Check the
<url url="http://netspace.net.au/&tilde;reiter/e2compr.html"
name="e2compr homepage">
for more information. I have reports of speed and good stability
which is why it is mentioned here.
</itemize>
<sect2>Other Filesystems
<p>
<nidx>disk!technologies!filesystems, other</nidx>
Also there is the user file system (<tt/userfs/) that allows FTP based file
system and some compression (<tt/arcfs/) plus fast prototyping and many
other features. The <tt/docfs/ is based on this filesystem.
Recent kernels feature the loop or loopback device which can be
used to put a complete file system within a file. There are some
possibilities for using this for making new file systems with
compression, tarring, encryption etc.
Note that this device is unrelated to the network loopback device.
Very recently a compression package that extends <tt>ext2fs</tt> was
announced. It
is still under testing and will therefore mainly be of interest for kernel
hackers but should soon gain stability for wider use.
There is a number of other ongoing file system projects, but these are
in the experimental stage and fall outside the scope of this HOWTO.
-->
<sect1>Physical Track Positioning<label id="physical-track-positioning">
<p>
<nidx>disk!technologies!physical track positioning</nidx>
<nidx>disk!technologies!track positioning</nidx>
This trick used to be very important when drives were slow and small,
and some file systems used to take the varying characteristics into
account when placing files. Although higher overall speed, on board
drive and controller caches and intelligence
has reduced the effect of this.
Nevertheless there is still a little to be gained even today.
As we know, "<it/world dominance/" is soon within reach
but to achieve this "<it/fast/" we need to employ all the tricks
we can use
<!-- <htmlurl url="http://www.cs.indiana.edu/finger/linux.cs.helsinki.fi/linus" -->
<htmlurl url="http://www.mit.edu:8001/finger?linus@linux.cs.helsinki.fi"
name=" ">.
To understand the strategy we need to recall this near ancient piece
of knowledge and the properties of the various track locations.
This is based on the fact that transfer speeds generally increase for tracks
further away from the spindle, as well as the fact that it is faster
to seek to or from the central tracks than to or from the inner or outer
tracks.
Most drives use disks running at constant angular velocity but use
(fairly) constant data density across all tracks. This means that
you will get much higher transfer rates on the outer tracks than
on the inner tracks; a characteristics which fits the requirements
for large libraries well.
Newer disks use a logical geometry mapping which differs from the actual
physical mapping which is transparently mapped by the drive itself.
This makes the estimation of the "middle" tracks a little harder.
In most cases track 0 is at the outermost track and this is the general
assumption most people use. Still, it should be kept in mind that there
are no guarantees this is so.
<p>
<descrip>
<tag/Inner/ tracks are usually slow in transfer, and lying at one
end of the seeking position it is also slow to seek to.
This is more suitable to the low end directories such as DOS, root
and print spools.
<tag/Middle/ tracks are on average faster with respect to transfers
than inner tracks and being in the middle also on average faster to
seek to.
This characteristics is ideal for the most demanding parts such as
<tt>swap</tt>, <tt>/tmp</tt> and <tt>/var/tmp</tt>.
<tag/Outer/ tracks have on average even faster transfer characteristics
but like the inner tracks are at the end of the seek so statistically it
is equally slow to seek to as the inner tracks.
Large files such as libraries would benefit from a place here.
</descrip>
Hence seek time reduction can be achieved by positioning frequently
accessed tracks in the middle so that the average seek distance and
therefore the seek time is short. This can be done either by using
<tt>fdisk</tt> or <tt>cfdisk</tt> to make a partition
on the middle tracks or by first
making a file (using <tt/dd/) equal to half the size of the entire disk
before creating the files that are frequently accessed, after which
the dummy file can be deleted. Both cases assume starting from an
empty disk.
The latter trick is suitable for news spools where the empty directory
structure can be placed in the middle before putting in the data files.
This also helps reducing fragmentation a little.
This little trick can be used both on ordinary drives as well as RAID
systems. In the latter case the calculation for centring the tracks
will be different, if possible. Consult the latest RAID manual.
The speed difference this makes depends on the drives, but a 50 percent
improvement is a typical value.
<sect2>Disk Speed Values<label id="disk-speed-values">
<p>
<nidx>disk!technologies!disk speed values</nidx>
The same mechanical head disk assembly (HDA) is often available
with a number of interfaces (IDE, SCSI etc) and the mechanical
parameters are therefore often comparable. The mechanics is
today often the limiting factor but development is improving
things steadily. There are two main parameters, usually quoted
in milliseconds (ms):
<itemize>
<item>Head movement - the speed at which the read-write head
is able to move from one track to the next, called access time.
If you do the mathematics and doubly integrate the seek
first across all possible starting tracks and
then across all possible target tracks you will find that
this is equivalent of a stroke across a third of all tracks.
<item>Rotational speed - which determines the time taken to
get to the right sector, called latency.
</itemize>
After voice coils replaced stepper motors for the head movement
the improvements seem to have levelled off and more energy is
now spent (literally) at improving rotational speed. This has
the secondary benefit of also improving transfer rates.
Some typical values:
<tscreen><verb>
Drive type
Access time (ms) | Fast Typical Old
---------------------------------------------
Track-to-track <1 2 8
Average seek 10 15 30
End-to-end 10 30 70
</verb></tscreen>
This shows that the very high end drives offer only marginally
better access times then the average drives but that the old
drives based on stepper motors are significantly worse.
<tscreen><verb>
Rotational speed (RPM) | 3600 | 4500 | 4800 | 5400 | 7200 | 10000
-------------------------------------------------------------------
Latency (ms) | 17 | 13 | 12.5 | 11.1 | 8.3 | 6.0
</verb></tscreen>
As latency is the average time taken to reach a given sector, the
formula is quite simply
<tscreen><verb>
latency (ms) = 60000 / speed (RPM)
</verb></tscreen>
Clearly this too is an example of diminishing returns for the
efforts put into development. However, what really takes off
here is the power consumption, heat and noise.
<sect1>Yoke
<p>
<nidx>disk!technologies!yoke</nidx>
There is also a
<!-- <url url="http://www.it.uc3m.es/&tilde;ptb/cgi-bin/cvs-yoke.cgi" -->
<url url="http://www.it.uc3m.es/cgi-bin/ptb/cvs-yoke.cgi"
name="Linux Yoke Driver">
available in beta which
is intended to do hot-swappable transparent binding of
one Linux block device to another. This means that if you
bind two block devices together,
say <tt>/dev/hda</tt> and <tt>/dev/loop0</tt>,
writing to one device will mean also writing to the other
and reading from either will yield the same result.
<sect1>Stacking
<p>
<nidx>disk!technologies!stacking</nidx>
One of the advantages of a layered design of an operating system
is that you have the flexibility to put the pieces together
in a number of ways.
For instance you can cache a CD-ROM with <tt/cachefs/ that is
a volume striped over 2 drives. This in turn can be set up
translucently with a volume that is NFS mounted from another machine.
RAID can be stacked in several layers to offer very fast seek and
transfer in such a way that it will work if even 3 drives fail.
The choices are many, limited only by imagination and, probably
more importantly, money.
<sect1>Recommendations
<p>
<nidx>disk!technologies!recommendations</nidx>
There is a near infinite number of combinations available but my
recommendation is to start off with a simple setup without any
fancy add-ons. Get a feel for what is needed, where the maximum
performance is required, if it is access time or transfer speed
that is the bottle neck, and so on. Then phase in each component
in turn. As you can stack quite freely you should be able to
retrofit most components in as time goes by with relatively few
difficulties.
RAID is usually a good idea but make sure you have a thorough
grasp of the technology and a solid back up system.
<sect>Other Operating Systems
<p>
<nidx>disk!operating systems, other</nidx>
Many Linux users have several operating systems installed, often
necessitated by hardware setup systems that run under other operating
systems, typically DOS or some flavour of Windows. A small section on
how best to deal with this is therefore included here.
<sect1>DOS
<p>
<nidx>disk!operating systems, other!DOS</nidx>
Leaving aside the debate on weather or not DOS qualifies as an operating
system one can in general say that it has little sophistication with
respect to disk operations. The more important result of this is that there
can be severe difficulties in running various versions of DOS on large
drives, and you are therefore strongly recommended in reading the
<em>Large Drives mini-HOWTO</em>. One effect is that you are often better
off placing DOS on low track numbers.
Having been designed for small drives it has a rather unsophisticated
file system (<tt/fat/) which when used on large drives will allocate
enormous block sizes. It is also prone to block fragmentation which
will after a while cause excessive seeks and slow effective transfers.
One solution to this is to use a defragmentation program regularly but
it is strongly recommended to back up data and verify the disk before
defragmenting. All versions of DOS have <tt/chkdsk/ that can do some
disk checking, newer versions also have <tt/scandisk/ which is somewhat
better. There are many defragmentation programs available, some versions
have one called <tt/defrag/. Norton Utilities have a large suite of
disk tools and there are many others available too.
As always there are snags, and this particular snake in our drive
paradise is called <em/hidden files/. Some vendors started to use
these for copy protection schemes and would not take kindly to being
moved to a different place on the drive, even if it remained in the
same place in the directory structure. The result of this was that
newer defragmentation programs will not touch any hidden file, which
in turn reduces the effect of defragmentation.
Being a single tasking, single threading and single most other things
operating system there is very little gains in using multiple drives
unless you use a drive controller with built in RAID support of some
kind.
There are a few utilities called <tt/join/ and <tt/subst/ which
can do some multiple drive configuration but there is very little
gains for a lot of work. Some of these commands have been removed in
newer versions.
In the end there is very little you can do, but not all hope is lost.
Many programs need fast, temporary storage, and the better behaved
ones will look for environment variables called <tt/TMPDIR/ or
<tt/TEMPDIR/ which you can set to point to another drive. This is
often best done in <tt/autoexec.bat/.
<code>
SET TMPDIR=E:/TMP
SET TEMPDIR=E:/TEMP
</code>
Not only will this possibly gain you some speed but also it can
reduce fragmentation.
There have been reports about difficulties in removing multiple primary
partitions using the <tt/fdisk/ program that comes with DOS. Should this
happen you can instead use a Linux rescue disk with Linux <tt/fdisk/ to
repair the system.
Don't forget there are other alternatives to DOS, the most well known
being
<url url="http://www.caldera.com/dos/"
name="DR-DOS">
from
<url url="http://www.caldera.com/"
name="Caldera">.
This is a direct descendant from DR-DOS from Digital Research.
It offers many features not found in the more common DOS, such
as multi tasking and long filenames.
Another alternative which also is free is
<url url="http://www.freedos.org/"
name="Free DOS">
which is a project under development. A number of free utilities
are also available.
<sect1>Windows
<p>
<nidx>disk!operating systems, other!Windows</nidx>
Most of the above points are valid for Windows too, with the exception
of Windows95 which apparently has better disk handling, which will get
better performance out of SCSI drives.
A useful thing is the introduction of long filenames, to read these from
Linux you will need the <tt/vfat/ file system for mounting these partitions.
Disk fragmentation is still a problem. Some of this can be avoided by
doing a defragmentation immediately before and immediately after installing
large programs or systems. I use this scheme at work and have found it
to work quite well. Purging unused files and emptying the waste basket first
can improve defragmentation further.
Windows also use swap drives, redirecting this to another drive can give
you some performance gains. There are several mini-HOWTOs telling you how
best to share swap space between various operating systems.
<!-- added 030298, need more info here, low priority for now -->
The trick of setting <tt/TEMPDIR/ can still be used but not all
programs will honour this setting. Some do, though. To get a good
overview of the settings in the control files you can run <tt/sysedit/
which will open a number of files for editing, one of which is the
<tt/autoexec/ file where you can add the <tt/TEMPDIR/ settings.
Much of the temporary files are located in the <tt>/windows/temp</tt>
directory and changing this is more tricky. To achieve this you can
use <tt/regedit/ which is rather powerful and quite capable of
rendering your system in a state you will not enjoy, or more
precisely, in a state much less enjoyable than windows in general.
Registry database error is a message that means seriously bad news.
Also you will see that many programs have their own private temporary
directories scattered around the system.
Setting the swap file to a separate partition is a better idea and much
less risky. Keep in mind that this partition cannot be used for anything
else, even if there should appear to be space left there.
It is now possible to read <tt/ext2fs/ partitions from Windows,
either by mounting the partition using
<url url="http://www.yipton.demon.co.uk/"
name="FSDEXT2">
or by using a file explorer like tool called
<!-- <url url="http://uranus.it.swin.edu.au/&tilde;jn/linux/Explore2fs.html" 000502 -->
<url url="http://uranus.it.swin.edu.au/&tilde;jn/linux/explore2fs.htm"
name="Explore2fs">.
<sect1>OS/2
<p>
<nidx>disk!operating systems, other!OS/2</nidx>
The only special note here is that you can get file system driver for
OS/2 that can read an <tt/ext2fs/ partition.
Matthieu Willm's ext2fs Installable File System for OS/2 can be found at
<url url="ftp://ftp-os2.nmsu.edu/pub/os2/system/drivers/filesys/ext2_240.zip"
name="ftp-os2.nmsu.edu">,
<url url="ftp://sunsite.unc.edu/pub/Linux/system/filesystems/ext2/ext2_240.zip"
name="Sunsite">,
<url url="ftp://ftp.leo.org/pub/comp/os/os2/drivers/ifs/ext2_240.zip"
name="ftp.leo.org"> and
<url url="ftp://ftp-os2.cdrom.com/pub/os2/diskutil/ext2_240.zip"
name="ftp-os2.cdrom.com">.
The IFS has read and write capabilities.
<sect1>NT
<p>
<nidx>disk!operating systems, other!Windows/NT</nidx>
<nidx>disk!operating systems, other!NT</nidx>
<nidx>disk!Microsoft!bug</nidx>
This is a more serious system featuring most buzzwords known to marketing.
It is well worth noting that it features software striping and other more
sophisticated setups. Check out the drive manager in the control panel.
I do not have easy access to NT, more details on this can take a bit of time.
One important snag was recently reported by acahalan at cs.uml.edu :
(reformatted from a Usenet News posting)
NT DiskManager has a serious bug that can corrupt your disk when you have
several (more than one?) extended partitions. Microsoft provides an
emergency fix program at their web site. See the
<url url="http://www.microsoft.com/kb/"
name="knowledge base">
for more. (This affects Linux users, because Linux users have extra partitions)
You can now read <tt/ext2fs/ partitions from NT using
<!-- <url url="http://uranus.it.swin.edu.au/&tilde;jn/linux/Explore2fs.html" 000502 -->
<url url="http://uranus.it.swin.edu.au/&tilde;jn/linux/explore2fs.htm"
name="Explore2fs">.
<sect1>Windows 2000
<p>
<nidx>disk!operating systems, other!Windows 2000</nidx>
Most points regarding Windows NT also applies to its descendant Windows 2000
though at the time of writing this I do not know if the aforementioned bugs
have been fixed or not.
While Windows 2000, like its predecessor, features RAID, at least one
company,
<url url="http://www.raidtoolbox.com/"
name="RAID Toolbox">,
has found the bundled RAID somewhat lacking and made their own commercial
alternative.
<sect1>Sun OS
<p>
<nidx>disk!operating systems, other!SunOS</nidx>
There is a little bit of confusion in this area between Sun OS vs. Solaris.
Strictly speaking Solaris is just Sun OS 5.x packaged with Openwindows and
a few other things. If you run Solaris, just type <tt/uname -a/ to see your
version. Parts of the reason for this confusion is that Sun Microsystems
used to use an OS from the BSD family, albeight with a few bits and pieces
from elsewhere as well as things made by themselves. This was the situation
up to Sun OS 4.x.y when they did a "strategic roadmap decision" and decided
to switch over to the official Unix, System V, Release 4 (aka SVR5),
and Sun OS 5 was created.
This made a lot of people unhappy. Also this was bundled with other things
and marketed under the name Solaris, which currently stands at release
7 which just recently replaced version 2.6 as the latest and greatest.
In spite of the large jump in version number this is actually a minor
technical upgrade but a giant leap for marketing.
<sect2>Sun OS 4
<p>
<nidx>disk!operating systems, other!SunOS 4</nidx>
This is quite familiar to most Linux users.
The last release is 4.1.4 plus various patches.
Note however that the file system
structure is quite different and does not conform to FSSTND so any planning
must be based on the traditional structure. You can get some information by
the man page on this: <tt/man hier/. This is, like most man pages, rather brief
but should give you a good start. If you are still confused by the structure
it will at least be at a higher level.
<sect2>Sun OS 5 (aka Solaris)
<p>
<nidx>disk!operating systems, other!SunOS 5</nidx>
<nidx>disk!operating systems, other!Solaris</nidx>
This comes with a snazzy installation system that runs under Openwindows, it
will help you in partitioning and formatting the drives before installing the
system from CD-ROM. It will also fail if your drive setup is too far out, and
as it takes a complete installation run from a full CD-ROM in a 1x only drive
this failure will dawn on you after too long time. That is the experience we
had where I used to work. Instead we installed everything onto one drive and then
moved directories across.
The default settings are sensible for most things, yet there remains a little
oddity: swap drives. Even though the official manual recommends multiple swap
drives (which are used in a similar fashion as on Linux) the default is to use
only a single drive. It is recommended to change this as soon as possible.
Sun OS 5 offers also a file system especially designed for temporary files,
<tt/tmpfs/. It offers significant speed improvements over <tt/ufs/ but does
not survive rebooting.
<!--
This is a kind of souped up RAM disk, and like ordinary RAM disks
the contents is lost when the power goes. If space is scarce parts of the
pseudo drive is swapped out, so in effect you store temporary files on the
swap partition. Linux does not have such a file system; it has been discussed
in the past but opinions were mixed. I would be interested in hearing comments
on this. -->
The only comment so far is: beware! Under Solaris 2.0 it seem that
creating too big files in <tt>/tmp</tt> can cause an out of swap space
kernel panic trap. As the evidence of what has happened is as lost as
any data on a RAMdisk after powering down it can be hard to find out
what has happened. What is worse, it seems that user space processes
can cause this kernel panic and unless this problem is taken care of
it is best not to use <tt/tmpfs/ in potentially hostile environments.
Also see the notes on
<ref id="tmpfs"
name="tmpfs">.
<!--
and
<ref id="comb-swap-n-tmp"
name="Combining swap and /tmp">.
-->
Trivia: There is a movie also called Solaris, a science fiction movie that is
very, very long, slow and incomprehensible. This was often pointed out at the
time Solaris (the OS) appeared...
<sect1>BeOS
<p>
<nidx>disk!operating systems, other!BeOS</nidx>
This operating system is one of the more recent one to arrive
and it features a file system that has some database like features.
There is a BFS file system driver being developed for Linux
and is available in alpha stage. For more information check the
<url url="http://hp.vector.co.jp/authors/VA008030/bfs/"
name="Linux BFS page">
where patches also are available.
<sect>Clusters
<p>
<nidx>disk!technologies!clusters</nidx>
In this section I will briefly touch on the ways machines can be connected
together but this is so big a topic it could be a separate HOWTO in its
own right, hint, hint. Also, strictly speaking, this section lies outside
the scope of this HOWTO, so if you feel like getting fame etc. <em/you/ could
contact me and take over this part and turn it into a new document.
These days computers gets outdated at an incredible rate. There is however
no reason why old hardware could not be put to good use with Linux. Using
an old and otherwise outdated computer as a network server can be both
useful in its own right as well as a valuable educational exercise. Such
a local networked cluster of computers can take on many forms but to remain
within the charter of this HOWTO I will limit myself to the disk strategies.
Nevertheless I would hope someone else could take on this topic and turn it
into a document on its own.
This is an exciting area of activity today, and many forms of clustering
is available today, ranging from automatic workload balancing over local
network to more exotic hardware such as Scalable Coherent Interface (SCI)
which gives a tight integration of machines, effectively turning them into
a single machine. Various kinds of clustering has been available for larger
machines for some time and the VAXcluster is perhaps a well known example
of this. Clustering is done usually in order to share resources such as
disk drives, printers and terminals etc, but also processing resources equally
transparently between the computational nodes.
There is no universal definition of clustering, in here it is taken to mean
a network of machines that combine their resources to serve users. Admittedly
this is a rather loose definition but this will change later.
These days also Linux offers some clustering features but for a starter
I will just describe a simple local network. It is a good way of putting old
and otherwise unusable hardware to good use, as long as they can run Linux or
something similar.
One of the best ways of using an old machine is as a network server in which
case the effective speed is more likely to be limited by network bandwidth
rather than pure computational performance. For home use you can move the
following functionality off to an older machine used as a server:
<itemize>
<item>news
<item>mail
<item>web proxy
<item>printer server
<item>modem server (PPP, SLIP, FAX, Voice mail)
</itemize>
You can also <tt/NFS mount/ drives from the server onto your workstation
thereby reducing drive space requirements. Still read the FSSTND to see
what directories should <em/not/ be exported. The best candidates for
exporting to all machines are <tt>/usr</tt> and <tt>/var/spool</tt>
and possibly <tt>/usr/local</tt> but probably not <tt>/var/spool/lpd</tt>.
Most of the time even slow disks will deliver sufficient performance. On
the other hand, if you do processing directly on the disks on the server or
have very fast networking, you might want to rethink your strategy and use
faster drives. Searching features on a web server or news database searches
are two examples of this.
Such a network can be an excellent way of learning system administration
and building up your own toaster network, as it often is called. You can
get more information on this in other HOWTOs but there are two important
things you should keep in mind:
<itemize>
<item>Do not pull IP numbers out of thin air. Configure your inside net
using IP numbers reserved for private use, and use your network server
as a router that handles this IP masquerading.
<item>Remember that if you additionally configure the router as a firewall
you might not be able to get to your own data from the outside, depending
on the firewall configuration.
</itemize>
The <em/Nyx/ network provides an example of a cluster in the sense defined here.
It consists of the following machines:
<descrip>
<tag/nyx/ is one of the two user login machines and also provides some
of the networking services.
<tag/nox/ (aka nyx10) is the main user login machine and is also the
mail server.
<tag/noc/ is a dedicated news server. The news spool is made accessible
through NFS mounting to nyx and nox.
<tag/arachne/ (aka www) is the web server. Web pages are written by
NFS mounting onto nox.
</descrip>
There are also some more advanced clustering projects going, notably
<itemize>
<item>
<!-- <url url="http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html" -->
<url url="http://www.beowulf.org/"
name="The Beowulf Project">
<item>
<url url="http://www.disi.unige.it/project/gamma/"
name="The Genoa Active Message Machine (GAMMA)">
</itemize>
<p>
High-tech clustering requires high-tech interconnect, and SCI is one of them.
To find out more you can either look up the home page of
<url url="http://www.dolphinics.no/"
name="Dolphin Interconnect Solutions">
which is one of the main actors in this field,
or you can have a look at
<url url="http://www.scizzl.com/"
name="scizzl">.
<p>
Centralised mail servers using IMAP are becoming more and more popular
as disks become large enough to keep all mail stored indefinitely
and also cheap enough to make it a feasible option.
Unfortunately it has become clear that <tt/NFS/ mounting the mail
archives from another machine can cause corruption of the IMAP
database as the server software does not handle NFS timeouts too well,
and NFS timeouts are a rather common occurrence.
Keep therefore the mail archive local to the IMAP server.
<sect>Mount Points
<p>
<nidx>disk!mount points</nidx>
In designing the disk layout it is important not to split off the
directory tree structure at the wrong points, hence this section.
As it is highly dependent on the FSSTND it has been put aside in
a separate section, and will most likely have to be totally rewritten
when FHS is adopted in a Linux distribution.
In the meanwhile this will do.
Remember that this is a list of where a separation <em/can/ take place,
not where it <em/has/ to be. As always, good judgement is always required.
Again only a rough indication can be given here. The values indicate
<tscreen><verb>
0=don't separate here
1=not recommended
...
4=useful
5=recommended
</verb></tscreen>
In order to keep the list short, the uninteresting parts are removed.
<tscreen><verb>
Directory Suitability
/
|
+-bin 0
+-boot 5
+-dev 0
+-etc 0
+-home 5
+-lib 0
+-mnt 0
+-proc 0
+-root 0
+-sbin 0
+-tmp 5
+-usr 5
| \
| +-X11R6 3
| +-bin 3
| +-lib 4
| +-local 4
| | \
| | +bin 2
| | +lib 4
| +-src 3
|
+-var 5
\
+-adm 0
+-lib 2
+-lock 1
+-log 0
+-preserve 1
+-run 1
+-spool 4
| \
| +-mail 3
| +-mqueue 3
| +-news 5
| +-smail 3
| +-uucp 3
+-tmp 5
</verb></tscreen>
There is of course plenty of adjustments possible, for instance a home user
would not bother with splitting off the <tt>/var/spool</tt> hierarchy but
a serious ISP should. The key here is <em/usage/.
<em/QUIZ!/ Why should <tt>/etc</tt> never be on a separate partition?
Answer: Mounting instructions during boot is found in the file
<tt>/etc/fstab</tt> so if this is on a separate and unmounted partition
it is like the key to a locked drawer is inside that drawer, a hopeless
situation. (Yes, I'll do nearly anything to liven up this HOWTO.)
<sect>Considerations and Dimensioning
<p>
<nidx>disk!considerations and dimensioning</nidx>
The starting point in this will be to consider where you are and what
you want to do. The typical home system starts out with existing
hardware and the newly converted Linux user will want to get the most
out of existing hardware. Someone setting up a new system for a
specific purpose (such as an Internet provider) will instead have to
consider what the goal is and buy accordingly. Being ambitious I will
try to cover the entire range.
Various purposes will also have different requirements regarding file
system placement on the drives, a large multiuser machine would
probably be best off with the <tt>/home</tt> directory on a
separate disk, just to give an example.
In general, for performance it is advantageous to split most things
over as many disks as possible but there is a limited number of
devices that can live on a SCSI bus and cost is naturally also a
factor. Equally important, file system maintenance becomes more
complicated as the number of partitions and physical drives increases.
<sect1>Home Systems
<p>
<nidx>disk!considerations and dimensioning!home systems</nidx>
With the cheap hardware available today it is possible to have
quite a big system at home that is still cheap, systems that
rival major servers of yesteryear. While many started out with
old, discarded disks to build a Linux server (which is how this
HOWTO came into existence), many can now afford to buy 40 GB
disks up front.
Size remains important for some, and here are a few guidelines:
<descrip>
<tag/Testing/ Linux is simple and you don't even need a hard disk to
try it out, if you can get the boot floppies to work you are likely to
get it to work on your hardware. If the standard kernel does not work
for you, do not forget that often there can be special boot disk versions
available for unusual hardware combinations that can solve your initial
problems until you can compile your own kernel.
<tag/Learning/ about operating system is something Linux excels in,
there is plenty of documentation and the source is available. A single
drive with 50 MB is enough to get you started with a shell, a few of
the most frequently used commands and utilities.
<tag/Hobby/ use or more serious learning requires more commands and
utilities but a single drive is still all it takes, 500 MB should give
you plenty of room, also for sources and documentation.
<tag/Serious/ software development or just serious hobby work requires
even more space. At this stage you have probably a mail and news feed
that requires spool files and plenty of space. Separate drives for
various tasks will begin to show a benefit. At this stage you have
probably already gotten hold of a few drives too. Drive requirements
gets harder to estimate but I would expect 2-4 GB to be plenty, even
for a small server.
<tag/Servers/ come in many flavours, ranging from mail servers to full
sized ISP servers. A base of 2 GB for the main system should be
sufficient, then add space and perhaps also drives for separate
features you will offer. Cost is the main limiting factor here
but be prepared to spend a bit if you wish to justify the "S"
in ISP. Admittedly, not all do it.
Basically a server is dimensioned like any machine for serious use
with added space for the services offered, and tends to be IO bound
rather than CPU bound.
With cheap networking technology both for land lines as well as
through radio nets, it is quite likely that very soon home users
will have their own servers more or less permanently hooked onto
the net.
</descrip>
<sect1>Servers
<p>
<nidx>disk!servers</nidx>
Big tasks require big drives and a separate section here. If possible
keep as much as possible on separate drives. Some of the appendices
detail the setup of a small departmental server for 10-100
users. Here I will present a few consideration for the higher end
servers. In general you should not be afraid of using RAID, not only
because it is fast and safe but also because it can make growth a
little less painful. All the notes below come as additions to the
points mentioned earlier.
Popular servers rarely just happens, rather they grow over time and this
demands both generous amounts of disk space as well as a good net
connection. In many of these cases it might be a good idea to reserve
entire SCSI drives, in singles or as arrays, for each task. This way you
can move the data should the computer fail. Note that transferring drives
across computers is not simple and might not always work, especially in the
case of IDE drives. Drive arrays require careful setup in order to
reconstruct the data correctly, so you might want to keep a paper copy of
your <tt/fstab/ file as well as a note of SCSI IDs.
<sect2>Home Directories <label id="server-home-dirs">
<p>
<nidx>disk!servers!home directories</nidx>
Estimate how many drives you will need, if this is more than 2 I would
recommend RAID, strongly. If not you should separate users across your
drives dedicated to users based on some kind of simple hashing algorithm.
For instance you could use the first 2 letters in the user name, so
<tt>jbloggs</tt> is put on <tt>/u/j/b/jbloggs</tt> where <tt>/u/j</tt>
is a symbolic link to a
physical drive so you can get a balanced load on your drives.
<sect2>Anonymous FTP
<p>
<nidx>disk!servers!FTP, anonymous</nidx>
<nidx>disk!servers!anonymous FTP</nidx>
This is an essential service if you are serious about service. Good
servers are well maintained, documented, kept up to date, and
immensely popular no matter where in the world they are located. The
big server
<url url="ftp://ftp.funet.fi"
name="ftp.funet.fi">
is an excellent example of this.
In general this is not a question of CPU but of network bandwidth. Size
is hard to estimate, mainly it is a question of ambition and service
attitudes. I believe the big archive at
<url url="ftp://ftp.cdrom.com"
name="ftp.cdrom.com">
is a *BSD machine
with 50 GB disk. Also memory is important for a dedicated FTP server,
about 256 MB RAM would be sufficient for a very big server, whereas
smaller servers can get the job done well with 64 MB RAM.
Network connections would still be the most important factor.
<sect2>WWW <label id="www">
<p>
<nidx>disk!servers!WWW</nidx>
<nidx>disk!servers!World Wide Web</nidx>
For many this is the main reason to get onto the Internet, in fact many
now seem to equate the two. In addition to being network intensive
there is also a fair bit of drive activity related to this, mainly
regarding the caches. Keeping the cache on a separate, fast drive
would be beneficial. Even better would be installing a caching proxy
server. This way you can reduce the cache size for each user and speed
up the service while at the same time cut down on the bandwidth
requirements.
With a caching proxy server you need a fast set of drives, RAID0 would
be ideal as reliability is not important here. Higher capacity is
better but about 2 GB should be sufficient for most. Remember to match
the cache period to the capacity and demand. Too long periods would on
the other hand be a disadvantage, if possible try to adjust based on
the URL. For more information check up on the most used servers such as
<tt/Harvest/,
<!-- http://squid.nlanr.net/Squid -->
<!-- <url url="http://www.squid-cache.org/Squid" 001203 -->
<url url="http://www.squid-cache.org/"
name="Squid">
and the one from
<url url="http://www.netscape.com"
name="Netscape">.
<sect2>Mail
<p>
<nidx>disk!servers!mail</nidx>
Handling mail is something most machines do to some extent. The big mail
servers, however, come into a class of their own. This is a demanding task
and a big server can be slow even when connected to fast drives and a good
net feed. In the Linux world the big server at <tt/vger.rutgers.edu/ is a
well known example. Unlike a news service which is distributed and which
can partially reconstruct the spool using other machines as a feed, the
mail servers are centralised. This makes safety much more important, so for
a major server you should consider a RAID solution with emphasize on
reliability. Size is hard to estimate, it all depends on how many lists you
run as well as how many subscribers you have.
Big mail servers can be IO limited in performance and for this
reason some use huge silicon disks connected to the SCSI bus to hold
all mail related files including temporary files.
For extra safety these are battery backed and filesystems
like <tt/udf/ are preferred since they always flush metadata to disk.
This added cost to performance is offset by the very fast disk.
Note that these days more and more switch over from using <tt/POP/ to
pull mail to local machine from mail server and instead use <tt/IMAP/ to
serve mail while keeping the mail archive centralised.
This means that mail is no longer spooled in its original sense but
often builds up, requiring huge disk space. Also more and more (ab)use
mail attachments to send all sorts of things across, even a small
word processor document can easily end up over 1 MB. Size your disks
generously and keep an eye on how much space is left.
<sect2>News
<p>
<nidx>disk!servers!news</nidx>
This is definitely a high volume task, and very dependent on what
news groups you subscribe to. On Nyx there is a fairly complete feed
and the spool files consume about 17 GB. The biggest groups are no doubt
in the <tt>alt.binary.*</tt> hierarchy, so if you for some reason decide not to
get these you can get a good service with perhaps 12 GB. Still others,
that shall remain nameless, feel 2 GB is sufficient to claim ISP status.
In this case news expires so fast I feel the spelling IsP is barely
justified. A full newsfeed means a traffic of a few GB every day and this
is an ever growing number.
<sect2>Others
<p>
<nidx>disk!servers!other</nidx>
There are many services available on the net and even though many have been
put somewhat in the shadows by the web. Nevertheless, services like
<em/archie/, <em/gopher/ and <em/wais/ just to name a few, still exist and
remain valuable tools on the net. If you are serious about starting a major
server you should also consider these services. Determining the required
volumes is hard, it all depends on popularity and demand. Providing good
service inevitably has its costs, disk space is just one of them.
<sect2>Server Recommendations
<p>
<nidx>disk!servers!recommendations</nidx>
Servers today require large numbers of large disks to function
satisfactorily in commercial settings. As mean time between failure
(MTBF) decreases rapidly as the number of components increase it is
advisable to look into using RAID for protection and use a number of
medium sized drives rather than one single huge disk. Also look into
the High Availability (HA) project for more information.
More information is available at
<!-- <url url="http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html" 001203 -->
<url url="http://www.ibiblio.org/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html"
name="High Availability HOWTO">
and also at related
<url url="http://www.henge.com/&tilde;alanr/ha/index.html"
name="web pages">.
There is also an article in Byte called
<url url="http://www.byte.com/columns/servinglinux/1999/06/0607servinglinux.html"
name="How Big Does Your Unix Server Have To Be?">
with many points that are relevant to Linux.
<sect1>Pitfalls<label id="pitfalls">
<p>
<nidx>disk!pitfalls</nidx>
The dangers of splitting up everything into separate partitions are
briefly mentioned in the section about volume management. Still, several
people have asked me to emphasize this point more strongly: when one
partition fills up it cannot grow any further, no matter if there is
plenty of space in other partitions.
In particular look out for explosive growth in the news spool
(<tt>/var/spool/news</tt>). For multi user machines with quotas keep
an eye on <tt>/tmp</tt> and <tt>/var/tmp</tt> as some people try to hide their
files there, just look out for filenames ending in gif or jpeg...
In fact, for single physical drives this scheme offers very little gains
at all, other than making file growth monitoring easier
(using '<tt>df</tt>') and physical track positioning. Most importantly
there is no scope for parallel disk access. A freely available volume
management system would solve this but this is still some time in the
future. However, when more specialised file systems become available
even a single disk could benefit from being divided into several
partitions.
For more information see section <ref id="troubleshooting" name="Troubleshooting">.
<!-- <**** expand here (2) &&&> -->
<sect>Disk Layout <label id="disk-layout">
<p>
<nidx>disk!disk layout</nidx>
<nidx>disk!layout, disk</nidx>
With all this in mind we are now ready to embark on the layout. I
have based this on my own method developed when I got hold of 3 old
SCSI disks and boggled over the possibilities.
The tables in the appendices are designed to simplify the mapping process. They
have been designed to help you go through the process of optimizations as well
as making an useful log in case of system repair. A few examples are also given.
<!--
At the end of this document there is an appendix with a few blank forms
that you can fill in to help you decide and design your system. The
following few paragraphs will refer to them.
-->
<sect1>Selection for Partitioning
<p>
<nidx>disk!layout, disk!partitioning</nidx>
Determine your needs and set up a list of all the parts of the file system
you want to be on separate partitions and sort them in descending order of
speed requirement and how much space you want to give each partition.
The table in <ref id="app-a" name="Appendix A"> section
is a useful tool to select what directories you
should put on different partitions. It is sorted in a logical order
with space for your own additions and notes about mounting points and
additional systems. It is therefore NOT sorted in order of speed, instead
the speed requirements are indicated by bullets ('o').
If you plan to RAID make a note of the disks you want to use and what
partitions you want to RAID. Remember various RAID solutions offers
different speeds and degrees of reliability.
(Just to make it simple I'll assume we have a set of identical SCSI disks
and no RAID)
<sect1>Mapping Partitions to Drives
<p>
<nidx>disk!layout, disk!mapping partitions</nidx>
<nidx>disk!layout, disk!partitions, mapping</nidx>
Then we want to place the partitions onto physical disks. The point of the
following algorithm is to maximise parallelizing and bus capacity. In this
example the drives are A, B and C and the partitions are 987654321 where 9
is the partition with the highest speed requirement. Starting at one drive
we 'meander' the partition line over and over the drives in this way:
<tscreen><verb>
A : 9 4 3
B : 8 5 2
C : 7 6 1
</verb></tscreen>
This makes the 'sum of speed requirements' the most equal across each
drive.
Use the table in <ref id="app-b" name="Appendix B"> section
to select what drives to use for each partition in order to optimize for paralellicity.
Note the speed characteristics of your drives and note each directory under
the appropriate column. Be prepared to shuffle directories, partitions
and drives around a few times before you are satisfied.
<sect1>Sorting Partitions on Drives
<p>
<nidx>disk!layout, disk!sorting partitions</nidx>
<nidx>disk!layout, disk!partitions, sorting</nidx>
After that it is recommended to select partition numbering for each drive.
Use the table in <ref id="app-c" name="Appendix C"> section
to select partition numbers in order to optimize for track characteristics.
At the end of this you should have a table sorted in ascending partition
number. Fill these numbers back into the tables in appendix A and B.
You will find these tables useful
when running the partitioning program (<tt>fdisk</tt> or
<tt>cfdisk</tt>) and when doing the installation.
<sect1>Optimizing
<p>
<nidx>disk!layout, disk!optimizing partitions</nidx>
<nidx>disk!layout, disk!partitions, optimizing</nidx>
After this there are usually a few partitions that have to be 'shuffled' over
the drives either to make them fit or if there are special considerations
regarding speed, reliability, special file systems etc. Nevertheless this
gives what this author believes is a good starting point for the complete
setup of the drives and the partitions. In the end it is actual use that will
determine the real needs after we have made so many assumptions. After
commencing operations one should assume a time comes when a repartitioning
will be beneficial.
For instance if one of the 3 drives in the above mentioned example is
very slow compared to the two others a better plan would be as
follows:
<tscreen><verb>
A : 9 6 5
B : 8 7 4
C : 3 2 1
</verb></tscreen>
<sect2>Optimizing by Characteristics
<p>
<nidx>disk!layout, disk!optimizing by characteristics</nidx>
<nidx>disk!layout, disk!characteristics, optimizing by</nidx>
Often drives can be similar in apparent overall speed but some
advantage can be gained by matching drives to the file size
distribution and frequency of access. Thus binaries are suited
to drives with fast access that offer command queueing, and
libraries are better suited to drives with larger transfer speeds
where IDE offers good performance for the money.
<sect2>Optimizing by Drive Parallelising <label id="opt-drive-parall">
<p>
<nidx>disk!layout, disk!optimizing by parallelising</nidx>
<nidx>disk!layout, disk!parallelising, optimizing by</nidx>
Avoid drive contention by looking at tasks: for instance if you are
accessing <tt>/usr/local/bin</tt> chances are you will soon also need files
from <tt>/usr/local/lib</tt> so placing these at separate drives allows less
seeking and possible parallel operation and drive caching. It is
quite possible that choosing what may appear less than ideal drive
characteristics will still be advantageous if you can gain parallel
operations. Identify common tasks, what partitions they use and try
to keep these on separate physical drives.
Just to illustrate my point I will give a few examples of
task analysis here.
<descrip>
<tag/Office software/ such as editing, word processing and
spreadsheets are typical examples of low intensity software both in
terms of CPU and disk intensity. However, should you have a single
server for a huge number of users you should not forget that most such
software have auto save facilities which cause extra traffic, usually
on the home directories. Splitting users over several drives would
reduce contention.
<tag/News/ readers also feature auto save features on home directories
so ISPs should consider separating home directories
News spools are notorious for their deeply nested directories and
their large number of very small files. Loss of a news spool
partition is not a big problem for most people, too, so they are good
candidates for a RAID 0 setup with many small disks to distribute
the many seeks among multiple spindles. It is recommended in the
manuals and FAQs for the INN news server to put news spool
and <tt>.overview</tt> files on separate drives for larger installations.
<!-- There is also a web page dedicated to 001210 gone.
<url url="http://www.spinne.com/usenet/inn-perf.html"
name="INN optimising">
well worth reading. -->
Some notes on
<url url="http://www.tru64unix.compaq.com/internet/inn-wp.html"
name="INN optimising under Tru64 UNIX">
also applies to a wider audience, including Linux users.
<tag/Database/ applications can be demanding both in terms of drive
usage and speed requirements. The details are naturally application
specific, read the documentation carefully with disk requirements in
mind. Also consider RAID both for performance and reliability.
<tag/E-mail/ reading and sending involves home directories as well as
in- and outgoing spool files. If possible keep home directories and
spool files on separate drives. If you are a mail server or a mail hub
consider putting in- and outgoing spool directories on separate
drives.
Losing mail is an extremely bad thing, if you are managing an ISP or major
hub. Think about RAIDing your mail spool and consider frequent
backups.
<tag/Software development/ can require a large number of directories
for binaries, libraries, include files as well as source and project
files. If possible split as much as possible across separate
drives. On small systems you can place <tt>/usr/src</tt> and project files on
the same drive as the home directories.
<tag/Web browsing/ is becoming more and more popular. Many browsers
have a local cache which can expand to rather large volumes. As this
is used when reloading pages or returning to the previous page, speed
is quite important here. If however you are connected via a well configured
proxy server you do not need more than typically a few megabytes per
user for a session.
See also the sections on
<ref id="server-home-dirs" name="Home Directories">
and
<ref id="www" name="WWW">.
</descrip>
<!-- 990124 moved over to recommendation section
<sect1>Usage Requirements
<p>
<nidx>disk!usage requirements</nidx>
When you get a box of 10 or so CD-ROMs with a Linux distribution and
the entire contents of the big FTP sites it can be tempting to install
as much as your drives can take. Soon, however, one would find that
this leaves little room to grow and that it is easy to bite over more
than can be chewed, at least in polite company. Therefore I will make
a few comments on a few points to keep in mind when you plan out your
system. Comments here are actively sought.
<descrip>
<tag/Testing/ Linux is simple and you don't even need a hard disk to
try it out, if you can get the boot floppies to work you are likely to
get it to work on your hardware. If the standard kernel does not work
for you, do not forget that often there can be special boot disk versions
available for unusual hardware combinations that can solve your initial
problems until you can compile your own kernel.
<tag/Learning/ about operating system is something Linux excels in,
there is plenty of documentation and the source is available. A single
drive with 50 MB is enough to get you started with a shell, a few of
the most frequently used commands and utilities.
<tag/Hobby/ use or more serious learning requires more commands and
utilities but a single drive is still all it takes, 500 MB should give
you plenty of room, also for sources and documentation.
<tag/Serious/ software development or just serious hobby work requires
even more space. At this stage you have probably a mail and news feed
that requires spool files and plenty of space. Separate drives for
various tasks will begin to show a benefit. At this stage you have
probably already gotten hold of a few drives too. Drive requirements
gets harder to estimate but I would expect 2-4 GB to be plenty, even
for a small server.
<tag/Servers/ come in many flavours, ranging from mail servers to full
sized ISP servers. A base of 2 GB for the main system should be
sufficient, then add space and perhaps also drives for separate
features you will offer. Cost is the main limiting factor here
but be prepared to spend a bit if you wish to justify the "S"
in ISP. Admittedly, not all do it.
</descrip>
<sect1>Servers
<p>
<nidx>disk!servers</nidx>
Big tasks require big drives and a separate section here. If possible
keep as much as possible on separate drives. Some of the appendices
detail the setup of a small departmental server for 10-100
users. Here I will present a few consideration for the higher end
servers. In general you should not be afraid of using RAID, not only
because it is fast and safe but also because it can make growth a
little less painful. All the notes below come as additions to the
points mentioned earlier.
Popular servers rarely just happens, rather they grow over time and this
demands both generous amounts of disk space as well as a good net
connection. In many of these cases it might be a good idea to reserve
entire SCSI drives, in singles or as arrays, for each task. This way you
can move the data should the computer fail. Note that transferring drives
across computers is not simple and might not always work, especially in the
case of IDE drives. Drive arrays require careful setup in order to
reconstruct the data correctly, so you might want to keep a paper copy of
your <tt/fstab/ file as well as a note of SCSI IDs.
<sect2>Home Directories <label id="server-home-dirs">
<p>
<nidx>disk!servers!home directories</nidx>
Estimate how many drives you will need, if this is more than 2 I would
recommend RAID, strongly. If not you should separate users across your
drives dedicated to users based on some kind of simple hashing algorithm.
For instance you could use the first 2 letters in the user name, so
<tt>jbloggs</tt> is put on <tt>/u/j/b/jbloggs</tt> where <tt>/u/j</tt>
is a symbolic link to a
physical drive so you can get a balanced load on your drives.
<sect2>Anonymous FTP
<p>
<nidx>disk!servers!FTP, anonymous</nidx>
<nidx>disk!servers!anonymous FTP</nidx>
This is an essential service if you are serious about service. Good
servers are well maintained, documented, kept up to date, and
immensely popular no matter where in the world they are located. The
big server
<url url="ftp://ftp.funet.fi"
name="ftp.funet.fi">
is an excellent example of this.
In general this is not a question of CPU but of network bandwidth. Size
is hard to estimate, mainly it is a question of ambition and service
attitudes. I believe the big archive at
<url url="ftp://ftp.cdrom.com"
name="ftp.cdrom.com">
is a *BSD machine
with 50 GB disk. Also memory is important for a dedicated FTP server,
about 256 MB RAM would be sufficient for a very big server, whereas
smaller servers can get the job done well with 64 MB RAM.
Network connections would still be the most important factor.
<sect2>WWW <label id="www">
<p>
<nidx>disk!servers!WWW</nidx>
<nidx>disk!servers!World Wide Web</nidx>
For many this is the main reason to get onto the Internet, in fact many
now seem to equate the two. In addition to being network intensive
there is also a fair bit of drive activity related to this, mainly
regarding the caches. Keeping the cache on a separate, fast drive
would be beneficial. Even better would be installing a caching proxy
server. This way you can reduce the cache size for each user and speed
up the service while at the same time cut down on the bandwidth
requirements.
With a caching proxy server you need a fast set of drives, RAID0 would
be ideal as reliability is not important here. Higher capacity is
better but about 2 GB should be sufficient for most. Remember to match
the cache period to the capacity and demand. Too long periods would on
the other hand be a disadvantage, if possible try to adjust based on
the URL. For more information check up on the most used servers such as
<tt/Harvest/,
<url url="http://www.nlanr.net/Squid"
name="Squid">
and the one from
<url url="http://www.netscape.com"
name="Netscape">.
<sect2>Mail
<p>
<nidx>disk!servers!mail</nidx>
Handling mail is something most machines do to some extent. The big mail
servers, however, come into a class of their own. This is a demanding task
and a big server can be slow even when connected to fast drives and a good
net feed. In the Linux world the big server at <tt/vger.rutgers.edu/ is a
well known example. Unlike a news service which is distributed and which
can partially reconstruct the spool using other machines as a feed, the
mail servers are centralised. This makes safety much more important, so for
a major server you should consider a RAID solution with emphasize on
reliability. Size is hard to estimate, it all depends on how many lists you
run as well as how many subscribers you have.
<sect2>News
<p>
<nidx>disk!servers!news</nidx>
This is definitely a high volume task, and very dependent on what
news groups you subscribe to. On Nyx there is a fairly complete feed
and the spool files consume about 17 GB. The biggest groups are no doubt
in the <tt>alt.binary.*</tt> hierarchy, so if you for some reason decide not to
get these you can get a good service with perhaps 12 GB. Still others,
that shall remain nameless, feel 2 GB is sufficient to claim ISP status.
In this case news expires so fast I feel the spelling IsP is barely
justified. A full newsfeed means a traffic of a few GB every day and this
is an ever growing number.
<sect2>Others
<p>
<nidx>disk!servers!other</nidx>
There are many services available on the net and even though many have been
put somewhat in the shadows by the web. Nevertheless, services like
<em/archie/, <em/gopher/ and <em/wais/ just to name a few, still exist and
remain valuable tools on the net. If you are serious about starting a major
server you should also consider these services. Determining the required
volumes is hard, it all depends on popularity and demand. Providing good
service inevitably has its costs, disk space is just one of them.
<sect1>Pitfalls
<p>
<nidx>disk!pitfalls</nidx>
The dangers of splitting up everything into separate partitions are
briefly mentioned in the section about volume management. Still, several
people have asked me to emphasize this point more strongly: when one
partition fills up it cannot grow any further, no matter if there is
plenty of space in other partitions.
In particular look out for explosive growth in the news spool
(<tt>/var/spool/news</tt>). For multi user machines with quotas keep
an eye on <tt>/tmp</tt> and <tt>/var/tmp</tt> as some people try to hide their
files there, just look out for filenames ending in gif or jpeg...
In fact, for single physical drives this scheme offers very little gains
at all, other than making file growth monitoring easier
(using '<tt>df</tt>') and physical track positioning. Most importantly
there is no scope for parallel disk access. A freely available volume
management system would solve this but this is still some time in the
future. However, when more specialised file systems become available
even a single disk could benefit from being divided into several
partitions.
-->
<sect1>Compromises
<p>
<nidx>disk!compromises</nidx>
One way to avoid the aforementioned
<ref id="pitfalls" name="pitfalls">
is to only set off fixed
partitions to directories with a fairly well known size such as swap,
<tt>/tmp</tt> and <tt>/var/tmp</tt> and group together the remainders
into the remaining partitions using symbolic links.
Example: a slow disk (<tt>slowdisk</tt>),
a fast disk (<tt>fastdisk</tt>) and an
assortment of files. Having set up <tt/swap/ and <tt/tmp/ on <tt/fastdisk/;
and <tt>/home</tt> and root on slowdisk we have (the fictitious) directories
<tt>/a/slow</tt>, <tt>/a/fast</tt>, <tt>/b/slow</tt> and <tt>/b/fast</tt>
left to allocate on the partitions
<tt>/mnt.slowdisk</tt> and <tt>/mnt.fastdisk</tt> which represents the
remaining partitions of the two drives.
Putting <tt>/a</tt> or <tt>/b</tt> directly on either drive gives the same
properties to the subdirectories. We could make all 4 directories
separate partitions but would lose some flexibility in managing
the size of each directory. A better solution is to make
these 4 directories symbolic links to appropriate directories on
the respective drives.
Thus we make
<tscreen><verb>
/a/fast point to /mnt.fastdisk/a/fast or /mnt.fastdisk/a.fast
/a/slow point to /mnt.slowdisk/a/slow or /mnt.slowdisk/a.slow
/b/fast point to /mnt.fastdisk/b/fast or /mnt.fastdisk/b.fast
/b/slow point to /mnt.slowdisk/b/slow or /mnt.slowdisk/b.slow
</verb></tscreen>
and we get all fast directories on the fast drive without having to
set up a partition for all 4 directories. The second (right hand)
alternative gives us a flatter files system which in this case can
make it simpler to keep an overview of the structure.
The disadvantage is that it is a complicated scheme to set up and plan
in the first place and that all mount points and partitions have to be
defined before the system installation.
<em/Important:/ note that the <tt>/usr</tt> partition must be mounted
directly onto root and not via an indirect link as described above.
The reason for this are the long backward links used extensively in
X11 that go from deep within <tt>/usr</tt> all the way to root and then
down into <tt>/etc</tt> directories.
<sect>Implementation
<p>
<nidx>disk!implementation</nidx>
Having done the layout you should now have a detailed description on
what goes where. Most likely this will be on paper but hopefully
someone will make a more automated system that can deal with
everything from the design, through partitioning to formatting and
installation. This is the route one will have to take to realise the
design.
Modern distributions come with installation tools that will guide you
through partitioning and formatting and also set up <tt>/etc/fstab</tt>
for you automatically. For later modifications, however, you will need
to understand the underlying mechanisms.
<sect1>Checklist
<p>
<nidx>disk!implementation!checklist</nidx>
Before starting make sure you have the following:
<itemize>
<item>Written notes of what goes where, your design
<item>A functioning, tested rescue disk
<item>A fresh backup of your precious data
<item>At least two formatted, tested and empty floppies
<item>Read and understood the man page for fdisk or equivalent
<item>Patience, concentration and elbow grease
</itemize>
<sect1>Drives and Partitions
<p>
<nidx>disk!implementation!drives</nidx>
<nidx>disk!implementation!partitions</nidx>
When you start DOS or the like you will find all partitions labeled
<tt/C:/ and onwards, with no differentiation on IDE, SCSI, network or
whatever type of media you have. In the world of Linux this is rather
different. During booting you will see partitions described like this:
<code>
Dec 6 23:45:18 demos kernel: Partition check:
Dec 6 23:45:18 demos kernel: sda: sda1
Dec 6 23:45:18 demos kernel: hda: hda1 hda2
</code>
SCSI drives are labelled <tt/sda/, <tt/sdb/, <tt/sdc/ etc, and
(E)IDE drives are labelled <tt/hda/, <tt/hdb/, <tt/hdc/ etc.
There are also standard names for all devices, full information can be
found in
<tt>/dev/MAKEDEV</tt> and <tt>/usr/src/linux/Documentation/devices.txt</tt>.
Partitions are labelled numerically for each drive <tt/hda1/, <tt/hda2/
and so on. On SCSI drives there can be 15 partitions per
drive, on EIDE drives there can be 63 partitions per drive. Both
limits exceed what is currently useful for most disks.
These are then mounted according to the file <tt>/etc/fstab</tt> before
they appear as a part of the file system.
<sect1>Partitioning
<p>
<nidx>disk!implementation!partitioning</nidx>
<nidx>disk!fdisk</nidx>
<nidx>disk!cfdisk</nidx>
<nidx>disk!sfdisk</nidx>
<nidx>disk!Disk Druid</nidx>
<it>It feels so good / It's a marginal risk / when I clear off / windows with fdisk! </it>
(the Dustbunny in an
<url url="http://www.userfriendly.org/cartoons/archives/99feb/19990221.html"
name="issue">
of
<url url="http://www.userfriendly.org/"
name="User Friendly">
in the song "Refund this")
First you have to partition each drive into a number of separate partitions.
Under Linux there are two main methods, <tt/fdisk/ and the more screen
oriented <tt/cfdisk/. These are complex programs, read the manual <em/very/
carefully. For the experts there is now also <tt/sfdisk/.
Partitions come in 3 flavours, <tt/primary/, <tt/extended/ and <tt/logical/.
You have to use <tt/primary/ partitions for booting, but there is a maximum
of 4 primary partitions. If you want more you have to define an <tt/extended/
partition within which you define your <tt/logical/ partitions.
Each partition has an identifier number which tells the operating system
what it is, for Linux the types <tt/swap(82)/ and <tt/ext2fs(83)/ are
the ones you
will need to know.
If you want to use RAID with autostart you have to check the documentation
for the appropriate type number for the RAID partition.
There is a readme file that comes with <tt/fdisk/ that gives more in-depth
information on partitioning.
Someone has just made a <em/Partitioning HOWTO/ which contains excellent,
in depth information on the nitty-gritty of partitioning. Rather than
repeating it here and bloating this document further, I will instead refer
you to it instead.
Redhat has written a screen oriented utility called <em/Disk Druid/ which
is supposed to be a user friendly alternative
to <tt/fdisk/ and <tt/cfdisk/ and also
automates a few other things. Unfortunately this product is not quite
mature so if you use it and cannot get it to work you are well advised
to try <tt/fdisk/ or <tt/cfdisk/.
Not to be outdone, Mandrakesoft has made an even more graphic alternative
called
<url url="http://www.linux-mandrake.com/diskdrake/"
name="Diskdrake">
that also offers numerous features.
Also the GNU project offers a partitioning tool called
<url url="http://www.gnu.org/software/parted/"
name="GNU Parted">
The
<url url="http://www.users.intercom.com/&tilde;ranish/part/"
name="Ranish Partition Manager">
is another free alternative,
while
<url url="http://www.powerquest.com"
name="Partition Magic">
is a popular commercial alternative which also offers some
support for resizing <tt/ext2fs/ partitions.
Note that Windows will complain if it finds
more than one primary partition on a drive.
Also it appears to assign drive letters
to primary partitions as it finds disks
before starting over from the first disk to
assign subsequent drive names to logical partitions.
If you want DOS/Windows on your system you should make that partition
first, a primary one to boot to, made with the DOS <tt/fdisk/ program.
Then if you want NT you put that one in.
Finally, for Linux, you create those partitions with the Linux <tt/fdisk/
program or equivalents. Linux is flexible enough to boot from both
primary as well as logical partitions.
In depth information on DOS <tt/fdisk/ can be found at
<url url="http://www.fdisk.com/fdisk/"
name="Fdisk.com">
and
<url url="http://members.aol.com/axcel216/secrets.htm"
name="MS-DOS 5.00 - 7.10 Undocumented, Secret + Hidden Features">
which details even more bugs and pitfalls.
<sect1>Repartitioning <label id="repartitioning">
<p>
<nidx>disk!implementation!repartitioning</nidx>
<nidx>disk!ShowFat</nidx>
<nidx>disk!fips</nidx>
<nidx>disk!Partition Magic</nidx>
<nidx>disk!Partition Resizer</nidx>
Sometimes it is necessary to change the sizes of existing partitions
while keeping the contents intact. One way is of course to back up
everything, recreate new partitions and then restore the old contents,
and while this gives your back up system a good test it is also
rather time consuming.
Partition resizing is a simpler alternative where a file system is
first shrunk to desired volume and then the partition table is
updated to reflect the new end of partition position. This process
is therefore very file system sensitive.
Repartitioning requires there to be free space at the end of
the file space so to ensure you are able to shrink the size
you should first defragment your drive and empty any wastebaskets.
Using
<url url="http://www.igd.fhg.de/&tilde;aschaefe/fips/"
name="fips">
you can resize a <tt/fat/ partition,
and the latest version 1.6 of <tt/fips/ or <tt/fips 2.0/
are also able to resize <tt/fat32/ partition.
Note that these programs actually run under DOS.
Resizing other file systems are much more complicated but one
popular commercial system
<url url="http://www.powerquest.com"
name="Partition Magic">
is able to resize more file system types, including <tt/ext2fs/
using the <tt/resize2fs/ program. Make sure you get the latest
updates to this program as recent versions had problems with
large disks.
In order to get the most out of <tt/fips/ you should
first delete unnecessary files, empty wastebaskets etc.
before defragmenting your drive.
This way you can allocate more space to other partitions.
If the program complains there are still files at the end
of your drive it is probably hidden files generated by
Microsoft Mirror or Norton Image.
These are probably called <tt/image.idx/ and <tt/image.dat/ and
contain backups of some system files.
There are reports that in some Windows defragmentation programs
you should make sure the box "allow Windows to move files around"
is <em/not/ checked, otherwise you will end up with some files
in the last cylinder of the partition which will prevent FIPS
from reclaiming space.
If you still have unmovable files at the end of your DOS partition
you should get the DOS program
<url url="http://www8.pair.com/dmurdoch/programs/showfat.htm"
name="showfat">
version 3.0 or higher.
This shows you what files are where so you can deal with them
directly.
A freeware alternative is
<!-- <url url="http://members.xoom.com/Zeleps" 001203 -->
<url url="http://members.nbci.com/Zeleps/"
name="Partition Resizer">
which can shrink, grow and move partitions.
Some versions of DOS / Windows have a hidden flag for <tt/defrag/, "<tt>/P</tt> that
causes <tt/defrag/ to move even hidden files. Use at own risk.
Repartitioning is as dangerous process as any other partitioning
so you are advised to have a fresh backup handy.
<sect1>Microsoft Partition Bug <label id="microsoft-partition-bug">
<p>
<nidx>disk!implementation!Microsoft Partition Bug</nidx>
<nidx>disk!Microsoft!nasty bug</nidx>
In Microsoft products all the way up to Win 98 there is a tricky bug
that can cause you a bit of trouble:
if you have several primary <tt/fat/ partitions
and the last extended partition is not a <tt/fat/ partition
the Microsoft system will try to mount the last partition as if
it were a FAT partition in place of the last primary FAT partition.
There is more
<!-- <url url="http://www.v-com.com/support/osinstalls/notes/95notes.html" -->
<url url="http://www.v-com.com/"
name="information">
available on the net on this.
To avoid this you can place a small logical <tt/fat/ partition
at the very end of your disk.
More information on multi OS installations are available at
<url url="http://www.v-com.com/"
name="V Communications"> but they keep rearranging the
links continuously so no direct links can be offered here.
Since some hardware comes with setup software that is available
under DOS only this could come in handy anyway. Notable examples
are RAID controllers from DPT and a number of networking cards.
<sect1>Multiple Devices (<tt>md</tt>)
<p>
<nidx>disk!implementation!multiple devices</nidx>
<nidx>disk!implementation!devices, multiple</nidx>
Being in a state of flux you should make sure to read the latest
documentation on this kernel feature. It is not yet stable, beware.
Briefly explained it works by adding partitions together into new
devices <tt/md0/, <tt/md1/ etc. using <tt/mdadd/ before you activate
them using <tt/mdrun/. This process can be automated using the file
<tt>/etc/mdtab</tt>.
The latest <tt/md/ system uses a <!-- <file>/etc/raidtab</file> -->
<htmlurl url="file:///etc/raidtab"
name="/etc/raidtab">
and
a different syntax. Make sure your RAID-tools package matches
the <tt/md/ version as the internal protocol has changed.
Then you then treat these like any other partition on a drive. Proceed
with formatting etc. as described below using these new devices.
There is now also a HOWTO in development for RAID using <tt/md/ you
should read.
<sect1>Formatting
<p>
<nidx>disk!implementation!formatting</nidx>
Next comes partition formatting, putting down the data structures that will
describe the files and where they are located. If this is the first time it
is recommended you use formatting with verify. Strictly speaking it should
not be necessary but this exercises the I/O hard enough that it can uncover
potential problems, such as incorrect termination, before you store your
precious data. Look up the command <tt/mkfs/ for more details.
Linux can support a great number of file systems, rather than repeating
the details you can read the man page for <tt/fs/ which describes them in
some details. Note that your kernel has to have the drivers compiled in
or made as modules in order to be able to use these features. When the time
comes for kernel compiling you should read carefully through the file system
feature list. If you use <tt/make menuconfig/ you can get online help for
each file system type.
Note that some rescue disk systems require <tt/minix/, <tt/msdos/ and <tt/ext2fs/
to be compiled into the kernel.
Also swap partitions have to be prepared, and for this you use <tt/mkswap/.
Some important notes on formatting with DOS and Windows can be found in
<url url="http://members.aol.com/axcel216/secrets.htm"
name="MS-DOS 5.00 - 7.10 Undocumented, Secret + Hidden Features">.
Note that this formatting is high level formatting, that writes the file
system to the disk, as opposed to low level formatting that lays down
tracks and sectors. The latter is hardly ever needed these days.
<sect1>Mounting
<p>
<nidx>disk!implementation!mounting</nidx>
Data on a partition is not available to the file system until it is mounted
on a mount point. This can be done manually using <tt/mount/ or automatically
during booting by adding appropriate lines to <tt>/etc/fstab</tt>. Read the
manual for <tt/mount/ and pay close attention to the tabulation.
<sect1><tt/fstab/
<p>
<nidx>disk!implementation!fstab</nidx>
<nidx>disk!fstab</nidx>
During the booting process the system mounts all partitions
as described in the <tt/fstab/ file which can look something
like this:
<tscreen><verb>
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/hda2 / ext2 defaults 0 1
None none swap sw 0 0
proc /proc proc defaults 0 0
/dev/hda1 /dosc vfat defaults 0 1
</verb></tscreen>
This file is somewhat sensitive to the formatting used so it
is best and also most convenient to edit it using one of the
editing tools made for this purpose,
such as
<url url="http://www.bit.net.au/&tilde;bhepple/fstool/"
name="on the netfstool">, a Tcl/Tk-based file system mounter,
and
<url url="http://kfstab.purespace.de/kfstab/"
name="kfstab">, an editing tool for KDE.
Briefly, the fields are partition name, where to mount the partition,
type of file system, mount options, when to dump for backup
and when to do <tt/fsck/.
Linux offers the possibility of parallel file checking (<tt/fsck/)
but to be efficient it is important not to <tt/fsck/ more than one
partition on a drive at a time.
<sect1>Mount options
<p>
<nidx>disk!mount</nidx>
Mounting, either by hand or using the <tt>fstab</tt>, allows for
a number of options that offers extra protection. Below are some
of the more useful options.
<descrip>
<tag/nodev/ Do not interpret character or block special
devices on the file system.
<tag/noexec/ This disallows execution of any binaries on
the mounted file system. Useful in spool areas.
<tag/nosuid/ This disallows set-user-identifier or
set-group-identifier on the mounted file system.
Useful in home directories.
</descrip>
For more information and cautions refer to the man page
for <tt/mount/ and <tt/fstab/.
<sect1>Recommendations
<p>
Having constructed and implemented your clever scheme
you are well advised to make a complete record of it all, on paper.
After all having all the necessary information on disk is no use
if the machine is down.
Partition tables can be damaged or lost, in which case it is
excruciatingly important that you enter the exact same numbers
into <tt/fdisk/ so you can rescue your system.
You can use the program <tt/printpar/ to make a clear record
of the tables. Also write down the SCSI numbers or IDE names
for each disk so you can put the system together again in the
right order.
There is also a small script in appendix
<ref id="disk-documenter" name="Appendix M: Disk System Documenter">
which will generate a summary of your disk configurations.
For checking your hard disks you can use the Disk Advisor boot disk
available
<url url="http://www.ontrack.com/"
name="on the net">.
The disk builder required Windows to run. This system is useful to
diagnose failed disks.
You are strongly recommended to make a rescue disk and <em>test</em> it.
Most distributions make on available and is often part of the
installation disks. For some, such as the one for Redhat 6.1 the way
to invoke the disk as a rescue disk is to type <em>linux rescue</em>
at the boot prompt.
There are also specialised rescue disk distributions available
on the net.
When need for it comes you will need to know where your root and boot
partitions reside which you need to write down and keep safe.
Note: the difference between a boot disk and a rescue disk is that
a boot disk will fail if it cannot mount the file system, typically
on your hard disk. A rescue disk is self contained and will work
even if there are no hard disks.
<sect>Maintenance
<p>
<nidx>disk!maintenance</nidx>
It is the duty of the system manager to keep an eye on the drives
and partitions. Should any of the partitions overflow, the system
is likely to stop working properly, no matter how much space is
available on other partitions, until space is reclaimed.
Partitions and disks are easily monitored using <tt>df</tt> and
should be done
frequently, perhaps using a cron job or some other general system
management tool.
Do not forget the swap partitions, these are best monitored using
one of the memory statistics programs such as
<tt>free</tt>, <tt>procinfo</tt> or <tt>top</tt>.
Drive usage monitoring is more difficult but it is important for
the sake of performance to avoid contention - placing too much
demand on a single drive if others are available and idle.
It is important when installing software packages to have a clear
idea where the various files go. As previously mentioned GCC keeps
binaries in a library directory and there are also other programs
that for historical reasons are hard to figure out, X11 for instance
has an unusually complex structure.
When your system is about to fill up it is about time to check and
prune old logging messages as well as hunt down core files. Proper
use of <tt/ulimit/ in global shell settings can help saving you
from having core files littered around the system.
<sect1>Backup
<p>
<nidx>disk!maintenance!backup</nidx>
The observant reader might have noticed a few hints about the usefulness
of making backups. Horror stories are legio about accidents and what
happened to the person responsible when the backup turned out to be
non-functional or even non existent. You might find it simpler to invest
in proper backups than a second, secret identity.
There are many options and also a mini-HOWTO ( <tt/Backup-With-MSDOS/ )
detailling what you need to know. In addition to the DOS specifics it
also contains general information and further leads.
In addition to making these backups you should also make sure you can
restore the data. Not all systems verify that the data written is
correct and many administrators have started restoring the system after
an accident happy in the belief that everything is working, only to
discover to their horror that the backups were useless. Be careful.
<!-- 0.24 -->
There are both free and commercial backup systems available for Linux.
One commercial example is the disk image level backup system from
<url url="http://www.estinc.com/"
name="QuickStart">
offering a full function 30 day Linux demo available online.
<sect1>Defragmentation
<p>
<nidx>disk!maintenance!defragmentation</nidx>
This is very dependent on the file system design, some suffer fast and
nearly debilitating fragmentation. Fortunately for us, <tt/ext2fs/ does
not belong to this group and therefore there has been very little talk
about defragmentation tools. It does in fact exist but is hardly ever
needed.
If for some reason you feel this is necessary, the quick and easy solution
is to do a backup and a restore. If only a small area is affected, for instance
the home directories, you could <tt/tar/ it over to a temporary area on
another partition, <em/verify/ the archive, delete the original
and then untar it back again.
<sect1>Deletions
<p>
<nidx>disk!maintenance!deletions</nidx>
Quite often disk space shortages can be remedied simply by deleting unnecessary
files that accumulate around the system. Quite often programs that terminate
abnormally cause all kinds of mess lying around the oddest places. Normally a
core dump results after such an incident and unless you are going to debug it
you can simply delete it. These can be found everywhere so you are advised to do
a global search for them now and then.
The <tt>locate</tt> command is useful for this.
Unexpected termination can also cause all sorts of temporary files remaining in
places like <tt>/tmp</tt> or <tt>/var/tmp</tt>, files that are automatically
removed when the program ends normally. Rebooting cleans up some of these areas
but not necessary all and if you have a long uptime you could end up with a lot
of old junk. If space is short you have to delete with care, make sure the file
is not in active use first. Utilities like <tt/file/ can often tell you what kind
of file you are looking at.
Many things are logged when the system is running, mostly to files in the
<tt>/var/log</tt> area. In particular the file <tt>/var/log/messages</tt> tends
to grow until deleted. It is a good idea to keep a small archive of old log
files around for comparison should the system start to behave oddly.
If the mail or news system is not working properly you could have
excessive growth in their spool areas, <tt>/var/spool/mail</tt>
and <tt>/var/spool/news</tt> respectively. Beware of the overview files
as these have a leading dot which makes them invisible to <tt/ls -l/, it
is always better to use <tt/ls -Al/ which will reveal them.
User space overflow is a particularly tricky topic. Wars have been waged between
system administrators and users. Tact, diplomacy and a generous budget for new
drives is what is needed. Make use of the message-of-the-day feature, information
displayed during login from the <tt>/etc/motd</tt> file to tell users when space
is short.
Setting the default shell settings to prevent core files being dumped can save you
a lot of work too.
Certain kinds of people try to hide files around the system,
usually trying to take advantage of the fact that
files with a leading dot in the name are invisible to the <tt/ls/ command.
One common example are files that look like <tt/.../ that
normally either are not seen,
or, when using <tt/ls -al/ disappear in the noise of normal files
like <tt/./ or <tt/../ that are in every directory.
There is however a countermeasure to this,
use <tt/ls -Al/ that suppresses <tt/./ or <tt/../ but shows all other dot-files.
<sect1>Upgrades
<p>
<nidx>disk!maintenance!upgrades</nidx>
No matter how large your drives, time will come when you will find you
need more. As technology progresses you can get ever more for your
money. At the time of writing this, it appears that 6.4 GB drives gives
you the most bang for your bucks.
Note that with IDE drives you might have to remove an old drive, as the
maximum number supported on your mother board is normally only 2 or some
times 4. With SCSI you can have up to 7 for narrow (8-bit) SCSI or up to
15 for wide (15 bit) SCSI, per channel. Some host adapters can support
more than a single channel and in any case you can have more than
one host adapter per system. My personal recommendation is that you will
most likely be better off with SCSI in the long run.
The question comes, where should you put this new drive? In many cases
the reason for expansion is that you want a larger spool area, and in that
case the fast, simple solution is to mount the drive somewhere under
<tt>/var/spool</tt>. On the other hand newer drives are likely to be
faster than older ones so in the long run you might find it worth your
time to do a full reorganizing, possibly using your old design sheets.
If the upgrade is forced by running out of space in partitions used for
things like <tt>/usr</tt> or <tt>/var</tt> the upgrade is a little more
involved. You might consider the option of a full re-installation from
your favourite (and hopefully upgraded) distribution. In this case you
will have to be careful not to overwrite your essential setups. Usually
these things are in the <tt>/etc</tt> directory. Proceed with care,
fresh backups and working rescue disks. The other possibility is to
simply copy the old directory over to the new directory which is
mounted on a temporary mount point, edit your <tt>/etc/fstab</tt> file,
reboot with your new partition in place and check that it works.
Should it fail you can reboot with your rescue disk, re-edit
<tt>/etc/fstab</tt> and try again.
Until volume management becomes available to Linux this is both
complicated and dangerous. Do not get too surprised if you discover
you need to restore your system from a backup.
The Tips-HOWTO gives the following example on how to move an entire
directory structure across:
<code>
(cd /source/directory; tar cf - . ) | (cd /dest/directory; tar xvfp -)
</code>
While this approach to moving directory trees is portable among many
Unix systems, it is inconvenient to remember. Also, it fails for
deeply nested directory trees when pathnames become to long to
handle for tar (GNU tar has special provisions to deal with long
pathnames).
If you have access to GNU cp (which is always the case on Linux
systems), you could as well use
<code>
cp -av /source/directory /dest/directory
</code>
GNU cp knows specifically about symbolic links, hard links,
FIFOs and device files and will copy them correctly.
Remember that it might not be a good idea to try to transfer
<tt>/dev</tt> or <tt>/proc</tt>.
There is also a
<url url="http://www.storm.ca/&tilde;yan/Hard-Disk-Upgrade.html"
name="Hard Disk Upgrade mini-HOWTO">
that gives you a step by step guide on migrating an entire
Linux system, including LILO, form one hard disk to another.
<sect1>Recovery
<p>
<nidx>disk!maintenance!recovery</nidx>
<nidx>disk!gpart</nidx>
<nidx>disk!dos tool!findpart</nidx>
<nidx>disk!dos tool!editpart</nidx>
<nidx>disk!dos tool!findfat</nidx>
<nidx>disk!dos tool!getsect</nidx>
<nidx>disk!dos tool!putsect</nidx>
<nidx>disk!dos tool!cyldir</nidx>
<nidx>disk!dos tool!cdir</nidx>
System crashes come in many and entertaining flavours, and
partition table corruption always guarantees plenty of excitement.
A recent and undoubtedly useful tool for those of us who
are happy with the normal level of excitement, is
<url url="http://www.stud.uni-hannover.de/user/76201/gpart/"
name="gpart">
which means "Guess PC-Type hard disk partitions". Useful.
In addition there are some
<url url="http://inet.uni2.dk/&tilde;svolaf/utilities.htm"
name="partition utilities">
available under DOS.
<sect1>Rescue Disk
<p>
<nidx>disk!maintenance!rescue disk</nidx>
Upgrades of kernel and hardware is not uncommon in the Linux world
and it is therefore important that you prepare an updated rescue
disk especially when you use special drivers to access your hardware.
Rescue disks can be gotten off the net, from your distribution or
you can put one together yourself. Do make sure the boot and root
parameters are set so the kernel will know where to find your system.
If you don't have a recovery floppy you can use the
<url url="http://www.gnu.org/software/grub/"
name="GRUB"> boot loader
to load from a Linux kernel somewhere on disk, with arguments.
<sect>Advanced Issues
<p>
<nidx>disk!advanced topics</nidx>
Linux and related systems offer plenty of possibilities for fast, efficient
and devastating destruction. This document is no exception. With power comes
dangers and the following sections describe a few more esoteric issues that
should not be attempted before reading and understanding the documentation,
the issues and the dangers. You should also make a backup. Also remember
to try to restore the system from scratch from your backup at least once.
Otherwise you might not be the first to be found with a perfect backup of
your system and no tools available to reinstall it (or, even more
embarrassing, some critical files missing on tape).
The techniques described here are rarely necessary but can be used for very
specific setups. Think very clearly through what you wish to accomplish
before playing around with this.
<sect1>Hard Disk Tuning
<p>
<nidx>disk!advanced topics!tuning, hard disk</nidx>
The hard drive parameters can be tuned using the <tt/hdparms/ utility. Here
the most interesting parameter is probably the read-ahead parameter which
determines how much prefetch should be done in sequential reading.
If you want to try this out it makes most sense to tune for the
characteristic file size on your drive but remember that this tuning is for
the <em/entire/ drive which makes it a bit more difficult. Probably this is
only of use on large servers using dedicated news drives etc.
For safety the default hdparm settings are rather conservative. The
disadvantage is that this mean you can get lost interrupts if you have
a high frequency of IRQs as you would when using the serial port and
an IDE disk as IRQs from the latter would mask other IRQs. This would
be noticeable as less then ideal performance when downloading data from
the net to disk. Setting <tt/hdparm -u1 device/ would prevent this
masking and either improve your performance or, depending on hardware,
corrupt the data on your disk. Experiment with caution and fresh
backups.
For more information read the article
<url url="http://www.linuxforum.com/plug/articles/needforspeed.html"
name="The Need For Speed">
on tuning with <tt>hdparm</tt>.
<sect1>File System Tuning
<p>
<nidx>disk!advanced topics!tuning, filesystem</nidx>
Most file systems come with a tuning utility and for <tt/ext2fs/ there is
the <tt/tune2fs/ utility. Several parameters can be modified but perhaps
the most useful parameter here is what size should be reserved and who should
be able to take advantage of this which could help you getting more useful
space out of your drives, possibly at the cost of less room for repairing
a system should it crash.
<sect1>Spindle Synchronizing
<p>
<nidx>disk!advanced topics!spindle synchronization</nidx>
This should not in itself be dangerous, other than the peculiar fact that
the exact details of the connections remain unclear for many drives. The
theory is simple: keeping a fixed phase difference between the different
drives in a RAID setup makes for less waiting for the right track to come
into position for the read/write head. In practice it now seems that with
large read-ahead buffers in the drives the effect is negligible.
Spindle synchronisation should not be used on RAID0 or RAID 0/1 as you
would then lose the benefit of having the read heads over different
areas of the mirrored sectors.
<sect>Troubleshooting <label id="troubleshooting">
<p>
<nidx>disk!troubleshooting</nidx>
Much can go wrong and this is the start of a growing list of symptoms,
problems and solutions:
<sect1>During Installation
<sect2>Locating Disks
<p>
<descrip>
<tag/Symptoms/Cannot find disk
<tag/Problem/How to find what drive letter corresponds to what disk/partition
<tag/Solution/Remember Linux does not use drive letters but device names. More
information can be found in <ref id="drive-names" name="Drive names">.
</descrip>
<p>
<descrip>
<tag/Symptoms/Cannot partition disk
<tag/Problem/Most likely wrong input to the command line for <tt/fdisk/ or similar tool.
<tag/Solution/Remember to use <tt>/dev/hda</tt> rather than just <tt>hda</tt>. Also
do not use numbers behind <tt>hda</tt>, those indicate partitions.
</descrip>
<sect2>Formatting
<p>
<descrip>
<tag/Symptoms/Cannot format disk.
<tag/Problem/Strictly speaking you format partitions not disks.
<tag/Solution/Make sure you add the partition number after the device name
of the disk, for instance <tt>/dev/hda1</tt> to the command line.
</descrip>
<sect1>During Booting
<sect2>Booting fails
<p>
<descrip>
<tag/Symptoms/ Number keep scrolling up the screen.
<tag/Problem/ Possibly corrupt disk.
<tag/Solution/ Try another disk, you might have to reinstall. Check for
loose cables and possible data corruption.
</descrip>
<p>
<descrip>
<tag/Symptoms / Get <tt/LI/ and then it hangs.
<tag/Problem/You use LILO to load Linux but LILO cannot find your root.
<tag/Solution/ Read the LILO HOWTO.
</descrip>
<p>
<descrip>
<tag/Symptoms/Kernel panics, something about missing root file system.
<tag/Problem/The kernel does not know where the root partition is.
<tag/Solution/Use <tt/rdev/ or (if applicable) LILO to add information
to the kernel image where your root is.
</descrip>
<sect2>Getting into Single User Mode
<p>
<descrip>
<tag/Symptoms/System boots but get into a root shell in single user mode.
<tag/Problem/Something went wrong in the later stages of booting and the
system has come far enough to let you open a shell to repair the system.
<tag/Solution/Locate the problems from the boot log. Note that file system
can be in read-only mode. Remount read-write if you have to. Often the
reason is that the <tt>/etc/fstab</tt> contained an entry that was mismapped
such as trying to mount a swap partition as your normal file space.
</descrip>
<sect1>During Running
<sect2>Swap
<p>
<descrip>
<tag/Symptoms/Short on memory
<tag/Problem/Swap space is not available
<tag/Solution/Type free and check the output. If you get
<tscreen><verb>
total used free shared buffers cached
Mem: 46920 30136 16784 7480 11788 5764
-/+ buffers/cache: 12584 34336
Swap: 128484 9176 119308
</verb></tscreen>
then system is running normal. If the line with <tt/Swap:/ contains zeros
you have either not mounted the swap space (partition or swap file)
(see <tt>swapon(8)</tt>)
or not formatted the swap space (see <tt>mkswap(8)</tt>).
</descrip>
<sect2>Partitions
<p>
<descrip>
<tag/Symptoms/No room amidst plenty 1
<tag/Problem/Partitionitis:Underdimensioned partition sizes
has caused overflow in some areas
<tag/Solution/Examine your partition usage using <tt>df(1)</tt> and locate
problem areas. Normally the problem can be solved by removing old junk but
you might have to repartition your system,
see section <ref id="repartitioning" name="Repartitioning">.
</descrip>
<p>
<descrip>
<tag/Symptoms/No room amidst plenty 2
<tag/Problem/Running out of i-nodes has caused overflow in some ares,
often in areas with many small files such as news spool.
<tag/Solution/Examine your partition usage using <tt>df -i</tt> and locate
problem areas. Normally the problem is solved by reformatting using
a higher number of i-nodes, see <tt>mkfs(8)</tt> and related man pages.
</descrip>
<!--
<sect2>
<p>
<descrip>
<tag/Symptoms/
<tag/Problem/
<tag/Solution/
</descrip>
-->
<sect>Further Information
<p>
<nidx>disk!information resources</nidx>
There is wealth of information one should go through when setting up a
major system, for instance for a news or general Internet service provider.
The FAQs in the following groups are useful:
<sect1>News groups
<p>
<nidx>disk!information resources!news groups</nidx>
Some of the most interesting news groups are:
<itemize>
<item><url url="news:comp.arch.storage" name="Storage">.
<item><url url="news:comp.sys.ibm.pc.hardware.storage" name="PC storage">.
<item><url url="news:alt.filesystems.afs" name="AFS">.
<item><url url="news:comp.periphs.scsi" name="SCSI">.
<item><url url="news:comp.os.linux.setup" name="Linux setup">.
</itemize>
Most newsgroups have their own FAQ that are designed to answer most of your
questions, as the name Frequently Asked Questions indicate. Fresh versions
should be posted regularly to the relevant newsgroups. If you cannot find it
in your news spool you could go directly to the
<url url="ftp://rtfm.mit.edu"
name="FAQ main archive FTP site">. The WWW versions can be browsed at
<!-- <url url="http://www.cis.ohio-state.edu/hypertext/faq/usenet/FAQ-List.html" 000501 -->
<url url="http://www.faqs.org/faqs/FAQ-List.html"
name="FAQ main archive WWW site">.
Some FAQs have their own home site, of particular interest here are
<itemize>
<item><url url="http://www.scsifaq.org/"
name="SCSI FAQ"> and
<item><url url="http://alumni.caltech.edu/&tilde;rdv/comp-arch-storage/FAQ-1.html"
name="comp.arch.storage FAQ">.
</itemize>
<!-- http://alumni.caltech.edu/&tilde;rdv/comp&lowbar;arch&lowbar;storage/FAQ-1.html" -->
<sect1>Mailing Lists
<p>
<nidx>disk!information resources!mailing lists</nidx>
These are low noise channels mainly for developers. Think
twice before asking questions there as noise delays the development.
Some relevant lists are <tt/linux-raid/, <tt/linux-scsi/ and <tt/linux-ext2fs/.
Many of the most useful mailing lists run on the <tt>vger.rutgers.edu</tt> server
but this is notoriously overloaded, so try to find a mirror. There are some lists mirrored at
<url url="http://www.redhat.com"
name="The Redhat Home Page">.
Many lists are also accessible at
<url url="http://www.linuxhq.com/lnxlists/"
name="linuxhq">,
and the rest of the web site is a gold mine of useful information.
If you want to find out more about the lists available you can send a message
with the line <tt/lists/ to the list server at vger.rutgers.edu (
<htmlurl url="mailto:majordomo@vger.rutgers.edu"
name="majordomo@vger.rutgers.edu">).
If you need help on how to use the mail server just send the line <tt/help/
to the same address.
Due to the popularity of this server it is likely it takes a bit to time before
you get a reply or even get messages after you send a <tt/subscribe/ command.
There is also a number of other majordomo list servers that can be of interest
such as the EATA driver list (
<htmlurl url="mailto:linux-eata@mail.uni-mainz.de"
name="linux-eata@mail.uni-mainz.de">)
and the Intelligent IO list
<htmlurl url="mailto:linux-i2o@dpt.com"
name="linux-i2o@dpt.com">.
Mailing lists are in a state of flux but you can find links to a number of
interesting lists from the
<url url="http://www.linuxdoc.org/"
name="Linux Documentation Homepage">.
<sect1>HOWTO
<p>
<nidx>disk!information resources!HOWTOs</nidx>
These are intended as the primary starting points to
get the background information as well as show you how to solve
a specific problem.
Some relevant HOWTOs are <tt/Bootdisk/, <tt/Installation/, <tt/SCSI/ and <tt/UMSDOS/.
The main site for these is the
<url url="http://www.linuxdoc.org/"
name="LDP archive">.
<!-- at Metalab (formerly known as Sunsite). -->
There is a new HOWTO out that deals with setting up a
DPT RAID system, check out the
<url url="http://www.ram.org/computing/linux/dpt&lowbar;raid.html"
name="DPT RAID HOWTO homepage">.
<sect1>Mini-HOWTO
<p>
<nidx>disk!information resources!mini-HOWTOs</nidx>
These are the smaller free text relatives to the HOWTOs.
Some relevant mini-HOWTOs are
<tt/Backup-With-MSDOS/, <tt/Diskless/, <tt/LILO/, <tt/Large Disk/,
<tt/Linux+DOS+Win95+OS2/, <tt/Linux+OS2+DOS/, <tt/Linux+Win95/,
<tt/NFS-Root/, <tt/Win95+Win+Linux/, <tt/ZIP Drive/ .
You can find these at the same place as the HOWTOs, usually in a sub directory
called <tt/mini/. Note that these are scheduled to be converted into SGML and
become proper HOWTOs in the near future.
The old <tt/Linux Large IDE mini-HOWTO/ is no longer valid, instead read
<tt>/usr/src/linux/drivers/block/README.ide</tt> or
<tt>/usr/src/linux/Documentation/ide.txt</tt>.
<sect1>Local Resources
<p>
<nidx>disk!information resources!local</nidx>
In most distributions of Linux there is a document directory installed,
have a look in the
<htmlurl url="file:///usr/doc"
name="/usr/doc"> directory.
where most packages store their main documentation and README files etc.
Also you will here find the HOWTO archive (
<htmlurl url="file:///usr/doc/HOWTO"
name="/usr/doc/HOWTO">)
of ready formatted HOWTOs
and also the mini-HOWTO archive (
<url url="file:///usr/doc/HOWTO/mini"
name="/usr/doc/HOWTO/mini">)
of plain text documents.
Many of the configuration files mentioned earlier can be found in the
<htmlurl url="file:///etc"
name="/etc">
directory. In particular you will want to work with the
<htmlurl url="file:///etc/fstab"
name="/etc/fstab">
file that sets up the mounting of partitions
and possibly also
<htmlurl url="file:///etc/mdtab"
name="/etc/mdtab">
file that is used for the <tt/md/ system to set up RAID.
The kernel source in
<url url="file:///usr/src/linux"
name="/usr/src/linux">
is, of course, the ultimate documentation. In other
words, <em>use the source, Luke</em>.
It should also be pointed out that the kernel comes not only with
source code which is even commented (well, partially at least)
but also an informative
<url url="file:///usr/src/linux/Documentation"
name="documentation directory">.
If you are about to ask any questions about the kernel you should
read this first, it will save you and many others a lot of time
and possibly embarrassment.
Also have a look in your system log file (
<htmlurl url="file:///var/log/messages"
name="/var/log/messages">)
to see what is going on and in particular how the booting went if
too much scrolled off your screen. Using <tt>tail -f /var/log/messages</tt>
in a separate window or screen will give you a continuous update of what is
going on in your system.
You can also take advantage of the
<htmlurl url="file:///proc"
name="/proc">
file system that is a window into the inner workings of your system.
Use <tt/cat/ rather than <tt/more/ to view the files as they are
reported as being zero length. Reports are that <tt/less/ works well here.
<!-- removed 221198
Much of the work here is based on the Filesystem Structure Standard (FSSTND).
It has changed name to File Hierarchy Standard (FHS) and is less Linux
specific.
The maintainer has set up a
<url url="http://www.pathname.com/fhs"
name="home page">
which tells you how to join the currently private mailing list,
where the development takes place.
-->
<sect1>Web Pages
<p>
<nidx>disk!information resources!WWW</nidx>
<nidx>disk!information resources!web pages</nidx>
There is a huge number of informative web pages out there and by their very
nature they change quickly so don't be too surprised if these links become
quickly outdated.
A good starting point is of course the
<url url="http://www.linuxdoc.org/"
name="Linux Documentation Homepage">.
that is a information central for documentation, project pages and much, much more.
<itemize>
<item>Mike Neuffer, the author of the DPT caching RAID controller drivers, has some
interesting pages on
<!-- Old links updated 971021
<url url="http://www.i-connect.net/&tilde;mike/scsi"
name="SCSI">
and
<url url="http://www.i-connect.net/&tilde;mike/scsi/dpt"
name="DPT">.
-->
<url url="http://www.uni-mainz.de/&tilde;neuffer/scsi/"
name="SCSI">
and
<url url="http://www.uni-mainz.de/&tilde;neuffer/scsi/dpt/"
name="DPT">.
<item>Software RAID development information can be found at
<url url="http://www.kernel.org/"
name="Linux Kernel site">
along with patches and utilities.
<item>Disk related information on benchmarking, RAID, reliability and
much, much more can be found at
<url url="http://linas.org"
name="Linas Vepstas">
project page.
<item>There is also information available on how to
<url url="ftp://ftp.bizsystems.com/pub/raid/Root-RAID-HOWTO.html"
name="RAID the root partition">
and what software packages are needed to achieve this.
<item>In depth documentation on
<url url="http://step.polymtl.ca/&tilde;ldd/ext2fs/ext2fs&lowbar;toc.html"
name="ext2fs">
is also available.
<!-- moved 990126
<item>Mark D. Roth has information on
<url url="http://www.uiuc.edu/ph/www/roth"
name="VPS">
-->
<!-- moved
<item>A similar kind of project on an
<url url="http://www.virtual.net.au/&tilde;rjh/enh-fs.html"
name="Enhanced File System"> -->
<item>People who looking for information on VFAT, FAT32 and Joliet
could have a look at the
<url url="http://bmrc.berkeley.edu/people/chaffee/index.html"
name="development page">.
<!-- Only minor details are missing before it comes into the kernel. -->
These drivers are in the 2.1.x kernel development series as well as
in 2.0.34 and later.
<!-- seems to be gone 001117
<item>For more information on booting and also some BSD information
have a look at
<url url="http://www.paranoia.com/&tilde;vax/boot.html"
name="booting information">
page. -->
</itemize>
For diagrams and information on all sorts of disk drives, controllers etc. both
for current and discontinued lines
<url url="http://theref.aquascape.com/theref.html"
name="The Ref">
is the site you need. There is a lot of useful information here, a real treasure trove.
<!--
You can also download the database using
<url url="ftp://theref.c3d.rl.af.mil/public"
name="FTP">.
-->
Please let me know if you have any other leads that can be of interest.
<sect1>Search Engines
<p>
<nidx>disk!information resources!search engines</nidx>
<nidx>disk!information resources!Troubleshooting mini-HOWTO</nidx>
<nidx>disk!information resources!Updated mini-HOWTO</nidx>
<!--
Remember you can also use the web search engines and that some, like
<itemize>
<item><url url="http://www.altavista.digital.com"
name="Altavista">
<item><url url="http://www.excite.com"
name="Excite">
<item><url url="http://www.hotbot.com"
name="Hotbot">
</itemize>
can also search Usenet News.
Also remember that
<url url="http://www.deja.com"
name="Deja">, formerly known as Dejanews,
is a dedicated news searcher that keeps a news spool
from early 1995 and onwards.
-->
When all fails try the internet search engines. There is a huge number
of them, all a little different from each other. It falls outside the
scope of this HOWTO to describe how best to use them. Instead you
could turn to the Troubleshooting on the Internet mini-HOWTO, and the
Updated mini-HOWTO.
If you have to ask for help you are most likely to get help in the
<url url="news:comp.os.linux.setup"
name="Linux Setup">
news group.
Due to large workload and a slow network connection I am not able to
follow that newsgroup so if you want to contact me you have to do so
by e-mail.
<sect>Getting Help
<p>
<!-- New 971006 -->
<nidx>disk!assistance, obtaining</nidx>
In the end you might find yourself unable to solve your problems and need
help from someone else. The most efficient way is either to ask someone
local or in your nearest Linux user group, search the web for the nearest
one.
Another possibility is to ask on Usenet News in one of the many, many
newsgroups available. The problem is that these have such a high
volume and noise (called low signal-to-noise ratio) that your question
can easily fall through unanswered.
No matter where you ask it is important to ask well or you will not be
taken seriously. Saying just <it/my disk does not work/ is not going
to help you and instead the noise level is increased even further and if
you are lucky someone will ask you to clarify.
Instead describe your problems in some detail that
will enable people to help you. The problem could lie somewhere you did
not expect. Therefore you are advised to list up the following information
on your system:
<descrip>
<tag/Hardware/
<itemize>
<item>Processor
<item>DMA
<item>IRQ
<item>Chip set (LX, BX etc)
<item>Bus (ISA, VESA, PCI etc)
<item>Expansion cards used (Disk controllers, video, IO etc)
</itemize>
<tag/Software/
<itemize>
<item>BIOS (On motherboard and possibly SCSI host adapters)
<item>LILO, if used
<item>Linux kernel version as well as possible modifications and patches
<item>Kernel parameters, if any
<item>Software that shows the error (with version number or date)
</itemize>
<tag/Peripherals/
<itemize>
<item>Type of disk drives with manufacturer name, version and type
<item>Other relevant peripherals connected to the same busses
</itemize>
</descrip>
As an example of how interrelated these problems are: an old chip set caused
problems with a certain combination of video controller and SCSI host adapter.
Remember that booting text is logged to <tt>/var/log/messages</tt> which can
answer most of the questions above. Obviously if the drives fail you might not
be able to get the log saved to disk but you can at least scroll back up the
screen using the <tt/SHIFT/ and <tt/PAGE UP/ keys. It may also be useful to
include part of this in your request for help but do not go overboard, keep
it <em/brief/ as a complete log file dumped to Usenet News is more than a
little annoying.
<sect>Concluding Remarks
<p>
<nidx>disk!conclusion</nidx>
Disk tuning and partition decisions are difficult to make, and there are no
hard rules here. Nevertheless it is a good idea to work more on this as the
payoffs can be considerable. Maximizing usage on one drive only while the
others are idle is unlikely to be optimal, watch the drive light, they are
not there just for decoration. For a properly set up system the lights should
look like Christmas in a disco. Linux offers software RAID but also support
for some hardware base SCSI RAID controllers. Check what is available. As
your system and experiences evolve you are likely to repartition and you
might look on this document again. Additions are always welcome.
Finally I'd like to sum up my recommendations:
<itemize>
<item>Disks are cheap but the data they contain could be much more
valuable, use and test your backup system.
<item>Work is also expensive, make sure you get large enough disks
as refitting new or repartitioning old disks takes time.
<item>Think reliability, replace old disks before they fail.
<item>Keep a paper copy of your setup, having it all on disk when
the machine is down will not help you much.
<item>Start out with a simple design with a minimum of fancy technology
and rather fit it in later. In general adding is easier than replacing,
be it disks, technology or other features.
</itemize>
<sect1>Coming Soon
<p>
<nidx>disk!coming soon</nidx>
There are a few more important things that are about to appear here. In
particular I will add more example tables as I am about to set
up two fairly large and general systems, one at work and one at home. These
should give some general feeling on how a system can be set up for either
of these two purposes. Examples of smooth running existing systems are
also welcome.
There is also a fair bit of work left to do on the various kinds of file
systems and utilities.
There will be a big addition on drive technologies coming soon
as well as a more in depth description on using
<tt>fdisk</tt>, <tt>cfdisk</tt> and <tt>sfdisk</tt>.
The file systems will be beefed up as more features become available
as well as more on RAID and what directories can benefit from what
RAID level.
<!--
Also I hope to get some information from DPT who make the only RAID
controller supported by Linux so far. I have contacted them but have
yet to hear from them.
Recently I received an information pack from DPT, who made the first
hardware RAID supported by Linux. Their leaflets now carry the familiar
penguin logo to show they support Linux. More in-depth information will
come soon. -->
There is some minor overlapping with
the Linux Filesystem Structure Standard and FHS
that I hope to integrate better soon, which will
probably mean a big reworking
of all the tables at the end of this document.
As more people start reading this I should get some more
comments and feedback. I am also thinking of making a program
that can automate a fair bit of this decision making process
and although it is unlikely to be optimum it should provide
a simpler, more complete starting point.
<sect1>Request for Information
<p>
<nidx>disk!request for information</nidx>
It has taken a fair bit of time to write this document and although
most pieces are beginning to come together there are still some
information needed before we are out of the beta stage.
<itemize>
<item> More information on swap sizing policies is needed as well as
information on the largest swap size possible under the various kernel
versions.
<item> How common is drive or file system corruption? So far I have
only heard of problems caused by flaky hardware.
<item> References to speed and drives is needed.
<item> Are any other Linux compatible RAID controllers available?
<!-- <item> Leads to file system, volume management and other related
software is welcome. -->
<item> What relevant monitoring, management and maintenance
tools are available?
<item> General references to information sources are needed, perhaps
this should be a separate document?
<item> Usage of <tt>/tmp</tt> and <tt>/var/tmp</tt> has been hard to
determine, in fact what programs use which directory is not well defined
and more information here is required. Still, it seems at least clear
that these should reside on different physical drives in order to increase
paralellicity.
</itemize>
<sect1>Suggested Project Work
<p>
<nidx>disk!projects, suggested</nidx>
Now and then people post on comp.os.linux.*, looking for good project
ideas. Here I will list a few that comes to mind that are relevant to
this document. Plans about big projects such as new file systems should
still be posted in order to either find co-workers or see if someone is
already working on it.
<descrip>
<tag/Planning tools/ that can automate the design process outlines
earlier would probably make a medium sized project, perhaps as an
exercise in constraint based programming.
<tag/Partitioning tools/ that take the output of the previously
mentioned program and format drives in parallel and apply the
appropriate symbolic links to the directory structure. It would
probably be best if this were integrated in existing system
installation software. The drive partitioning setup used in
Solaris is an example of what it can look like.
<tag/Surveillance tools/ that keep an eye on the partition sizes
and warn before a partition overflows.
<tag/Migration tools/ that safely lets you move old structures to
new (for instance RAID) systems. This could probably be done as a
shell script controlling a back up program and would be rather
simple. Still, be sure it is safe and that the changes can be undone.
</descrip>
<sect>Questions and Answers
<p>
<nidx>disk!FAQ</nidx>
<nidx>disk!frequently asked questions</nidx>
This is just a collection of what I believe are the most common
questions people might have. Give me more feedback and I will
turn this section into a proper FAQ.
<itemize>
<item>Q:How many physical disk drives (spindles) does a Linux system need?
<p>
A: Linux can run just fine on one drive (spindle). Having enough
RAM (around 32 MB, and up to 64 MB) to support swapping is a
better price/performance choice than getting a second disk.
(E)IDE disk is usually cheaper (but a little slower) than SCSI.
<item>Q: I have a single drive, will this HOWTO help me?
<p>
A: Yes, although only to a minor degree. Still, section
<ref id="physical-track-positioning" name="Physical Track Positioning">
will offer you some gains.
<item>Q: Are there any disadvantages in this scheme?
<p>
A: There is only a minor snag: if even a single partition overflows
the system might stop working properly. The severity depends of course
on what partition is affected. Still this is not hard to monitor, the
command <tt/df/ gives you a good overview of the situation. Also check
the swap partition(s) using <tt/free/ to make sure you are not about
to run out of virtual memory.
<item>Q: OK, so should I split the system into as many partitions
as possible for a single drive?
<p>
A: No, there are several disadvantages to that. First of all maintenance
becomes needlessly complex and you gain very little in this. In fact if your
partitions are too big you will seek across larger areas than needed.
This is a balance and dependent on the number of physical drives you have.
<item>Q: Does that mean more drives allows more partitions?
<p>
A: To some degree, yes. Still, some directories should not be split
off from root, check out the file system standards for more details.
<item>Q: What if I have many drives I want to use?
<p>
A: If you have more than 3-4 drives you should consider using RAID of
some form. Still, it is a good idea to keep your root partition on a
simple partition without RAID, see section
<ref id="RAID" name="RAID"> for more details.
<item>Q: I have installed the latest Windows95 but cannot access this
partition from within the Linux system, what is wrong?
<p>
A: Most likely you are using <tt/FAT32/ in your windows partition. It
seems that Microsoft decided we needed yet another format, and this
was introduced in their latest version of Windows95, called OSR2.
The advantage is that this format is better suited to large drives.
You might also be interested to hear that Microsoft NT 4.0 does not
support it yet either.
<item>Q: I cannot get the disk size and partition sizes to match,
something is missing. What has happened?
<p>
A:It is possible you have mounted a partition onto a mount point that
was not an empty directory. Mount points are directories and if it
is not empty the mounting will mask the contents. If you do the sums
you will see the amount of disk space used in this directory is
missing from the observed total.
To solve this you can boot from a rescue disk and see what is hiding
behind your mount points and remove or transfer the contents by
mounting the offending partition on a temporary mounting point. You
might find it useful to have "spare" emergency mounting points ready
made.
<item>Q: It doesn't look like my swap partition is in use, how come?
<p>
A: It is possible that it has not been necessary to swap out,
especially if you have plenty of RAM. Check your log files to see
if you ran out of memory at one point or another, in that case
your swap space should have been put to use. If not it is
possible that either the swap partition was not assigned the
right number, that you did not prepare it with <tt/mkswap/ or
that you have not done <tt/swapon/ or added it to your
<!-- <file/fstab/. -->
<htmlurl url="file:///etc/fstab"
name="/etc/fstab">
file.
<item>Q: What is this Nyx that is mentioned several times here?
<p>
A: It is a large free Unix system with currently about 10000 users.
I use it for my web pages for this HOWTO as well as a source
of ideas for a setup of large Unix systems. It has been running for
many years and has a quite stable setup. For more information you can
view the
<url url="http://www.nyx.net"
name="Nyx homepage">
which also gives you information on how to get your own free account.
</itemize>
<sect>Bits and Pieces <label id="bits-n-pieces">
<p>
<nidx>disk!miscellaneous</nidx>
This is basically a section where I stuff all the bits I have not yet
decided where should go, yet that I feel is worth knowing about. It is
a kind of transient area.
<!-- removed 990124
<sect1>Combining <tt/swap/ and <tt>/tmp</tt> <label id="comb-swap-n-tmp">
<p>
<nidx>disk!miscellaneous!swap and tmp combined</nidx>
Recently there have been discussions in the various Linux related
news groups about specialized file systems for temporary storage.
This is partly inspired by the <tt/tmpfs/ on *BSD* and Solaris, as
well as <tt/swapfs/ on the NeXT machines.
The rationale is that these are temporary storage that normally
does not require much space, yet in normal systems you need to
reserve a certain amount of space for these. Elementary statistical
knowledge tells you (very simplified) that when you sum a number of
variables the relative statistical uncertainty decreases. So combining
<tt/swap/ and <tt>/tmp</tt> you do not need to reserve as much space
as you otherwise would need.
This specialized file system is nothing more than a swappable RAM disk
that are swapped out to disk when and only when space is limited, thus
effectively putting temporary files on the swap partition.
There is, however, a snag. This scheme prevents you from getting
parallel activity on <tt/swap/ and <tt>/tmp</tt> drives so under
heavy activity the system takes a bigger performance hit. Put
another way, you trade speed to get space. Interleaving across
multiple drives reduces this somewhat.
Also there is the security problem with users bringing down the
machine by overflowing the <tt>/tmp</tt> directory.
-->
<!-- redundant
<sect1>Interleaved <tt/swap/ drives.
<p>
<nidx>disk!miscellaneous!interleaved swap drives</nidx>
This is not striping across several drives, instead drives are accessed
in a round robin fashion in order to spread the load in a crude fashion.
In Linux you additionally have a priority parameter you can adjust for
tuning your system, especially useful if your disks differs significantly
in speed. Check <tt/man 8 swapon/ as well as <tt/man 2 swapon/ for more
information. -->
<sect1>Swap Partition: to Use or Not to Use
<p>
<nidx>disk!miscellaneous!swap or no swap</nidx>
In many cases you do not need a swap partition, for instance if you
have plenty of RAM, say, more than 64 MB, and you are the sole user
of the machine. In this case you can experiment running without a
swap partition and check the system logs to see if you ran out of
virtual memory at any point.
Removing swap partitions have two advantages:
<itemize>
<item>you save disk space (rather obvious really)
<item>you save seek time as swap partitions otherwise would
lie in the middle of your disk space.
</itemize>
In the end, having a swap partition is like having a heated toilet:
you do not use it very often, but you sure appreciate it when
you require it.
<sect1>Mount Point and <tt>/mnt</tt>
<p>
<nidx>disk!miscellaneous!mount point issues</nidx>
In an earlier version of this document I proposed to put all
permanently mounted partitions under <tt>/mnt</tt>. That, however, is
not such a good idea as this itself can be used as a mount point, which
leads to all mounted partitions becoming unavailable. Instead I will
propose mounting straight from root using a meaningful name like
<tt>/mnt.descriptive-name</tt>.
Lately I have become aware that some Linux distributions use mount points
at subdirectories <em/under/ <tt>/mnt</tt>, such as <tt>/mnt/floppy</tt>
and <tt>/mnt/cdrom</tt>, which just shows how confused the whole issue is.
Hopefully FHS should clarify this.
<!--
<sect1>SCSI Id Numbers and Names
<p>
<nidx>disk!miscellaneous!SCSI id numbers vs. names</nidx>
Partitions are labeled in the order they are found, <em/not/ depending
on the SCSI id number. This means that if you add a drive with an id
number inserted in the previous order of numbers, or change id number
in any other way, the partition names will be messed up. This is
important if you use removable media. In order to save yourself from some
unpleasant experiences, you are recommended to use low numbers for fixed
media and reserve the last number(s) for removable media drives.
Many have been bitten by this misfeature and there is a strong call for
something to be done about it. Nobody knows how soon this will be fixed
so in the meantime it is wise to take this into consideration when you
design your system. For instance it may be a good idea to use the lowest
SCSI id number for your root disk so that it has the least probability of
being renumbered should one drive fail.
The source of the problem lies in the limited number of bits available
for major and minor numbering in the device files used to describe the
device itself. You can see these in the <tt>/dev</tt> directory, info
on the numbering and allocation can be found in <tt/man MAKEDEV/.
Currently there are 2 solutions to this problem in various stages of
development:
<descrip>
<tag/scsidev/ works by creating a database of drives and where they
belong, check <tt/man scsifs/ for more information
<tag/devfs/ is a more long term project aimed at getting around the
whole business of device numbering by making the <tt>/dev</tt>
directory a kernel file system in the same way as <tt>/procfs</tt>
is.
More information will appear as it becomes available.
</descrip>
SCSI numbers are also used for arbitration. If several drives request
service, the drive with the lowest number is given priority.
-->
<sect1>Power and Heating <label id="power-heating">
<p>
<nidx>disk!miscellaneous!power-related issues</nidx>
<nidx>disk!miscellaneous!heat-related issues</nidx>
Not many years ago a machine with the equivalent power of a modern PC
required 3-phase power and cooling, usually by air conditioning the machine
room, some times also by water cooling. Technology has progressed very
quickly giving not only high speed but also low power components. Still,
there is a definite limit to the technology, something one should keep in
mind as the system is expanded with yet another disk drive or PCI
card. When the power supply is running at full rated power, keep in mind
that all this energy is going somewhere, mostly into heat. Unless this is
dissipated using fans you will get a serious heating inside the cabinet
followed by a reduced reliability and also life time of the electronics.
Manufacturers state minimum cooling requirements for their drives, usually
in terms of cubic feet per minute (CFM). You are well advised to take this
serious.
Keep air flow passages open, clean out dust and check the temperature of your
system running. If it is too hot to touch it is probably running too hot.
If possible use sequential spin up for the drives. It is during
spin up, when the drive platters accelerate up to normal speed,
that a drive consumes maximum power and if all drives start up
simultaneously you could go beyond the rated power maximum of
your power supply.
<sect1>Deja
<p>
<nidx>disk!miscellaneous!Dejanews</nidx>
<nidx>disk!miscellaneous!Deja</nidx>
<nidx>disk!reliability</nidx>
This is an Internet system that no doubt most of you are familiar with.
It searches and serves <em/Usenet News/ articles from 1995 and to the
latest postings and also offers a web based reading and posting interface.
There is a lot more, check out
<url url="http://www.deja.com"
name="Deja">
for more information. It changed name from Dejanews.
What perhaps is less known, is that they use about 120 Linux SMP
computers many of which use the <tt/md/ module to manage between 4
and 24 Gig of disk space (over 1200 Gig altogether) for this service.
The system is continuously growing but at the time of writing they
use mostly dual Pentium Pro 200MHz and Pentium II 300 MHz
systems with 256 MB RAM or more.
A production database machine normally has 1 disk for the operating system and
between 4 and 6 disks managed by the <tt/md/ module where the articles are
archived.
The drives are connected to BusLogic Model BT-946C and BT-958
PCI SCSI adapters, usually one to a machine.
<!-- Added 980809, to be checked -->
For the production
systems (which are up 365 days a year) the downtime attributable to
disk errors is less than 0.25 % (that is a quarter of 1%, not 25%).
<!-- end of addition -->
Just in case: this is not an advertisement, it is stated as an
example of how much is required for what is a major Internet
service.
<!-- removed 221198
<sect1>File System Structure
<p>
<nidx>disk!miscellaneous!filesystem structure</nidx>
There are many file system structures in existence, differing with
FSSTND (and soon FHS) to varying degree both in terms of philosophy,
strategy and implementation. It is not possible to detail all here,
instead the interested reader should read the relevant manual page,
<tt/man hier/ which is available on many platforms and implementations.
-->
<!-- removed 221198
<sect1>Track Numbering and Optimizing Schemes
<p>
<nidx>disk!miscellaneous!track numbering</nidx>
<nidx>disk!miscellaneous!optimization</nidx>
In the old days the file system used to take advantage of knowing the
physical drive parameters in order to optimize transfers, for instance
by endeavouring to keep a file within a single track if possible which
saves track-to-track seek time. These days with logical drive parameters,
drive cache and schemes to map out bad sectors, such optimizations
become meaningless and might even cost more than it would gain. As most
Linux installations use modern file systems these schemes are not used,
however, some other operating systems have retained such schemes.
-->
<sect1>Crash Recovery
<p>
<nidx>disk!miscellaneous!recovery</nidx>
<nidx>disk!miscellaneous!crash recovery</nidx>
Occationally hard disks crash. A crash causing data scrambling can
often be at least partially recovered from and there are already
HOWTOs describing this.
In case of hardware failure things are far more serious, and you
have two options: either send the drive to a professional data
recovery company, or try recovering yourself. The latter is of
course <em>high risk</em> and can cause more damage.
If a disk stops rotating or fails to spin up, the number one
advice is first to turn off the system as fast as safely possible.
Next you could try disconnecting the drives and power up the
machine, just to check power with a multimeter that power is
present. Quite often connectors can get unseated and cause all
sorts of problems.
If you decide to risk trying it yourself you could check all
connectors and then reapply power and see if the drive spins up
and responds. If it still is dead turn off power quickly,
preferrably before the operating system boots. Make sure that
delayed spinup is not deceiving you here.
If you decide to progress even further (and take higher risks)
you could remove the drive, give it a firm tap on the side so
that the disk moves a little with respect to the casing. This
can help in unsticking the head from the surface, allowing the
platter to move freely as the motor power is not sufficient to
unstick a stuck head on its own.
Also if a drive has been turned off for a while after running
for long periods of time, or if it has overheated, the lubricant
can harden of drain out of the bearings. In this case warming the
drive slowly and gently up to normal operating temperature will
possibly recover the lubrication problems.
If after this the drive still does not respond the last possible
and the highest risk suggestion is to replace the circuit board
of the drive with a board from am identical model drive.
Often the contents of a drive is worth far more than the media
itself, so do consider professional help. These companies have
advanced equipment and know-how obtained from the manufacturers
on how to recover a damaged drive, far beyond that of a hobbyist.
<!--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-->
<sect>Appendix A: Partitioning Layout Table: Mounting and Linking <label id="app-a">
<p>
<nidx>disk!partitioning layout table!mounting and linking</nidx>
The following table is designed to make layout a simpler paper
and pencil exercise. It is probably best to print it out (using
NON PROPORTIONAL fonts) and adjust the numbers until you are
happy with them.
Mount point is what directory you wish to mount a partition on or
the actual device. This is also a good place to note how you plan
to use symbolic links.
The size given corresponds to a fairly big Debian 1.2.6 installation.
Other examples are coming later.
Mainly you use this table to select what structure and drives you will use,
the partition numbers and letters will come from the next two tables.
<tscreen><verb>
Directory Mount point speed seek transfer size SIZE
swap __________ ooooo ooooo ooooo 32 ____
/ __________ o o o 20 ____
/tmp __________ oooo oooo oooo ____
/var __________ oo oo oo 25 ____
/var/tmp __________ oooo oooo oooo ____
/var/spool __________ ____
/var/spool/mail __________ o o o ____
/var/spool/news __________ ooo ooo oo ____
/var/spool/____ __________ ____ ____ ____ ____
/home __________ oo oo oo ____
/usr __________ 500 ____
/usr/bin __________ o oo o 250 ____
/usr/lib __________ oo oo ooo 200 ____
/usr/local __________ ____
/usr/local/bin __________ o oo o ____
/usr/local/lib __________ oo oo ooo ____
/usr/local/____ __________ ____
/usr/src __________ o oo o 50 ____
DOS __________ o o o ____
Win __________ oo oo oo ____
NT __________ ooo ooo ooo ____
/mnt._________ __________ ____ ____ ____ ____
/mnt._________ __________ ____ ____ ____ ____
/mnt._________ __________ ____ ____ ____ ____
/_____________ __________ ____ ____ ____ ____
/_____________ __________ ____ ____ ____ ____
/_____________ __________ ____ ____ ____ ____
Total capacity:
</verb></tscreen>
<sect>Appendix B: Partitioning Layout Table: Numbering and Sizing <label id="app-b">
<p>
<nidx>disk!partitioning layout table!numbering and sizing</nidx>
This table follows the same logical structure as the table above
where you decided what disk to use. Here you select the physical
tracking, keeping in mind the effect of track positioning mentioned
earlier in
<ref id="physical-track-positioning" name="Physical Track Positioning">.
The final partition number will come out of the table after this.
<tscreen><verb>
Drive sda sdb sdc hda hdb hdc ___
SCSI ID | __ | __ | __ |
Directory
swap | | | | | | |
/ | | | | | | |
/tmp | | | | | | |
/var : : : : : : :
/var/tmp | | | | | | |
/var/spool : : : : : : :
/var/spool/mail | | | | | | |
/var/spool/news : : : : : : :
/var/spool/____ | | | | | | |
/home | | | | | | |
/usr | | | | | | |
/usr/bin : : : : : : :
/usr/lib | | | | | | |
/usr/local : : : : : : :
/usr/local/bin | | | | | | |
/usr/local/lib : : : : : : :
/usr/local/____ | | | | | | |
/usr/src : : : :
DOS | | | | | | |
Win : : : : : : :
NT | | | | | | |
/mnt.___/_____ | | | | | | |
/mnt.___/_____ : : : : : : :
/mnt.___/_____ | | | | | | |
/_____________ : : : : : : :
/_____________ | | | | | | |
/_____________ : : : : : : :
Total capacity:
</verb></tscreen>
<sect>Appendix C: Partitioning Layout Table: Partition Placement <label id="app-c">
<p>
<nidx>disk!partitioning layout table!partition placement</nidx>
This is just to sort the partition numbers in ascending order ready
to input to fdisk or cfdisk. Here you take physical track positioning
into account when finalizing your design. Unless you get specific
information otherwise, you can assume track 0 is the outermost track.
These numbers and letters
are then used to update the previous tables, all of which you will find
very useful in later maintenance.
In case of disk crash you might find it handy to know what SCSI id
belongs to which drive, consider keeping a paper copy of this.
<tscreen><verb>
Drive : sda sdb sdc hda hdb hdc ___
Total capacity: | ___ | ___ | ___ | ___ | ___ | ___ | ___
SCSI ID | __ | __ | __ |
Partition
1 | | | | | | |
2 : : : : : : :
3 | | | | | | |
4 : : : : : : :
5 | | | | | | |
6 : : : : : : :
7 | | | | | | |
8 : : : : : : :
9 | | | | | | |
10 : : : : : : :
11 | | | | | | |
12 : : : : : : :
13 | | | | | | |
14 : : : : : : :
15 | | | | | | |
16 : : : : : : :
</verb></tscreen>
<sect>Appendix D: Example: Multipurpose Server
<p>
<nidx>disk!example!server, multi-purpose</nidx>
The following table is from the setup of a medium sized multipurpose
server where I once worked. Aside from being a general Linux machine it will
also be a network related server (DNS, mail, FTP, news, printers etc.)
X server for various CAD programs, CD ROM burner and many other things.
The files reside on 3 SCSI drives with a capacity of 600, 1000 and 1300
MB.
Some further speed could possibly be gained by splitting <tt>/usr/local</tt>
from the rest of the <tt>/usr</tt> system but we deemed the further added
complexity would not be worth it. With another couple of drives
this could be more worthwhile. In this setup drive sda is old and slow
and could just a well be replaced by an IDE drive. The other two drives
are both rather fast. Basically we split most of the load between these
two. To reduce dangers of imbalance in partition sizing we have decided
to keep <tt>/usr/bin</tt> and <tt>/usr/local/bin</tt> in one drive
and <tt>/usr/lib</tt> and <tt>/usr/local/lib</tt> on another separate drive
which also affords us some drive parallelizing.
Even more could be gained by using RAID but we felt that as a server we
needed more reliability than was then afforded by the <tt/md/ patch and
a dedicated RAID controller was out of our reach.
<sect>Appendix E: Example: Mounting and Linking
<p>
<nidx>disk!example!mounting and linking</nidx>
<tscreen><verb>
Directory Mount point speed seek transfer size SIZE
swap sdb2, sdc2 ooooo ooooo ooooo 32 2x64
/ sda2 o o o 20 100
/tmp sdb3 oooo oooo oooo 300
/var __________ oo oo oo ____
/var/tmp sdc3 oooo oooo oooo 300
/var/spool sdb1 436
/var/spool/mail __________ o o o ____
/var/spool/news __________ ooo ooo oo ____
/var/spool/____ __________ ____ ____ ____ ____
/home sda3 oo oo oo 400
/usr sdb4 230 200
/usr/bin __________ o oo o 30 ____
/usr/lib -> libdisk oo oo ooo 70 ____
/usr/local __________ ____
/usr/local/bin __________ o oo o ____
/usr/local/lib -> libdisk oo oo ooo ____
/usr/local/____ __________ ____
/usr/src ->/home/usr.src o oo o 10 ____
DOS sda1 o o o 100
Win __________ oo oo oo ____
NT __________ ooo ooo ooo ____
/mnt.libdisk sdc4 oo oo ooo 226
/mnt.cd sdc1 o o oo 710
Total capacity: 2900 MB
</verb></tscreen>
<sect>Appendix F: Example: Numbering and Sizing
<p>
<nidx>disk!example!numbering and sizing</nidx>
Here we do the adjustment of sizes and positioning.
<tscreen><verb>
Directory sda sdb sdc
swap | | 64 | 64 |
/ | 100 | | |
/tmp | | 300 | |
/var : : : :
/var/tmp | | | 300 |
/var/spool : : 436 : :
/var/spool/mail | | | |
/var/spool/news : : : :
/var/spool/____ | | | |
/home | 400 | | |
/usr | | 200 | |
/usr/bin : : : :
/usr/lib | | | |
/usr/local : : : :
/usr/local/bin | | | |
/usr/local/lib : : : :
/usr/local/____ | | | |
/usr/src : : : :
DOS | 100 | | |
Win : : : :
NT | | | |
/mnt.libdisk | | | 226 |
/mnt.cd : : : 710 :
/mnt.___/_____ | | | |
Total capacity: | 600 | 1000 | 1300 |
</verb></tscreen>
<sect>Appendix G: Example: Partition Placement
<p>
<nidx>disk!example!partition placement</nidx>
This is just to sort the partition numbers in ascending order ready
to input to fdisk or cfdisk. Remember to optimize for physical track
positioning (not done here).
<tscreen><verb>
Drive : sda sdb sdc
Total capacity: | 600 | 1000 | 1300 |
Partition
1 | 100 | 436 | 710 |
2 : 100 : 64 : 64 :
3 | 400 | 300 | 300 |
4 : : 200 : 226 :
</verb></tscreen>
<sect>Appendix H: Example II
<p>
<nidx>disk!example!server, academic</nidx>
The following is an example of a
server setup in an academic setting, and is contributed by
<tt/nakano (at) apm.seikei.ac.jp/. I have only done minor editing to
this section.
<tt>/var/spool/delegate</tt> is a directory for storing logs and cache files
of an WWW proxy server program, "delegated". Since I don't notice it
widely, there are 1000--1500 requests/day currently, and average
disk usage is 15--30% with expiration of caches each day.
<tt>/mnt.archive</tt> is used for data files which are big and not frequently
referenced such a s experimental data (especially graphic ones),
various source archives, and Win95 backups (growing very fast...).
<tt>/mnt.root</tt> is backup root file system containing rescue utilities. A
boot floppy is also prepared to boot with this partition.
<tscreen><verb>
=================================================
Directory sda sdb hda
swap | 64 | 64 | |
/ | | | 20 |
/tmp | | | 180 |
/var : 300 : : :
/var/tmp | | 300 | |
/var/spool/delegate | 300 | | |
/home | | | 850 |
/usr | 360 | | |
/usr/lib -> /mnt.lib/usr.lib
/usr/local/lib -> /mnt.lib/usr.local.lib
/mnt.lib | | 350 | |
/mnt.archive : : 1300 : :
/mnt.root | | 20 | |
Total capacity: 1024 2034 1050
=================================================
Drive : sda sdb hda
Total capacity: | 1024 | 2034 | 1050 |
Partition
1 | 300 | 20 | 20 |
2 : 64 : 1300 : 180 :
3 | 300 | 64 | 850 |
4 : 360 : ext : :
5 | | 300 | |
6 : : 350 : :
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/hda1 19485 10534 7945 57% /
/dev/hda2 178598 13 169362 0% /tmp
/dev/hda3 826640 440814 343138 56% /home
/dev/sda1 306088 33580 256700 12% /var
/dev/sda3 297925 47730 234807 17% /var/spool/delegate
/dev/sda4 363272 170872 173640 50% /usr
/dev/sdb5 297598 2 282228 0% /var/tmp
/dev/sdb2 1339248 302564 967520 24% /mnt.archive
/dev/sdb6 323716 78792 228208 26% /mnt.lib
</verb></tscreen>
Apparently <tt>/tmp</tt> and <tt>/var/tmp</tt> is too big. These
directories shall be
packed together into one partition when disk space shortage comes.
<tt>/mnt.lib</tt> is also seemed to be, but I plan to install newer TeX and
ghostscript archives, so <tt>/usr/local/lib</tt> may grow about 100 MB or so
(since we must use Japanese fonts!).
Whole system is backed up by Seagate Tapestore 8000 (Travan TR-4,
4G/8G).
<!--
// 140197 text removed
It works fine when accessed through <tt>/dev/st0</tt>, but
when done through <tt>/dev/nst0</tt> or with `<tt>mt</tt>' command,
SCSI system get up a panic occasionally. It's not critical, but the
biggest problem rest in our system...
-->
<sect>Appendix I: Example III: SPARC Solaris
<p>
<nidx>disk!example!server, industrial</nidx>
The following section is the basic design used at work for a number of
Sun SPARC servers running Solaris 2.5.1 in an industrial development
environment. It serves a number of database and cad applications in
addition to the normal services such as mail.
Simplicity is emphasized here so <tt>/usr/lib</tt> has not been split
off from <tt>/usr</tt>.
This is the basic layout, planned for about 100 users.
<tscreen><verb>
Drive: SCSI 0 SCSI 1
Partition Size (MB) Mount point Size (MB) Mount point
0 160 swap 160 swap
1 100 /tmp 100 /var/tmp
2 400 /usr
3 100 /
4 50 /var
5
6 remainder /local0 remainder /local1
</verb></tscreen>
Due to specific requirements at this place it is at times necessary to
have large partitions available on a short notice. Therefore drive 0 is
given as many tasks as feasible, leaving a large <tt>/local1</tt>
partition.
This setup has been in use for some time now and found satisfactorily.
For a more general and balanced system it would be better to swap <tt>/tmp</tt>
and <tt>/var/tmp</tt> and then move <tt>/var</tt> to drive 1.
<sect>Appendix J: Example IV: Server with 4 Drives
<p>
<nidx>disk!example!server, 4 drives</nidx>
This gives an example of using all techniques described earlier, short of
RAID. It is admittedly rather complicated but offers in return high
performance from modest hardware. Dimensioning are skipped but reasonable
figures can be found in previous examples.
<tscreen><verb>
Partition sda sdb sdc sdd
---- ---- ---- ----
1 root overview lib news
2 swap swap swap swap
3 home /usr /var/tmp /tmp
4 spare root mail /var
</verb></tscreen>
Setup is optimised with respect to track positioning but also for
minimising drive seeks.
If you want DOS or Windows too you will have to use <tt/sda1/ for this
and move the other partitions after that. It will be advantageous to
use the swap partitions on <tt/sdb2/, <tt/sdc2/ and <tt/sdd2/ for
Windows swap, <tt/TEMPDIR/ and Windows temporary directory under these
sessions. A number of other HOWTOs describe how you can make several
operating systems coexist on your machine.
For completeness a 4 drive example using several types of RAID is
also given which is even more complex than the example above.
<tscreen><verb>
Partition sda sdb sdc sdd
---- ---- ---- ----
1 boot overview news news
2 overview swap swap swap
3 swap lib lib lib
4 lib overview /tmp /tmp
5 /var/tmp /var/tmp mail /usr
6 /home /usr /usr mail
7 /usr /home /var
8 / (root) spare root
</verb></tscreen>
Here all duplicates are parts of a RAID 0 set with two exceptions,
swap which is interleaved and home and mail which are implemented
as RAID 1 for safety.
Note that boot and root are separated: only the boot file with the
kernel has to reside within the 1023 cylinder limit. The rest of the
root files can be anywhere and here they are placed on the slowest
outermost partition. For simplicity and safety the root partition
is not on a RAID system.
With such a complicated comes an equally complicated <tt/fstab/ file.
The large number of partitions makes it important to do the <tt/fsck/
passes in the right order, otherwise the process can take perhaps
ten times as long time to complete as the optimal solution.
<tscreen><verb>
/dev/sda8 / ? ? 1 1 (a)
/dev/sdb8 / ? noauto 1 2 (b)
/dev/sda1 boot ? ? 1 2 (a)
/dev/sdc7 /var ? ? 1 2 (c)
/dev/md1 news ? ? 1 3 (c+d)
/dev/md2 /var/tmp ? ? 1 3 (a+b)
/dev/md3 mail ? ? 1 4 (c+d)
/dev/md4 /home ? ? 1 4 (a+b)
/dev/md5 /tmp ? ? 1 5 (c+d)
/dev/md6 /usr ? ? 1 6 (a+b+c+d)
/dev/md7 /lib ? ? 1 7 (a+b+c+d)
</verb></tscreen>
The letters in the brackets indicate what drives will be active
for each <tt/fsck/ entry and pass. These letters are <em/not/ present
in a real <tt/fstab/ file.
All in all there are 7 passes.
<sect>Appendix K: Example V: Dual Drive System
<p>
<nidx>disk!example!system, 2 drives</nidx>
A dual drive system offers less opportunity for clever schemes but
the following should provide a simple starting point.
<tscreen><verb>
Partition sda sdb
---- ----
1 boot lib
2 swap news
3 /tmp swap
4 /usr /var/tmp
5 /var /home
6 / (root)
</verb></tscreen>
If you use a dual OS system you have to keep in mind that many other
systems must boot from the first partition on the first drive. A simple
DOS / Linux system could look like this:
<tscreen><verb>
Partition sda sdb
---- ----
1 DOS lib
2 boot news
3 swap swap
4 /tmp /var/tmp
5 /usr /home
6 /var DOSTEMP
7 / (root)
</verb></tscreen>
Also remember that DOS and Windows prefer there to be just a single
primary partition which has to be the first one where it boots from.
As Linux can happily exist in logical partitions this is not a big
problem.
<sect>Appendix L: Example VI: Single Drive System
<p>
<nidx>disk!example!system, 1 drive</nidx>
Although this falls somewhat outside the scope of this HOWTO
it cannot be denied that recently some rather large drives have
become very affordable. Drives with 10 - 20 GB are becoming
common and the question often is how best to partition such
monsters. Interestingly enough very few seem to have any problems
in filling up such drives and the future looks generally quite
rosy for manufacturers planning on even bigger drives.
Opportunities for optimisations are of course even smaller
than for 2 drive systems but some tricks can still be used
to optimise track positions while minimising head movements.
<tscreen><verb>
Partition hda Size estimate (MB)
---- ------------------
1 DOS 500
2 boot 20
3 Winswap 200
4 data The bulk of the drive
5 lib 50 - 500
6 news 300+
7 swap 128 (Maximum size for 32-bit CPU)
8 tmp 300+ (/tmp and /var/tmp)
9 /usr 50 - 500
10 /home 300+
11 /var 50 - 300
12 mail 300+
13 / (root) 30
14 dosdata 10 ( Windows bug workaround!)
</verb></tscreen>
Remember that the <tt/dosdata/ partition is a DOS filesystem that
must be the very last partition on the drive, otherwise Windows
gets confused.
<sect>Appendix M: Disk System Documenter <label id="disk-documenter">
<p>
<nidx>disk!disk documenter</nidx>
This shell script was very kindly provided by Steffen Hulegaard. Run it
as root (superuser) and it will generate a summary of your disk setup.
Run it after you have implemented your design and compare it with what
you designed to check for mistakes. Should your system develop defects
this document will also be a useful starting point for recovery.
<code>
#!/bin/bash
#$Header$
#
# makediskdoc Collects storage/disk info via df, mount,
# /etc/fstab and fdisk. Creates a single
# reference file -- /root/sysop/doc/README.diskdoc
# Especially good for documenting storage
# config/partioning
#
# 11/11/1999 SC Hulegaard Created just before RedHat 5.2 to
# RedHat 6.1 upgrade
# 12/31/1999 SC Hulegaard Added sfdisk -glx usage just prior to
# collapse of my Quantum Grand Prix (4.3 Gb)
#
# SEE ALSO Other /root/bin/make*doc commands to produce other /root/sysop/doc/README.*
# files. For example, /root/bin/makenetdoc.
#
FILE=/root/sysop/doc/README.diskdoc
echo Creating $FILE ...
echo ' ' > $FILE
echo $FILE >> $FILE
echo Produced By $0 >> $FILE
echo `date` >> $FILE
echo ' ' >> $FILE
echo $Header$ >> $FILE
echo ' ' >> $FILE
echo DESCRIPTION: df -a >> $FILE
df -a >> $FILE 2>&1
echo ' ' >> $FILE
echo DESCRIPTION: df -ia >> $FILE
df -ia >> $FILE 2>&1
echo ' ' >> $FILE
echo DESCRIPTION: mount >> $FILE
mount >> $FILE 2>&1
echo ' ' >> $FILE
echo DESCRIPTION: /etc/fstab >> $FILE
cat /etc/fstab >> $FILE
echo ' ' >> $FILE
echo DESCRIPTION: sfdisk -s disk device size summary >> $FILE
sfdisk -s >> $FILE
echo ' ' >> $FILE
echo DESCRIPTION: sfdisk -glx info for all disks listed in /etc/fstab >> $FILE
for x in `cat /etc/fstab | egrep /dev/[sh] | cut -c 0-8 | uniq`; do
echo ' ' >> $FILE
echo $x ============================= >> $FILE
sfdisk -glx $x >> $FILE
done
echo ' ' >> $FILE
echo DESCRIPTION: fdisk -l info for all disks listed in /etc/fstab >> $FILE
for x in `cat /etc/fstab | egrep /dev/[sh] | cut -c 0-8 | uniq`; do
echo ' ' >> $FILE
echo $x ============================= >> $FILE
fdisk -l $x >> $FILE
done
echo ' ' >> $FILE
echo DESCRIPTION: dmesg info on both sd and hd drives >> $FILE
dmesg | egrep [hs]d[a-z] >> $FILE
echo '' >> $FILE
echo Done >> $FILE
echo Done
exit
</code>
</article>