6052 lines
231 KiB
Plaintext
6052 lines
231 KiB
Plaintext
HOWTO: Multi Disk System Tuning
|
||
Stein Gjoen, sgjoen@nyx.net
|
||
v0.33a, 20 May 2002
|
||
|
||
This document describes how best to use multiple disks and partitions
|
||
for a Linux system. Although some of this text is Linux specific the
|
||
general approach outlined here can be applied to many other multi
|
||
tasking operating systems.
|
||
______________________________________________________________________
|
||
|
||
Table of Contents
|
||
|
||
|
||
|
||
1. Introduction
|
||
|
||
1.1 Copyright
|
||
1.2 Disclaimer
|
||
1.3 News
|
||
1.4 Credits
|
||
1.5 Translations
|
||
|
||
2. Structure
|
||
|
||
2.1 Logical structure
|
||
2.2 Document structure
|
||
2.3 Reading plan
|
||
|
||
3. Drive Technologies
|
||
|
||
3.1 Drives
|
||
3.2 Geometry
|
||
3.3 Media
|
||
3.3.1 Magnetic Drives
|
||
3.3.2 Optical Drives
|
||
3.3.3 Solid State Drives
|
||
3.4 Interfaces
|
||
3.4.1 MFM and RLL
|
||
3.4.2 ESDI
|
||
3.4.3 IDE and ATA
|
||
3.4.4 EIDE, Fast-ATA and ATA-2
|
||
3.4.5 Ultra-ATA
|
||
3.4.6 Serial-ATA
|
||
3.4.7 ATAPI
|
||
3.4.8 SCSI
|
||
3.5 Cabling
|
||
3.6 Host Adapters
|
||
3.7 Multi Channel Systems
|
||
3.8 Multi Board Systems
|
||
3.9 Speed Comparison
|
||
3.9.1 Controllers
|
||
3.9.2 Bus Types
|
||
3.10 Benchmarking
|
||
3.11 Comparisons
|
||
3.12 Future Development
|
||
3.13 Recommendations
|
||
|
||
4. File System Structure
|
||
|
||
4.1 File System Features
|
||
4.1.1 Swap
|
||
4.1.2 Temporary Storage (/tmp and /var/tmp)
|
||
4.1.3 Spool Areas (/var/spool/news and /var/spool/mail)
|
||
4.1.4 Home Directories (/home)
|
||
4.1.5 Main Binaries ( /usr/bin and /usr/local/bin)
|
||
4.1.6 Libraries ( /usr/lib and /usr/local/lib)
|
||
4.1.7 Boot
|
||
4.1.8 Root
|
||
4.1.9 DOS etc.
|
||
4.2 Explanation of Terms
|
||
4.2.1 Speed
|
||
4.2.2 Reliability
|
||
4.2.3 Files
|
||
|
||
5. File Systems
|
||
|
||
5.1 General Purpose File Systems
|
||
5.1.1 minix
|
||
5.1.2 xiafs and extfs
|
||
5.1.3 ext2fs
|
||
5.1.4 ext3fs
|
||
5.1.5 ufs
|
||
5.1.6 efs
|
||
5.1.7 XFS
|
||
5.1.8 reiserfs
|
||
5.1.9 enh-fs
|
||
5.1.10 Tux2 fs
|
||
5.2 Microsoft File Systems
|
||
5.2.1 fat
|
||
5.2.2 fat32
|
||
5.2.3 vfat
|
||
5.2.4 ntfs
|
||
5.3 Logging and Journaling File Systems
|
||
5.4 Read-only File Systems
|
||
5.4.1 High Sierra
|
||
5.4.2 iso9660
|
||
5.4.3 Rock Ridge
|
||
5.4.4 Joliet
|
||
5.4.5 Trivia
|
||
5.4.6 UDF
|
||
5.5 Networking File Systems
|
||
5.5.1 NFS
|
||
5.5.2 AFS
|
||
5.5.3 Coda
|
||
5.5.4 nbd
|
||
5.5.5 enbd
|
||
5.5.6 GFS
|
||
5.6 Special File Systems
|
||
5.6.1 tmpfs and swapfs
|
||
5.6.2 userfs
|
||
5.6.3 devfs
|
||
5.6.4 smugfs
|
||
5.7 File System Recommendations
|
||
|
||
6. Technologies
|
||
|
||
6.1 RAID
|
||
6.1.1 SCSI-to-SCSI
|
||
6.1.2 PCI-to-SCSI
|
||
6.1.3 Software RAID
|
||
6.1.4 RAID Levels
|
||
6.2 Volume Management
|
||
6.3 Linux md Kernel Patch
|
||
6.4 Compression
|
||
6.5 ACL
|
||
6.6 cachefs
|
||
6.7 Translucent or Inheriting File Systems
|
||
6.8 Physical Track Positioning
|
||
6.8.1 Disk Speed Values
|
||
6.9 Yoke
|
||
6.10 Stacking
|
||
6.11 Recommendations
|
||
|
||
7. Other Operating Systems
|
||
|
||
7.1 DOS
|
||
7.2 Windows
|
||
7.3 OS/2
|
||
7.4 NT
|
||
7.5 Windows 2000
|
||
7.6 Sun OS
|
||
7.6.1 Sun OS 4
|
||
7.6.2 Sun OS 5 (aka Solaris)
|
||
7.7 BeOS
|
||
|
||
8. Clusters
|
||
9. Mount Points
|
||
|
||
10. Considerations and Dimensioning
|
||
|
||
10.1 Home Systems
|
||
10.2 Servers
|
||
10.2.1 Home Directories
|
||
10.2.2 Anonymous FTP
|
||
10.2.3 WWW
|
||
10.2.4 Mail
|
||
10.2.5 News
|
||
10.2.6 Others
|
||
10.2.7 Server Recommendations
|
||
10.3 Pitfalls
|
||
|
||
11. Disk Layout
|
||
|
||
11.1 Selection for Partitioning
|
||
11.2 Mapping Partitions to Drives
|
||
11.3 Sorting Partitions on Drives
|
||
11.4 Optimizing
|
||
11.4.1 Optimizing by Characteristics
|
||
11.4.2 Optimizing by Drive Parallelising
|
||
11.5 Compromises
|
||
|
||
12. Implementation
|
||
|
||
12.1 Checklist
|
||
12.2 Drives and Partitions
|
||
12.3 Partitioning
|
||
12.4 Repartitioning
|
||
12.5 Microsoft Partition Bug
|
||
12.6 Multiple Devices (md)
|
||
12.7 Formatting
|
||
12.8 Mounting
|
||
12.9 fstab
|
||
12.10 Mount options
|
||
12.11 Recommendations
|
||
|
||
13. Maintenance
|
||
|
||
13.1 Backup
|
||
13.2 Defragmentation
|
||
13.3 Deletions
|
||
13.4 Upgrades
|
||
13.5 Recovery
|
||
13.6 Rescue Disk
|
||
|
||
14. Advanced Issues
|
||
|
||
14.1 Hard Disk Tuning
|
||
14.2 File System Tuning
|
||
14.3 Spindle Synchronizing
|
||
|
||
15. Troubleshooting
|
||
|
||
15.1 During Installation
|
||
15.1.1 Locating Disks
|
||
15.1.2 Formatting
|
||
15.2 During Booting
|
||
15.2.1 Booting fails
|
||
15.2.2 Getting into Single User Mode
|
||
15.3 During Running
|
||
15.3.1 Swap
|
||
15.3.2 Partitions
|
||
|
||
16. Further Information
|
||
|
||
16.1 News groups
|
||
16.2 Mailing Lists
|
||
16.3 HOWTO
|
||
16.4 Mini-HOWTO
|
||
16.5 Local Resources
|
||
16.6 Web Pages
|
||
16.7 Search Engines
|
||
|
||
17. Getting Help
|
||
|
||
18. Concluding Remarks
|
||
|
||
18.1 Coming Soon
|
||
18.2 Request for Information
|
||
18.3 Suggested Project Work
|
||
|
||
19. Questions and Answers
|
||
|
||
20. Bits and Pieces
|
||
|
||
20.1 Swap Partition: to Use or Not to Use
|
||
20.2 Mount Point and /mnt
|
||
20.3 Power and Heating
|
||
20.4 Deja
|
||
20.5 Crash Recovery
|
||
|
||
21. Appendix A: Partitioning Layout Table: Mounting and Linking
|
||
|
||
22. Appendix B: Partitioning Layout Table: Numbering and Sizing
|
||
|
||
23. Appendix C: Partitioning Layout Table: Partition Placement
|
||
|
||
24. Appendix D: Example: Multipurpose Server
|
||
|
||
25. Appendix E: Example: Mounting and Linking
|
||
|
||
26. Appendix F: Example: Numbering and Sizing
|
||
|
||
27. Appendix G: Example: Partition Placement
|
||
|
||
28. Appendix H: Example II
|
||
|
||
29. Appendix I: Example III: SPARC Solaris
|
||
|
||
30. Appendix J: Example IV: Server with 4 Drives
|
||
|
||
31. Appendix K: Example V: Dual Drive System
|
||
|
||
32. Appendix L: Example VI: Single Drive System
|
||
|
||
33. Appendix M: Disk System Documenter
|
||
|
||
|
||
|
||
______________________________________________________________________
|
||
|
||
1. Introduction
|
||
|
||
For unclear reasons this brand new release is codenamed the Taylor3
|
||
release.
|
||
|
||
New code names will appear as per industry standard guidelines to
|
||
emphasize the state-of-the-art-ness of this document.
|
||
|
||
This document was written for two reasons, mainly because I got hold
|
||
of 3 old SCSI disks to set up my Linux system on and I was pondering
|
||
how best to utilise the inherent possibilities of parallelizing in a
|
||
SCSI system. Secondly I hear there is a prize for people who write
|
||
documents...
|
||
|
||
This is intended to be read in conjunction with the Linux Filesystem
|
||
Structure Standard (FSSTND). It does not in any way replace it but
|
||
tries to suggest where physically to place directories detailed in the
|
||
FSSTND, in terms of drives, partitions, types, RAID, file system (fs),
|
||
physical sizes and other parameters that should be considered and
|
||
tuned in a Linux system, ranging from single home systems to large
|
||
servers on the Internet.
|
||
|
||
|
||
The followup to FSSTND is called the Filesystem Hierarchy Standard
|
||
(FHS) and covers more than Linux alone. FHS versions 2.0, 2.1 and 2.2
|
||
have been released but there are still a few issues to be dealt with.
|
||
Many recent distributions are now aiming for FHS compliance.
|
||
|
||
It is also a good idea to read the Linux Installation guides
|
||
thoroughly and if you are using a PC system, which I guess the
|
||
majority still does, you can find much relevant and useful information
|
||
in the FAQs for the newsgroup comp.sys.ibm.pc.hardware especially for
|
||
storage media.
|
||
|
||
This is also a learning experience for myself and I hope I can start
|
||
the ball rolling with this HOWTO and that it perhaps can evolve into a
|
||
larger more detailed and hopefully even more correct HOWTO.
|
||
|
||
|
||
First of all we need a bit of legalese. Recent development shows it is
|
||
quite important.
|
||
|
||
|
||
1.1. Copyright
|
||
|
||
|
||
This document is Copyright 1996 Stein Gjoen. Permission is granted to
|
||
copy, distribute and/or modify this document under the terms of the
|
||
GNU Free Documentation License, Version 1.1 or any later version
|
||
published by the Free Software Foundation with no Invariant Sections,
|
||
no Front-Cover Texts, and no Back-Cover Texts.
|
||
|
||
If you have any questions, please contact <{linux-
|
||
howto@metalab.unc.edu}>
|
||
|
||
|
||
|
||
1.2. Disclaimer
|
||
|
||
|
||
Use the information in this document at your own risk. I disavow any
|
||
potential liability for the contents of this document. Use of the
|
||
concepts, examples, and/or other content of this document is entirely
|
||
at your own risk.
|
||
|
||
All copyrights are owned by their owners, unless specifically noted
|
||
otherwise. Use of a term in this document should not be regarded as
|
||
affecting the validity of any trademark or service mark.
|
||
|
||
Naming of particular products or brands should not be seen as
|
||
endorsements.
|
||
|
||
You are strongly recommended to take a backup of your system before
|
||
major installation and backups at regular intervals.
|
||
1.3. News
|
||
|
||
|
||
This is a major upgrade featuring a new copyright statement that is
|
||
intended to be Debian compliant and allow for inclusion in their
|
||
distribution. A number of mistakes are corrected and new features
|
||
added such as descriptions of recent ATA features and more.
|
||
|
||
|
||
|
||
On the development front people are concentrating their energy towards
|
||
completing Linux 2.4 and until that is released there is not going to
|
||
be much news on disk technology for Linux.
|
||
|
||
|
||
Also now the document is available in postscript both for US letter as
|
||
well as European A4 formats.
|
||
|
||
The latest version number of this document can be gleaned from my plan
|
||
entry if you finger
|
||
<http://www.mit.edu:8001/finger?sgjoen@nox.nyx.net> my Nyx account.
|
||
|
||
Also, the latest version will be available on my web space on Nyx in a
|
||
number of formats:
|
||
|
||
<20> HTML <http://www.nyx.net/~sgjoen/disk.html>.
|
||
|
||
<20> plain ASCII text <http://www.nyx.net/~sgjoen/disk.txt> (ca. 6200
|
||
lines).
|
||
|
||
<20> compressed postscript US letter format
|
||
<http://www.nyx.net/~sgjoen/disk-US.ps.gz> (ca. 90 pages).
|
||
|
||
<20> compressed postscript European A4 format
|
||
<http://www.nyx.net/~sgjoen/disk-A4.ps.gz> (ca. 85 pages).
|
||
|
||
<20> SGML source <http://www.nyx.net/~sgjoen/disk.sgml> (ca. 260 KB).
|
||
|
||
|
||
A European mirror of the Multi Disk HOWTO
|
||
<http://home.online.no/~ggjoeen/stein/disk.html> just went on line.
|
||
|
||
|
||
|
||
1.4. Credits
|
||
|
||
In this version I have the pleasure of acknowledging even more people
|
||
who have contributed in one way or another:
|
||
|
||
|
||
|
||
ronnej (at ) ucs.orst.edu
|
||
cm (at) kukuruz.ping.at
|
||
armbru (at) pond.sub.org
|
||
R.P.Blake (at) open.ac.uk
|
||
neuffer (at) goofy.zdv.Uni-Mainz.de
|
||
sjmudd (at) redestb.es
|
||
nat (at) nataa.fr.eu.org
|
||
sundbyk (at) oslo.geco-prakla.slb.com
|
||
ggjoeen (at) online.no
|
||
mike (at) i-Connect.Net
|
||
roth (at) uiuc.edu
|
||
phall (at) ilap.com
|
||
szaka (at) mirror.cc.u-szeged.hu
|
||
CMckeon (at) swcp.com
|
||
kris (at) koentopp.de
|
||
edick (at) idcomm.com
|
||
pot (at) fly.cnuce.cnr.it
|
||
earl (at) sbox.tu-graz.ac.at
|
||
ebacon (at) oanet.com
|
||
vax (at) linkdead.paranoia.com
|
||
tschenk (at) theoffice.net
|
||
pjfarley (at) dorsai.org
|
||
jean (at) stat.ubc.ca
|
||
johnf (at) whitsunday.net.au
|
||
clasen (at) unidui.uni-duisburg.de
|
||
eeslgw (at) ee.surrey.asc.uk
|
||
adam (at) onshore.com
|
||
anikolae (at) wega-fddi2.rz.uni-ulm.de
|
||
cjaeger (at) dwave.net
|
||
eperezte (at) c2i.net
|
||
yesteven (at) ms2.hinet.net
|
||
cj (at) samurajdata.se
|
||
tbotond (at) netx.hu
|
||
russel (at) coker.com.au
|
||
lars (at) iar.se
|
||
GALLAGS3 (at) labs.wyeth.com
|
||
morimoto (at) xantia.citroen.org
|
||
shulegaa (at) gatekeeper.txl.com
|
||
roman.legat (at) stud.uni-hannover.de
|
||
ahamish (at) hicks.alien.usr.com
|
||
hduff2 (at) worldnet.att.net
|
||
mbaehr (at) email.archlab.tuwien.ac.at
|
||
adc (at) postoffice.utas.edu.au
|
||
pjm (at) bofh.asn.au
|
||
jochen.berg (at) ac.com
|
||
jpotts (at) us.ibm.com
|
||
jarry (at) gmx.net
|
||
LeBlanc (at) mcc.ac.uk
|
||
masy (at) webmasters.gr.jp
|
||
karlheg (at) hegbloom.net
|
||
goeran (at) uddeborg.pp.se
|
||
wgm (at) telus.net
|
||
|
||
|
||
|
||
1.5. Translations
|
||
|
||
|
||
Special thanks go to nakano (at) apm.seikei.ac.jp for doing the
|
||
Japanese translation <http://www.linux.or.jp/JF/JFdocs/Multi-Disk-
|
||
HOWTO.html>, general contributions as well as contributing an example
|
||
of a computer in an academic setting, which is included at the end of
|
||
this document.
|
||
|
||
There are now many new translations available and special thanks go to
|
||
the translators for the job and the input they have given:
|
||
|
||
|
||
<20> German Translation <http://www.linuxdoc.org/> by chewie (at)
|
||
nuernberg.netsurf.de
|
||
|
||
<20> Swedish Translation <http://www.swe-doc.linux.nu> by jonah (at)
|
||
swipnet.se
|
||
|
||
<20> French Translation <http://www.lri.fr/~loisel/howto/> by
|
||
Patrick.Loiseleur (at) lri.fr
|
||
|
||
<20> Chinese Translation <http://www.linuxdoc.org/> by yesteven (at )
|
||
ms2.hinet.net
|
||
|
||
<20> Italian Translation <http://www.pluto.linux.it/ildp/HOWTO/Multi-
|
||
Disk-HOWTO.html> by bigpaul (at) flashnet.it
|
||
|
||
|
||
ICP Vortex is gratefully acknowledges for sending in-depth information
|
||
on their range of RAID controllers.
|
||
|
||
Also DPT is acknowledged for sending me documentation on their
|
||
controllers as well as permission to quote from the material. These
|
||
quotes have been approved before appearing here and will be clearly
|
||
labelled. No quotes as of yet but that is coming.
|
||
|
||
Not many still, so please read through this document, make a
|
||
contribution and join the elite. If I have forgotten anyone, please
|
||
let me know.
|
||
|
||
New in this version is an appendix with a few tables you can fill in
|
||
for your system in order to simplify the design process.
|
||
|
||
Any comments or suggestions can be mailed to my mail address on Nyx:
|
||
sgjoen@nyx.net.
|
||
|
||
|
||
So let's cut to the chase where swap and /tmp are racing along hard
|
||
drive...
|
||
|
||
|
||
|
||
2. Structure
|
||
|
||
As this type of document is supposed to be as much for learning as a
|
||
technical reference document I have rearranged the structure to this
|
||
end. For the designer of a system it is more useful to have the
|
||
information presented in terms of the goals of this exercise than from
|
||
the point of view of the logical layer structure of the devices
|
||
themselves. Nevertheless this document would not be complete without
|
||
such a layer structure the computer field is so full of, so I will
|
||
include it here as an introduction to how it works.
|
||
|
||
It is a long time since the mini in mini-HOWTO could be defended as
|
||
proper but I am convinced that this document is as long as it needs to
|
||
be in order to make the right design decisions, and not longer.
|
||
|
||
|
||
2.1. Logical structure
|
||
|
||
This is based on how each layer access each other, traditionally with
|
||
the application on top and the physical layer on the bottom. It is
|
||
quite useful to show the interrelationship between each of the layers
|
||
used in controlling drives.
|
||
|
||
|
||
___________________________________________________________
|
||
|__ File structure ( /usr /tmp etc) __|
|
||
|__ File system (ext2fs, vfat etc) __|
|
||
|__ Volume management (AFS) __|
|
||
|__ RAID, concatenation (md) __|
|
||
|__ Device driver (SCSI, IDE etc) __|
|
||
|__ Controller (chip, card) __|
|
||
|__ Connection (cable, network) __|
|
||
|__ Drive (magnetic, optical etc) __|
|
||
-----------------------------------------------------------
|
||
|
||
|
||
|
||
In the above diagram both volume management and RAID and concatenation
|
||
are optional layers. The 3 lower layers are in hardware. All parts
|
||
are discussed at length later on in this document.
|
||
|
||
|
||
2.2. Document structure
|
||
|
||
Most users start out with a given set of hardware and some plans on
|
||
what they wish to achieve and how big the system should be. This is
|
||
the point of view I will adopt in this document in presenting the
|
||
material, starting out with hardware, continuing with design
|
||
constraints before detailing the design strategy that I have found to
|
||
work well. I have used this both for my own personal computer at
|
||
home, a multi purpose server at work and found it worked quite well.
|
||
In addition my Japanese co-worker in this project have applied the
|
||
same strategy on a server in an academic setting with similar success.
|
||
|
||
Finally at the end I have detailed some configuration tables for use
|
||
in your own design. If you have any comments regarding this or notes
|
||
from your own design work I would like to hear from you so this
|
||
document can be upgraded.
|
||
|
||
|
||
2.3. Reading plan
|
||
|
||
Although not the biggest HOWTO it is nevertheless rather big already
|
||
and I have been requested to make a reading plan to make it possible
|
||
to cut down on the volume
|
||
|
||
|
||
Expert
|
||
(aka the elite). If you are familiar with Linux as well as disk
|
||
drive technologies you will find most of what you need in the
|
||
appendices. Additionally you are recommended to read the FAQ and
|
||
the ``Bits'n'pieces'' chapter.
|
||
|
||
|
||
Experienced
|
||
(aka Competent). If you are familiar with computers in general
|
||
you can go straight to the chapters on ``technologies'' and
|
||
continue from there on.
|
||
|
||
|
||
|
||
Newbie
|
||
(mostly harmless). You just have to read the whole thing.
|
||
Sorry. In addition you are also recommended to read all the
|
||
other disk related HOWTOs.
|
||
|
||
|
||
|
||
3. Drive Technologies
|
||
|
||
A far more complete discussion on drive technologies for IBM PCs can
|
||
be found at the home page of The Enhanced IDE/Fast-ATA FAQ
|
||
<http://thef-nym.sci.kun.nl/~pieterh/storage.html> which is also
|
||
regularly posted on Usenet News. There is also a site dedicated to
|
||
ATA and ATAPI Information and Software <http://ata-atapi.com>.
|
||
|
||
Here I will just present what is needed to get an understanding of the
|
||
technology and get you started on your setup.
|
||
|
||
|
||
3.1. Drives
|
||
|
||
This is the physical device where your data lives and although the
|
||
operating system makes the various types seem rather similar they can
|
||
in actual fact be very different. An understanding of how it works can
|
||
be very useful in your design work. Floppy drives fall outside the
|
||
scope of this document, though should there be a big demand I could
|
||
perhaps be persuaded to add a little here.
|
||
|
||
|
||
3.2. Geometry
|
||
|
||
Physically disk drives consists of one or more platters containing
|
||
data that is read in and out using sensors mounted on movable heads
|
||
that are fixed with respects to themselves. Data transfers therefore
|
||
happens across all surfaces simultaneously which defines a cylinder of
|
||
tracks. The drive is also divided into sectors containing a number of
|
||
data fields.
|
||
|
||
Drives are therefore often specified in terms of its geometry: the
|
||
number of Cylinders, Heads and Sectors (CHS).
|
||
|
||
For various reasons there is now a number of translations between
|
||
|
||
<20> the physical CHS of the drive itself
|
||
|
||
<20> the logical CHS the drive reports to the BIOS or OS
|
||
|
||
<20> the logical CHS used by the OS
|
||
|
||
Basically it is a mess and a source of much confusion. For more
|
||
information you are strongly recommended to read the Large Disk mini-
|
||
HOWTO
|
||
|
||
|
||
3.3. Media
|
||
|
||
The media technology determines important parameters such as
|
||
read/write rates, seek times, storage size as well as if it is
|
||
read/write or read only.
|
||
|
||
|
||
3.3.1. Magnetic Drives
|
||
|
||
This is the typical read-write mass storage medium, and as everything
|
||
else in the computer world, comes in many flavours with different
|
||
properties. Usually this is the fastest technology and offers
|
||
read/write capability. The platter rotates with a constant angular
|
||
velocity (CAV) with a variable physical sector density for more
|
||
efficient magnetic media area utilisation. In other words, the number
|
||
of bits per unit length is kept roughly constant by increasing the
|
||
number of logical sectors for the outer tracks.
|
||
|
||
Typical values for rotational speeds are 4500 and 5400 RPM, though
|
||
7200 is also used. Very recently also 10000 RPM has entered the mass
|
||
market. Seek times are around 10 ms, transfer rates quite variable
|
||
from one type to another but typically 4-40 MB/s. With the extreme
|
||
high performance drives you should remember that performance costs
|
||
more electric power which is dissipated as heat, see the point on
|
||
``Power and Heating''.
|
||
|
||
|
||
Note that there are several kinds of transfers going on here, and that
|
||
these are quoted in different units. First of all there is the
|
||
platter-to-drive cache transfer, usually quoted in Mbits/s. Typical
|
||
values here is about 50-250 Mbits/s. The second stage is from the
|
||
built in drive cache to the adapter, and this is typically quoted in
|
||
MB/s, and typical quoted values here is 3-40 MB/s. Note, however, that
|
||
this assumed data is already in the cache and hence for maximum
|
||
readout speed from the drive the effective transfer rate will decrease
|
||
dramatically.
|
||
|
||
|
||
|
||
3.3.2. Optical Drives
|
||
|
||
Optical read/write drives exist but are slow and not so common. They
|
||
were used in the NeXT machine but the low speed was a source for much
|
||
of the complaints. The low speed is mainly due to the thermal nature
|
||
of the phase change that represents the data storage. Even when using
|
||
relatively powerful lasers to induce the phase changes the effects are
|
||
still slower than the magnetic effect used in magnetic drives.
|
||
|
||
Today many people use CD-ROM drives which, as the name suggests, is
|
||
read-only. Storage is about 650 MB, transfer speeds are variable,
|
||
depending on the drive but can exceed 1.5 MB/s. Data is stored on a
|
||
spiraling single track so it is not useful to talk about geometry for
|
||
this. Data density is constant so the drive uses constant linear
|
||
velocity (CLV). Seek is also slower, about 100 ms, partially due to
|
||
the spiraling track. Recent, high speed drives, use a mix of CLV and
|
||
CAV in order to maximize performance. This also reduces access time
|
||
caused by the need to reach correct rotational speed for readout.
|
||
|
||
A new type (DVD) is on the horizon, offering up to about 18 GB on a
|
||
single disk.
|
||
|
||
|
||
3.3.3. Solid State Drives
|
||
|
||
This is a relatively recent addition to the available technology and
|
||
has been made popular especially in portable computers as well as in
|
||
embedded systems. Containing no movable parts they are very fast both
|
||
in terms of access and transfer rates. The most popular type is flash
|
||
RAM, but also other types of RAM is used. A few years ago many had
|
||
great hopes for magnetic bubble memories but it turned out to be
|
||
relatively expensive and is not that common.
|
||
|
||
In general the use of RAM disks are regarded as a bad idea as it is
|
||
normally more sensible to add more RAM to the motherboard and let the
|
||
operating system divide the memory pool into buffers, cache, program
|
||
and data areas. Only in very special cases, such as real time systems
|
||
with short time margins, can RAM disks be a sensible solution.
|
||
|
||
Flash RAM is today available in several 10's of megabytes in storage
|
||
and one might be tempted to use it for fast, temporary storage in a
|
||
computer. There is however a huge snag with this: flash RAM has a
|
||
finite life time in terms of the number of times you can rewrite data,
|
||
so putting swap, /tmp or /var/tmp on such a device will certainly
|
||
shorten its lifetime dramatically. Instead, using flash RAM for
|
||
directories that are read often but rarely written to, will be a big
|
||
performance win.
|
||
|
||
In order to get the optimum life time out of flash RAM you will need
|
||
to use special drivers that will use the RAM evenly and minimize the
|
||
number of block erases.
|
||
|
||
This example illustrates the advantages of splitting up your directory
|
||
structure over several devices.
|
||
|
||
Solid state drives have no real cylinder/head/sector addressing but
|
||
for compatibility reasons this is simulated by the driver to give a
|
||
uniform interface to the operating system.
|
||
|
||
|
||
3.4. Interfaces
|
||
|
||
There is a plethora of interfaces to chose from widely ranging in
|
||
price and performance. Most motherboards today include IDE interface
|
||
which are part of modern chipsets.
|
||
|
||
Many motherboards also include a SCSI interface chip made by Symbios
|
||
(formerly NCR) and that is connected directly to the PCI bus. Check
|
||
what you have and what BIOS support you have with it.
|
||
|
||
|
||
3.4.1. MFM and RLL
|
||
|
||
Once upon a time this was the established technology, a time when 20
|
||
MB was awesome, which compared to todays sizes makes you think that
|
||
dinosaurs roamed the Earth with these drives. Like the dinosaurs these
|
||
are outdated and are slow and unreliable compared to what we have
|
||
today. Linux does support this but you are well advised to think twice
|
||
about what you would put on this. One might argue that an emergency
|
||
partition with a suitable vintage of DOS might be fitting.
|
||
|
||
|
||
3.4.2. ESDI
|
||
|
||
Actually, ESDI was an adaptation of the very widely used SMD interface
|
||
used on "big" computers to the cable set used with the ST506
|
||
interface, which was more convenient to package than the 60-pin +
|
||
26-pin connector pair used with SMD. The ST506 was a "dumb" interface
|
||
which relied entirely on the controller and host computer to do
|
||
everything from computing head/cylinder/sector locations and keeping
|
||
track of the head location, etc. ST506 required the controller to
|
||
extract clock from the recovered data, and control the physical
|
||
location of detailed track features on the medium, bit by bit. It had
|
||
about a 10-year life if you include the use of MFM, RLL, and ERLL/ARLL
|
||
modulation schemes. ESDI, on the other hand, had intelligence, often
|
||
using three or four separate microprocessors on a single drive, and
|
||
high-level commands to format a track, transfer data, perform seeks,
|
||
and so on. Clock recovery from the data stream was accomplished at the
|
||
drive, which drove the clock line and presented its data in NRZ,
|
||
though error correction was still the task of the controller. ESDI
|
||
allowed the use of variable bit density recording, or, for that
|
||
matter, any other modulation technique, since it was locally generated
|
||
and resolved at the drive. Though many of the techniques used in ESDI
|
||
were later incorporated in IDE, it was the increased popularity of
|
||
SCSI which led to the demise of ESDI in computers. ESDI had a life of
|
||
about 10 years, though mostly in servers and otherwise "big" systems
|
||
rather than PC's.
|
||
|
||
|
||
|
||
3.4.3. IDE and ATA
|
||
|
||
Progress made the drive electronics migrate from the ISA slot card
|
||
over to the drive itself and Integrated Drive Electronics was borne.
|
||
It was simple, cheap and reasonably fast so the BIOS designers
|
||
provided the kind of snag that the computer industry is so full of. A
|
||
combination of an IDE limitation of 16 heads together with the BIOS
|
||
limitation of 1024 cylinders gave us the infamous 504 MB limit.
|
||
Following the computer industry traditions again, the snag was patched
|
||
with a kludge and we got all sorts of translation schemes and BIOS
|
||
bodges. This means that you need to read the installation
|
||
documentation very carefully and check up on what BIOS you have and
|
||
what date it has as the BIOS has to tell Linux what size drive you
|
||
have. Fortunately with Linux you can also tell the kernel directly
|
||
what size drive you have with the drive parameters, check the
|
||
documentation for LILO and Loadlin, thoroughly. Note also that IDE is
|
||
equivalent to ATA, AT Attachment. IDE uses CPU-intensive Programmed
|
||
Input/Output (PIO) to transfer data to and from the drives and has no
|
||
capability for the more efficient Direct Memory Access (DMA)
|
||
technology. Highest transfer rate is 8.3 MB/s.
|
||
|
||
|
||
3.4.4. EIDE, Fast-ATA and ATA-2
|
||
|
||
These 3 terms are roughly equivalent, fast-ATA is ATA-2 but EIDE
|
||
additionally includes ATAPI. ATA-2 is what most use these days which
|
||
is faster and with DMA. Highest transfer rate is increased to 16.6
|
||
MB/s.
|
||
|
||
|
||
|
||
3.4.5. Ultra-ATA
|
||
|
||
A new, faster DMA mode that is approximately twice the speed of EIDE
|
||
PIO-Mode 4 (33 MB/s). Disks with and without Ultra-ATA can be mixed on
|
||
the same cable without speed penalty for the faster adapters. The
|
||
Ultra-ATA interface is electrically identical with the normal Fast-ATA
|
||
interface, including the maximum cable length.
|
||
|
||
|
||
The ATA/66 was superceeded by ATA/100 and very recently we have now
|
||
gotten ATA/133. While the interface speed has iproved dramatically the
|
||
disks are often limited by platter-to-cache limites which today stands
|
||
at about 40 MB/s.
|
||
|
||
For more information read up on these overviews and whitepapers from
|
||
Maxtor: Fast Drives Technology
|
||
<http://www.maxtor.com/products/FastDrive/default.htm> on the ATA/133
|
||
interface and Big Drives Technology
|
||
<http://www.maxtor.com/products/BigDrive/default.htm> on breaking the
|
||
137 GB limit.
|
||
|
||
|
||
|
||
3.4.6. Serial-ATA
|
||
|
||
A new, standard has been agreed upon, the Serial-ATA interface, backed
|
||
by the The Serial ATA <http://www.serial-ata.org/> group who made the
|
||
announcement in August 2001.
|
||
|
||
Advantages are numerous: simple, thin connectors rather than old
|
||
cumbersome cable mats that also obstructued air flow, higher speeds
|
||
(about 150 MB/s) and backward compatibility.
|
||
|
||
|
||
|
||
3.4.7. ATAPI
|
||
|
||
The ATA Packet Interface was designed to support CD-ROM drives using
|
||
the IDE port and like IDE it is cheap and simple.
|
||
|
||
|
||
3.4.8. SCSI
|
||
|
||
The Small Computer System Interface is a multi purpose interface that
|
||
can be used to connect to everything from drives, disk arrays,
|
||
printers, scanners and more. The name is a bit of a misnomer as it has
|
||
traditionally been used by the higher end of the market as well as in
|
||
work stations since it is well suited for multi tasking environments.
|
||
|
||
The standard interface is 8 bits wide and can address 8 devices.
|
||
There is a wide version with 16 bit that is twice as fast on the same
|
||
clock and can address 16 devices. The host adapter always counts as a
|
||
device and is usually number 7. It is also possible to have 32 bit
|
||
wide busses but this usually requires a double set of cables to carry
|
||
all the lines.
|
||
|
||
The old standard was 5 MB/s and the newer fast-SCSI increased this to
|
||
10 MB/s. Recently ultra-SCSI, also known as Fast-20, arrived with 20
|
||
MB/s transfer rates for an 8 bit wide bus. New low voltage
|
||
differential (LVD) signalling allows these high speeds as well as much
|
||
longer cabling than before.
|
||
|
||
Even more recently an even faster standard has been introduced: SCSI
|
||
160 (originally named SCSI 160/m) which is capable of a monstrous 160
|
||
MB/s over a 16 bit wide bus. Support is scarce yet but for a few 10000
|
||
RPM drives that can transfer 40 MB/s sustained. Putting 6 such drives
|
||
on a RAID will keep such a bus saturated and also saturate most PCI
|
||
busses. Obviously this is only for the very highest end servers per
|
||
today. More information on this standard is available at The Ultra 160
|
||
SCSI home page <http://www.ultra160-scsi.com/>
|
||
|
||
Adaptec just announced a Linux driver for their SCSI 160 host adapter.
|
||
More information will come when more information becomes available.
|
||
|
||
Now also SCSI/320 is available.
|
||
|
||
The higher performance comes at a cost that is usually higher than for
|
||
(E)IDE. The importance of correct termination and good quality cables
|
||
cannot be overemphasized. SCSI drives also often tend to be of a
|
||
higher quality than IDE drives. Also adding SCSI devices tend to be
|
||
easier than adding more IDE drives: Often it is only a matter of
|
||
plugging or unplugging the device; some people do this without
|
||
powering down the system. This feature is most convenient when you
|
||
have multiple systems and you can just take the devices from one
|
||
system to the other should one of them fail for some reason.
|
||
|
||
There is a number of useful documents you should read if you use SCSI,
|
||
the SCSI HOWTO as well as the SCSI FAQ posted on Usenet News.
|
||
|
||
SCSI also has the advantage you can connect it easily to tape drives
|
||
for backing up your data, as well as some printers and scanners. It is
|
||
even possible to use it as a very fast network between computers while
|
||
simultaneously share SCSI devices on the same bus. Work is under way
|
||
but due to problems with ensuring cache coherency between the
|
||
different computers connected, this is a non trivial task.
|
||
SCSI numbers are also used for arbitration. If several drives request
|
||
service, the drive with the lowest number is given priority.
|
||
|
||
Note that newer SCSI cards will simultaneously support an array of
|
||
different types of SCSI devices all at individually optimized speeds.
|
||
|
||
|
||
|
||
3.5. Cabling
|
||
|
||
|
||
I do not intend to make too many comments on hardware but I feel I
|
||
should make a little note on cabling. This might seem like a
|
||
remarkably low technological piece of equipment, yet sadly it is the
|
||
source of many frustrating problems. At todays high speeds one should
|
||
think of the cable more of a an RF device with its inherent demands on
|
||
impedance matching. If you do not take your precautions you will get a
|
||
much reduced reliability or total failure. Some SCSI host adapters are
|
||
more sensitive to this than others.
|
||
|
||
Shielded cables are of course better than unshielded but the price is
|
||
much higher. With a little care you can get good performance from a
|
||
cheap unshielded cable.
|
||
|
||
|
||
<20> For Fast-ATA and Ultra-ATA, the maximum cable length is specified
|
||
as 45cm (18"). The data lines of both IDE channels are connected on
|
||
many boards, though, so they count as one cable. In any case EIDE
|
||
cables should be as short as possible. If there are mysterious
|
||
crashes or spontaneous changes of data, it is well worth
|
||
investigating your cabling. Try a lower PIO mode or disconnect the
|
||
second channel and see if the problem still occurs.
|
||
|
||
<20> For Cable Select (ATA drives) you set the drive jumpers to cable
|
||
select and use the cable to determine master and slave. This is not
|
||
much used.
|
||
|
||
<20> Do not have a slave on an ATA controller (primary or secondary)
|
||
without a master on the same controller, behaviour in these cases
|
||
is undetermined.
|
||
|
||
<20> Use as short cable as possible, but do not forget the 30 cm minimum
|
||
separation for ultra SCSI and 60 cm separation for differential
|
||
SCSI.
|
||
|
||
<20> Avoid long stubs between the cable and the drive, connect the plug
|
||
on the cable directly to the drive without an extension.
|
||
|
||
<20> SCSI Cabling limitations:
|
||
|
||
|
||
Bus Speed (MHz) | Max Length (m)
|
||
--------------------------------------------------
|
||
5 | 6
|
||
10 (fast) | 3
|
||
20 (fast-20 / ultra) | 3 (max 4 devices), 1.5 (max 8 devices)
|
||
xx (differential) | 25 (max 16 devices
|
||
--------------------------------------------------
|
||
|
||
|
||
|
||
<20> Use correct termination for SCSI devices and at the correct
|
||
positions: both ends of the SCSI chain. Remember the host adapter
|
||
itself may have on board termination.
|
||
<20> Do not mix shielded or unshielded cabling, do not wrap cables
|
||
around metal, try to avoid proximity to metal parts along parts of
|
||
the cabling. Any such discontinuities can cause impedance
|
||
mismatching which in turn can cause reflection of signals which
|
||
increases noise on the cable. This problems gets even more severe
|
||
in the case of multi channel controllers. Recently someone
|
||
suggested wrapping bubble plastic around the cables in order to
|
||
avoid too close proximity to metal, a real problem inside crowded
|
||
cabinets.
|
||
|
||
More information on SCSI cabling and termination can be found at
|
||
various web pages around the net.
|
||
|
||
|
||
|
||
3.6. Host Adapters
|
||
|
||
|
||
This is the other end of the interface from the drive, the part that
|
||
is connected to a computer bus. The speed of the computer bus and that
|
||
of the drives should be roughly similar, otherwise you have a
|
||
bottleneck in your system. Connecting a RAID 0 disk-farm to a ISA card
|
||
is pointless. These days most computers come with 32 bit PCI bus
|
||
capable of 132 MB/s transfers which should not represent a bottleneck
|
||
for most people in the near future.
|
||
|
||
As the drive electronic migrated to the drives the remaining part that
|
||
became the (E)IDE interface is so small it can easily fit into the PCI
|
||
chip set. The SCSI host adapter is more complex and often includes a
|
||
small CPU of its own and is therefore more expensive and not
|
||
integrated into the PCI chip sets available today. Technological
|
||
evolution might change this.
|
||
|
||
Some host adapters come with separate caching and intelligence but as
|
||
this is basically second guessing the operating system the gains are
|
||
heavily dependent on which operating system is used. Some of the more
|
||
primitive ones, that shall remain nameless, experience great gains.
|
||
Linux, on the other hand, have so much smarts of its own that the
|
||
gains are much smaller.
|
||
|
||
Mike Neuffer, who did the drivers for the DPT controllers, states that
|
||
the DPT controllers are intelligent enough that given enough cache
|
||
memory it will give you a big push in performance and suggests that
|
||
people who have experienced little gains with smart controllers just
|
||
have not used a sufficiently intelligent caching controller.
|
||
|
||
|
||
3.7. Multi Channel Systems
|
||
|
||
In order to increase throughput it is necessary to identify the most
|
||
significant bottlenecks and then eliminate them. In some systems, in
|
||
particular where there are a great number of drives connected, it is
|
||
advantageous to use several controllers working in parallel, both for
|
||
SCSI host adapters as well as IDE controllers which usually have 2
|
||
channels built in. Linux supports this.
|
||
|
||
Some RAID controllers feature 2 or 3 channels and it pays to spread
|
||
the disk load across all channels. In other words, if you have two
|
||
SCSI drives you want to RAID and a two channel controller, you should
|
||
put each drive on separate channels.
|
||
|
||
|
||
3.8. Multi Board Systems
|
||
|
||
In addition to having both a SCSI and an IDE in the same machine it is
|
||
also possible to have more than one SCSI controller. Check the SCSI-
|
||
HOWTO on what controllers you can combine. Also you will most likely
|
||
have to tell the kernel it should probe for more than just a single
|
||
SCSI or a single IDE controller. This is done using kernel parameters
|
||
when booting, for instance using LILO. Check the HOWTOs for SCSI and
|
||
LILO for how to do this.
|
||
|
||
Multi board systems can offer significant speed gains if you configure
|
||
your disks right, especially for RAID0. Make sure you interleave the
|
||
controllers as well as the drives, so that you add drives to the md
|
||
RAID device in the right order. If controller 1 is connected to
|
||
drives sda and sdc while controller 2 is connected to drives sdb and
|
||
sdd you will gain more paralellicity by adding in the order of sda -
|
||
sdc - sdb - sdd rather than sda - sdb - sdc - sdd because a read or
|
||
write over more than one cluster will be more likely to span two
|
||
controllers.
|
||
|
||
|
||
The same methods can also be applied to IDE. Most motherboards come
|
||
with typically 4 IDE ports:
|
||
|
||
<20> hda primary master
|
||
|
||
<20> hdb primary slave
|
||
|
||
<20> hdc secondary master
|
||
|
||
<20> hdd secondary slave
|
||
|
||
where the two primaries share one flat cable and the secondaries
|
||
share another cable. Modern chipsets keep these independent.
|
||
Therefore it is best to RAID in the order hda - hdc - hdb - hdd as
|
||
this will most likely parallelise both channels.
|
||
|
||
|
||
3.9. Speed Comparison
|
||
|
||
The following tables are given just to indicate what speeds are
|
||
possible but remember that these are the theoretical maximum speeds.
|
||
All transfer rates are in MB per second and bus widths are measured in
|
||
bits.
|
||
|
||
|
||
|
||
3.9.1. Controllers
|
||
|
||
|
||
|
||
IDE : 8.3 - 16.7
|
||
Ultra-ATA : 33 - 66
|
||
|
||
SCSI :
|
||
Bus width (bits)
|
||
|
||
Bus Speed (MHz) | 8 16 32
|
||
--------------------------------------------------
|
||
5 | 5 10 20
|
||
10 (fast) | 10 20 40
|
||
20 (fast-20 / ultra) | 20 40 80
|
||
40 (fast-40 / ultra-2) | 40 80 --
|
||
--------------------------------------------------
|
||
|
||
|
||
|
||
3.9.2. Bus Types
|
||
|
||
|
||
|
||
ISA : 8-12
|
||
EISA : 33
|
||
VESA : 40 (Sometimes tuned to 50)
|
||
|
||
PCI
|
||
Bus width (bits)
|
||
|
||
Bus Speed (MHz) | 32 64
|
||
--------------------------------------------------
|
||
33 | 132 264
|
||
66 | 264 528
|
||
--------------------------------------------------
|
||
|
||
|
||
|
||
3.10. Benchmarking
|
||
|
||
This is a very, very difficult topic and I will only make a few
|
||
cautious comments about this minefield. First of all, it is more
|
||
difficult to make comparable benchmarks that have any actual meaning.
|
||
This, however, does not stop people from trying...
|
||
|
||
Instead one can use benchmarking to diagnose your own system, to check
|
||
it is going as fast as it should, that is, not slowing down. Also you
|
||
would expect a significant increase when switching from a simple file
|
||
system to RAID, so a lack of performance gain will tell you something
|
||
is wrong.
|
||
|
||
When you try to benchmark you should not hack up your own, instead
|
||
look up iozone and bonnie and read the documentation very carefully.
|
||
In particular make sure your buffer size is bigger than your RAM size,
|
||
otherwise you test your RAM rather than your disks which will give you
|
||
unrealistically high performance.
|
||
|
||
A very simple benchmark can be obtained using hdparm -tT which can be
|
||
used both on IDE and SCSI drives.
|
||
|
||
For more information on benchmarking and software for a number of
|
||
platforms, check out ACNC <http://www.acnc.com/benchmarks.html>
|
||
benchmark page as well as this one <http://spin.ch/~tpo/bench/> and
|
||
also The Benchmarking-HOWTO
|
||
<http://www.linuxdoc.org/HOWTO/Benchmarking-HOWTO.html>.
|
||
|
||
There are also official home pages for bonnie
|
||
<http://www.textuality.com/bonnie/>, bonnie++
|
||
<http://www.coker.com.au/bonnie++/> and iozone
|
||
<http://www.iozone.org>.
|
||
|
||
Trivia: Bonnie is intended to locate bottlenecks, the name is a
|
||
tribute to Bonnie Raitt, "who knows how to use one" as the author puts
|
||
it.
|
||
|
||
|
||
3.11. Comparisons
|
||
|
||
SCSI offers more performance than EIDE but at a price. Termination is
|
||
more complex but expansion not too difficult. Having more than 4 (or
|
||
in some cases 2) IDE drives can be complicated, with wide SCSI you can
|
||
have up to 15 per adapter. Some SCSI host adapters have several
|
||
channels thereby multiplying the number of possible drives even
|
||
further.
|
||
|
||
For SCSI you have to dedicate one IRQ per host adapter which can
|
||
control up to 15 drives. With EIDE you need one IRQ for each channel
|
||
(which can connect up to 2 disks, master and slave) which can cause
|
||
conflict.
|
||
|
||
RLL and MFM is in general too old, slow and unreliable to be of much
|
||
use.
|
||
|
||
|
||
|
||
3.12. Future Development
|
||
|
||
|
||
SCSI-3 is under way and will hopefully be released soon. Faster
|
||
devices are already being announced, recently an 80 MB/s and then a
|
||
160 MB/s monster specification has been proposed and also very
|
||
recently became commercially available. These are based around the
|
||
Ultra-2 standard (which used a 40 MHz clock) combined with a 16 bit
|
||
cable.
|
||
|
||
Some manufacturers already announce SCSI-3 devices but this is
|
||
currently rather premature as the standard is not yet firm. As the
|
||
transfer speeds increase the saturation point of the PCI bus is
|
||
getting closer. Currently the 64 bit version has a limit of 264 MB/s.
|
||
The PCI transfer rate will in the future be increased from the current
|
||
33 MHz to 66 MHz, thereby increasing the limit to 528 MB/s.
|
||
|
||
The ATA development is continuing and is increasing the performance
|
||
with the new ATA/100 standard. Since most ATA drives are slower in
|
||
sustained transfer from platter than this the performance increase
|
||
will for most people be small.
|
||
|
||
More interesting is the Serial ATA development, where the flat cable
|
||
will be replaced with a high speed serial link. This makes cabling far
|
||
simpler than today and also it solves the problem of cabling
|
||
obstructing airflow over the drives.
|
||
|
||
Another trend is for larger and larger drives. I hear it is possible
|
||
to get 75 GB on a single drive though this is rather expensive.
|
||
Currently the optimum storage for your money is about 30 GB but also
|
||
this is continuously increasing. The introduction of DVD will in the
|
||
near future have a big impact, with nearly 20 GB on a single disk you
|
||
can have a complete copy of even major FTP sites from around the
|
||
world. The only thing we can be reasonably sure about the future is
|
||
that even if it won't get any better, it will definitely be bigger.
|
||
|
||
Addendum: soon after I first wrote this I read that the maximum useful
|
||
speed for a CD-ROM was 20x as mechanical stability would be too great
|
||
a problem at these speeds. About one month after that again the first
|
||
commercial 24x CD-ROMs were available... Currently you can get 40x and
|
||
no doubt higher speeds are in the pipeline.
|
||
|
||
A project to encapsulate SCSI over TCP/IP, called iSCSI
|
||
<http://www.ietf.org/internet-drafts/draft-ietf-ips-iscsi-06.txt> has
|
||
started, and one Linux iSCSI implementation
|
||
<http://www.cs.uml.edu/~mbrown/iSCSI> has appeared.
|
||
|
||
|
||
|
||
3.13. Recommendations
|
||
|
||
My personal view is that EIDE or Ultra ATA is the best way to start
|
||
out on your system, especially if you intend to use DOS as well on
|
||
your machine. If you plan to expand your system over many years or
|
||
use it as a server I would strongly recommend you get SCSI drives.
|
||
Currently wide SCSI is a little more expensive. You are generally more
|
||
likely to get more for your money with standard width SCSI. There is
|
||
also differential versions of the SCSI bus which increases maximum
|
||
length of the cable. The price increase is even more substantial and
|
||
cannot therefore be recommended for normal users.
|
||
|
||
In addition to disk drives you can also connect some types of scanners
|
||
and printers and even networks to a SCSI bus.
|
||
|
||
Also keep in mind that as you expand your system you will draw ever
|
||
more power, so make sure your power supply is rated for the job and
|
||
that you have sufficient cooling. Many SCSI drives offer the option of
|
||
sequential spin-up which is a good idea for large systems. See also
|
||
``Power and Heating''.
|
||
|
||
|
||
|
||
4. File System Structure
|
||
|
||
Linux has been multi tasking from the very beginning where a number of
|
||
programs interact and run continuously. It is therefore important to
|
||
keep a file structure that everyone can agree on so that the system
|
||
finds data where it expects to. Historically there has been so many
|
||
different standards that it was confusing and compatibility was
|
||
maintained using symbolic links which confused the issue even further
|
||
and the structure ended looking like a maze.
|
||
|
||
In the case of Linux a standard was fortunately agreed on early on
|
||
called the File Systems Standard (FSSTND) which today is used by all
|
||
main Linux distributions.
|
||
|
||
Later it was decided to make a successor that should also support
|
||
operating systems other than just Linux, called the Filesystem
|
||
Hierarchy Standard (FHS) at version 2.2 currently. This standard is
|
||
under continuous development and will soon be adopted by Linux
|
||
distributions.
|
||
|
||
I recommend not trying to roll your own structure as a lot of thought
|
||
has gone into the standards and many software packages comply with the
|
||
standards. Instead you can read more about this at the FHS home page
|
||
<http://www.pathname.com/fhs/>.
|
||
|
||
This HOWTO endeavours to comply with FSSTND and will follow FHS when
|
||
distributions become available.
|
||
|
||
|
||
|
||
4.1. File System Features
|
||
|
||
The various parts of FSSTND have different requirements regarding
|
||
speed, reliability and size, for instance losing root is a pain but
|
||
can easily be recovered. Losing /var/spool/mail is a rather different
|
||
issue. Here is a quick summary of some essential parts and their
|
||
properties and requirements. Note that this is just a guide, there can
|
||
be binaries in etc and lib directories, libraries in bin directories
|
||
and so on.
|
||
|
||
|
||
|
||
4.1.1. Swap
|
||
|
||
|
||
Speed
|
||
Maximum! Though if you rely too much on swap you should consider
|
||
buying some more RAM. Note, however, that on many old Pentium PC
|
||
motherboards the cache will not work on RAM above 128 MB.
|
||
|
||
|
||
Size
|
||
Similar as for RAM. Quick and dirty algorithm: just as for tea:
|
||
16 MB for the machine and 2 MB for each user. Smallest kernel
|
||
run in 1 MB but is tight, use 4 MB for general work and light
|
||
applications, 8 MB for X11 or GCC or 16 MB to be comfortable.
|
||
(The author is known to brew a rather powerful cuppa tea...)
|
||
|
||
Some suggest that swap space should be 1-2 times the size of the
|
||
RAM, pointing out that the locality of the programs determines
|
||
how effective your added swap space is. Note that using the same
|
||
algorithm as for 4BSD is slightly incorrect as Linux does not
|
||
allocate space for pages in core.
|
||
|
||
A more thorough approach is to consider swap space plus RAM as
|
||
your total working set, so if you know how much space you will
|
||
need at most, you subtract the physical RAM you have and that is
|
||
the swap space you will need.
|
||
|
||
There is also another reason to be generous when dimensioning
|
||
your swap space: memory leaks. Ill behaving programs that do not
|
||
free the memory they allocate for themselves are said to have a
|
||
memory leak. This allocation remains even after the offending
|
||
program has stopped so this is a source of memory consumption.
|
||
Only after the program dies is the memory returned. Once all
|
||
physical RAM and swap space are exhausted the only solution is
|
||
to kill the offending processes if possible, or failing that,
|
||
reboot and start over. Thankfully such programs are not too
|
||
common but should you come across one you will find that extra
|
||
swap space will buy you extra time between reboots.
|
||
|
||
Also remember to take into account the type of programs you use.
|
||
Some programs that have large working sets, such as image
|
||
processing software have huge data structures loaded in RAM
|
||
rather than working explicitly on disk files. Data and computing
|
||
intensive programs like this will cause excessive swapping if
|
||
you have less RAM than the requirements.
|
||
|
||
Other types of programs can lock their pages into RAM. This can
|
||
be for security reasons, preventing copies of data reaching a
|
||
swap device or for performance reasons such as in a real time
|
||
module. Either way, locking pages reduces the remaining amount
|
||
of swappable memory and can cause the system to swap earlier
|
||
then otherwise expected.
|
||
|
||
In man 8 mkswap it is explained that each swap partition can be
|
||
a maximum of just under 128 MB in size for 32-bit machines and
|
||
just under 256 MB for 64-bit machines.
|
||
|
||
This however changed with kernel 2.2.0 after which the limit is
|
||
2 GB. The man page has been updated to reflect this change.
|
||
|
||
|
||
|
||
Reliability
|
||
Medium. When it fails you know it pretty quickly and failure
|
||
will cost you some lost work. You save often, don't you?
|
||
|
||
Note 1
|
||
Linux offers the possibility of interleaved swapping across
|
||
multiple devices, a feature that can gain you much. Check out
|
||
"man 8 swapon" for more details. However, software raiding swap
|
||
across multiple devices adds more overheads than you gain.
|
||
|
||
Thus the /etc/fstab file might look like this:
|
||
|
||
|
||
/dev/sda1 swap swap pri=1 0 0
|
||
/dev/sdc1 swap swap pri=1 0 0
|
||
|
||
|
||
|
||
Remember that the fstab file is very sensitive to the formatting
|
||
used, read the man page carefully and do not just cut and paste the
|
||
lines above.
|
||
|
||
|
||
Note 2
|
||
Some people use a RAM disk for swapping or some other file
|
||
systems. However, unless you have some very unusual requirements
|
||
or setups you are unlikely to gain much from this as this cuts
|
||
into the memory available for caching and buffering.
|
||
|
||
|
||
Note 2b
|
||
There is once exception: on a number of badly designed
|
||
motherboards the on board cache memory is not able to cache all
|
||
the RAM that can be addressed. Many older motherboards could
|
||
accept 128 MB RAM but only cache the lower 64 MB. In such cases
|
||
it would improve the performance if you used the upper
|
||
(uncached) 64 MB RAM for RAMdisk based swap or other temporary
|
||
storage.
|
||
|
||
|
||
|
||
4.1.2. Temporary Storage ( /tmp and /var/tmp )
|
||
|
||
|
||
Speed
|
||
Very high. On a separate disk/partition this will reduce
|
||
fragmentation generally, though ext2fs handles fragmentation
|
||
rather well.
|
||
|
||
|
||
Size
|
||
Hard to tell, small systems are easy to run with just a few MB
|
||
but these are notorious hiding places for stashing files away
|
||
from prying eyes and quota enforcement and can grow without
|
||
control on larger machines. Suggested: small home machine: 8 MB,
|
||
large home machine: 32 MB, small server: 128 MB, and large
|
||
machines up to 500 MB (The machine used by the author at work
|
||
has 1100 users and a 300 MB /tmp directory). Keep an eye on
|
||
these directories, not only for hidden files but also for old
|
||
files. Also be prepared that these partitions might be the first
|
||
reason you might have to resize your partitions.
|
||
|
||
|
||
Reliability
|
||
Low. Often programs will warn or fail gracefully when these
|
||
areas fail or are filled up. Random file errors will of course
|
||
be more serious, no matter what file area this is.
|
||
|
||
Files
|
||
Mostly short files but there can be a huge number of them.
|
||
Normally programs delete their old tmp files but if somehow an
|
||
interruption occurs they could survive. Many distributions have
|
||
a policy regarding cleaning out tmp files at boot time, you
|
||
might want to check out what your setup is.
|
||
|
||
|
||
Note1
|
||
In FSSTND there is a note about putting /tmp on RAM disk. This,
|
||
however, is not recommended for the same reasons as stated for
|
||
swap. Also, as noted earlier, do not use flash RAM drives for
|
||
these directories. One should also keep in mind that some
|
||
systems are set to automatically clean tmp areas on rebooting.
|
||
|
||
|
||
Note2
|
||
Older systems had a /usr/tmp but this is no longer recommended
|
||
and for historical reasons a symbolic link now makes it point to
|
||
one of the other tmp areas.
|
||
|
||
|
||
|
||
(* That was 50 lines, I am home and dry! *)
|
||
|
||
|
||
4.1.3. Spool Areas ( /var/spool/news and /var/spool/mail )
|
||
|
||
|
||
Speed
|
||
High, especially on large news servers. News transfer and
|
||
expiring are disk intensive and will benefit from fast drives.
|
||
Print spools: low. Consider RAID0 for news.
|
||
|
||
|
||
Size
|
||
For news/mail servers: whatever you can afford. For single user
|
||
systems a few MB will be sufficient if you read continuously.
|
||
Joining a list server and taking a holiday is, on the other
|
||
hand, not a good idea. (Again the machine I use at work has 100
|
||
MB reserved for the entire /var/spool)
|
||
|
||
|
||
Reliability
|
||
Mail: very high, news: medium, print spool: low. If your mail is
|
||
very important (isn't it always?) consider RAID for reliability.
|
||
|
||
|
||
Files
|
||
Usually a huge number of files that are around a few KB in size.
|
||
Files in the print spool can on the other hand be few but quite
|
||
sizable.
|
||
|
||
|
||
Note
|
||
Some of the news documentation suggests putting all the
|
||
.overview files on a drive separate from the news files, check
|
||
out all news FAQs for more information. Typical size is about
|
||
3-10 percent of total news spool size.
|
||
|
||
|
||
|
||
4.1.4. Home Directories ( /home )
|
||
|
||
|
||
|
||
Speed
|
||
Medium. Although many programs use /tmp for temporary storage,
|
||
others such as some news readers frequently update files in the
|
||
home directory which can be noticeable on large multiuser
|
||
systems. For small systems this is not a critical issue.
|
||
|
||
|
||
Size
|
||
Tricky! On some systems people pay for storage so this is
|
||
usually then a question of finance. Large systems such as
|
||
Nyx.net <http://www.nyx.net/> (which is a free Internet service
|
||
with mail, news and WWW services) run successfully with a
|
||
suggested limit of 100 KB per user and 300 KB as enforced
|
||
maximum. Commercial ISPs offer typically about 5 MB in their
|
||
standard subscription packages.
|
||
|
||
If however you are writing books or are doing design work the
|
||
requirements balloon quickly.
|
||
|
||
|
||
Reliability
|
||
Variable. Losing /home on a single user machine is annoying but
|
||
when 2000 users call you to tell you their home directories are
|
||
gone it is more than just annoying. For some their livelihood
|
||
relies on what is here. You do regular backups of course?
|
||
|
||
|
||
Files
|
||
Equally tricky. The minimum setup for a single user tends to be
|
||
a dozen files, 0.5 - 5 KB in size. Project related files can be
|
||
huge though.
|
||
|
||
|
||
Note1
|
||
You might consider RAID for either speed or reliability. If you
|
||
want extremely high speed and reliability you might be looking
|
||
at other operating system and hardware platforms anyway. (Fault
|
||
tolerance etc.)
|
||
|
||
|
||
Note2
|
||
Web browsers often use a local cache to speed up browsing and
|
||
this cache can take up a substantial amount of space and cause
|
||
much disk activity. There are many ways of avoiding this kind of
|
||
performance hits, for more information see the sections on
|
||
``Home Directories'' and ``WWW''.
|
||
|
||
|
||
Note3
|
||
Users often tend to use up all available space on the /home
|
||
partition. The Linux Quota subsystem is capable of limiting the
|
||
number of blocks and the number of inode a single user ID can
|
||
allocate on a per-filesystem basis. See the Linux Quota mini-
|
||
HOWTO <http://www.linuxdoc.org/HOWTO/mini/Quota.html> by Albert
|
||
M.C. Tam bertie (at) scn.org for details on setup.
|
||
|
||
|
||
|
||
4.1.5. Main Binaries ( /usr/bin and /usr/local/bin )
|
||
|
||
|
||
Speed
|
||
Low. Often data is bigger than the programs which are demand
|
||
loaded anyway so this is not speed critical. Witness the
|
||
successes of live file systems on CD ROM.
|
||
Size
|
||
The sky is the limit but 200 MB should give you most of what you
|
||
want for a comprehensive system. A big system, for software
|
||
development or a multi purpose server should perhaps reserve 500
|
||
MB both for installation and for growth.
|
||
|
||
|
||
Reliability
|
||
Low. This is usually mounted under root where all the essentials
|
||
are collected. Nevertheless losing all the binaries is a pain...
|
||
|
||
|
||
Files
|
||
Variable but usually of the order of 10 - 100 KB.
|
||
|
||
|
||
|
||
4.1.6. Libraries ( /usr/lib and /usr/local/lib )
|
||
|
||
|
||
Speed
|
||
Medium. These are large chunks of data loaded often, ranging
|
||
from object files to fonts, all susceptible to bloating. Often
|
||
these are also loaded in their entirety and speed is of some use
|
||
here.
|
||
|
||
|
||
Size
|
||
Variable. This is for instance where word processors store their
|
||
immense font files. The few that have given me feedback on this
|
||
report about 70 MB in their various lib directories. A rather
|
||
complete Debian 1.2 installation can take as much as 250 MB
|
||
which can be taken as an realistic upper limit. The following
|
||
ones are some of the largest disk space consumers: GCC, Emacs,
|
||
TeX/LaTeX, X11 and perl.
|
||
|
||
|
||
Reliability
|
||
Low. See point ``Main binaries''.
|
||
|
||
|
||
Files
|
||
Usually large with many of the order of 1 MB in size.
|
||
|
||
|
||
Note
|
||
For historical reasons some programs keep executables in the lib
|
||
areas. One example is GCC which have some huge binaries in the
|
||
/usr/lib/gcc/lib hierarchy.
|
||
|
||
|
||
4.1.7. Boot
|
||
|
||
|
||
Speed
|
||
Quite low: after all booting doesn't happen that often and
|
||
loading the kernel is just a tiny fraction of the time it takes
|
||
to get the system up and running.
|
||
|
||
|
||
Size
|
||
Quite small, a complete image with some extras fit on a single
|
||
floppy so 5 MB should be plenty.
|
||
|
||
|
||
|
||
Reliability
|
||
High. See section below on Root.
|
||
|
||
|
||
Note 1
|
||
The most important part about the Boot partition is that on many
|
||
systems it must reside below cylinder 1023. This is a BIOS
|
||
limitation that Linux cannot get around.
|
||
|
||
|
||
Note 1a
|
||
The above is not necessarily true for recent IDE systems and not
|
||
for any SCSI disks. For more information check the latest Large
|
||
Disk HOWTO.
|
||
|
||
|
||
Note 2
|
||
Recently a new boot loader has been written that overcomes the
|
||
1023 sector limit. For more information check out this article
|
||
<http://www.linuxforum.com/plug/articles/nuni.html> on nuni.
|
||
|
||
|
||
|
||
4.1.8. Root
|
||
|
||
|
||
Speed
|
||
Quite low: only the bare minimum is here, much of which is only
|
||
run at startup time.
|
||
|
||
|
||
Size
|
||
Relatively small. However it is a good idea to keep some
|
||
essential rescue files and utilities on the root partition and
|
||
some keep several kernel versions. Feedback suggests about 20 MB
|
||
would be sufficient.
|
||
|
||
|
||
Reliability
|
||
High. A failure here will possibly cause a fair bit of grief and
|
||
you might end up spending some time rescuing your boot
|
||
partition. With some practice you can of course do this in an
|
||
hour or so, but I would think if you have some practice doing
|
||
this you are also doing something wrong.
|
||
|
||
Naturally you do have a rescue disk? Of course this is updated
|
||
since you did your initial installation? There are many ready
|
||
made rescue disks as well as rescue disk creation tools you
|
||
might find valuable. Presumably investing some time in this
|
||
saves you from becoming a root rescue expert.
|
||
|
||
|
||
Note 1
|
||
If you have plenty of drives you might consider putting a spare
|
||
emergency boot partition on a separate physical drive. It will
|
||
cost you a little bit of space but if your setup is huge the
|
||
time saved, should something fail, will be well worth the extra
|
||
space.
|
||
|
||
|
||
Note 2
|
||
For simplicity and also in case of emergencies it is not
|
||
advisable to put the root partition on a RAID level 0 system.
|
||
Also if you use RAID for your boot partition you have to
|
||
remember to have the md option turned on for your emergency
|
||
kernel.
|
||
|
||
|
||
Note 3
|
||
For simplicity it is quite common to keep Boot and Root on the
|
||
same partition. If you do that, then in order to boot from LILO
|
||
it is important that the essential boot files reside wholly
|
||
within cylinder 1023. This includes the kernel as well as files
|
||
found in /boot.
|
||
|
||
|
||
|
||
4.1.9. DOS etc.
|
||
|
||
At the danger of sounding heretical I have included this little
|
||
section about something many reading this document have strong
|
||
feelings about. Unfortunately many hardware items come with setup and
|
||
maintenance tools based around those systems, so here goes.
|
||
|
||
|
||
Speed
|
||
Very low. The systems in question are not famed for speed so
|
||
there is little point in using prime quality drives.
|
||
Multitasking or multi-threading are not available so the command
|
||
queueing facility found in SCSI drives will not be taken
|
||
advantage of. If you have an old IDE drive it should be good
|
||
enough. The exception is to some degree Win95 and more notably
|
||
NT which have multi-threading support which should theoretically
|
||
be able to take advantage of the more advanced features offered
|
||
by SCSI devices.
|
||
|
||
|
||
Size
|
||
The company behind these operating systems is not famed for
|
||
writing tight code so you have to be prepared to spend a few
|
||
tens of MB depending on what version you install of the OS or
|
||
Windows. With an old version of DOS or Windows you might fit it
|
||
all in on 50 MB.
|
||
|
||
|
||
Reliability
|
||
Ha-ha. As the chain is no stronger than the weakest link you can
|
||
use any old drive. Since the OS is more likely to scramble
|
||
itself than the drive is likely to self destruct you will soon
|
||
learn the importance of keeping backups here.
|
||
|
||
Put another way: "Your mission, should you choose to accept it,
|
||
is to keep this partition working. The warranty will self
|
||
destruct in 10 seconds..."
|
||
|
||
Recently I was asked to justify my claims here. First of all I
|
||
am not calling DOS and Windows sorry excuses for operating
|
||
systems. Secondly there are various legal issues to be taken
|
||
into account. Saying there is a connection between the last two
|
||
sentences are merely the ravings of the paranoid. Surely.
|
||
Instead I shall offer the esteemed reader a few key words: DOS
|
||
4.0, DOS 6.x and various drive compression tools that shall
|
||
remain nameless.
|
||
|
||
|
||
|
||
4.2. Explanation of Terms
|
||
|
||
Naturally the faster the better but often the happy installer of Linux
|
||
has several disks of varying speed and reliability so even though this
|
||
document describes performance as 'fast' and 'slow' it is just a rough
|
||
guide since no finer granularity is feasible. Even so there are a few
|
||
details that should be kept in mind:
|
||
|
||
|
||
|
||
4.2.1. Speed
|
||
|
||
This is really a rather woolly mix of several terms: CPU load,
|
||
transfer setup overhead, disk seek time and transfer rate. It is in
|
||
the very nature of tuning that there is no fixed optimum, and in most
|
||
cases price is the dictating factor. CPU load is only significant for
|
||
IDE systems where the CPU does the transfer itself but is generally
|
||
low for SCSI, see SCSI documentation for actual numbers. Disk seek
|
||
time is also small, usually in the millisecond range. This however is
|
||
not a problem if you use command queueing on SCSI where you then
|
||
overlap commands keeping the bus busy all the time. News spools are a
|
||
special case consisting of a huge number of normally small files so in
|
||
this case seek time can become more significant.
|
||
|
||
There are two main parameters that are of interest here:
|
||
|
||
|
||
Seek
|
||
is usually specified in the average time take for the read/write
|
||
head to seek from one track to another. This parameter is
|
||
important when dealing with a large number of small files such
|
||
as found in spool files. There is also the extra seek delay
|
||
before the desired sector rotates into position under the head.
|
||
This delay is dependent on the angular velocity of the drive
|
||
which is why this parameter quite often is quoted for a drive.
|
||
Common values are 4500, 5400 and 7200 RPM (rotations per
|
||
minute). Higher RPM reduces the seek time but at a substantial
|
||
cost. Also drives working at 7200 RPM have been known to be
|
||
noisy and to generate a lot of heat, a factor that should be
|
||
kept in mind if you are building a large array or "disk farm".
|
||
Very recently drives working at 10000 RPM has entered the market
|
||
and here the cooling requirements are even stricter and minimum
|
||
figures for air flow are given.
|
||
|
||
|
||
Transfer
|
||
is usually specified in megabytes per second. This parameter is
|
||
important when handling large files that have to be transferred.
|
||
Library files, dictionaries and image files are examples of
|
||
this. Drives featuring a high rotation speed also normally have
|
||
fast transfers as transfer speed is proportional to angular
|
||
velocity for the same sector density.
|
||
|
||
It is therefore important to read the specifications for the drives
|
||
very carefully, and note that the maximum transfer speed quite often
|
||
is quoted for transfers out of the on board cache (burst speed) and
|
||
not directly from the platter (sustained speed). See also section on
|
||
``Power and Heating''.
|
||
|
||
|
||
|
||
4.2.2. Reliability
|
||
|
||
Naturally no-one would want low reliability disks but one might be
|
||
better off regarding old disks as unreliable. Also for RAID purposes
|
||
(See the relevant information) it is suggested to use a mixed set of
|
||
disks so that simultaneous disk crashes become less likely.
|
||
|
||
So far I have had only one report of total file system failure but
|
||
here unstable hardware seemed to be the cause of the problems.
|
||
|
||
Disks are cheap these days yet people still underestimate the value of
|
||
the contents of the drives. If you need higher reliability make sure
|
||
you replace old drives and keep spares. It is not unusual that drives
|
||
can work more or less continuous for years and years but what often
|
||
kills a drive in the end is power cycling.
|
||
|
||
|
||
4.2.3. Files
|
||
|
||
The average file size is important in order to decide the most
|
||
suitable drive parameters. A large number of small files makes the
|
||
average seek time important whereas for big files the transfer speed
|
||
is more important. The command queueing in SCSI devices is very handy
|
||
for handling large numbers of small files, but for transfer EIDE is
|
||
not too far behind SCSI and normally much cheaper than SCSI.
|
||
|
||
|
||
|
||
5. File Systems
|
||
|
||
Over time the requirements for file systems have increased and the
|
||
demands for large structures, large files, long file names and more
|
||
has prompted ever more advanced file systems, the system that accesses
|
||
and organises the data on mass storage. Today there is a large number
|
||
of file systems to choose from and this section will describe these in
|
||
detail.
|
||
|
||
The emphasis is on Linux but with more input I will be happy to add
|
||
information for a wider audience.
|
||
|
||
|
||
|
||
5.1. General Purpose File Systems
|
||
|
||
Most operating systems usually have a general purpose file system for
|
||
every day use for most kinds of files, reflecting available features
|
||
in the OS such as permission flags, protection and recovery.
|
||
|
||
|
||
5.1.1. minix
|
||
|
||
This was the original fs for Linux, back in the days Linux was hosted
|
||
on minix machines. It is simple but limited in features and hardly
|
||
ever used these days other than in some rescue disks as it is rather
|
||
compact.
|
||
|
||
|
||
5.1.2. xiafs and extfs
|
||
|
||
These are also old and have fallen in disuse and are no longer
|
||
recommended.
|
||
|
||
|
||
5.1.3. ext2fs
|
||
|
||
This is the established standard for general purpose in the Linux
|
||
world. It is fast, efficient and mature and is under continuous
|
||
development and features such as ACL and transparent compression are
|
||
on the horizon.
|
||
|
||
For more information check the ext2fs
|
||
<http://web.mit.edu/tytso/www/linux/ext2.html> home page.
|
||
|
||
|
||
|
||
5.1.4. ext3fs
|
||
|
||
This is the name for the upcoming successor to ext2fs due to enter
|
||
stable kernel in the near future. Many features are added to ext2fs
|
||
but to avoid confusion over the name after such a radical upgrade the
|
||
name will be changed too. You may have heard of it already but source
|
||
code is now in beta release .
|
||
|
||
Patches are available at Linux.org
|
||
<ftp://ftp.linux.org.uk/pub/linux/sct/fs/jfs>.
|
||
|
||
|
||
|
||
5.1.5. ufs
|
||
|
||
This is the fs used by BSD and variants thereof. It is mature but also
|
||
developed for older types of disk drives where geometries were known.
|
||
The fs uses a number of tricks to optimise performance but as disk
|
||
geometries are translated in a number of ways the net effect is no
|
||
longer so optimal.
|
||
|
||
|
||
|
||
5.1.6. efs
|
||
|
||
The Extent File System (efs) is Silicon Graphics' early file system
|
||
widely used on IRIX before version 6.0 after which xfs has taken over.
|
||
While migration to xfs is encouraged efs is still supported and much
|
||
used on CDs.
|
||
|
||
There is a Linux driver available in early beta stage, available at
|
||
Linux extent file system <http://aeschi.ch.eu.org/efs/> home page.
|
||
|
||
|
||
|
||
5.1.7. XFS
|
||
|
||
Silicon Graphics Inc (sgi) <http://www.sgi.com/> has started porting
|
||
its mainframe grade file system to Linux. Source is not yet available
|
||
as they are busily cleaning out legal encumbrance but once that is
|
||
done they will provide the source code under GPL.
|
||
|
||
More information is already available on the XFS project page
|
||
<http://oss.sgi.com/projects/xfs/> at SGI.
|
||
|
||
|
||
|
||
5.1.8. reiserfs
|
||
|
||
As of July, 23th 1997 Hans Reiser reiser (at) RICOCHET.NET has put up
|
||
the source to his tree based reiserfs <http://www.namesys.com> on the
|
||
web. While his filesystem has some very interesting features and is
|
||
much faster than ext2fs and is in use by a number of people.
|
||
Hopefully it will be ready for kernel 2.4.0 which might be ready at
|
||
the end of the year.
|
||
|
||
|
||
|
||
5.1.9. enh-fs
|
||
|
||
The Enhanced File System project is now dead.
|
||
|
||
|
||
|
||
5.1.10. Tux2 fs
|
||
|
||
This is a variation on the ext2fs that adds robustness in case of
|
||
unexpected interruptions such as power failure. After such an event
|
||
Tux2 fs will restart with the file system in a consistent, recently
|
||
recorded state without fsck or other recovery operations. To achieve
|
||
this Tux2 fs uses a newly designed algorithm called Phase Tree.
|
||
|
||
More information can be found at the project home page
|
||
<http://tux2.sourceforge.net>.
|
||
|
||
|
||
|
||
5.2. Microsoft File Systems
|
||
|
||
This company is responsible for a lot, including a number of
|
||
filesystems that has at the very least caused confusions.
|
||
|
||
|
||
|
||
5.2.1. fat
|
||
|
||
Actually there are 2 fats out there, fat12 and fat16 depending on the
|
||
partition size used but fortunately the difference is so minor that
|
||
the whole issue is transparent.
|
||
|
||
On the plus side these are fast and simple and most OSes understands
|
||
it and can both read and write this fs. And that is about it.
|
||
|
||
The minus side is limited safety, severely limited permission flags
|
||
and atrocious scalability. For instance with fat you cannot have
|
||
partitions larger than 2 GB.
|
||
|
||
|
||
|
||
5.2.2. fat32
|
||
|
||
After about 10 years Microsoft realised fat was about, well, 10 years
|
||
behind the times and created this fs which scales reasonably well.
|
||
|
||
Permission flags are still limited. NT 4.0 cannot read this file
|
||
system but Linux can.
|
||
|
||
|
||
|
||
5.2.3. vfat
|
||
|
||
At the same time as Microsoft launched fat32 they also added support
|
||
for long file names, known as vfat.
|
||
|
||
Linux reads vfat and fat32 partitions by mounting with type vfat.
|
||
|
||
|
||
|
||
5.2.4. ntfs
|
||
|
||
This is the native fs of Win-NT but as complete information is not
|
||
available there is limited support for other OSes.
|
||
|
||
|
||
5.3. Logging and Journaling File Systems
|
||
|
||
These take a radically different approach to file updates by logging
|
||
modifications for files in a log and later at some time checkpointing
|
||
the logs.
|
||
|
||
Reading is roughly as fast as traditional file systems that always
|
||
update the files directly. Writing is much faster as only updates are
|
||
appended to a log. All this is transparent to the user. It is in
|
||
reliability and particularly in checking file system integrity that
|
||
these file systems really shine. Since the data before last
|
||
checkpointing is known to be good only the log has to be checked, and
|
||
this is much faster than for traditional file systems.
|
||
|
||
Note that while logging filesystems keep track of changes made to both
|
||
data and inodes, journaling filesystems keep track only of inode
|
||
changes.
|
||
|
||
Linux has quite a choice in such file systems but none are yet in
|
||
production quality. Some are also on hold.
|
||
|
||
|
||
<20> Adam Richter from Yggdrasil posted some time ago that they have
|
||
been working on a compressed log file based system but that this
|
||
project is currently on hold. Nevertheless a non-working version is
|
||
available on their FTP server. Check out the Yggdrasil ftp server
|
||
<ftp://ftp.yggdrasil.com/private/adam> where special patched
|
||
versions of the kernel can be found.
|
||
|
||
<20> Another project is the Linux log-structured Filesystem Project
|
||
<http://outflux.net/projects/lfs/> which sadly also is on hold.
|
||
Nevertheless this page contains much information on the topic.
|
||
|
||
<20> Then there is the LinLogFS -- A Log-Structured Filesystem For Linux
|
||
<http://www.complang.tuwien.ac.at/czezatke/lfs.html> (formerly
|
||
known as dtfs) which seems to be going strong. Still in alpha but
|
||
sufficiently complete to make programs run off this file system
|
||
|
||
<20> Finally there is the Journaling Flash File System
|
||
<http://developer.axis.com/software/jffs/> designed for their
|
||
embedded diskless systems such as their Linux based web camera.
|
||
|
||
Note that ext3fs, XFS and reiserfs also have features for logging or
|
||
journaling.
|
||
|
||
|
||
5.4. Read-only File Systems
|
||
|
||
Read-only media has not escaped the ever increasing complexities seen
|
||
in more general file systems so again there is a large choice to
|
||
choose from with corresponding opportunities for exciting mistakes.
|
||
|
||
Note that ext2fs works quite well on a CD-ROM and seems to save space
|
||
while offering the normal file system features such as long file names
|
||
and permissions that can be retained when copying files across to
|
||
read-write media. Also having /dev on a CD-ROM is possible.
|
||
|
||
Most of these are used with the CD-ROM media but also the new DVD can
|
||
be used and you can even use it through the loopback device on a hard
|
||
disk file for verifying an image before burning a ROM.
|
||
|
||
There is a read-only romfs for Linux but as that is not disk related
|
||
nothing more will be said about it here.
|
||
|
||
|
||
|
||
5.4.1. High Sierra
|
||
|
||
This was one of the earliest standards for CD-ROM formats, supposedly
|
||
named after the hotel where the final agreement took place.
|
||
|
||
High Sierra was so limited in features that new extensions simply had
|
||
to appear and while there has been no end to new formats the original
|
||
High Sierra remains the common precursor and is therefore still widely
|
||
supported.
|
||
|
||
|
||
|
||
5.4.2. iso9660
|
||
|
||
The International Standards Organisation made their extensions and
|
||
formalised the standard into what we know as the iso9660 standard.
|
||
|
||
The Linux iso9660 file system supports both High Sierra as well as
|
||
Rock Ridge extensions.
|
||
|
||
|
||
|
||
5.4.3. Rock Ridge
|
||
|
||
Not everyone accepts limits like short filenames and lack of
|
||
permissions so very soon the Rock Ridge extensions appeared to rectify
|
||
these shortcomings.
|
||
|
||
|
||
|
||
5.4.4. Joliet
|
||
|
||
Microsoft, not be be outdone in the standards extension game, decided
|
||
it should extend CD-ROM formats with some internationalisation
|
||
features and called it Joliet.
|
||
|
||
Linux supports this standards in kernels 2.0.34 or newer. You need to
|
||
enable NLS in order to use it.
|
||
|
||
|
||
|
||
5.4.5. Trivia
|
||
|
||
Joliet is a city outside Chicago; best known for being the site of the
|
||
prison where Jake was locked up in the movie "Blues Brothers." Rock
|
||
Ridge (the UNIX extensions to ISO 9660) is named after the (fictional)
|
||
town in the movie "Blazing Saddles."
|
||
|
||
|
||
|
||
5.4.6. UDF
|
||
|
||
With the arrival of DVD with up to about 17 GB of storage capacity the
|
||
world seemingly needed another format, this time ambitiously named
|
||
Universal Disk Format (UDF). This is intended to replace iso9660 and
|
||
will be required for DVD.
|
||
|
||
Currently this is not in the standard Linux kernel but a project is
|
||
underway to make a <http://trylinux.com/projects/udf/index.html>
|
||
name="UDF driver"> for Linux. Patches and documentation are available.
|
||
|
||
More information is also available at the Linux and DVDs
|
||
<http://atv.ne.mediaone.net/linux-dvd/> page.
|
||
|
||
|
||
|
||
5.5. Networking File Systems
|
||
|
||
There is a large number of networking technologies available that lets
|
||
you distribute disks throughout a local or even global networks. This
|
||
is somewhat peripheral to the topic of this HOWTO but as it can be
|
||
used with local disks I will cover this briefly. It would be best if
|
||
someone (else) took this into a separate HOWTO...
|
||
|
||
|
||
5.5.1. NFS
|
||
|
||
This is one of the earliest systems that allows mounting a file space
|
||
on one machine onto another. There are a number of problems with NFS
|
||
ranging from performance to security but it has nevertheless become
|
||
established.
|
||
|
||
|
||
5.5.2. AFS
|
||
|
||
This is a system that allows efficient sharing of files across large
|
||
networks. Starting out as an academic project it is now sold by
|
||
Transarc <http://www.transarc.com> whose home page gives you more
|
||
details.
|
||
|
||
Derek Atkins, of MIT, ported AFS to Linux and has also set up the
|
||
Linux AFS mailing List ( linux-afs@mit.edu) for this which is open to
|
||
the public. Requests to join the list should go to linux-afs-
|
||
request@mit.edu and finally bug reports should be directed to linux-
|
||
afs-bugs@mit.edu.
|
||
|
||
Important: as AFS uses encryption it is restricted software and cannot
|
||
easily be exported from the US.
|
||
|
||
IBM who owns Transarc, has announced the availability of the latest
|
||
version of client as well as server for Linux.
|
||
|
||
Arla is a free AFS implementation, check the Arla homepage
|
||
<http://www.stacken.kth.se/projekt/arla/> for more information as well
|
||
as documentation.
|
||
|
||
|
||
|
||
5.5.3. Coda
|
||
|
||
A networking filesystem similar to AFS is underway and is called Coda
|
||
<http://coda.cs.cmu.edu/>. This is designed to be more robust and
|
||
fault tolerant than AFS, and supports mobile, disconnected operations.
|
||
Currently it does not scale very well, and does not really have proper
|
||
administrative tools, as AFS does and ARLA is beginning to.
|
||
|
||
|
||
|
||
5.5.4. nbd
|
||
|
||
The Network Block Device <http://atrey.karlin.mff.cuni.cz/~pavel/>
|
||
(nbd) is available in Linux kernel 2.2 and later and offers reportedly
|
||
excellent performance. The interesting thing here is that it can be
|
||
combined with RAID (see later).
|
||
|
||
|
||
|
||
5.5.5. enbd
|
||
|
||
The <http://www.it.uc3m.es/~ptb/nbd> name="Enhanced Network Block
|
||
Device"> (enbd) is a project to enhance the nbd with features such as
|
||
block journaled multi channel communications, internal failover and
|
||
automatic balancing between channels and more.
|
||
|
||
The intended use is for RAID over the net.
|
||
|
||
|
||
5.5.6. GFS
|
||
|
||
The Global File System <http://gfs.lcse.umn.edu/> is a new file system
|
||
designed for storage across a wide area network. It is currently in
|
||
the early stages and more information will come later.
|
||
|
||
|
||
|
||
5.6. Special File Systems
|
||
|
||
In addition to the general file systems there is also a number of more
|
||
specific ones, usually to provide higher performance or other
|
||
features, usually with a tradeoff in other respects.
|
||
|
||
|
||
|
||
5.6.1. tmpfs and swapfs
|
||
|
||
For short term fast file storage SunOS offers tmpfs which is about the
|
||
same as the swapfs on NeXT. This overcomes the inherent slowness in
|
||
ufs by caching file data and keeping control information in memory.
|
||
This means that data on such a file system will be lost when rebooting
|
||
and is therefore mainly suitable for /tmp area but not /var/tmp which
|
||
is where temporary data that must survive a reboot, is placed.
|
||
|
||
SunOS offers very limited tuning for tmpfs and the number of files is
|
||
even limited by total physical memory of the machine.
|
||
|
||
|
||
Linux now features tmpfs since kernel version 2.4 and is enabled by
|
||
turning on virtual memory file system support (former shm fs). Under
|
||
certain circumstances tmpfs can lock up the system in early kerbel
|
||
versions, make sure you use version 2.4.6 or later.
|
||
|
||
|
||
|
||
5.6.2. userfs
|
||
|
||
The user file system (userfs) allows a number of extensions to
|
||
traditional file system use such as FTP based file system, compression
|
||
(arcfs) and fast prototyping and many other features. The docfs is
|
||
based on this filesystem. Check the userfs homepage
|
||
<http://www.goop.org/~jeremy/userfs/> for more information.
|
||
|
||
|
||
|
||
5.6.3. devfs
|
||
|
||
When disks are added, removed or just fail it is likely that disk
|
||
device names of the remaining disks will change. For instance if sdb
|
||
fails then the old sdc becomes sdb, the old sdc becomes sdb and so on.
|
||
Note that in this case hda, hdb etc will remain unchanged. Likewise
|
||
if a new drive is added the reverse may happen.
|
||
|
||
There is no guarantee that SCSI ID 0 becomes sda and that adding disks
|
||
in increasing ID order will just add a new device name without
|
||
renaming previous entries, as some SCSI drivers assign from ID 0 and
|
||
up while others reverse the scanning order. Likewise adding a SCSI
|
||
host adapter can also cause renaming.
|
||
|
||
Generally device names are assigned in the order they are found.
|
||
|
||
The source of the problem lies in the limited number of bits available
|
||
for major and minor numbering in the device files used to describe the
|
||
device itself. You an see these in the /dev directory, info on the
|
||
numbering and allocation can be found in man MAKEDEV. Currently there
|
||
are 2 solutions to this problem in various stages of development:
|
||
|
||
scsidev
|
||
works by creating a database of drives and where they belong,
|
||
check man scsifs and the scsidev home page for more information
|
||
|
||
devfs
|
||
is a more long term project aimed at getting around the whole
|
||
business of device numbering by making the /dev directory a
|
||
kernel file system in the same way as /proc is. More
|
||
information will appear as it becomes available.
|
||
|
||
|
||
|
||
5.6.4. smugfs
|
||
|
||
For a number of reasons it is currently difficult to have files bigger
|
||
than 2 GB. One file system that tries to overcome this limit is smugfs
|
||
which is very fast but also simple. For instance there are no
|
||
directories and the block allocation is simple.
|
||
|
||
It is available as compressed tarred source code
|
||
<ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/> and while it
|
||
worked with kernel version 2.1.85 it is quite possible some work is
|
||
required to make it fit into newer kernels. Also the low version
|
||
number (0.0) suggests extra care is required.
|
||
|
||
|
||
|
||
5.7. File System Recommendations
|
||
|
||
There is a jungle of choices but generally it is recommended to use
|
||
the general file system that comes with your distribution. If you use
|
||
ufs and have some kind of tmpfs available you should first start off
|
||
with the general file system to get an idea of the space requirements
|
||
and if necessary buy more RAM to support the size of tmpfs you need.
|
||
Otherwise you will end up with mysterious crashes and lost time.
|
||
|
||
If you use dual boot and need to transfer data between the two OSes
|
||
one of the simplest ways is to use an appropriately sized partition
|
||
formatted with fat as most systems can reliably read and write this.
|
||
Remember the limit of 2 GB for fat partitions.
|
||
|
||
For more information of file system interconnectivity you can check
|
||
out the file system
|
||
<http://students.ceid.upatras.gr/~gef/fs/oldindex.html> page which has
|
||
been superseded by file system <http://www.penguin.cz/~mhi/fs/> and
|
||
the article Kragen's Amazing List of Filesystems
|
||
<http://linuxtoday.com/stories/5556.html>.
|
||
|
||
|
||
That guide is being superseded by a HOWTO which is underway and a link
|
||
will be added when it is ready.
|
||
|
||
To avoid total havoc with device renaming if a drive fails check out
|
||
the scanning order of your system and try to keep your root system on
|
||
hda or sda and removable media such as ZIP drives at the end of the
|
||
scanning order.
|
||
|
||
|
||
6. Technologies
|
||
|
||
In order to decide how to get the most of your devices you need to
|
||
know what technologies are available and their implications. As always
|
||
there can be some tradeoffs with respect to speed, reliability, power,
|
||
flexibility, ease of use and complexity.
|
||
|
||
Many of the techniques described below can be stacked in a number of
|
||
ways to maximise performance and reliability, though at the cost of
|
||
added complexity.
|
||
|
||
|
||
|
||
6.1. RAID
|
||
|
||
This is a method of increasing reliability, speed or both by using
|
||
multiple disks in parallel thereby decreasing access time and
|
||
increasing transfer speed. A checksum or mirroring system can be used
|
||
to increase reliability. Large servers can take advantage of such a
|
||
setup but it might be overkill for a single user system unless you
|
||
already have a large number of disks available. See other documents
|
||
and FAQs for more information.
|
||
|
||
For Linux one can set up a RAID system using either software (the md
|
||
module in the kernel), a Linux compatible controller card (PCI-to-
|
||
SCSI) or a SCSI-to-SCSI controller. Check the documentation for what
|
||
controllers can be used. A hardware solution is usually faster, and
|
||
perhaps also safer, but comes at a significant cost.
|
||
|
||
A summary of available hardware RAID solutions for Linux is available
|
||
at Linux Consulting <http://www.Linux-
|
||
Consulting.com/Raid/Docs/raid_hw.txt>.
|
||
|
||
|
||
|
||
6.1.1. SCSI-to-SCSI
|
||
|
||
SCSI-to-SCSI controllers are usually implemented as complete cabinets
|
||
with drives and a controller that connects to the computer with a
|
||
second SCSI bus. This makes the entire cabinet of drives look like a
|
||
single large, fast SCSI drive and requires no special RAID driver. The
|
||
disadvantage is that the SCSI bus connecting the cabinet to the
|
||
computer becomes a bottleneck.
|
||
|
||
A significant disadvantage for people with large disk farms is that
|
||
there is a limit to how many SCSI entries there can be in the /dev
|
||
directory. In these cases using SCSI-to-SCSI will conserve entries.
|
||
|
||
Usually they are configured via the front panel or with a terminal
|
||
connected to their on-board serial interface.
|
||
|
||
|
||
Some manufacturers of such systems are CMD <http://www.cmd.com> and
|
||
Syred <http://www.syred.com> whose web pages describe several systems.
|
||
|
||
|
||
|
||
6.1.2. PCI-to-SCSI
|
||
|
||
PCI-to-SCSI controllers are, as the name suggests, connected to the
|
||
high speed PCI bus and is therefore not suffering from the same
|
||
bottleneck as the SCSI-to-SCSI controllers. These controllers require
|
||
special drivers but you also get the means of controlling the RAID
|
||
configuration over the network which simplifies management.
|
||
|
||
Currently only a few families of PCI-to-SCSI host adapters are
|
||
supported under Linux.
|
||
|
||
|
||
|
||
DPT
|
||
The oldest and most mature is a range of controllers from DPT
|
||
<http://www.dpt.com> including SmartCache I/III/IV and SmartRAID
|
||
I/III/IV controller families. These controllers are supported
|
||
by the EATA-DMA driver in the standard kernel. This company also
|
||
has an informative home page <http://www.dpt.com> which also
|
||
describes various general aspects of RAID and SCSI in addition
|
||
to the product related information.
|
||
|
||
More information from the author of the DPT controller drivers
|
||
(EATA* drivers) can be found at his pages on SCSI
|
||
<http://www.uni-mainz.de/~neuffer/scsi/> and DPT
|
||
<http://www.uni-mainz.de/~neuffer/scsi/dpt/>.
|
||
|
||
These are not the fastest but have a good track record of proven
|
||
reliability.
|
||
|
||
Note that the maintenance tools for DPT controllers currently
|
||
run under DOS/Win only so you will need a small DOS/Win
|
||
partition for some of the software. This also means you have to
|
||
boot the system into Windows in order to maintain your RAID
|
||
system.
|
||
|
||
|
||
|
||
ICP-Vortex
|
||
A very recent addition is a range of controllers from ICP-Vortex
|
||
<http://www.icp-vortex.com> featuring up to 5 independent
|
||
channels and very fast hardware based on the i960 chip. The
|
||
Linux driver was written by the company itself which shows they
|
||
support Linux.
|
||
|
||
As ICP-Vortex supplies the maintenance software for Linux it is
|
||
not necessary with a reboot to other operating systems for the
|
||
setup and maintenance of your RAID system. This saves you also
|
||
extra downtime.
|
||
|
||
|
||
|
||
Mylex DAC-960
|
||
This is one of the latest entries which is out in early beta.
|
||
More information as well as drivers are available at Dandelion
|
||
Digital's Linux DAC960 Page
|
||
<http://www.dandelion.com/Linux/DAC960.html>.
|
||
|
||
|
||
|
||
Compaq Smart-2 PCI Disk Array Controllers
|
||
Another very recent entry and currently in beta release is the
|
||
Smart-2 <http://www.insync.net/~frantzc/cpqarray.html> driver.
|
||
|
||
|
||
IBM ServeRAID
|
||
IBM has released their driver
|
||
<http://www.developer.ibm.com/welcome/netfinity/serveraid_beta.html>
|
||
as GPL.
|
||
|
||
|
||
|
||
6.1.3. Software RAID
|
||
|
||
A number of operating systems offer software RAID using ordinary disks
|
||
and controllers. Cost is low and performance for raw disk IO can be
|
||
very high. As this can be very CPU intensive it increases the load
|
||
noticeably so if the machine is CPU bound in performance rather then
|
||
IO bound you might be better off with a hardware PCI-to-RAID
|
||
controller.
|
||
|
||
Real cost, performance and especially reliability of software vs.
|
||
hardware RAID is a very controversial topic. Reliability on Linux
|
||
systems have been very good so far.
|
||
|
||
The current software RAID project on Linux is the md system (multiple
|
||
devices) which offers much more than RAID so it is described in more
|
||
details later.
|
||
|
||
|
||
|
||
6.1.4. RAID Levels
|
||
|
||
RAID comes in many levels and flavours which I will give a brief
|
||
overview of this here. Much has been written about it and the
|
||
interested reader is recommended to read more about this in the
|
||
Software RAID HOWTO <http://ostenfeld.dk/~jakob/Software-RAID.HOWTO/>.
|
||
|
||
|
||
<20> RAID 0 is not redundant at all but offers the best throughput of
|
||
all levels here. Data is striped across a number of drives so read
|
||
and write operations take place in parallel across all drives. On
|
||
the other hand if a single drive fail then everything is lost. Did
|
||
I mention backups?
|
||
|
||
<20> RAID 1 is the most primitive method of obtaining redundancy by
|
||
duplicating data across all drives. Naturally this is massively
|
||
wasteful but you get one substantial advantage which is fast
|
||
access. The drive that access the data first wins. Transfers are
|
||
not any faster than for a single drive, even though you might get
|
||
some faster read transfers by using one track reading per drive.
|
||
|
||
Also if you have only 2 drives this is the only method of achieving
|
||
redundancy.
|
||
|
||
<20> RAID 2 and 4 are not so common and are not covered here.
|
||
|
||
<20> RAID 3 uses a number of disks (at least 2) to store data in a
|
||
striped RAID 0 fashion. It also uses an additional redundancy disk
|
||
to store the XOR sum of the data from the data disks. Should the
|
||
redundancy disk fail, the system can continue to operate as if
|
||
nothing happened. Should any single data disk fail the system can
|
||
compute the data on this disk from the information on the
|
||
redundancy disk and all remaining disks. Any double fault will
|
||
bring the whole RAID set off-line.
|
||
|
||
RAID 3 makes sense only with at least 2 data disks (3 disks
|
||
including the redundancy disk). Theoretically there is no limit for
|
||
the number of disks in the set, but the probability of a fault
|
||
increases with the number of disks in the RAID set. Usually the
|
||
upper limit is 5 to 7 disks in a single RAID set.
|
||
|
||
Since RAID 3 stores all redundancy information on a dedicated disk
|
||
and since this information has to be updated whenever a write to
|
||
any data disk occurs, the overall write speed of a RAID 3 set is
|
||
limited by the write speed of the redundancy disk. This, too, is a
|
||
limit for the number of disks in a RAID set. The overall read speed
|
||
of a RAID 3 set with all data disks up and running is that of a
|
||
RAID 0 set with that number of data disks. If the set has to
|
||
reconstruct data stored on a failed disk from redundant
|
||
information, the performance will be severely limited: All disks in
|
||
the set have to be read and XOR-ed to compute the missing
|
||
information.
|
||
|
||
<20> RAID 5 is just like RAID 3, but the redundancy information is
|
||
spread on all disks of the RAID set. This improves write
|
||
performance, because load is distributed more evenly between all
|
||
available disks.
|
||
|
||
There are also hybrids available based on RAID 0 or 1 and one other
|
||
level. Many combinations are possible but I have only seen a few
|
||
referred to. These are more complex than the above mentioned RAID
|
||
levels.
|
||
|
||
RAID 0/1 combines striping with duplication which gives very high
|
||
transfers combined with fast seeks as well as redundancy. The
|
||
disadvantage is high disk consumption as well as the above mentioned
|
||
complexity.
|
||
|
||
RAID 1/5 combines the speed and redundancy benefits of RAID5 with the
|
||
fast seek of RAID1. Redundancy is improved compared to RAID 0/1 but
|
||
disk consumption is still substantial. Implementing such a system
|
||
would involve typically more than 6 drives, perhaps even several
|
||
controllers or SCSI channels.
|
||
|
||
|
||
|
||
6.2. Volume Management
|
||
|
||
Volume management is a way of overcoming the constraints of fixed
|
||
sized partitions and disks while still having a control of where
|
||
various parts of file space resides. With such a system you can add
|
||
new disks to your system and add space from this drive to parts of the
|
||
file space where needed, as well as migrating data out from a disk
|
||
developing faults to other drives before catastrophic failure occurs.
|
||
|
||
The system developed by Veritas <http://www.veritas.com> has become
|
||
the defacto standard for logical volume management.
|
||
|
||
Volume management is for the time being an area where Linux is
|
||
lacking.
|
||
|
||
One is the virtual partition system project VPS <http://www-
|
||
wsg.cso.uiuc.edu/~roth/> that will reimplement many of the volume
|
||
management functions found in IBM's AIX system. Unfortunately this
|
||
project is currently on hold.
|
||
|
||
Another project is the Logical Volume Manager
|
||
<http://www.sistina.com/lvm/> project that is similar to a project by
|
||
HP.
|
||
|
||
|
||
|
||
6.3. Linux md Kernel Patch
|
||
|
||
The Linux Multi Disk (md) provides a number of block level features in
|
||
various stages of development.
|
||
|
||
RAID 0 (striping) and concatenation are very solid and in production
|
||
quality and also RAID 4 and 5 are quite mature.
|
||
|
||
It is also possible to stack some levels, for instance mirroring (RAID
|
||
1) two pairs of drives, each pair set up as striped disks (RAID 0),
|
||
which offers the speed of RAID 0 combined with the reliability of RAID
|
||
1.
|
||
|
||
In addition to RAID this system offers (in alpha stage) block level
|
||
volume management and soon also translucent file space. Since this is
|
||
done on the block level it can be used in combination with any file
|
||
system, even for fat using Wine.
|
||
|
||
Think very carefully what drives you combine so you can operate all
|
||
drives in parallel, which gives you better performance and less wear.
|
||
Read more about this in the documentation that comes with md.
|
||
|
||
|
||
Unfortunately The Linux software RAID has split into two trees, the
|
||
old stable versions 0.35 and 0.42 which are documented in the official
|
||
Software-RAID HOWTO <http://linas.org/linux/Software-RAID/Software-
|
||
RAID.html> and the newer less stable 0.90 series which is documented
|
||
in the unofficial Software RAID HOWTO
|
||
<http://ostenfeld.dk/~jakob/Software-RAID.HOWTO/> which is a work in
|
||
progress.
|
||
|
||
A patch for online growth of ext2fs <http://www-
|
||
mddsp.enel.ucalgary.ca/People/adilger/online-ext2/> is available in
|
||
early stages and related work is taking place at the ext2fs resize
|
||
project <http://ext2resize.sourceforge.net/> at Sourceforge.
|
||
|
||
|
||
|
||
Hint: if you cannot get it to work properly you have forgotten to set
|
||
the persistent-block flag. Your best documentation is currently the
|
||
source code.
|
||
|
||
|
||
|
||
6.4. Compression
|
||
|
||
Disk compression versus file compression is a hotly debated topic
|
||
especially regarding the added danger of file corruption. Nevertheless
|
||
there are several options available for the adventurous
|
||
administrators. These take on many forms, from kernel modules and
|
||
patches to extra libraries but note that most suffer various forms of
|
||
limitations such as being read-only. As development takes place at
|
||
neck breaking speed the specs have undoubtedly changed by the time you
|
||
read this. As always: check the latest updates yourself. Here only a
|
||
few references are given.
|
||
|
||
|
||
<20> DouBle features file compression with some limitations.
|
||
|
||
<20> Zlibc adds transparent on-the-fly decompression of files as they
|
||
load.
|
||
|
||
<20> there are many modules available for reading compressed files or
|
||
partitions that are native to various other operating systems
|
||
though currently most of these are read-only.
|
||
|
||
<20> dmsdos <http://bf9nt.uni-
|
||
duisburg.de/mitarbeiter/gockel/software/dmsdos/> (currently in
|
||
version 0.9.2.0) offer many of the compression options available
|
||
for DOS and Windows. It is not yet complete but work is ongoing and
|
||
new features added regularly.
|
||
|
||
<20> e2compr is a package that extends ext2fs with compression
|
||
capabilities. It is still under testing and will therefore mainly
|
||
be of interest for kernel hackers but should soon gain stability
|
||
for wider use. Check the <http://e2compr.memalpha.cx/e2compr/>
|
||
name="e2compr homepage"> for more information. I have reports of
|
||
speed and good stability which is why it is mentioned here.
|
||
|
||
|
||
|
||
6.5. ACL
|
||
|
||
Access Control List (ACL) offers finer control over file access on a
|
||
user by user basis, rather than the traditional owner, group and
|
||
others, as seen in directory listings (drwxr-xr-x). This is currently
|
||
not available in Linux but is expected in kernel 2.3 as hooks are
|
||
already in place in ext2fs.
|
||
|
||
|
||
|
||
6.6. cachefs
|
||
|
||
This uses part of a hard disk to cache slower media such as CD-ROM.
|
||
It is available under SunOS but not yet for Linux.
|
||
|
||
|
||
|
||
6.7. Translucent or Inheriting File Systems
|
||
|
||
This is a copy-on-write system where writes go to a different system
|
||
than the original source while making it look like an ordinary file
|
||
space. Thus the file space inherits the original data and the
|
||
translucent write back buffer can be private to each user.
|
||
|
||
There is a number of applications:
|
||
|
||
<20> updating a live file system on CD-ROM, making it flexible, fast
|
||
while also conserving space,
|
||
|
||
<20> original skeleton files for each new user, saving space since the
|
||
original data is kept in a single space and shared out,
|
||
|
||
<20> parallel project development prototyping where every user can
|
||
seemingly modify the system globally while not affecting other
|
||
users.
|
||
|
||
SunOS offers this feature and this is under development for Linux.
|
||
There was an old project called the Inheriting File Systems (ifs) but
|
||
this project has stopped. One current project is part of the md
|
||
system and offers block level translucence so it can be applied to any
|
||
file system.
|
||
|
||
Sun has an informative page <http://www.sun.ca/white-papers/tfs.html>
|
||
on translucent file system.
|
||
|
||
It should be noted that Clearcase (now owned by Rational)
|
||
<http://www.rational.com> pioneered and popularized translucent
|
||
filesystems for software configuration management by writing their own
|
||
UNIX filesystem.
|
||
|
||
|
||
|
||
6.8. Physical Track Positioning
|
||
|
||
This trick used to be very important when drives were slow and small,
|
||
and some file systems used to take the varying characteristics into
|
||
account when placing files. Although higher overall speed, on board
|
||
drive and controller caches and intelligence has reduced the effect of
|
||
this.
|
||
|
||
Nevertheless there is still a little to be gained even today. As we
|
||
know, "world dominance" is soon within reach but to achieve this
|
||
"fast" we need to employ all the tricks we can use .
|
||
|
||
To understand the strategy we need to recall this near ancient piece
|
||
of knowledge and the properties of the various track locations. This
|
||
is based on the fact that transfer speeds generally increase for
|
||
tracks further away from the spindle, as well as the fact that it is
|
||
faster to seek to or from the central tracks than to or from the inner
|
||
or outer tracks.
|
||
|
||
Most drives use disks running at constant angular velocity but use
|
||
(fairly) constant data density across all tracks. This means that you
|
||
will get much higher transfer rates on the outer tracks than on the
|
||
inner tracks; a characteristics which fits the requirements for large
|
||
libraries well.
|
||
|
||
Newer disks use a logical geometry mapping which differs from the
|
||
actual physical mapping which is transparently mapped by the drive
|
||
itself. This makes the estimation of the "middle" tracks a little
|
||
harder.
|
||
|
||
In most cases track 0 is at the outermost track and this is the
|
||
general assumption most people use. Still, it should be kept in mind
|
||
that there are no guarantees this is so.
|
||
|
||
|
||
|
||
Inner
|
||
tracks are usually slow in transfer, and lying at one end of the
|
||
seeking position it is also slow to seek to.
|
||
|
||
This is more suitable to the low end directories such as DOS,
|
||
root and print spools.
|
||
|
||
|
||
Middle
|
||
tracks are on average faster with respect to transfers than
|
||
inner tracks and being in the middle also on average faster to
|
||
seek to.
|
||
|
||
This characteristics is ideal for the most demanding parts such
|
||
as swap, /tmp and /var/tmp.
|
||
|
||
|
||
Outer
|
||
tracks have on average even faster transfer characteristics but
|
||
like the inner tracks are at the end of the seek so
|
||
statistically it is equally slow to seek to as the inner tracks.
|
||
|
||
Large files such as libraries would benefit from a place here.
|
||
|
||
|
||
Hence seek time reduction can be achieved by positioning frequently
|
||
accessed tracks in the middle so that the average seek distance and
|
||
therefore the seek time is short. This can be done either by using
|
||
fdisk or cfdisk to make a partition on the middle tracks or by first
|
||
making a file (using dd) equal to half the size of the entire disk
|
||
before creating the files that are frequently accessed, after which
|
||
the dummy file can be deleted. Both cases assume starting from an
|
||
empty disk.
|
||
|
||
The latter trick is suitable for news spools where the empty directory
|
||
structure can be placed in the middle before putting in the data
|
||
files. This also helps reducing fragmentation a little.
|
||
|
||
This little trick can be used both on ordinary drives as well as RAID
|
||
systems. In the latter case the calculation for centring the tracks
|
||
will be different, if possible. Consult the latest RAID manual.
|
||
|
||
The speed difference this makes depends on the drives, but a 50
|
||
percent improvement is a typical value.
|
||
|
||
|
||
6.8.1. Disk Speed Values
|
||
|
||
The same mechanical head disk assembly (HDA) is often available with a
|
||
number of interfaces (IDE, SCSI etc) and the mechanical parameters are
|
||
therefore often comparable. The mechanics is today often the limiting
|
||
factor but development is improving things steadily. There are two
|
||
main parameters, usually quoted in milliseconds (ms):
|
||
|
||
|
||
<20> Head movement - the speed at which the read-write head is able to
|
||
move from one track to the next, called access time. If you do the
|
||
mathematics and doubly integrate the seek first across all possible
|
||
starting tracks and then across all possible target tracks you will
|
||
find that this is equivalent of a stroke across a third of all
|
||
tracks.
|
||
|
||
<20> Rotational speed - which determines the time taken to get to the
|
||
right sector, called latency.
|
||
|
||
After voice coils replaced stepper motors for the head movement the
|
||
improvements seem to have levelled off and more energy is now spent
|
||
(literally) at improving rotational speed. This has the secondary
|
||
benefit of also improving transfer rates.
|
||
|
||
Some typical values:
|
||
|
||
|
||
|
||
Drive type
|
||
|
||
|
||
Access time (ms) | Fast Typical Old
|
||
---------------------------------------------
|
||
Track-to-track <1 2 8
|
||
Average seek 10 15 30
|
||
End-to-end 10 30 70
|
||
|
||
|
||
|
||
This shows that the very high end drives offer only marginally better
|
||
access times then the average drives but that the old drives based on
|
||
stepper motors are significantly worse.
|
||
|
||
|
||
|
||
Rotational speed (RPM) | 3600 | 4500 | 4800 | 5400 | 7200 | 10000
|
||
-------------------------------------------------------------------
|
||
Latency (ms) | 17 | 13 | 12.5 | 11.1 | 8.3 | 6.0
|
||
|
||
|
||
|
||
As latency is the average time taken to reach a given sector, the
|
||
formula is quite simply
|
||
latency (ms) = 60000 / speed (RPM)
|
||
|
||
|
||
|
||
Clearly this too is an example of diminishing returns for the efforts
|
||
put into development. However, what really takes off here is the power
|
||
consumption, heat and noise.
|
||
|
||
|
||
|
||
6.9. Yoke
|
||
|
||
There is also a Linux Yoke Driver <http://www.it.uc3m.es/cgi-
|
||
bin/ptb/cvs-yoke.cgi> available in beta which is intended to do hot-
|
||
swappable transparent binding of one Linux block device to another.
|
||
This means that if you bind two block devices together, say /dev/hda
|
||
and /dev/loop0, writing to one device will mean also writing to the
|
||
other and reading from either will yield the same result.
|
||
|
||
|
||
|
||
6.10. Stacking
|
||
|
||
One of the advantages of a layered design of an operating system is
|
||
that you have the flexibility to put the pieces together in a number
|
||
of ways. For instance you can cache a CD-ROM with cachefs that is a
|
||
volume striped over 2 drives. This in turn can be set up translucently
|
||
with a volume that is NFS mounted from another machine. RAID can be
|
||
stacked in several layers to offer very fast seek and transfer in such
|
||
a way that it will work if even 3 drives fail. The choices are many,
|
||
limited only by imagination and, probably more importantly, money.
|
||
|
||
|
||
|
||
6.11. Recommendations
|
||
|
||
There is a near infinite number of combinations available but my
|
||
recommendation is to start off with a simple setup without any fancy
|
||
add-ons. Get a feel for what is needed, where the maximum performance
|
||
is required, if it is access time or transfer speed that is the bottle
|
||
neck, and so on. Then phase in each component in turn. As you can
|
||
stack quite freely you should be able to retrofit most components in
|
||
as time goes by with relatively few difficulties.
|
||
|
||
RAID is usually a good idea but make sure you have a thorough grasp of
|
||
the technology and a solid back up system.
|
||
|
||
|
||
|
||
7. Other Operating Systems
|
||
|
||
Many Linux users have several operating systems installed, often
|
||
necessitated by hardware setup systems that run under other operating
|
||
systems, typically DOS or some flavour of Windows. A small section on
|
||
how best to deal with this is therefore included here.
|
||
|
||
|
||
7.1. DOS
|
||
|
||
Leaving aside the debate on weather or not DOS qualifies as an
|
||
operating system one can in general say that it has little
|
||
sophistication with respect to disk operations. The more important
|
||
result of this is that there can be severe difficulties in running
|
||
various versions of DOS on large drives, and you are therefore
|
||
strongly recommended in reading the Large Drives mini-HOWTO. One
|
||
effect is that you are often better off placing DOS on low track
|
||
numbers.
|
||
|
||
Having been designed for small drives it has a rather unsophisticated
|
||
file system (fat) which when used on large drives will allocate
|
||
enormous block sizes. It is also prone to block fragmentation which
|
||
will after a while cause excessive seeks and slow effective transfers.
|
||
|
||
One solution to this is to use a defragmentation program regularly but
|
||
it is strongly recommended to back up data and verify the disk before
|
||
defragmenting. All versions of DOS have chkdsk that can do some disk
|
||
checking, newer versions also have scandisk which is somewhat better.
|
||
There are many defragmentation programs available, some versions have
|
||
one called defrag. Norton Utilities have a large suite of disk tools
|
||
and there are many others available too.
|
||
|
||
As always there are snags, and this particular snake in our drive
|
||
paradise is called hidden files. Some vendors started to use these for
|
||
copy protection schemes and would not take kindly to being moved to a
|
||
different place on the drive, even if it remained in the same place in
|
||
the directory structure. The result of this was that newer
|
||
defragmentation programs will not touch any hidden file, which in turn
|
||
reduces the effect of defragmentation.
|
||
|
||
Being a single tasking, single threading and single most other things
|
||
operating system there is very little gains in using multiple drives
|
||
unless you use a drive controller with built in RAID support of some
|
||
kind.
|
||
|
||
There are a few utilities called join and subst which can do some
|
||
multiple drive configuration but there is very little gains for a lot
|
||
of work. Some of these commands have been removed in newer versions.
|
||
|
||
In the end there is very little you can do, but not all hope is lost.
|
||
Many programs need fast, temporary storage, and the better behaved
|
||
ones will look for environment variables called TMPDIR or TEMPDIR
|
||
which you can set to point to another drive. This is often best done
|
||
in autoexec.bat.
|
||
|
||
|
||
______________________________________________________________________
|
||
SET TMPDIR=E:/TMP
|
||
SET TEMPDIR=E:/TEMP
|
||
______________________________________________________________________
|
||
|
||
|
||
|
||
Not only will this possibly gain you some speed but also it can reduce
|
||
fragmentation.
|
||
|
||
There have been reports about difficulties in removing multiple
|
||
primary partitions using the fdisk program that comes with DOS. Should
|
||
this happen you can instead use a Linux rescue disk with Linux fdisk
|
||
to repair the system.
|
||
|
||
Don't forget there are other alternatives to DOS, the most well known
|
||
being DR-DOS <http://www.caldera.com/dos/> from Caldera
|
||
<http://www.caldera.com/>. This is a direct descendant from DR-DOS
|
||
from Digital Research. It offers many features not found in the more
|
||
common DOS, such as multi tasking and long filenames.
|
||
|
||
Another alternative which also is free is Free DOS
|
||
<http://www.freedos.org/> which is a project under development. A
|
||
number of free utilities are also available.
|
||
|
||
7.2. Windows
|
||
|
||
Most of the above points are valid for Windows too, with the exception
|
||
of Windows95 which apparently has better disk handling, which will get
|
||
better performance out of SCSI drives.
|
||
|
||
A useful thing is the introduction of long filenames, to read these
|
||
from Linux you will need the vfat file system for mounting these
|
||
partitions.
|
||
|
||
|
||
Disk fragmentation is still a problem. Some of this can be avoided by
|
||
doing a defragmentation immediately before and immediately after
|
||
installing large programs or systems. I use this scheme at work and
|
||
have found it to work quite well. Purging unused files and emptying
|
||
the waste basket first can improve defragmentation further.
|
||
|
||
Windows also use swap drives, redirecting this to another drive can
|
||
give you some performance gains. There are several mini-HOWTOs telling
|
||
you how best to share swap space between various operating systems.
|
||
|
||
|
||
|
||
The trick of setting TEMPDIR can still be used but not all programs
|
||
will honour this setting. Some do, though. To get a good overview of
|
||
the settings in the control files you can run sysedit which will open
|
||
a number of files for editing, one of which is the autoexec file where
|
||
you can add the TEMPDIR settings.
|
||
|
||
Much of the temporary files are located in the /windows/temp directory
|
||
and changing this is more tricky. To achieve this you can use regedit
|
||
which is rather powerful and quite capable of rendering your system in
|
||
a state you will not enjoy, or more precisely, in a state much less
|
||
enjoyable than windows in general. Registry database error is a
|
||
message that means seriously bad news. Also you will see that many
|
||
programs have their own private temporary directories scattered around
|
||
the system.
|
||
|
||
Setting the swap file to a separate partition is a better idea and
|
||
much less risky. Keep in mind that this partition cannot be used for
|
||
anything else, even if there should appear to be space left there.
|
||
|
||
It is now possible to read ext2fs partitions from Windows, either by
|
||
mounting the partition using FSDEXT2 <http://www.yipton.demon.co.uk/>
|
||
or by using a file explorer like tool called Explore2fs
|
||
<http://uranus.it.swin.edu.au/~jn/linux/explore2fs.htm>.
|
||
|
||
|
||
|
||
7.3. OS/2
|
||
|
||
The only special note here is that you can get file system driver for
|
||
OS/2 that can read an ext2fs partition. Matthieu Willm's ext2fs
|
||
Installable File System for OS/2 can be found at ftp-os2.nmsu.edu
|
||
<ftp://ftp-os2.nmsu.edu/pub/os2/system/drivers/filesys/ext2_240.zip>,
|
||
Sunsite
|
||
<ftp://sunsite.unc.edu/pub/Linux/system/filesystems/ext2/ext2_240.zip>,
|
||
ftp.leo.org
|
||
<ftp://ftp.leo.org/pub/comp/os/os2/drivers/ifs/ext2_240.zip> and ftp-
|
||
os2.cdrom.com <ftp://ftp-os2.cdrom.com/pub/os2/diskutil/ext2_240.zip>.
|
||
|
||
The IFS has read and write capabilities.
|
||
|
||
|
||
|
||
7.4. NT
|
||
|
||
This is a more serious system featuring most buzzwords known to
|
||
marketing. It is well worth noting that it features software striping
|
||
and other more sophisticated setups. Check out the drive manager in
|
||
the control panel. I do not have easy access to NT, more details on
|
||
this can take a bit of time.
|
||
|
||
One important snag was recently reported by acahalan at cs.uml.edu :
|
||
(reformatted from a Usenet News posting)
|
||
|
||
NT DiskManager has a serious bug that can corrupt your disk when you
|
||
have several (more than one?) extended partitions. Microsoft provides
|
||
an emergency fix program at their web site. See the knowledge base
|
||
<http://www.microsoft.com/kb/> for more. (This affects Linux users,
|
||
because Linux users have extra partitions)
|
||
|
||
You can now read ext2fs partitions from NT using Explore2fs
|
||
<http://uranus.it.swin.edu.au/~jn/linux/explore2fs.htm>.
|
||
|
||
|
||
|
||
7.5. Windows 2000
|
||
|
||
Most points regarding Windows NT also applies to its descendant
|
||
Windows 2000 though at the time of writing this I do not know if the
|
||
aforementioned bugs have been fixed or not.
|
||
|
||
While Windows 2000, like its predecessor, features RAID, at least one
|
||
company, RAID Toolbox <http://www.raidtoolbox.com/>, has found the
|
||
bundled RAID somewhat lacking and made their own commercial
|
||
alternative.
|
||
|
||
|
||
|
||
7.6. Sun OS
|
||
|
||
There is a little bit of confusion in this area between Sun OS vs.
|
||
Solaris. Strictly speaking Solaris is just Sun OS 5.x packaged with
|
||
Openwindows and a few other things. If you run Solaris, just type
|
||
uname -a to see your version. Parts of the reason for this confusion
|
||
is that Sun Microsystems used to use an OS from the BSD family,
|
||
albeight with a few bits and pieces from elsewhere as well as things
|
||
made by themselves. This was the situation up to Sun OS 4.x.y when
|
||
they did a "strategic roadmap decision" and decided to switch over to
|
||
the official Unix, System V, Release 4 (aka SVR5), and Sun OS 5 was
|
||
created. This made a lot of people unhappy. Also this was bundled
|
||
with other things and marketed under the name Solaris, which currently
|
||
stands at release 7 which just recently replaced version 2.6 as the
|
||
latest and greatest. In spite of the large jump in version number
|
||
this is actually a minor technical upgrade but a giant leap for
|
||
marketing.
|
||
|
||
|
||
|
||
7.6.1. Sun OS 4
|
||
|
||
This is quite familiar to most Linux users. The last release is 4.1.4
|
||
plus various patches. Note however that the file system structure is
|
||
quite different and does not conform to FSSTND so any planning must be
|
||
based on the traditional structure. You can get some information by
|
||
the man page on this: man hier. This is, like most man pages, rather
|
||
brief but should give you a good start. If you are still confused by
|
||
the structure it will at least be at a higher level.
|
||
|
||
7.6.2. Sun OS 5 (aka Solaris)
|
||
|
||
This comes with a snazzy installation system that runs under
|
||
Openwindows, it will help you in partitioning and formatting the
|
||
drives before installing the system from CD-ROM. It will also fail if
|
||
your drive setup is too far out, and as it takes a complete
|
||
installation run from a full CD-ROM in a 1x only drive this failure
|
||
will dawn on you after too long time. That is the experience we had
|
||
where I used to work. Instead we installed everything onto one drive
|
||
and then moved directories across.
|
||
|
||
The default settings are sensible for most things, yet there remains a
|
||
little oddity: swap drives. Even though the official manual recommends
|
||
multiple swap drives (which are used in a similar fashion as on Linux)
|
||
the default is to use only a single drive. It is recommended to change
|
||
this as soon as possible.
|
||
|
||
Sun OS 5 offers also a file system especially designed for temporary
|
||
files, tmpfs. It offers significant speed improvements over ufs but
|
||
does not survive rebooting.
|
||
|
||
|
||
The only comment so far is: beware! Under Solaris 2.0 it seem that
|
||
creating too big files in /tmp can cause an out of swap space kernel
|
||
panic trap. As the evidence of what has happened is as lost as any
|
||
data on a RAMdisk after powering down it can be hard to find out what
|
||
has happened. What is worse, it seems that user space processes can
|
||
cause this kernel panic and unless this problem is taken care of it is
|
||
best not to use tmpfs in potentially hostile environments.
|
||
|
||
Also see the notes on ``tmpfs''.
|
||
|
||
Trivia: There is a movie also called Solaris, a science fiction movie
|
||
that is very, very long, slow and incomprehensible. This was often
|
||
pointed out at the time Solaris (the OS) appeared...
|
||
|
||
|
||
|
||
7.7. BeOS
|
||
|
||
This operating system is one of the more recent one to arrive and it
|
||
features a file system that has some database like features.
|
||
|
||
There is a BFS file system driver being developed for Linux and is
|
||
available in alpha stage. For more information check the Linux BFS
|
||
page <http://hp.vector.co.jp/authors/VA008030/bfs/> where patches also
|
||
are available.
|
||
|
||
|
||
|
||
8. Clusters
|
||
|
||
In this section I will briefly touch on the ways machines can be
|
||
connected together but this is so big a topic it could be a separate
|
||
HOWTO in its own right, hint, hint. Also, strictly speaking, this
|
||
section lies outside the scope of this HOWTO, so if you feel like
|
||
getting fame etc. you could contact me and take over this part and
|
||
turn it into a new document.
|
||
|
||
These days computers gets outdated at an incredible rate. There is
|
||
however no reason why old hardware could not be put to good use with
|
||
Linux. Using an old and otherwise outdated computer as a network
|
||
server can be both useful in its own right as well as a valuable
|
||
educational exercise. Such a local networked cluster of computers can
|
||
take on many forms but to remain within the charter of this HOWTO I
|
||
will limit myself to the disk strategies. Nevertheless I would hope
|
||
someone else could take on this topic and turn it into a document on
|
||
its own.
|
||
|
||
This is an exciting area of activity today, and many forms of
|
||
clustering is available today, ranging from automatic workload
|
||
balancing over local network to more exotic hardware such as Scalable
|
||
Coherent Interface (SCI) which gives a tight integration of machines,
|
||
effectively turning them into a single machine. Various kinds of
|
||
clustering has been available for larger machines for some time and
|
||
the VAXcluster is perhaps a well known example of this. Clustering is
|
||
done usually in order to share resources such as disk drives, printers
|
||
and terminals etc, but also processing resources equally transparently
|
||
between the computational nodes.
|
||
|
||
There is no universal definition of clustering, in here it is taken to
|
||
mean a network of machines that combine their resources to serve
|
||
users. Admittedly this is a rather loose definition but this will
|
||
change later.
|
||
|
||
These days also Linux offers some clustering features but for a
|
||
starter I will just describe a simple local network. It is a good way
|
||
of putting old and otherwise unusable hardware to good use, as long as
|
||
they can run Linux or something similar.
|
||
|
||
One of the best ways of using an old machine is as a network server in
|
||
which case the effective speed is more likely to be limited by network
|
||
bandwidth rather than pure computational performance. For home use you
|
||
can move the following functionality off to an older machine used as a
|
||
server:
|
||
|
||
<20> news
|
||
|
||
<20> mail
|
||
|
||
<20> web proxy
|
||
|
||
<20> printer server
|
||
|
||
<20> modem server (PPP, SLIP, FAX, Voice mail)
|
||
|
||
You can also NFS mount drives from the server onto your workstation
|
||
thereby reducing drive space requirements. Still read the FSSTND to
|
||
see what directories should not be exported. The best candidates for
|
||
exporting to all machines are /usr and /var/spool and possibly
|
||
/usr/local but probably not /var/spool/lpd.
|
||
|
||
Most of the time even slow disks will deliver sufficient performance.
|
||
On the other hand, if you do processing directly on the disks on the
|
||
server or have very fast networking, you might want to rethink your
|
||
strategy and use faster drives. Searching features on a web server or
|
||
news database searches are two examples of this.
|
||
|
||
Such a network can be an excellent way of learning system
|
||
administration and building up your own toaster network, as it often
|
||
is called. You can get more information on this in other HOWTOs but
|
||
there are two important things you should keep in mind:
|
||
|
||
<20> Do not pull IP numbers out of thin air. Configure your inside net
|
||
using IP numbers reserved for private use, and use your network
|
||
server as a router that handles this IP masquerading.
|
||
|
||
<20> Remember that if you additionally configure the router as a
|
||
firewall you might not be able to get to your own data from the
|
||
outside, depending on the firewall configuration.
|
||
|
||
The Nyx network provides an example of a cluster in the sense defined
|
||
here. It consists of the following machines:
|
||
|
||
nyx
|
||
is one of the two user login machines and also provides some of
|
||
the networking services.
|
||
|
||
nox
|
||
(aka nyx10) is the main user login machine and is also the mail
|
||
server.
|
||
|
||
noc
|
||
is a dedicated news server. The news spool is made accessible
|
||
through NFS mounting to nyx and nox.
|
||
|
||
arachne
|
||
(aka www) is the web server. Web pages are written by NFS
|
||
mounting onto nox.
|
||
|
||
There are also some more advanced clustering projects going, notably
|
||
|
||
<20> The Beowulf Project <http://www.beowulf.org/>
|
||
|
||
<20> The Genoa Active Message Machine (GAMMA)
|
||
<http://www.disi.unige.it/project/gamma/>
|
||
|
||
|
||
High-tech clustering requires high-tech interconnect, and SCI is one
|
||
of them. To find out more you can either look up the home page of
|
||
Dolphin Interconnect Solutions <http://www.dolphinics.no/> which is
|
||
one of the main actors in this field, or you can have a look at scizzl
|
||
<http://www.scizzl.com/>.
|
||
|
||
|
||
Centralised mail servers using IMAP are becoming more and more popular
|
||
as disks become large enough to keep all mail stored indefinitely and
|
||
also cheap enough to make it a feasible option. Unfortunately it has
|
||
become clear that NFS mounting the mail archives from another machine
|
||
can cause corruption of the IMAP database as the server software does
|
||
not handle NFS timeouts too well, and NFS timeouts are a rather common
|
||
occurrence. Keep therefore the mail archive local to the IMAP server.
|
||
|
||
|
||
|
||
9. Mount Points
|
||
|
||
In designing the disk layout it is important not to split off the
|
||
directory tree structure at the wrong points, hence this section. As
|
||
it is highly dependent on the FSSTND it has been put aside in a
|
||
separate section, and will most likely have to be totally rewritten
|
||
when FHS is adopted in a Linux distribution. In the meanwhile this
|
||
will do.
|
||
|
||
Remember that this is a list of where a separation can take place, not
|
||
where it has to be. As always, good judgement is always required.
|
||
|
||
Again only a rough indication can be given here. The values indicate
|
||
|
||
|
||
|
||
0=don't separate here
|
||
1=not recommended
|
||
...
|
||
4=useful
|
||
5=recommended
|
||
|
||
|
||
|
||
In order to keep the list short, the uninteresting parts are removed.
|
||
|
||
|
||
|
||
Directory Suitability
|
||
/
|
||
|
|
||
+-bin 0
|
||
+-boot 5
|
||
+-dev 0
|
||
+-etc 0
|
||
+-home 5
|
||
+-lib 0
|
||
+-mnt 0
|
||
+-proc 0
|
||
+-root 0
|
||
+-sbin 0
|
||
+-tmp 5
|
||
+-usr 5
|
||
| \
|
||
| +-X11R6 3
|
||
| +-bin 3
|
||
| +-lib 4
|
||
| +-local 4
|
||
| | \
|
||
| | +bin 2
|
||
| | +lib 4
|
||
| +-src 3
|
||
|
|
||
+-var 5
|
||
\
|
||
+-adm 0
|
||
+-lib 2
|
||
+-lock 1
|
||
+-log 0
|
||
+-preserve 1
|
||
+-run 1
|
||
+-spool 4
|
||
| \
|
||
| +-mail 3
|
||
| +-mqueue 3
|
||
| +-news 5
|
||
| +-smail 3
|
||
| +-uucp 3
|
||
+-tmp 5
|
||
|
||
|
||
|
||
There is of course plenty of adjustments possible, for instance a home
|
||
user would not bother with splitting off the /var/spool hierarchy but
|
||
a serious ISP should. The key here is usage.
|
||
|
||
QUIZ! Why should /etc never be on a separate partition? Answer:
|
||
Mounting instructions during boot is found in the file /etc/fstab so
|
||
if this is on a separate and unmounted partition it is like the key to
|
||
a locked drawer is inside that drawer, a hopeless situation. (Yes,
|
||
I'll do nearly anything to liven up this HOWTO.)
|
||
|
||
|
||
|
||
10. Considerations and Dimensioning
|
||
|
||
The starting point in this will be to consider where you are and what
|
||
you want to do. The typical home system starts out with existing
|
||
hardware and the newly converted Linux user will want to get the most
|
||
out of existing hardware. Someone setting up a new system for a
|
||
specific purpose (such as an Internet provider) will instead have to
|
||
consider what the goal is and buy accordingly. Being ambitious I will
|
||
try to cover the entire range.
|
||
|
||
Various purposes will also have different requirements regarding file
|
||
system placement on the drives, a large multiuser machine would
|
||
probably be best off with the /home directory on a separate disk, just
|
||
to give an example.
|
||
|
||
In general, for performance it is advantageous to split most things
|
||
over as many disks as possible but there is a limited number of
|
||
devices that can live on a SCSI bus and cost is naturally also a
|
||
factor. Equally important, file system maintenance becomes more
|
||
complicated as the number of partitions and physical drives increases.
|
||
|
||
|
||
10.1. Home Systems
|
||
|
||
With the cheap hardware available today it is possible to have quite a
|
||
big system at home that is still cheap, systems that rival major
|
||
servers of yesteryear. While many started out with old, discarded
|
||
disks to build a Linux server (which is how this HOWTO came into
|
||
existence), many can now afford to buy 40 GB disks up front.
|
||
|
||
Size remains important for some, and here are a few guidelines:
|
||
|
||
|
||
Testing
|
||
Linux is simple and you don't even need a hard disk to try it
|
||
out, if you can get the boot floppies to work you are likely to
|
||
get it to work on your hardware. If the standard kernel does not
|
||
work for you, do not forget that often there can be special boot
|
||
disk versions available for unusual hardware combinations that
|
||
can solve your initial problems until you can compile your own
|
||
kernel.
|
||
|
||
|
||
Learning
|
||
about operating system is something Linux excels in, there is
|
||
plenty of documentation and the source is available. A single
|
||
drive with 50 MB is enough to get you started with a shell, a
|
||
few of the most frequently used commands and utilities.
|
||
|
||
|
||
Hobby
|
||
use or more serious learning requires more commands and
|
||
utilities but a single drive is still all it takes, 500 MB
|
||
should give you plenty of room, also for sources and
|
||
documentation.
|
||
|
||
|
||
|
||
Serious
|
||
software development or just serious hobby work requires even
|
||
more space. At this stage you have probably a mail and news feed
|
||
that requires spool files and plenty of space. Separate drives
|
||
for various tasks will begin to show a benefit. At this stage
|
||
you have probably already gotten hold of a few drives too. Drive
|
||
requirements gets harder to estimate but I would expect 2-4 GB
|
||
to be plenty, even for a small server.
|
||
|
||
|
||
Servers
|
||
come in many flavours, ranging from mail servers to full sized
|
||
ISP servers. A base of 2 GB for the main system should be
|
||
sufficient, then add space and perhaps also drives for separate
|
||
features you will offer. Cost is the main limiting factor here
|
||
but be prepared to spend a bit if you wish to justify the "S" in
|
||
ISP. Admittedly, not all do it.
|
||
|
||
Basically a server is dimensioned like any machine for serious
|
||
use with added space for the services offered, and tends to be
|
||
IO bound rather than CPU bound.
|
||
|
||
With cheap networking technology both for land lines as well as
|
||
through radio nets, it is quite likely that very soon home users
|
||
will have their own servers more or less permanently hooked onto
|
||
the net.
|
||
|
||
|
||
|
||
10.2. Servers
|
||
|
||
Big tasks require big drives and a separate section here. If possible
|
||
keep as much as possible on separate drives. Some of the appendices
|
||
detail the setup of a small departmental server for 10-100 users. Here
|
||
I will present a few consideration for the higher end servers. In
|
||
general you should not be afraid of using RAID, not only because it is
|
||
fast and safe but also because it can make growth a little less
|
||
painful. All the notes below come as additions to the points mentioned
|
||
earlier.
|
||
|
||
Popular servers rarely just happens, rather they grow over time and
|
||
this demands both generous amounts of disk space as well as a good net
|
||
connection. In many of these cases it might be a good idea to reserve
|
||
entire SCSI drives, in singles or as arrays, for each task. This way
|
||
you can move the data should the computer fail. Note that transferring
|
||
drives across computers is not simple and might not always work,
|
||
especially in the case of IDE drives. Drive arrays require careful
|
||
setup in order to reconstruct the data correctly, so you might want to
|
||
keep a paper copy of your fstab file as well as a note of SCSI IDs.
|
||
|
||
|
||
10.2.1. Home Directories
|
||
|
||
Estimate how many drives you will need, if this is more than 2 I would
|
||
recommend RAID, strongly. If not you should separate users across your
|
||
drives dedicated to users based on some kind of simple hashing
|
||
algorithm. For instance you could use the first 2 letters in the user
|
||
name, so jbloggs is put on /u/j/b/jbloggs where /u/j is a symbolic
|
||
link to a physical drive so you can get a balanced load on your
|
||
drives.
|
||
|
||
|
||
10.2.2. Anonymous FTP
|
||
|
||
This is an essential service if you are serious about service. Good
|
||
servers are well maintained, documented, kept up to date, and
|
||
immensely popular no matter where in the world they are located. The
|
||
big server ftp.funet.fi <ftp://ftp.funet.fi> is an excellent example
|
||
of this.
|
||
|
||
In general this is not a question of CPU but of network bandwidth.
|
||
Size is hard to estimate, mainly it is a question of ambition and
|
||
service attitudes. I believe the big archive at ftp.cdrom.com
|
||
<ftp://ftp.cdrom.com> is a *BSD machine with 50 GB disk. Also memory
|
||
is important for a dedicated FTP server, about 256 MB RAM would be
|
||
sufficient for a very big server, whereas smaller servers can get the
|
||
job done well with 64 MB RAM. Network connections would still be the
|
||
most important factor.
|
||
|
||
|
||
|
||
10.2.3. WWW
|
||
|
||
For many this is the main reason to get onto the Internet, in fact
|
||
many now seem to equate the two. In addition to being network
|
||
intensive there is also a fair bit of drive activity related to this,
|
||
mainly regarding the caches. Keeping the cache on a separate, fast
|
||
drive would be beneficial. Even better would be installing a caching
|
||
proxy server. This way you can reduce the cache size for each user and
|
||
speed up the service while at the same time cut down on the bandwidth
|
||
requirements.
|
||
|
||
With a caching proxy server you need a fast set of drives, RAID0 would
|
||
be ideal as reliability is not important here. Higher capacity is
|
||
better but about 2 GB should be sufficient for most. Remember to match
|
||
the cache period to the capacity and demand. Too long periods would on
|
||
the other hand be a disadvantage, if possible try to adjust based on
|
||
the URL. For more information check up on the most used servers such
|
||
as Harvest, Squid <http://www.squid-cache.org/> and the one from
|
||
Netscape <http://www.netscape.com>.
|
||
|
||
|
||
|
||
10.2.4. Mail
|
||
|
||
Handling mail is something most machines do to some extent. The big
|
||
mail servers, however, come into a class of their own. This is a
|
||
demanding task and a big server can be slow even when connected to
|
||
fast drives and a good net feed. In the Linux world the big server at
|
||
vger.rutgers.edu is a well known example. Unlike a news service which
|
||
is distributed and which can partially reconstruct the spool using
|
||
other machines as a feed, the mail servers are centralised. This makes
|
||
safety much more important, so for a major server you should consider
|
||
a RAID solution with emphasize on reliability. Size is hard to
|
||
estimate, it all depends on how many lists you run as well as how many
|
||
subscribers you have.
|
||
|
||
Big mail servers can be IO limited in performance and for this reason
|
||
some use huge silicon disks connected to the SCSI bus to hold all mail
|
||
related files including temporary files. For extra safety these are
|
||
battery backed and filesystems like udf are preferred since they
|
||
always flush metadata to disk. This added cost to performance is
|
||
offset by the very fast disk.
|
||
|
||
Note that these days more and more switch over from using POP to pull
|
||
mail to local machine from mail server and instead use IMAP to serve
|
||
mail while keeping the mail archive centralised. This means that mail
|
||
is no longer spooled in its original sense but often builds up,
|
||
requiring huge disk space. Also more and more (ab)use mail attachments
|
||
to send all sorts of things across, even a small word processor
|
||
document can easily end up over 1 MB. Size your disks generously and
|
||
keep an eye on how much space is left.
|
||
10.2.5. News
|
||
|
||
This is definitely a high volume task, and very dependent on what news
|
||
groups you subscribe to. On Nyx there is a fairly complete feed and
|
||
the spool files consume about 17 GB. The biggest groups are no doubt
|
||
in the alt.binary.* hierarchy, so if you for some reason decide not to
|
||
get these you can get a good service with perhaps 12 GB. Still others,
|
||
that shall remain nameless, feel 2 GB is sufficient to claim ISP
|
||
status. In this case news expires so fast I feel the spelling IsP is
|
||
barely justified. A full newsfeed means a traffic of a few GB every
|
||
day and this is an ever growing number.
|
||
|
||
|
||
|
||
10.2.6. Others
|
||
|
||
There are many services available on the net and even though many have
|
||
been put somewhat in the shadows by the web. Nevertheless, services
|
||
like archie, gopher and wais just to name a few, still exist and
|
||
remain valuable tools on the net. If you are serious about starting a
|
||
major server you should also consider these services. Determining the
|
||
required volumes is hard, it all depends on popularity and demand.
|
||
Providing good service inevitably has its costs, disk space is just
|
||
one of them.
|
||
|
||
|
||
|
||
10.2.7. Server Recommendations
|
||
|
||
Servers today require large numbers of large disks to function
|
||
satisfactorily in commercial settings. As mean time between failure
|
||
(MTBF) decreases rapidly as the number of components increase it is
|
||
advisable to look into using RAID for protection and use a number of
|
||
medium sized drives rather than one single huge disk. Also look into
|
||
the High Availability (HA) project for more information. More
|
||
information is available at
|
||
|
||
High Availability HOWTO <http://www.ibiblio.org/pub/Linux/ALPHA/linux-
|
||
ha/High-Availability-HOWTO.html> and also at related web pages
|
||
<http://www.henge.com/~alanr/ha/index.html>.
|
||
|
||
There is also an article in Byte called How Big Does Your Unix Server
|
||
Have To Be?
|
||
<http://www.byte.com/columns/servinglinux/1999/06/0607servinglinux.html>
|
||
with many points that are relevant to Linux.
|
||
|
||
|
||
|
||
10.3. Pitfalls
|
||
|
||
The dangers of splitting up everything into separate partitions are
|
||
briefly mentioned in the section about volume management. Still,
|
||
several people have asked me to emphasize this point more strongly:
|
||
when one partition fills up it cannot grow any further, no matter if
|
||
there is plenty of space in other partitions.
|
||
|
||
In particular look out for explosive growth in the news spool
|
||
(/var/spool/news). For multi user machines with quotas keep an eye on
|
||
/tmp and /var/tmp as some people try to hide their files there, just
|
||
look out for filenames ending in gif or jpeg...
|
||
|
||
In fact, for single physical drives this scheme offers very little
|
||
gains at all, other than making file growth monitoring easier (using
|
||
'df') and physical track positioning. Most importantly there is no
|
||
scope for parallel disk access. A freely available volume management
|
||
system would solve this but this is still some time in the future.
|
||
However, when more specialised file systems become available even a
|
||
single disk could benefit from being divided into several partitions.
|
||
|
||
For more information see section ``Troubleshooting''.
|
||
|
||
|
||
|
||
11. Disk Layout
|
||
|
||
With all this in mind we are now ready to embark on the layout. I have
|
||
based this on my own method developed when I got hold of 3 old SCSI
|
||
disks and boggled over the possibilities.
|
||
|
||
The tables in the appendices are designed to simplify the mapping
|
||
process. They have been designed to help you go through the process of
|
||
optimizations as well as making an useful log in case of system
|
||
repair. A few examples are also given.
|
||
|
||
|
||
|
||
11.1. Selection for Partitioning
|
||
|
||
Determine your needs and set up a list of all the parts of the file
|
||
system you want to be on separate partitions and sort them in
|
||
descending order of speed requirement and how much space you want to
|
||
give each partition.
|
||
|
||
The table in ``Appendix A'' section is a useful tool to select what
|
||
directories you should put on different partitions. It is sorted in a
|
||
logical order with space for your own additions and notes about
|
||
mounting points and additional systems. It is therefore NOT sorted in
|
||
order of speed, instead the speed requirements are indicated by
|
||
bullets ('o').
|
||
|
||
If you plan to RAID make a note of the disks you want to use and what
|
||
partitions you want to RAID. Remember various RAID solutions offers
|
||
different speeds and degrees of reliability.
|
||
|
||
(Just to make it simple I'll assume we have a set of identical SCSI
|
||
disks and no RAID)
|
||
|
||
|
||
|
||
11.2. Mapping Partitions to Drives
|
||
|
||
Then we want to place the partitions onto physical disks. The point of
|
||
the following algorithm is to maximise parallelizing and bus capacity.
|
||
In this example the drives are A, B and C and the partitions are
|
||
987654321 where 9 is the partition with the highest speed requirement.
|
||
Starting at one drive we 'meander' the partition line over and over
|
||
the drives in this way:
|
||
|
||
|
||
|
||
A : 9 4 3
|
||
B : 8 5 2
|
||
C : 7 6 1
|
||
|
||
|
||
|
||
This makes the 'sum of speed requirements' the most equal across each
|
||
drive.
|
||
|
||
|
||
Use the table in ``Appendix B'' section to select what drives to use
|
||
for each partition in order to optimize for paralellicity.
|
||
|
||
Note the speed characteristics of your drives and note each directory
|
||
under the appropriate column. Be prepared to shuffle directories,
|
||
partitions and drives around a few times before you are satisfied.
|
||
|
||
|
||
11.3. Sorting Partitions on Drives
|
||
|
||
After that it is recommended to select partition numbering for each
|
||
drive.
|
||
|
||
Use the table in ``Appendix C'' section to select partition numbers in
|
||
order to optimize for track characteristics. At the end of this you
|
||
should have a table sorted in ascending partition number. Fill these
|
||
numbers back into the tables in appendix A and B.
|
||
|
||
You will find these tables useful when running the partitioning
|
||
program (fdisk or cfdisk) and when doing the installation.
|
||
|
||
|
||
|
||
11.4. Optimizing
|
||
|
||
After this there are usually a few partitions that have to be
|
||
'shuffled' over the drives either to make them fit or if there are
|
||
special considerations regarding speed, reliability, special file
|
||
systems etc. Nevertheless this gives what this author believes is a
|
||
good starting point for the complete setup of the drives and the
|
||
partitions. In the end it is actual use that will determine the real
|
||
needs after we have made so many assumptions. After commencing
|
||
operations one should assume a time comes when a repartitioning will
|
||
be beneficial.
|
||
|
||
For instance if one of the 3 drives in the above mentioned example is
|
||
very slow compared to the two others a better plan would be as
|
||
follows:
|
||
|
||
|
||
|
||
A : 9 6 5
|
||
B : 8 7 4
|
||
C : 3 2 1
|
||
|
||
|
||
|
||
11.4.1. Optimizing by Characteristics
|
||
|
||
Often drives can be similar in apparent overall speed but some
|
||
advantage can be gained by matching drives to the file size
|
||
distribution and frequency of access. Thus binaries are suited to
|
||
drives with fast access that offer command queueing, and libraries are
|
||
better suited to drives with larger transfer speeds where IDE offers
|
||
good performance for the money.
|
||
|
||
|
||
|
||
11.4.2. Optimizing by Drive Parallelising
|
||
|
||
Avoid drive contention by looking at tasks: for instance if you are
|
||
accessing /usr/local/bin chances are you will soon also need files
|
||
from /usr/local/lib so placing these at separate drives allows less
|
||
seeking and possible parallel operation and drive caching. It is quite
|
||
possible that choosing what may appear less than ideal drive
|
||
characteristics will still be advantageous if you can gain parallel
|
||
operations. Identify common tasks, what partitions they use and try to
|
||
keep these on separate physical drives.
|
||
|
||
Just to illustrate my point I will give a few examples of task
|
||
analysis here.
|
||
|
||
|
||
|
||
Office software
|
||
such as editing, word processing and spreadsheets are typical
|
||
examples of low intensity software both in terms of CPU and disk
|
||
intensity. However, should you have a single server for a huge
|
||
number of users you should not forget that most such software
|
||
have auto save facilities which cause extra traffic, usually on
|
||
the home directories. Splitting users over several drives would
|
||
reduce contention.
|
||
|
||
|
||
News
|
||
readers also feature auto save features on home directories so
|
||
ISPs should consider separating home directories
|
||
|
||
News spools are notorious for their deeply nested directories
|
||
and their large number of very small files. Loss of a news spool
|
||
partition is not a big problem for most people, too, so they are
|
||
good candidates for a RAID 0 setup with many small disks to
|
||
distribute the many seeks among multiple spindles. It is
|
||
recommended in the manuals and FAQs for the INN news server to
|
||
put news spool and .overview files on separate drives for larger
|
||
installations.
|
||
|
||
|
||
Some notes on INN optimising under Tru64 UNIX
|
||
<http://www.tru64unix.compaq.com/internet/inn-wp.html> also
|
||
applies to a wider audience, including Linux users.
|
||
|
||
|
||
Database
|
||
applications can be demanding both in terms of drive usage and
|
||
speed requirements. The details are naturally application
|
||
specific, read the documentation carefully with disk
|
||
requirements in mind. Also consider RAID both for performance
|
||
and reliability.
|
||
|
||
|
||
E-mail
|
||
reading and sending involves home directories as well as in- and
|
||
outgoing spool files. If possible keep home directories and
|
||
spool files on separate drives. If you are a mail server or a
|
||
mail hub consider putting in- and outgoing spool directories on
|
||
separate drives.
|
||
|
||
Losing mail is an extremely bad thing, if you are managing an
|
||
ISP or major hub. Think about RAIDing your mail spool and
|
||
consider frequent backups.
|
||
|
||
|
||
Software development
|
||
can require a large number of directories for binaries,
|
||
libraries, include files as well as source and project files. If
|
||
possible split as much as possible across separate drives. On
|
||
small systems you can place /usr/src and project files on the
|
||
same drive as the home directories.
|
||
Web browsing
|
||
is becoming more and more popular. Many browsers have a local
|
||
cache which can expand to rather large volumes. As this is used
|
||
when reloading pages or returning to the previous page, speed is
|
||
quite important here. If however you are connected via a well
|
||
configured proxy server you do not need more than typically a
|
||
few megabytes per user for a session. See also the sections on
|
||
``Home Directories'' and ``WWW''.
|
||
|
||
|
||
|
||
11.5. Compromises
|
||
|
||
One way to avoid the aforementioned ``pitfalls'' is to only set off
|
||
fixed partitions to directories with a fairly well known size such as
|
||
swap, /tmp and /var/tmp and group together the remainders into the
|
||
remaining partitions using symbolic links.
|
||
|
||
Example: a slow disk (slowdisk), a fast disk (fastdisk) and an
|
||
assortment of files. Having set up swap and tmp on fastdisk; and /home
|
||
and root on slowdisk we have (the fictitious) directories /a/slow,
|
||
/a/fast, /b/slow and /b/fast left to allocate on the partitions
|
||
/mnt.slowdisk and /mnt.fastdisk which represents the remaining
|
||
partitions of the two drives.
|
||
|
||
Putting /a or /b directly on either drive gives the same properties to
|
||
the subdirectories. We could make all 4 directories separate
|
||
partitions but would lose some flexibility in managing the size of
|
||
each directory. A better solution is to make these 4 directories
|
||
symbolic links to appropriate directories on the respective drives.
|
||
|
||
Thus we make
|
||
|
||
|
||
|
||
/a/fast point to /mnt.fastdisk/a/fast or /mnt.fastdisk/a.fast
|
||
/a/slow point to /mnt.slowdisk/a/slow or /mnt.slowdisk/a.slow
|
||
/b/fast point to /mnt.fastdisk/b/fast or /mnt.fastdisk/b.fast
|
||
/b/slow point to /mnt.slowdisk/b/slow or /mnt.slowdisk/b.slow
|
||
|
||
|
||
|
||
and we get all fast directories on the fast drive without having to
|
||
set up a partition for all 4 directories. The second (right hand)
|
||
alternative gives us a flatter files system which in this case can
|
||
make it simpler to keep an overview of the structure.
|
||
|
||
The disadvantage is that it is a complicated scheme to set up and plan
|
||
in the first place and that all mount points and partitions have to be
|
||
defined before the system installation.
|
||
|
||
Important: note that the /usr partition must be mounted directly onto
|
||
root and not via an indirect link as described above. The reason for
|
||
this are the long backward links used extensively in X11 that go from
|
||
deep within /usr all the way to root and then down into /etc
|
||
directories.
|
||
|
||
|
||
|
||
12. Implementation
|
||
|
||
Having done the layout you should now have a detailed description on
|
||
what goes where. Most likely this will be on paper but hopefully
|
||
someone will make a more automated system that can deal with
|
||
everything from the design, through partitioning to formatting and
|
||
installation. This is the route one will have to take to realise the
|
||
design.
|
||
|
||
Modern distributions come with installation tools that will guide you
|
||
through partitioning and formatting and also set up /etc/fstab for you
|
||
automatically. For later modifications, however, you will need to
|
||
understand the underlying mechanisms.
|
||
|
||
|
||
|
||
12.1. Checklist
|
||
|
||
Before starting make sure you have the following:
|
||
|
||
<20> Written notes of what goes where, your design
|
||
|
||
<20> A functioning, tested rescue disk
|
||
|
||
<20> A fresh backup of your precious data
|
||
|
||
<20> At least two formatted, tested and empty floppies
|
||
|
||
<20> Read and understood the man page for fdisk or equivalent
|
||
|
||
<20> Patience, concentration and elbow grease
|
||
|
||
|
||
|
||
12.2. Drives and Partitions
|
||
|
||
When you start DOS or the like you will find all partitions labeled C:
|
||
and onwards, with no differentiation on IDE, SCSI, network or whatever
|
||
type of media you have. In the world of Linux this is rather
|
||
different. During booting you will see partitions described like this:
|
||
|
||
______________________________________________________________________
|
||
Dec 6 23:45:18 demos kernel: Partition check:
|
||
Dec 6 23:45:18 demos kernel: sda: sda1
|
||
Dec 6 23:45:18 demos kernel: hda: hda1 hda2
|
||
______________________________________________________________________
|
||
|
||
|
||
|
||
SCSI drives are labelled sda, sdb, sdc etc, and (E)IDE drives are
|
||
labelled hda, hdb, hdc etc. There are also standard names for all
|
||
devices, full information can be found in /dev/MAKEDEV and
|
||
/usr/src/linux/Documentation/devices.txt.
|
||
|
||
Partitions are labelled numerically for each drive hda1, hda2 and so
|
||
on. On SCSI drives there can be 15 partitions per drive, on EIDE
|
||
drives there can be 63 partitions per drive. Both limits exceed what
|
||
is currently useful for most disks.
|
||
|
||
These are then mounted according to the file /etc/fstab before they
|
||
appear as a part of the file system.
|
||
|
||
|
||
|
||
12.3. Partitioning
|
||
|
||
|
||
It feels so good / It's a marginal risk / when I clear off / windows
|
||
with fdisk! (the Dustbunny in an issue
|
||
<http://www.userfriendly.org/cartoons/archives/99feb/19990221.html> of
|
||
User Friendly <http://www.userfriendly.org/> in the song "Refund
|
||
this")
|
||
|
||
First you have to partition each drive into a number of separate
|
||
partitions. Under Linux there are two main methods, fdisk and the
|
||
more screen oriented cfdisk. These are complex programs, read the
|
||
manual very carefully. For the experts there is now also sfdisk.
|
||
|
||
|
||
Partitions come in 3 flavours, primary, extended and logical. You
|
||
have to use primary partitions for booting, but there is a maximum of
|
||
4 primary partitions. If you want more you have to define an extended
|
||
partition within which you define your logical partitions.
|
||
|
||
Each partition has an identifier number which tells the operating
|
||
system what it is, for Linux the types swap(82) and ext2fs(83) are the
|
||
ones you will need to know. If you want to use RAID with autostart
|
||
you have to check the documentation for the appropriate type number
|
||
for the RAID partition.
|
||
|
||
There is a readme file that comes with fdisk that gives more in-depth
|
||
information on partitioning.
|
||
|
||
Someone has just made a Partitioning HOWTO which contains excellent,
|
||
in depth information on the nitty-gritty of partitioning. Rather than
|
||
repeating it here and bloating this document further, I will instead
|
||
refer you to it instead.
|
||
|
||
Redhat has written a screen oriented utility called Disk Druid which
|
||
is supposed to be a user friendly alternative to fdisk and cfdisk and
|
||
also automates a few other things. Unfortunately this product is not
|
||
quite mature so if you use it and cannot get it to work you are well
|
||
advised to try fdisk or cfdisk.
|
||
|
||
Not to be outdone, Mandrakesoft has made an even more graphic
|
||
alternative called Diskdrake <http://www.linux-
|
||
mandrake.com/diskdrake/> that also offers numerous features.
|
||
|
||
Also the GNU project offers a partitioning tool called GNU Parted
|
||
<http://www.gnu.org/software/parted/>
|
||
|
||
|
||
The Ranish Partition Manager
|
||
<http://www.users.intercom.com/~ranish/part/> is another free
|
||
alternative, while Partition Magic <http://www.powerquest.com> is a
|
||
popular commercial alternative which also offers some support for
|
||
resizing ext2fs partitions.
|
||
|
||
Note that Windows will complain if it finds more than one primary
|
||
partition on a drive. Also it appears to assign drive letters to
|
||
primary partitions as it finds disks before starting over from the
|
||
first disk to assign subsequent drive names to logical partitions.
|
||
|
||
If you want DOS/Windows on your system you should make that partition
|
||
first, a primary one to boot to, made with the DOS fdisk program.
|
||
Then if you want NT you put that one in. Finally, for Linux, you
|
||
create those partitions with the Linux fdisk program or equivalents.
|
||
Linux is flexible enough to boot from both primary as well as logical
|
||
partitions.
|
||
|
||
In depth information on DOS fdisk can be found at Fdisk.com
|
||
<http://www.fdisk.com/fdisk/> and MS-DOS 5.00 - 7.10 Undocumented,
|
||
Secret + Hidden Features <http://members.aol.com/axcel216/secrets.htm>
|
||
which details even more bugs and pitfalls.
|
||
|
||
|
||
12.4. Repartitioning
|
||
|
||
Sometimes it is necessary to change the sizes of existing partitions
|
||
while keeping the contents intact. One way is of course to back up
|
||
everything, recreate new partitions and then restore the old contents,
|
||
and while this gives your back up system a good test it is also rather
|
||
time consuming.
|
||
|
||
Partition resizing is a simpler alternative where a file system is
|
||
first shrunk to desired volume and then the partition table is updated
|
||
to reflect the new end of partition position. This process is
|
||
therefore very file system sensitive.
|
||
|
||
Repartitioning requires there to be free space at the end of the file
|
||
space so to ensure you are able to shrink the size you should first
|
||
defragment your drive and empty any wastebaskets.
|
||
|
||
Using fips <http://www.igd.fhg.de/~aschaefe/fips/> you can resize a
|
||
fat partition, and the latest version 1.6 of fips or fips 2.0 are also
|
||
able to resize fat32 partition. Note that these programs actually run
|
||
under DOS.
|
||
|
||
Resizing other file systems are much more complicated but one popular
|
||
commercial system Partition Magic <http://www.powerquest.com> is able
|
||
to resize more file system types, including ext2fs using the resize2fs
|
||
program. Make sure you get the latest updates to this program as
|
||
recent versions had problems with large disks.
|
||
|
||
|
||
In order to get the most out of fips you should first delete
|
||
unnecessary files, empty wastebaskets etc. before defragmenting your
|
||
drive. This way you can allocate more space to other partitions. If
|
||
the program complains there are still files at the end of your drive
|
||
it is probably hidden files generated by Microsoft Mirror or Norton
|
||
Image. These are probably called image.idx and image.dat and contain
|
||
backups of some system files.
|
||
|
||
There are reports that in some Windows defragmentation programs you
|
||
should make sure the box "allow Windows to move files around" is not
|
||
checked, otherwise you will end up with some files in the last
|
||
cylinder of the partition which will prevent FIPS from reclaiming
|
||
space.
|
||
|
||
If you still have unmovable files at the end of your DOS partition you
|
||
should get the DOS program showfat
|
||
<http://www8.pair.com/dmurdoch/programs/showfat.htm> version 3.0 or
|
||
higher. This shows you what files are where so you can deal with them
|
||
directly.
|
||
|
||
A freeware alternative is Partition Resizer
|
||
<http://members.nbci.com/Zeleps/> which can shrink, grow and move
|
||
partitions.
|
||
|
||
Some versions of DOS / Windows have a hidden flag for defrag, "/P that
|
||
causes defrag to move even hidden files. Use at own risk.
|
||
|
||
|
||
Repartitioning is as dangerous process as any other partitioning so
|
||
you are advised to have a fresh backup handy.
|
||
|
||
|
||
|
||
12.5. Microsoft Partition Bug
|
||
|
||
In Microsoft products all the way up to Win 98 there is a tricky bug
|
||
that can cause you a bit of trouble: if you have several primary fat
|
||
partitions and the last extended partition is not a fat partition the
|
||
Microsoft system will try to mount the last partition as if it were a
|
||
FAT partition in place of the last primary FAT partition.
|
||
|
||
There is more information <http://www.v-com.com/> available on the net
|
||
on this.
|
||
|
||
To avoid this you can place a small logical fat partition at the very
|
||
end of your disk.
|
||
|
||
More information on multi OS installations are available at V
|
||
Communications <http://www.v-com.com/> but they keep rearranging the
|
||
links continuously so no direct links can be offered here.
|
||
|
||
|
||
Since some hardware comes with setup software that is available under
|
||
DOS only this could come in handy anyway. Notable examples are RAID
|
||
controllers from DPT and a number of networking cards.
|
||
|
||
|
||
12.6. Multiple Devices ( md )
|
||
|
||
Being in a state of flux you should make sure to read the latest
|
||
documentation on this kernel feature. It is not yet stable, beware.
|
||
|
||
Briefly explained it works by adding partitions together into new
|
||
devices md0, md1 etc. using mdadd before you activate them using
|
||
mdrun. This process can be automated using the file /etc/mdtab.
|
||
|
||
The latest md system uses a /etc/raidtab and a different syntax. Make
|
||
sure your RAID-tools package matches the md version as the internal
|
||
protocol has changed.
|
||
|
||
Then you then treat these like any other partition on a drive. Proceed
|
||
with formatting etc. as described below using these new devices.
|
||
|
||
There is now also a HOWTO in development for RAID using md you should
|
||
read.
|
||
|
||
|
||
|
||
12.7. Formatting
|
||
|
||
Next comes partition formatting, putting down the data structures that
|
||
will describe the files and where they are located. If this is the
|
||
first time it is recommended you use formatting with verify. Strictly
|
||
speaking it should not be necessary but this exercises the I/O hard
|
||
enough that it can uncover potential problems, such as incorrect
|
||
termination, before you store your precious data. Look up the command
|
||
mkfs for more details.
|
||
|
||
Linux can support a great number of file systems, rather than
|
||
repeating the details you can read the man page for fs which describes
|
||
them in some details. Note that your kernel has to have the drivers
|
||
compiled in or made as modules in order to be able to use these
|
||
features. When the time comes for kernel compiling you should read
|
||
carefully through the file system feature list. If you use make
|
||
menuconfig you can get online help for each file system type.
|
||
|
||
Note that some rescue disk systems require minix, msdos and ext2fs to
|
||
be compiled into the kernel.
|
||
|
||
Also swap partitions have to be prepared, and for this you use mkswap.
|
||
|
||
Some important notes on formatting with DOS and Windows can be found
|
||
in MS-DOS 5.00 - 7.10 Undocumented, Secret + Hidden Features
|
||
<http://members.aol.com/axcel216/secrets.htm>.
|
||
|
||
Note that this formatting is high level formatting, that writes the
|
||
file system to the disk, as opposed to low level formatting that lays
|
||
down tracks and sectors. The latter is hardly ever needed these days.
|
||
|
||
|
||
12.8. Mounting
|
||
|
||
Data on a partition is not available to the file system until it is
|
||
mounted on a mount point. This can be done manually using mount or
|
||
automatically during booting by adding appropriate lines to
|
||
/etc/fstab. Read the manual for mount and pay close attention to the
|
||
tabulation.
|
||
|
||
|
||
|
||
12.9. fstab
|
||
|
||
During the booting process the system mounts all partitions as
|
||
described in the fstab file which can look something like this:
|
||
|
||
|
||
|
||
# <file system> <mount point> <type> <options> <dump> <pass>
|
||
/dev/hda2 / ext2 defaults 0 1
|
||
None none swap sw 0 0
|
||
proc /proc proc defaults 0 0
|
||
/dev/hda1 /dosc vfat defaults 0 1
|
||
|
||
|
||
|
||
This file is somewhat sensitive to the formatting used so it is best
|
||
and also most convenient to edit it using one of the editing tools
|
||
made for this purpose, such as on the netfstool
|
||
<http://www.bit.net.au/~bhepple/fstool/>, a Tcl/Tk-based file system
|
||
mounter, and kfstab <http://kfstab.purespace.de/kfstab/>, an editing
|
||
tool for KDE.
|
||
|
||
Briefly, the fields are partition name, where to mount the partition,
|
||
type of file system, mount options, when to dump for backup and when
|
||
to do fsck.
|
||
|
||
Linux offers the possibility of parallel file checking (fsck) but to
|
||
be efficient it is important not to fsck more than one partition on a
|
||
drive at a time.
|
||
|
||
|
||
|
||
12.10. Mount options
|
||
|
||
Mounting, either by hand or using the fstab, allows for a number of
|
||
options that offers extra protection. Below are some of the more
|
||
useful options.
|
||
|
||
|
||
nodev
|
||
Do not interpret character or block special devices on the file
|
||
system.
|
||
|
||
|
||
noexec
|
||
This disallows execution of any binaries on the mounted file
|
||
system. Useful in spool areas.
|
||
nosuid
|
||
This disallows set-user-identifier or set-group-identifier on
|
||
the mounted file system. Useful in home directories.
|
||
|
||
|
||
For more information and cautions refer to the man page for mount and
|
||
fstab.
|
||
|
||
|
||
|
||
12.11. Recommendations
|
||
|
||
Having constructed and implemented your clever scheme you are well
|
||
advised to make a complete record of it all, on paper. After all
|
||
having all the necessary information on disk is no use if the machine
|
||
is down.
|
||
|
||
Partition tables can be damaged or lost, in which case it is
|
||
excruciatingly important that you enter the exact same numbers into
|
||
fdisk so you can rescue your system. You can use the program printpar
|
||
to make a clear record of the tables. Also write down the SCSI numbers
|
||
or IDE names for each disk so you can put the system together again in
|
||
the right order.
|
||
|
||
There is also a small script in appendix ``Appendix M: Disk System
|
||
Documenter'' which will generate a summary of your disk
|
||
configurations.
|
||
|
||
For checking your hard disks you can use the Disk Advisor boot disk
|
||
available on the net <http://www.ontrack.com/>. The disk builder
|
||
required Windows to run. This system is useful to diagnose failed
|
||
disks.
|
||
|
||
You are strongly recommended to make a rescue disk and test it. Most
|
||
distributions make on available and is often part of the installation
|
||
disks. For some, such as the one for Redhat 6.1 the way to invoke the
|
||
disk as a rescue disk is to type linux rescue at the boot prompt.
|
||
|
||
There are also specialised rescue disk distributions available on the
|
||
net.
|
||
|
||
When need for it comes you will need to know where your root and boot
|
||
partitions reside which you need to write down and keep safe.
|
||
|
||
Note: the difference between a boot disk and a rescue disk is that a
|
||
boot disk will fail if it cannot mount the file system, typically on
|
||
your hard disk. A rescue disk is self contained and will work even if
|
||
there are no hard disks.
|
||
|
||
|
||
|
||
13. Maintenance
|
||
|
||
It is the duty of the system manager to keep an eye on the drives and
|
||
partitions. Should any of the partitions overflow, the system is
|
||
likely to stop working properly, no matter how much space is available
|
||
on other partitions, until space is reclaimed.
|
||
|
||
Partitions and disks are easily monitored using df and should be done
|
||
frequently, perhaps using a cron job or some other general system
|
||
management tool.
|
||
|
||
Do not forget the swap partitions, these are best monitored using one
|
||
of the memory statistics programs such as free, procinfo or top.
|
||
|
||
|
||
Drive usage monitoring is more difficult but it is important for the
|
||
sake of performance to avoid contention - placing too much demand on a
|
||
single drive if others are available and idle.
|
||
|
||
It is important when installing software packages to have a clear idea
|
||
where the various files go. As previously mentioned GCC keeps binaries
|
||
in a library directory and there are also other programs that for
|
||
historical reasons are hard to figure out, X11 for instance has an
|
||
unusually complex structure.
|
||
|
||
When your system is about to fill up it is about time to check and
|
||
prune old logging messages as well as hunt down core files. Proper use
|
||
of ulimit in global shell settings can help saving you from having
|
||
core files littered around the system.
|
||
|
||
|
||
|
||
13.1. Backup
|
||
|
||
The observant reader might have noticed a few hints about the
|
||
usefulness of making backups. Horror stories are legio about accidents
|
||
and what happened to the person responsible when the backup turned out
|
||
to be non-functional or even non existent. You might find it simpler
|
||
to invest in proper backups than a second, secret identity.
|
||
|
||
There are many options and also a mini-HOWTO ( Backup-With-MSDOS )
|
||
detailling what you need to know. In addition to the DOS specifics it
|
||
also contains general information and further leads.
|
||
|
||
In addition to making these backups you should also make sure you can
|
||
restore the data. Not all systems verify that the data written is
|
||
correct and many administrators have started restoring the system
|
||
after an accident happy in the belief that everything is working, only
|
||
to discover to their horror that the backups were useless. Be careful.
|
||
|
||
There are both free and commercial backup systems available for Linux.
|
||
One commercial example is the disk image level backup system from
|
||
QuickStart <http://www.estinc.com/> offering a full function 30 day
|
||
Linux demo available online.
|
||
|
||
|
||
|
||
13.2. Defragmentation
|
||
|
||
This is very dependent on the file system design, some suffer fast and
|
||
nearly debilitating fragmentation. Fortunately for us, ext2fs does not
|
||
belong to this group and therefore there has been very little talk
|
||
about defragmentation tools. It does in fact exist but is hardly ever
|
||
needed.
|
||
|
||
If for some reason you feel this is necessary, the quick and easy
|
||
solution is to do a backup and a restore. If only a small area is
|
||
affected, for instance the home directories, you could tar it over to
|
||
a temporary area on another partition, verify the archive, delete the
|
||
original and then untar it back again.
|
||
|
||
|
||
|
||
13.3. Deletions
|
||
|
||
Quite often disk space shortages can be remedied simply by deleting
|
||
unnecessary files that accumulate around the system. Quite often
|
||
programs that terminate abnormally cause all kinds of mess lying
|
||
around the oddest places. Normally a core dump results after such an
|
||
incident and unless you are going to debug it you can simply delete
|
||
it. These can be found everywhere so you are advised to do a global
|
||
search for them now and then. The locate command is useful for this.
|
||
|
||
Unexpected termination can also cause all sorts of temporary files
|
||
remaining in places like /tmp or /var/tmp, files that are
|
||
automatically removed when the program ends normally. Rebooting cleans
|
||
up some of these areas but not necessary all and if you have a long
|
||
uptime you could end up with a lot of old junk. If space is short you
|
||
have to delete with care, make sure the file is not in active use
|
||
first. Utilities like file can often tell you what kind of file you
|
||
are looking at.
|
||
|
||
Many things are logged when the system is running, mostly to files in
|
||
the /var/log area. In particular the file /var/log/messages tends to
|
||
grow until deleted. It is a good idea to keep a small archive of old
|
||
log files around for comparison should the system start to behave
|
||
oddly.
|
||
|
||
If the mail or news system is not working properly you could have
|
||
excessive growth in their spool areas, /var/spool/mail and
|
||
/var/spool/news respectively. Beware of the overview files as these
|
||
have a leading dot which makes them invisible to ls -l, it is always
|
||
better to use ls -Al which will reveal them.
|
||
|
||
User space overflow is a particularly tricky topic. Wars have been
|
||
waged between system administrators and users. Tact, diplomacy and a
|
||
generous budget for new drives is what is needed. Make use of the
|
||
message-of-the-day feature, information displayed during login from
|
||
the /etc/motd file to tell users when space is short. Setting the
|
||
default shell settings to prevent core files being dumped can save you
|
||
a lot of work too.
|
||
|
||
Certain kinds of people try to hide files around the system, usually
|
||
trying to take advantage of the fact that files with a leading dot in
|
||
the name are invisible to the ls command. One common example are
|
||
files that look like ... that normally either are not seen, or, when
|
||
using ls -al disappear in the noise of normal files like . or .. that
|
||
are in every directory. There is however a countermeasure to this,
|
||
use ls -Al that suppresses . or .. but shows all other dot-files.
|
||
|
||
|
||
|
||
13.4. Upgrades
|
||
|
||
No matter how large your drives, time will come when you will find you
|
||
need more. As technology progresses you can get ever more for your
|
||
money. At the time of writing this, it appears that 6.4 GB drives
|
||
gives you the most bang for your bucks.
|
||
|
||
Note that with IDE drives you might have to remove an old drive, as
|
||
the maximum number supported on your mother board is normally only 2
|
||
or some times 4. With SCSI you can have up to 7 for narrow (8-bit)
|
||
SCSI or up to 15 for wide (15 bit) SCSI, per channel. Some host
|
||
adapters can support more than a single channel and in any case you
|
||
can have more than one host adapter per system. My personal
|
||
recommendation is that you will most likely be better off with SCSI in
|
||
the long run.
|
||
|
||
The question comes, where should you put this new drive? In many cases
|
||
the reason for expansion is that you want a larger spool area, and in
|
||
that case the fast, simple solution is to mount the drive somewhere
|
||
under /var/spool. On the other hand newer drives are likely to be
|
||
faster than older ones so in the long run you might find it worth your
|
||
time to do a full reorganizing, possibly using your old design sheets.
|
||
|
||
If the upgrade is forced by running out of space in partitions used
|
||
for things like /usr or /var the upgrade is a little more involved.
|
||
You might consider the option of a full re-installation from your
|
||
favourite (and hopefully upgraded) distribution. In this case you will
|
||
have to be careful not to overwrite your essential setups. Usually
|
||
these things are in the /etc directory. Proceed with care, fresh
|
||
backups and working rescue disks. The other possibility is to simply
|
||
copy the old directory over to the new directory which is mounted on a
|
||
temporary mount point, edit your /etc/fstab file, reboot with your new
|
||
partition in place and check that it works. Should it fail you can
|
||
reboot with your rescue disk, re-edit /etc/fstab and try again.
|
||
|
||
Until volume management becomes available to Linux this is both
|
||
complicated and dangerous. Do not get too surprised if you discover
|
||
you need to restore your system from a backup.
|
||
|
||
The Tips-HOWTO gives the following example on how to move an entire
|
||
directory structure across:
|
||
|
||
______________________________________________________________________
|
||
(cd /source/directory; tar cf - . ) | (cd /dest/directory; tar xvfp -)
|
||
______________________________________________________________________
|
||
|
||
|
||
|
||
While this approach to moving directory trees is portable among many
|
||
Unix systems, it is inconvenient to remember. Also, it fails for
|
||
deeply nested directory trees when pathnames become to long to handle
|
||
for tar (GNU tar has special provisions to deal with long pathnames).
|
||
|
||
If you have access to GNU cp (which is always the case on Linux
|
||
systems), you could as well use
|
||
|
||
|
||
______________________________________________________________________
|
||
cp -av /source/directory /dest/directory
|
||
______________________________________________________________________
|
||
|
||
|
||
|
||
GNU cp knows specifically about symbolic links, hard links, FIFOs and
|
||
device files and will copy them correctly.
|
||
|
||
Remember that it might not be a good idea to try to transfer /dev or
|
||
/proc.
|
||
|
||
There is also a Hard Disk Upgrade mini-HOWTO
|
||
<http://www.storm.ca/~yan/Hard-Disk-Upgrade.html> that gives you a
|
||
step by step guide on migrating an entire Linux system, including
|
||
LILO, form one hard disk to another.
|
||
|
||
|
||
13.5. Recovery
|
||
|
||
System crashes come in many and entertaining flavours, and partition
|
||
table corruption always guarantees plenty of excitement. A recent and
|
||
undoubtedly useful tool for those of us who are happy with the normal
|
||
level of excitement, is gpart <http://www.stud.uni-
|
||
hannover.de/user/76201/gpart/> which means "Guess PC-Type hard disk
|
||
partitions". Useful.
|
||
|
||
In addition there are some partition utilities
|
||
<http://inet.uni2.dk/~svolaf/utilities.htm> available under DOS.
|
||
|
||
|
||
|
||
13.6. Rescue Disk
|
||
|
||
Upgrades of kernel and hardware is not uncommon in the Linux world and
|
||
it is therefore important that you prepare an updated rescue disk
|
||
especially when you use special drivers to access your hardware.
|
||
Rescue disks can be gotten off the net, from your distribution or you
|
||
can put one together yourself. Do make sure the boot and root
|
||
parameters are set so the kernel will know where to find your system.
|
||
|
||
If you don't have a recovery floppy you can use the GRUB
|
||
<http://www.gnu.org/software/grub/> boot loader to load from a Linux
|
||
kernel somewhere on disk, with arguments.
|
||
|
||
|
||
|
||
14. Advanced Issues
|
||
|
||
Linux and related systems offer plenty of possibilities for fast,
|
||
efficient and devastating destruction. This document is no exception.
|
||
With power comes dangers and the following sections describe a few
|
||
more esoteric issues that should not be attempted before reading and
|
||
understanding the documentation, the issues and the dangers. You
|
||
should also make a backup. Also remember to try to restore the system
|
||
from scratch from your backup at least once. Otherwise you might not
|
||
be the first to be found with a perfect backup of your system and no
|
||
tools available to reinstall it (or, even more embarrassing, some
|
||
critical files missing on tape).
|
||
|
||
The techniques described here are rarely necessary but can be used for
|
||
very specific setups. Think very clearly through what you wish to
|
||
accomplish before playing around with this.
|
||
|
||
|
||
14.1. Hard Disk Tuning
|
||
|
||
The hard drive parameters can be tuned using the hdparms utility. Here
|
||
the most interesting parameter is probably the read-ahead parameter
|
||
which determines how much prefetch should be done in sequential
|
||
reading.
|
||
|
||
If you want to try this out it makes most sense to tune for the
|
||
characteristic file size on your drive but remember that this tuning
|
||
is for the entire drive which makes it a bit more difficult. Probably
|
||
this is only of use on large servers using dedicated news drives etc.
|
||
|
||
For safety the default hdparm settings are rather conservative. The
|
||
disadvantage is that this mean you can get lost interrupts if you have
|
||
a high frequency of IRQs as you would when using the serial port and
|
||
an IDE disk as IRQs from the latter would mask other IRQs. This would
|
||
be noticeable as less then ideal performance when downloading data
|
||
from the net to disk. Setting hdparm -u1 device would prevent this
|
||
masking and either improve your performance or, depending on hardware,
|
||
corrupt the data on your disk. Experiment with caution and fresh
|
||
backups.
|
||
|
||
For more information read the article The Need For Speed
|
||
<http://www.linuxforum.com/plug/articles/needforspeed.html> on tuning
|
||
with hdparm.
|
||
|
||
|
||
|
||
14.2. File System Tuning
|
||
|
||
Most file systems come with a tuning utility and for ext2fs there is
|
||
the tune2fs utility. Several parameters can be modified but perhaps
|
||
the most useful parameter here is what size should be reserved and who
|
||
should be able to take advantage of this which could help you getting
|
||
more useful space out of your drives, possibly at the cost of less
|
||
room for repairing a system should it crash.
|
||
|
||
|
||
|
||
14.3. Spindle Synchronizing
|
||
|
||
This should not in itself be dangerous, other than the peculiar fact
|
||
that the exact details of the connections remain unclear for many
|
||
drives. The theory is simple: keeping a fixed phase difference between
|
||
the different drives in a RAID setup makes for less waiting for the
|
||
right track to come into position for the read/write head. In practice
|
||
it now seems that with large read-ahead buffers in the drives the
|
||
effect is negligible.
|
||
|
||
Spindle synchronisation should not be used on RAID0 or RAID 0/1 as you
|
||
would then lose the benefit of having the read heads over different
|
||
areas of the mirrored sectors.
|
||
|
||
|
||
|
||
15. Troubleshooting
|
||
|
||
Much can go wrong and this is the start of a growing list of symptoms,
|
||
problems and solutions:
|
||
|
||
|
||
|
||
15.1. During Installation
|
||
|
||
15.1.1. Locating Disks
|
||
|
||
|
||
Symptoms
|
||
Cannot find disk
|
||
|
||
Problem
|
||
How to find what drive letter corresponds to what disk/partition
|
||
|
||
Solution
|
||
Remember Linux does not use drive letters but device names. More
|
||
information can be found in ``Drive names''.
|
||
|
||
|
||
Symptoms
|
||
Cannot partition disk
|
||
|
||
Problem
|
||
Most likely wrong input to the command line for fdisk or similar
|
||
tool.
|
||
|
||
Solution
|
||
Remember to use /dev/hda rather than just hda. Also do not use
|
||
numbers behind hda, those indicate partitions.
|
||
|
||
|
||
|
||
15.1.2. Formatting
|
||
|
||
|
||
Symptoms
|
||
Cannot format disk.
|
||
|
||
|
||
Problem
|
||
Strictly speaking you format partitions not disks.
|
||
|
||
Solution
|
||
Make sure you add the partition number after the device name of
|
||
the disk, for instance /dev/hda1 to the command line.
|
||
|
||
|
||
|
||
15.2. During Booting
|
||
|
||
15.2.1. Booting fails
|
||
|
||
|
||
Symptoms
|
||
Number keep scrolling up the screen.
|
||
|
||
Problem
|
||
Possibly corrupt disk.
|
||
|
||
Solution
|
||
Try another disk, you might have to reinstall. Check for loose
|
||
cables and possible data corruption.
|
||
|
||
|
||
Symptoms
|
||
Get LI and then it hangs.
|
||
|
||
Problem
|
||
You use LILO to load Linux but LILO cannot find your root.
|
||
|
||
Solution
|
||
Read the LILO HOWTO.
|
||
|
||
|
||
Symptoms
|
||
Kernel panics, something about missing root file system.
|
||
|
||
Problem
|
||
The kernel does not know where the root partition is.
|
||
|
||
Solution
|
||
Use rdev or (if applicable) LILO to add information to the
|
||
kernel image where your root is.
|
||
|
||
|
||
|
||
15.2.2. Getting into Single User Mode
|
||
|
||
|
||
Symptoms
|
||
System boots but get into a root shell in single user mode.
|
||
|
||
Problem
|
||
Something went wrong in the later stages of booting and the
|
||
system has come far enough to let you open a shell to repair the
|
||
system.
|
||
|
||
Solution
|
||
Locate the problems from the boot log. Note that file system can
|
||
be in read-only mode. Remount read-write if you have to. Often
|
||
the reason is that the /etc/fstab contained an entry that was
|
||
mismapped such as trying to mount a swap partition as your
|
||
normal file space.
|
||
|
||
15.3. During Running
|
||
|
||
15.3.1. Swap
|
||
|
||
|
||
Symptoms
|
||
Short on memory
|
||
|
||
Problem
|
||
Swap space is not available
|
||
|
||
Solution
|
||
Type free and check the output. If you get
|
||
|
||
|
||
total used free shared buffers cached
|
||
Mem: 46920 30136 16784 7480 11788 5764
|
||
-/+ buffers/cache: 12584 34336
|
||
Swap: 128484 9176 119308
|
||
|
||
|
||
|
||
then system is running normal. If the line with Swap: contains
|
||
zeros you have either not mounted the swap space (partition or swap
|
||
file) (see swapon(8)) or not formatted the swap space (see
|
||
mkswap(8)).
|
||
|
||
|
||
|
||
15.3.2. Partitions
|
||
|
||
|
||
Symptoms
|
||
No room amidst plenty 1
|
||
|
||
Problem
|
||
Partitionitis:Underdimensioned partition sizes has caused
|
||
overflow in some areas
|
||
|
||
Solution
|
||
Examine your partition usage using df(1) and locate problem
|
||
areas. Normally the problem can be solved by removing old junk
|
||
but you might have to repartition your system, see section
|
||
``Repartitioning''.
|
||
|
||
|
||
Symptoms
|
||
No room amidst plenty 2
|
||
|
||
Problem
|
||
Running out of i-nodes has caused overflow in some ares, often
|
||
in areas with many small files such as news spool.
|
||
|
||
Solution
|
||
Examine your partition usage using df -i and locate problem
|
||
areas. Normally the problem is solved by reformatting using a
|
||
higher number of i-nodes, see mkfs(8) and related man pages.
|
||
|
||
|
||
|
||
16. Further Information
|
||
|
||
There is wealth of information one should go through when setting up a
|
||
major system, for instance for a news or general Internet service
|
||
provider. The FAQs in the following groups are useful:
|
||
|
||
|
||
16.1. News groups
|
||
|
||
Some of the most interesting news groups are:
|
||
|
||
<20> Storage <news:comp.arch.storage>.
|
||
|
||
<20> PC storage <news:comp.sys.ibm.pc.hardware.storage>.
|
||
|
||
<20> AFS <news:alt.filesystems.afs>.
|
||
|
||
<20> SCSI <news:comp.periphs.scsi>.
|
||
|
||
<20> Linux setup <news:comp.os.linux.setup>.
|
||
|
||
Most newsgroups have their own FAQ that are designed to answer most of
|
||
your questions, as the name Frequently Asked Questions indicate. Fresh
|
||
versions should be posted regularly to the relevant newsgroups. If you
|
||
cannot find it in your news spool you could go directly to the FAQ
|
||
main archive FTP site <ftp://rtfm.mit.edu>. The WWW versions can be
|
||
browsed at FAQ main archive WWW site <http://www.faqs.org/faqs/FAQ-
|
||
List.html>.
|
||
|
||
Some FAQs have their own home site, of particular interest here are
|
||
|
||
<20> SCSI FAQ <http://www.scsifaq.org/> and
|
||
|
||
<20> comp.arch.storage FAQ <http://alumni.caltech.edu/~rdv/comp-arch-
|
||
storage/FAQ-1.html>.
|
||
|
||
|
||
|
||
16.2. Mailing Lists
|
||
|
||
These are low noise channels mainly for developers. Think twice before
|
||
asking questions there as noise delays the development. Some relevant
|
||
lists are linux-raid, linux-scsi and linux-ext2fs. Many of the most
|
||
useful mailing lists run on the vger.rutgers.edu server but this is
|
||
notoriously overloaded, so try to find a mirror. There are some lists
|
||
mirrored at The Redhat Home Page <http://www.redhat.com>. Many lists
|
||
are also accessible at linuxhq <http://www.linuxhq.com/lnxlists/>, and
|
||
the rest of the web site is a gold mine of useful information.
|
||
|
||
If you want to find out more about the lists available you can send a
|
||
message with the line lists to the list server at vger.rutgers.edu (
|
||
majordomo@vger.rutgers.edu). If you need help on how to use the mail
|
||
server just send the line help to the same address. Due to the
|
||
popularity of this server it is likely it takes a bit to time before
|
||
you get a reply or even get messages after you send a subscribe
|
||
command.
|
||
|
||
There is also a number of other majordomo list servers that can be of
|
||
interest such as the EATA driver list ( linux-eata@mail.uni-mainz.de)
|
||
and the Intelligent IO list linux-i2o@dpt.com.
|
||
|
||
Mailing lists are in a state of flux but you can find links to a
|
||
number of interesting lists from the Linux Documentation Homepage
|
||
<http://www.linuxdoc.org/>.
|
||
|
||
|
||
16.3. HOWTO
|
||
|
||
These are intended as the primary starting points to get the
|
||
background information as well as show you how to solve a specific
|
||
problem. Some relevant HOWTOs are Bootdisk, Installation, SCSI and
|
||
UMSDOS. The main site for these is the LDP archive
|
||
<http://www.linuxdoc.org/>.
|
||
|
||
There is a a new HOWTO out that deals with setting up a DPT RAID
|
||
system, check out the DPT RAID HOWTO homepage
|
||
<http://www.ram.org/computing/linux/dpt_raid.html>.
|
||
|
||
|
||
|
||
16.4. Mini-HOWTO
|
||
|
||
These are the smaller free text relatives to the HOWTOs. Some
|
||
relevant mini-HOWTOs are Backup-With-MSDOS, Diskless, LILO, Large
|
||
Disk, Linux+DOS+Win95+OS2, Linux+OS2+DOS, Linux+Win95, NFS-Root,
|
||
Win95+Win+Linux, ZIP Drive . You can find these at the same place as
|
||
the HOWTOs, usually in a sub directory called mini. Note that these
|
||
are scheduled to be converted into SGML and become proper HOWTOs in
|
||
the near future.
|
||
|
||
The old Linux Large IDE mini-HOWTO is no longer valid, instead read
|
||
/usr/src/linux/drivers/block/README.ide or
|
||
/usr/src/linux/Documentation/ide.txt.
|
||
|
||
|
||
16.5. Local Resources
|
||
|
||
In most distributions of Linux there is a document directory
|
||
installed, have a look in the /usr/doc directory. where most packages
|
||
store their main documentation and README files etc. Also you will
|
||
here find the HOWTO archive ( /usr/doc/HOWTO) of ready formatted
|
||
HOWTOs and also the mini-HOWTO archive ( /usr/doc/HOWTO/mini
|
||
<file:///usr/doc/HOWTO/mini>) of plain text documents.
|
||
|
||
Many of the configuration files mentioned earlier can be found in the
|
||
/etc directory. In particular you will want to work with the
|
||
/etc/fstab file that sets up the mounting of partitions and possibly
|
||
also /etc/mdtab file that is used for the md system to set up RAID.
|
||
|
||
The kernel source in /usr/src/linux <file:///usr/src/linux> is, of
|
||
course, the ultimate documentation. In other words, use the source,
|
||
Luke. It should also be pointed out that the kernel comes not only
|
||
with source code which is even commented (well, partially at least)
|
||
but also an informative documentation directory
|
||
<file:///usr/src/linux/Documentation>. If you are about to ask any
|
||
questions about the kernel you should read this first, it will save
|
||
you and many others a lot of time and possibly embarrassment.
|
||
|
||
Also have a look in your system log file ( /var/log/messages) to see
|
||
what is going on and in particular how the booting went if too much
|
||
scrolled off your screen. Using tail -f /var/log/messages in a
|
||
separate window or screen will give you a continuous update of what is
|
||
going on in your system.
|
||
|
||
You can also take advantage of the /proc file system that is a window
|
||
into the inner workings of your system. Use cat rather than more to
|
||
view the files as they are reported as being zero length. Reports are
|
||
that less works well here.
|
||
|
||
|
||
|
||
16.6. Web Pages
|
||
|
||
There is a huge number of informative web pages out there and by their
|
||
very nature they change quickly so don't be too surprised if these
|
||
links become quickly outdated.
|
||
|
||
A good starting point is of course the Linux Documentation Homepage
|
||
<http://www.linuxdoc.org/>. that is a information central for
|
||
documentation, project pages and much, much more.
|
||
|
||
|
||
|
||
<20> Mike Neuffer, the author of the DPT caching RAID controller
|
||
drivers, has some interesting pages on SCSI <http://www.uni-
|
||
mainz.de/~neuffer/scsi/> and DPT <http://www.uni-
|
||
mainz.de/~neuffer/scsi/dpt/>.
|
||
|
||
<20> Software RAID development information can be found at Linux Kernel
|
||
site <http://www.kernel.org/> along with patches and utilities.
|
||
|
||
<20> Disk related information on benchmarking, RAID, reliability and
|
||
much, much more can be found at Linas Vepstas <http://linas.org>
|
||
project page.
|
||
|
||
<20> There is also information available on how to RAID the root
|
||
partition <ftp://ftp.bizsystems.com/pub/raid/Root-RAID-HOWTO.html>
|
||
and what software packages are needed to achieve this.
|
||
|
||
<20> In depth documentation on ext2fs
|
||
<http://step.polymtl.ca/~ldd/ext2fs/ext2fs_toc.html> is also
|
||
available.
|
||
|
||
|
||
|
||
<20> People who looking for information on VFAT, FAT32 and Joliet could
|
||
have a look at the development page
|
||
<http://bmrc.berkeley.edu/people/chaffee/index.html>. These
|
||
drivers are in the 2.1.x kernel development series as well as in
|
||
2.0.34 and later.
|
||
|
||
|
||
For diagrams and information on all sorts of disk drives, controllers
|
||
etc. both for current and discontinued lines The Ref
|
||
<http://theref.aquascape.com/theref.html> is the site you need. There
|
||
is a lot of useful information here, a real treasure trove.
|
||
|
||
Please let me know if you have any other leads that can be of
|
||
interest.
|
||
|
||
|
||
|
||
16.7. Search Engines
|
||
|
||
|
||
When all fails try the internet search engines. There is a huge number
|
||
of them, all a little different from each other. It falls outside the
|
||
scope of this HOWTO to describe how best to use them. Instead you
|
||
could turn to the Troubleshooting on the Internet mini-HOWTO, and the
|
||
Updated mini-HOWTO.
|
||
|
||
|
||
If you have to ask for help you are most likely to get help in the
|
||
Linux Setup <news:comp.os.linux.setup> news group. Due to large
|
||
workload and a slow network connection I am not able to follow that
|
||
newsgroup so if you want to contact me you have to do so by e-mail.
|
||
17. Getting Help
|
||
|
||
|
||
In the end you might find yourself unable to solve your problems and
|
||
need help from someone else. The most efficient way is either to ask
|
||
someone local or in your nearest Linux user group, search the web for
|
||
the nearest one.
|
||
|
||
Another possibility is to ask on Usenet News in one of the many, many
|
||
newsgroups available. The problem is that these have such a high
|
||
volume and noise (called low signal-to-noise ratio) that your question
|
||
can easily fall through unanswered.
|
||
|
||
No matter where you ask it is important to ask well or you will not be
|
||
taken seriously. Saying just my disk does not work is not going to
|
||
help you and instead the noise level is increased even further and if
|
||
you are lucky someone will ask you to clarify.
|
||
|
||
Instead describe your problems in some detail that will enable people
|
||
to help you. The problem could lie somewhere you did not expect.
|
||
Therefore you are advised to list up the following information on your
|
||
system:
|
||
|
||
|
||
Hardware
|
||
|
||
<20> Processor
|
||
|
||
<20> DMA
|
||
|
||
<20> IRQ
|
||
|
||
<20> Chip set (LX, BX etc)
|
||
|
||
<20> Bus (ISA, VESA, PCI etc)
|
||
|
||
<20> Expansion cards used (Disk controllers, video, IO etc)
|
||
|
||
|
||
Software
|
||
|
||
<20> BIOS (On motherboard and possibly SCSI host adapters)
|
||
|
||
<20> LILO, if used
|
||
|
||
<20> Linux kernel version as well as possible modifications and
|
||
patches
|
||
|
||
<20> Kernel parameters, if any
|
||
|
||
<20> Software that shows the error (with version number or date)
|
||
|
||
|
||
Peripherals
|
||
|
||
<20> Type of disk drives with manufacturer name, version and type
|
||
|
||
<20> Other relevant peripherals connected to the same busses
|
||
|
||
|
||
As an example of how interrelated these problems are: an old chip set
|
||
caused problems with a certain combination of video controller and
|
||
SCSI host adapter.
|
||
|
||
Remember that booting text is logged to /var/log/messages which can
|
||
answer most of the questions above. Obviously if the drives fail you
|
||
might not be able to get the log saved to disk but you can at least
|
||
scroll back up the screen using the SHIFT and PAGE UP keys. It may
|
||
also be useful to include part of this in your request for help but do
|
||
not go overboard, keep it brief as a complete log file dumped to
|
||
Usenet News is more than a little annoying.
|
||
|
||
|
||
|
||
18. Concluding Remarks
|
||
|
||
Disk tuning and partition decisions are difficult to make, and there
|
||
are no hard rules here. Nevertheless it is a good idea to work more on
|
||
this as the payoffs can be considerable. Maximizing usage on one drive
|
||
only while the others are idle is unlikely to be optimal, watch the
|
||
drive light, they are not there just for decoration. For a properly
|
||
set up system the lights should look like Christmas in a disco. Linux
|
||
offers software RAID but also support for some hardware base SCSI RAID
|
||
controllers. Check what is available. As your system and experiences
|
||
evolve you are likely to repartition and you might look on this
|
||
document again. Additions are always welcome.
|
||
|
||
Finally I'd like to sum up my recommendations:
|
||
|
||
<20> Disks are cheap but the data they contain could be much more
|
||
valuable, use and test your backup system.
|
||
|
||
<20> Work is also expensive, make sure you get large enough disks as
|
||
refitting new or repartitioning old disks takes time.
|
||
|
||
<20> Think reliability, replace old disks before they fail.
|
||
|
||
<20> Keep a paper copy of your setup, having it all on disk when the
|
||
machine is down will not help you much.
|
||
|
||
<20> Start out with a simple design with a minimum of fancy technology
|
||
and rather fit it in later. In general adding is easier than
|
||
replacing, be it disks, technology or other features.
|
||
|
||
|
||
|
||
18.1. Coming Soon
|
||
|
||
There are a few more important things that are about to appear here.
|
||
In particular I will add more example tables as I am about to set up
|
||
two fairly large and general systems, one at work and one at home.
|
||
These should give some general feeling on how a system can be set up
|
||
for either of these two purposes. Examples of smooth running existing
|
||
systems are also welcome.
|
||
|
||
There is also a fair bit of work left to do on the various kinds of
|
||
file systems and utilities.
|
||
|
||
There will be a big addition on drive technologies coming soon as well
|
||
as a more in depth description on using fdisk, cfdisk and sfdisk. The
|
||
file systems will be beefed up as more features become available as
|
||
well as more on RAID and what directories can benefit from what RAID
|
||
level.
|
||
|
||
|
||
There is some minor overlapping with the Linux Filesystem Structure
|
||
Standard and FHS that I hope to integrate better soon, which will
|
||
probably mean a big reworking of all the tables at the end of this
|
||
document.
|
||
|
||
As more people start reading this I should get some more comments and
|
||
feedback. I am also thinking of making a program that can automate a
|
||
fair bit of this decision making process and although it is unlikely
|
||
to be optimum it should provide a simpler, more complete starting
|
||
point.
|
||
|
||
|
||
18.2. Request for Information
|
||
|
||
It has taken a fair bit of time to write this document and although
|
||
most pieces are beginning to come together there are still some
|
||
information needed before we are out of the beta stage.
|
||
|
||
|
||
<20> More information on swap sizing policies is needed as well as
|
||
information on the largest swap size possible under the various
|
||
kernel versions.
|
||
|
||
<20> How common is drive or file system corruption? So far I have only
|
||
heard of problems caused by flaky hardware.
|
||
|
||
<20> References to speed and drives is needed.
|
||
|
||
<20> Are any other Linux compatible RAID controllers available?
|
||
|
||
<20> What relevant monitoring, management and maintenance tools are
|
||
available?
|
||
|
||
<20> General references to information sources are needed, perhaps this
|
||
should be a separate document?
|
||
|
||
<20> Usage of /tmp and /var/tmp has been hard to determine, in fact what
|
||
programs use which directory is not well defined and more
|
||
information here is required. Still, it seems at least clear that
|
||
these should reside on different physical drives in order to
|
||
increase paralellicity.
|
||
|
||
|
||
18.3. Suggested Project Work
|
||
|
||
Now and then people post on comp.os.linux.*, looking for good project
|
||
ideas. Here I will list a few that comes to mind that are relevant to
|
||
this document. Plans about big projects such as new file systems
|
||
should still be posted in order to either find co-workers or see if
|
||
someone is already working on it.
|
||
|
||
|
||
|
||
Planning tools
|
||
that can automate the design process outlines earlier would
|
||
probably make a medium sized project, perhaps as an exercise in
|
||
constraint based programming.
|
||
|
||
|
||
Partitioning tools
|
||
that take the output of the previously mentioned program and
|
||
format drives in parallel and apply the appropriate symbolic
|
||
links to the directory structure. It would probably be best if
|
||
this were integrated in existing system installation software.
|
||
The drive partitioning setup used in Solaris is an example of
|
||
what it can look like.
|
||
|
||
|
||
Surveillance tools
|
||
that keep an eye on the partition sizes and warn before a
|
||
partition overflows.
|
||
|
||
|
||
Migration tools
|
||
that safely lets you move old structures to new (for instance
|
||
RAID) systems. This could probably be done as a shell script
|
||
controlling a back up program and would be rather simple. Still,
|
||
be sure it is safe and that the changes can be undone.
|
||
|
||
|
||
|
||
19. Questions and Answers
|
||
|
||
This is just a collection of what I believe are the most common
|
||
questions people might have. Give me more feedback and I will turn
|
||
this section into a proper FAQ.
|
||
|
||
|
||
<20> Q:How many physical disk drives (spindles) does a Linux system
|
||
need?
|
||
|
||
A: Linux can run just fine on one drive (spindle). Having enough
|
||
RAM (around 32 MB, and up to 64 MB) to support swapping is a better
|
||
price/performance choice than getting a second disk. (E)IDE disk
|
||
is usually cheaper (but a little slower) than SCSI.
|
||
|
||
|
||
<20> Q: I have a single drive, will this HOWTO help me?
|
||
|
||
A: Yes, although only to a minor degree. Still, section ``Physical
|
||
Track Positioning'' will offer you some gains.
|
||
|
||
|
||
<20> Q: Are there any disadvantages in this scheme?
|
||
|
||
A: There is only a minor snag: if even a single partition overflows
|
||
the system might stop working properly. The severity depends of
|
||
course on what partition is affected. Still this is not hard to
|
||
monitor, the command df gives you a good overview of the situation.
|
||
Also check the swap partition(s) using free to make sure you are
|
||
not about to run out of virtual memory.
|
||
|
||
|
||
<20> Q: OK, so should I split the system into as many partitions as
|
||
possible for a single drive?
|
||
|
||
A: No, there are several disadvantages to that. First of all
|
||
maintenance becomes needlessly complex and you gain very little in
|
||
this. In fact if your partitions are too big you will seek across
|
||
larger areas than needed. This is a balance and dependent on the
|
||
number of physical drives you have.
|
||
|
||
|
||
<20> Q: Does that mean more drives allows more partitions?
|
||
|
||
A: To some degree, yes. Still, some directories should not be split
|
||
off from root, check out the file system standards for more
|
||
details.
|
||
|
||
|
||
<20> Q: What if I have many drives I want to use?
|
||
|
||
A: If you have more than 3-4 drives you should consider using RAID
|
||
of some form. Still, it is a good idea to keep your root partition
|
||
on a simple partition without RAID, see section ``RAID'' for more
|
||
details.
|
||
|
||
|
||
|
||
<20> Q: I have installed the latest Windows95 but cannot access this
|
||
partition from within the Linux system, what is wrong?
|
||
|
||
A: Most likely you are using FAT32 in your windows partition. It
|
||
seems that Microsoft decided we needed yet another format, and this
|
||
was introduced in their latest version of Windows95, called OSR2.
|
||
The advantage is that this format is better suited to large drives.
|
||
|
||
You might also be interested to hear that Microsoft NT 4.0 does not
|
||
support it yet either.
|
||
|
||
|
||
<20> Q: I cannot get the disk size and partition sizes to match,
|
||
something is missing. What has happened?
|
||
|
||
A:It is possible you have mounted a partition onto a mount point
|
||
that was not an empty directory. Mount points are directories and
|
||
if it is not empty the mounting will mask the contents. If you do
|
||
the sums you will see the amount of disk space used in this
|
||
directory is missing from the observed total.
|
||
|
||
To solve this you can boot from a rescue disk and see what is
|
||
hiding behind your mount points and remove or transfer the contents
|
||
by mounting the offending partition on a temporary mounting point.
|
||
You might find it useful to have "spare" emergency mounting points
|
||
ready made.
|
||
|
||
|
||
<20> Q: It doesn't look like my swap partition is in use, how come?
|
||
|
||
A: It is possible that it has not been necessary to swap out,
|
||
especially if you have plenty of RAM. Check your log files to see
|
||
if you ran out of memory at one point or another, in that case your
|
||
swap space should have been put to use. If not it is possible that
|
||
either the swap partition was not assigned the right number, that
|
||
you did not prepare it with mkswap or that you have not done swapon
|
||
or added it to your /etc/fstab file.
|
||
|
||
|
||
|
||
<20> Q: What is this Nyx that is mentioned several times here?
|
||
|
||
A: It is a large free Unix system with currently about 10000 users.
|
||
I use it for my web pages for this HOWTO as well as a source of
|
||
ideas for a setup of large Unix systems. It has been running for
|
||
many years and has a quite stable setup. For more information you
|
||
can view the Nyx homepage <http://www.nyx.net> which also gives you
|
||
information on how to get your own free account.
|
||
|
||
|
||
|
||
20. Bits and Pieces
|
||
|
||
This is basically a section where I stuff all the bits I have not yet
|
||
decided where should go, yet that I feel is worth knowing about. It is
|
||
a kind of transient area.
|
||
|
||
|
||
|
||
20.1. Swap Partition: to Use or Not to Use
|
||
|
||
In many cases you do not need a swap partition, for instance if you
|
||
have plenty of RAM, say, more than 64 MB, and you are the sole user of
|
||
the machine. In this case you can experiment running without a swap
|
||
partition and check the system logs to see if you ran out of virtual
|
||
memory at any point.
|
||
|
||
Removing swap partitions have two advantages:
|
||
|
||
<20> you save disk space (rather obvious really)
|
||
|
||
<20> you save seek time as swap partitions otherwise would lie in the
|
||
middle of your disk space.
|
||
|
||
In the end, having a swap partition is like having a heated toilet:
|
||
you do not use it very often, but you sure appreciate it when you
|
||
require it.
|
||
|
||
|
||
20.2. Mount Point and /mnt
|
||
|
||
In an earlier version of this document I proposed to put all
|
||
permanently mounted partitions under /mnt. That, however, is not such
|
||
a good idea as this itself can be used as a mount point, which leads
|
||
to all mounted partitions becoming unavailable. Instead I will propose
|
||
mounting straight from root using a meaningful name like
|
||
/mnt.descriptive-name.
|
||
|
||
Lately I have become aware that some Linux distributions use mount
|
||
points at subdirectories under /mnt, such as /mnt/floppy and
|
||
/mnt/cdrom, which just shows how confused the whole issue is.
|
||
Hopefully FHS should clarify this.
|
||
|
||
|
||
|
||
20.3. Power and Heating
|
||
|
||
Not many years ago a machine with the equivalent power of a modern PC
|
||
required 3-phase power and cooling, usually by air conditioning the
|
||
machine room, some times also by water cooling. Technology has
|
||
progressed very quickly giving not only high speed but also low power
|
||
components. Still, there is a definite limit to the technology,
|
||
something one should keep in mind as the system is expanded with yet
|
||
another disk drive or PCI card. When the power supply is running at
|
||
full rated power, keep in mind that all this energy is going
|
||
somewhere, mostly into heat. Unless this is dissipated using fans you
|
||
will get a serious heating inside the cabinet followed by a reduced
|
||
reliability and also life time of the electronics. Manufacturers
|
||
state minimum cooling requirements for their drives, usually in terms
|
||
of cubic feet per minute (CFM). You are well advised to take this
|
||
serious.
|
||
|
||
Keep air flow passages open, clean out dust and check the temperature
|
||
of your system running. If it is too hot to touch it is probably
|
||
running too hot.
|
||
|
||
If possible use sequential spin up for the drives. It is during spin
|
||
up, when the drive platters accelerate up to normal speed, that a
|
||
drive consumes maximum power and if all drives start up simultaneously
|
||
you could go beyond the rated power maximum of your power supply.
|
||
|
||
|
||
20.4. Deja
|
||
|
||
This is an Internet system that no doubt most of you are familiar
|
||
with. It searches and serves Usenet News articles from 1995 and to
|
||
the latest postings and also offers a web based reading and posting
|
||
interface. There is a lot more, check out Deja <http://www.deja.com>
|
||
for more information. It changed name from Dejanews.
|
||
What perhaps is less known, is that they use about 120 Linux SMP
|
||
computers many of which use the md module to manage between 4 and 24
|
||
Gig of disk space (over 1200 Gig altogether) for this service. The
|
||
system is continuously growing but at the time of writing they use
|
||
mostly dual Pentium Pro 200MHz and Pentium II 300 MHz systems with 256
|
||
MB RAM or more.
|
||
|
||
A production database machine normally has 1 disk for the operating
|
||
system and between 4 and 6 disks managed by the md module where the
|
||
articles are archived. The drives are connected to BusLogic Model
|
||
BT-946C and BT-958 PCI SCSI adapters, usually one to a machine.
|
||
|
||
For the production systems (which are up 365 days a year) the downtime
|
||
attributable to disk errors is less than 0.25 % (that is a quarter of
|
||
1%, not 25%).
|
||
|
||
Just in case: this is not an advertisement, it is stated as an example
|
||
of how much is required for what is a major Internet service.
|
||
|
||
|
||
|
||
20.5. Crash Recovery
|
||
|
||
Occationally hard disks crash. A crash causing data scrambling can
|
||
often be at least partially recovered from and there are already
|
||
HOWTOs describing this.
|
||
|
||
In case of hardware failure things are far more serious, and you have
|
||
two options: either send the drive to a professional data recovery
|
||
company, or try recovering yourself. The latter is of course high risk
|
||
and can cause more damage.
|
||
|
||
If a disk stops rotating or fails to spin up, the number one advice is
|
||
first to turn off the system as fast as safely possible.
|
||
|
||
Next you could try disconnecting the drives and power up the machine,
|
||
just to check power with a multimeter that power is present. Quite
|
||
often connectors can get unseated and cause all sorts of problems.
|
||
|
||
If you decide to risk trying it yourself you could check all
|
||
connectors and then reapply power and see if the drive spins up and
|
||
responds. If it still is dead turn off power quickly, preferrably
|
||
before the operating system boots. Make sure that delayed spinup is
|
||
not deceiving you here.
|
||
|
||
If you decide to progress even further (and take higher risks) you
|
||
could remove the drive, give it a firm tap on the side so that the
|
||
disk moves a little with respect to the casing. This can help in
|
||
unsticking the head from the surface, allowing the platter to move
|
||
freely as the motor power is not sufficient to unstick a stuck head on
|
||
its own.
|
||
|
||
Also if a drive has been turned off for a while after running for long
|
||
periods of time, or if it has overheated, the lubricant can harden of
|
||
drain out of the bearings. In this case warming the drive slowly and
|
||
gently up to normal operating temperature will possibly recover the
|
||
lubrication problems.
|
||
|
||
If after this the drive still does not respond the last possible and
|
||
the highest risk suggestion is to replace the circuit board of the
|
||
drive with a board from am identical model drive.
|
||
|
||
Often the contents of a drive is worth far more than the media itself,
|
||
so do consider professional help. These companies have advanced
|
||
equipment and know-how obtained from the manufacturers on how to
|
||
recover a damaged drive, far beyond that of a hobbyist.
|
||
|
||
|
||
|
||
21. Appendix A: Partitioning Layout Table: Mounting and Linking
|
||
|
||
The following table is designed to make layout a simpler paper and
|
||
pencil exercise. It is probably best to print it out (using NON
|
||
PROPORTIONAL fonts) and adjust the numbers until you are happy with
|
||
them.
|
||
|
||
Mount point is what directory you wish to mount a partition on or the
|
||
actual device. This is also a good place to note how you plan to use
|
||
symbolic links.
|
||
|
||
The size given corresponds to a fairly big Debian 1.2.6 installation.
|
||
Other examples are coming later.
|
||
|
||
Mainly you use this table to select what structure and drives you will
|
||
use, the partition numbers and letters will come from the next two
|
||
tables.
|
||
|
||
|
||
|
||
Directory Mount point speed seek transfer size SIZE
|
||
|
||
|
||
swap __________ ooooo ooooo ooooo 32 ____
|
||
|
||
/ __________ o o o 20 ____
|
||
|
||
/tmp __________ oooo oooo oooo ____
|
||
|
||
/var __________ oo oo oo 25 ____
|
||
/var/tmp __________ oooo oooo oooo ____
|
||
/var/spool __________ ____
|
||
/var/spool/mail __________ o o o ____
|
||
/var/spool/news __________ ooo ooo oo ____
|
||
/var/spool/____ __________ ____ ____ ____ ____
|
||
|
||
/home __________ oo oo oo ____
|
||
|
||
/usr __________ 500 ____
|
||
/usr/bin __________ o oo o 250 ____
|
||
/usr/lib __________ oo oo ooo 200 ____
|
||
/usr/local __________ ____
|
||
/usr/local/bin __________ o oo o ____
|
||
/usr/local/lib __________ oo oo ooo ____
|
||
/usr/local/____ __________ ____
|
||
/usr/src __________ o oo o 50 ____
|
||
|
||
DOS __________ o o o ____
|
||
Win __________ oo oo oo ____
|
||
NT __________ ooo ooo ooo ____
|
||
|
||
/mnt._________ __________ ____ ____ ____ ____
|
||
/mnt._________ __________ ____ ____ ____ ____
|
||
/mnt._________ __________ ____ ____ ____ ____
|
||
/_____________ __________ ____ ____ ____ ____
|
||
/_____________ __________ ____ ____ ____ ____
|
||
/_____________ __________ ____ ____ ____ ____
|
||
|
||
|
||
|
||
Total capacity:
|
||
|
||
|
||
|
||
22. Appendix B: Partitioning Layout Table: Numbering and Sizing
|
||
|
||
This table follows the same logical structure as the table above where
|
||
you decided what disk to use. Here you select the physical tracking,
|
||
keeping in mind the effect of track positioning mentioned earlier in
|
||
``Physical Track Positioning''.
|
||
|
||
The final partition number will come out of the table after this.
|
||
|
||
|
||
|
||
Drive sda sdb sdc hda hdb hdc ___
|
||
|
||
SCSI ID | __ | __ | __ |
|
||
|
||
Directory
|
||
swap | | | | | | |
|
||
|
||
/ | | | | | | |
|
||
|
||
/tmp | | | | | | |
|
||
|
||
/var : : : : : : :
|
||
/var/tmp | | | | | | |
|
||
/var/spool : : : : : : :
|
||
/var/spool/mail | | | | | | |
|
||
/var/spool/news : : : : : : :
|
||
/var/spool/____ | | | | | | |
|
||
|
||
/home | | | | | | |
|
||
|
||
/usr | | | | | | |
|
||
/usr/bin : : : : : : :
|
||
/usr/lib | | | | | | |
|
||
/usr/local : : : : : : :
|
||
/usr/local/bin | | | | | | |
|
||
/usr/local/lib : : : : : : :
|
||
/usr/local/____ | | | | | | |
|
||
/usr/src : : : :
|
||
|
||
DOS | | | | | | |
|
||
Win : : : : : : :
|
||
NT | | | | | | |
|
||
|
||
/mnt.___/_____ | | | | | | |
|
||
/mnt.___/_____ : : : : : : :
|
||
/mnt.___/_____ | | | | | | |
|
||
/_____________ : : : : : : :
|
||
/_____________ | | | | | | |
|
||
/_____________ : : : : : : :
|
||
|
||
|
||
Total capacity:
|
||
|
||
|
||
|
||
23. Appendix C: Partitioning Layout Table: Partition Placement
|
||
|
||
This is just to sort the partition numbers in ascending order ready to
|
||
input to fdisk or cfdisk. Here you take physical track positioning
|
||
into account when finalizing your design. Unless you get specific
|
||
information otherwise, you can assume track 0 is the outermost track.
|
||
|
||
These numbers and letters are then used to update the previous tables,
|
||
all of which you will find very useful in later maintenance.
|
||
|
||
In case of disk crash you might find it handy to know what SCSI id
|
||
belongs to which drive, consider keeping a paper copy of this.
|
||
|
||
|
||
|
||
Drive : sda sdb sdc hda hdb hdc ___
|
||
|
||
Total capacity: | ___ | ___ | ___ | ___ | ___ | ___ | ___
|
||
SCSI ID | __ | __ | __ |
|
||
|
||
Partition
|
||
|
||
1 | | | | | | |
|
||
2 : : : : : : :
|
||
3 | | | | | | |
|
||
4 : : : : : : :
|
||
5 | | | | | | |
|
||
6 : : : : : : :
|
||
7 | | | | | | |
|
||
8 : : : : : : :
|
||
9 | | | | | | |
|
||
10 : : : : : : :
|
||
11 | | | | | | |
|
||
12 : : : : : : :
|
||
13 | | | | | | |
|
||
14 : : : : : : :
|
||
15 | | | | | | |
|
||
16 : : : : : : :
|
||
|
||
|
||
|
||
24. Appendix D: Example: Multipurpose Server
|
||
|
||
The following table is from the setup of a medium sized multipurpose
|
||
server where I once worked. Aside from being a general Linux machine
|
||
it will also be a network related server (DNS, mail, FTP, news,
|
||
printers etc.) X server for various CAD programs, CD ROM burner and
|
||
many other things. The files reside on 3 SCSI drives with a capacity
|
||
of 600, 1000 and 1300 MB.
|
||
|
||
Some further speed could possibly be gained by splitting /usr/local
|
||
from the rest of the /usr system but we deemed the further added
|
||
complexity would not be worth it. With another couple of drives this
|
||
could be more worthwhile. In this setup drive sda is old and slow and
|
||
could just a well be replaced by an IDE drive. The other two drives
|
||
are both rather fast. Basically we split most of the load between
|
||
these two. To reduce dangers of imbalance in partition sizing we have
|
||
decided to keep /usr/bin and /usr/local/bin in one drive and /usr/lib
|
||
and /usr/local/lib on another separate drive which also affords us
|
||
some drive parallelizing.
|
||
|
||
Even more could be gained by using RAID but we felt that as a server
|
||
we needed more reliability than was then afforded by the md patch and
|
||
a dedicated RAID controller was out of our reach.
|
||
|
||
|
||
25. Appendix E: Example: Mounting and Linking
|
||
|
||
|
||
|
||
Directory Mount point speed seek transfer size SIZE
|
||
|
||
|
||
swap sdb2, sdc2 ooooo ooooo ooooo 32 2x64
|
||
|
||
/ sda2 o o o 20 100
|
||
|
||
/tmp sdb3 oooo oooo oooo 300
|
||
|
||
/var __________ oo oo oo ____
|
||
/var/tmp sdc3 oooo oooo oooo 300
|
||
/var/spool sdb1 436
|
||
/var/spool/mail __________ o o o ____
|
||
/var/spool/news __________ ooo ooo oo ____
|
||
/var/spool/____ __________ ____ ____ ____ ____
|
||
|
||
/home sda3 oo oo oo 400
|
||
|
||
/usr sdb4 230 200
|
||
/usr/bin __________ o oo o 30 ____
|
||
/usr/lib -> libdisk oo oo ooo 70 ____
|
||
/usr/local __________ ____
|
||
/usr/local/bin __________ o oo o ____
|
||
/usr/local/lib -> libdisk oo oo ooo ____
|
||
/usr/local/____ __________ ____
|
||
/usr/src ->/home/usr.src o oo o 10 ____
|
||
|
||
DOS sda1 o o o 100
|
||
Win __________ oo oo oo ____
|
||
NT __________ ooo ooo ooo ____
|
||
|
||
/mnt.libdisk sdc4 oo oo ooo 226
|
||
/mnt.cd sdc1 o o oo 710
|
||
|
||
|
||
Total capacity: 2900 MB
|
||
|
||
|
||
|
||
26. Appendix F: Example: Numbering and Sizing
|
||
|
||
Here we do the adjustment of sizes and positioning.
|
||
|
||
|
||
|
||
Directory sda sdb sdc
|
||
|
||
|
||
swap | | 64 | 64 |
|
||
|
||
/ | 100 | | |
|
||
|
||
/tmp | | 300 | |
|
||
|
||
/var : : : :
|
||
/var/tmp | | | 300 |
|
||
/var/spool : : 436 : :
|
||
/var/spool/mail | | | |
|
||
/var/spool/news : : : :
|
||
/var/spool/____ | | | |
|
||
|
||
/home | 400 | | |
|
||
|
||
/usr | | 200 | |
|
||
/usr/bin : : : :
|
||
/usr/lib | | | |
|
||
/usr/local : : : :
|
||
/usr/local/bin | | | |
|
||
/usr/local/lib : : : :
|
||
/usr/local/____ | | | |
|
||
/usr/src : : : :
|
||
|
||
DOS | 100 | | |
|
||
Win : : : :
|
||
NT | | | |
|
||
|
||
/mnt.libdisk | | | 226 |
|
||
/mnt.cd : : : 710 :
|
||
/mnt.___/_____ | | | |
|
||
|
||
|
||
Total capacity: | 600 | 1000 | 1300 |
|
||
|
||
|
||
|
||
27. Appendix G: Example: Partition Placement
|
||
|
||
This is just to sort the partition numbers in ascending order ready to
|
||
input to fdisk or cfdisk. Remember to optimize for physical track
|
||
positioning (not done here).
|
||
|
||
|
||
|
||
Drive : sda sdb sdc
|
||
|
||
Total capacity: | 600 | 1000 | 1300 |
|
||
|
||
Partition
|
||
|
||
1 | 100 | 436 | 710 |
|
||
2 : 100 : 64 : 64 :
|
||
3 | 400 | 300 | 300 |
|
||
4 : : 200 : 226 :
|
||
|
||
|
||
|
||
28. Appendix H: Example II
|
||
|
||
|
||
The following is an example of a server setup in an academic setting,
|
||
and is contributed by nakano (at) apm.seikei.ac.jp. I have only done
|
||
minor editing to this section.
|
||
|
||
/var/spool/delegate is a directory for storing logs and cache files of
|
||
an WWW proxy server program, "delegated". Since I don't notice it
|
||
widely, there are 1000--1500 requests/day currently, and average disk
|
||
usage is 15--30% with expiration of caches each day.
|
||
|
||
/mnt.archive is used for data files which are big and not frequently
|
||
referenced such a s experimental data (especially graphic ones),
|
||
various source archives, and Win95 backups (growing very fast...).
|
||
|
||
/mnt.root is backup root file system containing rescue utilities. A
|
||
boot floppy is also prepared to boot with this partition.
|
||
|
||
|
||
|
||
=================================================
|
||
Directory sda sdb hda
|
||
|
||
swap | 64 | 64 | |
|
||
/ | | | 20 |
|
||
/tmp | | | 180 |
|
||
|
||
/var : 300 : : :
|
||
/var/tmp | | 300 | |
|
||
/var/spool/delegate | 300 | | |
|
||
|
||
/home | | | 850 |
|
||
/usr | 360 | | |
|
||
/usr/lib -> /mnt.lib/usr.lib
|
||
/usr/local/lib -> /mnt.lib/usr.local.lib
|
||
|
||
/mnt.lib | | 350 | |
|
||
/mnt.archive : : 1300 : :
|
||
/mnt.root | | 20 | |
|
||
|
||
Total capacity: 1024 2034 1050
|
||
|
||
|
||
=================================================
|
||
Drive : sda sdb hda
|
||
Total capacity: | 1024 | 2034 | 1050 |
|
||
|
||
Partition
|
||
1 | 300 | 20 | 20 |
|
||
2 : 64 : 1300 : 180 :
|
||
3 | 300 | 64 | 850 |
|
||
4 : 360 : ext : :
|
||
5 | | 300 | |
|
||
6 : : 350 : :
|
||
|
||
|
||
Filesystem 1024-blocks Used Available Capacity Mounted on
|
||
/dev/hda1 19485 10534 7945 57% /
|
||
/dev/hda2 178598 13 169362 0% /tmp
|
||
/dev/hda3 826640 440814 343138 56% /home
|
||
/dev/sda1 306088 33580 256700 12% /var
|
||
/dev/sda3 297925 47730 234807 17% /var/spool/delegate
|
||
/dev/sda4 363272 170872 173640 50% /usr
|
||
/dev/sdb5 297598 2 282228 0% /var/tmp
|
||
/dev/sdb2 1339248 302564 967520 24% /mnt.archive
|
||
/dev/sdb6 323716 78792 228208 26% /mnt.lib
|
||
|
||
|
||
|
||
Apparently /tmp and /var/tmp is too big. These directories shall be
|
||
packed together into one partition when disk space shortage comes.
|
||
|
||
/mnt.lib is also seemed to be, but I plan to install newer TeX and
|
||
ghostscript archives, so /usr/local/lib may grow about 100 MB or so
|
||
(since we must use Japanese fonts!).
|
||
|
||
Whole system is backed up by Seagate Tapestore 8000 (Travan TR-4,
|
||
4G/8G).
|
||
|
||
|
||
|
||
29. Appendix I: Example III: SPARC Solaris
|
||
|
||
|
||
The following section is the basic design used at work for a number of
|
||
Sun SPARC servers running Solaris 2.5.1 in an industrial development
|
||
environment. It serves a number of database and cad applications in
|
||
addition to the normal services such as mail.
|
||
|
||
Simplicity is emphasized here so /usr/lib has not been split off from
|
||
/usr.
|
||
|
||
This is the basic layout, planned for about 100 users.
|
||
|
||
|
||
|
||
Drive: SCSI 0 SCSI 1
|
||
|
||
Partition Size (MB) Mount point Size (MB) Mount point
|
||
|
||
0 160 swap 160 swap
|
||
1 100 /tmp 100 /var/tmp
|
||
2 400 /usr
|
||
3 100 /
|
||
4 50 /var
|
||
5
|
||
6 remainder /local0 remainder /local1
|
||
|
||
|
||
|
||
Due to specific requirements at this place it is at times necessary to
|
||
have large partitions available on a short notice. Therefore drive 0
|
||
is given as many tasks as feasible, leaving a large /local1 partition.
|
||
|
||
This setup has been in use for some time now and found satisfactorily.
|
||
|
||
For a more general and balanced system it would be better to swap /tmp
|
||
and /var/tmp and then move /var to drive 1.
|
||
|
||
|
||
30. Appendix J: Example IV: Server with 4 Drives
|
||
|
||
This gives an example of using all techniques described earlier, short
|
||
of RAID. It is admittedly rather complicated but offers in return high
|
||
performance from modest hardware. Dimensioning are skipped but
|
||
reasonable figures can be found in previous examples.
|
||
|
||
|
||
|
||
Partition sda sdb sdc sdd
|
||
---- ---- ---- ----
|
||
1 root overview lib news
|
||
2 swap swap swap swap
|
||
3 home /usr /var/tmp /tmp
|
||
4 spare root mail /var
|
||
|
||
|
||
|
||
Setup is optimised with respect to track positioning but also for
|
||
minimising drive seeks.
|
||
|
||
If you want DOS or Windows too you will have to use sda1 for this and
|
||
move the other partitions after that. It will be advantageous to use
|
||
the swap partitions on sdb2, sdc2 and sdd2 for Windows swap, TEMPDIR
|
||
and Windows temporary directory under these sessions. A number of
|
||
other HOWTOs describe how you can make several operating systems
|
||
coexist on your machine.
|
||
|
||
|
||
For completeness a 4 drive example using several types of RAID is also
|
||
given which is even more complex than the example above.
|
||
|
||
|
||
|
||
Partition sda sdb sdc sdd
|
||
---- ---- ---- ----
|
||
1 boot overview news news
|
||
2 overview swap swap swap
|
||
3 swap lib lib lib
|
||
4 lib overview /tmp /tmp
|
||
5 /var/tmp /var/tmp mail /usr
|
||
6 /home /usr /usr mail
|
||
7 /usr /home /var
|
||
8 / (root) spare root
|
||
|
||
|
||
|
||
Here all duplicates are parts of a RAID 0 set with two exceptions,
|
||
swap which is interleaved and home and mail which are implemented as
|
||
RAID 1 for safety.
|
||
|
||
Note that boot and root are separated: only the boot file with the
|
||
kernel has to reside within the 1023 cylinder limit. The rest of the
|
||
root files can be anywhere and here they are placed on the slowest
|
||
outermost partition. For simplicity and safety the root partition is
|
||
not on a RAID system.
|
||
|
||
With such a complicated comes an equally complicated fstab file. The
|
||
large number of partitions makes it important to do the fsck passes in
|
||
the right order, otherwise the process can take perhaps ten times as
|
||
long time to complete as the optimal solution.
|
||
|
||
|
||
|
||
/dev/sda8 / ? ? 1 1 (a)
|
||
/dev/sdb8 / ? noauto 1 2 (b)
|
||
/dev/sda1 boot ? ? 1 2 (a)
|
||
/dev/sdc7 /var ? ? 1 2 (c)
|
||
/dev/md1 news ? ? 1 3 (c+d)
|
||
/dev/md2 /var/tmp ? ? 1 3 (a+b)
|
||
/dev/md3 mail ? ? 1 4 (c+d)
|
||
/dev/md4 /home ? ? 1 4 (a+b)
|
||
/dev/md5 /tmp ? ? 1 5 (c+d)
|
||
/dev/md6 /usr ? ? 1 6 (a+b+c+d)
|
||
/dev/md7 /lib ? ? 1 7 (a+b+c+d)
|
||
|
||
|
||
|
||
The letters in the brackets indicate what drives will be active for
|
||
each fsck entry and pass. These letters are not present in a real
|
||
fstab file. All in all there are 7 passes.
|
||
|
||
|
||
|
||
31. Appendix K: Example V: Dual Drive System
|
||
|
||
A dual drive system offers less opportunity for clever schemes but the
|
||
following should provide a simple starting point.
|
||
Partition sda sdb
|
||
---- ----
|
||
1 boot lib
|
||
2 swap news
|
||
3 /tmp swap
|
||
4 /usr /var/tmp
|
||
5 /var /home
|
||
6 / (root)
|
||
|
||
|
||
|
||
If you use a dual OS system you have to keep in mind that many other
|
||
systems must boot from the first partition on the first drive. A
|
||
simple DOS / Linux system could look like this:
|
||
|
||
|
||
|
||
Partition sda sdb
|
||
---- ----
|
||
1 DOS lib
|
||
2 boot news
|
||
3 swap swap
|
||
4 /tmp /var/tmp
|
||
5 /usr /home
|
||
6 /var DOSTEMP
|
||
7 / (root)
|
||
|
||
|
||
|
||
Also remember that DOS and Windows prefer there to be just a single
|
||
primary partition which has to be the first one where it boots from.
|
||
As Linux can happily exist in logical partitions this is not a big
|
||
problem.
|
||
|
||
|
||
|
||
32. Appendix L: Example VI: Single Drive System
|
||
|
||
Although this falls somewhat outside the scope of this HOWTO it cannot
|
||
be denied that recently some rather large drives have become very
|
||
affordable. Drives with 10 - 20 GB are becoming common and the
|
||
question often is how best to partition such monsters. Interestingly
|
||
enough very few seem to have any problems in filling up such drives
|
||
and the future looks generally quite rosy for manufacturers planning
|
||
on even bigger drives.
|
||
|
||
Opportunities for optimisations are of course even smaller than for 2
|
||
drive systems but some tricks can still be used to optimise track
|
||
positions while minimising head movements.
|
||
|
||
|
||
|
||
Partition hda Size estimate (MB)
|
||
---- ------------------
|
||
1 DOS 500
|
||
2 boot 20
|
||
3 Winswap 200
|
||
4 data The bulk of the drive
|
||
5 lib 50 - 500
|
||
6 news 300+
|
||
7 swap 128 (Maximum size for 32-bit CPU)
|
||
8 tmp 300+ (/tmp and /var/tmp)
|
||
9 /usr 50 - 500
|
||
10 /home 300+
|
||
11 /var 50 - 300
|
||
12 mail 300+
|
||
13 / (root) 30
|
||
14 dosdata 10 ( Windows bug workaround!)
|
||
|
||
|
||
|
||
Remember that the dosdata partition is a DOS filesystem that must be
|
||
the very last partition on the drive, otherwise Windows gets confused.
|
||
|
||
|
||
33. Appendix M: Disk System Documenter
|
||
|
||
|
||
This shell script was very kindly provided by Steffen Hulegaard. Run
|
||
it as root (superuser) and it will generate a summary of your disk
|
||
setup. Run it after you have implemented your design and compare it
|
||
with what you designed to check for mistakes. Should your system
|
||
develop defects this document will also be a useful starting point for
|
||
recovery.
|
||
|
||
|
||
|
||
______________________________________________________________________
|
||
|
||
#!/bin/bash
|
||
#$Header: /cvsroot/LDP/howto/linuxdoc/Multi-Disk-HOWTO.sgml,v 1.5 2002/05/20 21:12:29 gferg Exp $
|
||
#
|
||
# makediskdoc Collects storage/disk info via df, mount,
|
||
# /etc/fstab and fdisk. Creates a single
|
||
# reference file -- /root/sysop/doc/README.diskdoc
|
||
# Especially good for documenting storage
|
||
# config/partioning
|
||
#
|
||
# 11/11/1999 SC Hulegaard Created just before RedHat 5.2 to
|
||
# RedHat 6.1 upgrade
|
||
# 12/31/1999 SC Hulegaard Added sfdisk -glx usage just prior to
|
||
# collapse of my Quantum Grand Prix (4.3 Gb)
|
||
#
|
||
# SEE ALSO Other /root/bin/make*doc commands to produce other /root/sysop/doc/README.*
|
||
# files. For example, /root/bin/makenetdoc.
|
||
#
|
||
FILE=/root/sysop/doc/README.diskdoc
|
||
echo Creating $FILE ...
|
||
echo ' ' > $FILE
|
||
echo $FILE >> $FILE
|
||
echo Produced By $0 >> $FILE
|
||
echo `date` >> $FILE
|
||
echo ' ' >> $FILE
|
||
echo $Header: /cvsroot/LDP/howto/linuxdoc/Multi-Disk-HOWTO.sgml,v 1.5 2002/05/20 21:12:29 gferg Exp $ >> $FILE
|
||
echo ' ' >> $FILE
|
||
echo DESCRIPTION: df -a >> $FILE
|
||
df -a >> $FILE 2>&1
|
||
echo ' ' >> $FILE
|
||
echo DESCRIPTION: df -ia >> $FILE
|
||
df -ia >> $FILE 2>&1
|
||
echo ' ' >> $FILE
|
||
echo DESCRIPTION: mount >> $FILE
|
||
mount >> $FILE 2>&1
|
||
echo ' ' >> $FILE
|
||
echo DESCRIPTION: /etc/fstab >> $FILE
|
||
cat /etc/fstab >> $FILE
|
||
echo ' ' >> $FILE
|
||
echo DESCRIPTION: sfdisk -s disk device size summary >> $FILE
|
||
sfdisk -s >> $FILE
|
||
echo ' ' >> $FILE
|
||
echo DESCRIPTION: sfdisk -glx info for all disks listed in /etc/fstab >> $FILE
|
||
for x in `cat /etc/fstab | egrep /dev/[sh] | cut -c 0-8 | uniq`; do
|
||
echo ' ' >> $FILE
|
||
echo $x ============================= >> $FILE
|
||
sfdisk -glx $x >> $FILE
|
||
done
|
||
echo ' ' >> $FILE
|
||
echo DESCRIPTION: fdisk -l info for all disks listed in /etc/fstab >> $FILE
|
||
for x in `cat /etc/fstab | egrep /dev/[sh] | cut -c 0-8 | uniq`; do
|
||
echo ' ' >> $FILE
|
||
echo $x ============================= >> $FILE
|
||
fdisk -l $x >> $FILE
|
||
done
|
||
echo ' ' >> $FILE
|
||
echo DESCRIPTION: dmesg info on both sd and hd drives >> $FILE
|
||
dmesg | egrep [hs]d[a-z] >> $FILE
|
||
echo '' >> $FILE
|
||
echo Done >> $FILE
|
||
echo Done
|
||
exit
|
||
|
||
______________________________________________________________________
|
||
|
||
|
||
|