1174 lines
56 KiB
HTML
1174 lines
56 KiB
HTML
|
<HTML>
|
|||
|
<center>
|
|||
|
<A HREF="../tlk-toc.html"> Table of Contents</A>,
|
|||
|
<A href="../tlk.html" target="_top"> Show Frames</A>,
|
|||
|
<A href="../fs/filesystem.html" target="_top"> No Frames</A>
|
|||
|
</center>
|
|||
|
<hr>
|
|||
|
<META NAME="TtH" CONTENT="1.03">
|
|||
|
|
|||
|
<p>
|
|||
|
<H1><A NAME="tth_chAp9">Chapter 9 <br>The File system</H1>
|
|||
|
<A NAME="filesystem-chapter"></A>
|
|||
|
<p>
|
|||
|
<img src="../logos/sit3-bw-tran.1.gif"><br> <tt><b></tt></b>
|
|||
|
This chapter describes how the Linux kernel maintains the files in the file systems
|
|||
|
that it supports.
|
|||
|
It describes the Virtual File System (VFS) and explains how the Linux kernel's
|
|||
|
real file systems are supported.
|
|||
|
|
|||
|
<p>
|
|||
|
One of the most important features of Linux is its support for many different file systems.
|
|||
|
This makes it very flexible and well able to coexist with many other operating systems.
|
|||
|
At the time of writing, Linux supports 15 file systems; <tt>ext</tt>, <tt>ext2</tt>, <tt>xia</tt>,
|
|||
|
<tt>minix</tt>, <tt>umsdos</tt>, <tt>msdos</tt>, <tt>vfat</tt>, <tt>proc</tt>, <tt>smb</tt>, <tt>ncp</tt>, <tt>iso9660</tt>,
|
|||
|
<tt>sysv</tt>, <tt>hpfs</tt>, <tt>affs</tt> and <tt>ufs</tt>, and no doubt, over time more will be added.
|
|||
|
|
|||
|
<p>
|
|||
|
In Linux, as it is for Unix <sup><font size=-4><tt>T</tt>M</font></sup>, the separate file systems the system may use are
|
|||
|
not accessed by device identifiers (such as a drive number or a drive name) but instead
|
|||
|
they are combined into a single hierarchical tree structure that represents the
|
|||
|
file system as one whole single entity.
|
|||
|
Linux adds each new file system into this single file system tree as it is mounted.
|
|||
|
All file systems, of whatever type, are mounted onto a directory and
|
|||
|
the files of the mounted file system cover up the existing contents of that directory.
|
|||
|
This directory is known as the mount directory or mount point.
|
|||
|
When the file system is unmounted, the mount directory's own files are once again revealed.
|
|||
|
|
|||
|
<p>
|
|||
|
When disks are initialized (using <font face="helvetica">fdisk</font>, say) they have a partition structure
|
|||
|
imposed on them that divides the physical disk into a number of logical partitions.
|
|||
|
Each partition may hold a single file system, for example an <tt>EXT2</tt> file system.
|
|||
|
File systems organize files into logical hierarchical structures with directories, soft links and
|
|||
|
so on held in blocks on physical devices.
|
|||
|
Devices that can contain file systems are known as block devices.
|
|||
|
The IDE disk partition <tt>/dev/hda1</tt>, the first partition of the first IDE disk drive in the
|
|||
|
system, is a block device.
|
|||
|
The Linux file systems regard these block devices as simply linear collections of blocks,
|
|||
|
they do not know or care about the underlying physical disk's geometry.
|
|||
|
It is the task of each block device driver to map a request to read a particular
|
|||
|
block of its device into terms meaningful to its device; the particular track, sector
|
|||
|
and cylinder of its hard disk where the block is kept.
|
|||
|
A file system has to look, feel and operate in the same way no matter what device is holding it.
|
|||
|
Moreover, using Linux's file systems, it does not matter (at least to the system user) that
|
|||
|
these different file systems are on different physical media controlled by
|
|||
|
different hardware controllers.
|
|||
|
The file system might not even be on the local system, it could just as
|
|||
|
well be a disk remotely mounted over a network link.
|
|||
|
Consider the following example where a Linux system has its root file system on a SCSI disk:
|
|||
|
|
|||
|
<pre>
|
|||
|
A E boot etc lib opt tmp usr
|
|||
|
C F cdrom fd proc root var sbin
|
|||
|
D bin dev home mnt lost+found
|
|||
|
</pre>
|
|||
|
<p>
|
|||
|
Neither the users nor the programs that operate on the files themselves need know that
|
|||
|
<tt>/C</tt> is in fact a mounted VFAT file system that is on the first IDE disk in the system.
|
|||
|
In the example (which is actually my home Linux system), <tt>/E</tt> is the master IDE disk
|
|||
|
on the second IDE controller.
|
|||
|
It does not matter either that the first IDE controller is a PCI controller and that the second
|
|||
|
is an ISA controller which also controls the IDE CDROM.
|
|||
|
I can dial into the network where I work using a modem and the PPP network protocol
|
|||
|
using a modem and in this case I can remotely mount my Alpha AXP Linux system's
|
|||
|
file systems on <tt>/mnt/remote</tt>.
|
|||
|
|
|||
|
<p>
|
|||
|
The files in a file system are collections of data;
|
|||
|
the file holding the sources to this chapter is an ASCII file called <tt>filesystems.tex</tt>.
|
|||
|
A file system not only holds the data that is contained within the files of the
|
|||
|
file system but also the structure of the file system.
|
|||
|
It holds all of the information that Linux users and processes see as files, directories
|
|||
|
soft links, file protection information and so on.
|
|||
|
Moreover it must hold that information safely and securely, the basic integrity of the
|
|||
|
operating system depends on its file systems.
|
|||
|
Nobody would use an operating system that randomly lost data and files<a href="#tthFtNtAAB" name=tthFrefAAB><sup>1</sup></a>.
|
|||
|
|
|||
|
<p>
|
|||
|
<tt>Minix</tt>, the first file system that Linux had is rather restrictive and lacking
|
|||
|
in performance.
|
|||
|
|
|||
|
<p>
|
|||
|
Its filenames cannot be longer than 14 characters (which is still better than 8.3 filenames)
|
|||
|
and the maximum file size is 64MBytes.
|
|||
|
64Mbytes might at first glance seem large enough but large file sizes are necessary
|
|||
|
to hold even modest databases.
|
|||
|
The first file system designed specifically for Linux, the Extended File system, or <tt>EXT</tt>, was
|
|||
|
introduced in April 1992 and cured a lot of the problems but it was still felt
|
|||
|
to lack performance.
|
|||
|
|
|||
|
<p>
|
|||
|
So, in 1993, the Second Extended File system, or <tt>EXT2</tt>, was added.
|
|||
|
|
|||
|
<p>
|
|||
|
It is this file system that is described in detail later on in this chapter.
|
|||
|
|
|||
|
<p>
|
|||
|
An important development took place when the EXT file system was added into Linux.
|
|||
|
The real file systems were separated from the operating system and system services
|
|||
|
by an interface layer known as the Virtual File system, or VFS.
|
|||
|
|
|||
|
<p>
|
|||
|
VFS allows Linux to support many, often very different, file systems,
|
|||
|
each presenting a common software interface to the VFS.
|
|||
|
All of the details of the Linux file systems are translated by
|
|||
|
software so that all file systems appear identical to the rest of the Linux kernel
|
|||
|
and to programs running in the system.
|
|||
|
Linux's Virtual File system layer allows you to transparently
|
|||
|
mount the many different file systems at the same time.
|
|||
|
|
|||
|
<p>
|
|||
|
The Linux Virtual File system is implemented so that access to its files is as fast and
|
|||
|
efficient as possible.
|
|||
|
It must also make sure that the files and their data are kept correctly.
|
|||
|
These two requirements can be at odds with each other.
|
|||
|
The Linux VFS caches information in memory from each file system as it is mounted and
|
|||
|
used.
|
|||
|
A lot of care must be taken to update the file system correctly as data within these
|
|||
|
caches is modified as files and directories are created, written to and deleted.
|
|||
|
If you could see the file system's data structures within the running kernel,
|
|||
|
you would be able to see data blocks being read and written by the file system.
|
|||
|
Data structures, describing the files and directories being accessed would be
|
|||
|
created and destroyed and all the time the device drivers would be working
|
|||
|
away, fetching and saving data.
|
|||
|
The most important of these caches is the Buffer Cache, which is integrated into the
|
|||
|
way that the individual file systems access their underlying block devices.
|
|||
|
As blocks are accessed they are put into the Buffer Cache and kept on various queues
|
|||
|
depending on their states.
|
|||
|
The Buffer Cache not only caches data buffers, it also helps manage the asynchronous
|
|||
|
interface with the block device drivers.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H2><A NAME="tth_sEc9.1">9.1 </A> The Second Extended File system (EXT2)</H2>
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<p><A NAME="tth_fIg9.1"></A>
|
|||
|
<center><center> <img src="ext2.gif"><br>
|
|||
|
<p>
|
|||
|
</center></center><center> Figure 9.1: Physical Layout of the EXT2 File system</center>
|
|||
|
<A NAME="ext2fs-figure"></A>
|
|||
|
<p>
|
|||
|
<p>The Second Extended File system was devised (by Rémy Card) as an extensible
|
|||
|
and powerful file system for Linux.
|
|||
|
It is also the most successful file system so far in the Linux community and is the
|
|||
|
basis for all of the currently shipping Linux distributions.
|
|||
|
|
|||
|
<p>
|
|||
|
The EXT2 file system, like a lot of the file systems, is built on the premise that the
|
|||
|
data held in files is kept
|
|||
|
in data blocks.
|
|||
|
These data blocks are all of the same length and, although that length can
|
|||
|
vary between different EXT2 file systems the block size of a particular
|
|||
|
EXT2 file system is set when it is created (using <font face="helvetica">mke2fs</font>).
|
|||
|
Every file's size is rounded up to an integral number of blocks.
|
|||
|
If the block size is 1024 bytes, then a file of 1025 bytes will occupy two 1024 byte
|
|||
|
blocks.
|
|||
|
Unfortunately this means that on average you waste half a block per file.
|
|||
|
Usually in computing you trade off CPU usage for memory and disk space utilisation.
|
|||
|
In this case Linux, along with most operating systems, trades off a relatively
|
|||
|
inefficient disk usage in order to reduce the workload on the CPU.
|
|||
|
Not all of the blocks in the file system hold data, some must be used to contain the
|
|||
|
information that describes the structure of the file system.
|
|||
|
EXT2 defines the file system topology by describing each file
|
|||
|
in the system with an inode data structure.
|
|||
|
An inode describes which blocks the data within a file occupies as well as the
|
|||
|
access rights of the file, the file's modification times and the type of the file.
|
|||
|
Every file in the EXT2 file system is described by a single inode and each inode
|
|||
|
has a single unique number identifying it.
|
|||
|
The inodes for the file system are all kept together in inode tables.
|
|||
|
EXT2 directories are simply special files
|
|||
|
(themselves described by inodes)
|
|||
|
which contain pointers to the inodes of their directory entries.
|
|||
|
|
|||
|
<p>
|
|||
|
Figure <A href="#ext2fs-figure"
|
|||
|
> 9.1</A> shows the layout of the EXT2 file system as occupying
|
|||
|
a series of blocks in a block structured device.
|
|||
|
So far as each file system is concerned, block devices are just a series of blocks
|
|||
|
that can be read and written.
|
|||
|
A file system does not need to concern itself with where on the physical media
|
|||
|
a block should be put, that is the job of the device's driver.
|
|||
|
Whenever a file system needs to read information or data from the block device
|
|||
|
containing it, it requests that its supporting device driver reads an integral
|
|||
|
number of blocks.
|
|||
|
The EXT2 file system divides the logical partition that it occupies
|
|||
|
into Block Groups.
|
|||
|
|
|||
|
<p>
|
|||
|
Each group duplicates information critical to the integrity of the file system
|
|||
|
as well as holding real files and directories as blocks of information and data.
|
|||
|
This duplication is neccessary should a disaster occur and the file system need
|
|||
|
recovering.
|
|||
|
The subsections describe in more detail the contents of each Block Group.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.1.1">9.1.1 </A> The EXT2 Inode</H3>
|
|||
|
|
|||
|
<p><A NAME="tth_fIg9.2"></A>
|
|||
|
<p>
|
|||
|
|
|||
|
<center><center> <img src="ext2_inode.gif"><br>
|
|||
|
<p>
|
|||
|
</center></center><center> Figure 9.2: EXT2 Inode</center>
|
|||
|
<A NAME="ext2fs-inode-figure"></A>
|
|||
|
<p>
|
|||
|
<p>In the <tt>EXT2</tt> file system, the inode is the basic building block; every
|
|||
|
file and directory in the file system is described by one and only one inode.
|
|||
|
The EXT2 inodes for each Block Group are kept in the inode table together
|
|||
|
with a bitmap that allows the system to keep track of allocated and unallocated
|
|||
|
inodes.
|
|||
|
Figure <A href="#ext2fs-inode-figure"
|
|||
|
> 9.2</A> shows the format of an EXT2 inode,
|
|||
|
amongst other information, it contains the following fields:
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<DL compact> <dt><b>mode</b></dt><dd> This holds two pieces of information; what
|
|||
|
this inode describes and the permissions that users have
|
|||
|
to it.
|
|||
|
For EXT2, an inode can describe one
|
|||
|
of file, directory, symbolic link, block device, character device or
|
|||
|
FIFO.
|
|||
|
<dt><b>Owner Information</b></dt><dd> The user and group identifiers of the owners of this
|
|||
|
file or directory. This allows the file system to correctly allow
|
|||
|
the right sort of accesses,
|
|||
|
<dt><b>Size</b></dt><dd> The size of the file in bytes,
|
|||
|
<dt><b>Timestamps</b></dt><dd> The time that the inode was created and the last time that
|
|||
|
it was modified,
|
|||
|
<dt><b>Datablocks</b></dt><dd> Pointers to the blocks that contain the data that this
|
|||
|
inode is describing. The first twelve are pointers to the
|
|||
|
physical blocks containing the data described by this inode and the
|
|||
|
last three pointers contain more and more levels of indirection.
|
|||
|
For example, the double indirect blocks pointer points at a block of
|
|||
|
pointers to blocks of pointers to data blocks.
|
|||
|
This means that files less than or equal to twelve data blocks in length
|
|||
|
are more quickly accessed than larger files.
|
|||
|
</DL>
|
|||
|
<p>
|
|||
|
You should note that EXT2 inodes can describe special device files. These are
|
|||
|
not real files but handles that programs can use to access devices. All of the
|
|||
|
device files in <tt>/dev</tt> are there to allow programs to access Linux's devices.
|
|||
|
For example the <font face="helvetica">mount</font> program takes as an argument the device file that it wishes to
|
|||
|
mount.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.1.2">9.1.2 </A> The EXT2 Superblock</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
The Superblock contains a description of the basic size and shape of this file system.
|
|||
|
The information within it allows the file system manager to use and maintain the file system.
|
|||
|
Usually only the Superblock in Block Group 0 is read when the file system is mounted
|
|||
|
but each Block Group contains a duplicate copy in case of file system corruption.
|
|||
|
Amongst other information it holds the:
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<DL compact> <dt><b>Magic Number</b></dt><dd> This allows the mounting software to check that this is
|
|||
|
indeed the Superblock for an EXT2 file system. For the current version of
|
|||
|
<tt>EXT2</tt> this is <em>0xEF53</em>.
|
|||
|
<dt><b>Revision Level</b></dt><dd> The major and minor revision levels allow the mounting
|
|||
|
code to determine whether or not this file system supports features
|
|||
|
that are only available in particular revisions of the file system.
|
|||
|
There are also feature compatibility fields which help the mounting code
|
|||
|
to determine which new features can safely be used on this file system,
|
|||
|
<dt><b>Mount Count and Maximum Mount Count</b></dt><dd> Together these allow the system to
|
|||
|
determine if the file system should be fully checked. The mount count
|
|||
|
is incremented each time the file system is mounted and when it equals
|
|||
|
the maximum mount count the warning message ``maximal mount count reached,
|
|||
|
running e2fsck is recommended'' is displayed,
|
|||
|
<dt><b>Block Group Number</b></dt><dd> The Block Group number that holds this copy of
|
|||
|
the Superblock,
|
|||
|
<dt><b>Block Size</b></dt><dd> The size of the block for this file system in bytes, for
|
|||
|
example 1024 bytes,
|
|||
|
<dt><b>Blocks per Group</b></dt><dd> The number of blocks in a group. Like the block size
|
|||
|
this is fixed when the file system is created,
|
|||
|
<dt><b>Free Blocks</b></dt><dd> The number of free blocks in the file system,
|
|||
|
<dt><b>Free Inodes</b></dt><dd> The number of free Inodes in the file system,
|
|||
|
<dt><b>First Inode</b></dt><dd> This is the inode number of the first inode in the file system.
|
|||
|
The first inode in an EXT2 root file system would be the directory entry
|
|||
|
for the '/' directory.
|
|||
|
</DL>
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.1.3">9.1.3 </A> The EXT2 Group Descriptor</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
Each Block Group has a data structure describing it.
|
|||
|
Like the Superblock, all the group descriptors
|
|||
|
for all of the Block Groups are duplicated in each Block Group in case of file system
|
|||
|
corruption.
|
|||
|
|
|||
|
<p>
|
|||
|
Each Group Descriptor contains the following information:
|
|||
|
|
|||
|
<DL compact>
|
|||
|
<p>
|
|||
|
<dt><b>Blocks Bitmap</b></dt><dd> The block number of the block allocation bitmap for
|
|||
|
this Block Group. This is used during block allocation and deallocation,
|
|||
|
<dt><b>Inode Bitmap</b></dt><dd> The block number of the inode allocation bitmap for
|
|||
|
this Block Group. This is used during inode allocation and deallocation,
|
|||
|
<dt><b>Inode Table</b></dt><dd> The block number of the starting block for the inode table for
|
|||
|
this Block Group. Each inode is represented by the EXT2 inode data structure
|
|||
|
described below.
|
|||
|
<dt><b>Free blocks count, Free Inodes count, Used directory count</b></dt><dd>
|
|||
|
</DL>
|
|||
|
<p>
|
|||
|
The group descriptors are placed on after another and together they make the group
|
|||
|
descriptor table.
|
|||
|
Each Blocks Group contains the entire table of group descriptors after its copy of
|
|||
|
the Superblock.
|
|||
|
Only the first copy (in Block Group 0) is actually used by the EXT2 file system.
|
|||
|
The other copies are there, like the copies of the Superblock, in case the main
|
|||
|
copy is corrupted.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.1.4">9.1.4 </A> EXT2 Directories</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<p><A NAME="tth_fIg9.3"></A>
|
|||
|
<center><center> <img src="ext2_dir.gif"><br>
|
|||
|
<p>
|
|||
|
</center></center><center> Figure 9.3: EXT2 Directory</center>
|
|||
|
<A NAME="ext2fs-dir-figure"></A>
|
|||
|
<p>
|
|||
|
<p>In the EXT2 file system, directories are special files that are used to create and hold
|
|||
|
access paths to the files in the file system.
|
|||
|
Figure <A href="#ext2fs-dir-figure"
|
|||
|
> 9.3</A> shows the layout of a directory entry in memory.
|
|||
|
|
|||
|
<p>
|
|||
|
A directory file is a list of directory entries, each one containing the following
|
|||
|
information:
|
|||
|
|
|||
|
<DL compact>
|
|||
|
<p>
|
|||
|
<dt><b>inode</b></dt><dd> The inode for this directory entry. This is an index into the
|
|||
|
array of inodes held in the Inode Table of the Block Group. In
|
|||
|
figure <A href="#ext2fs-dir-figure"
|
|||
|
> 9.3</A>, the directory entry for the file called
|
|||
|
<tt>file</tt> has a reference to inode number <tt>i1</tt>,
|
|||
|
<dt><b>name length</b></dt><dd> The length of this directory entry in bytes,
|
|||
|
<dt><b>name</b></dt><dd> The name of this directory entry.
|
|||
|
</DL>
|
|||
|
<p>
|
|||
|
The first two entries for every directory are always
|
|||
|
the standard ``.'' and ``..'' entries meaning ``this directory'' and ``the parent
|
|||
|
directory'' respectively.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.1.5">9.1.5 </A> Finding a File in an EXT2 File System</H3>
|
|||
|
A Linux filename has the same format as all Unix <sup><font size=-4><tt>T</tt>M</font></sup> filenames have.
|
|||
|
It is a series of directory names separated by forward slashes (``<tt>/</tt>'') and
|
|||
|
ending in the file's name.
|
|||
|
One example filename would be <tt>/home/rusling/.cshrc</tt> where <tt>/home</tt> and <tt>/rusling</tt>
|
|||
|
are directory names and the file's name is <tt>.cshrc</tt>.
|
|||
|
Like all other Unix <sup><font size=-4><tt>T</tt>M</font></sup> systems, Linux does not care about the format of the filename itself; it
|
|||
|
can be any length and consist of any of the printable characters.
|
|||
|
To find the inode representing this file within an <tt>EXT2</tt> file system the
|
|||
|
system must parse the filename a directory at a time until we get to the file itself.
|
|||
|
|
|||
|
<p>
|
|||
|
The first inode we need is the inode for the root of the file system and we
|
|||
|
find its number in the file system's superblock.
|
|||
|
To read an EXT2 inode we must look for it in the inode table of the appropriate
|
|||
|
Block Group.
|
|||
|
If, for example, the root inode number is 42, then we need the 42nd inode from
|
|||
|
the inode table of Block Group 0.
|
|||
|
The root inode is for an EXT2 directory, in other words the mode of the root
|
|||
|
inode describes it as a directory and it's data blocks contain EXT2 directory entries.
|
|||
|
|
|||
|
<p>
|
|||
|
<tt>home</tt> is just one of the many directory entries and this directory entry gives us
|
|||
|
the number of the inode describing the <tt>/home</tt> directory.
|
|||
|
We have to read this directory (by first reading its inode and then
|
|||
|
reading the directory entries from the data blocks described by its inode)
|
|||
|
to find the <tt>rusling</tt> entry which gives us the number of the inode describing the
|
|||
|
<tt>/home/rusling</tt> directory.
|
|||
|
Finally we read the directory entries pointed at by the inode describing the
|
|||
|
<tt>/home/rusling</tt> directory to find the inode number of the <tt>.cshrc</tt> file and
|
|||
|
from this we get the data blocks containing the information in the file.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.1.6">9.1.6 </A> Changing the Size of a File in an EXT2 File System</H3>
|
|||
|
One common problem with a file system is its tendency to fragment.
|
|||
|
The blocks that hold the file's data get spread all over the file system
|
|||
|
and this makes sequentially accessing the data blocks of a file more and more
|
|||
|
inefficient the further apart the data blocks are.
|
|||
|
The EXT2 file system tries to overcome this by allocating the new blocks for a
|
|||
|
file physically close to its current data blocks or at least in the same Block Group
|
|||
|
as its current data blocks.
|
|||
|
Only when this fails does it allocate data blocks in another Block Group.
|
|||
|
|
|||
|
<p>
|
|||
|
Whenever a process attempts to write data into a file the Linux file system checks to
|
|||
|
see if the data has gone off the end of the file's last allocated block.
|
|||
|
If it has, then it must allocate a new data block for this file.
|
|||
|
Until the allocation is complete, the process cannot run; it must wait for
|
|||
|
the file system to allocate a new data block and write the rest of the data to it before
|
|||
|
it can continue.
|
|||
|
The first thing that the EXT2 block allocation routines do is to lock the
|
|||
|
EXT2 Superblock for this file system. Allocating and deallocating changes fields
|
|||
|
within the superblock, and the Linux file system cannot allow more than one process
|
|||
|
to do this at the same time.
|
|||
|
If another process needs to allocate more data blocks, it will have to wait until
|
|||
|
this process has finished.
|
|||
|
Processes waiting for the superblock are suspended, unable to run, until
|
|||
|
control of the superblock is relinquished by its current user.
|
|||
|
Access to the superblock is granted on a first come, first
|
|||
|
served basis and once a process has control of the superblock, it keeps control
|
|||
|
until it has finished.
|
|||
|
Having locked the superblock, the process checks that there are enough free blocks
|
|||
|
left in this file system. If there are not enough free blocks, then this attempt to
|
|||
|
allocate more will fail and the process will relinquish control of this file system's
|
|||
|
superblock.
|
|||
|
|
|||
|
<p>
|
|||
|
If there are enough free blocks in the file system, the process tries to
|
|||
|
allocate one.
|
|||
|
|
|||
|
<p>
|
|||
|
If the EXT2 file system has been built to preallocate data blocks then we may be able
|
|||
|
to take one of those.
|
|||
|
The preallocated blocks do not actually exist, they are just reserved within the
|
|||
|
allocated block bitmap.
|
|||
|
The VFS inode representing the file that we are trying to allocate a new data block
|
|||
|
for has two EXT2 specific fields, <tt>prealloc_block</tt> and <tt>prealloc_count</tt>, which
|
|||
|
are the block number of the first preallocated data block and how many of them there
|
|||
|
are, respectively.
|
|||
|
If there were no preallocated blocks or block preallocation is not enabled, the EXT2
|
|||
|
file system must allocate a new block.
|
|||
|
The EXT2 file system first looks to see if the data block after the last data block
|
|||
|
in the file is free. Logically, this is the most efficient block to allocate as
|
|||
|
it makes sequential accesses much quicker.
|
|||
|
If this block is not free, then the search widens and it looks for a data block within
|
|||
|
64 blocks of the of the ideal block.
|
|||
|
This block, although not ideal is at least fairly close and within the same Block Group
|
|||
|
as the other data blocks belonging to this file.
|
|||
|
|
|||
|
<p>
|
|||
|
If even that block is not free, the process starts looking in all of the
|
|||
|
other Block Groups in turn until it finds some free blocks.
|
|||
|
The block allocation code looks for a
|
|||
|
cluster of eight free data blocks somewhere in one of the Block Groups.
|
|||
|
If it cannot find eight together, it will settle for less.
|
|||
|
If block preallocation is wanted and enabled it will update <tt>prealloc_block</tt> and
|
|||
|
<tt>prealloc_count</tt> accordingly.
|
|||
|
|
|||
|
<p>
|
|||
|
Wherever it finds the free block, the block allocation code updates the Block Group's
|
|||
|
block bitmap and allocates a data buffer in the buffer cache.
|
|||
|
That data buffer is uniquely identified by the file system's supporting device identifier
|
|||
|
and the block number of the allocated block.
|
|||
|
The data in the buffer is zero'd and the buffer is marked as ``dirty'' to show that it's
|
|||
|
contents have not been written to the physical disk.
|
|||
|
Finally, the superblock itself is marked as ``dirty'' to show that it has been changed
|
|||
|
and it is unlocked.
|
|||
|
If there were any processes waiting for the superblock, the first one in the queue
|
|||
|
is allowed to run again and will gain exclusive control of the superblock for its
|
|||
|
file operations.
|
|||
|
The process's data is written to the new data block and, if that data block is filled,
|
|||
|
the entire process is repeated and another data block allocated.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H2><A NAME="tth_sEc9.2">9.2 </A> The Virtual File System (VFS)</H2>
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<p><A NAME="tth_fIg9.4"></A>
|
|||
|
<center><center> <img src="vfs.gif"><br>
|
|||
|
<p>
|
|||
|
</center></center><center> Figure 9.4: A Logical Diagram of the Virtual File System</center>
|
|||
|
<A NAME="vfs-figure"></A>
|
|||
|
<p>
|
|||
|
<p>Figure <A href="#vfs-figure"
|
|||
|
> 9.4</A> shows the relationship between the Linux kernel's Virtual File System
|
|||
|
and it's real file systems.
|
|||
|
The virtual file system must manage all of the different file systems that are mounted at any
|
|||
|
given time.
|
|||
|
To do this it maintains data structures that describe the whole (virtual) file system and the
|
|||
|
real, mounted, file systems.
|
|||
|
|
|||
|
<p>
|
|||
|
Rather confusingly, the VFS describes the system's files in terms of superblocks and inodes
|
|||
|
in much the same way as the EXT2 file system uses superblocks and inodes.
|
|||
|
Like the EXT2 inodes, the VFS inodes describe files and directories within the system;
|
|||
|
the contents and topology of the Virtual File System.
|
|||
|
From now on, to avoid confusion, I will write about VFS inodes and VFS superblocks to
|
|||
|
distinquish them from EXT2 inodes and superblocks.
|
|||
|
|
|||
|
<p>
|
|||
|
As each file system is initialised, it registers itself with the VFS.
|
|||
|
This happens as the operating system initialises itself at system boot time.
|
|||
|
The real file systems are either built into the kernel itself or are built as loadable modules.
|
|||
|
File System modules are loaded as the system needs them, so, for example, if the <tt>VFAT</tt> file system
|
|||
|
is implemented as a kernel module, then it is only loaded when a <tt>VFAT</tt> file system is mounted.
|
|||
|
When a block device based file system is mounted, and this includes the root file system, the
|
|||
|
VFS must read its superblock.
|
|||
|
Each file system type's superblock read routine must work out the file system's topology
|
|||
|
and map that information onto a VFS superblock data structure.
|
|||
|
The VFS keeps a list of the mounted file systems in the system together with
|
|||
|
their VFS superblocks.
|
|||
|
Each VFS superblock contains information and pointers to routines that perform particular
|
|||
|
functions.
|
|||
|
So, for example, the superblock representing a mounted EXT2 file system contains
|
|||
|
a pointer to the EXT2 specific inode reading routine.
|
|||
|
This EXT2 inode read routine, like all of the file system specific inode read routines,
|
|||
|
fills out the fields in a VFS inode.
|
|||
|
Each VFS superblock contains a pointer to the first VFS inode on the file system.
|
|||
|
For the root file system, this is the inode that represents the <tt>``/''</tt> directory.
|
|||
|
This mapping of information is very efficient for the EXT2 file system but
|
|||
|
moderately less so for other file systems.
|
|||
|
|
|||
|
<p>
|
|||
|
As the system's processes access directories and files, system routines are called
|
|||
|
that traverse the VFS inodes in the system.
|
|||
|
|
|||
|
<p>
|
|||
|
For example, typing <font face="helvetica">ls</font> for a directory or <font face="helvetica">cat</font> for a file cause the the
|
|||
|
Virtual File System to search through the VFS inodes that represent the file system.
|
|||
|
As every file and directory on the system is represented by a VFS inode, then a number
|
|||
|
of inodes will be being repeatedly accessed.
|
|||
|
These inodes are kept in the inode cache which makes access to them quicker.
|
|||
|
If an inode is not in the inode cache, then a file system specific routine must be
|
|||
|
called in order to read the appropriate inode.
|
|||
|
The action of reading the inode causes it to be put into the inode cache and
|
|||
|
further accesses to the inode keep it in the cache.
|
|||
|
The less used VFS inodes get removed from the cache.
|
|||
|
|
|||
|
<p>
|
|||
|
All of the Linux file systems use a common buffer cache to cache data buffers
|
|||
|
from the underlying devices to help speed up access by all of the file systems to the
|
|||
|
physical devices holding the file systems.
|
|||
|
|
|||
|
<p>
|
|||
|
This buffer cache is independent of the file systems and is integrated into the
|
|||
|
mechanisms that the Linux kernel uses to allocate and read and write data buffers.
|
|||
|
It has the distinct advantage of making
|
|||
|
the Linux file systems independent from the underlying media and from the
|
|||
|
device drivers that support them.
|
|||
|
All block structured devices register themselves with the Linux kernel and
|
|||
|
present a uniform, block based, usually asynchronous interface.
|
|||
|
Even relatively complex block devices such as SCSI devices do this.
|
|||
|
As the real file systems read data from the underlying physical disks, this
|
|||
|
results in requests to the block device drivers to read physical blocks from the
|
|||
|
device that they control.
|
|||
|
Integrated into this block device interface is the buffer cache.
|
|||
|
As blocks are read by the file systems they are saved in the global buffer cache
|
|||
|
shared by all of the file systems and the Linux kernel.
|
|||
|
Buffers within it are identified by their block number and a unique identifier for
|
|||
|
the device that read it.
|
|||
|
So, if the same data is needed often, it will be retrieved from the buffer cache
|
|||
|
rather than read from the disk, which would take somewhat longer.
|
|||
|
Some devices support read ahead where data blocks are speculatively read just in
|
|||
|
case they are needed.
|
|||
|
|
|||
|
<p>
|
|||
|
The VFS also keeps a cache of directory lookups so that the inodes
|
|||
|
for frequently used directories can be quickly found.
|
|||
|
|
|||
|
<p>
|
|||
|
As an experiment, try listing a directory that you have not listed recently.
|
|||
|
The first time you list it, you may notice a slight pause but the second time
|
|||
|
you list its contents the result is immediate.
|
|||
|
The directory cache does not store the inodes for the directories itself; these
|
|||
|
should be in the inode cache, the directory cache simply stores the mapping between
|
|||
|
the full directory names and their inode numbers.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.2.1">9.2.1 </A> The VFS Superblock</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
Every mounted file system is represented by a VFS superblock;
|
|||
|
amongst other information, the VFS superblock contains the:
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<DL compact> <dt><b>Device</b></dt><dd> This is the device identifier for the block device that
|
|||
|
this file system is contained in. For example, <tt>/dev/hda1</tt>, the
|
|||
|
first IDE hard disk in the system has a device identifier of <em>0x301</em>,
|
|||
|
<dt><b>Inode pointers</b></dt><dd> The <tt>mounted</tt> inode pointer points at the first inode
|
|||
|
in this file system. The <tt>covered</tt> inode pointer points at the
|
|||
|
inode representing the directory that this file system is mounted on.
|
|||
|
The root file system's VFS superblock does not have a <tt>covered</tt> pointer,
|
|||
|
<dt><b>Blocksize</b></dt><dd> The block size in bytes of this file system, for example 1024 bytes,
|
|||
|
<dt><b>Superblock operations</b></dt><dd> A pointer to a set of superblock routines for
|
|||
|
this file system. Amongst other things, these routines are used by the
|
|||
|
VFS to read and write inodes and superblocks.
|
|||
|
<dt><b>File System type</b></dt><dd> A pointer to the mounted file system's <tt>file_system_type</tt>
|
|||
|
data structure,
|
|||
|
<dt><b>File System specific</b></dt><dd> A pointer to information needed by this file system,
|
|||
|
</DL>
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.2.2">9.2.2 </A> The VFS Inode</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
Like the EXT2 file system, every file, directory and so on in the VFS is represented
|
|||
|
by one and only one VFS inode.
|
|||
|
|
|||
|
<p>
|
|||
|
The information in each VFS inode is built from information in the underlying file system
|
|||
|
by file system specific routines.
|
|||
|
VFS inodes exist only in the kernel's memory and are kept in the VFS inode cache
|
|||
|
as long as they are useful to the system.
|
|||
|
Amongst other information, VFS inodes contain the following fields:
|
|||
|
|
|||
|
<DL compact>
|
|||
|
<p>
|
|||
|
<dt><b>device</b></dt><dd> This is the device identifer of the device holding the file or whatever
|
|||
|
that this VFS inode represents,
|
|||
|
<dt><b>inode number</b></dt><dd> This is the number of the inode and is unique within this file system.
|
|||
|
The combination of <tt>device</tt> and <tt>inode number</tt> is unique within the Virtual
|
|||
|
File System,
|
|||
|
<dt><b>mode</b></dt><dd> Like EXT2 this field describes what this VFS inode represents as well
|
|||
|
as access rights to it,
|
|||
|
<dt><b>user ids</b></dt><dd> The owner identifiers,
|
|||
|
<dt><b>times</b></dt><dd> The creation, modification and write times,
|
|||
|
<dt><b>block size</b></dt><dd> The size of a block for this file in bytes, for example 1024 bytes,
|
|||
|
<dt><b>inode operations</b></dt><dd> A pointer to a block of routine addresses. These routines
|
|||
|
are specific to the file system and they perform operations for this inode, for
|
|||
|
example, truncate the file that is represented by this inode.
|
|||
|
<dt><b>count</b></dt><dd> The number of system components currently using this VFS inode. A count
|
|||
|
of zero means that the inode is free to be discarded or reused,
|
|||
|
<dt><b>lock</b></dt><dd> This field is used to lock the VFS inode, for example, when it is being
|
|||
|
read from the file system,
|
|||
|
<dt><b>dirty</b></dt><dd> Indicates whether this VFS inode has been written to, if so the
|
|||
|
underlying file system will need modifying,
|
|||
|
<dt><b>file system specific information</b></dt><dd>
|
|||
|
</DL>
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.2.3">9.2.3 </A> Registering the File Systems</H3>
|
|||
|
|
|||
|
<p><A NAME="tth_fIg9.5"></A>
|
|||
|
<p>
|
|||
|
|
|||
|
<center><center> <img src="file-systems.gif"><br>
|
|||
|
<p>
|
|||
|
</center></center><center> Figure 9.5: Registered File Systems</center>
|
|||
|
<A NAME="file-systems-figure"></A>
|
|||
|
<p>
|
|||
|
<p>When you build the Linux kernel you are asked if you want each of the supported
|
|||
|
file systems.
|
|||
|
When the kernel is built, the file system startup code contains calls to
|
|||
|
the initialisation routines of all of the built in file systems.
|
|||
|
|
|||
|
<p>
|
|||
|
Linux file systems may also be built as modules and, in this case, they may
|
|||
|
be demand loaded as they are needed or loaded by hand using <font face="helvetica">insmod</font>.
|
|||
|
Whenever a file system module is loaded it registers itself with the kernel and unregisters
|
|||
|
itself when it is unloaded.
|
|||
|
Each file system's initialisation routine registers itself with the Virtual File System
|
|||
|
and is represented by a <tt>file_system_type</tt> data structure
|
|||
|
which contains the name
|
|||
|
of the file system and a pointer to its VFS superblock read routine.
|
|||
|
Figure <A href="#file-systems-figure"
|
|||
|
> 9.5</A> shows that the <tt>file_system_type</tt> data structures
|
|||
|
are put into a list pointed at by the <tt>file_systems</tt> pointer.
|
|||
|
Each <tt>file_system_type</tt> data structure contains the following information:
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<DL compact> <dt><b>Superblock read routine</b></dt><dd> This routine is called by the VFS when an
|
|||
|
instance of the file system is mounted,
|
|||
|
<dt><b>File System name</b></dt><dd> The name of this file system, for example <tt>ext2</tt>,
|
|||
|
<dt><b>Device needed</b></dt><dd> Does this file system need a device to support? Not all
|
|||
|
file system need a device to hold them. The <tt>/proc</tt> file system,
|
|||
|
for example, does not require a block device,
|
|||
|
</DL>
|
|||
|
<p>
|
|||
|
You can see which file systems are registered by looking in at <tt>/proc/filesystems</tt>.
|
|||
|
For example:
|
|||
|
|
|||
|
<pre>
|
|||
|
ext2
|
|||
|
nodev proc
|
|||
|
iso9660
|
|||
|
</pre>
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.2.4">9.2.4 </A> Mounting a File System</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
When the superuser attempts to mount a file system, the Linux kernel must
|
|||
|
first validate the arguments passed in the system call.
|
|||
|
Although <font face="helvetica">mount</font> does some basic checking, it does not know which file systems
|
|||
|
this kernel has been built to support or that the proposed mount point actually
|
|||
|
exists.
|
|||
|
Consider the following <font face="helvetica">mount</font> command:
|
|||
|
|
|||
|
<pre>
|
|||
|
$ mount -t iso9660 -o ro /dev/cdrom /mnt/cdrom
|
|||
|
</pre>
|
|||
|
<p>
|
|||
|
This <font face="helvetica">mount</font> command will pass the kernel three pieces of information; the name of
|
|||
|
the file system, the physical block device that contains the file system and, thirdly,
|
|||
|
where in the existing file system topology the new file system is to be mounted.
|
|||
|
|
|||
|
<p>
|
|||
|
The first thing that the Virtual File System must do is to find the file system.
|
|||
|
|
|||
|
<p>
|
|||
|
To do this it searches through the list of known file systems by looking at each
|
|||
|
<tt>file_system_type</tt> data structure in the list pointed at by <tt>file_systems</tt>.
|
|||
|
|
|||
|
<p>
|
|||
|
If it finds a matching name it now knows that this file system type is supported
|
|||
|
by this kernel and it has the address of the file system specific routine for
|
|||
|
reading this file system's superblock.
|
|||
|
If it cannot find a matching file system name then all is not lost if the kernel
|
|||
|
is built to demand load kernel modules (see Chapter <A href="../modules/modules.html"
|
|||
|
> modules-chapter</A>).
|
|||
|
In this case the kernel will request that the kernel daemon loads the appropriate
|
|||
|
file system module before continuing as before.
|
|||
|
|
|||
|
<p>
|
|||
|
Next if the physical device passed by <font face="helvetica">mount</font> is not already mounted,
|
|||
|
it must find the VFS inode of the directory that is to be the new file system's
|
|||
|
mount point.
|
|||
|
This VFS inode may be in the inode cache or it might have to be read from the
|
|||
|
block device supporting the file system of the mount point.
|
|||
|
Once the inode has been found it is checked to see that it is a directory
|
|||
|
and that there is not already some other file system mounted there.
|
|||
|
The same directory cannot be used as a mount point for more than one file system.
|
|||
|
|
|||
|
<p>
|
|||
|
At this point the VFS mount code must allocate a VFS superblock and
|
|||
|
pass it the mount information to the superblock read routine for this
|
|||
|
file system.
|
|||
|
All of the system's VFS superblocks are kept in the <tt>super_blocks</tt> vector
|
|||
|
of <tt>super_block</tt> data structures and one must be allocated for this mount.
|
|||
|
The superblock read routine must fill out the VFS superblock fields based on
|
|||
|
information that it reads from the physical device.
|
|||
|
For the EXT2 file system this mapping or translation of information is quite
|
|||
|
easy, it simply reads the EXT2 superblock and fills out the VFS superblock from there.
|
|||
|
For other file systems, such as the MS DOS file system, it is not quite such
|
|||
|
an easy task.
|
|||
|
Whatever the file system, filling out the VFS superblock means that the file system
|
|||
|
must read whatever describes it from the block device that supports it.
|
|||
|
If the block device cannot be read from or if it does not contain this type of
|
|||
|
file system then the <font face="helvetica">mount</font> command will fail.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<p><A NAME="tth_fIg9.6"></A>
|
|||
|
<center><center> <img src="mounted.gif"><br>
|
|||
|
<p>
|
|||
|
</center></center><center> Figure 9.6: A Mounted File System</center>
|
|||
|
<A NAME="mounted-figure"></A>
|
|||
|
<p>
|
|||
|
<p>Each mounted file system is described by a <tt>vfsmount</tt> data structure;
|
|||
|
see figure <A href="#mounted-figure"
|
|||
|
> 9.6</A>.
|
|||
|
These are queued on a list pointed at by <tt>vfsmntlist</tt>.
|
|||
|
|
|||
|
<p>
|
|||
|
Another pointer, <tt>vfsmnttail</tt> points at the last entry in the list and
|
|||
|
the <tt>mru_vfsmnt</tt> pointer points at the most recently used file system.
|
|||
|
Each <tt>vfsmount</tt> structure contains the device number of the block device
|
|||
|
holding the file system, the directory where this file system is mounted and a
|
|||
|
pointer to the VFS superblock allocated when this file system was mounted.
|
|||
|
In turn the VFS superblock points at the <tt>file_system_type</tt> data structure
|
|||
|
for this sort of file system and to the root inode for this file system.
|
|||
|
This inode is kept resident in the VFS inode cache all of the time that
|
|||
|
this file system is loaded.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.2.5">9.2.5 </A> Finding a File in the Virtual File System</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
To find the VFS inode of a file in the Virtual File System, VFS must resolve the
|
|||
|
name a directory at a time, looking up the VFS inode representing each of the
|
|||
|
intermediate directories in the name.
|
|||
|
Each directory lookup involves calling the file system specific lookup whose address
|
|||
|
is held in the VFS inode representing the parent directory.
|
|||
|
This works because we always have the VFS inode of the root of each file system
|
|||
|
available and pointed at by the VFS superblock for that system.
|
|||
|
Each time an inode is looked up by the real file system it checks the directory
|
|||
|
cache for the directory.
|
|||
|
If there is no entry in the directory cache, the real file system gets the VFS
|
|||
|
inode either from the underlying file system or from the inode cache.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.2.6">9.2.6 </A> Creating a File in the Virtual File System</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.2.7">9.2.7 </A> Unmounting a File System</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
The workshop manual for my MG usually describes assembly as the reverse of
|
|||
|
disassembly and the reverse is more or less true for unmounting a file system.
|
|||
|
|
|||
|
<p>
|
|||
|
A file system cannot be unmounted if something in the system is using one of its
|
|||
|
files.
|
|||
|
So, for example, you cannot umount <tt>/mnt/cdrom</tt> if a process is using that
|
|||
|
directory or any of its children.
|
|||
|
If anything is using the file system to be unmounted there may be VFS inodes
|
|||
|
from it in the VFS inode cache, and the code checks for this by looking through
|
|||
|
the list of inodes looking for inodes owned by the device that this file system
|
|||
|
occupies.
|
|||
|
If the VFS superblock for the mounted file system is dirty, that is it has
|
|||
|
been modified, then it must be written back to the file system on disk.
|
|||
|
Once it has been written to disk, the memory occupied by the VFS superblock
|
|||
|
is returned to the kernel's free pool of memory.
|
|||
|
Finally the <tt>vfsmount</tt> data structure for this mount is unlinked from
|
|||
|
<tt>vfsmntlist</tt> and freed.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.2.8">9.2.8 </A> The VFS Inode Cache</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
As the mounted file systems are navigated, their VFS inodes are being continually
|
|||
|
read and, in some cases, written.
|
|||
|
The Virtual File System maintains an inode cache to speed up accesses to all of
|
|||
|
the mounted file systems.
|
|||
|
Every time a VFS inode is read from the inode cache the system saves an access to
|
|||
|
a physical device.
|
|||
|
|
|||
|
<p>
|
|||
|
The VFS inode cache is implmented as a hash table whose entries are pointers to
|
|||
|
lists of VFS inodes that have the same hash value.
|
|||
|
The hash value of an inode is calculated from its inode number and from the device
|
|||
|
identifier for the underlying physical device containing the file system.
|
|||
|
Whenever the Virtual File System needs to access an inode, it first looks in the
|
|||
|
VFS inode cache.
|
|||
|
To find an inode in the cache, the system first calculates its hash value and then uses
|
|||
|
it as an index into the inode hash table.
|
|||
|
This gives it a pointer to a list of inodes with the same hash value.
|
|||
|
It then reads each inode in turn until it finds one with both the same inode
|
|||
|
number and the same device identifier as the one that it is searching for.
|
|||
|
|
|||
|
<p>
|
|||
|
If it can find the inode in the cache, its count is incremented to show that it
|
|||
|
has another user and the file system access continues.
|
|||
|
Otherwise a free VFS inode must be found so that the file system can read the
|
|||
|
inode from memory.
|
|||
|
VFS has a number of choices about how to get a free inode.
|
|||
|
If the system may allocate more VFS inodes then this is what it does; it
|
|||
|
allocates kernel pages and breaks them up into new, free, inodes and puts them
|
|||
|
into the inode list.
|
|||
|
All of the system's VFS inodes are in a list pointed at by <tt>first_inode</tt> as well
|
|||
|
as in the inode hash table.
|
|||
|
If the system already has all of the inodes that it is allowed to have, it must
|
|||
|
find an inode that is a good candidate to be reused.
|
|||
|
Good candidates are inodes with a usage count of zero; this indicates that the
|
|||
|
system is not currently using them.
|
|||
|
Really important VFS inodes, for example the root inodes of file systems always have
|
|||
|
a usage count greater than zero and so are never candidates for reuse.
|
|||
|
Once a candidate for reuse has been located it is cleaned up.
|
|||
|
The VFS inode might be dirty and in this case it needs to be written back to the file system
|
|||
|
or it might be locked and in this case the system must wait for it to be unlocked before
|
|||
|
continuing.
|
|||
|
The candidate VFS inode must be cleaned up before it can be reused.
|
|||
|
|
|||
|
<p>
|
|||
|
However the new VFS inode is found, a file system specific routine must be called
|
|||
|
to fill it out from information read from the underlying real file system.
|
|||
|
Whilst it is being filled out, the new VFS inode has a usage count of one and is
|
|||
|
locked so that nothing else accesses it until it contains valid information.
|
|||
|
|
|||
|
<p>
|
|||
|
To get the VFS inode that is actually needed, the file system may need to access
|
|||
|
several other inodes.
|
|||
|
This happens when you read a directory; only the inode for the final directory is
|
|||
|
needed but the inodes for the intermediate directories must also be read.
|
|||
|
As the VFS inode cache is used and filled up, the less used inodes will be discarded
|
|||
|
and the more used inodes will remain in the cache.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.2.9">9.2.9 </A> The Directory Cache</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
To speed up accesses to commonly used directories, the VFS maintains a cache of
|
|||
|
directory entries.
|
|||
|
|
|||
|
<p>
|
|||
|
As directories are looked up by the real file systems their details are added into
|
|||
|
the directory cache.
|
|||
|
The next time the same directory is looked up, for example to list it or open a
|
|||
|
file within it, then it will be found in the directory cache.
|
|||
|
Only short directory entries (up to 15 characters long) are cached but this is
|
|||
|
reasonable as the shorter directory names are the most commonly used ones.
|
|||
|
For example, <tt>/usr/X11R6/bin</tt> is very commonly accessed when the X server is
|
|||
|
running.
|
|||
|
|
|||
|
<p>
|
|||
|
The directory cache consists of a hash table, each entry of which points at
|
|||
|
a list of directory cache entries that have the same hash value.
|
|||
|
The hash function uses the device number of the device holding the file system and
|
|||
|
the directory's name to calculate the offset, or index, into the hash table.
|
|||
|
It allows cached directory entries to be quickly found.
|
|||
|
It is no use having a cache when lookups within the cache take too long to find entries,
|
|||
|
or even not to find them.
|
|||
|
|
|||
|
<p>
|
|||
|
In an effort to keep the caches valid and up to date the VFS keeps lists of
|
|||
|
Least Recently Used (LRU) directory cache entries.
|
|||
|
When a directory entry is first put into the cache, which is when it is first
|
|||
|
looked up, it is added onto the end of the first level LRU list.
|
|||
|
In a full cache this will displace an existing entry from the front of the
|
|||
|
LRU list.
|
|||
|
As the directory entry is accessed again it is promoted to the back of the
|
|||
|
second LRU cache list.
|
|||
|
Again, this may displace a cached level two directory entry at the front of
|
|||
|
the level two LRU cache list.
|
|||
|
This displacing of entries at the front of the level one and level two LRU lists
|
|||
|
is fine.
|
|||
|
The only reason that entries are at the front of the lists is that they have not
|
|||
|
been recently accessed.
|
|||
|
If they had, they would be nearer the back of the lists.
|
|||
|
The entries in the second level LRU cache list are safer than entries in the level one
|
|||
|
LRU cache list.
|
|||
|
This is the intention as these entries have not only been looked up but also
|
|||
|
they have been repeatedly referenced.
|
|||
|
|
|||
|
<p>
|
|||
|
<font face="helvetica">REVIEW NOTE:</font> <em>Do we need a diagram for this?</em>
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H2><A NAME="tth_sEc9.3">9.3 </A> The Buffer Cache</H2>
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<p><A NAME="tth_fIg9.7"></A>
|
|||
|
<center><center> <img src="buffer-cache.gif"><br>
|
|||
|
<p>
|
|||
|
</center></center><center> Figure 9.7: The Buffer Cache</center>
|
|||
|
<A NAME="buffer-cache-figure"></A>
|
|||
|
<p>
|
|||
|
<p>As the mounted file systems are used they generate a lot of requests to the block
|
|||
|
devices to read and write data blocks.
|
|||
|
All block data read and write requests are given to the device drivers in the form
|
|||
|
of <tt>buffer_head</tt> data structures via standard kernel routine calls.
|
|||
|
These give all of the information that the block device drivers need; the device
|
|||
|
identifier uniquely identifies the device and the block number tells the driver
|
|||
|
which block to read.
|
|||
|
All block devices are viewed as linear collections of blocks of the same size.
|
|||
|
To speed up access to the physical block devices, Linux maintains a cache of block
|
|||
|
buffers.
|
|||
|
All of the block buffers in the system are kept somewhere in this buffer cache, even
|
|||
|
the new, unused buffers.
|
|||
|
This cache is shared between all of the physical block devices; at any one time
|
|||
|
there are many block buffers in the cache, belonging to any one of the system's
|
|||
|
block devices and often in many different states.
|
|||
|
If valid data is available from the buffer cache this saves the system
|
|||
|
an access to a physical device.
|
|||
|
Any block buffer that has been used to read data from a block device or to write data
|
|||
|
to it goes into the buffer cache.
|
|||
|
Over time it may be removed from the cache to make way for a more deserving
|
|||
|
buffer or it may remain in the cache as it is frequently accessed.
|
|||
|
|
|||
|
<p>
|
|||
|
Block buffers within the cache are uniquely identfied by the owning device
|
|||
|
identifier and the block number of the buffer.
|
|||
|
The buffer cache is composed of two functional parts.
|
|||
|
The first part is the lists of free block buffers.
|
|||
|
There is one list per supported buffer size and the system's free block buffers
|
|||
|
are queued onto these lists when they are first created or when they have been
|
|||
|
discarded.
|
|||
|
The currently supported buffer sizes are 512, 1024, 2048, 4096 and 8192 bytes.
|
|||
|
The second functional part is the cache itself.
|
|||
|
This is a hash table which is a vector of pointers to chains of buffers that have
|
|||
|
the same hash index.
|
|||
|
The hash index is generated from the owning device identifier and the block number
|
|||
|
of the data block.
|
|||
|
Figure <A href="#buffer-cache-figure"
|
|||
|
> 9.7</A> shows the hash table together with a few entries.
|
|||
|
Block buffers are either in one of the free lists or they are in the buffer cache.
|
|||
|
When they are in the buffer cache they are also queued onto Least Recently Used
|
|||
|
(LRU) lists.
|
|||
|
There is an LRU list for each buffer type and
|
|||
|
these are used by the system to perform work on buffers of a type, for example, writing
|
|||
|
buffers with new data in them out to disk.
|
|||
|
The buffer's type reflects its state and Linux currently supports the following types:
|
|||
|
|
|||
|
<DL compact>
|
|||
|
<p>
|
|||
|
<dt><b>clean</b></dt><dd> Unused, new buffers,
|
|||
|
<dt><b>locked</b></dt><dd> Buffers that are locked, waiting to be written,
|
|||
|
<dt><b>dirty</b></dt><dd> Dirty buffers. These contain new, valid data, and will be written
|
|||
|
but so far have not been scheduled to write,
|
|||
|
<dt><b>shared</b></dt><dd> Shared buffers,
|
|||
|
<dt><b>unshared</b></dt><dd> Buffers that were once shared but which are now not shared,
|
|||
|
</DL>
|
|||
|
<p>
|
|||
|
Whenever a file system needs to read a buffer from its underlying physical
|
|||
|
device, it trys to get a block from the buffer cache.
|
|||
|
If it cannot get a buffer from the buffer cache, then it will get a clean one from
|
|||
|
the appropriate sized free list and this new buffer will go into the buffer cache.
|
|||
|
If the buffer that it needed is in the buffer cache, then it may or may not be up to
|
|||
|
date.
|
|||
|
If it is not up to date or if it is a new block buffer, the file system must request
|
|||
|
that the device driver read the appropriate block of data from the disk.
|
|||
|
|
|||
|
<p>
|
|||
|
Like all caches, the buffer cache must be maintained so that it runs efficiently
|
|||
|
and fairly allocates cache entries between the block devices using the buffer
|
|||
|
cache.
|
|||
|
Linux uses the <tt>bdflush</tt>
|
|||
|
|
|||
|
<p>
|
|||
|
kernel daemon to perform a lot of housekeeping duties
|
|||
|
on the cache but some happen automatically as a result of the cache being used.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.3.1">9.3.1 </A> The <tt>bdflush</tt> Kernel Daemon</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
The <tt>bdflush</tt> kernel daemon is a simple kernel daemon that provides a dynamic
|
|||
|
response to the system having too many dirty buffers; buffers that contain data
|
|||
|
that must be written out to disk at some time.
|
|||
|
It is started as a kernel thread at system startup time and, rather confusingly,
|
|||
|
it calls itself ``kflushd'' and that is the name that you will see if you use
|
|||
|
the <font face="helvetica">ps</font> command to show the processes in the system.
|
|||
|
Mostly this daemon sleeps waiting for the number of dirty buffers in the
|
|||
|
system to grow too large.
|
|||
|
As buffers are allocated and discarded the number of dirty buffers in the system
|
|||
|
is checked.
|
|||
|
If there are too many as a percentage of the total number of buffers in the system then
|
|||
|
<tt>bdflush</tt> is woken up.
|
|||
|
The default threshold is 60% but, if the system is desperate for buffers, <tt>bdflush</tt>
|
|||
|
will be woken up anyway.
|
|||
|
This value can be seen and changed using the <font face="helvetica">update</font> command:
|
|||
|
|
|||
|
<pre>
|
|||
|
|
|||
|
# update -d
|
|||
|
|
|||
|
bdflush version 1.4
|
|||
|
0: 60 Max fraction of LRU list to examine for dirty blocks
|
|||
|
1: 500 Max number of dirty blocks to write each time bdflush activated
|
|||
|
2: 64 Num of clean buffers to be loaded onto free list by refill_freelist
|
|||
|
3: 256 Dirty block threshold for activating bdflush in refill_freelist
|
|||
|
4: 15 Percentage of cache to scan for free clusters
|
|||
|
5: 3000 Time for data buffers to age before flushing
|
|||
|
6: 500 Time for non-data (dir, bitmap, etc) buffers to age before flushing
|
|||
|
7: 1884 Time buffer cache load average constant
|
|||
|
8: 2 LAV ratio (used to determine threshold for buffer fratricide).
|
|||
|
|
|||
|
</pre>
|
|||
|
<p>
|
|||
|
All of the dirty buffers are linked into the <tt>BUF_DIRTY</tt> LRU list whenever they
|
|||
|
are made dirty by having data written to them and <tt>bdflush</tt> tries to write a
|
|||
|
reasonable number of them out to their owning disks.
|
|||
|
Again this number can be seen and controlled by the <font face="helvetica">update</font> command and the
|
|||
|
default is 500 (see above).
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H3><A NAME="tth_sEc9.3.2">9.3.2 </A> The <font face="helvetica">update</font> Process</H3>
|
|||
|
|
|||
|
<p>
|
|||
|
The <font face="helvetica">update</font> command is more than just a command; it is also a daemon.
|
|||
|
When run as superuser (during system initialisation) it will periodically flush
|
|||
|
all of the older dirty buffers out to disk.
|
|||
|
It does this by calling a system service routine
|
|||
|
|
|||
|
<p>
|
|||
|
that does more or less the same thing
|
|||
|
as <tt>bdflush</tt>.
|
|||
|
Whenever a dirty buffer is finished with, it is tagged with the system time that
|
|||
|
it should be written out to its owning disk.
|
|||
|
Every time that <font face="helvetica">update</font> runs it looks at all of the dirty buffers in the system
|
|||
|
looking for ones with an expired flush time.
|
|||
|
Every expired buffer is written out to disk.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H2><A NAME="tth_sEc9.4">9.4 </A> The /proc File System</H2>
|
|||
|
|
|||
|
<p>
|
|||
|
The <tt>/proc</tt> file system really shows the power of the Linux Virtual File System.
|
|||
|
It does not really exist (yet another of Linux's conjuring tricks), neither
|
|||
|
the <tt>/proc</tt> directory nor its subdirectories and its files actually exist.
|
|||
|
So how can you <font face="helvetica">cat</font> <tt>/proc/devices</tt>?
|
|||
|
The /proc file system, like a real file system, registers itself with the
|
|||
|
Virtual File System.
|
|||
|
However, when the VFS makes calls to it requesting inodes as its files and
|
|||
|
directories are opened, the <tt>/proc</tt> file system creates those files and
|
|||
|
directories from information within the kernel.
|
|||
|
For example, the kernel's <tt>/proc/devices</tt> file is generated from the kernel's
|
|||
|
data structures describing its devices.
|
|||
|
|
|||
|
<p>
|
|||
|
The <tt>/proc</tt> file system presents a user readable window into the kernel's
|
|||
|
inner workings.
|
|||
|
Several Linux subsystems, such as Linux kernel modules described in
|
|||
|
chapter <A href="../modules/modules.html"
|
|||
|
> modules-chapter</A>, create entries in the the <tt>/proc</tt> file system.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<H2><A NAME="tth_sEc9.5">9.5 </A> Device Special Files</H2>
|
|||
|
|
|||
|
<p>
|
|||
|
Linux, like all versions of Unix <sup><font size=-4><tt>T</tt>M</font></sup> presents its hardware devices as special files.
|
|||
|
So, for example, /dev/null is the null device.
|
|||
|
A device file does not use any data space in the file system, it is only an access
|
|||
|
point to the device driver.
|
|||
|
The EXT2 file system and the Linux VFS both implement device files as special types of
|
|||
|
inode.
|
|||
|
There are two types of device file; character and block special files.
|
|||
|
Within the kernel itself, the device drivers implement file semantices: you can open them,
|
|||
|
close them and so on.
|
|||
|
Character devices allow I/O operations in character mode and block devices require
|
|||
|
that all I/O is via the buffer cache.
|
|||
|
When an I/O request is made to a device file, it is forwarded to the appropriate
|
|||
|
device driver within the system.
|
|||
|
Often this is not a real device driver but a pseudo-device driver for some subsystem such
|
|||
|
as the SCSI device driver layer.
|
|||
|
Device files are referenced by a major number, which identifies the device type, and a
|
|||
|
minor type, which identifies the unit, or instance of that major type.
|
|||
|
For example, the IDE disks on the first IDE controller in the system
|
|||
|
have a major number of 3 and the first partition of an
|
|||
|
IDE disk would have a minor number of 1. So, <tt>ls -l</tt> of <tt>/dev/hda1</tt> gives:
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<pre>
|
|||
|
$ brw-rw---- 1 root disk 3, 1 Nov 24 15:09 /dev/hda1
|
|||
|
</pre>Within the kernel, every device is uniquely described by a <tt>kdev_t</tt> data type, this
|
|||
|
is two bytes long, the first byte containing the minor device number and the second
|
|||
|
byte holding the major device number.
|
|||
|
|
|||
|
<p>
|
|||
|
The IDE device above is held within the kernel as <em>0x0301</em>.
|
|||
|
An EXT2 inode that represents a block or character device keeps the device's major
|
|||
|
and minor numbers in its first direct block pointer.
|
|||
|
When it is read by the VFS, the VFS inode data structure representing it has its <tt>i_rdev</tt>
|
|||
|
field set to the correct device identifier.
|
|||
|
|
|||
|
<p>
|
|||
|
<hr><H3>Footnotes:</H3>
|
|||
|
|
|||
|
<p><a name=tthFtNtAAB></a><a href="#tthFrefAAB"><sup>1</sup></a> Well, not
|
|||
|
knowingly, although I have been bitten by operating systems with more lawyers than Linux
|
|||
|
has developers
|
|||
|
<p><hr><small>File translated from T<sub><font size=-1>E</font></sub>X by <a href="http://hutchinson.belmont.ma.us/tth/tth.html">T<sub><font size=-1>T</font></sub>H</a>, version 1.0.</small>
|
|||
|
<hr>
|
|||
|
<center>
|
|||
|
<A HREF="../fs/filesystem.html"> Top of Chapter</A>,
|
|||
|
<A HREF="../tlk-toc.html"> Table of Contents</A>,
|
|||
|
<A href="../tlk.html" target="_top"> Show Frames</A>,
|
|||
|
<A href="../fs/filesystem.html" target="_top"> No Frames</A><br>
|
|||
|
<EFBFBD> 1996-1999 David A Rusling <A HREF="../misc/copyright.html">copyright notice</a>.
|
|||
|
</center>
|
|||
|
</HTML>
|