mirror of https://github.com/tLDP/LDP
5340 lines
237 KiB
Plaintext
5340 lines
237 KiB
Plaintext
<!doctype linuxdoc system>
|
|
|
|
<article>
|
|
<title>Linux Kernel 2.4 Internals
|
|
<author>Tigran Aivazian <tt>tigran@veritas.com</tt>
|
|
<date>7 August 2002 (29 Av 6001)
|
|
<abstract>
|
|
Introduction to the Linux 2.4 kernel. The latest copy of this document
|
|
can be always downloaded from:
|
|
|
|
<url url="http://www.moses.uklinux.net/patches/lki.sgml">
|
|
|
|
This guide is now part of the Linux Documentation Project and can also be
|
|
downloaded in various formats from:
|
|
|
|
<url url="http://www.linuxdoc.org/guides.html">
|
|
|
|
or can be read online (latest version) at:
|
|
|
|
<url url="http://www.moses.uklinux.net/patches/lki.html">
|
|
|
|
This documentation is free software; you can redistribute
|
|
it and/or modify it under the terms of the GNU General Public
|
|
License as published by the Free Software Foundation; either
|
|
version 2 of the License, or (at your option) any later version.
|
|
|
|
The author is working as senior Linux kernel engineer at VERITAS Software
|
|
Ltd and wrote this book for the purpose of supporting the short training
|
|
course/lectures he gave on this subject, internally at VERITAS.
|
|
Thanks to
|
|
Juan J. Quintela <tt>(quintela@fi.udc.es)</tt>,
|
|
Francis Galiegue <tt>(fg@mandrakesoft.com)</tt>,
|
|
Hakjun Mun <tt>(juniorm@orgio.net)</tt>,
|
|
Matt Kraai <tt>(kraai@alumni.carnegiemellon.edu)</tt>,
|
|
Nicholas Dronen <tt>(ndronen@frii.com)</tt>,
|
|
Samuel S Chessman <tt>(chessman@tux.org)</tt>,
|
|
Nadeem Hasan <tt>(nhasan@nadmm.com)</tt>,
|
|
Michael Svetlik <tt>(m.svetlik@ssi-schaefer-peem.com)</tt>
|
|
for various corrections and suggestions.
|
|
|
|
The Linux Page Cache chapter was written by:
|
|
Christoph Hellwig <tt>(hch@caldera.de)</tt>.
|
|
|
|
The IPC Mechanisms chapter was written by:
|
|
Russell Weight <tt>(weightr@us.ibm.com)</tt> and Mingming Cao <tt>(mcao@us.ibm.com)</tt>
|
|
|
|
</abstract>
|
|
|
|
<toc>
|
|
|
|
<sect>Booting<p>
|
|
<sect1>Building the Linux Kernel Image<p>
|
|
This section explains the steps taken during compilation of the Linux kernel
|
|
and the output produced at each stage.
|
|
The build process depends on the architecture so I would like to emphasize
|
|
that we only consider building a Linux/x86 kernel.
|
|
|
|
When the user types 'make zImage' or 'make bzImage' the resulting bootable
|
|
kernel image is stored as
|
|
<tt>arch/i386/boot/zImage</tt> or
|
|
<tt>arch/i386/boot/bzImage</tt> respectively.
|
|
Here is how the image is built:
|
|
<enum>
|
|
<item> C and assembly source files are compiled into ELF relocatable object format (.o) and
|
|
some of them are grouped logically into archives (.a) using
|
|
<bf>ar(1)</bf>.
|
|
|
|
<item> Using <bf>ld(1)</bf>, the above .o and .a are linked into <tt>vmlinux</tt> which is a
|
|
statically linked, non-stripped ELF 32-bit LSB 80386 executable file.
|
|
|
|
<item> <tt>System.map</tt> is produced by <bf>nm vmlinux</bf>, irrelevant or uninteresting
|
|
symbols are grepped out.
|
|
|
|
<item> Enter directory <tt>arch/i386/boot</tt>.
|
|
|
|
<item> Bootsector asm code <tt>bootsect.S</tt> is preprocessed either with or without
|
|
<bf>-D__BIG_KERNEL__</bf>, depending on whether the target is
|
|
bzImage or zImage, into <tt>bbootsect.s</tt> or <tt>bootsect.s</tt> respectively.
|
|
|
|
<item> <tt>bbootsect.s</tt> is assembled and then converted into 'raw binary' form
|
|
called <tt>bbootsect</tt> (or <tt>bootsect.s</tt> assembled and raw-converted into
|
|
<tt>bootsect</tt> for zImage).
|
|
|
|
<item> Setup code <tt>setup.S</tt> (<tt>setup.S</tt> includes <tt>video.S</tt>) is preprocessed into
|
|
<tt>bsetup.s</tt> for bzImage or <tt>setup.s</tt> for zImage. In the same way as the
|
|
bootsector code, the difference is marked by -<bf>D__BIG_KERNEL__</bf> present
|
|
for bzImage. The result is then converted into 'raw binary' form
|
|
called <tt>bsetup</tt>.
|
|
|
|
<item> Enter directory <tt>arch/i386/boot/compressed</tt> and convert
|
|
<tt>/usr/src/linux/vmlinux</tt> to $tmppiggy (tmp filename) in raw binary
|
|
format, removing <tt>.note</tt> and <tt>.comment</tt> ELF sections.
|
|
|
|
<item> <bf>gzip -9 < $tmppiggy > $tmppiggy.gz</bf>
|
|
|
|
<item> Link $tmppiggy.gz into ELF relocatable (<bf>ld -r</bf>) <tt>piggy.o</tt>.
|
|
|
|
<item> Compile compression routines <tt>head.S</tt> and <tt>misc.c</tt> (still in
|
|
<tt>arch/i386/boot/compressed</tt> directory) into ELF objects <tt>head.o</tt> and
|
|
<tt>misc.o</tt>.
|
|
|
|
<item> Link together <tt>head.o</tt>, <tt>misc.o</tt> and <tt>piggy.o</tt> into <tt>bvmlinux</tt> (or <tt>vmlinux</tt> for
|
|
zImage, don't mistake this for <tt>/usr/src/linux/vmlinux</tt>!). Note the
|
|
difference between <bf>-Ttext 0x1000</bf> used for <tt>vmlinux</tt> and <bf>-Ttext 0x100000</bf>
|
|
for <tt>bvmlinux</tt>, i.e. for bzImage compression loader is high-loaded.
|
|
|
|
<item> Convert <tt>bvmlinux</tt> to 'raw binary' <tt>bvmlinux.out</tt> removing <tt>.note</tt> and
|
|
<tt>.comment</tt> ELF sections.
|
|
|
|
<item> Go back to <tt>arch/i386/boot</tt> directory and, using the program <bf>tools/build</bf>,
|
|
cat together <tt>bbootsect</tt>, <tt>bsetup</tt> and <tt>compressed/bvmlinux.out</tt> into <tt>bzImage</tt>
|
|
(delete extra 'b' above for <tt>zImage</tt>). This writes important variables
|
|
like <tt>setup_sects</tt> and <tt>root_dev</tt> at the end of the bootsector.
|
|
</enum>
|
|
The size of the bootsector is always 512 bytes. The size of the setup must
|
|
be greater than 4 sectors but is limited above by about 12K - the rule
|
|
is:
|
|
|
|
0x4000 bytes >= 512 + setup_sects * 512 + room for stack while running
|
|
bootsector/setup
|
|
|
|
We will see later where this limitation comes from.
|
|
|
|
The upper limit on the bzImage size produced at this step is about 2.5M for
|
|
booting with LILO and 0xFFFF paragraphs (0xFFFF0 = 1048560 bytes) for
|
|
booting raw image, e.g. from floppy disk or CD-ROM (El-Torito emulation mode).
|
|
|
|
Note that while <bf>tools/build</bf> does validate the size of boot sector, kernel image
|
|
and lower bound of setup size, it does not check the *upper* bound of said
|
|
setup size. Therefore it is easy to build a broken kernel by just adding some
|
|
large ".space" at the end of <tt>setup.S</tt>.
|
|
|
|
<sect1>Booting: Overview<p>
|
|
|
|
The boot process details are architecture-specific, so we shall
|
|
focus our attention on the IBM PC/IA32 architecture.
|
|
Due to old design and backward compatibility, the PC firmware boots the
|
|
operating system in an old-fashioned manner.
|
|
This process can be separated into the following six logical stages:
|
|
|
|
<enum>
|
|
<item> BIOS selects the boot device.
|
|
<item> BIOS loads the bootsector from the boot device.
|
|
<item> Bootsector loads setup, decompression routines and compressed kernel
|
|
image.
|
|
<item> The kernel is uncompressed in protected mode.
|
|
<item> Low-level initialisation is performed by asm code.
|
|
<item> High-level C initialisation.
|
|
</enum>
|
|
|
|
<sect1>Booting: BIOS POST<p>
|
|
|
|
<enum>
|
|
<item> The power supply starts the clock generator and asserts #POWERGOOD
|
|
signal on the bus.
|
|
<item> CPU #RESET line is asserted (CPU now in real 8086 mode).
|
|
<item> %ds=%es=%fs=%gs=%ss=0, %cs=0xFFFF0000,%eip = 0x0000FFF0 (ROM BIOS POST code).
|
|
<item> All POST checks are performed with interrupts disabled.
|
|
<item> IVT (Interrupt Vector Table) initialised at address 0.
|
|
<item> The BIOS Bootstrap Loader function is invoked via <bf>int 0x19</bf>,
|
|
with %dl containing the boot device 'drive number'. This loads
|
|
track 0, sector 1 at physical address 0x7C00 (0x07C0:0000).
|
|
</enum>
|
|
|
|
<sect1>Booting: bootsector and setup<p>
|
|
|
|
The bootsector used to boot Linux kernel could be either:
|
|
|
|
<itemize>
|
|
<item> Linux bootsector (<tt>arch/i386/boot/bootsect.S</tt>),
|
|
<item> LILO (or other bootloader's) bootsector, or
|
|
<item> no bootsector (loadlin etc)
|
|
</itemize>
|
|
|
|
We consider here the Linux bootsector in detail.
|
|
The first few lines initialise the convenience macros to be used for segment
|
|
values:
|
|
|
|
<tscreen><code>
|
|
29 SETUPSECS = 4 /* default nr of setup-sectors */
|
|
30 BOOTSEG = 0x07C0 /* original address of boot-sector */
|
|
31 INITSEG = DEF_INITSEG /* we move boot here - out of the way */
|
|
32 SETUPSEG = DEF_SETUPSEG /* setup starts here */
|
|
33 SYSSEG = DEF_SYSSEG /* system loaded at 0x10000 (65536) */
|
|
34 SYSSIZE = DEF_SYSSIZE /* system size: # of 16-byte clicks */
|
|
</code></tscreen>
|
|
|
|
(the numbers on the left are the line numbers of bootsect.S file)
|
|
The values of <tt>DEF_INITSEG</tt>, <tt>DEF_SETUPSEG</tt>, <tt>DEF_SYSSEG</tt> and <tt>DEF_SYSSIZE</tt> are taken
|
|
from <tt>include/asm/boot.h</tt>:
|
|
|
|
<tscreen><code>
|
|
/* Don't touch these, unless you really know what you're doing. */
|
|
#define DEF_INITSEG 0x9000
|
|
#define DEF_SYSSEG 0x1000
|
|
#define DEF_SETUPSEG 0x9020
|
|
#define DEF_SYSSIZE 0x7F00
|
|
</code></tscreen>
|
|
|
|
Now, let us consider the actual code of <tt>bootsect.S</tt>:
|
|
|
|
<tscreen><code>
|
|
54 movw $BOOTSEG, %ax
|
|
55 movw %ax, %ds
|
|
56 movw $INITSEG, %ax
|
|
57 movw %ax, %es
|
|
58 movw $256, %cx
|
|
59 subw %si, %si
|
|
60 subw %di, %di
|
|
61 cld
|
|
62 rep
|
|
63 movsw
|
|
64 ljmp $INITSEG, $go
|
|
|
|
65 # bde - changed 0xff00 to 0x4000 to use debugger at 0x6400 up (bde). We
|
|
66 # wouldn't have to worry about this if we checked the top of memory. Also
|
|
67 # my BIOS can be configured to put the wini drive tables in high memory
|
|
68 # instead of in the vector table. The old stack might have clobbered the
|
|
69 # drive table.
|
|
|
|
70 go: movw $0x4000-12, %di # 0x4000 is an arbitrary value >=
|
|
71 # length of bootsect + length of
|
|
72 # setup + room for stack;
|
|
73 # 12 is disk parm size.
|
|
74 movw %ax, %ds # ax and es already contain INITSEG
|
|
75 movw %ax, %ss
|
|
76 movw %di, %sp # put stack at INITSEG:0x4000-12.
|
|
</code></tscreen>
|
|
|
|
Lines 54-63 move the bootsector code from address 0x7C00 to 0x90000.
|
|
This is achieved by:
|
|
|
|
<enum>
|
|
<item> set %ds:%si to $BOOTSEG:0 (0x7C0:0 = 0x7C00)
|
|
|
|
<item> set %es:%di to $INITSEG:0 (0x9000:0 = 0x90000)
|
|
|
|
<item> set the number of 16bit words in %cx (256 words = 512 bytes = 1 sector)
|
|
|
|
<item> clear DF (direction) flag in EFLAGS to auto-increment addresses (cld)
|
|
|
|
<item> go ahead and copy 512 bytes (rep movsw)
|
|
</enum>
|
|
|
|
The reason this code does not use <tt>rep movsd</tt> is intentional (hint - .code16).
|
|
|
|
Line 64 jumps to label <tt>go:</tt> in the newly made copy of the
|
|
bootsector, i.e. in segment 0x9000. This and the following three
|
|
instructions (lines 64-76) prepare the stack at $INITSEG:0x4000-0xC, i.e.
|
|
%ss = $INITSEG (0x9000) and %sp = 0x3FF4 (0x4000-0xC). This is where the
|
|
limit on setup size comes from that we mentioned earlier (see Building the
|
|
Linux Kernel Image).
|
|
|
|
Lines 77-103 patch the disk parameter table for the first disk to
|
|
allow multi-sector reads:
|
|
|
|
<tscreen><code>
|
|
77 # Many BIOS's default disk parameter tables will not recognise
|
|
78 # multi-sector reads beyond the maximum sector number specified
|
|
79 # in the default diskette parameter tables - this may mean 7
|
|
80 # sectors in some cases.
|
|
81 #
|
|
82 # Since single sector reads are slow and out of the question,
|
|
83 # we must take care of this by creating new parameter tables
|
|
84 # (for the first disk) in RAM. We will set the maximum sector
|
|
85 # count to 36 - the most we will encounter on an ED 2.88.
|
|
86 #
|
|
87 # High doesn't hurt. Low does.
|
|
88 #
|
|
89 # Segments are as follows: ds = es = ss = cs - INITSEG, fs = 0,
|
|
90 # and gs is unused.
|
|
|
|
91 movw %cx, %fs # set fs to 0
|
|
92 movw $0x78, %bx # fs:bx is parameter table address
|
|
93 pushw %ds
|
|
94 ldsw %fs:(%bx), %si # ds:si is source
|
|
95 movb $6, %cl # copy 12 bytes
|
|
96 pushw %di # di = 0x4000-12.
|
|
97 rep # don't need cld -> done on line 66
|
|
98 movsw
|
|
99 popw %di
|
|
100 popw %ds
|
|
101 movb $36, 0x4(%di) # patch sector count
|
|
102 movw %di, %fs:(%bx)
|
|
103 movw %es, %fs:2(%bx)
|
|
</code></tscreen>
|
|
|
|
The floppy disk controller is reset using BIOS service int 0x13 function 0
|
|
(reset FDC) and setup sectors are loaded immediately after the
|
|
bootsector, i.e. at physical address 0x90200 ($INITSEG:0x200), again using
|
|
BIOS service int 0x13, function 2 (read sector(s)).
|
|
This happens during lines 107-124:
|
|
<tscreen><code>
|
|
107 load_setup:
|
|
108 xorb %ah, %ah # reset FDC
|
|
109 xorb %dl, %dl
|
|
110 int $0x13
|
|
111 xorw %dx, %dx # drive 0, head 0
|
|
112 movb $0x02, %cl # sector 2, track 0
|
|
113 movw $0x0200, %bx # address = 512, in INITSEG
|
|
114 movb $0x02, %ah # service 2, "read sector(s)"
|
|
115 movb setup_sects, %al # (assume all on head 0, track 0)
|
|
116 int $0x13 # read it
|
|
117 jnc ok_load_setup # ok - continue
|
|
|
|
118 pushw %ax # dump error code
|
|
119 call print_nl
|
|
120 movw %sp, %bp
|
|
121 call print_hex
|
|
122 popw %ax
|
|
123 jmp load_setup
|
|
|
|
124 ok_load_setup:
|
|
</code></tscreen>
|
|
If loading failed for some reason (bad floppy or someone pulled the diskette
|
|
out during the operation), we dump error code and retry in an endless
|
|
loop.
|
|
The only way to get out of it is to reboot the machine, unless retry succeeds
|
|
but usually it doesn't (if something is wrong it will only get worse).
|
|
|
|
If loading setup_sects sectors of setup code succeeded we jump to label
|
|
<tt>ok_load_setup:</tt>.
|
|
|
|
Then we proceed to load the compressed kernel image at physical
|
|
address 0x10000. This
|
|
is done to preserve the firmware data areas in low memory (0-64K).
|
|
After the kernel is loaded, we jump to $SETUPSEG:0 (<tt>arch/i386/boot/setup.S</tt>).
|
|
Once the data is no longer needed (e.g. no more calls to BIOS) it is
|
|
overwritten by moving the entire (compressed) kernel image from 0x10000 to
|
|
0x1000 (physical addresses, of course).
|
|
This is done by <tt>setup.S</tt> which sets things up for protected mode and jumps
|
|
to 0x1000 which is the head of the compressed kernel, i.e.
|
|
<tt>arch/386/boot/compressed/{head.S,misc.c}</tt>.
|
|
This sets up stack and calls <tt>decompress_kernel()</tt> which uncompresses the
|
|
kernel to address 0x100000 and jumps to it.
|
|
|
|
Note that old bootloaders (old versions of LILO) could only load the
|
|
first 4 sectors of setup, which is why there is code in setup to load the rest of
|
|
itself if needed. Also, the code in setup has to take care of various
|
|
combinations of loader type/version vs zImage/bzImage and is therefore
|
|
highly complex.
|
|
|
|
Let us examine the kludge in the bootsector code that allows to load a big
|
|
kernel, known also as "bzImage".
|
|
The setup sectors are loaded as usual at 0x90200, but the kernel is loaded
|
|
64K chunk at a time using a special helper routine that calls BIOS to move
|
|
data from low to high memory. This helper routine is referred to by
|
|
<tt>bootsect_kludge</tt> in <tt>bootsect.S</tt> and is defined as <tt>bootsect_helper</tt> in <tt>setup.S</tt>.
|
|
The <tt>bootsect_kludge</tt> label in <tt>setup.S</tt> contains the value of setup segment
|
|
and the offset of <tt>bootsect_helper</tt> code in it so that bootsector can use the <tt>lcall</tt>
|
|
instruction to jump to it (inter-segment jump).
|
|
The reason why it is in <tt>setup.S</tt> is simply because there is no more space left
|
|
in bootsect.S (which is strictly not true - there are approximately 4 spare bytes
|
|
and at least 1 spare byte in <tt>bootsect.S</tt> but that is not enough, obviously).
|
|
This routine uses BIOS service int 0x15 (ax=0x8700) to move to high memory
|
|
and resets %es to always point to 0x10000. This ensures that the code in <tt>bootsect.S</tt>
|
|
doesn't run out of low memory when copying data from disk.
|
|
|
|
<sect1> Using LILO as a bootloader <p>
|
|
|
|
There are several advantages in using a specialised bootloader (LILO) over
|
|
a bare bones Linux bootsector:
|
|
<enum>
|
|
<item> Ability to choose between multiple Linux kernels or even multiple OSes.
|
|
<item> Ability to pass kernel command line parameters (there is a patch
|
|
called BCP that adds this ability to bare-bones bootsector+setup).
|
|
<item> Ability to load much larger bzImage kernels - up to 2.5M vs 1M.
|
|
</enum>
|
|
Old versions of LILO (v17 and earlier) could not load bzImage kernels. The
|
|
newer versions (as of a couple of years ago or earlier) use the same
|
|
technique as bootsect+setup of moving data from low into high memory by
|
|
means of BIOS services. Some people (Peter Anvin notably) argue that zImage
|
|
support should be removed. The main reason (according to Alan Cox) it stays
|
|
is that there are apparently some broken BIOSes that make it impossible to
|
|
boot bzImage kernels while loading zImage ones fine.
|
|
|
|
The last thing LILO does is to jump to <tt>setup.S</tt> and things proceed as normal.
|
|
|
|
<sect1> High level initialisation <p>
|
|
|
|
By "high-level initialisation" we consider anything which is not directly
|
|
related to bootstrap, even though parts of the code to perform this are
|
|
written in asm, namely <tt>arch/i386/kernel/head.S</tt> which is the head of the
|
|
uncompressed kernel. The following steps are performed:
|
|
|
|
<enum>
|
|
<item> Initialise segment values (%ds = %es = %fs = %gs = __KERNEL_DS = 0x18).
|
|
<item> Initialise page tables.
|
|
<item> Enable paging by setting PG bit in %cr0.
|
|
<item> Zero-clean BSS (on SMP, only first CPU does this).
|
|
<item> Copy the first 2k of bootup parameters (kernel commandline).
|
|
<item> Check CPU type using EFLAGS and, if possible, cpuid, able to detect
|
|
386 and higher.
|
|
<item> The first CPU calls <tt>start_kernel()</tt>, all others call
|
|
<tt>arch/i386/kernel/smpboot.c:initialize_secondary()</tt> if ready=1,
|
|
which just reloads esp/eip and doesn't return.
|
|
</enum>
|
|
|
|
The <tt>init/main.c:start_kernel()</tt> is written in C and does the following:
|
|
|
|
<enum>
|
|
<item> Take a global kernel lock (it is needed so that only one CPU
|
|
goes through initialisation).
|
|
<item> Perform arch-specific setup (memory layout analysis, copying
|
|
boot command line again, etc.).
|
|
<item> Print Linux kernel "banner" containing the version, compiler used to
|
|
build it etc. to the kernel ring buffer for messages. This is taken
|
|
from the variable linux_banner defined in init/version.c and is the
|
|
same string as displayed by <bf>cat /proc/version</bf>.
|
|
<item> Initialise traps.
|
|
<item> Initialise irqs.
|
|
<item> Initialise data required for scheduler.
|
|
<item> Initialise time keeping data.
|
|
<item> Initialise softirq subsystem.
|
|
<item> Parse boot commandline options.
|
|
<item> Initialise console.
|
|
<item> If module support was compiled into the kernel, initialise dynamical
|
|
module loading facility.
|
|
<item> If "profile=" command line was supplied, initialise profiling buffers.
|
|
<item> <tt>kmem_cache_init()</tt>, initialise most of slab allocator.
|
|
<item> Enable interrupts.
|
|
<item> Calculate BogoMips value for this CPU.
|
|
<item> Call <tt>mem_init()</tt> which calculates <tt>max_mapnr</tt>, <tt>totalram_pages</tt> and
|
|
<tt>high_memory</tt> and prints out the "Memory: ..." line.
|
|
<item> <tt>kmem_cache_sizes_init()</tt>, finish slab allocator initialisation.
|
|
<item> Initialise data structures used by procfs.
|
|
<item> <tt>fork_init()</tt>, create <tt>uid_cache</tt>, initialise <tt>max_threads</tt> based on
|
|
the amount of memory available and configure <tt>RLIMIT_NPROC</tt> for
|
|
<tt>init_task</tt> to be <tt>max_threads/2</tt>.
|
|
<item> Create various slab caches needed for VFS, VM, buffer cache, etc.
|
|
<item> If System V IPC support is compiled in, initialise the IPC subsystem.
|
|
Note that for System V shm, this includes mounting an internal
|
|
(in-kernel) instance of shmfs filesystem.
|
|
<item> If quota support is compiled into the kernel, create and initialise
|
|
a special slab cache for it.
|
|
<item> Perform arch-specific "check for bugs" and, whenever possible,
|
|
activate workaround for processor/bus/etc bugs. Comparing various
|
|
architectures reveals that "ia64 has no bugs" and "ia32 has quite a
|
|
few bugs", good example is "f00f bug" which is only checked if kernel
|
|
is compiled for less than 686 and worked around accordingly.
|
|
<item> Set a flag to indicate that a schedule should be invoked at "next
|
|
opportunity" and create a kernel thread <tt>init()</tt> which execs
|
|
execute_command if supplied via "init=" boot parameter, or tries to
|
|
exec <bf>/sbin/init</bf>, <bf>/etc/init</bf>, <bf>/bin/init</bf>, <bf>/bin/sh</bf> in this order; if
|
|
all these fail, panic with "suggestion" to use "init=" parameter.
|
|
<item> Go into the idle loop, this is an idle thread with pid=0.
|
|
</enum>
|
|
|
|
Important thing to note here that the <tt>init()</tt> kernel thread calls
|
|
<tt>do_basic_setup()</tt> which in turn calls <tt>do_initcalls()</tt> which goes through the
|
|
list of functions registered by means of <tt>__initcall</tt> or <tt>module_init()</tt> macros
|
|
and invokes them. These functions either do not depend on each other
|
|
or their dependencies have been manually fixed by the link order in the
|
|
Makefiles. This means that, depending on
|
|
the position of directories in the trees and the structure of the Makefiles,
|
|
the order in which initialisation functions are invoked can change. Sometimes, this
|
|
is important because you can imagine two subsystems A and B with B depending
|
|
on some initialisation done by A. If A is compiled statically and B is a
|
|
module then B's entry point is guaranteed to be invoked after A prepared
|
|
all the necessary environment. If A is a module, then B is also necessarily
|
|
a module so there are no problems. But what if both A and B are statically
|
|
linked into the kernel? The order in which they are invoked depends on the relative
|
|
entry point offsets in the <tt>.initcall.init</tt> ELF section of the kernel image.
|
|
Rogier Wolff proposed to introduce a hierarchical "priority" infrastructure
|
|
whereby modules could let the linker know in what (relative) order they
|
|
should be linked, but so far there are no patches available that implement
|
|
this in a sufficiently elegant manner to be acceptable into the kernel.
|
|
Therefore, make sure your link order is correct. If, in the example above,
|
|
A and B work fine when compiled statically once, they will always work,
|
|
provided they are listed sequentially in the same Makefile. If they don't
|
|
work, change the order in which their object files are listed.
|
|
|
|
Another thing worth noting is Linux's ability to execute an "alternative
|
|
init program" by means of passing "init=" boot commandline. This is useful
|
|
for recovering from accidentally overwritten <bf>/sbin/init</bf> or debugging the
|
|
initialisation (rc) scripts and <tt>/etc/inittab</tt> by hand, executing them
|
|
one at a time.
|
|
|
|
<sect1>SMP Bootup on x86<p>
|
|
|
|
On SMP, the BP goes through the normal sequence of bootsector, setup etc
|
|
until it reaches the <tt>start_kernel()</tt>, and then on to <tt>smp_init()</tt> and
|
|
especially <tt>src/i386/kernel/smpboot.c:smp_boot_cpus()</tt>. The <tt>smp_boot_cpus()</tt>
|
|
goes in a loop for each apicid (until <tt>NR_CPUS</tt>) and calls <tt>do_boot_cpu()</tt> on
|
|
it. What <tt>do_boot_cpu()</tt> does is create (i.e. <tt>fork_by_hand</tt>) an idle task for
|
|
the target cpu and write in well-known locations defined by the Intel MP
|
|
spec (0x467/0x469) the EIP of trampoline code found in <tt>trampoline.S</tt>. Then
|
|
it generates STARTUP IPI to the target cpu which makes this AP execute the
|
|
code in <tt>trampoline.S</tt>.
|
|
|
|
The boot CPU creates a copy of trampoline code for each CPU in
|
|
low memory. The AP code writes a magic number in its own code which is
|
|
verified by the BP to make sure that AP is executing the trampoline code.
|
|
The requirement that trampoline code must be in low memory is enforced by
|
|
the Intel MP specification.
|
|
|
|
The trampoline code simply sets %bx register to 1, enters protected mode
|
|
and jumps to startup_32 which is the main entry to <tt>arch/i386/kernel/head.S</tt>.
|
|
|
|
Now, the AP starts executing <tt>head.S</tt> and discovering that it is not a BP,
|
|
it skips the code that clears BSS and then enters <tt>initialize_secondary()</tt>
|
|
which just enters the idle task for this CPU - recall that <tt>init_tasks[cpu]</tt>
|
|
was already initialised by BP executing <tt>do_boot_cpu(cpu)</tt>.
|
|
|
|
Note that init_task can be shared but each idle thread must have its own
|
|
TSS. This is why <tt>init_tss[NR_CPUS]</tt> is an array.
|
|
|
|
<sect1>Freeing initialisation data and code<p>
|
|
|
|
When the operating system initialises itself, most of the code and data
|
|
structures are never needed again.
|
|
Most operating systems (BSD, FreeBSD etc.) cannot dispose of this unneeded
|
|
information, thus wasting precious physical kernel memory.
|
|
The excuse they use (see McKusick's 4.4BSD book) is that "the relevant code
|
|
is spread around various subsystems and so it is not feasible to free it".
|
|
Linux, of course, cannot use such excuses because under Linux "if something
|
|
is possible in principle, then it is already implemented or somebody is
|
|
working on it".
|
|
|
|
So, as I said earlier, Linux kernel can only be compiled as an ELF binary, and
|
|
now we find out the reason (or one of the reasons) for that. The reason
|
|
related to throwing away initialisation code/data is that Linux provides two
|
|
macros to be used:
|
|
|
|
<itemize>
|
|
<item> <tt>__init</tt> - for initialisation code
|
|
<item> <tt>__initdata</tt> - for data
|
|
</itemize>
|
|
|
|
These evaluate to gcc attribute specificators (also known as "gcc magic")
|
|
as defined in <tt>include/linux/init.h</tt>:
|
|
|
|
<tscreen><code>
|
|
#ifndef MODULE
|
|
#define __init __attribute__ ((__section__ (".text.init")))
|
|
#define __initdata __attribute__ ((__section__ (".data.init")))
|
|
#else
|
|
#define __init
|
|
#define __initdata
|
|
#endif
|
|
</code></tscreen>
|
|
|
|
What this means is that if the code is compiled statically into the kernel
|
|
(i.e. MODULE is not defined) then it is placed in the special ELF section
|
|
<tt>.text.init</tt>, which is declared in the linker map in <tt>arch/i386/vmlinux.lds</tt>.
|
|
Otherwise (i.e. if it is a module) the macros evaluate to nothing.
|
|
|
|
What happens during boot is that the "init" kernel thread (function
|
|
<tt>init/main.c:init()</tt>) calls the arch-specific function <tt>free_initmem()</tt> which
|
|
frees all the pages between addresses <tt>__init_begin</tt> and <tt>__init_end</tt>.
|
|
|
|
On a typical system (my workstation), this results in freeing about 260K of
|
|
memory.
|
|
|
|
The functions registered via <tt>module_init()</tt> are placed in <tt>.initcall.init</tt>
|
|
which is also freed in the static case. The current trend in Linux, when
|
|
designing a subsystem (not necessarily a module), is to provide
|
|
init/exit entry points from the early stages of design so that in the
|
|
future, the subsystem in question can be modularised if needed. Example of
|
|
this is pipefs, see <tt>fs/pipe.c</tt>. Even if a given subsystem will never become a
|
|
module, e.g. bdflush (see <tt>fs/buffer.c</tt>), it is still nice and tidy to use
|
|
the <tt>module_init()</tt> macro against its initialisation function, provided it does
|
|
not matter when exactly is the function called.
|
|
|
|
There are two more macros which work in a similar manner, called <tt>__exit</tt> and
|
|
<tt>__exitdata</tt>, but they are more directly connected to the module support and
|
|
therefore will be explained in a later section.
|
|
|
|
<sect1>Processing kernel command line<p>
|
|
|
|
Let us recall what happens to the commandline passed to kernel during boot:
|
|
|
|
<enum>
|
|
<item> LILO (or BCP) accepts the commandline using BIOS keyboard services
|
|
and stores it at a well-known location in physical memory, as well
|
|
as a signature saying that there is a valid commandline there.
|
|
|
|
<item> <tt>arch/i386/kernel/head.S</tt> copies the first 2k of it out to the zeropage.
|
|
|
|
<item> <tt>arch/i386/kernel/setup.c:parse_mem_cmdline()</tt> (called by
|
|
<tt>setup_arch()</tt>, itself called by <tt>start_kernel()</tt>) copies 256 bytes from zeropage
|
|
into <tt>saved_command_line</tt> which is displayed by <tt>/proc/cmdline</tt>. This
|
|
same routine processes the "mem=" option if present and makes appropriate
|
|
adjustments to VM parameters.
|
|
|
|
<item> We return to commandline in <tt>parse_options()</tt> (called by <tt>start_kernel()</tt>)
|
|
which processes some "in-kernel" parameters (currently "init=" and
|
|
environment/arguments for init) and passes each word to <tt>checksetup()</tt>.
|
|
|
|
<item> <tt>checksetup()</tt> goes through the code in ELF section <tt>.setup.init</tt> and
|
|
invokes each function, passing it the word if it matches. Note that
|
|
using the return value of 0 from the function registered via <tt>__setup()</tt>,
|
|
it is possible to pass the same "variable=value" to more than one
|
|
function with "value" invalid to one and valid to another.
|
|
Jeff Garzik commented: "hackers who do that get spanked :)"
|
|
Why? Because this is clearly ld-order specific, i.e. kernel linked
|
|
in one order will have functionA invoked before functionB and another
|
|
will have it in reversed order, with the result depending on the order.
|
|
|
|
</enum>
|
|
|
|
So, how do we write code that processes boot commandline? We use the <tt>__setup()</tt>
|
|
macro defined in <tt>include/linux/init.h</tt>:
|
|
|
|
<tscreen><code>
|
|
|
|
/*
|
|
* Used for kernel command line parameter setup
|
|
*/
|
|
struct kernel_param {
|
|
const char *str;
|
|
int (*setup_func)(char *);
|
|
};
|
|
|
|
extern struct kernel_param __setup_start, __setup_end;
|
|
|
|
#ifndef MODULE
|
|
#define __setup(str, fn) \
|
|
static char __setup_str_##fn[] __initdata = str; \
|
|
static struct kernel_param __setup_##fn __initsetup = \
|
|
{ __setup_str_##fn, fn }
|
|
|
|
#else
|
|
#define __setup(str,func) /* nothing */
|
|
endif
|
|
</code></tscreen>
|
|
|
|
So, you would typically use it in your code like this
|
|
(taken from code of real driver, BusLogic HBA <tt>drivers/scsi/BusLogic.c</tt>):
|
|
|
|
<tscreen><code>
|
|
static int __init
|
|
BusLogic_Setup(char *str)
|
|
{
|
|
int ints[3];
|
|
|
|
(void)get_options(str, ARRAY_SIZE(ints), ints);
|
|
|
|
if (ints[0] != 0) {
|
|
BusLogic_Error("BusLogic: Obsolete Command Line Entry "
|
|
"Format Ignored\n", NULL);
|
|
return 0;
|
|
}
|
|
if (str == NULL || *str == '\0')
|
|
return 0;
|
|
return BusLogic_ParseDriverOptions(str);
|
|
}
|
|
|
|
__setup("BusLogic=", BusLogic_Setup);
|
|
</code></tscreen>
|
|
|
|
Note that <tt>__setup()</tt> does nothing for modules, so the code that wishes to
|
|
process boot commandline and can be either a module or statically linked
|
|
must invoke its parsing function manually in the module initialisation
|
|
routine. This also means that it is possible to write code that
|
|
processes parameters when compiled as a module but not when it is static or
|
|
vice versa.
|
|
|
|
<sect>Process and Interrupt Management<p>
|
|
|
|
<sect1>Task Structure and Process Table<p>
|
|
|
|
Every process under Linux is dynamically allocated a <tt>struct task_struct</tt>
|
|
structure. The maximum number of processes which can be created on Linux
|
|
is limited only by the amount of physical memory present, and is
|
|
equal to (see <tt>kernel/fork.c:fork_init()</tt>):
|
|
|
|
<tscreen><code>
|
|
/*
|
|
* The default maximum number of threads is set to a safe
|
|
* value: the thread structures can take up at most half
|
|
* of memory.
|
|
*/
|
|
max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;
|
|
</code></tscreen>
|
|
|
|
which, on IA32 architecture, basically means <tt>num_physpages/4</tt>. As an example,
|
|
on a 512M machine, you can create 32k threads. This is a considerable
|
|
improvement over the 4k-epsilon limit for older (2.2 and earlier) kernels.
|
|
Moreover, this can be changed at runtime using the KERN_MAX_THREADS <bf>sysctl(2)</bf>,
|
|
or simply using procfs interface to kernel tunables:
|
|
|
|
<tscreen><code>
|
|
# cat /proc/sys/kernel/threads-max
|
|
32764
|
|
# echo 100000 > /proc/sys/kernel/threads-max
|
|
# cat /proc/sys/kernel/threads-max
|
|
100000
|
|
# gdb -q vmlinux /proc/kcore
|
|
Core was generated by `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118'.
|
|
#0 0x0 in ?? ()
|
|
(gdb) p max_threads
|
|
$1 = 100000
|
|
</code></tscreen>
|
|
|
|
The set of processes on the Linux system is represented as a collection of
|
|
<tt>struct task_struct</tt> structures which are linked in two ways:
|
|
|
|
<enum>
|
|
<item> as a hashtable, hashed by pid, and
|
|
<item> as a circular, doubly-linked list using <tt>p->next_task</tt> and <tt>p->prev_task</tt>
|
|
pointers.
|
|
</enum>
|
|
|
|
The hashtable is called <tt>pidhash[]</tt> and is defined in
|
|
<tt>include/linux/sched.h</tt>:
|
|
|
|
<tscreen><code>
|
|
/* PID hashing. (shouldnt this be dynamic?) */
|
|
#define PIDHASH_SZ (4096 >> 2)
|
|
extern struct task_struct *pidhash[PIDHASH_SZ];
|
|
|
|
#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
|
|
</code></tscreen>
|
|
|
|
The tasks are hashed by their pid value and the above hashing function is
|
|
supposed to distribute the elements uniformly in their domain
|
|
(<tt>0</tt> to <tt>PID_MAX-1</tt>). The hashtable is used to quickly find a task by given pid,
|
|
using <tt>find_task_pid()</tt> inline from <tt>include/linux/sched.h</tt>:
|
|
|
|
<tscreen><code>
|
|
static inline struct task_struct *find_task_by_pid(int pid)
|
|
{
|
|
struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];
|
|
|
|
for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
|
|
;
|
|
|
|
return p;
|
|
}
|
|
</code></tscreen>
|
|
|
|
The tasks on each hashlist (i.e. hashed to the same value) are linked
|
|
by <tt>p->pidhash_next/pidhash_pprev</tt> which are used by <tt>hash_pid()</tt> and
|
|
<tt>unhash_pid()</tt> to insert and remove a given process into the hashtable.
|
|
These are done under protection of the read-write spinlock called <tt>tasklist_lock</tt>
|
|
taken for WRITE.
|
|
|
|
The circular doubly-linked list that uses <tt>p->next_task/prev_task</tt> is
|
|
maintained so that one could go through all tasks on the system easily.
|
|
This is achieved by the <tt>for_each_task()</tt> macro from <tt>include/linux/sched.h</tt>:
|
|
|
|
<tscreen><code>
|
|
#define for_each_task(p) \
|
|
for (p = &init_task ; (p = p->next_task) != &init_task ; )
|
|
</code></tscreen>
|
|
|
|
Users of <tt>for_each_task()</tt> should take tasklist_lock for READ.
|
|
Note that <tt>for_each_task()</tt> is using <tt>init_task</tt> to mark the beginning (and end)
|
|
of the list - this is safe because the idle task (pid 0) never exits.
|
|
|
|
The modifiers of the process hashtable or/and the process table links,
|
|
notably <tt>fork()</tt>, <tt>exit()</tt> and <tt>ptrace()</tt>, must take <tt>tasklist_lock</tt> for WRITE. What is
|
|
more interesting is that the writers must also disable interrupts on the
|
|
local CPU. The reason for this is not trivial: the <tt>send_sigio()</tt> function walks the
|
|
task list and thus takes <tt>tasklist_lock</tt> for READ, and it is called from
|
|
<tt>kill_fasync()</tt> in interrupt context. This is why writers must disable
|
|
interrupts while readers don't need to.
|
|
|
|
Now that we understand how the <tt>task_struct</tt> structures are linked together,
|
|
let us examine the members of <tt>task_struct</tt>. They loosely correspond to the
|
|
members of UNIX 'struct proc' and 'struct user' combined together.
|
|
|
|
The other versions of UNIX separated the task state information into
|
|
one part which should be kept memory-resident at all times (called 'proc
|
|
structure' which includes process state, scheduling information etc.) and
|
|
another part which is only needed when the process is running (called 'u area' which
|
|
includes file descriptor table, disk quota information etc.). The only reason
|
|
for such ugly design was that memory was a very scarce resource. Modern
|
|
operating systems (well, only Linux at the moment but others, e.g. FreeBSD
|
|
seem to improve in this direction towards Linux) do not need such separation
|
|
and therefore maintain process state in a kernel memory-resident data
|
|
structure at all times.
|
|
|
|
The task_struct structure is declared in <tt>include/linux/sched.h</tt> and is
|
|
currently 1680 bytes in size.
|
|
|
|
The state field is declared as:
|
|
|
|
<tscreen><code>
|
|
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
|
|
|
|
#define TASK_RUNNING 0
|
|
#define TASK_INTERRUPTIBLE 1
|
|
#define TASK_UNINTERRUPTIBLE 2
|
|
#define TASK_ZOMBIE 4
|
|
#define TASK_STOPPED 8
|
|
#define TASK_EXCLUSIVE 32
|
|
</code></tscreen>
|
|
|
|
Why is <tt>TASK_EXCLUSIVE</tt> defined as 32 and not 16? Because 16 was used up by
|
|
<tt>TASK_SWAPPING</tt> and I forgot to shift <tt>TASK_EXCLUSIVE</tt> up when I removed
|
|
all references to <tt>TASK_SWAPPING</tt> (sometime in 2.3.x).
|
|
|
|
The <tt>volatile</tt> in <tt>p->state</tt> declaration means it can be modified
|
|
asynchronously (from interrupt handler):
|
|
|
|
<enum>
|
|
|
|
<item><bf>TASK_RUNNING</bf>: means the task is "supposed to be" on the run
|
|
queue. The reason it may not yet be on the runqueue is that marking a task as
|
|
<tt>TASK_RUNNING</tt> and placing it on the runqueue is not atomic. You need to hold
|
|
the <tt>runqueue_lock</tt> read-write spinlock for read in order to look at the
|
|
runqueue. If you do so, you will then see that every task on the runqueue is in
|
|
<tt>TASK_RUNNING</tt> state. However, the converse is not true for the reason explained
|
|
above. Similarly, drivers can mark themselves (or rather the process context they
|
|
run in) as <tt>TASK_INTERRUPTIBLE</tt> (or <tt>TASK_UNINTERRUPTIBLE</tt>) and then call <tt>schedule()</tt>,
|
|
which will then remove it from the runqueue (unless there is a pending signal, in which
|
|
case it is left on the runqueue). </item>
|
|
|
|
<item><bf>TASK_INTERRUPTIBLE</bf>: means the task is sleeping but can be woken up
|
|
by a signal or by expiry of a timer.</item>
|
|
|
|
<item><bf>TASK_UNINTERRUPTIBLE</bf>: same as <tt>TASK_INTERRUPTIBLE</tt>, except it cannot
|
|
be woken up.</item>
|
|
|
|
<item><bf>TASK_ZOMBIE</bf>: task has terminated but has not had its status collected
|
|
(<tt>wait()</tt>-ed for) by the parent (natural or by adoption).</item>
|
|
|
|
<item><bf>TASK_STOPPED</bf>: task was stopped, either due to job control signals or
|
|
due to <bf>ptrace(2)</bf>.</item>
|
|
|
|
<item><bf>TASK_EXCLUSIVE</bf>: this is not a separate state but can be OR-ed to
|
|
either one of <tt>TASK_INTERRUPTIBLE</tt> or <tt>TASK_UNINTERRUPTIBLE</tt>.
|
|
This means that when
|
|
this task is sleeping on a wait queue with many other tasks, it will be
|
|
woken up alone instead of causing "thundering herd" problem by waking up all
|
|
the waiters.</item>
|
|
</enum>
|
|
|
|
Task flags contain information about the process states which are not
|
|
mutually exclusive:
|
|
<tscreen><code>
|
|
unsigned long flags; /* per process flags, defined below */
|
|
/*
|
|
* Per process flags
|
|
*/
|
|
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
|
|
/* Not implemented yet, only for 486*/
|
|
#define PF_STARTING 0x00000002 /* being created */
|
|
#define PF_EXITING 0x00000004 /* getting shut down */
|
|
#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */
|
|
#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */
|
|
#define PF_DUMPCORE 0x00000200 /* dumped core */
|
|
#define PF_SIGNALED 0x00000400 /* killed by a signal */
|
|
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
|
|
#define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
|
|
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */
|
|
</code></tscreen>
|
|
|
|
The fields <tt>p->has_cpu</tt>, <tt>p->processor</tt>, <tt>p->counter</tt>, <tt>p->priority</tt>, <tt>p->policy</tt> and
|
|
<tt>p->rt_priority</tt> are related to the scheduler and will be looked at later.
|
|
|
|
The fields <tt>p->mm</tt> and <tt>p->active_mm</tt> point respectively to the process' address space
|
|
described by <tt>mm_struct</tt> structure and to the active address space if the
|
|
process doesn't have a real one (e.g. kernel threads). This helps minimise
|
|
TLB flushes on switching address spaces when the task is scheduled out.
|
|
So, if we are scheduling-in the kernel thread (which has no <tt>p->mm</tt>) then its
|
|
<tt>next->active_mm</tt> will be set to the <tt>prev->active_mm</tt> of the task that was
|
|
scheduled-out, which will be the same as <tt>prev->mm</tt> if <tt>prev->mm != NULL</tt>.
|
|
The address space can be shared between threads if <tt>CLONE_VM</tt> flag is passed
|
|
to the <bf>clone(2)</bf> system call or by means of <bf>vfork(2)</bf> system call.
|
|
|
|
The fields <tt>p->exec_domain</tt> and <tt>p->personality</tt> relate to the personality of
|
|
the task, i.e. to the way certain system calls behave in order to emulate the
|
|
"personality" of foreign flavours of UNIX.
|
|
|
|
The field <tt>p->fs</tt> contains filesystem information, which under Linux means
|
|
three pieces of information:
|
|
|
|
<enum>
|
|
<item>root directory's dentry and mountpoint,
|
|
<item>alternate root directory's dentry and mountpoint,
|
|
<item>current working directory's dentry and mountpoint.
|
|
</enum>
|
|
|
|
This structure also includes a reference count because it can be shared
|
|
between cloned tasks when <tt>CLONE_FS</tt> flag is passed to the <bf>clone(2)</bf> system
|
|
call.
|
|
|
|
The field <tt>p->files</tt> contains the file descriptor table. This too can be
|
|
shared between tasks, provided <tt>CLONE_FILES</tt> is specified with <bf>clone(2)</bf> system
|
|
call.
|
|
|
|
The field <tt>p->sig</tt> contains signal handlers and can be shared between cloned
|
|
tasks by means of <tt>CLONE_SIGHAND</tt>.
|
|
|
|
<sect1>Creation and termination of tasks and kernel threads<p>
|
|
|
|
Different books on operating systems define a "process" in different ways,
|
|
starting from "instance of a program in execution" and ending with "that
|
|
which is produced by clone(2) or fork(2) system calls".
|
|
Under Linux, there are three kinds of processes:
|
|
|
|
<itemize>
|
|
<item> the idle thread(s),
|
|
<item> kernel threads,
|
|
<item> user tasks.
|
|
</itemize>
|
|
|
|
The idle thread is created at compile time for the first CPU; it is then
|
|
"manually" created for each CPU by means of arch-specific
|
|
<tt>fork_by_hand()</tt> in <tt>arch/i386/kernel/smpboot.c</tt>, which unrolls the <bf>fork(2)</bf> system
|
|
call by hand (on some archs). Idle tasks share one init_task structure but
|
|
have a private TSS structure, in the per-CPU array <tt>init_tss</tt>. Idle tasks all have
|
|
pid = 0 and no other task can share pid, i.e. use <tt>CLONE_PID</tt> flag to <bf>clone(2)</bf>.
|
|
|
|
Kernel threads are created using <tt>kernel_thread()</tt> function which invokes
|
|
the <bf>clone(2)</bf> system call in kernel mode. Kernel threads usually have no user
|
|
address space, i.e. <tt>p->mm = NULL</tt>, because they explicitly do <tt>exit_mm()</tt>, e.g.
|
|
via <tt>daemonize()</tt> function. Kernel threads can always access kernel address
|
|
space directly. They are allocated pid numbers in the low range. Running at
|
|
processor's ring 0 (on x86, that is) implies that the kernel threads enjoy all I/O privileges
|
|
and cannot be pre-empted by the scheduler.
|
|
|
|
User tasks are created by means of <bf>clone(2)</bf> or <bf>fork(2)</bf> system calls, both of
|
|
which internally invoke <bf>kernel/fork.c:do_fork()</bf>.
|
|
|
|
Let us understand what happens when a user process makes a <bf>fork(2)</bf> system
|
|
call. Although <bf>fork(2)</bf> is architecture-dependent due to the
|
|
different ways of passing user stack and registers, the actual underlying
|
|
function <tt>do_fork()</tt> that does the job is portable and is located at
|
|
<tt>kernel/fork.c</tt>.
|
|
|
|
The following steps are done:
|
|
|
|
<enum>
|
|
<item> Local variable <tt>retval</tt> is set to <tt>-ENOMEM</tt>, as this is the value which <tt>errno</tt>
|
|
should be set to if <bf>fork(2)</bf> fails to allocate a new task structure.
|
|
|
|
<item> If <tt>CLONE_PID</tt> is set in <tt>clone_flags</tt> then return an error (<tt>-EPERM</tt>), unless
|
|
the caller is the idle thread (during boot only). So, normal user
|
|
threads cannot pass <tt>CLONE_PID</tt> to <bf>clone(2)</bf> and expect it to succeed.
|
|
For <bf>fork(2)</bf>, this is irrelevant as <tt>clone_flags</tt> is set to <tt>SIFCHLD</tt> - this
|
|
is only relevant when <tt>do_fork()</tt> is invoked from <tt>sys_clone()</tt> which
|
|
passes the <tt>clone_flags</tt> from the value requested from userspace.
|
|
|
|
<item> <tt>current->vfork_sem</tt> is initialised (it is later cleared in the child).
|
|
This is used by <tt>sys_vfork()</tt> (<bf>vfork(2)</bf> system call, corresponds to
|
|
<tt>clone_flags = CLONE_VFORK|CLONE_VM|SIGCHLD</tt>) to make the parent sleep
|
|
until the child does <tt>mm_release()</tt>, for example as a result of <tt>exec()</tt>ing
|
|
another program or <bf>exit(2)</bf>-ing.
|
|
|
|
<item> A new task structure is allocated using arch-dependent
|
|
<tt>alloc_task_struct()</tt> macro. On x86 it is just a gfp at <tt>GFP_KERNEL</tt>
|
|
priority. This is the first reason why <bf>fork(2)</bf> system call may sleep.
|
|
If this allocation fails, we return <tt>-ENOMEM</tt>.
|
|
|
|
<item> All the values from current process' task structure are copied into
|
|
the new one, using structure assignment <tt>*p = *current</tt>. Perhaps this
|
|
should be replaced by a memcpy? Later on, the fields that should not
|
|
be inherited by the child are set to the correct values.
|
|
|
|
<item> Big kernel lock is taken as the rest of the code would otherwise be
|
|
non-reentrant.
|
|
|
|
<item> If the parent has user resources (a concept of UID, Linux is flexible
|
|
enough to make it a question rather than a fact), then verify if the
|
|
user exceeded <tt>RLIMIT_NPROC</tt> soft limit - if so, fail with <tt>-EAGAIN</tt>, if
|
|
not, increment the count of processes by given uid <tt>p->user->count</tt>.
|
|
|
|
<item> If the system-wide number of tasks exceeds the value of the tunable
|
|
max_threads, fail with <tt>-EAGAIN</tt>.
|
|
|
|
<item> If the binary being executed belongs to a modularised execution
|
|
domain, increment the corresponding module's reference count.
|
|
|
|
<item> If the binary being executed belongs to a modularised binary format,
|
|
increment the corresponding module's reference count.
|
|
|
|
<item> The child is marked as 'has not execed' (<tt>p->did_exec = 0</tt>)
|
|
|
|
<item> The child is marked as 'not-swappable' (<tt>p->swappable = 0</tt>)
|
|
|
|
<item> The child is put into 'uninterruptible sleep' state, i.e.
|
|
<tt>p->state = TASK_UNINTERRUPTIBLE</tt> (TODO: why is this done?
|
|
I think it's not needed - get rid of it, Linus confirms it is not
|
|
needed)
|
|
|
|
<item> The child's <tt>p->flags</tt> are set according to the value of clone_flags;
|
|
for plain <bf>fork(2)</bf>, this will be <tt>p->flags = PF_FORKNOEXEC</tt>.
|
|
|
|
<item> The child's pid <tt>p->pid</tt> is set using the fast algorithm in
|
|
<tt>kernel/fork.c:get_pid()</tt> (TODO: <tt>lastpid_lock</tt> spinlock can be made
|
|
redundant since <tt>get_pid()</tt> is always called under big kernel lock
|
|
from <tt>do_fork()</tt>, also remove flags argument of <tt>get_pid()</tt>, patch sent
|
|
to Alan on 20/06/2000 - followup later).
|
|
|
|
<item> The rest of the code in <tt>do_fork()</tt> initialises the rest of child's
|
|
task structure. At the very end, the child's task structure is
|
|
hashed into the <tt>pidhash</tt> hashtable and the child is woken up (TODO:
|
|
<tt>wake_up_process(p)</tt> sets <tt>p->state = TASK_RUNNING</tt> and adds the process
|
|
to the runq, therefore we probably didn't need to set <tt>p->state</tt> to
|
|
<tt>TASK_RUNNING</tt> earlier on in <tt>do_fork()</tt>). The interesting part is
|
|
setting <tt>p->exit_signal</tt> to <tt>clone_flags & CSIGNAL</tt>, which for <bf>fork(2)</bf>
|
|
means just <tt>SIGCHLD</tt> and setting <tt>p->pdeath_signal</tt> to 0. The
|
|
<tt>pdeath_signal</tt> is used when a process 'forgets' the original parent
|
|
(by dying) and can be set/get by means of <tt>PR_GET/SET_PDEATHSIG</tt>
|
|
commands of <bf>prctl(2)</bf> system call (You might argue that the way the
|
|
value of <tt>pdeath_signal</tt> is returned via userspace pointer argument
|
|
in <bf>prctl(2)</bf> is a bit silly - mea culpa, after Andries Brouwer
|
|
updated the manpage it was too late to fix ;)
|
|
</enum>
|
|
|
|
Thus tasks are created. There are several ways for tasks to terminate:
|
|
|
|
<enum>
|
|
|
|
<item> by making <bf>exit(2)</bf> system call;
|
|
|
|
<item> by being delivered a signal with default disposition to die;
|
|
|
|
<item> by being forced to die under certain exceptions;
|
|
|
|
<item> by calling <bf>bdflush(2)</bf> with <tt>func == 1</tt> (this is Linux-specific, for
|
|
compatibility with old distributions that still had the 'update'
|
|
line in <tt>/etc/inittab</tt> - nowadays the work of update is done by
|
|
kernel thread <tt>kupdate</tt>).
|
|
</enum>
|
|
|
|
Functions implementing system calls under Linux are prefixed with <tt>sys_</tt>,
|
|
but they are usually concerned only with argument checking or arch-specific
|
|
ways to pass some information and the actual work is done by <tt>do_</tt> functions.
|
|
So it is with <tt>sys_exit()</tt> which calls <tt>do_exit()</tt> to do the work. Although,
|
|
other parts of the kernel sometimes invoke <tt>sys_exit()</tt> while they should really
|
|
call <tt>do_exit()</tt>.
|
|
|
|
The function <tt>do_exit()</tt> is found in <tt>kernel/exit.c</tt>. The points to note about
|
|
<tt>do_exit()</tt>:
|
|
|
|
<itemize>
|
|
<item> Uses global kernel lock (locks but doesn't unlock).
|
|
|
|
<item> Calls <tt>schedule()</tt> at the end, which never returns.
|
|
|
|
<item> Sets the task state to <tt>TASK_ZOMBIE</tt>.
|
|
|
|
<item> Notifies any child with <tt>current->pdeath_signal</tt>, if not 0.
|
|
|
|
<item> Notifies the parent with a <tt>current->exit_signal</tt>, which is usually
|
|
equal to <tt>SIGCHLD</tt>.
|
|
|
|
<item> Releases resources allocated by fork, closes open files etc.
|
|
|
|
<item> On architectures that use lazy FPU switching (ia64, mips, mips64)
|
|
(TODO: remove 'flags' argument of
|
|
sparc, sparc64), do whatever the hardware requires to pass the FPU
|
|
ownership (if owned by current) to "none".
|
|
</itemize>
|
|
|
|
<sect1>Linux Scheduler<p>
|
|
|
|
The job of a scheduler is to arbitrate access to the current CPU between
|
|
multiple processes. The scheduler is implemented in the 'main kernel file'
|
|
<tt>kernel/sched.c</tt>. The corresponding header file <tt>include/linux/sched.h</tt> is
|
|
included (either explicitly or indirectly) by virtually every kernel source
|
|
file.
|
|
|
|
The fields of task structure relevant to scheduler include:
|
|
|
|
<itemize>
|
|
<item> <tt>p->need_resched</tt>: this field is set if <tt>schedule()</tt> should be invoked at
|
|
the 'next opportunity'.
|
|
|
|
<item> <tt>p->counter</tt>: number of clock ticks left to run in this scheduling
|
|
slice, decremented by a timer. When this field becomes lower than or equal to zero, it is reset
|
|
to 0 and <tt>p->need_resched</tt> is set. This is also sometimes called 'dynamic
|
|
priority' of a process because it can change by itself.
|
|
|
|
<item> <tt>p->priority</tt>: the process' static priority, only changed through well-known
|
|
system calls like <bf>nice(2)</bf>, POSIX.1b <bf>sched_setparam(2)</bf> or 4.4BSD/SVR4
|
|
<bf>setpriority(2)</bf>.
|
|
|
|
<item> <tt>p->rt_priority</tt>: realtime priority
|
|
|
|
<item> <tt>p->policy</tt>: the scheduling policy, specifies which scheduling class the
|
|
task belongs to. Tasks can change their scheduling class using the
|
|
<bf>sched_setscheduler(2)</bf> system call. The valid values are <tt>SCHED_OTHER</tt>
|
|
(traditional UNIX process), <tt>SCHED_FIFO</tt> (POSIX.1b FIFO realtime
|
|
process) and <tt>SCHED_RR</tt> (POSIX round-robin realtime process). One can
|
|
also OR <tt>SCHED_YIELD</tt> to any of these values to signify that the process
|
|
decided to yield the CPU, for example by calling <bf>sched_yield(2)</bf> system
|
|
call. A FIFO realtime process will run until either a) it blocks on I/O,
|
|
b) it explicitly yields the CPU or c) it is preempted by another realtime
|
|
process with a higher <tt>p->rt_priority</tt> value. <tt>SCHED_RR</tt> is the same as
|
|
<tt>SCHED_FIFO</tt>, except that when its timeslice expires it goes back to
|
|
the end of the runqueue.
|
|
</itemize>
|
|
|
|
The scheduler's algorithm is simple, despite the great apparent complexity
|
|
of the <tt>schedule()</tt> function. The function is complex because it implements
|
|
three scheduling algorithms in one and also because of the subtle
|
|
SMP-specifics.
|
|
|
|
The apparently 'useless' gotos in <tt>schedule()</tt> are there for a purpose - to
|
|
generate the best optimised (for i386) code. Also, note that scheduler
|
|
(like most of the kernel) was completely rewritten for 2.4, therefore the
|
|
discussion below does not apply to 2.2 or earlier kernels.
|
|
|
|
Let us look at the function in detail:
|
|
|
|
<enum>
|
|
<item> If <tt>current->active_mm == NULL</tt> then something is wrong. Current
|
|
process, even a kernel thread (<tt>current->mm == NULL</tt>) must have a valid
|
|
<tt>p->active_mm</tt> at all times.
|
|
|
|
<item> If there is something to do on the <tt>tq_scheduler</tt> task queue, process it
|
|
now. Task queues provide a kernel mechanism to schedule execution of
|
|
functions at a later time. We shall look at it in details elsewhere.
|
|
|
|
<item> Initialise local variables <tt>prev</tt> and <tt>this_cpu</tt> to current task and
|
|
current CPU respectively.
|
|
|
|
<item> Check if <tt>schedule()</tt> was invoked from interrupt handler (due to a bug)
|
|
and panic if so.
|
|
|
|
<item> Release the global kernel lock.
|
|
|
|
<item> If there is some work to do via softirq mechanism, do it now.
|
|
|
|
<item> Initialise local pointer <tt>struct schedule_data *sched_data</tt> to point
|
|
to per-CPU (cacheline-aligned to prevent cacheline ping-pong)
|
|
scheduling data area, which contains the TSC value of <tt>last_schedule</tt> and the
|
|
pointer to last scheduled task structure (TODO: <tt>sched_data</tt> is used on
|
|
SMP only but why does <tt>init_idle()</tt> initialises it on UP as well?).
|
|
|
|
<item> <tt>runqueue_lock</tt> spinlock is taken. Note that we use <tt>spin_lock_irq()</tt>
|
|
because in <tt>schedule()</tt> we guarantee that interrupts are enabled. Therefore,
|
|
when we unlock <tt>runqueue_lock</tt>, we can just re-enable them instead of
|
|
saving/restoring eflags (<tt>spin_lock_irqsave/restore</tt> variant).
|
|
|
|
<item> task state machine: if the task is in <tt>TASK_RUNNING</tt> state, it is left
|
|
alone; if it is in <tt>TASK_INTERRUPTIBLE</tt> state and a signal is pending,
|
|
it is moved into <tt>TASK_RUNNING</tt> state. In all other cases, it is deleted
|
|
from the runqueue.
|
|
|
|
<item> <tt>next</tt> (best candidate to be scheduled) is set to the idle task of
|
|
this cpu. However, the goodness of this candidate is set to a very
|
|
low value (-1000), in hope that there is someone better than that.
|
|
|
|
<item> If the <tt>prev</tt> (current) task is in <tt>TASK_RUNNING</tt> state, then the
|
|
current goodness is set to its goodness and it is marked as a better
|
|
candidate to be scheduled than the idle task.
|
|
|
|
<item> Now the runqueue is examined and a goodness of each process that can
|
|
be scheduled on this cpu is compared with current value; the
|
|
process with highest goodness wins. Now the concept of "can be
|
|
scheduled on this cpu" must be clarified: on UP, every process on
|
|
the runqueue is eligible to be scheduled; on SMP, only process not
|
|
already running on another cpu is eligible to be scheduled on this
|
|
cpu. The goodness is calculated by a function called <tt>goodness()</tt>, which
|
|
treats realtime processes by making their goodness very high
|
|
(<tt>1000 + p->rt_priority</tt>), this being greater than 1000 guarantees that
|
|
no <tt>SCHED_OTHER</tt> process can win; so they only contend with other
|
|
realtime processes that may have a greater <tt>p->rt_priority</tt>. The
|
|
goodness function returns 0 if the process' time slice (<tt>p->counter</tt>)
|
|
is over. For non-realtime processes, the initial value of goodness is
|
|
set to <tt>p->counter</tt> - this way, the process is less likely to get CPU if
|
|
it already had it for a while, i.e. interactive processes are favoured
|
|
more than CPU bound number crunchers. The arch-specific constant
|
|
<tt>PROC_CHANGE_PENALTY</tt> attempts to implement "cpu affinity" (i.e. give
|
|
advantage to a process on the same CPU). It also gives a slight
|
|
advantage to processes with mm pointing to current <tt>active_mm</tt> or to
|
|
processes with no (user) address space, i.e. kernel threads.
|
|
|
|
<item> if the current value of goodness is 0 then the entire list of
|
|
processes (not just the ones on the runqueue!) is examined and their dynamic
|
|
priorities are recalculated using simple algorithm:
|
|
|
|
<tscreen><code>
|
|
|
|
recalculate:
|
|
{
|
|
struct task_struct *p;
|
|
spin_unlock_irq(&runqueue_lock);
|
|
read_lock(&tasklist_lock);
|
|
for_each_task(p)
|
|
p->counter = (p->counter >> 1) + p->priority;
|
|
read_unlock(&tasklist_lock);
|
|
spin_lock_irq(&runqueue_lock);
|
|
}
|
|
</code></tscreen>
|
|
|
|
Note that the we drop the <tt>runqueue_lock</tt> before we recalculate. The
|
|
reason is that we go through entire set of processes; this can take
|
|
a long time, during which the <tt>schedule()</tt> could be called on another CPU and
|
|
select a process with goodness good enough for that CPU, whilst we on
|
|
this CPU were forced to recalculate. Ok, admittedly this is somewhat
|
|
inconsistent because while we (on this CPU) are selecting a process with
|
|
the best goodness, <tt>schedule()</tt> running on another CPU could be
|
|
recalculating dynamic priorities.
|
|
|
|
<item> From this point on it is certain that <tt>next</tt> points to the task to
|
|
be scheduled, so we initialise <tt>next->has_cpu</tt> to 1 and <tt>next->processor</tt>
|
|
to <tt>this_cpu</tt>. The <tt>runqueue_lock</tt> can now be unlocked.
|
|
|
|
<item> If we are switching back to the same task (<tt>next == prev</tt>) then we can
|
|
simply reacquire the global kernel lock and return, i.e. skip all the
|
|
hardware-level (registers, stack etc.) and VM-related (switch page
|
|
directory, recalculate <tt>active_mm</tt> etc.) stuff.
|
|
|
|
<item> The macro <tt>switch_to()</tt> is architecture specific. On i386, it is
|
|
concerned with a) FPU handling, b) LDT handling, c) reloading segment
|
|
registers, d) TSS handling and e) reloading debug registers.
|
|
</enum>
|
|
|
|
<sect1>Linux linked list implementation<p>
|
|
|
|
Before we go on to examine implementation of wait queues, we must
|
|
acquaint ourselves with the Linux standard doubly-linked list implementation.
|
|
Wait queues (as well as everything else in Linux) make heavy use
|
|
of them and they are called in jargon "list.h implementation" because the
|
|
most relevant file is <tt>include/linux/list.h</tt>.
|
|
|
|
The fundamental data structure here is <tt>struct list_head</tt>:
|
|
|
|
<tscreen><code>
|
|
struct list_head {
|
|
struct list_head *next, *prev;
|
|
};
|
|
|
|
#define LIST_HEAD_INIT(name) { &(name), &(name) }
|
|
|
|
#define LIST_HEAD(name) \
|
|
struct list_head name = LIST_HEAD_INIT(name)
|
|
|
|
#define INIT_LIST_HEAD(ptr) do { \
|
|
(ptr)->next = (ptr); (ptr)->prev = (ptr); \
|
|
} while (0)
|
|
|
|
#define list_entry(ptr, type, member) \
|
|
((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))
|
|
|
|
#define list_for_each(pos, head) \
|
|
for (pos = (head)->next; pos != (head); pos = pos->next)
|
|
</code></tscreen>
|
|
|
|
The first three macros are for initialising an empty list by pointing both
|
|
<tt>next</tt> and <tt>prev</tt> pointers to itself. It is obvious from C syntactical
|
|
restrictions which ones should be used where - for example, <tt>LIST_HEAD_INIT()</tt>
|
|
can be used for structure's element initialisation in declaration, the second
|
|
can be used for static variable initialising declarations and the third can
|
|
be used inside a function.
|
|
|
|
The macro <tt>list_entry()</tt> gives access to individual list element, for example
|
|
(from <tt>fs/file_table.c:fs_may_remount_ro()</tt>):
|
|
|
|
<tscreen><code>
|
|
struct super_block {
|
|
...
|
|
struct list_head s_files;
|
|
...
|
|
} *sb = &some_super_block;
|
|
|
|
struct file {
|
|
...
|
|
struct list_head f_list;
|
|
...
|
|
} *file;
|
|
|
|
struct list_head *p;
|
|
|
|
for (p = sb->s_files.next; p != &sb->s_files; p = p->next) {
|
|
struct file *file = list_entry(p, struct file, f_list);
|
|
do something to 'file'
|
|
}
|
|
</code></tscreen>
|
|
|
|
A good example of the use of <tt>list_for_each()</tt> macro is in the scheduler where
|
|
we walk the runqueue looking for the process with highest goodness:
|
|
|
|
<tscreen><code>
|
|
static LIST_HEAD(runqueue_head);
|
|
struct list_head *tmp;
|
|
struct task_struct *p;
|
|
|
|
list_for_each(tmp, &runqueue_head) {
|
|
p = list_entry(tmp, struct task_struct, run_list);
|
|
if (can_schedule(p)) {
|
|
int weight = goodness(p, this_cpu, prev->active_mm);
|
|
if (weight > c)
|
|
c = weight, next = p;
|
|
}
|
|
}
|
|
</code></tscreen>
|
|
|
|
Here, <tt>p->run_list</tt> is declared as <tt>struct list_head run_list</tt> inside
|
|
<tt>task_struct</tt> structure and serves as anchor to the list. Removing an element
|
|
from the list and adding (to head or tail of the list) is done by
|
|
<tt>list_del()/list_add()/list_add_tail()</tt> macros. The examples below are adding
|
|
and removing a task from runqueue:
|
|
|
|
<tscreen><code>
|
|
static inline void del_from_runqueue(struct task_struct * p)
|
|
{
|
|
nr_running--;
|
|
list_del(&p->run_list);
|
|
p->run_list.next = NULL;
|
|
}
|
|
|
|
static inline void add_to_runqueue(struct task_struct * p)
|
|
{
|
|
list_add(&p->run_list, &runqueue_head);
|
|
nr_running++;
|
|
}
|
|
|
|
static inline void move_last_runqueue(struct task_struct * p)
|
|
{
|
|
list_del(&p->run_list);
|
|
list_add_tail(&p->run_list, &runqueue_head);
|
|
}
|
|
|
|
static inline void move_first_runqueue(struct task_struct * p)
|
|
{
|
|
list_del(&p->run_list);
|
|
list_add(&p->run_list, &runqueue_head);
|
|
}
|
|
</code></tscreen>
|
|
|
|
<sect1>Wait Queues<p>
|
|
|
|
When a process requests the kernel to do something which is currently
|
|
impossible but that may become possible later, the process is put to sleep
|
|
and is woken up when the request is more likely to be satisfied. One of the
|
|
kernel mechanisms used for this is called a 'wait queue'.
|
|
|
|
Linux implementation allows wake-on semantics using <tt>TASK_EXCLUSIVE</tt> flag.
|
|
With waitqueues, you can either use a well-known queue and then simply
|
|
<tt>sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout</tt>,
|
|
or you can define your own waitqueue and use <tt>add/remove_wait_queue</tt> to add and
|
|
remove yourself from it and <tt>wake_up/wake_up_interruptible</tt> to wake up
|
|
when needed.
|
|
|
|
An example of the first usage of waitqueues is interaction between the page
|
|
allocator (in <tt>mm/page_alloc.c:__alloc_pages()</tt>) and the <tt>kswapd</tt> kernel daemon (in
|
|
<tt>mm/vmscan.c:kswap()</tt>), by means of wait queue <tt>kswapd_wait,</tt> declared in
|
|
<tt>mm/vmscan.c</tt>; the <tt>kswapd</tt> daemon sleeps on this queue, and it is woken up
|
|
whenever the page allocator needs to free up some pages.
|
|
|
|
An example of autonomous waitqueue usage is interaction between
|
|
user process requesting data via <bf>read(2)</bf> system call and kernel running in
|
|
the interrupt context to supply the data. An interrupt handler might look
|
|
like (simplified <tt>drivers/char/rtc_interrupt()</tt>):
|
|
|
|
<tscreen><code>
|
|
static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);
|
|
|
|
void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
|
|
{
|
|
spin_lock(&rtc_lock);
|
|
rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
|
|
spin_unlock(&rtc_lock);
|
|
wake_up_interruptible(&rtc_wait);
|
|
}
|
|
</code></tscreen>
|
|
|
|
So, the interrupt handler obtains the data by reading from some
|
|
device-specific I/O port (<tt>CMOS_READ()</tt> macro turns into a couple <tt>outb/inb</tt>) and
|
|
then wakes up whoever is sleeping on the <tt>rtc_wait</tt> wait queue.
|
|
|
|
Now, the <bf>read(2)</bf> system call could be implemented as:
|
|
|
|
<tscreen><code>
|
|
ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
|
|
{
|
|
DECLARE_WAITQUEUE(wait, current);
|
|
unsigned long data;
|
|
ssize_t retval;
|
|
|
|
add_wait_queue(&rtc_wait, &wait);
|
|
current->state = TASK_INTERRUPTIBLE;
|
|
do {
|
|
spin_lock_irq(&rtc_lock);
|
|
data = rtc_irq_data;
|
|
rtc_irq_data = 0;
|
|
spin_unlock_irq(&rtc_lock);
|
|
|
|
if (data != 0)
|
|
break;
|
|
|
|
if (file->f_flags & O_NONBLOCK) {
|
|
retval = -EAGAIN;
|
|
goto out;
|
|
}
|
|
if (signal_pending(current)) {
|
|
retval = -ERESTARTSYS;
|
|
goto out;
|
|
}
|
|
schedule();
|
|
} while(1);
|
|
retval = put_user(data, (unsigned long *)buf);
|
|
if (!retval)
|
|
retval = sizeof(unsigned long);
|
|
|
|
out:
|
|
current->state = TASK_RUNNING;
|
|
remove_wait_queue(&rtc_wait, &wait);
|
|
return retval;
|
|
}
|
|
</code></tscreen>
|
|
|
|
What happens in <tt>rtc_read()</tt> is this:
|
|
|
|
<enum>
|
|
<item> We declare a wait queue element pointing to current process context.
|
|
|
|
<item> We add this element to the <tt>rtc_wait</tt> waitqueue.
|
|
|
|
<item> We mark current context as <tt>TASK_INTERRUPTIBLE</tt> which means it will not
|
|
be rescheduled after the next time it sleeps.
|
|
|
|
<item> We check if there is no data available; if there is we break out,
|
|
copy data to user buffer, mark ourselves as <tt>TASK_RUNNING</tt>, remove
|
|
ourselves from the wait queue and return
|
|
|
|
<item> If there is no data yet, we check whether the user specified non-blocking I/O
|
|
and if so we fail with <tt>EAGAIN</tt> (which is the same as <tt>EWOULDBLOCK</tt>)
|
|
|
|
<item> We also check if a signal is pending and if so inform the "higher
|
|
layers" to restart the system call if necessary. By "if necessary"
|
|
I meant the details of signal disposition as specified in <bf>sigaction(2)</bf>
|
|
system call.
|
|
|
|
<item> Then we "switch out", i.e. fall asleep, until woken up by the
|
|
interrupt handler. If we didn't mark ourselves as <tt>TASK_INTERRUPTIBLE</tt>
|
|
then the scheduler could schedule us sooner than when the data is
|
|
available, thus causing unneeded processing.
|
|
</enum>
|
|
|
|
It is also worth pointing out that, using wait queues, it is rather easy to
|
|
implement the <bf>poll(2)</bf> system call:
|
|
|
|
<tscreen><code>
|
|
static unsigned int rtc_poll(struct file *file, poll_table *wait)
|
|
{
|
|
unsigned long l;
|
|
|
|
poll_wait(file, &rtc_wait, wait);
|
|
|
|
spin_lock_irq(&rtc_lock);
|
|
l = rtc_irq_data;
|
|
spin_unlock_irq(&rtc_lock);
|
|
|
|
if (l != 0)
|
|
return POLLIN | POLLRDNORM;
|
|
return 0;
|
|
}
|
|
</code></tscreen>
|
|
|
|
All the work is done by the device-independent function <tt>poll_wait()</tt> which does
|
|
the necessary waitqueue manipulations; all we need to do is point it to the
|
|
waitqueue which is woken up by our device-specific interrupt handler.
|
|
|
|
<sect1>Kernel Timers<p>
|
|
|
|
Now let us turn our attention to kernel timers. Kernel timers are used to
|
|
dispatch execution of a particular function (called 'timer handler') at a
|
|
specified time in the future. The main data structure is <tt>struct timer_list</tt>
|
|
declared in <tt>include/linux/timer.h</tt>:
|
|
|
|
<tscreen><code>
|
|
struct timer_list {
|
|
struct list_head list;
|
|
unsigned long expires;
|
|
unsigned long data;
|
|
void (*function)(unsigned long);
|
|
volatile int running;
|
|
};
|
|
</code></tscreen>
|
|
|
|
The <tt>list</tt> field is for linking into the internal list, protected by the
|
|
<tt>timerlist_lock</tt> spinlock. The <tt>expires</tt> field is the value of <tt>jiffies</tt> when
|
|
the <tt>function</tt> handler should be invoked with <tt>data</tt> passed as a parameter.
|
|
The <tt>running</tt> field is used on SMP to test if the timer handler is currently
|
|
running on another CPU.
|
|
|
|
The functions <tt>add_timer()</tt> and <tt>del_timer()</tt> add and remove a given timer to the
|
|
list. When a timer expires, it is removed automatically. Before a timer is
|
|
used, it MUST be initialised by means of <tt>init_timer()</tt> function. And before it
|
|
is added, the fields <tt>function</tt> and <tt>expires</tt> must be set.
|
|
|
|
<sect1>Bottom Halves<p>
|
|
|
|
Sometimes it is reasonable to split the amount of work to be performed inside
|
|
an interrupt handler into immediate work (e.g. acknowledging the interrupt,
|
|
updating the stats etc.) and work which can be postponed until later, when
|
|
interrupts are enabled (e.g. to do some postprocessing on data, wake up
|
|
processes waiting for this data, etc).
|
|
|
|
Bottom halves are the oldest mechanism for deferred execution of kernel
|
|
tasks and have been available since Linux 1.x. In Linux 2.0, a new mechanism
|
|
was added, called 'task queues', which will be the subject of next section.
|
|
|
|
Bottom halves are serialised by the <tt>global_bh_lock</tt> spinlock, i.e.
|
|
there can only be one bottom half running on any CPU at a time. However,
|
|
when attempting to execute the handler, if <tt>global_bh_lock</tt> is not available,
|
|
the bottom half is marked (i.e. scheduled) for execution - so processing can
|
|
continue, as opposed to a busy loop on <tt>global_bh_lock</tt>.
|
|
|
|
There can only be 32 bottom halves registered in total.
|
|
The functions required to manipulate bottom halves are as follows (all
|
|
exported to modules):
|
|
|
|
<itemize>
|
|
<item> <tt>void init_bh(int nr, void (*routine)(void))</tt>: installs a bottom half
|
|
handler pointed to by <tt>routine</tt> argument into slot <tt>nr</tt>. The slot
|
|
ought to be enumerated in <tt>include/linux/interrupt.h</tt> in the form
|
|
<tt>XXXX_BH</tt>, e.g. <tt>TIMER_BH</tt> or <tt>TQUEUE_BH</tt>. Typically, a subsystem's
|
|
initialisation routine (<tt>init_module()</tt> for modules) installs the
|
|
required bottom half using this function.
|
|
|
|
<item> <tt>void remove_bh(int nr)</tt>: does the opposite of <tt>init_bh()</tt>, i.e.
|
|
de-installs bottom half installed at slot <tt>nr</tt>. There is no error
|
|
checking performed there, so, for example <tt>remove_bh(32)</tt> will
|
|
panic/oops the system. Typically, a subsystem's cleanup routine
|
|
(<tt>cleanup_module()</tt> for modules) uses this function to free up the slot
|
|
that can later be reused by some other subsystem. (TODO: wouldn't it
|
|
be nice to have <tt>/proc/bottom_halves</tt> list all registered bottom
|
|
halves on the system? That means <tt>global_bh_lock</tt> must be made
|
|
read/write, obviously)
|
|
|
|
<item> <tt>void mark_bh(int nr)</tt>: marks bottom half in slot <tt>nr</tt> for execution. Typically,
|
|
an interrupt handler will mark its bottom half (hence the name!) for
|
|
execution at a "safer time".
|
|
|
|
</itemize>
|
|
|
|
Bottom halves are globally locked tasklets, so the question "when are bottom
|
|
half handlers executed?" is really "when are tasklets executed?". And the
|
|
answer is, in two places: a) on each <tt>schedule()</tt> and b) on each
|
|
interrupt/syscall return path in <tt>entry.S</tt> (TODO: therefore, the <tt>schedule()</tt>
|
|
case is really boring - it like adding yet another very very slow interrupt,
|
|
why not get rid of <tt>handle_softirq</tt> label from <tt>schedule()</tt> altogether?).
|
|
|
|
|
|
<sect1>Task Queues<p>
|
|
|
|
Task queues can be though of as a dynamic extension to old bottom halves. In
|
|
fact, in the source code they are sometimes referred to as "new" bottom
|
|
halves. More specifically, the old bottom halves discussed in previous
|
|
section have these limitations:
|
|
|
|
<enum>
|
|
<item> There are only a fixed number (32) of them.
|
|
|
|
<item> Each bottom half can only be associated with one handler function.
|
|
|
|
<item> Bottom halves are consumed with a spinlock held so they cannot block.
|
|
</enum>
|
|
|
|
So, with task queues, arbitrary number of functions can be chained and
|
|
processed one after another at a later time. One creates a new task queue
|
|
using the <tt>DECLARE_TASK_QUEUE()</tt> macro and queues a task onto it using
|
|
the <tt>queue_task()</tt> function. The task queue then can be processed using
|
|
<tt>run_task_queue()</tt>. Instead of creating your own task queue (and
|
|
having to consume it manually) you can use one of Linux' predefined
|
|
task queues which are consumed at well-known points:
|
|
|
|
<enum>
|
|
<item> <bf>tq_timer</bf>: the timer task queue, run on each timer interrupt
|
|
and when releasing a tty device (closing or releasing a half-opened
|
|
terminal device). Since the timer handler runs in interrupt context,
|
|
the <tt>tq_timer</tt> tasks also run in interrupt context and thus cannot block.
|
|
|
|
<item> <bf>tq_scheduler</bf>: the scheduler task queue, consumed by the scheduler (and also
|
|
when closing tty devices, like <tt>tq_timer</tt>). Since the scheduler executed
|
|
in the context of the process being re-scheduled, the <tt>tq_scheduler</tt>
|
|
tasks can do anything they like, i.e. block, use process context data
|
|
(but why would they want to), etc.
|
|
|
|
<item> <bf>tq_immediate</bf>: this is really a bottom half <tt>IMMEDIATE_BH</tt>, so
|
|
drivers can <tt>queue_task(task, &tq_immediate)</tt> and then
|
|
<tt>mark_bh(IMMEDIATE_BH)</tt> to be consumed in interrupt context.
|
|
|
|
<item> <bf>tq_disk</bf>: used by low level block device access (and RAID) to start
|
|
the actual requests. This task queue is exported to modules but shouldn't
|
|
be used except for the special purposes which it was designed for.
|
|
</enum>
|
|
|
|
Unless a driver uses its own task queues, it does not need to call
|
|
<tt>run_tasks_queues()</tt> to process the queue, except under circumstances explained
|
|
below.
|
|
|
|
The reason <tt>tq_timer/tq_scheduler</tt> task queues are consumed not only in the
|
|
usual places but elsewhere (closing tty device is but one example) becomes
|
|
clear if one remembers that the driver can schedule tasks on the queue, and these tasks
|
|
only make sense while a particular instance of the device is still valid
|
|
- which usually means until the application closes it. So, the driver may
|
|
need to call <tt>run_task_queue()</tt> to flush the tasks it (and anyone else) has
|
|
put on the queue, because allowing them to run at a later time may make no
|
|
sense - i.e. the relevant data structures may have been freed/reused by a
|
|
different instance. This is the reason you see <tt>run_task_queue()</tt> on <tt>tq_timer</tt>
|
|
and <tt>tq_scheduler</tt> in places other than timer interrupt and <tt>schedule()</tt>
|
|
respectively.
|
|
|
|
<sect1>Tasklets<p>
|
|
|
|
Not yet, will be in future revision.
|
|
|
|
<sect1>Softirqs<p>
|
|
|
|
Not yet, will be in future revision.
|
|
|
|
<sect1>How System Calls Are Implemented on i386 Architecture?<p>
|
|
|
|
There are two mechanisms under Linux for implementing system calls:
|
|
|
|
<itemize>
|
|
<item> lcall7/lcall27 call gates;
|
|
<item> int 0x80 software interrupt.
|
|
</itemize>
|
|
|
|
Native Linux programs use int 0x80 whilst binaries from foreign flavours
|
|
of UNIX (Solaris, UnixWare 7 etc.) use the lcall7 mechanism. The name 'lcall7' is
|
|
historically misleading because it also covers lcall27 (e.g. Solaris/x86), but
|
|
the handler function is called lcall7_func.
|
|
|
|
When the system boots, the function <tt>arch/i386/kernel/traps.c:trap_init()</tt> is
|
|
called which sets up the IDT so that vector 0x80 (of type 15, dpl 3) points to
|
|
the address of system_call entry from <tt>arch/i386/kernel/entry.S</tt>.
|
|
|
|
When a userspace application makes a system call, the arguments are passed via registers
|
|
and the application executes 'int 0x80' instruction. This causes a trap into
|
|
kernel mode and processor jumps to system_call entry point in <tt>entry.S</tt>.
|
|
What this does is:
|
|
|
|
<enum>
|
|
<item> Save registers.
|
|
|
|
<item> Set %ds and %es to KERNEL_DS, so that all data (and extra segment)
|
|
references are made in kernel address space.
|
|
|
|
<item> If the value of %eax is greater than <tt>NR_syscalls</tt> (currently 256),
|
|
fail with <tt>ENOSYS</tt> error.
|
|
|
|
<item> If the task is being ptraced (<tt>tsk->ptrace & PF_TRACESYS</tt>), do special
|
|
processing. This is to support programs like strace (analogue of
|
|
SVR4 <bf>truss(1)</bf>) or debuggers.
|
|
|
|
<item> Call <tt>sys_call_table+4*(syscall_number from %eax)</tt>. This table is
|
|
initialised in the same file (<tt>arch/i386/kernel/entry.S</tt>) to point to
|
|
individual system call handlers which under Linux are (usually)
|
|
prefixed with <tt>sys_</tt>, e.g. <tt>sys_open</tt>, <tt>sys_exit</tt>, etc. These C system
|
|
call handlers will find their arguments on the stack where <tt>SAVE_ALL</tt>
|
|
stored them.
|
|
|
|
<item> Enter 'system call return path'. This is a separate label because it
|
|
is used not only by int 0x80 but also by lcall7, lcall27. This is
|
|
concerned with handling tasklets (including bottom halves), checking
|
|
if a <tt>schedule()</tt> is needed (<tt>tsk->need_resched != 0</tt>), checking if there
|
|
are signals pending and if so handling them.
|
|
</enum>
|
|
|
|
Linux supports up to 6 arguments for system calls. They are passed in
|
|
%ebx, %ecx, %edx, %esi, %edi (and %ebp used temporarily, see <tt>_syscall6()</tt> in
|
|
<tt>asm-i386/unistd.h</tt>). The system call number is passed via %eax.
|
|
|
|
<sect1>Atomic Operations<p>
|
|
|
|
There are two types of atomic operations: bitmaps and <tt>atomic_t</tt>. Bitmaps are
|
|
very convenient for maintaining a concept of "allocated" or "free" units
|
|
from some large collection where each unit is identified by some number, for
|
|
example free inodes or free blocks. They are also widely used for simple
|
|
locking, for example to provide exclusive access to open a device. An example
|
|
of this can be found in <tt>arch/i386/kernel/microcode.c</tt>:
|
|
|
|
|
|
<tscreen><code>
|
|
/*
|
|
* Bits in microcode_status. (31 bits of room for future expansion)
|
|
*/
|
|
#define MICROCODE_IS_OPEN 0 /* set if device is in use */
|
|
|
|
static unsigned long microcode_status;
|
|
</code></tscreen>
|
|
|
|
There is no need to initialise <tt>microcode_status</tt> to 0 as BSS is zero-cleared
|
|
under Linux explicitly.
|
|
|
|
<tscreen><code>
|
|
/*
|
|
* We enforce only one user at a time here with open/close.
|
|
*/
|
|
static int microcode_open(struct inode *inode, struct file *file)
|
|
{
|
|
if (!capable(CAP_SYS_RAWIO))
|
|
return -EPERM;
|
|
|
|
/* one at a time, please */
|
|
if (test_and_set_bit(MICROCODE_IS_OPEN, &microcode_status))
|
|
return -EBUSY;
|
|
|
|
MOD_INC_USE_COUNT;
|
|
return 0;
|
|
}
|
|
</code></tscreen>
|
|
|
|
The operations on bitmaps are:
|
|
|
|
<itemize>
|
|
<item> <bf>void set_bit(int nr, volatile void *addr)</bf>: set bit <tt>nr</tt>
|
|
in the bitmap pointed to by <tt>addr</tt>.
|
|
|
|
<item> <bf>void clear_bit(int nr, volatile void *addr)</bf>: clear bit
|
|
<tt>nr</tt> in the bitmap pointed to by <tt>addr</tt>.
|
|
|
|
<item> <bf>void change_bit(int nr, volatile void *addr)</bf>: toggle bit
|
|
<tt>nr</tt> (if set clear, if clear set) in the bitmap pointed to by <tt>addr</tt>.
|
|
|
|
<item> <bf>int test_and_set_bit(int nr, volatile void *addr)</bf>:
|
|
atomically set bit <tt>nr</tt> and return the old bit value.
|
|
|
|
<item> <bf>int test_and_clear_bit(int nr, volatile void *addr)</bf>:
|
|
atomically clear bit <tt>nr</tt> and return the old bit value.
|
|
|
|
<item> <bf>int test_and_change_bit(int nr, volatile void *addr)</bf>:
|
|
atomically toggle bit <tt>nr</tt> and return the old bit value.
|
|
</itemize>
|
|
|
|
These operations use the <tt>LOCK_PREFIX</tt> macro, which on SMP kernels evaluates to
|
|
bus lock instruction prefix and to nothing on UP. This guarantees atomicity
|
|
of access in SMP environment.
|
|
|
|
Sometimes bit manipulations are not convenient, but instead we need to perform
|
|
arithmetic operations - add, subtract, increment decrement. The typical cases
|
|
are reference counts (e.g. for inodes). This facility is provided by the
|
|
<tt>atomic_t</tt> data type and the following operations:
|
|
|
|
<itemize>
|
|
<item> <bf>atomic_read(&v)</bf>: read the value of <tt>atomic_t</tt> variable <tt>v</tt>.
|
|
|
|
<item> <bf>atomic_set(&v, i)</bf>: set the value of <tt>atomic_t</tt> variable
|
|
<tt>v</tt> to integer <tt>i</tt>.
|
|
|
|
<item> <bf>void atomic_add(int i, volatile atomic_t *v)</bf>: add integer
|
|
<tt>i</tt> to the value of atomic variable pointed to by <tt>v</tt>.
|
|
|
|
<item> <bf>void atomic_sub(int i, volatile atomic_t *v)</bf>: subtract
|
|
integer <tt>i</tt> from the value of atomic variable pointed to by <tt>v</tt>.
|
|
|
|
<item> <bf>int atomic_sub_and_test(int i, volatile atomic_t *v)</bf>:
|
|
subtract integer <tt>i</tt> from the value of atomic variable pointed to by
|
|
<tt>v</tt>; return 1 if the new value is 0, return 0 otherwise.
|
|
|
|
<item> <bf>void atomic_inc(volatile atomic_t *v)</bf>: increment the value
|
|
by 1.
|
|
|
|
<item> <bf>void atomic_dec(volatile atomic_t *v)</bf>: decrement the value
|
|
by 1.
|
|
|
|
<item> <bf>int atomic_dec_and_test(volatile atomic_t *v)</bf>: decrement
|
|
the value; return 1 if the new value is 0, return 0 otherwise.
|
|
|
|
<item> <bf>int atomic_inc_and_test(volatile atomic_t *v)</bf>: increment
|
|
the value; return 1 if the new value is 0, return 0 otherwise.
|
|
|
|
<item> <bf>int atomic_add_negative(int i, volatile atomic_t *v)</bf>: add
|
|
the value of <tt>i</tt> to <tt>v</tt> and return 1 if the result is negative. Return
|
|
0 if the result is greater than or equal to 0. This operation is used
|
|
for implementing semaphores.
|
|
</itemize>
|
|
|
|
<sect1>Spinlocks, Read-write Spinlocks and Big-Reader Spinlocks<p>
|
|
|
|
Since the early days of Linux support (early 90s, this century),
|
|
developers were faced with the classical problem of accessing shared data
|
|
between different types of context (user process vs
|
|
interrupt) and different instances of the same context from multiple cpus.
|
|
|
|
SMP support was added to Linux 1.3.42 on 15 Nov 1995 (the original patch
|
|
was made to 1.3.37 in October the same year).
|
|
|
|
If the critical region of code may be executed by either process context
|
|
and interrupt context, then the way to protect it using <tt>cli/sti</tt> instructions
|
|
on UP is:
|
|
|
|
<tscreen><code>
|
|
unsigned long flags;
|
|
|
|
save_flags(flags);
|
|
cli();
|
|
/* critical code */
|
|
restore_flags(flags);
|
|
</code></tscreen>
|
|
|
|
While this is ok on UP, it obviously is of no use on SMP because the same
|
|
code sequence may be executed simultaneously on another cpu, and while <tt>cli()</tt>
|
|
provides protection against races with interrupt context on each CPU individually, it
|
|
provides no protection at all against races between contexts running on different
|
|
CPUs. This is where spinlocks are useful for.
|
|
|
|
There are three types of spinlocks: vanilla (basic), read-write and
|
|
big-reader spinlocks. Read-write spinlocks should be used when there is a
|
|
natural tendency of 'many readers and few writers'. Example of this is
|
|
access to the list of registered filesystems (see <tt>fs/super.c</tt>). The list is
|
|
guarded by the <tt>file_systems_lock</tt> read-write spinlock because one needs exclusive
|
|
access only when registering/unregistering a filesystem, but any process can
|
|
read the file <tt>/proc/filesystems</tt> or use the <bf>sysfs(2)</bf> system call to force a
|
|
read-only scan of the file_systems list. This makes it sensible to use
|
|
read-write spinlocks. With read-write spinlocks, one can have multiple
|
|
readers at a time but only one writer and there can be no readers while
|
|
there is
|
|
a writer. Btw, it would be nice if new readers would not get a lock while
|
|
there
|
|
is a writer trying to get a lock, i.e. if Linux could correctly deal with
|
|
the issue of potential writer starvation by multiple readers.
|
|
This would mean that readers must be blocked while there is a writer
|
|
attempting to get the lock. This is not
|
|
currently the case and it is not obvious whether this should be fixed - the
|
|
argument to the contrary is - readers usually take the lock for a very short
|
|
time so should they really be starved while the writer takes the lock for
|
|
potentially longer periods?
|
|
|
|
Big-reader spinlocks are a form of read-write spinlocks
|
|
heavily optimised for very light read access, with a penalty for writes.
|
|
There is a limited number of big-reader spinlocks - currently only two exist,
|
|
of which one is used only on sparc64 (global irq) and the other is used for
|
|
networking. In all other cases where the access pattern does not fit into
|
|
any of these two scenarios, one should use basic spinlocks. You cannot block
|
|
while holding any kind of spinlock.
|
|
|
|
Spinlocks come in three flavours: plain, <tt>_irq()</tt> and <tt>_bh()</tt>.
|
|
|
|
<enum>
|
|
<item> Plain <tt>spin_lock()/spin_unlock()</tt>: if you know the interrupts are always
|
|
disabled or if you do not race with interrupt context (e.g. from
|
|
within interrupt handler), then you can use this one. It does not
|
|
touch interrupt state on the current CPU.
|
|
|
|
<item> <tt>spin_lock_irq()/spin_unlock_irq()</tt>: if you know that interrupts are
|
|
always enabled then you can use this version, which simply disables
|
|
(on lock) and re-enables (on unlock) interrupts on the current CPU.
|
|
For example, <tt>rtc_read()</tt> uses
|
|
<tt>spin_lock_irq(&rtc_lock)</tt> (interrupts are always enabled inside
|
|
<tt>read()</tt>) whilst <tt>rtc_interrupt()</tt> uses
|
|
<tt>spin_lock(&rtc_lock)</tt> (interrupts are always disabled inside
|
|
interrupt handler). Note that <tt>rtc_read()</tt> uses <tt>spin_lock_irq()</tt> and not
|
|
the more generic <tt>spin_lock_irqsave()</tt> because on entry to any system
|
|
call interrupts are always enabled.
|
|
|
|
<item> <tt>spin_lock_irqsave()/spin_unlock_irqrestore()</tt>: the strongest form,
|
|
to be used when the interrupt state is not known, but only if
|
|
interrupts matter at all, i.e. there is no point in using it if
|
|
our interrupt handlers don't execute any critical code.
|
|
</enum>
|
|
|
|
The reason you cannot use plain <tt>spin_lock()</tt> if you race against interrupt handlers is because if you take it and then
|
|
an interrupt comes in on the same CPU, it will busy wait for the lock forever:
|
|
the lock holder, having been interrupted, will not continue until the
|
|
interrupt handler returns.
|
|
|
|
The most common usage of a spinlock is to access a data structure shared
|
|
between user process context and interrupt handlers:
|
|
|
|
<tscreen><code>
|
|
spinlock_t my_lock = SPIN_LOCK_UNLOCKED;
|
|
|
|
my_ioctl()
|
|
{
|
|
spin_lock_irq(&my_lock);
|
|
/* critical section */
|
|
spin_unlock_irq(&my_lock);
|
|
}
|
|
|
|
my_irq_handler()
|
|
{
|
|
spin_lock(&lock);
|
|
/* critical section */
|
|
spin_unlock(&lock);
|
|
}
|
|
</code></tscreen>
|
|
|
|
There are a couple of things to note about this example:
|
|
|
|
<enum>
|
|
<item> The process context, represented here as a typical driver method -
|
|
<tt>ioctl()</tt> (arguments and return values omitted for clarity), must
|
|
use <tt>spin_lock_irq()</tt> because it knows that interrupts are always
|
|
enabled while executing the device <tt>ioctl()</tt> method.
|
|
|
|
<item> Interrupt context, represented here by <tt>my_irq_handler()</tt> (again
|
|
arguments omitted for clarity) can use plain <tt>spin_lock()</tt> form because
|
|
interrupts are disabled inside an interrupt handler.
|
|
</enum>
|
|
|
|
<sect1>Semaphores and read/write Semaphores<p>
|
|
|
|
Sometimes, while accessing a shared data structure, one must perform operations
|
|
that can block, for example copy data to userspace. The locking primitive
|
|
available for such scenarios under Linux is called a semaphore. There are two
|
|
types of semaphores: basic and read-write semaphores. Depending on the
|
|
initial value of the semaphore, they can be used for either mutual exclusion
|
|
(initial value of 1) or to provide more sophisticated type of access.
|
|
|
|
Read-write semaphores differ from basic semaphores in the same way as
|
|
read-write spinlocks differ from basic spinlocks: one can have multiple
|
|
readers at a time but only one writer and there can be no readers while there are
|
|
writers - i.e. the writer blocks all readers and new readers block while a
|
|
writer is waiting.
|
|
|
|
Also, basic semaphores can be interruptible - just use the operations
|
|
<tt>down/up_interruptible()</tt> instead of the plain <tt>down()/up()</tt> and check the
|
|
value returned from <tt>down_interruptible()</tt>: it will be non zero if the operation was
|
|
interrupted.
|
|
|
|
Using semaphores for mutual exclusion is ideal in situations where a critical
|
|
code section may call by reference unknown functions registered by other
|
|
subsystems/modules, i.e. the caller cannot know apriori whether the function
|
|
blocks or not.
|
|
|
|
A simple example of semaphore usage is in <tt>kernel/sys.c</tt>, implementation of
|
|
<bf>gethostname(2)/sethostname(2)</bf> system calls.
|
|
|
|
<tscreen><code>
|
|
asmlinkage long sys_sethostname(char *name, int len)
|
|
{
|
|
int errno;
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
return -EPERM;
|
|
if (len < 0 || len > __NEW_UTS_LEN)
|
|
return -EINVAL;
|
|
down_write(&uts_sem);
|
|
errno = -EFAULT;
|
|
if (!copy_from_user(system_utsname.nodename, name, len)) {
|
|
system_utsname.nodename[len] = 0;
|
|
errno = 0;
|
|
}
|
|
up_write(&uts_sem);
|
|
return errno;
|
|
}
|
|
|
|
asmlinkage long sys_gethostname(char *name, int len)
|
|
{
|
|
int i, errno;
|
|
|
|
if (len < 0)
|
|
return -EINVAL;
|
|
down_read(&uts_sem);
|
|
i = 1 + strlen(system_utsname.nodename);
|
|
if (i > len)
|
|
i = len;
|
|
errno = 0;
|
|
if (copy_to_user(name, system_utsname.nodename, i))
|
|
errno = -EFAULT;
|
|
up_read(&uts_sem);
|
|
return errno;
|
|
}
|
|
</code></tscreen>
|
|
|
|
The points to note about this example are:
|
|
|
|
<enum>
|
|
<item> The functions may block while copying data from/to userspace in
|
|
<tt>copy_from_user()/copy_to_user()</tt>. Therefore they could not use any form
|
|
of spinlock here.
|
|
|
|
<item> The semaphore type chosen is read-write as opposed to basic because
|
|
there may be lots of concurrent <bf>gethostname(2)</bf> requests which need not
|
|
be mutually exclusive.
|
|
</enum>
|
|
|
|
Although Linux implementation of semaphores and read-write semaphores is
|
|
very sophisticated, there are possible scenarios one can think of which are
|
|
not yet implemented, for example there is no concept of interruptible
|
|
read-write semaphores. This is obviously because there are no real-world
|
|
situations which require these exotic flavours of the primitives.
|
|
|
|
<sect1>Kernel Support for Loading Modules<p>
|
|
|
|
Linux is a monolithic operating system and despite all the modern hype about
|
|
some "advantages" offered by operating systems based on micro-kernel design,
|
|
the truth remains (quoting Linus Torvalds himself):
|
|
|
|
<tscreen>
|
|
... message passing as the fundamental operation of the OS is just an
|
|
exercise in computer science masturbation. It may feel good, but you
|
|
don't actually get anything DONE.
|
|
</tscreen>
|
|
|
|
Therefore, Linux is and will always be based on a monolithic design, which
|
|
means that all subsystems run in the same privileged mode and share the same
|
|
address space; communication between them is achieved by the usual C function
|
|
call means.
|
|
|
|
However, although separating kernel functionality into separate "processes"
|
|
as is done in micro-kernels is definitely a bad idea, separating it into
|
|
dynamically loadable on demand kernel modules is desirable in some
|
|
circumstances (e.g. on machines with low memory or for installation kernels
|
|
which could otherwise contain ISA auto-probing device drivers that are
|
|
mutually exclusive). The decision whether to include support for loadable
|
|
modules is made at compile time and is determined by the <tt>CONFIG_MODULES</tt>
|
|
option. Support for module autoloading via <tt>request_module()</tt> mechanism is
|
|
a separate compilation option (<tt>CONFIG_KMOD</tt>).
|
|
|
|
The following functionality can be implemented as loadable modules under
|
|
Linux:
|
|
|
|
<enum>
|
|
<item> Character and block device drivers, including misc device drivers.
|
|
|
|
<item> Terminal line disciplines.
|
|
|
|
<item> Virtual (regular) files in <tt>/proc</tt> and in devfs (e.g. <tt>/dev/cpu/microcode</tt>
|
|
vs <tt>/dev/misc/microcode</tt>).
|
|
|
|
<item> Binary file formats (e.g. ELF, aout, etc).
|
|
|
|
<item> Execution domains (e.g. Linux, UnixWare7, Solaris, etc).
|
|
|
|
<item> Filesystems.
|
|
|
|
<item> System V IPC.
|
|
</enum>
|
|
|
|
There a few things that cannot be implemented as modules under Linux
|
|
(probably because it makes no sense for them to be modularised):
|
|
|
|
<enum>
|
|
<item> Scheduling algorithms.
|
|
|
|
<item> VM policies.
|
|
|
|
<item> Buffer cache, page cache and other caches.
|
|
</enum>
|
|
|
|
Linux provides several system calls to assist in loading modules:
|
|
|
|
<enum>
|
|
<item><tt>caddr_t create_module(const char *name, size_t size)</tt>: allocates
|
|
<tt>size</tt> bytes using <tt>vmalloc()</tt> and maps a module structure at the
|
|
beginning thereof. This new module is then linked into the list headed
|
|
by module_list. Only a process with <tt>CAP_SYS_MODULE</tt> can invoke this
|
|
system call, others will get <tt>EPERM</tt> returned.
|
|
|
|
<item><tt>long init_module(const char *name, struct module *image)</tt>: loads the
|
|
relocated module image and causes the module's initialisation routine
|
|
to be invoked. Only a process with <tt>CAP_SYS_MODULE</tt> can invoke this
|
|
system call, others will get <tt>EPERM</tt> returned.
|
|
|
|
<item><tt>long delete_module(const char *name)</tt>: attempts to unload the module.
|
|
If <tt>name == NULL</tt>, attempt is made to unload all unused modules.
|
|
|
|
<item><tt>long query_module(const char *name, int which, void *buf,
|
|
size_t bufsize, size_t *ret)</tt>: returns information about a module
|
|
(or about all modules).
|
|
</enum>
|
|
|
|
The command interface available to users consists of:
|
|
|
|
<itemize>
|
|
<item><bf>insmod</bf>: insert a single module.
|
|
|
|
<item><bf>modprobe</bf>: insert a module including all other modules it depends
|
|
on.
|
|
|
|
<item><bf>rmmod</bf>: remove a module.
|
|
|
|
<item><bf>modinfo</bf>: print some information about a module, e.g. author,
|
|
description, parameters the module accepts, etc.
|
|
</itemize>
|
|
|
|
Apart from being able to load a module manually using either <bf>insmod</bf> or <bf>modprobe</bf>,
|
|
it is also possible to have the module inserted automatically by the kernel
|
|
when a particular functionality is required. The kernel interface for this
|
|
is the function called <tt>request_module(name)</tt> which is exported to modules,
|
|
so that modules can load other modules as well. The <tt>request_module(name)</tt>
|
|
internally creates a kernel thread which execs the userspace command
|
|
<bf>modprobe -s -k module_name</bf>, using the standard <tt>exec_usermodehelper()</tt> kernel
|
|
interface (which is also exported to modules). The function returns 0 on
|
|
success, however it is usually not worth checking the return code from
|
|
<tt>request_module()</tt>. Instead, the programming idiom is:
|
|
|
|
<tscreen><code>
|
|
if (check_some_feature() == NULL)
|
|
request_module(module);
|
|
if (check_some_feature() == NULL)
|
|
return -ENODEV;
|
|
</code></tscreen>
|
|
|
|
For example, this is done by <tt>fs/block_dev.c:get_blkfops()</tt> to load a module
|
|
<tt>block-major-N</tt> when attempt is made to open a block device with major <tt>N</tt>.
|
|
Obviously, there is no such module called <tt>block-major-N</tt> (Linux developers
|
|
only chose sensible names for their modules) but it is mapped to a proper
|
|
module name using the file <tt>/etc/modules.conf</tt>. However, for most well-known
|
|
major numbers (and other kinds of modules) the <bf>modprobe/insmod</bf> commands
|
|
know which real module to load without needing an explicit alias statement
|
|
in <tt>/etc/modules.conf</tt>.
|
|
|
|
A good example of loading a module is inside the <bf>mount(2)</bf> system call. The
|
|
<bf>mount(2)</bf> system call accepts the filesystem type as a string which
|
|
<tt>fs/super.c:do_mount()</tt> then passes on to <tt>fs/super.c:get_fs_type()</tt>:
|
|
|
|
<tscreen><code>
|
|
static struct file_system_type *get_fs_type(const char *name)
|
|
{
|
|
struct file_system_type *fs;
|
|
|
|
read_lock(&file_systems_lock);
|
|
fs = *(find_filesystem(name));
|
|
if (fs && !try_inc_mod_count(fs->owner))
|
|
fs = NULL;
|
|
read_unlock(&file_systems_lock);
|
|
if (!fs && (request_module(name) == 0)) {
|
|
read_lock(&file_systems_lock);
|
|
fs = *(find_filesystem(name));
|
|
if (fs && !try_inc_mod_count(fs->owner))
|
|
fs = NULL;
|
|
read_unlock(&file_systems_lock);
|
|
}
|
|
return fs;
|
|
}
|
|
</code></tscreen>
|
|
|
|
A few things to note in this function:
|
|
|
|
<enum>
|
|
<item> First we attempt to find the filesystem with the given name amongst
|
|
those already registered. This is done under protection of
|
|
<tt>file_systems_lock</tt> taken for read (as we are not modifying the list
|
|
of registered filesystems).
|
|
|
|
<item> If such a filesystem is found then we attempt to get a new reference
|
|
to it by trying to increment its module's hold count. This always
|
|
returns 1 for statically linked filesystems or for modules not
|
|
presently being deleted. If <tt>try_inc_mod_count()</tt> returned 0 then
|
|
we consider it a failure - i.e. if the module is there but is being
|
|
deleted, it is as good as if it were not there at all.
|
|
|
|
<item> We drop the <tt>file_systems_lock</tt> because what we are about to do next
|
|
(<tt>request_module()</tt>) is a blocking operation, and therefore we can't
|
|
hold a spinlock over it. Actually, in this specific case, we would
|
|
have to drop <tt>file_systems_lock</tt> anyway, even if <tt>request_module()</tt> were
|
|
guaranteed to be non-blocking and the module loading were executed
|
|
in the same context atomically. The reason for this is that the module's
|
|
initialisation function will try to call <tt>register_filesystem()</tt>, which will
|
|
take the same <tt>file_systems_lock</tt> read-write spinlock for write.
|
|
|
|
<item> If the attempt to load was successful, then we take the
|
|
<tt>file_systems_lock</tt> spinlock and try to locate the newly registered
|
|
filesystem in the list. Note that this is slightly wrong because
|
|
it is in principle possible for a bug in modprobe command to cause
|
|
it to coredump after it successfully loaded the requested module, in
|
|
which case <tt>request_module()</tt> will fail even though the new filesystem will be
|
|
registered, and yet <tt>get_fs_type()</tt> won't find it.
|
|
|
|
<item> If the filesystem is found and we are able to get a reference to it,
|
|
we return it. Otherwise we return NULL.
|
|
</enum>
|
|
|
|
When a module is loaded into the kernel, it can refer to any symbols that
|
|
are exported as public by the kernel using <tt>EXPORT_SYMBOL()</tt> macro or by
|
|
other currently loaded modules. If the module uses symbols from another
|
|
module, it is marked as depending on that module during dependency
|
|
recalculation, achieved by running <bf>depmod -a</bf> command on boot (e.g. after
|
|
installing a new kernel).
|
|
|
|
Usually, one must match the set of modules with the version of the kernel
|
|
interfaces they use, which under Linux simply means the "kernel version" as
|
|
there is no special kernel interface versioning mechanism in general.
|
|
However, there is a limited functionality called "module versioning" or
|
|
<tt>CONFIG_MODVERSIONS</tt> which allows to avoid recompiling modules when switching
|
|
to a new kernel. What happens here is that the kernel symbol table is treated
|
|
differently for internal access and for access from modules. The elements of
|
|
public (i.e. exported) part of the symbol table are built by 32bit
|
|
checksumming the C declaration. So, in order to resolve a symbol used by a
|
|
module during loading, the loader must match the full representation of the
|
|
symbol that includes the checksum; it will refuse to load the module if these
|
|
symbols differ. This
|
|
only happens when both the kernel and the module are compiled with module
|
|
versioning enabled. If either one of them uses the original symbol names,
|
|
the loader simply tries to match the kernel version declared by the module
|
|
and the one exported by the kernel and refuses to load if they differ.
|
|
|
|
<sect>Virtual Filesystem (VFS)<p>
|
|
|
|
<sect1>Inode Caches and Interaction with Dcache<p>
|
|
|
|
In order to support multiple filesystems, Linux contains a special kernel
|
|
interface level called VFS (Virtual Filesystem Switch). This is similar
|
|
to the vnode/vfs interface found in SVR4 derivatives (originally it came from
|
|
BSD and Sun original implementations).
|
|
|
|
Linux inode cache is implemented in a single file, <tt>fs/inode.c</tt>, which consists
|
|
of 977 lines of code. It is interesting to note that not many changes have been
|
|
made to it for the last 5-7 years: one can still recognise some of the code
|
|
comparing the latest version with, say, 1.3.42.
|
|
|
|
The structure of Linux inode cache is as follows:
|
|
|
|
<enum>
|
|
<item> A global hashtable, <tt>inode_hashtable</tt>, where each inode is hashed by the
|
|
value of the superblock pointer and 32bit inode number. Inodes without a
|
|
superblock (<tt>inode->i_sb == NULL</tt>) are added to a doubly linked list
|
|
headed by <tt>anon_hash_chain</tt> instead. Examples of anonymous inodes
|
|
are sockets created by <tt>net/socket.c:sock_alloc()</tt>, by calling
|
|
<tt>fs/inode.c:get_empty_inode()</tt>.
|
|
|
|
<item> A global type in_use list (<tt>inode_in_use</tt>), which contains valid inodes
|
|
with <tt>i_count>0</tt> and <tt>i_nlink>0</tt>. Inodes newly allocated by
|
|
<tt>get_empty_inode()</tt> and <tt>get_new_inode()</tt> are added to the <tt>inode_in_use</tt> list.
|
|
|
|
<item> A global type unused list (<tt>inode_unused</tt>), which contains valid inodes
|
|
with <tt>i_count = 0</tt>.
|
|
|
|
<item> A per-superblock type dirty list (<tt>sb->s_dirty</tt>) which contains valid
|
|
inodes with <tt>i_count>0</tt>, <tt>i_nlink>0</tt> and <tt>i_state & I_DIRTY</tt>.
|
|
When inode is marked
|
|
dirty, it is added to the <tt>sb->s_dirty</tt> list if it is also hashed.
|
|
Maintaining a per-superblock dirty list of inodes allows to quickly
|
|
sync inodes.
|
|
|
|
<item> Inode cache proper - a SLAB cache called <tt>inode_cachep</tt>. As inode
|
|
objects are allocated and freed, they are taken from and returned to
|
|
this SLAB cache.
|
|
</enum>
|
|
|
|
The type lists are anchored from <tt>inode->i_list</tt>, the hashtable from
|
|
<tt>inode->i_hash</tt>. Each inode can be on a hashtable and one and only one type
|
|
(in_use, unused or dirty) list.
|
|
|
|
All these lists are protected by a single spinlock: <tt>inode_lock</tt>.
|
|
|
|
The inode cache subsystem is initialised when <tt>inode_init()</tt> function is called from
|
|
<tt>init/main.c:start_kernel()</tt>. The function is marked as <tt>__init</tt>, which means
|
|
its code is thrown away later on. It is passed a single argument - the
|
|
number of physical pages on the system. This is so that the inode cache can
|
|
configure itself depending on how much memory is available, i.e. create
|
|
a larger hashtable if there is enough memory.
|
|
|
|
The only stats information about inode cache is the number of unused inodes,
|
|
stored in <tt>inodes_stat.nr_unused</tt> and accessible to user programs via files
|
|
<tt>/proc/sys/fs/inode-nr</tt> and <tt>/proc/sys/fs/inode-state</tt>.
|
|
|
|
We can examine one of the lists from <bf>gdb</bf> running on a live kernel thus:
|
|
|
|
<tscreen><code>
|
|
(gdb) printf "%d\n", (unsigned long)(&((struct inode *)0)->i_list)
|
|
8
|
|
(gdb) p inode_unused
|
|
$34 = 0xdfa992a8
|
|
(gdb) p (struct list_head)inode_unused
|
|
$35 = {next = 0xdfa992a8, prev = 0xdfcdd5a8}
|
|
(gdb) p ((struct list_head)inode_unused).prev
|
|
$36 = (struct list_head *) 0xdfcdd5a8
|
|
(gdb) p (((struct list_head)inode_unused).prev)->prev
|
|
$37 = (struct list_head *) 0xdfb5a2e8
|
|
(gdb) set $i = (struct inode *)0xdfb5a2e0
|
|
(gdb) p $i->i_ino
|
|
$38 = 0x3bec7
|
|
(gdb) p $i->i_count
|
|
$39 = {counter = 0x0}
|
|
</code></tscreen>
|
|
|
|
Note that we deducted 8 from the address 0xdfb5a2e8 to obtain the address of
|
|
the <tt>struct inode</tt> (0xdfb5a2e0) according to the definition of <tt>list_entry()</tt>
|
|
macro from <tt>include/linux/list.h</tt>.
|
|
|
|
To understand how inode cache works, let us trace a lifetime of an inode
|
|
of a regular file on ext2 filesystem as it is opened and closed:
|
|
|
|
<tscreen><code>
|
|
fd = open("file", O_RDONLY);
|
|
close(fd);
|
|
</code></tscreen>
|
|
|
|
The <bf>open(2)</bf> system call is implemented in <tt>fs/open.c:sys_open</tt> function and
|
|
the real work is done by <tt>fs/open.c:filp_open()</tt> function, which is split into
|
|
two parts:
|
|
|
|
<enum>
|
|
<item> <tt>open_namei()</tt>: fills in the nameidata structure containing the dentry
|
|
and vfsmount structures.
|
|
|
|
<item> <tt>dentry_open()</tt>: given a dentry and vfsmount, this function allocates a new
|
|
<tt>struct file</tt> and links them together; it also invokes the filesystem
|
|
specific <tt>f_op->open()</tt> method which was set in <tt>inode->i_fop</tt> when inode
|
|
was read in <tt>open_namei()</tt> (which provided inode via <tt>dentry->d_inode</tt>).
|
|
</enum>
|
|
|
|
The <tt>open_namei()</tt> function interacts with dentry cache via <tt>path_walk()</tt>, which
|
|
in turn calls <tt>real_lookup()</tt>, which invokes the filesystem specific <tt>inode_operations->lookup()</tt> method.
|
|
The role of this method is to find the entry in the parent
|
|
directory with the matching name and then do <tt>iget(sb, ino)</tt> to get the
|
|
corresponding inode - which brings us to the inode cache. When the inode is
|
|
read in, the dentry is instantiated by means of <tt>d_add(dentry, inode)</tt>. While
|
|
we are at it, note that for UNIX-style filesystems which have the concept of
|
|
on-disk inode number, it is the lookup method's job to map its endianness
|
|
to current CPU format, e.g. if the inode number in raw (fs-specific) dir
|
|
entry is in little-endian 32 bit format one could do:
|
|
|
|
<tscreen><code>
|
|
unsigned long ino = le32_to_cpu(de->inode);
|
|
inode = iget(sb, ino);
|
|
d_add(dentry, inode);
|
|
</code></tscreen>
|
|
|
|
So, when we open a file we hit <tt>iget(sb, ino)</tt> which is really
|
|
<tt>iget4(sb, ino, NULL, NULL)</tt>, which does:
|
|
|
|
<enum>
|
|
<item> Attempt to find an inode with matching superblock and inode number
|
|
in the hashtable under protection of <tt>inode_lock</tt>. If inode is found,
|
|
its reference count (<tt>i_count</tt>) is incremented; if it
|
|
was 0 prior to incrementation and the inode is not dirty, it is removed from whatever
|
|
type list (<tt>inode->i_list</tt>) it is currently on (it has to be
|
|
<tt>inode_unused</tt> list, of course) and inserted into
|
|
<tt>inode_in_use</tt> type list; finally, <tt>inodes_stat.nr_unused</tt> is decremented.
|
|
|
|
<item> If inode is currently locked, we wait until it is unlocked so that
|
|
<tt>iget4()</tt> is guaranteed to return an unlocked inode.
|
|
|
|
<item> If inode was not found in the hashtable then it is the first time we
|
|
encounter this inode, so we call <tt>get_new_inode()</tt>, passing it the pointer
|
|
to the place in the hashtable where it should be inserted to.
|
|
|
|
<item> <tt>get_new_inode()</tt> allocates a new inode from the <tt>inode_cachep</tt> SLAB
|
|
cache but this operation can block (<tt>GFP_KERNEL</tt> allocation), so it
|
|
must drop the <tt>inode_lock</tt> spinlock which guards the hashtable. Since it
|
|
has dropped the spinlock, it must retry searching the inode in the
|
|
hashtable afterwards; if it is found this time, it returns (after incrementing
|
|
the reference by <tt>__iget</tt>) the one found in the hashtable and destroys
|
|
the newly allocated one. If it is still not found in the hashtable,
|
|
then the new inode we have just allocated is the one to be used;
|
|
therefore it is initialised to the required values and the fs-specific
|
|
<tt>sb->s_op->read_inode()</tt> method is invoked to populate the rest of the
|
|
inode. This brings us from inode cache back to the filesystem code -
|
|
remember that we came to the inode cache when filesystem-specific
|
|
<tt>lookup()</tt> method invoked <tt>iget()</tt>. While the <tt>s_op->read_inode()</tt> method
|
|
is reading the inode from disk, the inode is locked (<tt>i_state = I_LOCK</tt>);
|
|
it is unlocked after the <tt>read_inode()</tt> method returns and all the waiters for it are
|
|
woken up.
|
|
</enum>
|
|
|
|
Now, let's see what happens when we close this file descriptor. The <bf>close(2)</bf>
|
|
system call is implemented in <tt>fs/open.c:sys_close()</tt> function, which calls
|
|
<tt>do_close(fd, 1)</tt> which rips (replaces with NULL) the descriptor of the
|
|
process' file descriptor table and invokes the <tt>filp_close()</tt> function which does
|
|
most of the work. The interesting things happen in <tt>fput()</tt>, which checks if
|
|
this was the last reference to the file, and if so calls
|
|
<tt>fs/file_table.c:_fput()</tt> which calls <tt>__fput()</tt> which is where interaction with
|
|
dcache (and therefore with inode cache - remember dcache is a Master of inode
|
|
cache!) happens. The <tt>fs/dcache.c:dput()</tt> does <tt>dentry_iput()</tt> which brings us
|
|
back to inode cache via <tt>iput(inode)</tt> so let us understand
|
|
<tt>fs/inode.c:iput(inode)</tt>:
|
|
|
|
<enum>
|
|
<item> If parameter passed to us is NULL, we do absolutely nothing and return.
|
|
|
|
<item> if there is a fs-specific <tt>sb->s_op->put_inode()</tt> method, it is invoked
|
|
immediately with no spinlocks held (so it can block).
|
|
|
|
<item> <tt>inode_lock</tt> spinlock is taken and <tt>i_count</tt> is decremented. If this was
|
|
NOT the last reference to this inode then we simply check if
|
|
there are too many references to it and so <tt>i_count</tt> can wrap around
|
|
the 32 bits allocated to it and if so we print a warning and return.
|
|
Note that we call <tt>printk()</tt> while holding the <tt>inode_lock</tt> spinlock -
|
|
this is fine because <tt>printk()</tt> can never block, therefore it may be called in
|
|
absolutely any context (even from interrupt handlers!).
|
|
|
|
<item> If this was the last active reference then some work needs to be done.
|
|
</enum>
|
|
|
|
The work performed by <tt>iput()</tt> on the last inode reference is rather complex
|
|
so we separate it into a list of its own:
|
|
|
|
<enum>
|
|
<item> If <tt>i_nlink == 0</tt> (e.g. the file was unlinked while we held it open)
|
|
then the inode is removed from hashtable and from its type list; if
|
|
there are any data pages held in page cache for this inode, they are
|
|
removed by means of <tt>truncate_all_inode_pages(&inode->i_data)</tt>. Then
|
|
the filesystem-specific <tt>s_op->delete_inode()</tt> method is invoked,
|
|
which typically deletes the on-disk copy of the inode. If there is no
|
|
<tt>s_op->delete_inode()</tt> method registered by the filesystem (e.g. ramfs)
|
|
then we call <tt>clear_inode(inode)</tt>, which invokes <tt>s_op->clear_inode()</tt> if
|
|
registered and if inode corresponds to a block device, this device's
|
|
reference count is dropped by <tt>bdput(inode->i_bdev)</tt>.
|
|
|
|
<item> if <tt>i_nlink != 0</tt> then we check if there are other inodes in the same
|
|
hash bucket and if there is none, then if inode is not dirty we delete
|
|
it from its type list and add it to <tt>inode_unused</tt> list, incrementing
|
|
<tt>inodes_stat.nr_unused</tt>. If there are inodes in the same hashbucket
|
|
then we delete it from the type list and add to <tt>inode_unused</tt> list.
|
|
If this was an anonymous inode (NetApp .snapshot) then we delete it
|
|
from the type list and clear/destroy it completely.
|
|
</enum>
|
|
|
|
|
|
<sect1>Filesystem Registration/Unregistration<p>
|
|
|
|
The Linux kernel provides a mechanism for new filesystems to be written with
|
|
minimum effort. The historical reasons for this are:
|
|
|
|
<enum>
|
|
<item> In the world where people still use non-Linux operating systems
|
|
to protect their investment in legacy software, Linux had to provide
|
|
interoperability by supporting a great multitude of different
|
|
filesystems - most of which would not deserve to exist on their own
|
|
but only for compatibility with existing non-Linux operating systems.
|
|
|
|
<item> The interface for filesystem writers had to be very simple so that
|
|
people could try to reverse engineer existing proprietary filesystems
|
|
by writing read-only versions of them. Therefore Linux VFS makes it
|
|
very easy to implement read-only filesystems; 95% of the work is
|
|
to finish them by adding full write-support. As a concrete example,
|
|
I wrote read-only BFS filesystem for Linux in about 10 hours, but it
|
|
took several weeks to complete it to have full write support (and
|
|
even today some purists claim that it is not complete because "it
|
|
doesn't have compactification support").
|
|
|
|
<item> The VFS interface is exported, and therefore all Linux filesystems can
|
|
be implemented as modules.
|
|
|
|
</enum>
|
|
|
|
Let us consider the steps required to implement a filesystem under Linux.
|
|
The code to implement a filesystem can be either a dynamically loadable
|
|
module or statically linked into the kernel, and the way it is done under
|
|
Linux is very transparent. All that is needed is to fill in a
|
|
<tt>struct file_system_type</tt> structure and register it with the VFS using
|
|
the <tt>register_filesystem()</tt> function as in the following example from
|
|
<tt>fs/bfs/inode.c</tt>:
|
|
|
|
<tscreen><code>
|
|
#include <linux/module.h>
|
|
#include <linux/init.h>
|
|
|
|
static struct super_block *bfs_read_super(struct super_block *, void *, int);
|
|
|
|
static DECLARE_FSTYPE_DEV(bfs_fs_type, "bfs", bfs_read_super);
|
|
|
|
static int __init init_bfs_fs(void)
|
|
{
|
|
return register_filesystem(&bfs_fs_type);
|
|
}
|
|
|
|
static void __exit exit_bfs_fs(void)
|
|
{
|
|
unregister_filesystem(&bfs_fs_type);
|
|
}
|
|
|
|
module_init(init_bfs_fs)
|
|
module_exit(exit_bfs_fs)
|
|
</code></tscreen>
|
|
|
|
The <tt>module_init()/module_exit()</tt> macros ensure that, when BFS is compiled as a
|
|
module, the functions <tt>init_bfs_fs()</tt> and <tt>exit_bfs_fs()</tt> turn into <tt>init_module()</tt>
|
|
and <tt>cleanup_module()</tt> respectively; if BFS is statically linked into the kernel,
|
|
the <tt>exit_bfs_fs()</tt> code vanishes as it is unnecessary.
|
|
|
|
The <tt>struct file_system_type</tt> is declared in <tt>include/linux/fs.h</tt>:
|
|
|
|
<tscreen><code>
|
|
struct file_system_type {
|
|
const char *name;
|
|
int fs_flags;
|
|
struct super_block *(*read_super) (struct super_block *, void *, int);
|
|
struct module *owner;
|
|
struct vfsmount *kern_mnt; /* For kernel mount, if it's FS_SINGLE fs */
|
|
struct file_system_type * next;
|
|
};
|
|
</code></tscreen>
|
|
|
|
The fields thereof are explained thus:
|
|
|
|
<itemize>
|
|
|
|
<item><bf>name</bf>: human readable name, appears in <tt>/proc/filesystems</tt> file
|
|
and is used as a key to find a filesystem by its name; this same name is
|
|
used for the filesystem type in <bf>mount(2)</bf>, and it should be unique: there
|
|
can (obviously) be only one filesystem with a given name. For modules,
|
|
name points to module's address spaces and not copied: this means <bf>cat
|
|
/proc/filesystems</bf> can oops if the module was unloaded but filesystem is
|
|
still registered.
|
|
|
|
<item><bf>fs_flags</bf>: one or more (ORed) of the flags: <tt>FS_REQUIRES_DEV</tt>
|
|
for filesystems that can only be mounted on a block device, <tt>FS_SINGLE</tt>
|
|
for filesystems that can have only one superblock, <tt>FS_NOMOUNT</tt> for
|
|
filesystems that cannot be mounted from userspace by means of <bf>mount(2)</bf>
|
|
system call: they can however be mounted internally using <tt>kern_mount()</tt>
|
|
interface, e.g. pipefs.
|
|
|
|
<item><bf>read_super</bf>: a pointer to the function that reads the super
|
|
block during mount operation. This function is required: if it is not
|
|
provided, mount operation (whether from userspace or inkernel) will
|
|
always fail except in <tt>FS_SINGLE</tt> case where it will Oops in
|
|
<tt>get_sb_single()</tt>, trying to dereference a NULL pointer in
|
|
<tt>fs_type->kern_mnt->mnt_sb</tt> with (<tt>fs_type->kern_mnt = NULL</tt>).
|
|
|
|
<item><bf>owner</bf>: pointer to the module that implements this filesystem.
|
|
If the filesystem is statically linked into the kernel then this is
|
|
NULL. You don't need to set this manually as the macro <tt>THIS_MODULE</tt>
|
|
does the right thing automatically.
|
|
|
|
<item><bf>kern_mnt</bf>: for <tt>FS_SINGLE</tt> filesystems only. This is set by
|
|
<tt>kern_mount()</tt> (TODO: <tt>kern_mount()</tt> should refuse to mount filesystems
|
|
if <tt>FS_SINGLE</tt> is not set).
|
|
|
|
<item><bf>next</bf>: linkage into singly-linked list headed by <tt>file_systems</tt>
|
|
(see <tt>fs/super.c</tt>). The list is protected by the <tt>file_systems_lock</tt>
|
|
read-write spinlock and functions <tt>register/unregister_filesystem()</tt>
|
|
modify it by linking and unlinking the entry from the list.
|
|
</itemize>
|
|
|
|
The job of the <tt>read_super()</tt> function is to fill in the fields of the superblock,
|
|
allocate root inode and initialise any fs-private information associated with
|
|
this mounted instance of the filesystem. So, typically the <tt>read_super()</tt> would
|
|
do:
|
|
|
|
<enum>
|
|
<item> Read the superblock from the device specified via <tt>sb->s_dev</tt> argument,
|
|
using buffer cache <tt>bread()</tt> function. If it anticipates to read a few
|
|
more subsequent metadata blocks immediately then it makes sense to
|
|
use <tt>breada()</tt> to schedule reading extra blocks asynchronously.
|
|
|
|
<item> Verify that superblock contains the valid magic number and overall
|
|
"looks" sane.
|
|
|
|
<item> Initialise <tt>sb->s_op</tt> to point to <tt>struct super_block_operations</tt>
|
|
structure. This structure contains filesystem-specific functions
|
|
implementing operations like "read inode", "delete inode", etc.
|
|
|
|
<item> Allocate root inode and root dentry using <tt>d_alloc_root()</tt>.
|
|
|
|
<item> If the filesystem is not mounted read-only then set <tt>sb->s_dirt</tt> to 1
|
|
and mark the buffer containing superblock dirty (TODO: why do we
|
|
do this? I did it in BFS because MINIX did it...)
|
|
</enum>
|
|
|
|
<sect1>File Descriptor Management<p>
|
|
|
|
Under Linux there are several levels of indirection between user file
|
|
descriptor and the kernel inode structure. When a process makes <bf>open(2)</bf>
|
|
system call, the kernel returns a small non-negative integer which can be
|
|
used for subsequent I/O operations on this file. This integer is an index
|
|
into an array of pointers to <tt>struct file</tt>. Each file structure points to
|
|
a dentry via <tt>file->f_dentry</tt>. And each dentry points to an inode via
|
|
<tt>dentry->d_inode</tt>.
|
|
|
|
Each task contains a field <tt>tsk->files</tt> which is a pointer to
|
|
<tt>struct files_struct</tt> defined in <tt>include/linux/sched.h</tt>:
|
|
|
|
<tscreen><code>
|
|
/*
|
|
* Open file table structure
|
|
*/
|
|
struct files_struct {
|
|
atomic_t count;
|
|
rwlock_t file_lock;
|
|
int max_fds;
|
|
int max_fdset;
|
|
int next_fd;
|
|
struct file ** fd; /* current fd array */
|
|
fd_set *close_on_exec;
|
|
fd_set *open_fds;
|
|
fd_set close_on_exec_init;
|
|
fd_set open_fds_init;
|
|
struct file * fd_array[NR_OPEN_DEFAULT];
|
|
};
|
|
</code></tscreen>
|
|
|
|
The <tt>file->count</tt> is a reference count, incremented by <tt>get_file()</tt> (usually
|
|
called by <tt>fget()</tt>) and decremented by <tt>fput()</tt> and by <tt>put_filp()</tt>. The difference
|
|
between <tt>fput()</tt> and <tt>put_filp()</tt> is that <tt>fput()</tt> does more work usually needed
|
|
for regular files, such as releasing flock locks, releasing dentry, etc, while
|
|
<tt>put_filp()</tt> is only manipulating file table structures, i.e. decrements the
|
|
count, removes the file from the <tt>anon_list</tt> and adds it to the <tt>free_list</tt>,
|
|
under protection of <tt>files_lock</tt> spinlock.
|
|
|
|
The <tt>tsk->files</tt> can be shared between parent and child if the child thread
|
|
was created using <tt>clone()</tt> system call with <tt>CLONE_FILES</tt> set in the clone flags
|
|
argument. This can be seen in <tt>kernel/fork.c:copy_files()</tt> (called by
|
|
<tt>do_fork()</tt>) which only increments the <tt>file->count</tt> if <tt>CLONE_FILES</tt> is set
|
|
instead of the usual copying file descriptor table in time-honoured
|
|
tradition of classical UNIX <bf>fork(2)</bf>.
|
|
|
|
When a file is opened, the file structure allocated for it is installed into
|
|
<tt>current->files->fd[fd]</tt> slot and a <tt>fd</tt> bit is set in the bitmap
|
|
<tt>current->files->open_fds</tt> . All this is done under the write protection of
|
|
<tt>current->files->file_lock</tt> read-write spinlock. When the descriptor is
|
|
closed a <tt>fd</tt> bit is cleared in <tt>current->files->open_fds</tt> and
|
|
<tt>current->files->next_fd</tt> is set equal to <tt>fd</tt> as a hint for finding the
|
|
first unused descriptor next time this process wants to open a file.
|
|
|
|
<sect1>File Structure Management<p>
|
|
|
|
The file structure is declared in <tt>include/linux/fs.h</tt>:
|
|
|
|
<tscreen><code>
|
|
struct fown_struct {
|
|
int pid; /* pid or -pgrp where SIGIO should be sent */
|
|
uid_t uid, euid; /* uid/euid of process setting the owner */
|
|
int signum; /* posix.1b rt signal to be delivered on IO */
|
|
};
|
|
|
|
struct file {
|
|
struct list_head f_list;
|
|
struct dentry *f_dentry;
|
|
struct vfsmount *f_vfsmnt;
|
|
struct file_operations *f_op;
|
|
atomic_t f_count;
|
|
unsigned int f_flags;
|
|
mode_t f_mode;
|
|
loff_t f_pos;
|
|
unsigned long f_reada, f_ramax, f_raend, f_ralen, f_rawin;
|
|
struct fown_struct f_owner;
|
|
unsigned int f_uid, f_gid;
|
|
int f_error;
|
|
|
|
unsigned long f_version;
|
|
|
|
/* needed for tty driver, and maybe others */
|
|
void *private_data;
|
|
};
|
|
</code></tscreen>
|
|
|
|
Let us look at the various fields of <tt>struct file</tt>:
|
|
|
|
<enum>
|
|
<item><bf>f_list</bf>: this field links file structure on one (and only one)
|
|
of the lists: a) <tt>sb->s_files</tt> list of all open files on this filesystem,
|
|
if the corresponding inode is not anonymous, then <tt>dentry_open()</tt> (called
|
|
by <tt>filp_open()</tt>) links the file into this list;
|
|
b) <tt>fs/file_table.c:free_list</tt>, containing unused file structures;
|
|
c) <tt>fs/file_table.c:anon_list</tt>, when a new file structure is created by
|
|
<tt>get_empty_filp()</tt> it is placed on this list. All these lists are
|
|
protected by the <tt>files_lock</tt> spinlock.
|
|
|
|
<item><bf>f_dentry</bf>: the dentry corresponding to this file. The dentry
|
|
is created at nameidata lookup time by <tt>open_namei()</tt> (or
|
|
rather <tt>path_walk()</tt>
|
|
which it calls) but the actual <tt>file->f_dentry</tt> field is set by
|
|
<tt>dentry_open()</tt> to contain the dentry thus found.
|
|
|
|
<item><bf>f_vfsmnt</bf>: the pointer to <tt>vfsmount</tt> structure of the filesystem
|
|
containing the file. This is set by <tt>dentry_open()</tt> but is found as part
|
|
of nameidata lookup by <tt>open_namei()</tt> (or rather <tt>path_init()</tt> which it
|
|
calls).
|
|
|
|
<item><bf>f_op</bf>: the pointer to <tt>file_operations</tt> which contains various
|
|
methods that can be invoked on the file. This is copied from
|
|
<tt>inode->i_fop</tt> which is placed there by filesystem-specific
|
|
<tt>s_op->read_inode()</tt> method during nameidata lookup. We will look at
|
|
<tt>file_operations</tt> methods in detail later on in this section.
|
|
|
|
<item><bf>f_count</bf>: reference count manipulated by
|
|
<tt>get_file/put_filp/fput</tt>.
|
|
|
|
<item><bf>f_flags</bf>: <tt>O_XXX</tt> flags from <bf>open(2)</bf> system call copied there
|
|
(with slight modifications by <tt>filp_open()</tt>) by <tt>dentry_open()</tt> and after
|
|
clearing <tt>O_CREAT</tt>, <tt>O_EXCL</tt>, <tt>O_NOCTTY</tt>, <tt>O_TRUNC</tt> - there is no point in
|
|
storing these flags permanently since they cannot be modified by
|
|
<tt>F_SETFL</tt> (or queried by <tt>F_GETFL</tt>) <bf>fcntl(2)</bf> calls.
|
|
|
|
<item><bf>f_mode</bf>: a combination of userspace flags and mode, set
|
|
by <tt>dentry_open()</tt>. The point of the conversion is to store read and
|
|
write access in separate bits so one could do easy checks like
|
|
<tt>(f_mode & FMODE_WRITE)</tt> and <tt>(f_mode & FMODE_READ)</tt>.
|
|
|
|
<item><bf>f_pos</bf>: a current file position for next read or write to
|
|
the file. Under i386 it is of type <tt>long long</tt>, i.e. a 64bit value.
|
|
|
|
<item><bf>f_reada, f_ramax, f_raend, f_ralen, f_rawin</bf>: to support
|
|
readahead - too complex to be discussed by mortals ;)
|
|
|
|
<item><bf>f_owner</bf>: owner of file I/O to receive asynchronous I/O
|
|
notifications via <tt>SIGIO</tt> mechanism (see <tt>fs/fcntl.c:kill_fasync()</tt>).
|
|
|
|
<item><bf>f_uid, f_gid</bf> - set to user id and group id of the process that
|
|
opened the file, when the file structure is created in
|
|
<tt>get_empty_filp()</tt>. If the file is a socket, used by ipv4 netfilter.
|
|
|
|
<item><bf>f_error</bf>: used by NFS client to return write errors. It is
|
|
set in <tt>fs/nfs/file.c</tt> and checked in <tt>mm/filemap.c:generic_file_write()</tt>.
|
|
|
|
<item><bf>f_version</bf> - versioning mechanism for invalidating caches,
|
|
incremented (using global <tt>event</tt>) whenever <tt>f_pos</tt> changes.
|
|
|
|
<item><bf>private_data</bf>: private per-file data which can be used by
|
|
filesystems (e.g. coda stores credentials here) or by device drivers.
|
|
Device drivers (in the presence of devfs) could use this field to
|
|
differentiate between multiple instances instead of the classical
|
|
minor number encoded in <tt>file->f_dentry->d_inode->i_rdev</tt>.
|
|
|
|
</enum>
|
|
|
|
Now let us look at <tt>file_operations</tt> structure which contains the methods that
|
|
can be invoked on files. Let us recall that it is copied from <tt>inode->i_fop</tt>
|
|
where it is set by <tt>s_op->read_inode()</tt> method. It is declared in
|
|
<tt>include/linux/fs.h</tt>:
|
|
|
|
<tscreen><code>
|
|
struct file_operations {
|
|
struct module *owner;
|
|
loff_t (*llseek) (struct file *, loff_t, int);
|
|
ssize_t (*read) (struct file *, char *, size_t, loff_t *);
|
|
ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
|
|
int (*readdir) (struct file *, void *, filldir_t);
|
|
unsigned int (*poll) (struct file *, struct poll_table_struct *);
|
|
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
|
|
int (*mmap) (struct file *, struct vm_area_struct *);
|
|
int (*open) (struct inode *, struct file *);
|
|
int (*flush) (struct file *);
|
|
int (*release) (struct inode *, struct file *);
|
|
int (*fsync) (struct file *, struct dentry *, int datasync);
|
|
int (*fasync) (int, struct file *, int);
|
|
int (*lock) (struct file *, int, struct file_lock *);
|
|
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
|
|
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
|
|
};
|
|
</code></tscreen>
|
|
|
|
<enum>
|
|
<item><bf>owner</bf>: a pointer to the module that owns the subsystem in
|
|
question. Only drivers need to set it to <tt>THIS_MODULE</tt>, filesystems can
|
|
happily ignore it because their module counts are controlled at
|
|
mount/umount time whilst the drivers need to control it at open/release
|
|
time.
|
|
|
|
<item><bf>llseek</bf>: implements the <bf>lseek(2)</bf> system call. Usually it is
|
|
omitted and <tt>fs/read_write.c:default_llseek()</tt> is used, which does the
|
|
right thing (TODO: force all those who set it to NULL currently to use
|
|
default_llseek - that way we save an <tt>if()</tt> in <tt>llseek()</tt>)
|
|
|
|
<item><bf>read</bf>: implements <tt>read(2)</tt> system call. Filesystems can use
|
|
<tt>mm/filemap.c:generic_file_read()</tt> for regular files and
|
|
<tt>fs/read_write.c:generic_read_dir()</tt> (which simply returns <tt>-EISDIR</tt>)
|
|
for directories here.
|
|
|
|
<item><bf>write</bf>: implements <bf>write(2)</bf> system call. Filesystems can use
|
|
<tt>mm/filemap.c:generic_file_write()</tt> for regular files and ignore it for
|
|
directories here.
|
|
|
|
<item><bf>readdir</bf>: used by filesystems. Ignored for regular files
|
|
and implements <bf>readdir(2)</bf> and <bf>getdents(2)</bf> system calls for directories.
|
|
|
|
<item><bf>poll</bf>: implements <bf>poll(2)</bf> and <bf>select(2)</bf> system calls.
|
|
|
|
<item><bf>ioctl</bf>: implements driver or filesystem-specific
|
|
ioctls. Note that generic file ioctls like <tt>FIBMAP</tt>, <tt>FIGETBSZ</tt>, <tt>FIONREAD</tt>
|
|
are implemented by higher levels so they never read <tt>f_op->ioctl()</tt>
|
|
method.
|
|
|
|
<item><bf>mmap</bf>: implements the <bf>mmap(2)</bf> system call. Filesystems can use
|
|
<bf>generic_file_mmap</bf> here for regular files and ignore it on directories.
|
|
|
|
<item><bf>open</bf>: called at <bf>open(2)</bf> time by <tt>dentry_open()</tt>. Filesystems
|
|
rarely use this, e.g. coda tries to cache the file locally at open
|
|
time.
|
|
|
|
<item><bf>flush</bf>: called at each <bf>close(2)</bf> of this file, not necessarily
|
|
the last one (see <tt>release()</tt> method below). The only filesystem that
|
|
uses this is NFS client to flush all dirty pages. Note that this can
|
|
return an error which will be passed back to userspace which made the
|
|
<bf>close(2)</bf> system call.
|
|
|
|
<item><bf>release</bf>: called at the last <bf>close(2)</bf> of this file, i.e. when
|
|
<tt>file->f_count</tt> reaches 0. Although defined as returning int, the return
|
|
value is ignored by VFS (see <tt>fs/file_table.c:__fput()</tt>).
|
|
|
|
<item><bf>fsync</bf>: maps directly to <bf>fsync(2)/fdatasync(2)</bf> system calls,
|
|
with the last argument specifying whether it is fsync or fdatasync.
|
|
Almost no work is done by VFS around this, except to map file
|
|
descriptor to a file structure (<tt>file = fget(fd)</tt>) and down/up
|
|
<tt>inode->i_sem</tt> semaphore. Ext2 filesystem currently ignores the last
|
|
argument and does exactly the same for <bf>fsync(2)</bf> and <bf>fdatasync(2)</bf>.
|
|
|
|
<item><bf>fasync</bf>: this method is called when <tt>file->f_flags & FASYNC</tt>
|
|
changes.
|
|
|
|
<item><bf>lock</bf>: the filesystem-specific portion of the POSIX <bf>fcntl(2)</bf>
|
|
file region locking mechanism. The only bug here is that because it is
|
|
called before fs-independent portion (<tt>posix_lock_file()</tt>), if it
|
|
succeeds but the standard POSIX lock code fails then it will never be
|
|
unlocked on fs-dependent level..
|
|
|
|
<item><bf>readv</bf>: implements <bf>readv(2)</bf> system call.
|
|
|
|
<item><bf>writev</bf>: implements <bf>writev(2)</bf> system call.
|
|
</enum>
|
|
|
|
<sect1>Superblock and Mountpoint Management<p>
|
|
|
|
Under Linux, information about mounted filesystems is kept in two separate
|
|
structures - <tt>super_block</tt> and <tt>vfsmount</tt>. The reason for this is that Linux
|
|
allows to mount the same filesystem (block device) under multiple mount
|
|
points, which means that the same <tt>super_block</tt> can correspond to multiple
|
|
<tt>vfsmount</tt> structures.
|
|
|
|
Let us look at <tt>struct super_block</tt> first, declared in <tt>include/linux/fs.h</tt>:
|
|
|
|
<tscreen><code>
|
|
struct super_block {
|
|
struct list_head s_list; /* Keep this first */
|
|
kdev_t s_dev;
|
|
unsigned long s_blocksize;
|
|
unsigned char s_blocksize_bits;
|
|
unsigned char s_lock;
|
|
unsigned char s_dirt;
|
|
struct file_system_type *s_type;
|
|
struct super_operations *s_op;
|
|
struct dquot_operations *dq_op;
|
|
unsigned long s_flags;
|
|
unsigned long s_magic;
|
|
struct dentry *s_root;
|
|
wait_queue_head_t s_wait;
|
|
|
|
struct list_head s_dirty; /* dirty inodes */
|
|
struct list_head s_files;
|
|
|
|
struct block_device *s_bdev;
|
|
struct list_head s_mounts; /* vfsmount(s) of this one */
|
|
struct quota_mount_options s_dquot; /* Diskquota specific options */
|
|
|
|
union {
|
|
struct minix_sb_info minix_sb;
|
|
struct ext2_sb_info ext2_sb;
|
|
..... all filesystems that need sb-private info ...
|
|
void *generic_sbp;
|
|
} u;
|
|
/*
|
|
* The next field is for VFS *only*. No filesystems have any business
|
|
* even looking at it. You had been warned.
|
|
*/
|
|
struct semaphore s_vfs_rename_sem; /* Kludge */
|
|
|
|
/* The next field is used by knfsd when converting a (inode number based)
|
|
* file handle into a dentry. As it builds a path in the dcache tree from
|
|
* the bottom up, there may for a time be a subpath of dentrys which is not
|
|
* connected to the main tree. This semaphore ensure that there is only ever
|
|
* one such free path per filesystem. Note that unconnected files (or other
|
|
* non-directories) are allowed, but not unconnected diretories.
|
|
*/
|
|
struct semaphore s_nfsd_free_path_sem;
|
|
};
|
|
</code></tscreen>
|
|
|
|
The various fields in the <tt>super_block</tt> structure are:
|
|
|
|
<enum>
|
|
<item><bf>s_list</bf>: a doubly-linked list of all active superblocks; note
|
|
I don't say "of all mounted filesystems" because under Linux one can
|
|
have multiple instances of a mounted filesystem corresponding to a
|
|
single superblock.
|
|
|
|
<item><bf>s_dev</bf>: for filesystems which require a block to be mounted
|
|
on, i.e. for <tt>FS_REQUIRES_DEV</tt> filesystems, this is the <tt>i_dev</tt> of the
|
|
block device. For others (called anonymous filesystems) this is an
|
|
integer <tt>MKDEV(UNNAMED_MAJOR, i)</tt> where <tt>i</tt> is the first unset bit in
|
|
<tt>unnamed_dev_in_use</tt> array, between 1 and 255 inclusive. See
|
|
<tt>fs/super.c:get_unnamed_dev()/put_unnamed_dev()</tt>. It has been suggested
|
|
many times that anonymous filesystems should not use <tt>s_dev</tt> field.
|
|
|
|
<item><bf>s_blocksize, s_blocksize_bits</bf>: blocksize and log2(blocksize).
|
|
|
|
<item><bf>s_lock</bf>: indicates whether superblock is currently locked by
|
|
<tt>lock_super()/unlock_super()</tt>.
|
|
|
|
<item><bf>s_dirt</bf>: set when superblock is changed, and cleared whenever
|
|
it is written back to disk.
|
|
|
|
<item><bf>s_type</bf>: pointer to <tt>struct file_system_type</tt> of the
|
|
corresponding filesystem. Filesystem's <tt>read_super()</tt> method doesn't need
|
|
to set it as VFS <tt>fs/super.c:read_super()</tt> sets it for you if
|
|
fs-specific <tt>read_super()</tt> succeeds and resets to NULL if it fails.
|
|
|
|
<item><bf>s_op</bf>: pointer to <tt>super_operations</tt> structure which contains
|
|
fs-specific methods to read/write inodes etc. It is the job of
|
|
filesystem's <tt>read_super()</tt> method to initialise <tt>s_op</tt> correctly.
|
|
|
|
<item><bf>dq_op</bf>: disk quota operations.
|
|
|
|
<item><bf>s_flags</bf>: superblock flags.
|
|
|
|
<item><bf>s_magic</bf>: filesystem's magic number. Used by minix filesystem
|
|
to differentiate between multiple flavours of itself.
|
|
|
|
<item><bf>s_root</bf>: dentry of the filesystem's root. It is the job of
|
|
<tt>read_super()</tt> to read the root inode from the disk and pass it to
|
|
<tt>d_alloc_root()</tt> to allocate the dentry and instantiate it. Some
|
|
filesystems spell "root" other than "/" and so use more generic
|
|
<tt>d_alloc()</tt> function to bind the dentry to a name, e.g. pipefs mounts
|
|
itself on "pipe:" as its own root instead of "/".
|
|
|
|
<item><bf>s_wait</bf>: waitqueue of processes waiting for superblock to be
|
|
unlocked.
|
|
|
|
<item><bf>s_dirty</bf>: a list of all dirty inodes. Recall that if inode
|
|
is dirty (<tt>inode->i_state & I_DIRTY</tt>) then it is on superblock-specific
|
|
dirty list linked via <tt>inode->i_list</tt>.
|
|
|
|
<item><bf>s_files</bf>: a list of all open files on this superblock. Useful
|
|
for deciding whether filesystem can be remounted read-only, see
|
|
<tt>fs/file_table.c:fs_may_remount_ro()</tt> which goes through <tt>sb->s_files</tt> list
|
|
and denies remounting if there are files opened for write
|
|
(<tt>file->f_mode & FMODE_WRITE</tt>) or files with pending
|
|
unlink (<tt>inode->i_nlink == 0</tt>).
|
|
|
|
<item><bf>s_bdev</bf>: for <tt>FS_REQUIRES_DEV</tt>, this points to the block_device
|
|
structure describing the device the filesystem is mounted on.
|
|
|
|
<item><bf>s_mounts</bf>: a list of all <tt>vfsmount</tt> structures, one for each
|
|
mounted instance of this superblock.
|
|
|
|
<item><bf>s_dquot</bf>: more diskquota stuff.
|
|
</enum>
|
|
|
|
The superblock operations are described in the <tt>super_operations</tt> structure
|
|
declared in <tt>include/linux/fs.h</tt>:
|
|
|
|
<tscreen><code>
|
|
struct super_operations {
|
|
void (*read_inode) (struct inode *);
|
|
void (*write_inode) (struct inode *, int);
|
|
void (*put_inode) (struct inode *);
|
|
void (*delete_inode) (struct inode *);
|
|
void (*put_super) (struct super_block *);
|
|
void (*write_super) (struct super_block *);
|
|
int (*statfs) (struct super_block *, struct statfs *);
|
|
int (*remount_fs) (struct super_block *, int *, char *);
|
|
void (*clear_inode) (struct inode *);
|
|
void (*umount_begin) (struct super_block *);
|
|
};
|
|
</code></tscreen>
|
|
|
|
<enum>
|
|
<item><bf>read_inode</bf>: reads the inode from the filesystem. It is only
|
|
called from <tt>fs/inode.c:get_new_inode()</tt> from <tt>iget4()</tt> (and therefore
|
|
<tt>iget()</tt>). If a filesystem wants to use <tt>iget()</tt> then <tt>read_inode()</tt> must be
|
|
implemented - otherwise <tt>get_new_inode()</tt> will panic.
|
|
While inode is being read it is locked (<tt>inode->i_state = I_LOCK</tt>). When
|
|
the function returns, all waiters on <tt>inode->i_wait</tt> are woken up. The job
|
|
of the filesystem's <tt>read_inode()</tt> method is to locate the disk block which
|
|
contains the inode to be read and use buffer cache <tt>bread()</tt> function to
|
|
read it in and initialise the various fields of inode structure, for
|
|
example the <tt>inode->i_op</tt> and <tt>inode->i_fop</tt> so that VFS level knows what
|
|
operations can be performed on the inode or corresponding file.
|
|
Filesystems that don't implement <tt>read_inode()</tt> are ramfs and
|
|
pipefs. For example, ramfs has its own inode-generating function
|
|
<tt>ramfs_get_inode()</tt> with all the inode operations calling it as needed.
|
|
|
|
<item><bf>write_inode</bf>: write inode back to disk. Similar to
|
|
<tt>read_inode()</tt> in that it needs to locate the relevant block on
|
|
disk and interact with buffer cache by calling
|
|
<tt>mark_buffer_dirty(bh)</tt>. This method is called on dirty inodes
|
|
(those marked dirty with <tt>mark_inode_dirty()</tt>) when the inode needs
|
|
to be sync'd either individually or as part of syncing the
|
|
entire filesystem.
|
|
|
|
<item><bf>put_inode</bf>: called whenever the reference count is decreased.
|
|
|
|
<item><bf>delete_inode</bf>: called whenever both <tt>inode->i_count</tt> and
|
|
<tt>inode->i_nlink</tt> reach 0. Filesystem deletes the on-disk copy of the
|
|
inode and calls <tt>clear_inode()</tt> on VFS inode to "terminate it with
|
|
extreme prejudice".
|
|
|
|
<item><bf>put_super</bf>: called at the last stages of <bf>umount(2)</bf> system
|
|
call to notify the filesystem that any private information held by
|
|
the filesystem about this instance should be freed. Typically this
|
|
would <tt>brelse()</tt> the block containing the superblock and <tt>kfree()</tt> any
|
|
bitmaps allocated for free blocks, inodes, etc.
|
|
|
|
<item><bf>write_super</bf>: called when superblock needs to be
|
|
written back to disk. It should find the block containing the
|
|
superblock (usually kept in <tt>sb-private</tt> area) and
|
|
<tt>mark_buffer_dirty(bh)</tt> . It should also clear <tt>sb->s_dirt</tt> flag.
|
|
|
|
<item><bf>statfs</bf>: implements <bf>fstatfs(2)/statfs(2)</bf> system calls. Note
|
|
that the pointer to <tt>struct statfs</tt> passed as argument is a kernel
|
|
pointer, not a user pointer so we don't need to do any I/O to
|
|
userspace. If not implemented then <tt>statfs(2)</tt> will fail with <tt>ENOSYS</tt>.
|
|
|
|
<item><bf>remount_fs</bf>: called whenever filesystem is being remounted.
|
|
|
|
<item><bf>clear_inode</bf>: called from VFS level <tt>clear_inode()</tt>. Filesystems
|
|
that attach private data to inode structure (via <tt>generic_ip</tt> field) must
|
|
free it here.
|
|
|
|
<item><bf>umount_begin</bf>: called during forced umount to notify the
|
|
filesystem beforehand, so that it can do its best to make sure that
|
|
nothing keeps the filesystem busy. Currently used only by NFS. This
|
|
has nothing to do with the idea of generic VFS level forced umount
|
|
support.
|
|
</enum>
|
|
|
|
So, let us look at what happens when we mount a on-disk (<tt>FS_REQUIRES_DEV</tt>)
|
|
filesystem. The implementation of the <bf>mount(2)</bf> system call is in
|
|
<tt>fs/super.c:sys_mount()</tt> which is the just a wrapper that copies the options,
|
|
filesystem type and device name for the <tt>do_mount()</tt> function which does the
|
|
real work:
|
|
|
|
<enum>
|
|
<item>Filesystem driver is loaded if needed and its module's reference count
|
|
is incremented. Note that during mount operation, the filesystem
|
|
module's reference count is incremented twice - once by <tt>do_mount()</tt>
|
|
calling <tt>get_fs_type()</tt> and once by <tt>get_sb_dev()</tt> calling <tt>get_filesystem()</tt>
|
|
if <tt>read_super()</tt> was successful. The first increment is to prevent
|
|
module unloading while we are inside <tt>read_super()</tt> method and the second
|
|
increment is to indicate that the module is in use by this mounted
|
|
instance. Obviously, <tt>do_mount()</tt> decrements the count before returning, so
|
|
overall the count only grows by 1 after each mount.
|
|
|
|
<item>Since, in our case, <tt>fs_type->fs_flags & FS_REQUIRES_DEV</tt> is true, the
|
|
superblock is initialised by a call to <tt>get_sb_bdev()</tt> which obtains
|
|
the reference to the block device and interacts with the filesystem's
|
|
<tt>read_super()</tt> method to fill in the superblock. If all goes well, the
|
|
<tt>super_block</tt> structure is initialised and we have an extra reference
|
|
to the filesystem's module and a reference to the underlying block
|
|
device.
|
|
|
|
<item>A new <tt>vfsmount</tt> structure is allocated and linked to <tt>sb->s_mounts</tt> list
|
|
and to the global <tt>vfsmntlist</tt> list. The <tt>vfsmount</tt> field <tt>mnt_instances</tt>
|
|
allows to find all instances mounted on the same superblock as this
|
|
one. The <tt>mnt_list</tt> field allows to find all instances for all
|
|
superblocks system-wide. The <tt>mnt_sb</tt> field
|
|
points to this superblock and <tt>mnt_root</tt> has a new reference to the
|
|
<tt>sb->s_root</tt> dentry.
|
|
</enum>
|
|
|
|
<sect1>Example Virtual Filesystem: pipefs<p>
|
|
|
|
As a simple example of Linux filesystem that does not require a block device
|
|
for mounting, let us consider pipefs from <tt>fs/pipe.c</tt>. The filesystem's preamble
|
|
is rather straightforward and requires little explanation:
|
|
|
|
<tscreen><code>
|
|
static DECLARE_FSTYPE(pipe_fs_type, "pipefs", pipefs_read_super,
|
|
FS_NOMOUNT|FS_SINGLE);
|
|
|
|
static int __init init_pipe_fs(void)
|
|
{
|
|
int err = register_filesystem(&pipe_fs_type);
|
|
if (!err) {
|
|
pipe_mnt = kern_mount(&pipe_fs_type);
|
|
err = PTR_ERR(pipe_mnt);
|
|
if (!IS_ERR(pipe_mnt))
|
|
err = 0;
|
|
}
|
|
return err;
|
|
}
|
|
|
|
static void __exit exit_pipe_fs(void)
|
|
{
|
|
unregister_filesystem(&pipe_fs_type);
|
|
kern_umount(pipe_mnt);
|
|
}
|
|
|
|
module_init(init_pipe_fs)
|
|
module_exit(exit_pipe_fs)
|
|
</code></tscreen>
|
|
|
|
The filesystem is of type <tt>FS_NOMOUNT|FS_SINGLE</tt>, which means it cannot be
|
|
mounted from userspace and can only have one superblock system-wide. The
|
|
<tt>FS_SINGLE</tt> file also means that it must be mounted via <tt>kern_mount()</tt> after
|
|
it is successfully registered via <tt>register_filesystem()</tt>, which is exactly
|
|
what happens in <tt>init_pipe_fs()</tt>. The only bug in this function is that if
|
|
<tt>kern_mount()</tt> fails (e.g. because <tt>kmalloc()</tt> failed in <tt>add_vfsmnt()</tt>) then the
|
|
filesystem is left as registered but module initialisation fails. This will
|
|
cause <bf>cat /proc/filesystems</bf> to Oops. (have just sent a patch to Linus
|
|
mentioning that although this is not a real bug today as pipefs can't be
|
|
compiled as a module, it should be written with the view that in the future
|
|
it may become modularised).
|
|
|
|
The result of <tt>register_filesystem()</tt> is that <tt>pipe_fs_type</tt> is linked into
|
|
the <tt>file_systems</tt> list so one can read <tt>/proc/filesystems</tt> and find "pipefs"
|
|
entry in there with "nodev" flag indicating that <tt>FS_REQUIRES_DEV</tt> was not set.
|
|
The <tt>/proc/filesystems</tt> file should really be enhanced to support all the new
|
|
<tt>FS_</tt> flags (and I made a patch to do so) but it cannot be done because it will
|
|
break all the user applications that use it. Despite Linux kernel interfaces
|
|
changing every minute (only for the better) when it comes to the userspace
|
|
compatibility, Linux is a very conservative operating system which allows
|
|
many applications to be used for a long time without being recompiled.
|
|
|
|
The result of <tt>kern_mount()</tt> is that:
|
|
|
|
<enum>
|
|
<item>A new unnamed (anonymous) device number is allocated by setting a bit in
|
|
<tt>unnamed_dev_in_use</tt> bitmap; if there are no more bits then <tt>kern_mount()</tt>
|
|
fails with <tt>EMFILE</tt>.
|
|
|
|
<item>A new superblock structure is allocated by means of <tt>get_empty_super()</tt>.
|
|
The <tt>get_empty_super()</tt> function walks the list of superblocks headed
|
|
by <tt>super_block</tt> and looks for empty entry, i.e. <tt>s->s_dev == 0</tt>. If no
|
|
such empty superblock is found then a new one is allocated using
|
|
<tt>kmalloc()</tt> at <tt>GFP_USER</tt> priority. The maximum system-wide number of
|
|
superblocks is checked in <tt>get_empty_super()</tt> so if it starts failing,
|
|
one can adjust the tunable <tt>/proc/sys/fs/super-max</tt>.
|
|
|
|
<item>A filesystem-specific <tt>pipe_fs_type->read_super()</tt> method, i.e.
|
|
<tt>pipefs_read_super()</tt>, is invoked which allocates root inode and root
|
|
dentry <tt>sb->s_root</tt>, and sets <tt>sb->s_op</tt> to be <tt>&pipefs_ops</tt>.
|
|
|
|
<item>Then <tt>kern_mount()</tt> calls <tt>add_vfsmnt(NULL, sb->s_root, "none")</tt> which
|
|
allocates a new <tt>vfsmount</tt> structure and links it into <tt>vfsmntlist</tt> and
|
|
<tt>sb->s_mounts</tt>.
|
|
|
|
<item>The <tt>pipe_fs_type->kern_mnt</tt> is set to this new <tt>vfsmount</tt> structure and
|
|
it is returned. The reason why the return value of <tt>kern_mount()</tt> is a
|
|
<tt>vfsmount</tt> structure is because even <tt>FS_SINGLE</tt> filesystems can be mounted
|
|
multiple times and so their <tt>mnt->mnt_sb</tt> will point to the same thing
|
|
which would be silly to return from multiple calls to <tt>kern_mount()</tt>.
|
|
</enum>
|
|
|
|
Now that the filesystem is registered and inkernel-mounted we can use it.
|
|
The entry point into the pipefs filesystem is the <bf>pipe(2)</bf> system call,
|
|
implemented in arch-dependent function <tt>sys_pipe()</tt> but the real work is done
|
|
by a portable <tt>fs/pipe.c:do_pipe()</tt> function. Let us look at <tt>do_pipe()</tt> then.
|
|
The interaction with pipefs happens when <tt>do_pipe()</tt> calls <tt>get_pipe_inode()</tt>
|
|
to allocate a new pipefs inode. For this inode, <tt>inode->i_sb</tt> is set to
|
|
pipefs' superblock <tt>pipe_mnt->mnt_sb</tt>, the file operations <tt>i_fop</tt> is set to
|
|
<tt>rdwr_pipe_fops</tt> and the number of readers and writers (held in <tt>inode->i_pipe</tt>)
|
|
is set to 1. The reason why there is a separate inode field <tt>i_pipe</tt> instead
|
|
of keeping it in the <tt>fs-private</tt> union is that pipes and FIFOs share the same
|
|
code and FIFOs can exist on other filesystems which use the other access
|
|
paths within the same union which is very bad C and can work only by pure
|
|
luck. So, yes, 2.2.x kernels work only by pure luck and will stop working
|
|
as soon as you slightly rearrange the fields in the inode.
|
|
|
|
Each <bf>pipe(2)</bf> system call increments a reference count on the <tt>pipe_mnt</tt>
|
|
mount instance.
|
|
|
|
Under Linux, pipes are not symmetric (bidirection or STREAM pipes), i.e.
|
|
two sides of the file have different <tt>file->f_op</tt> operations - the
|
|
<tt>read_pipe_fops</tt> and <tt>write_pipe_fops</tt> respectively. The write on read side
|
|
returns <tt>EBADF</tt> and so does read on write side.
|
|
|
|
|
|
<sect1>Example Disk Filesystem: BFS<p>
|
|
|
|
As a simple example of ondisk Linux filesystem, let us consider BFS. The
|
|
preamble of the BFS module is in <tt>fs/bfs/inode.c</tt>:
|
|
|
|
<tscreen><code>
|
|
static DECLARE_FSTYPE_DEV(bfs_fs_type, "bfs", bfs_read_super);
|
|
|
|
static int __init init_bfs_fs(void)
|
|
{
|
|
return register_filesystem(&bfs_fs_type);
|
|
}
|
|
|
|
static void __exit exit_bfs_fs(void)
|
|
{
|
|
unregister_filesystem(&bfs_fs_type);
|
|
}
|
|
|
|
module_init(init_bfs_fs)
|
|
module_exit(exit_bfs_fs)
|
|
</code></tscreen>
|
|
|
|
A special fstype declaration macro <tt>DECLARE_FSTYPE_DEV()</tt> is used which
|
|
sets the <tt>fs_type->flags</tt> to <tt>FS_REQUIRES_DEV</tt> to signify that BFS requires a
|
|
real block device to be mounted on.
|
|
|
|
The module's initialisation function registers the filesystem with VFS and
|
|
the cleanup function (only present when BFS is configured to be a module)
|
|
unregisters it.
|
|
|
|
With the filesystem registered, we can proceed to mount it, which would
|
|
invoke out <tt>fs_type->read_super()</tt> method which is implemented in
|
|
<tt>fs/bfs/inode.c:bfs_read_super().</tt> It does the following:
|
|
|
|
<enum>
|
|
<item><tt>set_blocksize(s->s_dev, BFS_BSIZE)</tt>: since we are about to interact
|
|
with the block device layer via the buffer cache, we must initialise a few
|
|
things, namely set the block size and also inform VFS via fields
|
|
<tt>s->s_blocksize</tt> and <tt>s->s_blocksize_bits</tt>.
|
|
|
|
<item><tt>bh = bread(dev, 0, BFS_BSIZE)</tt>: we read block 0 of the device
|
|
passed via <tt>s->s_dev</tt>. This block is the filesystem's superblock.
|
|
|
|
<item>Superblock is validated against <tt>BFS_MAGIC</tt> number and, if valid, stored
|
|
in the sb-private field <tt>s->su_sbh</tt> (which is really <tt>s->u.bfs_sb.si_sbh</tt>).
|
|
|
|
<item>Then we allocate inode bitmap using <tt>kmalloc(GFP_KERNEL)</tt> and clear all
|
|
bits to 0 except the first two which we set to 1 to indicate that we
|
|
should never allocate inodes 0 and 1. Inode 2 is root and the
|
|
corresponding bit will be set to 1 a few lines later anyway - the
|
|
filesystem should have a valid root inode at mounting time!
|
|
|
|
<item>Then we initialise <tt>s->s_op</tt>, which means that we can from this point
|
|
invoke inode cache via <tt>iget()</tt> which results in <tt>s_op->read_inode()</tt> to
|
|
be invoked. This finds the block that contains the specified (by
|
|
<tt>inode->i_ino</tt> and <tt>inode->i_dev</tt>) inode and reads it in. If we fail to
|
|
get root inode then we free the inode bitmap and release superblock
|
|
buffer back to buffer cache and return NULL. If root inode was read OK,
|
|
then we allocate a dentry with name <tt>/</tt> (as becometh root) and
|
|
instantiate it with this inode.
|
|
|
|
<item>Now we go through all inodes on the filesystem and read them all in
|
|
order to set the corresponding bits in our internal inode bitmap and
|
|
also to calculate some other internal parameters like the offset of
|
|
last inode and the start/end blocks of last file. Each inode we read
|
|
is returned back to inode cache via <tt>iput()</tt> - we don't hold a reference
|
|
to it longer than needed.
|
|
|
|
<item>If the filesystem was not mounted read-only, we mark the superblock
|
|
buffer dirty and set <tt>s->s_dirt</tt> flag (TODO: why do I do this?
|
|
Originally, I did it because <tt>minix_read_super()</tt> did but neither minix
|
|
nor BFS seem to modify superblock in the <tt>read_super()</tt>).
|
|
|
|
<item>All is well so we return this initialised superblock back to the caller
|
|
at VFS level, i.e. <tt>fs/super.c:read_super()</tt>.
|
|
</enum>
|
|
|
|
After the <tt>read_super()</tt> function returns successfully, VFS obtains the
|
|
reference to the filesystem module via call to <tt>get_filesystem(fs_type)</tt> in
|
|
<tt>fs/super.c:get_sb_bdev()</tt> and a reference to the block device.
|
|
|
|
Now, let us examine what happens when we do I/O on the filesystem. We already
|
|
examined how inodes are read when <tt>iget()</tt> is called and how they are released
|
|
on <tt>iput().</tt> Reading inodes sets up, among other things, <tt>inode->i_op</tt> and
|
|
<tt>inode->i_fop</tt>; opening a file will propagate <tt>inode->i_fop</tt> into <tt>file->f_op</tt>.
|
|
|
|
Let us examine the code path of the <bf>link(2)</bf> system call. The implementation
|
|
of the system call is in <tt>fs/namei.c:sys_link()</tt>:
|
|
|
|
<enum>
|
|
<item>The userspace names are copied into kernel space by means of <tt>getname()</tt>
|
|
function which does the error checking.
|
|
|
|
<item>These names are nameidata converted using <tt>path_init()/path_walk()</tt>
|
|
interaction with dcache. The result is stored in <tt>old_nd</tt> and <tt>nd</tt>
|
|
structures.
|
|
|
|
<item>If <tt>old_nd.mnt != nd.mnt</tt> then "cross-device link" <tt>EXDEV</tt> is returned -
|
|
one cannot link between filesystems, in Linux this translates into -
|
|
one cannot link between mounted instances of a filesystem (or, in
|
|
particular between filesystems).
|
|
|
|
<item>A new dentry is created corresponding to <tt>nd</tt> by <tt>lookup_create()</tt> .
|
|
|
|
<item>A generic <tt>vfs_link()</tt> function is called which checks if we can
|
|
create a new entry in the directory and invokes the <tt>dir->i_op->link()</tt>
|
|
method which brings us back to filesystem-specific
|
|
<tt>fs/bfs/dir.c:bfs_link()</tt> function.
|
|
|
|
<item>Inside <tt>bfs_link()</tt>, we check if we are trying to link a directory and
|
|
if so, refuse with <tt>EPERM</tt> error. This is the same behaviour as standard (ext2).
|
|
|
|
<item>We attempt to add a new directory entry to the specified directory
|
|
by calling the helper function <tt>bfs_add_entry()</tt> which goes through all
|
|
entries looking for unused slot (<tt>de->ino == 0</tt>) and, when found, writes
|
|
out the name/inode pair into the corresponding block and marks it
|
|
dirty (at non-superblock priority).
|
|
|
|
<item>If we successfully added the directory entry then there is no way
|
|
to fail the operation so we increment <tt>inode->i_nlink</tt>, update
|
|
<tt>inode->i_ctime</tt> and mark this inode dirty as well as instantiating the
|
|
new dentry with the inode.
|
|
</enum>
|
|
|
|
Other related inode operations like <tt>unlink()/rename()</tt> etc work in a similar
|
|
way, so not much is gained by examining them all in details.
|
|
|
|
<sect1>Execution Domains and Binary Formats<p>
|
|
|
|
Linux supports loading user application binaries from disk. More
|
|
interestingly, the binaries can be stored in different formats and the
|
|
operating system's response to programs via system calls can deviate from
|
|
norm (norm being the Linux behaviour) as required, in order to emulate
|
|
formats found in other flavours of UNIX (COFF, etc) and also to emulate
|
|
system calls behaviour of other flavours (Solaris, UnixWare, etc). This is
|
|
what execution domains and binary formats are for.
|
|
|
|
Each Linux task has a personality stored in its <tt>task_struct</tt> (<tt>p->personality</tt>).
|
|
The currently existing (either in the official kernel or as addon patch)
|
|
personalities include support for FreeBSD, Solaris, UnixWare, OpenServer and
|
|
many other popular operating systems.
|
|
The value of <tt>current->personality</tt> is split into two parts:
|
|
|
|
<enum>
|
|
<item>high three bytes - bug emulation: <tt>STICKY_TIMEOUTS</tt>, <tt>WHOLE_SECONDS</tt>, etc.
|
|
<item>low byte - personality proper, a unique number.
|
|
</enum>
|
|
|
|
By changing the personality, we can change
|
|
the way the operating system treats certain system calls, for example
|
|
adding a <tt>STICKY_TIMEOUT</tt> to <tt>current->personality</tt> makes <bf>select(2)</bf> system call
|
|
preserve the value of last argument (timeout) instead of storing the
|
|
unslept time. Some buggy programs rely on buggy operating systems (non-Linux)
|
|
and so Linux provides a way to emulate bugs in cases where the source code
|
|
is not available and so bugs cannot be fixed.
|
|
|
|
Execution domain is a contiguous range of personalities implemented by a
|
|
single module. Usually a single execution domain implements a single
|
|
personality but sometimes it is possible to implement "close" personalities
|
|
in a single module without too many conditionals.
|
|
|
|
Execution domains are implemented in <tt>kernel/exec_domain.c</tt> and were completely
|
|
rewritten for 2.4 kernel, compared with 2.2.x. The list of execution domains
|
|
currently supported by the kernel, along with the range of personalities
|
|
they support, is available by reading the <tt>/proc/execdomains</tt> file. Execution
|
|
domains, except the <tt>PER_LINUX</tt> one, can be implemented as dynamically
|
|
loadable modules.
|
|
|
|
The user interface is via <bf>personality(2)</bf> system call, which sets the current
|
|
process' personality or returns the value of <tt>current->personality</tt> if the
|
|
argument is set to impossible personality 0xffffffff. Obviously, the
|
|
behaviour of this system call itself does not depend on personality..
|
|
|
|
The kernel interface to execution domains registration consists of two
|
|
functions:
|
|
|
|
<itemize>
|
|
<item><tt>int register_exec_domain(struct exec_domain *)</tt>: registers the
|
|
execution domain by linking it into single-linked list <tt>exec_domains</tt>
|
|
under the write protection of the read-write spinlock <tt>exec_domains_lock</tt>.
|
|
Returns 0 on success, non-zero on failure.
|
|
|
|
<item><tt>int unregister_exec_domain(struct exec_domain *)</tt>: unregisters the
|
|
execution domain by unlinking it from the <tt>exec_domains</tt> list, again using
|
|
<tt>exec_domains_lock</tt> spinlock in write mode. Returns 0 on success.
|
|
<item>
|
|
</itemize>
|
|
|
|
The reason why <tt>exec_domains_lock</tt> is a read-write is that only registration
|
|
and unregistration requests modify the list, whilst doing
|
|
<bf>cat /proc/filesystems</bf> calls <tt>fs/exec_domain.c:get_exec_domain_list()</tt>, which
|
|
needs only read access to the list. Registering a new execution domain
|
|
defines a "lcall7 handler" and a signal number conversion map. Actually,
|
|
ABI patch extends this concept of exec domain to include extra information
|
|
(like socket options, socket types, address family and errno maps).
|
|
|
|
The binary formats are implemented in a similar manner, i.e. a single-linked
|
|
list formats is defined in <tt>fs/exec.c</tt> and is protected by a read-write lock
|
|
<tt>binfmt_lock</tt>. As with <tt>exec_domains_lock</tt>, the <tt>binfmt_lock</tt> is taken read on
|
|
most occasions except for registration/unregistration of binary formats.
|
|
Registering a new binary format enhances the <bf>execve(2)</bf> system call with new
|
|
<tt>load_binary()/load_shlib()</tt> functions as well as ability to <tt>core_dump()</tt> . The
|
|
<tt>load_shlib()</tt> method is used only by the old <bf>uselib(2)</bf> system call while
|
|
the <tt>load_binary()</tt> method is called by the <tt>search_binary_handler()</tt> from
|
|
<tt>do_execve()</tt> which implements <bf>execve(2)</bf> system call.
|
|
|
|
The personality of the process is determined at binary format loading by
|
|
the corresponding format's <tt>load_binary()</tt> method using some heuristics.
|
|
For example to determine UnixWare7 binaries one first marks the binary
|
|
using the <bf>elfmark(1)</bf> utility, which sets the ELF header's <tt>e_flags</tt> to the magic
|
|
value 0x314B4455 which is detected at ELF loading time and
|
|
<tt>current->personality</tt> is set to PER_UW7. If this heuristic fails, then a more
|
|
generic one, such as treat ELF interpreter paths like <tt>/usr/lib/ld.so.1</tt> or
|
|
<tt>/usr/lib/libc.so.1</tt> to
|
|
indicate a SVR4 binary, is used and personality is set to PER_SVR4. One
|
|
could write a little utility program that uses Linux's <bf>ptrace(2)</bf> capabilities
|
|
to single-step the code and force a running program into any personality.
|
|
|
|
Once personality (and therefore <tt>current->exec_domain</tt>) is known, the system
|
|
calls are handled as follows. Let us assume that a process makes a system
|
|
call by means of lcall7 gate instruction. This transfers control to
|
|
<tt>ENTRY(lcall7)</tt> of <tt>arch/i386/kernel/entry.S</tt> because it was prepared in
|
|
<tt>arch/i386/kernel/traps.c:trap_init()</tt>. After appropriate stack layout
|
|
conversion, <tt>entry.S:lcall7</tt> obtains the pointer to <tt>exec_domain</tt> from <tt>current</tt>
|
|
and then an offset of lcall7 handler within the <tt>exec_domain</tt> (which is
|
|
hardcoded as 4 in asm code so you can't shift the <tt>handler</tt> field around in
|
|
C declaration of <tt>struct exec_domain</tt>) and jumps to it. So, in C, it would
|
|
look like this:
|
|
|
|
<tscreen><code>
|
|
static void UW7_lcall7(int segment, struct pt_regs * regs)
|
|
{
|
|
abi_dispatch(regs, &uw7_funcs[regs->eax & 0xff], 1);
|
|
}
|
|
</code></tscreen>
|
|
|
|
where <tt>abi_dispatch()</tt> is a wrapper around the table of function pointers that
|
|
implement this personality's system calls <tt>uw7_funcs</tt>.
|
|
|
|
<sect>Linux Page Cache<p>
|
|
|
|
In this chapter we describe the Linux 2.4 pagecache. The pagecache
|
|
is - as the name suggests - a cache of physical pages. In the UNIX world the
|
|
concept of a pagecache became popular with the introduction of SVR4 UNIX,
|
|
where it replaced the buffercache for data IO operations.
|
|
|
|
While the SVR4 pagecache is only used for filesystem data cache and thus uses
|
|
the struct vnode and an offset into the file as hash parameters, the Linux page
|
|
cache is designed to be more generic, and therefore uses a struct address_space
|
|
(explained below) as first parameter. Because the Linux pagecache is tightly
|
|
coupled to the notation of address spaces, you will need at least a basic
|
|
understanding of adress_spaces to understand the way the pagecache works.
|
|
An address_space is some kind of software MMU that maps all pages of one object
|
|
(e.g. inode) to an other concurrency (typically physical disk blocks).
|
|
The struct address_space is defined in <tt>include/linux/fs.h</tt> as:
|
|
|
|
<tscreen><code>
|
|
struct address_space {
|
|
struct list_head clean_pages;
|
|
struct list_head dirty_pages;
|
|
struct list_head locked_pages;
|
|
unsigned long nrpages;
|
|
struct address_space_operations *a_ops;
|
|
struct inode *host;
|
|
struct vm_area_struct *i_mmap;
|
|
struct vm_area_struct *i_mmap_shared;
|
|
spinlock_t i_shared_lock;
|
|
|
|
};
|
|
</code></tscreen>
|
|
|
|
To understand the way address_spaces works, we only need to look at a few of this fields:
|
|
<tt>clean_pages</tt>, <tt>dirty_pages</tt> and <tt>locked_pages</tt> are double linked lists
|
|
of all clean, dirty and locked pages that belong to this address_space, <tt>nrpages</tt>
|
|
is the total number of pages in this address_space. <tt>a_ops</tt> defines the methods of
|
|
this object and <tt>host</tt> is an pointer to the inode this address_space belongs to -
|
|
it may also be NULL, e.g. in the case of the swapper address_space
|
|
(<tt>mm/swap_state.c,</tt>).
|
|
|
|
The usage of <tt>clean_pages</tt>, <tt>dirty_pages</tt>, <tt>locked_pages</tt> and
|
|
<tt>nrpages</tt> is obvious, so we will take a tighter look at the
|
|
<tt>address_space_operations</tt> structure, defined in the same header:
|
|
|
|
<tscreen><code>
|
|
struct address_space_operations {
|
|
int (*writepage)(struct page *);
|
|
int (*readpage)(struct file *, struct page *);
|
|
int (*sync_page)(struct page *);
|
|
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
|
|
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
|
|
int (*bmap)(struct address_space *, long);
|
|
};
|
|
</code></tscreen>
|
|
|
|
For a basic view at the principle of address_spaces (and the pagecache) we need
|
|
to take a look at -><tt>writepage</tt> and -><tt>readpage</tt>, but in practice we need
|
|
to take a look at -><tt>prepare_write</tt> and -><tt>commit_write</tt>, too.
|
|
|
|
You can probably guess what the address_space_operations methods do
|
|
by virtue of their names alone; nevertheless, they do require some
|
|
explanation. Their use in the course of filesystem data I/O, by
|
|
far the most common path through the pagecache, provides a good
|
|
way of understanding them.
|
|
Unlike most other UNIX-like operating systems, Linux has generic file
|
|
operations (a subset of the SYSVish vnode operations) for data IO through the
|
|
pagecache. This means that the data will not directly interact with the file-
|
|
system on read/write/mmap, but will be read/written from/to the pagecache
|
|
whenever possible. The pagecache has to get data from the actual low-level
|
|
filesystem in case the user wants to read from a page not yet in memory,
|
|
or write data to disk in case memory gets low.
|
|
|
|
In the read path the generic methods will first try to find a page that
|
|
matches the wanted inode/index tuple.
|
|
|
|
<tscreen>
|
|
hash = page_hash(inode->i_mapping, index);
|
|
</tscreen>
|
|
|
|
Then we test whether the page actually exists.
|
|
|
|
<tscreen>
|
|
hash = page_hash(inode->i_mapping, index);
|
|
page = __find_page_nolock(inode->i_mapping, index, *hash);
|
|
</tscreen>
|
|
|
|
When it does not exist, we allocate a new free page, and add it to the page-
|
|
cache hash.
|
|
|
|
<tscreen>
|
|
page = page_cache_alloc();
|
|
__add_to_page_cache(page, mapping, index, hash);
|
|
</tscreen>
|
|
|
|
After the page is hashed we use the -><tt>readpage</tt> address_space operation to
|
|
actually fill the page with data. (file is an open instance of inode).
|
|
|
|
<tscreen>
|
|
error = mapping->a_ops->readpage(file, page);
|
|
</tscreen>
|
|
|
|
Finally we can copy the data to userspace.
|
|
|
|
For writing to the filesystem two pathes exist: one for writable mappings
|
|
(mmap) and one for the write(2) family of syscalls. The mmap case is very
|
|
simple, so it will be discussed first.
|
|
When a user modifies mappings, the VM subsystem marks the page dirty.
|
|
|
|
<tscreen>
|
|
SetPageDirty(page);
|
|
</tscreen>
|
|
|
|
The bdflush kernel thread that is trying to free pages, either as background
|
|
activity or because memory gets low will try to call -><tt>writepage</tt> on the pages
|
|
that are explicitly marked dirty. The -><tt>writepage</tt> method does now have to
|
|
write the pages content back to disk and free the page.
|
|
|
|
The second write path is _much_ more complicated. For each page the user
|
|
writes to, we are basically doing the following:
|
|
(for the full code see <tt>mm/filemap.c:generic_file_write()</tt>).
|
|
|
|
<tscreen>
|
|
page = __grab_cache_page(mapping, index, &cached_page);
|
|
mapping->a_ops->prepare_write(file, page, offset, offset+bytes);
|
|
copy_from_user(kaddr+offset, buf, bytes);
|
|
mapping->a_ops->commit_write(file, page, offset, offset+bytes);
|
|
</tscreen>
|
|
|
|
So first we try to find the hashed page or allocate a new one, then we call the
|
|
-><tt>prepare_write</tt> address_space method, copy the user buffer to kernel memory and
|
|
finally call the -><tt>commit_write</tt> method. As you probably have seen
|
|
->prepare_write and -><tt>commit_write</tt> are fundamentally different from -><tt>readpage</tt>
|
|
and -><tt>writepage</tt>, because they are not only called when physical IO is actually
|
|
wanted but everytime the user modifies the file.
|
|
There are two (or more?) ways to handle this, the first one uses the Linux
|
|
buffercache to delay the physical IO, by filling a <tt>page->buffers</tt> pointer with
|
|
buffer_heads, that will be used in try_to_free_buffers (<tt>fs/buffers.c</tt>) to
|
|
request IO once memory gets low, and is used very widespread in the current
|
|
kernel. The other way just sets the page dirty and relies on -><tt>writepage</tt> to do
|
|
all the work. Due to the lack of a validitity bitmap in struct page this does
|
|
not work with filesystem that have a smaller granuality then <tt>PAGE_SIZE</tt>.
|
|
|
|
<sect>IPC mechanisms<p>
|
|
This chapter describes the semaphore, shared memory, and
|
|
message queue IPC mechanisms as implemented in the Linux 2.4
|
|
kernel. It is organized into four sections. The
|
|
first three sections cover the interfaces and support functions
|
|
for <ref id="semaphores" name="semaphores">,
|
|
<ref id="message" name="message queues">,
|
|
and <ref id="sharedmem" name="shared memory"> respectively.
|
|
The <ref id="ipc_primitives" name="last"> section describes
|
|
a set of common functions and data structures that are shared by
|
|
all three mechanisms.
|
|
|
|
<sect1>Semaphores<label id="semaphores"><p>
|
|
The functions described in this section implement the user level
|
|
semaphore mechanisms. Note that this implementation relies on the
|
|
use of kernel splinlocks and kernel semaphores. To avoid confusion,
|
|
the term "kernel semaphore" will be used in reference to kernel
|
|
semaphores. All other uses of the word "sempahore" will be in
|
|
reference to the user level semaphores.
|
|
|
|
<sect2>Semaphore System Call Interfaces<label id="sem_apis"><p>
|
|
|
|
<sect3>sys_semget()<label id="sys_semget"><p>
|
|
The entire call to sys_semget() is protected by the
|
|
global <ref id="struct_ipc_ids" name="sem_ids.sem">
|
|
kernel semaphore.
|
|
|
|
In the case where a new set of semaphores must be
|
|
created, the <ref id="newary" name="newary()"> function is
|
|
called to create and initialize a new semaphore set. The ID of
|
|
the new set is returned to the caller.
|
|
|
|
In the case where a key value is provided for an existing
|
|
semaphore set, <ref id="ipc_findkey" name="ipc_findkey()">
|
|
is invoked to look up the corresponding semaphore descriptor
|
|
array index. The parameters and permissions of the caller are
|
|
verified before returning the semaphore set ID.
|
|
</sect3>
|
|
|
|
<sect3>sys_semctl()<label id="sys_semctl"><p>
|
|
For the <ref id="IPC_INFO_and_SEM_INFO" name="IPC_INFO">,
|
|
<ref id="IPC_INFO_and_SEM_INFO" name="SEM_INFO">, and
|
|
<ref id="SEM_STAT" name="SEM_STAT"> commands,
|
|
<ref id="semctl_nolock" name="semctl_nolock()">
|
|
is called to perform the necessary functions.
|
|
|
|
For the <ref id="GETALL" name="GETALL">, <ref id="GETVAL" name="GETVAL">,
|
|
<ref id="GETPID" name="GETPID">, <ref id="GETNCNT" name="GETNCNT">,
|
|
<ref id="GETZCNT" name="GETZCNT">, <ref id="IPC_STAT" name="IPC_STAT">,
|
|
<ref id="SETVAL" name="SETVAL">,and <ref id="SETALL" name="SETALL"> commands,
|
|
<ref id="semctl_main" name="semctl_main()"> is called to perform the
|
|
necessary functions.
|
|
|
|
For the <ref id="semctl_ipc_rmid" name="IPC_RMID">
|
|
and <ref id="semctl_ipc_set" name="IPC_SET"> command,
|
|
<ref id="semctl_down" name="semctl_down()"> is called
|
|
to perform the necessary functions. Throughout both of these
|
|
operations, the global <ref id="struct_ipc_ids" name="sem_ids.sem">
|
|
kernel semaphore is held.
|
|
</sect3>
|
|
|
|
<sect3>sys_semop()<label id="sys_semop"><p>
|
|
After validating the call parameters, the semaphore
|
|
operations data is copied from user space to a temporary buffer.
|
|
If a small temporary buffer is sufficient, then a stack buffer is
|
|
used. Otherwise, a larger buffer is allocated. After copying in the
|
|
semaphore operations data, the global semaphores spinlock is
|
|
locked, and the user-specified semaphore set ID is validated.
|
|
Access permissions for the semaphore set are also validated.
|
|
|
|
All of the user-specified semaphore operations are parsed.
|
|
During this process, a count is maintained of all the operations that
|
|
have the SEM_UNDO flag set. A <tt>decrease</tt> flag is set if any of the
|
|
operations subtract from a semaphore value, and an <tt>alter</tt> flag is set
|
|
if any of the semaphore values are modified (i.e. increased or
|
|
decreased). The number of each
|
|
semaphore to be modified is validated.
|
|
|
|
If SEM_UNDO was asserted for any of the semaphore operations,
|
|
then the undo list for the current task is searched for an undo
|
|
structure associated with this semaphore set. During this search,
|
|
if the semaphore set ID of any of the undo structures is found
|
|
to be -1, then <ref id="freeundos" name="freeundos()">
|
|
is called to free the undo structure
|
|
and remove it from the list. If no undo structure is found for
|
|
this semaphore set then <ref id="alloc_undo" name="alloc_undo()">
|
|
is called to allocate and initialize one.
|
|
|
|
The <ref id="try_atomic_semop" name="try_atomic_semop()">
|
|
function is called with the <tt>do_undo</tt>
|
|
parameter equal to 0 in order to execute the sequence of
|
|
operations. The return value indicates that either the
|
|
operations passed, failed, or were not executed because
|
|
they need to block. Each of these cases are further described below:
|
|
|
|
<sect4>Non-blocking Semaphore Operations<label id="Non-blocking_Semaphore_Operations"><p>
|
|
The <ref id="try_atomic_semop" name="try_atomic_semop()">
|
|
function returns zero to indicate that all operations in the
|
|
sequence succeeded. In this case,
|
|
<ref id="update_queue" name="update_queue()">
|
|
is called to traverse the queue of pending semaphore
|
|
operations for the semaphore set and awaken any
|
|
sleeping tasks that no longer need to block. This completes the
|
|
execution of the sys_semop() system call for this case.
|
|
</sect4>
|
|
|
|
<sect4>Failing Semaphore Operations<label id="Failing_Semaphore_Operations"><p>
|
|
If <ref id="try_atomic_semop" name="try_atomic_semop()">
|
|
returns a negative value, then a failure condition was encountered.
|
|
In this case, none of the operations have been executed.
|
|
This occurs when either a semaphore operation would cause an
|
|
invalid semaphore value, or an operation marked IPC_NOWAIT is
|
|
unable to complete. The error condition is then returned to the
|
|
caller of sys_semop().
|
|
|
|
Before sys_semop() returns, a call is made to
|
|
<ref id="update_queue" name="update_queue()"> to traverse
|
|
the queue of pending semaphore operations for the semaphore set
|
|
and awaken any sleeping tasks that no longer need to block.
|
|
</sect4>
|
|
|
|
<sect4>Blocking Semaphore Operations<label id="Blocking_Semaphore_Operations"><p>
|
|
The <ref id="try_atomic_semop" name="try_atomic_semop()">
|
|
function returns 1 to indicate that the
|
|
sequence of semaphore operations was not executed because
|
|
one of the semaphores would block. For this case, a new
|
|
<ref id="struct_sem_queue" name="sem_queue"> element is
|
|
initialized containing these semaphore operations. If any of
|
|
these operations would alter the state of the semaphore, then
|
|
the new queue element is added at the tail of the queue.
|
|
Otherwise, the new queue element is added at the head of the queue.
|
|
|
|
The <tt>semsleeping</tt> element of the current
|
|
task is set to indicate that the task is sleeping on this
|
|
<ref id="struct_sem_queue" name="sem_queue"> element.
|
|
The current task is marked as TASK_INTERRUPTIBLE, and the
|
|
<tt>sleeper</tt> element of the
|
|
<ref id="struct_sem_queue" name="sem_queue">
|
|
is set to identify this task as the sleeper. The
|
|
global semaphore spinlock is then unlocked, and schedule() is called
|
|
to put the current task to sleep.
|
|
|
|
When awakened, the task re-locks the global semaphore spinlock,
|
|
determines why it was awakened, and how it should
|
|
respond. The following cases are handled:
|
|
|
|
<itemize>
|
|
<item> If the semaphore set has been removed, then
|
|
the system call fails with EIDRM.
|
|
|
|
<item> If the <tt>status</tt> element of the
|
|
<ref id="struct_sem_queue" name="sem_queue"> structure
|
|
is set to 1, then the task was awakened in order to retry the
|
|
semaphore operations. Another call to
|
|
<ref id="try_atomic_semop" name="try_atomic_semop()"> is
|
|
made to execute the sequence of semaphore operations. If
|
|
try_atomic_sweep() returns 1, then the task must block again
|
|
as described above. Otherwise, 0 is returned for success,
|
|
or an appropriate error code is returned in case of failure.
|
|
|
|
Before sys_semop() returns, current->semsleeping is cleared,
|
|
and the <ref id="struct_sem_queue" name="sem_queue">
|
|
is removed from the queue. If any of the specified semaphore
|
|
operations were altering operations (increase or decrease),
|
|
then <ref id="update_queue" name="update_queue()"> is
|
|
called to traverse the queue of pending semaphore operations
|
|
for the semaphore set and awaken any sleeping tasks that no
|
|
longer need to block.
|
|
|
|
<item> If the <tt>status</tt> element of the
|
|
<ref id="struct_sem_queue" name="sem_queue"> structure is
|
|
NOT set to 1, and the
|
|
<ref id="struct_sem_queue" name="sem_queue"> element has
|
|
not been dequeued, then the task was awakened by an interrupt.
|
|
In this case, the system call fails with EINTR. Before
|
|
returning, current->semsleeping is cleared, and the
|
|
<ref id="struct_sem_queue" name="sem_queue"> is removed
|
|
from the queue. Also,
|
|
<ref id="update_queue" name="update_queue()"> is called
|
|
if any of the operations were altering operations.
|
|
|
|
<item> If the <tt>status</tt> element of the
|
|
<ref id="struct_sem_queue" name="sem_queue"> structure is
|
|
NOT set to 1, and the
|
|
<ref id="struct_sem_queue" name="sem_queue"> element
|
|
has been dequeued,
|
|
then the semaphore operations have already been executed by
|
|
<ref id="update_queue" name="update_queue()">. The
|
|
queue <tt>status</tt>, which could be 0 for success
|
|
or a negated error code for failure, becomes the return value of
|
|
the system call.
|
|
|
|
</itemize>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2>Semaphore Specific Support Structures<label id="sem_structures"><p>
|
|
The following structures are used specifically for semaphore support:
|
|
|
|
<sect3>struct sem_array<label id="struct_sem_array"><p>
|
|
<tscreen><code>
|
|
/* One sem_array data structure for each set of semaphores in the system. */
|
|
struct sem_array {
|
|
struct kern_ipc_perm sem_perm; /* permissions .. see ipc.h */
|
|
time_t sem_otime; /* last semop time */
|
|
time_t sem_ctime; /* last change time */
|
|
struct sem *sem_base; /* ptr to first semaphore in array */
|
|
struct sem_queue *sem_pending; /* pending operations to be processed */
|
|
struct sem_queue **sem_pending_last; /* last pending operation */
|
|
struct sem_undo *undo; /* undo requests on this array * /
|
|
unsigned long sem_nsems; /* no. of semaphores in array */
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct sem<label id="struct_sem"><p>
|
|
<tscreen><code>
|
|
/* One semaphore structure for each semaphore in the system. */
|
|
struct sem {
|
|
int semval; /* current value */
|
|
int sempid; /* pid of last operation */
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct seminfo<label id="struct_seminfo"><p>
|
|
<tscreen><code>
|
|
struct seminfo {
|
|
int semmap;
|
|
int semmni;
|
|
int semmns;
|
|
int semmnu;
|
|
int semmsl;
|
|
int semopm;
|
|
int semume;
|
|
int semusz;
|
|
int semvmx;
|
|
int semaem;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct semid64_ds<label id="struct_semid64_ds"><p>
|
|
<tscreen><code>
|
|
struct semid64_ds {
|
|
struct ipc64_perm sem_perm; /* permissions .. see
|
|
ipc.h */
|
|
__kernel_time_t sem_otime; /* last semop time */
|
|
unsigned long __unused1;
|
|
__kernel_time_t sem_ctime; /* last change time */
|
|
unsigned long __unused2;
|
|
unsigned long sem_nsems; /* no. of semaphores in
|
|
array */
|
|
unsigned long __unused3;
|
|
unsigned long __unused4;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct sem_queue<label id="struct_sem_queue"><p>
|
|
<tscreen><code>
|
|
/* One queue for each sleeping process in the system. */
|
|
struct sem_queue {
|
|
struct sem_queue * next; /* next entry in the queue */
|
|
struct sem_queue ** prev; /* previous entry in the queue, *(q->pr
|
|
ev) == q */
|
|
struct task_struct* sleeper; /* this process */
|
|
struct sem_undo * undo; /* undo structure */
|
|
int pid; /* process id of requesting process */
|
|
int status; /* completion status of operation */
|
|
struct sem_array * sma; /* semaphore array for operations */
|
|
int id; /* internal sem id */
|
|
struct sembuf * sops; /* array of pending operations */
|
|
int nsops; /* number of operations */
|
|
int alter; /* operation will alter semaphore */
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct sembuf<label id="struct_sembuf"><p>
|
|
<tscreen><code>
|
|
/* semop system calls takes an array of these. */
|
|
struct sembuf {
|
|
unsigned short sem_num; /* semaphore index in array */
|
|
short sem_op; /* semaphore operation */
|
|
short sem_flg; /* operation flags */
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct sem_undo<label id="struct_sem_undo"><p>
|
|
<tscreen><code>
|
|
/* Each task has a list of undo requests. They are executed automatically
|
|
* when the process exits.
|
|
*/
|
|
struct sem_undo {
|
|
struct sem_undo * proc_next; /* next entry on this process */
|
|
struct sem_undo * id_next; /* next entry on this semaphore set */
|
|
int semid; /* semaphore set identifier */
|
|
short * semadj; /* array of adjustments, one per
|
|
semaphore */
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2>Semaphore Support Functions<label id="sem_primitives"><p>
|
|
The following functions are used specifically in support of
|
|
semaphores:
|
|
|
|
<sect3>newary()<label id="newary"><p>
|
|
newary() relies on the <ref id="ipc_alloc" name="ipc_alloc()">
|
|
function to allocate the memory
|
|
required for the new semaphore set. It allocates enough memory
|
|
for the semaphore set descriptor and for each of the semaphores
|
|
in the set. The allocated memory is cleared, and the address of the
|
|
first element of the semaphore set descriptor is passed to
|
|
<ref id="ipc_addid" name="ipc_addid()">.
|
|
<ref id="ipc_addid" name="ipc_addid()"> reserves an array entry
|
|
for the new semaphore set descriptor and initializes the
|
|
(<ref id="struct_kern_ipc_perm" name="struct kern_ipc_perm">) data for the set.
|
|
The global <tt>used_sems</tt> variable is updated by the number of
|
|
semaphores in the new set and the initialization of the
|
|
(<ref id="struct_kern_ipc_perm" name="struct kern_ipc_perm">)
|
|
data for the new set is completed. Other
|
|
initialization for this set performed are listed below:
|
|
|
|
<itemize>
|
|
<item> The <tt>sem_base</tt> element for the set is initialized
|
|
to the address immediately following the
|
|
(<ref id="struct_sem_array" name="struct sem_array">)
|
|
portion of the newly allocated data. This corresponds to
|
|
the location of the first semaphore in the set.
|
|
|
|
<item> The <tt>sem_pending</tt> queue is initialized as empty.
|
|
</itemize>
|
|
|
|
All of the operations following the call to <ref id="ipc_addid" name="ipc_addid()">
|
|
are performed while holding the global semaphores spinlock. After
|
|
unlocking the global semaphores spinlock, newary() calls
|
|
<ref id="ipc_buildid" name="ipc_buildid()">
|
|
(via sem_buildid()). This function uses the index
|
|
of the semaphore set descriptor to create a unique ID, that is then
|
|
returned to the caller of newary().
|
|
|
|
</sect3>
|
|
|
|
<sect3>freeary()<label id="freeary"><p>
|
|
freeary() is called by <ref id="semctl_down" name="semctl_down()"> to perform the
|
|
functions listed below. It is called with the global semaphores
|
|
spinlock locked and it returns with the spinlock unlocked
|
|
|
|
<itemize>
|
|
<item> The <ref id="func_ipc_rmid" name="ipc_rmid()"> function
|
|
is called (via the
|
|
sem_rmid() wrapper) to delete the ID for the semaphore
|
|
set and to retrieve a pointer to the semaphore set.
|
|
|
|
<item> The undo list for the semaphore set is invalidated.
|
|
<item> All pending processes are awakened and caused to fail
|
|
with EIDRM.
|
|
|
|
<item> The number of used semaphores is reduced by the number
|
|
of semaphores in the removed set.
|
|
|
|
<item> The memory associated with the semaphore set is freed.
|
|
</itemize>
|
|
</sect3>
|
|
|
|
<sect3>semctl_down()<label id="semctl_down"><p>
|
|
semctl_down() provides the <ref id="semctl_ipc_rmid" name="IPC_RMID"> and
|
|
<ref id="semctl_ipc_set" name="IPC_SET"> operations of the
|
|
semctl() system call. The semaphore set ID and the access permissions
|
|
are verified prior to either of these operations, and in either
|
|
case, the global semaphore spinlock is held throughout the
|
|
operation.
|
|
|
|
<sect4>IPC_RMID<label id="semctl_ipc_rmid"><p>
|
|
The IPC_RMID operation calls <ref id="freeary" name="freeary()"> to remove the semaphore set.
|
|
</sect4>
|
|
|
|
<sect4>IPC_SET<label id="semctl_ipc_set"><p>
|
|
The IPC_SET operation updates the <tt>uid</tt>, <tt>gid</tt>,
|
|
<tt>mode</tt>, and <tt>ctime</tt> elements of the semaphore set.
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3>semctl_nolock()<label id="semctl_nolock"><p>
|
|
semctl_nolock() is called by <ref id="sys_semctl" name="sys_semctl()">
|
|
to perform the IPC_INFO, SEM_INFO and SEM_STAT functions.
|
|
|
|
<sect4>IPC_INFO and SEM_INFO<label id="IPC_INFO_and_SEM_INFO"><p>
|
|
IPC_INFO and SEM_INFO cause a temporary <ref id="struct_seminfo" name="seminfo">
|
|
buffer to be initialized and loaded with unchanging semaphore
|
|
statistical data. Then, while holding the global <tt>sem_ids.sem</tt>
|
|
kernel semaphore, the <tt>semusz</tt> and <tt>semaem</tt> elements of
|
|
the <ref id="struct_seminfo" name="seminfo"> structure are
|
|
updated according to the given command (IPC_INFO or SEM_INFO).
|
|
The return value of the system call is set to the maximum
|
|
semaphore set ID.
|
|
</sect4>
|
|
|
|
<sect4>SEM_STAT<label id="SEM_STAT"><p>
|
|
SEM_STAT causes a temporary <ref id="struct_semid64_ds" name="semid64_ds">
|
|
buffer to be initialized. The global
|
|
semaphore spinlock is then held while copying the <tt>sem_otime</tt>,
|
|
<tt>sem_ctime</tt>, and <tt>sem_nsems</tt> values into the buffer. This data is
|
|
then copied to user space.
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3>semctl_main()<label id="semctl_main"><p>
|
|
semctl_main() is called by <ref id="sys_semctl" name="sys_semctl()"> to perform many
|
|
of the supported functions, as described in the subsections below.
|
|
Prior to performing any of the following operations, semctl_main()
|
|
locks the global semaphore spinlock and validates the
|
|
semaphore set ID and the permissions. The spinlock is released
|
|
before returning.
|
|
|
|
<sect4>GETALL<label id="GETALL"><p>
|
|
The GETALL operation loads the current semaphore values into
|
|
a temporary kernel buffer and copies
|
|
them out to user space. The small stack buffer is used if the
|
|
semaphore set is small. Otherwise, the spinlock is temporarily
|
|
dropped in order to allocate a larger buffer. The spinlock is
|
|
held while copying the semaphore values in to the temporary buffer.
|
|
</sect4>
|
|
|
|
<sect4>SETALL<label id="SETALL"><p>
|
|
The SETALL operation copies semaphore values from user space into a temporary buffer,
|
|
and then into the semaphore set. The spinlock is dropped while
|
|
copying the values from user space into the temporary buffer,
|
|
and while verifying reasonable values. If the semaphore set
|
|
is small, then a stack buffer is used, otherwise a larger buffer
|
|
is allocated. The spinlock is regained and held while the
|
|
following operations are performed on the semaphore set:
|
|
|
|
<itemize>
|
|
<item> The semaphore values are copied into the semaphore set.
|
|
<item> The semaphore adjustments of the undo queue for
|
|
the semaphore set are cleared.
|
|
|
|
<item> The <tt>sem_ctime</tt> value for the semaphore set is set.
|
|
|
|
<item> The <ref id="update_queue" name="update_queue()">
|
|
function is called to traverse
|
|
the queue of pending semops and look for any tasks that
|
|
can be completed as a result of the SETALL operation. Any
|
|
pending tasks that are no longer blocked are awakened.
|
|
</itemize>
|
|
</sect4>
|
|
|
|
<sect4>IPC_STAT<label id="IPC_STAT"><p>
|
|
In the IPC_STAT operation, the <tt>sem_otime</tt>,
|
|
<tt>sem_ctime</tt>, and <tt>sem_nsems</tt> value are copied into
|
|
a stack buffer. The data is then copied to user space after
|
|
dropping the spinlock.
|
|
</sect4>
|
|
|
|
<sect4>GETVAL<label id="GETVAL"><p>
|
|
For GETVAL in the non-error case, the return value for the system call is
|
|
set to the value of the specified semaphore.
|
|
</sect4>
|
|
|
|
<sect4>GETPID<label id="GETPID"><p>
|
|
For GETPID in the non-error case, the return value for the system call is
|
|
set to the <tt>pid</tt> associated with the last operation on the
|
|
semaphore.
|
|
</sect4>
|
|
|
|
<sect4>GETNCNT<label id="GETNCNT"><p>
|
|
For GETNCNT in the non-error case, the return value for the system call
|
|
is set to the number of processes waiting on the semaphore
|
|
being less than zero. This number is calculated by the
|
|
<ref id="count_semncnt" name="count_semncnt()"> function.
|
|
</sect4>
|
|
|
|
<sect4>GETZCNT<label id="GETZCNT"><p>
|
|
For GETZCNT in the non-error case, the return value for the system call
|
|
is set to the number of processes waiting on the semaphore
|
|
being set to zero. This number is calculated by the
|
|
<ref id="count_semzcnt" name="count_semzcnt()"> function.
|
|
</sect4>
|
|
|
|
<sect4>SETVAL<label id="SETVAL"><p>
|
|
After validating the new semaphore value, the following
|
|
functions are performed:
|
|
|
|
<itemize>
|
|
<item> The undo queue is searched for any adjustments to
|
|
this semaphore. Any adjustments that are found are reset to
|
|
zero.
|
|
|
|
<item> The semaphore value is set to the value provided.
|
|
<item> The <tt>sem_ctime</tt> value for the semaphore set is updated.
|
|
<item> The <ref id="update_queue" name="update_queue()">
|
|
function is called to traverse
|
|
the queue of pending semops and look for any tasks that
|
|
can be completed as a result of the
|
|
<ref id="SETALL" name="SETALL"> operation. Any
|
|
pending tasks that are no longer blocked are awakened.
|
|
</itemize>
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3>count_semncnt()<label id="count_semncnt"><p>
|
|
count_semncnt() counts the number of tasks waiting on the value of a semaphore
|
|
to be less than zero.
|
|
</sect3>
|
|
|
|
<sect3>count_semzcnt()<label id="count_semzcnt"><p>
|
|
count_semzcnt() counts the number of tasks waiting on the value of a semaphore
|
|
to be zero.
|
|
</sect3>
|
|
|
|
<sect3>update_queue()<label id="update_queue"><p>
|
|
update_queue() traverses the queue of pending semops for
|
|
a semaphore set and calls
|
|
<ref id="try_atomic_semop" name="try_atomic_semop()">
|
|
to determine which sequences of semaphore operations
|
|
would succeed. If the status of the queue element
|
|
indicates that blocked tasks have already
|
|
been awakened, then the queue element is skipped over. For other
|
|
elements of the queue, the <tt>q-alter</tt> flag
|
|
is passed as the undo parameter to
|
|
<ref id="try_atomic_semop" name="try_atomic_semop()">,
|
|
indicating that any
|
|
altering operations should be undone before returning.
|
|
|
|
If the sequence of operations would block, then
|
|
update_queue() returns without making any changes.
|
|
|
|
A sequence of operations can fail if one of the semaphore
|
|
operations would cause an invalid semaphore value, or an
|
|
operation marked IPC_NOWAIT is unable to complete. In such a
|
|
case, the task that is blocked on the sequence of semaphore
|
|
operations is awakened, and the queue status is set with an
|
|
appropriate error code. The queue element is also dequeued.
|
|
|
|
If the sequence of operations is non-altering, then
|
|
they would have passed a zero value as the undo parameter to
|
|
<ref id="try_atomic_semop" name="try_atomic_semop()">.
|
|
If these operations succeeded, then they
|
|
are considered complete and are removed from the queue.
|
|
The blocked task is awakened, and the queue element
|
|
<tt>status</tt> is set to indicate success.
|
|
|
|
If the sequence of operations would alter the semaphore
|
|
values, but can succeed, then sleeping tasks that no longer
|
|
need to be blocked are awakened. The queue status is set to
|
|
1 to indicate that the blocked task has been awakened. The
|
|
operations have not been performed, so the queue element is not
|
|
removed from the queue. The semaphore operations would be
|
|
executed by the awakened task.
|
|
</sect3>
|
|
|
|
<sect3>try_atomic_semop()<label id="try_atomic_semop"><p>
|
|
try_atomic_semop() is called by <ref id="sys_semop" name="sys_semop()">
|
|
and <ref id="update_queue" name="update_queue()">
|
|
to determine if a sequence of semaphore operations will all
|
|
succeed. It determines this by attempting to perform each of the
|
|
operations.
|
|
|
|
If a blocking operation is encountered, then the process
|
|
is aborted and all operations are reversed. -EAGAIN is returned
|
|
if IPC_NOWAIT is set. Otherwise 1 is returned to indicate that
|
|
the sequence of semaphore operations is blocked.
|
|
|
|
If a semaphore value is adjusted beyond system limits, then
|
|
then all operations are reversed, and -ERANGE is returned.
|
|
|
|
If all operations in the sequence succeed, and the <tt>do_undo</tt>
|
|
parameter is non-zero, then all operations are reversed, and 0
|
|
is returned. If the <tt>do_undo</tt> parameter is zero, then all operations
|
|
succeeded and remain in force, and the <tt>sem_otime</tt>, field of the
|
|
semaphore set is updated.
|
|
</sect3>
|
|
|
|
<sect3>sem_revalidate()<label id="sem_revalidate"><p>
|
|
sem_revalidate() is called when the global semaphores spinlock
|
|
has been temporarily dropped and needs to be locked again. It is
|
|
called by <ref id="semctl_main" name="semctl_main()">
|
|
and <ref id="alloc_undo" name="alloc_undo()">. It validates the
|
|
semaphore ID and permissions and on success, returns with the
|
|
global semaphores spinlock locked.
|
|
</sect3>
|
|
|
|
<sect3>freeundos()<label id="freeundos"><p>
|
|
freeundos() traverses the process undo list in search of
|
|
the desired undo structure. If found, the undo structure is removed from the
|
|
list and freed. A pointer to the next undo structure on the
|
|
process list is returned.
|
|
</sect3>
|
|
|
|
<sect3>alloc_undo()<label id="alloc_undo"><p>
|
|
alloc_undo() expects to be called with the global semaphores
|
|
spinlock locked. In the case of an error, it returns with it
|
|
unlocked.
|
|
|
|
The global semaphores spinlock is unlocked, and kmalloc() is
|
|
called to allocate sufficient memory for both the
|
|
<ref id="struct_sem_undo" name="sem_undo">
|
|
structure, and also an array of one adjustment value for each
|
|
semaphore in the set. On success, the global spinlock is regained
|
|
with a call to <ref id="sem_revalidate" name="sem_revalidate()">.
|
|
|
|
The new semundo structure is then initialized, and the address
|
|
of this structure is placed at the address provided by the
|
|
caller. The new undo structure is then placed at the head of undo
|
|
list for the current task.
|
|
</sect3>
|
|
|
|
<sect3>sem_exit()<label id="sem_exit"><p>
|
|
sem_exit() is called by do_exit(), and is responsible for
|
|
executing all of the undo adjustments for the exiting task.
|
|
|
|
If the current process was blocked on a semaphore, then it is
|
|
removed from the <ref id="struct_sem_queue" name="sem_queue">
|
|
list while holding the global semaphores spinlock.
|
|
|
|
The undo list for the current task is then traversed, and the
|
|
following operations are performed while holding and releasing the
|
|
the global semaphores spinlock around the processing of each
|
|
element of the list. The following operations are performed for
|
|
each of the undo elements:
|
|
|
|
<itemize>
|
|
<item> The undo structure and the semaphore set ID are validated.
|
|
<item> The undo list of the corresponding semaphore set is
|
|
searched to find a reference to the same undo structure and to
|
|
remove it from that list.
|
|
<item> The adjustments indicated in the undo structure are
|
|
applied to the semaphore set.
|
|
<item> The <tt>sem_otime</tt> parameter of the semaphore set is updated.
|
|
<item> <ref id="update_queue" name="update_queue()"> is called
|
|
to traverse the queue of
|
|
pending semops and awaken any sleeping tasks that no longer
|
|
need to be blocked as a result of executing the undo
|
|
operations.
|
|
<item> The undo structure is freed.
|
|
</itemize>
|
|
|
|
When the processing of the list is complete, the
|
|
current->semundo value is cleared.
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1>Message queues<label id= "message"><p>
|
|
<sect2>Message System Call Interfaces<label id="Message_System_Call_Interfaces"><p>
|
|
<sect3>sys_msgget()<label id="sys_msgget"><p>
|
|
The entire call to sys_msgget() is protected by
|
|
the global message queue semaphore
|
|
(<ref id="struct_ipc_ids" name="msg_ids.sem">).
|
|
|
|
In the case where a new message queue must be created,
|
|
the <ref id="newque" name="newque()"> function is
|
|
called to create and initialize
|
|
a new message queue, and the new queue ID is returned to
|
|
the caller.
|
|
|
|
If a key value is provided for an existing message queue,
|
|
then <ref id="ipc_findkey" name="ipc_findkey()"> is called
|
|
to look up the corresponding index in the global message queue
|
|
descriptor array (msg_ids.entries). The
|
|
parameters and permissions of the caller are verified before
|
|
returning the message queue ID. The look up operation and
|
|
verification are performed while the global message queue
|
|
spinlock(msg_ids.ary) is held.
|
|
</sect3>
|
|
|
|
<sect3>sys_msgctl()<label id="sys_msgctl"><p>
|
|
The parameters passed to sys_msgctl() are: a message
|
|
queue ID (<tt>msqid</tt>), the operation
|
|
(<tt>cmd</tt>), and a pointer to a user space buffer of type
|
|
<ref id="struct_msqid_ds" name="msgid_ds">
|
|
(<tt>buf</tt>). Six operations are
|
|
provided in this function: IPC_INFO, MSG_INFO,IPC_STAT,
|
|
MSG_STAT, IPC_SET and IPC_RMID. The message queue
|
|
ID and the operation parameters are validated; then, the operation(cmd)
|
|
is performed as follows:
|
|
|
|
<sect4>IPC_INFO ( or MSG_INFO)<label id="msgctl_IPCINFO"><p>
|
|
The global message queue information is copied to user space.
|
|
</sect4>
|
|
|
|
<sect4>IPC_STAT ( or MSG_STAT)<label id="msgctl_IPCSTAT"><p>
|
|
A temporary buffer of type <ref id="struct_msqid64_ds" name="struct msqid64_ds">
|
|
is initialized and the global message queue spinlock is locked.
|
|
After verifying the access permissions of the calling process,
|
|
the message queue information associated with the message
|
|
queue ID is loaded into the temporary buffer, the global
|
|
message queue spinlock is unlocked, and the contents of
|
|
the temporary buffer are copied out to user space by
|
|
<ref id="copy_msqid_to_user" name="copy_msqid_to_user()">.
|
|
</sect4>
|
|
|
|
<sect4>IPC_SET<label id="msgctl_IPCSET"><p>
|
|
The user data is copied in via
|
|
<ref id="copy_msqid_to_user" name="copy_msqid_to_user()">. The global
|
|
message queue semaphore and spinlock are obtained and released
|
|
at the end. After the message queue ID and the current
|
|
process access permissions are validated, the message queue
|
|
information is updated with the user provided data. Later,
|
|
<ref id="expunge_all" name="expunge_all()"> and
|
|
<ref id="ss_wakeup" name="ss_wakeup()">
|
|
are called to wake up all
|
|
processes sleeping on the receiver and sender waiting queues
|
|
of the message queue. This is because some receivers may now
|
|
be excluded by stricter access permissions and some senders
|
|
may now be able to send the message due to an increased
|
|
queue size.
|
|
</sect4>
|
|
|
|
<sect4>IPC_RMID<label id="msgctl_IPCRMID"><p>
|
|
The global message queue semaphore
|
|
is obtained and the global message queue spinlock is locked.
|
|
After validating the message queue ID and the current task
|
|
access permissions, <ref id="freeque" name="freeque()">
|
|
is called to free the resources related to the message queue ID.
|
|
The global message queue semaphore and spinlock are released.
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3>sys_msgsnd()<label id="sys_msgsnd"><p>
|
|
sys_msgsnd() receives as parameters a message queue ID
|
|
(<tt>msqid</tt>), a pointer to a buffer of type
|
|
<ref id="struct_msg_msg" name="struct msg_msg">
|
|
(<tt>msgp</tt>), the size of the message to be sent
|
|
(<tt>msgsz</tt>), and a flag indicating wait vs.
|
|
not wait (<tt>msgflg</tt>). There are two task waiting
|
|
queues and one message waiting queue associated with the message
|
|
queue ID. If there is a task in the receiver waiting queue
|
|
that is waiting for this message, then the message is
|
|
delivered directly to the receiver, and the receiver is
|
|
awakened. Otherwise, if there is enough space available in
|
|
the message waiting queue, the message is saved in this
|
|
queue. As a last resort, the sending task enqueues itself
|
|
on the sender waiting queue. A more in-depth discussion of the
|
|
operations performed by sys_msgsnd() follows:
|
|
|
|
<enum>
|
|
<item> Validates the user buffer address and the message
|
|
type, then invokes
|
|
<ref id="load_msg" name="load_msg()"> to load the
|
|
contents of the user message into a temporary object
|
|
<tt<label id="msg">msg</tt> of type
|
|
<ref id="struct_msg_msg" name="struct msg_msg">.
|
|
The message type and message size fields
|
|
of <tt>msg</tt> are also initialized.
|
|
<item> Locks the global message queue spinlock and gets
|
|
the message queue descriptor associated with the
|
|
message queue ID. If no such message queue exists,
|
|
returns EINVAL.
|
|
<item><label id="sndretry">
|
|
Invokes <ref id="ipc_checkid" name="ipc_checkid()">
|
|
(via msg_checkid())to verify that the message
|
|
queue ID is valid and calls
|
|
<ref id="ipcperms" name="ipcperms()"> to check the
|
|
calling process' access permissions.
|
|
<item> Checks the message size and the space left in
|
|
the message waiting queue to see if there is enough
|
|
room to store the message. If not, the following
|
|
substeps are performed:
|
|
|
|
<enum>
|
|
<item> If IPC_NOWAIT is specified in
|
|
<tt>msgflg</tt> the global message
|
|
queue spinlock is unlocked, the memory
|
|
resources for the message are freed, and EAGAIN
|
|
is returned.
|
|
<item> Invokes
|
|
<ref id="ss_add" name="ss_add()"> to
|
|
enqueue the current
|
|
task in the sender waiting queue. It also unlocks
|
|
the global message queue spinlock and invokes
|
|
schedule() to put the current task to sleep.
|
|
<item> When awakened, obtains the global spinlock
|
|
again and verifies that the message queue ID
|
|
is still valid. If the message queue ID is not valid,
|
|
ERMID is returned.
|
|
<item> Invokes <ref id="ss_del" name="ss_del()">
|
|
to remove the sending task from the sender
|
|
waiting queue. If there is any signal pending
|
|
for the task, sys_msgsnd() unlocks the
|
|
global spinlock,
|
|
invokes <ref id="free_msg" name="free_msg()">
|
|
to free the message buffer,
|
|
and returns EINTR. Otherwise, the function goes
|
|
<ref id="sndretry" name="back">
|
|
to check again whether there is enough space
|
|
in the message waiting queue.
|
|
</enum>
|
|
<item> Invokes
|
|
<ref id="pipelined_send" name="pipelined_send()">
|
|
to try to send the message to the waiting receiver directly.
|
|
<item> If there is no receiver waiting for this message,
|
|
enqueues <tt>msg</tt> into the message waiting
|
|
queue(msq->q_messages). Updates the
|
|
<tt>q_cbytes</tt> and
|
|
the <tt>q_qnum</tt> fields of the message
|
|
queue descriptor, as well as the global variables
|
|
<tt>msg_bytes</tt> and
|
|
<tt>msg_hdrs</tt>, which indicate the total
|
|
number of bytes used for messages and the total number
|
|
of messages system wide.
|
|
<item> If the message has been successfully sent or
|
|
enqueued, updates the <tt>q_lspid</tt>
|
|
and the <tt>q_stime</tt> fields
|
|
of the message queue descriptor and releases the global
|
|
message queue spinlock.
|
|
</enum>
|
|
</sect3>
|
|
|
|
<sect3>sys_msgrcv()<label id="sys_msgrcv"><p>
|
|
The sys_msgrcv() function receives as parameters
|
|
a message queue ID
|
|
(<tt>msqid</tt>), a pointer to a buffer of type
|
|
<ref id="struct_msg_msg" name="msg_msg">
|
|
(<tt>msgp</tt>), the desired
|
|
message size(<tt>msgsz</tt>), the message type
|
|
(<tt>msgtyp</tt>), and the flags
|
|
(<tt>msgflg</tt>). It searches the message waiting queue
|
|
associated with the message queue ID, finds the first
|
|
message in the queue which matches the request type, and
|
|
copies it into the given user buffer. If no such message
|
|
is found in the message waiting queue, the requesting task
|
|
is enqueued into the receiver waiting queue until the
|
|
desired message is available. A more in-depth discussion of the
|
|
operations performed by sys_msgrcv() follows:
|
|
|
|
<enum>
|
|
<item> First, invokes
|
|
<ref id="convert_mode" name="convert_mode()">
|
|
to derive the search mode from
|
|
<tt>msgtyp</tt>. sys_msgrcv() then locks
|
|
the global message
|
|
queue spinlock and obtains the message queue descriptor
|
|
associated with the message queue ID. If no such
|
|
message queue exists, it returns EINVAL.
|
|
<item> Checks whether the current task has the correct
|
|
permissions to access the message queue.
|
|
<item><label id="rcvretry">
|
|
Starting from the first message in the message
|
|
waiting queue, invokes
|
|
<ref id="testmsg" name="testmsg()"> to check whether
|
|
the message type matches the required type. sys_msgrcv()
|
|
continues searching until a matched message is found or the whole
|
|
waiting queue is exhausted. If the search mode is
|
|
SEARCH_LESSEQUAL, then the first message on the queue
|
|
with the lowest type less than or equal to
|
|
<tt>msgtyp</tt> is searched.
|
|
<item> If a message is found, sys_msgrcv() performs
|
|
the following substeps:
|
|
<enum>
|
|
<item> If the message size is larger than
|
|
the desired size and <tt>msgflg</tt>
|
|
indicates no error allowed, unlocks the global
|
|
message queue spinlock and returns E2BIG.
|
|
<item> Removes the message from the message
|
|
waiting queue and updates the message queue
|
|
statistics.
|
|
<item> Wakes up all tasks sleeping on the senders
|
|
waiting queue. The removal of a message from
|
|
the queue in the previous step makes it possible
|
|
for one of the senders to progress. Goes to
|
|
the <ref id="laststep" name="last step">
|
|
</enum>
|
|
<item> If no message matching the receivers criteria is found
|
|
in the message waiting queue, then <tt>msgflg</tt>
|
|
is checked. If IPC_NOWAIT is set, then the global message
|
|
queue spinlock is unlocked and ENOMSG is returned. Otherwise,
|
|
the receiver is enqueued on the receiver waiting queue as
|
|
follows:
|
|
<enum>
|
|
<item> A <ref id="struct_msg_receiver" name="msg_receiver"> data structure
|
|
<tt>msr</tt> is allocated and is
|
|
added to the head of waiting queue.
|
|
<item> The <tt>r_tsk</tt> field of <tt>msr</tt>
|
|
is set to current task.
|
|
<item> The <tt>r_msgtype</tt> and
|
|
<tt>r_mode</tt> fields are
|
|
initialized with the desired message type and
|
|
mode respectively.
|
|
<item> If <tt>msgflg</tt> indicates
|
|
MSG_NOERROR, then the r_maxsize field of
|
|
<tt>msr</tt> is set to be the
|
|
value of <tt>msgsz</tt> otherwise
|
|
it is set to be INT_MAX.
|
|
<item> The <tt>r_msg</tt> field
|
|
is initialized to indicate that
|
|
no message has been received yet.
|
|
<item> After the initialization is complete,
|
|
the status of the receiving task is set to
|
|
TASK_INTERRUPTIBLE, the global message queue
|
|
spinlock is unlocked, and schedule() is invoked.
|
|
</enum>
|
|
<item> After the receiver is awakened,
|
|
the <tt>r_msg</tt> field of
|
|
<tt>msr</tt> is checked. This field is used to
|
|
store the pipelined message or in the case of an error,
|
|
to store the error status.
|
|
If the <tt>r_msg</tt> field is filled
|
|
with the desired message, then go to the
|
|
<ref id="laststep" name="last step"> Otherwise,
|
|
the global message queue spinlock is locked again.
|
|
<item> After obtaining the spinlock,
|
|
the <tt>r_msg</tt> field is
|
|
re-checked to see if the message was received while
|
|
waiting for the spinlock. If the message has been
|
|
received, the <ref id="laststep" name="last step">
|
|
occurs.
|
|
<item> If the <tt>r_msg</tt> field remains
|
|
unchanged, then the task was
|
|
awakened in order to retry. In this case,
|
|
<tt>msr</tt> is dequeued. If there is a
|
|
signal pending for the task, then the global message
|
|
queue spinlock is unlocked and EINTR is returned.
|
|
Otherwise, the function needs to go
|
|
<ref id="rcvretry" name="back"> and retry.
|
|
<item> If the <tt>r_msg</tt> field shows
|
|
that an error occurred
|
|
while sleeping, the global message queue spinlock
|
|
is unlocked and the error is returned.
|
|
<item><label id="laststep">
|
|
After validating that the address of the user buffer
|
|
<tt>msp</tt> is valid, message type is loaded
|
|
into the <tt>mtype</tt> field of
|
|
<tt>msp</tt>,and
|
|
<ref id="store_msg" name="store_msg()">
|
|
is invoked to copy the message contents to
|
|
the <tt>mtext</tt> field of
|
|
<tt>msp</tt>. Finally the memory for the message is
|
|
freed by function <ref id="free_msg" name="free_msg()">.
|
|
</enum>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2>Message Specific Structures<label id="datastructs"><p>
|
|
Data structures for message queues are defined in msg.c.
|
|
<sect3>struct msg_queue<label id="struct_msg_queue"><p>
|
|
<tscreen><code>
|
|
/* one msq_queue structure for each present queue on the system */
|
|
struct msg_queue {
|
|
struct kern_ipc_perm q_perm;
|
|
time_t q_stime; /* last msgsnd time */
|
|
time_t q_rtime; /* last msgrcv time */
|
|
time_t q_ctime; /* last change time */
|
|
unsigned long q_cbytes; /* current number of bytes on queue */
|
|
unsigned long q_qnum; /* number of messages in queue */
|
|
unsigned long q_qbytes; /* max number of bytes on queue */
|
|
pid_t q_lspid; /* pid of last msgsnd */
|
|
pid_t q_lrpid; /* last receive pid */
|
|
|
|
struct list_head q_messages;
|
|
struct list_head q_receivers;
|
|
struct list_head q_senders;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct msg_msg<label id="struct_msg_msg"><p>
|
|
<tscreen><code>
|
|
/* one msg_msg structure for each message */
|
|
struct msg_msg {
|
|
struct list_head m_list;
|
|
long m_type;
|
|
int m_ts; /* message text size */
|
|
struct msg_msgseg* next;
|
|
/* the actual message follows immediately */
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct msg_msgseg<label id="struct_msg_msgseg"><p>
|
|
<tscreen><code>
|
|
/* message segment for each message */
|
|
struct msg_msgseg {
|
|
struct msg_msgseg* next;
|
|
/* the next part of the message follows immediately */
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct msg_sender<label id="struct_msg_sender"><p>
|
|
<tscreen><code>
|
|
/* one msg_sender for each sleeping sender */
|
|
struct msg_sender {
|
|
struct list_head list;
|
|
struct task_struct* tsk;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct msg_receiver<label id="struct_msg_receiver"><p>
|
|
<tscreen><code>
|
|
/* one msg_receiver structure for each sleeping receiver */
|
|
struct msg_receiver {
|
|
struct list_head r_list;
|
|
struct task_struct* r_tsk;
|
|
|
|
int r_mode;
|
|
long r_msgtype;
|
|
long r_maxsize;
|
|
|
|
struct msg_msg* volatile r_msg;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct msqid64_ds<label id="struct_msqid64_ds"><p>
|
|
<tscreen><code>
|
|
struct msqid64_ds {
|
|
struct ipc64_perm msg_perm;
|
|
__kernel_time_t msg_stime; /* last msgsnd time */
|
|
unsigned long __unused1;
|
|
__kernel_time_t msg_rtime; /* last msgrcv time */
|
|
unsigned long __unused2;
|
|
__kernel_time_t msg_ctime; /* last change time */
|
|
unsigned long __unused3;
|
|
unsigned long msg_cbytes; /* current number of bytes on queue */
|
|
unsigned long msg_qnum; /* number of messages in queue */
|
|
unsigned long msg_qbytes; /* max number of bytes on queue */
|
|
__kernel_pid_t msg_lspid; /* pid of last msgsnd */
|
|
__kernel_pid_t msg_lrpid; /* last receive pid */
|
|
unsigned long __unused4;
|
|
unsigned long __unused5;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct msqid_ds<label id="struct_msqid_ds"><p>
|
|
<tscreen><code>
|
|
struct msqid_ds {
|
|
struct ipc_perm msg_perm;
|
|
struct msg *msg_first; /* first message on queue,unused */
|
|
struct msg *msg_last; /* last message in queue,unused */
|
|
__kernel_time_t msg_stime; /* last msgsnd time */
|
|
__kernel_time_t msg_rtime; /* last msgrcv time */
|
|
__kernel_time_t msg_ctime; /* last change time */
|
|
unsigned long msg_lcbytes; /* Reuse junk fields for 32 bit */
|
|
unsigned long msg_lqbytes; /* ditto */
|
|
unsigned short msg_cbytes; /* current number of bytes on queue */
|
|
unsigned short msg_qnum; /* number of messages in queue */
|
|
unsigned short msg_qbytes; /* max number of bytes on queue */
|
|
__kernel_ipc_pid_t msg_lspid; /* pid of last msgsnd */
|
|
__kernel_ipc_pid_t msg_lrpid; /* last receive pid */
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>msg_setbuf<label id="msg_setbuf"><p>
|
|
<tscreen><code>
|
|
struct msq_setbuf {
|
|
unsigned long qbytes;
|
|
uid_t uid;
|
|
gid_t gid;
|
|
mode_t mode;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2>Message Support Functions<label id="msgfuncs"><p>
|
|
<sect3>newque()<label id="newque"><p>
|
|
newque() allocates the memory for a new message queue
|
|
descriptor (<ref id="struct_msg_queue" name="struct msg_queue">)
|
|
and then calls <ref id="ipc_addid" name="ipc_addid()">, which
|
|
reserves a message queue array entry for the new message queue
|
|
descriptor. The message queue descriptor is initialized as
|
|
follows:
|
|
|
|
<itemize>
|
|
<item> The <ref id="struct_kern_ipc_perm" name="kern_ipc_perm">
|
|
structure is initialized.
|
|
<item> The <tt>q_stime</tt> and <tt>q_rtime</tt> fields of the message
|
|
queue descriptor are initialized as 0. The <tt>q_ctime</tt>
|
|
field is set to be CURRENT_TIME.
|
|
<item> The maximum number of bytes allowed in this
|
|
queue message (<tt>q_qbytes</tt>) is set to be MSGMNB,
|
|
and the number of bytes currently used by the queue
|
|
(<tt>q_cbytes</tt>) is initialized as 0.
|
|
<item> The message waiting queue (<tt>q_messages</tt>),
|
|
the receiver waiting queue (<tt>q_receivers</tt>),
|
|
and the sender waiting queue (<tt>q_senders</tt>)
|
|
are each initialized as empty.
|
|
</itemize>
|
|
|
|
All the operations following the call to
|
|
<ref id="ipc_addid" name="ipc_addid()"> are
|
|
performed while holding the global message queue spinlock.
|
|
After unlocking the spinlock, newque() calls msg_buildid(),
|
|
which maps directly to <ref id="ipc_buildid" name="ipc_buildid()">.
|
|
<ref id="ipc_buildid" name="ipc_buildid()">
|
|
uses the index of the message queue descriptor to create a unique
|
|
message queue ID that is then returned to the caller of newque().
|
|
</sect3>
|
|
|
|
<sect3>freeque()<label id="freeque"><p>
|
|
When a message queue is going to be removed, the freeque() function is
|
|
called. This function assumes that the global message queue spinlock
|
|
is already locked by the calling function. It frees all kernel
|
|
resources associated with that message queue. First, it calls
|
|
<ref id="func_ipc_rmid" name="ipc_rmid()"> (via msg_rmid())
|
|
to remove the message queue descriptor from the array of global
|
|
message queue descriptors. Then it calls
|
|
<ref id="expunge_all" name="expunge_all"> to wake up
|
|
all receivers and <ref id="ss_wakeup" name="ss_wakeup()">
|
|
to wake up all senders sleeping on this message queue. Later
|
|
the global message queue spinlock is released.
|
|
All messages stored in this message queue are freed and the
|
|
memory for the message queue descriptor is freed.
|
|
</sect3>
|
|
|
|
<sect3>ss_wakeup()<label id="ss_wakeup"><p>
|
|
ss_wakeup() wakes up all the tasks waiting in the given
|
|
message sender waiting queue. If this function is called by
|
|
<ref id="freeque" name="freeque()">, then all senders
|
|
in the queue are dequeued.
|
|
</sect3>
|
|
|
|
<sect3>ss_add()<label id="ss_add"><p>
|
|
ss_add() receives as parameters a message queue descriptor
|
|
and a message sender data structure. It fills the
|
|
<tt>tsk</tt> field of the message sender data
|
|
structure with the current process, changes the status of
|
|
current process to TASK_INTERRUPTIBLE,
|
|
then inserts the message sender data structure at the head of
|
|
the sender waiting queue of the given message queue.
|
|
</sect3>
|
|
|
|
<sect3>ss_del()<label id="ss_del"><p>
|
|
If the given message sender data structure
|
|
(<tt>mss</tt>) is still in the associated sender
|
|
waiting queue, then ss_del() removes
|
|
<tt>mss</tt> from the queue.
|
|
</sect3>
|
|
|
|
<sect3>expunge_all()<label id="expunge_all"><p>
|
|
expunge_all() receives as parameters a message queue
|
|
descriptor(<tt>msq</tt>) and an integer value
|
|
(<tt>res</tt>) indicating the reason for waking up the
|
|
receivers. For each sleeping receiver associated with
|
|
<tt>msq</tt>, the <tt>r_msg</tt>
|
|
field is set to the indicated
|
|
wakeup reason (<tt>res</tt>), and the associated receiving
|
|
task is awakened. This function is called when a message queue is
|
|
removed or a message control operation has been performed.
|
|
</sect3>
|
|
|
|
<sect3>load_msg()<label id="load_msg"><p>
|
|
When a process sends a message, the
|
|
<ref id="sys_msgsnd" name="sys_msgsnd()"> function
|
|
first invokes the load_msg() function to load the message
|
|
from user space to kernel space. The message is represented in
|
|
kernel memory as a linked list of data blocks. Associated with
|
|
the first data block is a <ref id="struct_msg_msg" name="msg_msg">
|
|
structure that describes the overall message. The datablock
|
|
associated with the msg_msg structure is limited to a size of
|
|
DATA_MSG_LEN. The data block and the structure are allocated in one
|
|
contiguous memory block that can be as large as one page in memory.
|
|
If the full message will not fit into this first data block, then
|
|
additional data blocks are allocated and are organized into a
|
|
linked list. These additional data blocks are limited to a size
|
|
of DATA_SEG_LEN, and each include an associated
|
|
<ref id="struct_msg_msgseg" name="msg_msgseg)"> structure. The
|
|
msg_msgseg structure and the associated data block are allocated in
|
|
one contiguous memory block that can be as large as one page in
|
|
memory. This function returns the address of the new
|
|
<ref id="struct_msg_msg" name="msg_msg"> structure on success.
|
|
</sect3>
|
|
|
|
<sect3>store_msg()<label id="store_msg"><p>
|
|
The store_msg() function is called by
|
|
<ref id="sys_msgrcv" name="sys_msgrcv()"> to reassemble a received
|
|
message into the user space buffer provided by the caller. The data
|
|
described by the <ref id="struct_msg_msg" name="msg_msg">
|
|
structure and any <ref id="struct_msg_msgseg" name="msg_msgseg">
|
|
structures are sequentially copied to the user space buffer.
|
|
</sect3>
|
|
|
|
<sect3>free_msg()<label id="free_msg"><p>
|
|
The free_msg() function releases the memory for a message
|
|
data structure <ref id="struct_msg_msg" name="msg_msg">,
|
|
and the message segments.
|
|
</sect3>
|
|
|
|
<sect3>convert_mode()<label id="convert_mode"><p>
|
|
convert_mode() is called by <ref id="sys_msgrcv" name="sys_msgrcv()">.
|
|
It receives as parameters the address of the specified message
|
|
type (<tt>msgtyp</tt>) and a flag (<tt>msgflg</tt>).
|
|
It returns the search mode to the caller based on the value of
|
|
<tt>msgtyp</tt> and <tt>msgflg</tt>. If
|
|
<tt>msgtyp</tt> is null, then SEARCH_ANY is returned.
|
|
If <tt>msgtyp</tt> is less than 0, then <tt>msgtyp</tt> is
|
|
set to it's absolute value and SEARCH_LESSEQUAL is returned.
|
|
If MSG_EXCEPT is specified in <tt>msgflg</tt>, then SEARCH_NOTEQUAL is returned.
|
|
Otherwise SEARCH_EQUAL is returned.
|
|
</sect3>
|
|
|
|
<sect3>testmsg()<label id="testmsg"><p>
|
|
The testmsg() function checks whether a message meets the
|
|
criteria specified by the receiver. It returns 1 if one of the
|
|
following conditions is true:
|
|
|
|
<itemize>
|
|
<item> The search mode indicates searching any message (SEARCH_ANY).
|
|
<item> The search mode is SEARCH_LESSEQUAL and the message type
|
|
is less than or equal to desired type.
|
|
<item> The search mode is SEARCH_EQUAL and the message type is
|
|
the same as desired type.
|
|
<item> Search mode is SEARCH_NOTEQUAL and the message type is
|
|
not equal to the specified type.
|
|
</itemize>
|
|
</sect3>
|
|
|
|
<sect3>pipelined_send()<label id="pipelined_send"><p>
|
|
pipelined_send() allows a process to directly send a message
|
|
to a waiting receiver rather than deposit the message in the
|
|
associated message waiting queue. The
|
|
<ref id="testmsg" name="testmsg()"> function is
|
|
invoked to find the first receiver which is waiting for the
|
|
given message. If found, the waiting receiver is removed from
|
|
the receiver waiting queue, and the associated receiving task is
|
|
awakened. The message is stored in the <tt>r_msg</tt>
|
|
field of the receiver, and 1 is returned. In the case where no
|
|
receiver is waiting for the message, 0 is returned.
|
|
|
|
In the process of searching for a receiver, potential
|
|
receivers may be found which have requested a size that is too small
|
|
for the given message. Such receivers are removed from the queue,
|
|
and are awakened with an error status of E2BIG, which is stored in the
|
|
<tt>r_msg</tt> field. The search then continues until
|
|
either a valid receiver is found, or the queue is exhausted.
|
|
</sect3>
|
|
|
|
<sect3>copy_msqid_to_user()<label id="copy_msqid_to_user"><p>
|
|
copy_msqid_to_user() copies the contents of a kernel buffer to
|
|
the user buffer. It receives as parameters a user buffer, a
|
|
kernel buffer of type
|
|
<ref id="struct_msqid64_ds" name="msqid64_ds">, and a
|
|
version flag indicating
|
|
the new IPC version vs. the old IPC version. If the version
|
|
flag equals IPC_64, then copy_to_user() is invoked to copy from
|
|
the kernel buffer to the user buffer directly. Otherwise a
|
|
temporary buffer of type struct msqid_ds is initialized, and the
|
|
kernel data is translated to this temporary buffer. Later
|
|
copy_to_user() is called to copy the contents of the temporary
|
|
buffer to the user buffer.
|
|
</sect3>
|
|
|
|
<sect3>copy_msqid_from_user()<label id="copy_msqid_from_user"><p>
|
|
The function copy_msqid_from_user() receives as parameters
|
|
a kernel message buffer of type struct msq_setbuf, a user buffer
|
|
and a version flag indicating the new IPC version vs. the old IPC
|
|
version. In the case of the new IPC version, copy_from_user()
|
|
is called to copy the contents of the user buffer
|
|
to a temporary buffer of type <ref id="struct_msqid64_ds" name="msqid64_ds">.
|
|
Then, the <tt>qbytes</tt>,<tt>uid</tt>, <tt>gid</tt>,
|
|
and <tt>mode</tt> fields of the kernel buffer are
|
|
filled with the values of the
|
|
corresponding fields from the temporary buffer. In the case of the
|
|
old IPC version, a temporary buffer of type struct
|
|
<ref id="struct_msqid_ds" name="msqid_ds"> is used instead.
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1>Shared Memory<label id="sharedmem"><p>
|
|
<sect2>Shared Memory System Call Interfaces<label id="Shared_Memory_System_Call_Interfaces">
|
|
<p>
|
|
<sect3>sys_shmget()<label id="sys_shmget"><p>
|
|
The entire call to sys_shmget() is protected by the
|
|
global shared memory semaphore.
|
|
|
|
In the case where a new shared memory segment must
|
|
be created, the <ref id="newseg" name="newseg()"> function is called to create
|
|
and initialize a new shared memory segment. The ID of
|
|
the new segment is returned to the caller.
|
|
|
|
In the case where a key value is provided for an
|
|
existing shared memory segment, the corresponding index
|
|
in the shared memory descriptors array is looked up, and
|
|
the parameters and permissions of the caller are verified
|
|
before returning the shared memory segment ID. The look up
|
|
operation and verification are performed while the global
|
|
shared memory spinlock is held.
|
|
</sect3>
|
|
|
|
<sect3>sys_shmctl()<label id="sys_shmctl"><p>
|
|
<sect4>IPC_INFO<label id="IPC_INFO"><p>
|
|
A temporary <ref id="struct_shminfo64" name="shminfo64">
|
|
buffer is loaded with system-wide
|
|
shared memory parameters and is copied out to user space for
|
|
access by the calling application.
|
|
</sect4>
|
|
|
|
<sect4>SHM_INFO<label id="SHM_INFO"><p>
|
|
The global shared memory semaphore and the global shared
|
|
memory spinlock are held while gathering system-wide statistical
|
|
information for shared memory. The
|
|
<ref id="shm_get_stat" name="shm_get_stat()"> function is called
|
|
to calculate both the number of shared memory pages that are
|
|
resident in memory and the number of shared memory pages that are
|
|
swapped out. Other statistics include the total number of shared
|
|
memory pages and the number of shared memory segments in use.
|
|
The counts of <tt>swap_attempts</tt> and <tt>swap_successes</tt>
|
|
are hard-coded to zero. These statistics are stored in a temporary
|
|
<ref id="struct_shm_info" name="shm_info"> buffer and copied out
|
|
to user space for the calling application.
|
|
</sect4>
|
|
|
|
<sect4>SHM_STAT, IPC_STAT<label id="SHM_STAT_IPC_STAT"><p>
|
|
For SHM_STAT and IPC_STATA, a temporary buffer of type
|
|
<ref id="struct_shmid64_ds" name="struct shmid64_ds"> is
|
|
initialized, and the global shared memory spinlock is locked.
|
|
|
|
For the SHM_STAT case, the shared memory segment ID parameter is
|
|
expected to be a straight index (i.e. 0 to n where n is the
|
|
number of shared memory IDs in the system). After validating
|
|
the index, <ref id="ipc_buildid" name="ipc_buildid()">
|
|
is called (via shm_buildid()) to
|
|
convert the index into a shared memory ID. In the passing case
|
|
of SHM_STAT, the shared memory ID will be the return value.
|
|
Note that this is an undocumented feature, but is maintained
|
|
for the ipcs(8) program.
|
|
|
|
For the IPC_STAT case, the shared memory segment ID parameter is
|
|
expected to be an ID that was generated by a call to
|
|
<ref id="sys_shmget" name="shmget()">.
|
|
The ID is validated before proceeding. In the passing case of
|
|
IPC_STAT, 0 will be the return value.
|
|
|
|
For both SHM_STAT and IPC_STAT, the access permissions of
|
|
the caller are verified. The desired statistics are loaded into
|
|
the temporary buffer and then copied out to the calling
|
|
application.
|
|
</sect4>
|
|
|
|
<sect4>SHM_LOCK, SHM_UNLOCK<label id="SHM_LOCK_SHM_UNLOCK"><p>
|
|
After validating access permissions, the global shared
|
|
memory spinlock is locked, and the shared memory segment ID
|
|
is validated. For both SHM_LOCK and SHM_UNLOCK,
|
|
<ref id="shmem_lock" name="shmem_lock()">
|
|
is called to perform the function. The parameters for
|
|
<ref id="shmem_lock" name="shmem_lock()">
|
|
identify the function to be performed.
|
|
</sect4>
|
|
|
|
<sect4>IPC_RMID<label id="IPC_RMID"><p>
|
|
During IPC_RMID the global shared memory semaphore and
|
|
the global shared memory spinlock are held throughout this
|
|
function. The Shared Memory ID is validated, and then if
|
|
there are no current attachments, <ref id="shm_destroy" name="shm_destroy()">
|
|
is called to destroy the shared memory segment.
|
|
Otherwise, the SHM_DEST flag is set to mark it for destruction,
|
|
and the IPC_PRIVATE flag is set to prevent other processes from
|
|
being able to reference the shared memory ID.
|
|
</sect4>
|
|
|
|
<sect4>IPC_SET<label id="IPC_SET"><p>
|
|
After validating the shared memory segment ID and the user
|
|
access permissions, the <tt>uid</tt>, <tt>gid</tt>, and <tt>mode</tt> flags of the
|
|
shared memory segment are updated with the user data. The
|
|
<tt>shm_ctime</tt> field is also updated. These changes are made
|
|
while holding the global shared memory semaphore and the
|
|
global share memory spinlock.
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3>sys_shmat()<label id="sys_shmat"><p>
|
|
sys_shmat() takes as parameters, a shared memory segment ID,
|
|
an address at which the shared memory segment should be
|
|
attached(<tt>shmaddr</tt>), and flags which will be described below.
|
|
|
|
If <tt>shmaddr</tt> is non-zero, and the SHM_RND flag is
|
|
specified, then <tt>shmaddr</tt> is rounded down to a multiple of
|
|
SHMLBA. If <tt>shmaddr</tt> is not a multiple of SHMLBA and SHM_RND
|
|
is not specified, then EINVAL is returned.
|
|
|
|
The access permissions of the caller are validated and
|
|
the <tt>shm_nattch</tt> field for the shared memory segment is
|
|
incremented. Note that this increment guarantees that the
|
|
attachment count is non-zero and prevents the shared memory
|
|
segment from being destroyed during the process of attaching
|
|
to the segment. These operations are performed while holding the
|
|
global shared memory spinlock.
|
|
|
|
The do_mmap() function is called to create a virtual memory
|
|
mapping to the shared memory segment pages. This is done while
|
|
holding the <tt>mmap_sem</tt> semaphore of the current task. The
|
|
MAP_SHARED flag is passed to do_mmap(). If an address was
|
|
provided by the caller, then the MAP_FIXED flag is also passed
|
|
to do_mmap(). Otherwise, do_mmap() will select the virtual
|
|
address at which to map the shared memory segment.
|
|
|
|
NOTE <ref id="shm_inc" name="shm_inc()"> will be invoked within the do_mmap()
|
|
function call via the <tt>shm_file_operations</tt> structure. This
|
|
function is called to set the PID, to set the current time, and
|
|
to increment the number of attachments to this shared memory
|
|
segment.
|
|
|
|
After the call to do_mmap(), the global shared memory
|
|
semaphore and the global shared memory spinlock are both
|
|
obtained. The attachment count is then decremented. The the net
|
|
change to the attachment count is 1 for a call
|
|
to shmat() because of the call to <ref id="shm_inc" name="shm_inc()">. If, after
|
|
decrementing the attachment count, the resulting count is found
|
|
to be zero, and if the segment is marked for destruction
|
|
(SHM_DEST), then <ref id="shm_destroy" name="shm_destroy()"> is
|
|
called to release the shared memory segment resources.
|
|
|
|
Finally, the virtual address at which the shared memory is
|
|
mapped is returned to the caller at the user specified address.
|
|
If an error code had been returned by do_mmap(), then this
|
|
failure code is passed on as the return value for the system call.
|
|
</sect3>
|
|
|
|
<sect3>sys_shmdt()<label id="sys_shmdt"><p>
|
|
The global shared memory semaphore is held while performing
|
|
sys_shmdt(). The <tt>mm_struct</tt> of the current
|
|
process is searched for the <tt>vm_area_struct</tt> associated with
|
|
the shared memory address. When it is found, do_munmap() is
|
|
called to undo the virtual address mapping for the shared memory segment.
|
|
|
|
Note also that do_munmap() performs a call-back to
|
|
<ref id="shm_close" name="shm_close()">,
|
|
which performs the shared-memory book keeping functions, and
|
|
releases the shared memory segment resources if there are no other
|
|
attachments.
|
|
|
|
sys_shmdt() unconditionally returns 0.
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2>Shared Memory Support Structures<label id="shm_structures"><p>
|
|
<sect3>struct shminfo64<label id="struct_shminfo64"><p>
|
|
<tscreen><code>
|
|
struct shminfo64 {
|
|
unsigned long shmmax;
|
|
unsigned long shmmin;
|
|
unsigned long shmmni;
|
|
unsigned long shmseg;
|
|
unsigned long shmall;
|
|
unsigned long __unused1;
|
|
unsigned long __unused2;
|
|
unsigned long __unused3;
|
|
unsigned long __unused4;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
<sect3>struct shm_info<label id="struct_shm_info"><p>
|
|
<tscreen><code>
|
|
struct shm_info {
|
|
int used_ids;
|
|
unsigned long shm_tot; /* total allocated shm */
|
|
unsigned long shm_rss; /* total resident shm */
|
|
unsigned long shm_swp; /* total swapped shm */
|
|
unsigned long swap_attempts;
|
|
unsigned long swap_successes;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct shmid_kernel<label id="struct_shmid_kernel"><p>
|
|
<tscreen><code>
|
|
struct shmid_kernel /* private to the kernel */
|
|
{
|
|
struct kern_ipc_perm shm_perm;
|
|
struct file * shm_file;
|
|
int id;
|
|
unsigned long shm_nattch;
|
|
unsigned long shm_segsz;
|
|
time_t shm_atim;
|
|
time_t shm_dtim;
|
|
time_t shm_ctim;
|
|
pid_t shm_cprid;
|
|
pid_t shm_lprid;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct shmid64_ds<label id="struct_shmid64_ds"><p>
|
|
<tscreen><code>
|
|
struct shmid64_ds {
|
|
struct ipc64_perm shm_perm; /* operation perms */
|
|
size_t shm_segsz; /* size of segment (bytes) */
|
|
__kernel_time_t shm_atime; /* last attach time */
|
|
unsigned long __unused1;
|
|
__kernel_time_t shm_dtime; /* last detach time */
|
|
unsigned long __unused2;
|
|
__kernel_time_t shm_ctime; /* last change time */
|
|
unsigned long __unused3;
|
|
__kernel_pid_t shm_cpid; /* pid of creator */
|
|
__kernel_pid_t shm_lpid; /* pid of last operator */
|
|
unsigned long shm_nattch; /* no. of current attaches */
|
|
unsigned long __unused4;
|
|
unsigned long __unused5;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct shmem_inode_info<label id="struct_shmem_inode_info"><p>
|
|
<tscreen><code>
|
|
struct shmem_inode_info {
|
|
spinlock_t lock;
|
|
unsigned long max_index;
|
|
swp_entry_t i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
|
|
swp_entry_t **i_indirect; /* doubly indirect blocks */
|
|
unsigned long swapped;
|
|
int locked; /* into memory */
|
|
struct list_head list;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2>Shared Memory Support Functions<label id="shm_primitives"><p>
|
|
<sect3>newseg()<label id="newseg"><p>
|
|
The newseg() function is called when a new shared memory
|
|
segment needs to be created. It acts on three parameters for
|
|
the new segment the key, the flag, and the size. After
|
|
validating that the size of the shared memory segment to be
|
|
created is between SHMMIN and SHMMAX and that the total number
|
|
of shared memory segments does not exceed SHMALL, it allocates
|
|
a new shared memory segment descriptor.
|
|
The <ref id="shmem_file_setup" name="shmem_file_setup()">
|
|
function is invoked later to create an unlinked file of type
|
|
tmpfs. The returned file pointer is saved in the <tt>shm_file</tt> field
|
|
of the associated shared memory segment descriptor. The files
|
|
size is set to be the same as the size of the segment. The
|
|
new shared memory segment descriptor is initialized and inserted
|
|
into the global IPC shared memory descriptors array. The shared
|
|
memory segment ID is created by shm_buildid()
|
|
(via <ref id="ipc_buildid" name="ipc_buildid()">).
|
|
This segment ID is saved in the <tt>id</tt> field of the shared memory
|
|
segment descriptor, as well as in the <tt>i_ino</tt> field of the associated
|
|
inode. In addition, the address of the shared memory operations
|
|
defined in structure <tt>shm_file_operation</tt> is stored in the associated
|
|
file. The value of the global variable <tt>shm_tot</tt>, which indicates
|
|
the total number of shared memory segments system wide, is also
|
|
increased to reflect this change. On success, the segment ID is
|
|
returned to the caller application.
|
|
</sect3>
|
|
|
|
<sect3>shm_get_stat()<label id="shm_get_stat"><p>
|
|
shm_get_stat() cycles through all of the shared memory
|
|
structures, and calculates the total number of memory pages in
|
|
use by shared memory and the total number of shared memory pages
|
|
that are swapped out. There is a file structure and an inode
|
|
structure for each shared memory segment. Since the required
|
|
data is obtained via the inode, the spinlock for each inode
|
|
structure that is accessed is locked and unlocked in sequence.
|
|
</sect3>
|
|
|
|
<sect3>shmem_lock()<label id="shmem_lock"><p>
|
|
shmem_lock() receives as parameters a pointer to the
|
|
shared memory segment descriptor and a flag indicating
|
|
lock vs. unlock.The locking state of the shared memory
|
|
segment is stored in an associated inode. This state is compared
|
|
with the desired locking state; shmem_lock() simply returns if they match.
|
|
|
|
While holding the semaphore of the associated inode, the
|
|
locking state of the inode is set. The following list of items
|
|
occur for each page in the shared memory segment:
|
|
<itemize>
|
|
<item> find_lock_page() is called to lock the page (setting
|
|
PG_locked) and to increment the reference count of the page.
|
|
Incrementing the reference count assures that the shared
|
|
memory segment remains locked in memory throughout this
|
|
operation.
|
|
<item> If the desired state is locked, then PG_locked is cleared,
|
|
but the reference count remains incremented.
|
|
<item> If the desired state is unlocked, then the reference count
|
|
is decremented twice once for the current reference, and once
|
|
for the existing reference which caused the page to remain
|
|
locked in memory. Then PG_locked is cleared.
|
|
</itemize>
|
|
</sect3>
|
|
|
|
<sect3>shm_destroy()<label id="shm_destroy"><p>
|
|
During shm_destroy() the total number of shared memory pages
|
|
is adjusted to account for the removal of the shared memory segment.
|
|
<ref id="func_ipc_rmid" name="ipc_rmid()"> is called
|
|
(via shm_rmid()) to remove the Shared Memory ID. <ref id="shmem_lock" name="shmem_lock"> is
|
|
called to unlock the shared memory pages, effectively decrementing
|
|
the reference counts to zero for each page. fput() is called to
|
|
decrement the usage counter <tt>f_count</tt> for the associated file object,
|
|
and if necessary, to release the file object resources. kfree() is
|
|
called to free the shared memory segment descriptor.
|
|
</sect3>
|
|
|
|
<sect3>shm_inc()<label id="shm_inc"><p>
|
|
shm_inc() sets the PID, sets the current time, and increments
|
|
the number of attachments for the given shared memory segment.
|
|
These operations are performed while holding the global shared
|
|
memory spinlock.
|
|
</sect3>
|
|
|
|
<sect3>shm_close()<label id="shm_close"><p>
|
|
shm_close() updates the <tt>shm_lprid</tt> and the <tt>shm_dtim</tt> fields
|
|
and decrements the number of attached shared memory segments. If
|
|
there are no other attachments to the shared memory segment,
|
|
then <ref id="shm_destroy" name="shm_destroy()"> is called to
|
|
release the shared memory segment resources. These operations are
|
|
all performed while holding both the global shared memory semaphore
|
|
and the global shared memory spinlock.
|
|
</sect3>
|
|
|
|
<sect3>shmem_file_setup()<label id="shmem_file_setup"><p>
|
|
The function shmem_file_setup() sets up an unlinked file living
|
|
in the tmpfs file system with the given name and size. If there
|
|
are enough systen memory resource for this file, it creates a new
|
|
dentry under the mount root of tmpfs, and allocates a new file
|
|
descriptor and a new inode object of tmpfs type. Then it associates
|
|
the new dentry object with the new inode object by calling
|
|
d_instantiate() and saves the address of the dentry object in the
|
|
file descriptor. The <tt>i_size</tt> field of the inode object is set to
|
|
be the file size and the <tt>i_nlink</tt> field is set to be 0 in order to
|
|
mark the inode unlinked. Also, shmem_file_setup() stores the
|
|
address of the <tt>shmem_file_operations</tt> structure in the <tt>f_op</tt> field,
|
|
and initializes <tt>f_mode</tt> and <tt>f_vfsmnt</tt> fields of the file descriptor
|
|
properly. The function shmem_truncate() is called to complete the
|
|
initialization of the inode object. On success, shmem_file_setup()
|
|
returns the new file descriptor.
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1>Linux IPC Primitives<label id="ipc_primitives"><p>
|
|
<sect2>Generic Linux IPC Primitives used with Semaphores, Messages,and Shared Memory
|
|
<label id="Generic_Linux_IPC_Primitives_used_with_Semaphores_Messages_and_Shared_Memory">
|
|
<p>
|
|
The semaphores, messages, and shared memory mechanisms of Linux
|
|
are built on a set of common primitives. These primitives are described in the sections below.
|
|
|
|
<sect3>ipc_alloc()<label id="ipc_alloc"><p>
|
|
If the memory allocation is greater than PAGE_SIZE, then
|
|
vmalloc() is used to allocate memory. Otherwise, kmalloc() is
|
|
called with GFP_KERNEL to allocate the memory.
|
|
</sect3>
|
|
|
|
<sect3>ipc_addid()<label id="ipc_addid"><p>
|
|
When a new semaphore set, message queue, or shared memory
|
|
segment is added, ipc_addid() first calls <ref id="grow_ary" name="grow_ary()"> to
|
|
insure that the size of the corresponding descriptor array is
|
|
sufficiently large for the system maximum. The array of descriptors
|
|
is searched for the first unused element. If an unused element
|
|
is found, the count of descriptors which are in use is incremented.
|
|
The <ref id="struct_kern_ipc_perm" name="kern_ipc_perm"> structure for the new resource descriptor
|
|
is then initialized, and the array index for the new descriptor
|
|
is returned. When ipc_addid() succeeds, it returns with the global
|
|
spinlock for the given IPC type locked.
|
|
</sect3>
|
|
|
|
<sect3>ipc_rmid()<label id="func_ipc_rmid"><p>
|
|
ipc_rmid() removes the IPC descriptor from the global
|
|
descriptor array of the IPC type, updates the count of IDs which
|
|
are in use, and adjusts the maximum ID in the corresponding
|
|
descriptor array if necessary. A pointer to the IPC
|
|
descriptor associated with given IPC ID is returned.
|
|
</sect3>
|
|
|
|
<sect3>ipc_buildid()<label id="ipc_buildid"><p>
|
|
ipc_buildid() creates a unique ID to be associated with
|
|
each descriptor within a given IPC type. This ID is created at
|
|
the time a new IPC element is added (e.g. a new shared memory
|
|
segment or a new semaphore set). The IPC ID converts
|
|
easily into the corresponding descriptor array index. Each
|
|
IPC type maintains a sequence number which is incremented
|
|
each time a descriptor is added. An ID is created by
|
|
multiplying the sequence number with SEQ_MULTIPLIER and adding
|
|
the product to the descriptor array index. The sequence number
|
|
used in creating a particular IPC ID is then stored in the
|
|
corresponding descriptor. The existence of the sequence number
|
|
makes it possible to detect the use of a stale IPC ID.
|
|
</sect3>
|
|
|
|
<sect3>ipc_checkid()<label id="ipc_checkid"><p>
|
|
ipc_checkid() divides the given IPC ID by the SEQ_MULTIPLIER
|
|
and compares the quotient with the seq value saved corresponding
|
|
descriptor. If they are equal, then the IPC ID is considered to
|
|
be valid and 1 is returned. Otherwise, 0 is returned.
|
|
</sect3>
|
|
|
|
<sect3>grow_ary()<label id="grow_ary"><p>
|
|
grow_ary() handles the possibility that the maximum
|
|
(tunable) number of IDs for a given IPC type can be dynamically
|
|
changed. It enforces the current maximum limit so that it is no
|
|
greater than the permanent system limit (IPCMNI) and adjusts it down
|
|
if necessary. It also insures that the existing descriptor array
|
|
is large enough. If the existing array size is sufficiently large,
|
|
then the current maximum limit is returned. Otherwise, a new larger
|
|
array is allocated, the old array is copied into the new array,
|
|
and the old array is freed. The corresponding global
|
|
spinlock is held when updating the descriptor array for the
|
|
given IPC type.
|
|
</sect3>
|
|
|
|
<sect3>ipc_findkey()<label id="ipc_findkey"><p>
|
|
ipc_findkey() searches through the descriptor array of
|
|
the specified <ref id="struct_ipc_ids" name="ipc_ids"> object,
|
|
and searches for the specified key. Once found, the index of
|
|
the corresponding descriptor is returned. If the key is not found,
|
|
then -1 is returned.
|
|
</sect3>
|
|
|
|
<sect3>ipcperms()<label id="ipcperms"><p>
|
|
ipcperms() checks the user, group, and other permissions
|
|
for access to the IPC resources. It returns 0 if permission
|
|
is granted and -1 otherwise.
|
|
</sect3>
|
|
|
|
<sect3>ipc_lock()<label id="ipc_lock"><p>
|
|
ipc_lock() takes an IPC ID as one of its parameters.
|
|
It locks the global spinlock for the given IPC type, and
|
|
returns a pointer to the descriptor corresponding to the
|
|
specified IPC ID.
|
|
</sect3>
|
|
|
|
<sect3>ipc_unlock()<label id="ipc_unlock"><p>
|
|
ipc_unlock() releases the global spinlock for the indicated IPC
|
|
type.
|
|
</sect3>
|
|
|
|
<sect3>ipc_lockall()<label id="ipc_lockall"><p>
|
|
ipc_lockall() locks the global spinlock for the given
|
|
IPC mechanism (i.e. shared memory, semaphores, and messaging).
|
|
</sect3>
|
|
|
|
<sect3>ipc_unlockall()<label id="ipc_unlockall"><p>
|
|
ipc_unlockall() unlocks the global spinlock for the given
|
|
IPC mechanism (i.e. shared memory, semaphores, and messaging).
|
|
</sect3>
|
|
|
|
<sect3>ipc_get()<label id="ipc_get"><p>
|
|
ipc_get() takes a pointer to a particular IPC type
|
|
(i.e. shared memory, semaphores, or message queues) and a
|
|
descriptor ID, and returns a pointer to the corresponding
|
|
IPC descriptor. Note that although the descriptors for each
|
|
IPC type are of different data types, the common
|
|
<ref id="struct_kern_ipc_perm" name="kern_ipc_perm">
|
|
structure type is embedded as the first entity in every case.
|
|
The ipc_get() function returns this common data type. The expected
|
|
model is that ipc_get() is called through a wrapper function
|
|
(e.g. shm_get()) which casts the data type to the correct
|
|
descriptor data type.
|
|
</sect3>
|
|
|
|
<sect3>ipc_parse_version()<label id="ipc_parse_version"><p>
|
|
ipc_parse_version() removes the IPC_64 flag from the command
|
|
if it is present and returns either IPC_64 or IPC_OLD.
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2>Generic IPC Structures used with Semaphores,
|
|
Messages, and Shared Memory<label id="ipc_structures"><p>
|
|
The semaphores, messages, and shared memory mechanisms all make
|
|
use of the following common structures:
|
|
|
|
<sect3>struct kern_ipc_perm<label id="struct_kern_ipc_perm"><p>
|
|
Each of the IPC descriptors has a data object of this type
|
|
as the first element. This makes it possible to access any
|
|
descriptor from any of the generic IPC functions using a pointer
|
|
of this data type.
|
|
|
|
<tscreen><code>
|
|
/* used by in-kernel data structures */
|
|
struct kern_ipc_perm {
|
|
key_t key;
|
|
uid_t uid;
|
|
gid_t gid;
|
|
uid_t cuid;
|
|
gid_t cgid;
|
|
mode_t mode;
|
|
unsigned long seq;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct ipc_ids<label id="struct_ipc_ids"><p>
|
|
The ipc_ids structure describes the common data for semaphores,
|
|
message queues, and shared memory. There are three global instances of
|
|
this data structure-- <tt>semid_ds</tt>,
|
|
<tt>msgid_ds</tt> and <tt>shmid_ds</tt>-- for
|
|
semaphores, messages and shared memory respectively. In each
|
|
instance, the <tt>sem</tt> semaphore is used to
|
|
protect access to the structure.
|
|
The <tt>entries</tt> field points to an IPC
|
|
descriptor array, and the
|
|
<tt>ary</tt> spinlock protects access to this array. The
|
|
<tt>seq</tt> field is a global sequence number which will
|
|
be incremented when a new IPC resource is created.
|
|
|
|
<tscreen><code>
|
|
struct ipc_ids {
|
|
int size;
|
|
int in_use;
|
|
int max_id;
|
|
unsigned short seq;
|
|
unsigned short seq_max;
|
|
struct semaphore sem;
|
|
spinlock_t ary;
|
|
struct ipc_id* entries;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
|
|
<sect3>struct ipc_id<label id="struct_ipc_id"><p>
|
|
An array of struct ipc_id exists in each instance of
|
|
the <ref id="struct_ipc_ids" name="ipc_ids"> structure.
|
|
The array is dynamically allocated and may be replaced with
|
|
larger array by <ref id="grow_ary" name="grow_ary()">
|
|
as required. The array is
|
|
sometimes referred to as the descriptor array, since the
|
|
<ref id="struct_kern_ipc_perm" name="kern_ipc_perm"> data
|
|
type is used as the common descriptor data type by the IPC generic
|
|
functions.
|
|
|
|
<tscreen><code>
|
|
struct ipc_id {
|
|
struct kern_ipc_perm* p;
|
|
};
|
|
</code></tscreen>
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
</sect>
|
|
|
|
</article>
|