LDP/LDP/guide/linuxdoc/lki.sgml

<!doctype linuxdoc system>

<article>
<title>Linux Kernel 2.4 Internals
<author>Tigran Aivazian <tt>tigran@veritas.com</tt>
<date>7 August 2002 (29 Av 6001)
<abstract>
Introduction to the Linux 2.4 kernel. The latest copy of this document
can be always downloaded from:

<url url="http://www.moses.uklinux.net/patches/lki.sgml">

This guide is now part of the Linux Documentation Project and can also be
downloaded in various formats from:

<url url="http://www.linuxdoc.org/guides.html">

or can be read online (latest version) at:

<url url="http://www.moses.uklinux.net/patches/lki.html">

This documentation is free software; you can redistribute
it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation; either
version 2 of the License, or (at your option) any later version.

The author is working as senior Linux kernel engineer at VERITAS Software
Ltd and wrote this book for the purpose of supporting the short training
course/lectures he gave on this subject, internally at VERITAS.
Thanks to
Juan J. Quintela <tt>(quintela@fi.udc.es)</tt>,
Francis Galiegue <tt>(fg@mandrakesoft.com)</tt>,
Hakjun Mun <tt>(juniorm@orgio.net)</tt>,
Matt Kraai <tt>(kraai@alumni.carnegiemellon.edu)</tt>,
Nicholas Dronen <tt>(ndronen@frii.com)</tt>,
Samuel S Chessman <tt>(chessman@tux.org)</tt>,
Nadeem Hasan <tt>(nhasan@nadmm.com)</tt>,
Michael Svetlik <tt>(m.svetlik@ssi-schaefer-peem.com)</tt>
for various corrections and suggestions.

The Linux Page Cache chapter was written by:
Christoph Hellwig <tt>(hch@caldera.de)</tt>.

The IPC Mechanisms chapter was written by:
Russell Weight <tt>(weightr@us.ibm.com)</tt> and Mingming Cao <tt>(mcao@us.ibm.com)</tt>

</abstract>

<toc>

<sect>Booting<p>
<sect1>Building the Linux Kernel Image<p>
This section explains the steps taken during compilation of the Linux kernel
and the output produced at each stage.
The build process depends on the architecture so I would like to emphasize
that we only consider building a Linux/x86 kernel.

When the user types 'make zImage' or 'make bzImage' the resulting bootable
kernel image is stored as
<tt>arch/i386/boot/zImage</tt> or
<tt>arch/i386/boot/bzImage</tt> respectively.
Here is how the image is built:
<enum>
<item> C and assembly source files are compiled into ELF relocatable object format (.o) and
       some of them are grouped logically into archives (.a) using
	   <bf>ar(1)</bf>.

<item> Using <bf>ld(1)</bf>, the above .o and .a are linked into <tt>vmlinux</tt> which is a
       statically linked, non-stripped ELF 32-bit LSB 80386 executable file.

<item> <tt>System.map</tt> is produced by <bf>nm vmlinux</bf>, irrelevant or uninteresting
       symbols are grepped out.

<item> Enter directory <tt>arch/i386/boot</tt>.

<item> Bootsector asm code <tt>bootsect.S</tt> is preprocessed either with or without
       <bf>-D__BIG_KERNEL__</bf>, depending on whether the target is
       bzImage or zImage, into <tt>bbootsect.s</tt> or <tt>bootsect.s</tt> respectively.

<item> <tt>bbootsect.s</tt> is assembled and then converted into 'raw binary' form
       called <tt>bbootsect</tt> (or <tt>bootsect.s</tt> assembled and raw-converted into
       <tt>bootsect</tt> for zImage).

<item> Setup code <tt>setup.S</tt> (<tt>setup.S</tt> includes <tt>video.S</tt>) is preprocessed into
       <tt>bsetup.s</tt> for bzImage or <tt>setup.s</tt> for zImage. In the same way as the
       bootsector code, the difference is marked by -<bf>D__BIG_KERNEL__</bf> present
       for bzImage.  The result is then converted into 'raw binary' form
       called <tt>bsetup</tt>.

<item> Enter directory <tt>arch/i386/boot/compressed</tt> and convert
       <tt>/usr/src/linux/vmlinux</tt> to $tmppiggy (tmp filename) in raw binary
       format, removing <tt>.note</tt> and <tt>.comment</tt> ELF sections.

<item> <bf>gzip -9 < $tmppiggy > $tmppiggy.gz</bf>

<item> Link $tmppiggy.gz into ELF relocatable (<bf>ld -r</bf>) <tt>piggy.o</tt>.

<item> Compile compression routines <tt>head.S</tt> and <tt>misc.c</tt> (still in
       <tt>arch/i386/boot/compressed</tt> directory) into ELF objects <tt>head.o</tt> and
       <tt>misc.o</tt>.

<item> Link together <tt>head.o</tt>, <tt>misc.o</tt> and <tt>piggy.o</tt> into <tt>bvmlinux</tt> (or <tt>vmlinux</tt> for
       zImage, don't mistake this for <tt>/usr/src/linux/vmlinux</tt>!). Note the
       difference between <bf>-Ttext 0x1000</bf> used for <tt>vmlinux</tt> and <bf>-Ttext 0x100000</bf>
       for <tt>bvmlinux</tt>, i.e. for bzImage compression loader is high-loaded.

<item> Convert <tt>bvmlinux</tt> to 'raw binary' <tt>bvmlinux.out</tt> removing <tt>.note</tt> and
       <tt>.comment</tt> ELF sections.

<item> Go back to <tt>arch/i386/boot</tt> directory and, using the program <bf>tools/build</bf>,
       cat together <tt>bbootsect</tt>, <tt>bsetup</tt> and <tt>compressed/bvmlinux.out</tt> into <tt>bzImage</tt>
	   (delete extra 'b' above for <tt>zImage</tt>). This writes important variables
       like <tt>setup_sects</tt> and <tt>root_dev</tt> at the end of the bootsector.
</enum>
The size of the bootsector is always 512 bytes. The size of the setup must
be greater than 4 sectors but is limited above by about 12K - the rule
is:

0x4000 bytes >= 512 + setup_sects * 512 + room for stack while running
bootsector/setup

We will see later where this limitation comes from.

The upper limit on the bzImage size produced at this step is about 2.5M for
booting with LILO and 0xFFFF paragraphs (0xFFFF0 = 1048560 bytes) for
booting raw image, e.g. from floppy disk or CD-ROM (El-Torito emulation mode).

Note that while <bf>tools/build</bf> does validate the size of boot sector, kernel image
and lower bound of setup size, it does not check the *upper* bound of said
setup size. Therefore it is easy to build a broken kernel by just adding some
large ".space" at the end of <tt>setup.S</tt>.

<sect1>Booting: Overview<p>

The boot process details are architecture-specific, so we shall
focus our attention on the IBM PC/IA32 architecture.
Due to old design and backward compatibility, the PC firmware boots the
operating system in an old-fashioned manner.
This process can be separated into the following six logical stages:

<enum>
<item> BIOS selects the boot device.
<item> BIOS loads the bootsector from the boot device.
<item> Bootsector loads setup, decompression routines and compressed kernel
       image.
<item> The kernel is uncompressed in protected mode.
<item> Low-level initialisation is performed by asm code.
<item> High-level C initialisation.
</enum>

<sect1>Booting: BIOS POST<p>

<enum>
<item> The power supply starts the clock generator and asserts #POWERGOOD
       signal on the bus.
<item> CPU #RESET line is asserted (CPU now in real 8086 mode).
<item> %ds=%es=%fs=%gs=%ss=0, %cs=0xFFFF0000,%eip = 0x0000FFF0 (ROM BIOS POST code).
<item> All POST checks are performed with interrupts disabled.
<item> IVT (Interrupt Vector Table) initialised at address 0.
<item> The BIOS Bootstrap Loader function is invoked via <bf>int 0x19</bf>,
       with %dl containing the boot device 'drive number'. This loads
       track 0, sector 1 at physical address 0x7C00 (0x07C0:0000).
</enum>

<sect1>Booting: bootsector and setup<p>

The bootsector used to boot Linux kernel could be either:

<itemize>
<item> Linux bootsector (<tt>arch/i386/boot/bootsect.S</tt>),
<item> LILO (or other bootloader's) bootsector, or
<item> no bootsector (loadlin etc)
</itemize>

We consider here the Linux bootsector in detail.
The first few lines initialise the convenience macros to be used for segment
values:

<tscreen><code>
29 SETUPSECS = 4		/* default nr of setup-sectors */
30 BOOTSEG   = 0x07C0		/* original address of boot-sector */
31 INITSEG   = DEF_INITSEG	/* we move boot here - out of the way */
32 SETUPSEG  = DEF_SETUPSEG	/* setup starts here */
33 SYSSEG    = DEF_SYSSEG	/* system loaded at 0x10000 (65536) */
34 SYSSIZE   = DEF_SYSSIZE	/* system size: # of 16-byte clicks */
</code></tscreen>

(the numbers on the left are the line numbers of bootsect.S file)
The values of <tt>DEF_INITSEG</tt>, <tt>DEF_SETUPSEG</tt>, <tt>DEF_SYSSEG</tt> and <tt>DEF_SYSSIZE</tt> are taken
from <tt>include/asm/boot.h</tt>:

<tscreen><code>
/* Don't touch these, unless you really know what you're doing. */
#define DEF_INITSEG     0x9000
#define DEF_SYSSEG      0x1000
#define DEF_SETUPSEG    0x9020
#define DEF_SYSSIZE     0x7F00
</code></tscreen>

Now, let us consider the actual code of <tt>bootsect.S</tt>:

<tscreen><code>
    54		movw	$BOOTSEG, %ax
    55		movw	%ax, %ds
    56		movw	$INITSEG, %ax
    57		movw	%ax, %es
    58		movw	$256, %cx
    59		subw	%si, %si
    60		subw	%di, %di
    61		cld
    62		rep
    63		movsw
    64		ljmp	$INITSEG, $go

    65	# bde - changed 0xff00 to 0x4000 to use debugger at 0x6400 up (bde).  We
    66	# wouldn't have to worry about this if we checked the top of memory.  Also
    67	# my BIOS can be configured to put the wini drive tables in high memory
    68	# instead of in the vector table.  The old stack might have clobbered the
    69	# drive table.

    70	go:	movw	$0x4000-12, %di		# 0x4000 is an arbitrary value >=
    71						# length of bootsect + length of
    72						# setup + room for stack;
    73						# 12 is disk parm size.
    74		movw	%ax, %ds		# ax and es already contain INITSEG
    75		movw	%ax, %ss
    76		movw	%di, %sp		# put stack at INITSEG:0x4000-12.
</code></tscreen>

Lines 54-63 move the bootsector code from address 0x7C00 to 0x90000.
This is achieved by:

<enum>
<item> set %ds:%si to $BOOTSEG:0 (0x7C0:0 = 0x7C00)

<item> set %es:%di to $INITSEG:0 (0x9000:0 = 0x90000)

<item> set the number of 16bit words in %cx (256 words = 512 bytes = 1 sector)

<item> clear DF (direction) flag in EFLAGS to auto-increment addresses (cld)

<item> go ahead and copy 512 bytes (rep movsw)
</enum>

The reason this code does not use <tt>rep movsd</tt> is intentional (hint - .code16).

Line 64 jumps to label <tt>go:</tt> in the newly made copy of the
bootsector, i.e. in segment 0x9000. This and the following three
instructions (lines 64-76) prepare the stack at $INITSEG:0x4000-0xC, i.e.
%ss = $INITSEG (0x9000) and %sp = 0x3FF4 (0x4000-0xC). This is where the
limit on setup size comes from that we mentioned earlier (see Building the
Linux Kernel Image).

Lines 77-103 patch the disk parameter table for the first disk to
allow multi-sector reads:

<tscreen><code>
    77	# Many BIOS's default disk parameter tables will not recognise
    78	# multi-sector reads beyond the maximum sector number specified
    79	# in the default diskette parameter tables - this may mean 7
    80	# sectors in some cases.
    81	#
    82	# Since single sector reads are slow and out of the question,
    83	# we must take care of this by creating new parameter tables
    84	# (for the first disk) in RAM.  We will set the maximum sector
    85	# count to 36 - the most we will encounter on an ED 2.88.
    86	#
    87	# High doesn't hurt.  Low does.
    88	#
    89	# Segments are as follows: ds = es = ss = cs - INITSEG, fs = 0,
    90	# and gs is unused.

    91		movw	%cx, %fs		# set fs to 0
    92		movw	$0x78, %bx		# fs:bx is parameter table address
    93		pushw	%ds
    94		ldsw	%fs:(%bx), %si		# ds:si is source
    95		movb	$6, %cl			# copy 12 bytes
    96		pushw	%di			# di = 0x4000-12.
    97		rep				# don't need cld -> done on line 66
    98		movsw
    99		popw	%di
   100		popw	%ds
   101		movb	$36, 0x4(%di)		# patch sector count
   102		movw	%di, %fs:(%bx)
   103		movw	%es, %fs:2(%bx)
</code></tscreen>

The floppy disk controller is reset using BIOS service int 0x13 function 0
(reset FDC) and setup sectors are loaded immediately after the
bootsector, i.e. at physical address 0x90200 ($INITSEG:0x200), again using
BIOS service int 0x13, function 2 (read sector(s)).
This happens during lines 107-124:
<tscreen><code>
   107	load_setup:
   108		xorb	%ah, %ah		# reset FDC
   109		xorb	%dl, %dl
   110		int 	$0x13
   111		xorw	%dx, %dx		# drive 0, head 0
   112		movb	$0x02, %cl		# sector 2, track 0
   113		movw	$0x0200, %bx		# address = 512, in INITSEG
   114		movb	$0x02, %ah		# service 2, "read sector(s)"
   115		movb	setup_sects, %al	# (assume all on head 0, track 0)
   116		int	$0x13			# read it
   117		jnc	ok_load_setup		# ok - continue

   118		pushw	%ax			# dump error code
   119		call	print_nl
   120		movw	%sp, %bp
   121		call	print_hex
   122		popw	%ax
   123		jmp	load_setup

   124	ok_load_setup:
</code></tscreen>
If loading failed for some reason (bad floppy or someone pulled the diskette
out during the operation), we dump error code and retry in an endless
loop.
The only way to get out of it is to reboot the machine, unless retry succeeds
but usually it doesn't (if something is wrong it will only get worse).

If loading setup_sects sectors of setup code succeeded we jump to label
<tt>ok_load_setup:</tt>.

Then we proceed to load the compressed kernel image at physical
address 0x10000. This
is done to preserve the firmware data areas in low memory (0-64K).
After the kernel is loaded, we jump to $SETUPSEG:0 (<tt>arch/i386/boot/setup.S</tt>).
Once the data is no longer needed (e.g. no more calls to BIOS) it is
overwritten by moving the entire (compressed) kernel image from 0x10000 to
0x1000 (physical addresses, of course).
This is done by <tt>setup.S</tt> which sets things up for protected mode and jumps
to 0x1000 which is the head of the compressed kernel, i.e.
<tt>arch/386/boot/compressed/{head.S,misc.c}</tt>.
This sets up stack and calls <tt>decompress_kernel()</tt> which uncompresses the
kernel to address 0x100000 and jumps to it.

Note that old bootloaders (old versions of LILO) could only load the
first 4 sectors of setup, which is why there is code in setup to load the rest of
itself if needed. Also, the code in setup has to take care of various
combinations of loader type/version vs zImage/bzImage and is therefore
highly complex.

Let us examine the kludge in the bootsector code that allows to load a big
kernel, known also as "bzImage".
The setup sectors are loaded as usual at 0x90200, but the kernel is loaded
64K chunk at a time using a special helper routine that calls BIOS to move
data from low to high memory. This helper routine is referred to by
<tt>bootsect_kludge</tt> in <tt>bootsect.S</tt> and is defined as <tt>bootsect_helper</tt> in <tt>setup.S</tt>.
The <tt>bootsect_kludge</tt> label in <tt>setup.S</tt> contains the value of setup segment
and the offset of <tt>bootsect_helper</tt> code in it so that bootsector can use the <tt>lcall</tt>
instruction to jump to it (inter-segment jump).
The reason why it is in <tt>setup.S</tt> is simply because there is no more space left
in bootsect.S (which is strictly not true - there are approximately 4 spare bytes
and at least 1 spare byte in <tt>bootsect.S</tt> but that is not enough, obviously).
This routine uses BIOS service int 0x15 (ax=0x8700) to move to high memory
and resets %es to always point to 0x10000. This ensures that the code in <tt>bootsect.S</tt>
doesn't run out of low memory when copying data from disk.

<sect1> Using LILO as a bootloader <p>

There are several advantages in using a specialised bootloader (LILO) over
a bare bones Linux bootsector:
<enum>
<item> Ability to choose between multiple Linux kernels or even multiple OSes.
<item> Ability to pass kernel command line parameters (there is a patch
       called BCP that adds this ability to bare-bones bootsector+setup).
<item> Ability to load much larger bzImage kernels - up to 2.5M vs 1M.
</enum>
Old versions of LILO (v17 and earlier) could not load bzImage kernels. The
newer versions (as of a couple of years ago or earlier) use the same
technique as bootsect+setup of moving data from low into high memory by
means of BIOS services. Some people (Peter Anvin notably) argue that zImage
support should be removed. The main reason (according to Alan Cox) it stays
is that there are apparently some broken BIOSes that make it impossible to
boot bzImage kernels while loading zImage ones fine.

The last thing LILO does is to jump to <tt>setup.S</tt> and things proceed as normal.

<sect1> High level initialisation <p>

By "high-level initialisation" we consider anything which is not directly
related to bootstrap, even though parts of the code to perform this are
written in asm, namely <tt>arch/i386/kernel/head.S</tt> which is the head of the
uncompressed kernel. The following steps are performed:

<enum>
<item> Initialise segment values (%ds = %es = %fs = %gs = __KERNEL_DS = 0x18).
<item> Initialise page tables.
<item> Enable paging by setting PG bit in %cr0.
<item> Zero-clean BSS (on SMP, only first CPU does this).
<item> Copy the first 2k of bootup parameters (kernel commandline).
<item> Check CPU type using EFLAGS and, if possible, cpuid, able to detect
       386 and higher.
<item> The first CPU calls <tt>start_kernel()</tt>, all others call
       <tt>arch/i386/kernel/smpboot.c:initialize_secondary()</tt> if ready=1,
       which just reloads esp/eip and doesn't return.
</enum>

The <tt>init/main.c:start_kernel()</tt> is written in C and does the following:

<enum>
<item> Take a global kernel lock (it is needed so that only one CPU
       goes through initialisation).
<item> Perform arch-specific setup (memory layout analysis, copying
       boot command line again, etc.).
<item> Print Linux kernel "banner" containing the version, compiler used to
       build it etc. to the kernel ring buffer for messages. This is taken
       from the variable linux_banner defined in init/version.c and is the
       same string as displayed by <bf>cat /proc/version</bf>.
<item> Initialise traps.
<item> Initialise irqs.
<item> Initialise data required for scheduler.
<item> Initialise time keeping data.
<item> Initialise softirq subsystem.
<item> Parse boot commandline options.
<item> Initialise console.
<item> If module support was compiled into the kernel, initialise dynamical
       module loading facility.
<item> If "profile=" command line was supplied, initialise profiling buffers.
<item> <tt>kmem_cache_init()</tt>, initialise most of slab allocator.
<item> Enable interrupts.
<item> Calculate BogoMips value for this CPU.
<item> Call <tt>mem_init()</tt> which calculates <tt>max_mapnr</tt>, <tt>totalram_pages</tt> and
       <tt>high_memory</tt> and prints out the "Memory: ..." line.
<item> <tt>kmem_cache_sizes_init()</tt>, finish slab allocator initialisation.
<item> Initialise data structures used by procfs.
<item> <tt>fork_init()</tt>, create <tt>uid_cache</tt>, initialise <tt>max_threads</tt> based on
       the amount of memory available and configure <tt>RLIMIT_NPROC</tt> for
       <tt>init_task</tt> to be <tt>max_threads/2</tt>.
<item> Create various slab caches needed for VFS, VM, buffer cache, etc.
<item> If System V IPC support is compiled in, initialise the IPC subsystem.
       Note that for System V shm, this includes mounting an internal
       (in-kernel) instance of shmfs filesystem.
<item> If quota support is compiled into the kernel, create and initialise
       a special slab cache for it.
<item> Perform arch-specific "check for bugs" and, whenever possible,
       activate workaround for processor/bus/etc bugs. Comparing various
       architectures reveals that "ia64 has no bugs" and "ia32 has quite a
       few bugs", good example is "f00f bug" which is only checked if kernel
       is compiled for less than 686 and worked around accordingly.
<item> Set a flag to indicate that a schedule should be invoked at "next
       opportunity" and create a kernel thread <tt>init()</tt> which execs
       execute_command if supplied via "init=" boot parameter, or tries to
       exec <bf>/sbin/init</bf>, <bf>/etc/init</bf>, <bf>/bin/init</bf>, <bf>/bin/sh</bf> in this order; if
       all these fail, panic with "suggestion" to use "init=" parameter.
<item> Go into the idle loop, this is an idle thread with pid=0.
</enum>

Important thing to note here that the <tt>init()</tt> kernel thread calls
<tt>do_basic_setup()</tt> which in turn calls <tt>do_initcalls()</tt> which goes through the
list of functions registered by means of <tt>__initcall</tt> or <tt>module_init()</tt> macros
and invokes them. These functions either do not depend on each other
or their dependencies have been manually fixed by the link order in the
Makefiles. This means that, depending on
the position of directories in the trees and the structure of the Makefiles,
the order in which initialisation functions are invoked can change. Sometimes, this
is important because you can imagine two subsystems A and B with B depending
on some initialisation done by A. If A is compiled statically and B is a
module then B's entry point is guaranteed to be invoked after A prepared
all the necessary environment. If A is a module, then B is also necessarily
a module so there are no problems. But what if both A and B are statically
linked into the kernel? The order in which they are invoked depends on the relative
entry point offsets in the <tt>.initcall.init</tt> ELF section of the kernel image.
Rogier Wolff proposed to introduce a hierarchical "priority" infrastructure
whereby modules could let the linker know in what (relative) order they
should be linked, but so far there are no patches available that implement
this in a sufficiently elegant manner to be acceptable into the kernel.
Therefore, make sure your link order is correct. If, in the example above,
A and B work fine when compiled statically once, they will always work,
provided they are listed sequentially in the same Makefile. If they don't
work, change the order in which their object files are listed.

Another thing worth noting is Linux's ability to execute an "alternative
init program" by means of passing "init=" boot commandline. This is useful
for recovering from accidentally overwritten <bf>/sbin/init</bf> or debugging the
initialisation (rc) scripts and <tt>/etc/inittab</tt> by hand, executing them
one at a time.

<sect1>SMP Bootup on x86<p>

On SMP, the BP goes through the normal sequence of bootsector, setup etc
until it reaches the <tt>start_kernel()</tt>, and then on to <tt>smp_init()</tt> and
especially <tt>src/i386/kernel/smpboot.c:smp_boot_cpus()</tt>. The <tt>smp_boot_cpus()</tt>
goes in a loop for each apicid (until <tt>NR_CPUS</tt>) and calls <tt>do_boot_cpu()</tt> on
it. What <tt>do_boot_cpu()</tt> does is create (i.e. <tt>fork_by_hand</tt>) an idle task for
the target cpu and write in well-known locations defined by the Intel MP
spec (0x467/0x469) the EIP of trampoline code found in <tt>trampoline.S</tt>. Then
it generates STARTUP IPI to the target cpu which makes this AP execute the
code in <tt>trampoline.S</tt>.

The boot CPU  creates a copy of trampoline code for each CPU in
low memory. The AP code writes a magic number in its own code which is
verified by the BP to make sure that AP is executing the trampoline code.
The requirement that trampoline code must be in low memory is enforced by
the Intel MP specification.

The trampoline code simply sets %bx register to 1, enters protected mode
and jumps to startup_32 which is the main entry to <tt>arch/i386/kernel/head.S</tt>.

Now, the AP starts executing <tt>head.S</tt> and discovering that it is not a BP,
it skips the code that clears BSS and then enters <tt>initialize_secondary()</tt>
which just enters the idle task for this CPU - recall that <tt>init_tasks[cpu]</tt>
was already initialised by BP executing <tt>do_boot_cpu(cpu)</tt>.

Note that init_task can be shared but each idle thread must have its own
TSS. This is why <tt>init_tss[NR_CPUS]</tt> is an array.

<sect1>Freeing initialisation data and code<p>

When the operating system initialises itself, most of the code and data
structures are never needed again.
Most operating systems (BSD, FreeBSD etc.) cannot dispose of this unneeded
information, thus wasting precious physical kernel memory.
The excuse they use (see McKusick's 4.4BSD book) is that "the relevant code
is spread around various subsystems and so it is not feasible to free it".
Linux, of course, cannot use such excuses because under Linux "if something
is possible in principle, then it is already implemented or somebody is
working on it".

So, as I said earlier, Linux kernel can only be compiled as an ELF binary, and
now we find out the reason (or one of the reasons) for that. The reason
related to throwing away initialisation code/data is that Linux provides two
macros to be used:

<itemize>
<item> <tt>__init</tt> - for initialisation code
<item> <tt>__initdata</tt> - for data
</itemize>

These evaluate to gcc attribute specificators (also known as "gcc magic")
as defined in <tt>include/linux/init.h</tt>:

<tscreen><code>
#ifndef MODULE
#define __init        __attribute__ ((__section__ (".text.init")))
#define __initdata    __attribute__ ((__section__ (".data.init")))
#else
#define __init
#define __initdata
#endif
</code></tscreen>

What this means is that if the code is compiled statically into the kernel
(i.e. MODULE is not defined) then it is placed in the special ELF section
<tt>.text.init</tt>, which is declared in the linker map in <tt>arch/i386/vmlinux.lds</tt>.
Otherwise (i.e. if it is a module) the macros evaluate to nothing.

What happens during boot is that the "init" kernel thread (function
<tt>init/main.c:init()</tt>) calls the arch-specific function <tt>free_initmem()</tt> which
frees all the pages between addresses <tt>__init_begin</tt> and <tt>__init_end</tt>.

On a typical system (my workstation), this results in freeing about 260K of
memory.

The functions registered via <tt>module_init()</tt> are placed in <tt>.initcall.init</tt>
which is also freed in the static case. The current trend in Linux, when
designing a subsystem (not necessarily a module), is to provide
init/exit entry points from the early stages of design so that in the
future, the subsystem in question can be modularised if needed. Example of
this is pipefs, see <tt>fs/pipe.c</tt>. Even if a given subsystem will never become a
module, e.g. bdflush (see <tt>fs/buffer.c</tt>), it is still nice and tidy to use
the <tt>module_init()</tt> macro against its initialisation function, provided it does
not matter when exactly is the function called.

There are two more macros which work in a similar manner, called <tt>__exit</tt> and
<tt>__exitdata</tt>, but they are more directly connected to the module support and
therefore will be explained in a later section.

<sect1>Processing kernel command line<p>

Let us recall what happens to the commandline passed to kernel during boot:

<enum>
<item> LILO (or BCP) accepts the commandline using BIOS keyboard services
       and stores it at a well-known location in physical memory, as well
       as a signature saying that there is a valid commandline there.

<item> <tt>arch/i386/kernel/head.S</tt> copies the first 2k of it out to the zeropage.

<item> <tt>arch/i386/kernel/setup.c:parse_mem_cmdline()</tt> (called by
       <tt>setup_arch()</tt>, itself called by <tt>start_kernel()</tt>) copies 256 bytes from zeropage
       into <tt>saved_command_line</tt> which is displayed by <tt>/proc/cmdline</tt>. This
       same routine processes the "mem=" option if present and makes appropriate
       adjustments to VM parameters.

<item> We return to commandline in <tt>parse_options()</tt> (called by <tt>start_kernel()</tt>)
       which processes some "in-kernel" parameters (currently "init=" and
       environment/arguments for init) and passes each word to <tt>checksetup()</tt>.

<item> <tt>checksetup()</tt> goes through the code in ELF section <tt>.setup.init</tt> and
       invokes each function, passing it the word if it matches. Note that
       using the return value of 0 from the function registered via <tt>__setup()</tt>,
       it is possible to pass the same "variable=value" to more than one
       function with "value" invalid to one and valid to another.
       Jeff Garzik commented: "hackers who do that get spanked :)"
       Why? Because this is clearly ld-order specific, i.e. kernel linked
       in one order will have functionA invoked before functionB and another
       will have it in reversed order, with the result depending on the order.

</enum>

So, how do we write code that processes boot commandline? We use the <tt>__setup()</tt>
macro defined in <tt>include/linux/init.h</tt>:

<tscreen><code>

/*
 * Used for kernel command line parameter setup
 */
struct kernel_param {
	const char *str;
	int (*setup_func)(char *);
};

extern struct kernel_param __setup_start, __setup_end;

#ifndef MODULE
#define __setup(str, fn) \
   static char __setup_str_##fn[] __initdata = str; \
   static struct kernel_param __setup_##fn __initsetup = \
   { __setup_str_##fn, fn }

#else
#define __setup(str,func) /* nothing */
endif
</code></tscreen>

So, you would typically use it in your code like this
(taken from code of real driver, BusLogic HBA <tt>drivers/scsi/BusLogic.c</tt>):

<tscreen><code>
static int __init
BusLogic_Setup(char *str)
{
        int ints[3];

        (void)get_options(str, ARRAY_SIZE(ints), ints);

        if (ints[0] != 0) {
                BusLogic_Error("BusLogic: Obsolete Command Line Entry "
                                "Format Ignored\n", NULL);
                return 0;
        }
        if (str == NULL || *str == '\0')
                return 0;
        return BusLogic_ParseDriverOptions(str);
}

__setup("BusLogic=", BusLogic_Setup);
</code></tscreen>

Note that <tt>__setup()</tt> does nothing for modules, so the code that wishes to
process boot commandline and can be either a module or statically linked
must invoke its parsing function manually in the module initialisation
routine. This also means that it is possible to write code that
processes parameters when compiled as a module but not when it is static or
vice versa.

<sect>Process and Interrupt Management<p>

<sect1>Task Structure and Process Table<p>

Every process under Linux is dynamically allocated a <tt>struct task_struct</tt>
structure. The maximum number of processes which can be created on Linux
is limited only by the amount of physical memory present, and is
equal to (see <tt>kernel/fork.c:fork_init()</tt>):

<tscreen><code>
        /*
         * The default maximum number of threads is set to a safe
         * value: the thread structures can take up at most half
         * of memory.
         */
        max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;
</code></tscreen>

which, on IA32 architecture, basically means <tt>num_physpages/4</tt>. As an example,
on a 512M machine, you can create 32k threads. This is a considerable
improvement over the 4k-epsilon limit for older (2.2 and earlier) kernels.
Moreover, this can be changed at runtime using the KERN_MAX_THREADS <bf>sysctl(2)</bf>,
or simply using procfs interface to kernel tunables:

<tscreen><code>
# cat /proc/sys/kernel/threads-max
32764
# echo 100000 > /proc/sys/kernel/threads-max
# cat /proc/sys/kernel/threads-max
100000
# gdb -q vmlinux /proc/kcore
Core was generated by `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118'.
#0  0x0 in ?? ()
(gdb) p max_threads
$1 = 100000
</code></tscreen>

The set of processes on the Linux system is represented as a collection of
<tt>struct task_struct</tt> structures which are linked in two ways:

<enum>
<item> as a hashtable, hashed by pid, and
<item> as a circular, doubly-linked list using <tt>p->next_task</tt> and <tt>p->prev_task</tt>
       pointers.
</enum>

The hashtable is called <tt>pidhash[]</tt> and is defined in
<tt>include/linux/sched.h</tt>:

<tscreen><code>
/* PID hashing. (shouldnt this be dynamic?) */
#define PIDHASH_SZ (4096 >> 2)
extern struct task_struct *pidhash[PIDHASH_SZ];

#define pid_hashfn(x)   ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
</code></tscreen>

The tasks are hashed by their pid value and the above hashing function is
supposed to distribute the elements uniformly in their domain
(<tt>0</tt> to <tt>PID_MAX-1</tt>). The hashtable is used to quickly find a task by given pid,
using <tt>find_task_pid()</tt> inline from <tt>include/linux/sched.h</tt>:

<tscreen><code>
static inline struct task_struct *find_task_by_pid(int pid)
{
        struct task_struct *p, **htable = &amp;pidhash[pid_hashfn(pid)];

        for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
                ;

        return p;
}
</code></tscreen>

The tasks on each hashlist (i.e. hashed to the same value) are linked
by <tt>p->pidhash_next/pidhash_pprev</tt> which are used by <tt>hash_pid()</tt> and
 <tt>unhash_pid()</tt> to insert and remove a given process into the hashtable.
These are done under protection of the read-write spinlock called <tt>tasklist_lock</tt>
taken for WRITE.

The circular doubly-linked list that uses <tt>p->next_task/prev_task</tt> is
maintained so that one could go through all tasks on the system easily.
This is achieved by the <tt>for_each_task()</tt> macro from <tt>include/linux/sched.h</tt>:

<tscreen><code>
#define for_each_task(p) \
        for (p = &amp;init_task ; (p = p->next_task) != &amp;init_task ; )
</code></tscreen>

Users of <tt>for_each_task()</tt> should take tasklist_lock for READ.
Note that <tt>for_each_task()</tt> is using <tt>init_task</tt> to mark the beginning (and end)
of the list - this is safe because the idle task (pid 0) never exits.

The modifiers of the process hashtable or/and the process table links,
notably <tt>fork()</tt>, <tt>exit()</tt> and <tt>ptrace()</tt>, must take <tt>tasklist_lock</tt> for WRITE. What is
more interesting is that the writers must also disable interrupts on the
local CPU. The reason for this is not trivial: the <tt>send_sigio()</tt> function walks the
task list and thus takes <tt>tasklist_lock</tt> for READ, and it is called from
<tt>kill_fasync()</tt> in interrupt context. This is why writers must disable
interrupts while readers don't need to.

Now that we understand how the <tt>task_struct</tt> structures are linked together,
let us examine the members of <tt>task_struct</tt>. They loosely correspond to the
members of UNIX 'struct proc' and 'struct user' combined together.

The other versions of UNIX separated the task state information into
one part which should be kept memory-resident at all times (called 'proc
structure' which includes process state, scheduling information etc.) and
another part which is only needed when the process is running (called 'u area' which
includes file descriptor table, disk quota information etc.). The only reason
for such ugly design was that memory was a very scarce resource. Modern
operating systems (well, only Linux at the moment but others, e.g. FreeBSD
seem to improve in this direction towards Linux) do not need such separation
and therefore maintain process state in a kernel memory-resident data
structure at all times.

The task_struct structure is declared in <tt>include/linux/sched.h</tt> and is
currently 1680 bytes in size.

The state field is declared as:

<tscreen><code>
volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */

#define TASK_RUNNING            0
#define TASK_INTERRUPTIBLE      1
#define TASK_UNINTERRUPTIBLE    2
#define TASK_ZOMBIE             4
#define TASK_STOPPED            8
#define TASK_EXCLUSIVE          32
</code></tscreen>

Why is <tt>TASK_EXCLUSIVE</tt> defined as 32 and not 16? Because 16 was used up by
<tt>TASK_SWAPPING</tt> and I forgot to shift <tt>TASK_EXCLUSIVE</tt> up when I removed
all references to <tt>TASK_SWAPPING</tt> (sometime in 2.3.x).

The <tt>volatile</tt> in <tt>p->state</tt> declaration means it can be modified
asynchronously (from interrupt handler):

<enum>

<item><bf>TASK_RUNNING</bf>: means the task is "supposed to be" on the run
queue.  The reason it may not yet be on the runqueue is that marking a task as
<tt>TASK_RUNNING</tt> and placing it on the runqueue is not atomic.  You need to hold
the <tt>runqueue_lock</tt> read-write spinlock for read in order to look at the
runqueue. If you do so, you will then see that every task on the runqueue is in
<tt>TASK_RUNNING</tt> state. However, the converse is not true for the reason explained
above. Similarly, drivers can mark themselves (or rather the process context they
run in) as <tt>TASK_INTERRUPTIBLE</tt> (or <tt>TASK_UNINTERRUPTIBLE</tt>) and then call <tt>schedule()</tt>,
which will then remove it from the runqueue (unless there is a pending signal, in which
case it is left on the runqueue).  </item>

<item><bf>TASK_INTERRUPTIBLE</bf>: means the task is sleeping but can be woken up
by a signal or by expiry of a timer.</item>

<item><bf>TASK_UNINTERRUPTIBLE</bf>: same as <tt>TASK_INTERRUPTIBLE</tt>, except it cannot
be woken up.</item>

<item><bf>TASK_ZOMBIE</bf>: task has terminated but has not had its status collected
(<tt>wait()</tt>-ed for) by the parent (natural or by adoption).</item>

<item><bf>TASK_STOPPED</bf>: task was stopped, either due to job control signals or
due to <bf>ptrace(2)</bf>.</item>

<item><bf>TASK_EXCLUSIVE</bf>: this is not a separate state but can be OR-ed to
either one of <tt>TASK_INTERRUPTIBLE</tt> or <tt>TASK_UNINTERRUPTIBLE</tt>.
This means that when
this task is sleeping on a wait queue with many other tasks, it will be
woken up alone instead of causing "thundering herd" problem by waking up all
the waiters.</item>
</enum>

Task flags contain information about the process states which are not
mutually exclusive:
<tscreen><code>
unsigned long flags;	/* per process flags, defined below */
/*
 * Per process flags
 */
#define PF_ALIGNWARN    0x00000001      /* Print alignment warning msgs */
                                        /* Not implemented yet, only for 486*/
#define PF_STARTING     0x00000002      /* being created */
#define PF_EXITING      0x00000004      /* getting shut down */
#define PF_FORKNOEXEC   0x00000040      /* forked but didn't exec */
#define PF_SUPERPRIV    0x00000100      /* used super-user privileges */
#define PF_DUMPCORE     0x00000200      /* dumped core */
#define PF_SIGNALED     0x00000400      /* killed by a signal */
#define PF_MEMALLOC     0x00000800      /* Allocating memory */
#define PF_VFORK        0x00001000      /* Wake up parent in mm_release */
#define PF_USEDFPU      0x00100000      /* task used FPU this quantum (SMP) */
</code></tscreen>

The fields <tt>p->has_cpu</tt>, <tt>p->processor</tt>, <tt>p->counter</tt>, <tt>p->priority</tt>, <tt>p->policy</tt> and
 <tt>p->rt_priority</tt> are related to the scheduler and will be looked at later.

The fields <tt>p->mm</tt> and <tt>p->active_mm</tt> point respectively to the process' address space
described by <tt>mm_struct</tt> structure and to the active address space if the
process doesn't have a real one (e.g. kernel threads). This helps minimise
TLB flushes on switching address spaces when the task is scheduled out.
So, if we are scheduling-in the kernel thread (which has no <tt>p->mm</tt>) then its
<tt>next->active_mm</tt> will be set to the <tt>prev->active_mm</tt> of the task that was
scheduled-out, which will be the same as <tt>prev->mm</tt> if <tt>prev->mm != NULL</tt>.
The address space can be shared between threads if <tt>CLONE_VM</tt> flag is passed
to the <bf>clone(2)</bf> system call or by means of <bf>vfork(2)</bf> system call.

The fields <tt>p->exec_domain</tt> and <tt>p->personality</tt> relate to the personality of
the task, i.e. to the way certain system calls behave in order to emulate the
"personality" of foreign flavours of UNIX.

The field <tt>p->fs</tt> contains filesystem information, which under Linux means
three pieces of information:

<enum>
<item>root directory's dentry and mountpoint,
<item>alternate root directory's dentry and mountpoint,
<item>current working directory's dentry and mountpoint.
</enum>

This structure also includes a reference count because it can be shared
between cloned tasks when <tt>CLONE_FS</tt> flag is passed to the <bf>clone(2)</bf> system
call.

The field <tt>p->files</tt> contains the file descriptor table. This too can be
shared between tasks, provided <tt>CLONE_FILES</tt> is specified with <bf>clone(2)</bf> system
call.

The field <tt>p->sig</tt> contains signal handlers and can be shared between cloned
tasks by means of <tt>CLONE_SIGHAND</tt>.

<sect1>Creation and termination of tasks and kernel threads<p>

Different books on operating systems define a "process" in different ways,
starting from "instance of a program in execution" and ending with "that
which is produced by clone(2) or fork(2) system calls".
Under Linux, there are three kinds of processes:

<itemize>
<item> the idle thread(s),
<item> kernel threads,
<item> user tasks.
</itemize>

The idle thread is created at compile time for the first CPU; it is then
"manually" created for each CPU by means of arch-specific
<tt>fork_by_hand()</tt> in <tt>arch/i386/kernel/smpboot.c</tt>, which unrolls the <bf>fork(2)</bf> system
call by hand (on some archs). Idle tasks share one init_task structure but
have a private TSS structure, in the per-CPU array <tt>init_tss</tt>. Idle tasks all have
pid = 0 and no other task can share pid, i.e. use <tt>CLONE_PID</tt> flag to <bf>clone(2)</bf>.

Kernel threads are created using <tt>kernel_thread()</tt> function which invokes
the <bf>clone(2)</bf> system call in kernel mode. Kernel threads usually have no user
address space, i.e. <tt>p->mm = NULL</tt>, because they explicitly do <tt>exit_mm()</tt>, e.g.
via <tt>daemonize()</tt> function. Kernel threads can always access kernel address
space directly. They are allocated pid numbers in the low range. Running at
processor's ring 0 (on x86, that is) implies that the kernel threads enjoy all I/O privileges
and cannot be pre-empted by the scheduler.

User tasks are created by means of <bf>clone(2)</bf> or <bf>fork(2)</bf> system calls, both of
which internally invoke <bf>kernel/fork.c:do_fork()</bf>.

Let us understand what happens when a user process makes a <bf>fork(2)</bf> system
call. Although <bf>fork(2)</bf> is architecture-dependent due to the
different ways of passing user stack and registers, the actual underlying
function <tt>do_fork()</tt> that does the job is portable and is located at
<tt>kernel/fork.c</tt>.

The following steps are done:

<enum>
<item> Local variable <tt>retval</tt> is set to <tt>-ENOMEM</tt>, as this is the value which <tt>errno</tt>
       should be set to if <bf>fork(2)</bf> fails to allocate a new task structure.

<item> If <tt>CLONE_PID</tt> is set in <tt>clone_flags</tt> then return an error (<tt>-EPERM</tt>), unless
       the caller is the idle thread (during boot only). So, normal user
       threads cannot pass <tt>CLONE_PID</tt> to <bf>clone(2)</bf> and expect it to succeed.
       For <bf>fork(2)</bf>, this is irrelevant as <tt>clone_flags</tt> is set to <tt>SIFCHLD</tt> - this
       is only relevant when <tt>do_fork()</tt> is invoked from <tt>sys_clone()</tt> which
       passes the <tt>clone_flags</tt> from the value requested from userspace.

<item> <tt>current->vfork_sem</tt> is initialised (it is later cleared in the child).
       This is used by <tt>sys_vfork()</tt> (<bf>vfork(2)</bf> system call, corresponds to
       <tt>clone_flags = CLONE_VFORK|CLONE_VM|SIGCHLD</tt>) to make the parent sleep
       until the child does <tt>mm_release()</tt>, for example as a result of <tt>exec()</tt>ing
       another program or <bf>exit(2)</bf>-ing.

<item> A new task structure is allocated using arch-dependent
       <tt>alloc_task_struct()</tt> macro. On x86 it is just a gfp at <tt>GFP_KERNEL</tt>
       priority. This is the first reason why <bf>fork(2)</bf> system call may sleep.
       If this allocation fails, we return <tt>-ENOMEM</tt>.

<item> All the values from current process' task structure are copied into
       the new one, using structure assignment <tt>*p = *current</tt>. Perhaps this
       should be replaced by a memcpy? Later on, the fields that should not
       be inherited by the child are set to the correct values.

<item> Big kernel lock is taken as the rest of the code would otherwise be
       non-reentrant.

<item> If the parent has user resources (a concept of UID, Linux is flexible
       enough to make it a question rather than a fact), then verify if the
       user exceeded <tt>RLIMIT_NPROC</tt> soft limit - if so, fail with <tt>-EAGAIN</tt>, if
       not, increment the count of processes by given uid <tt>p->user->count</tt>.

<item> If the system-wide number of tasks exceeds the value of the tunable
       max_threads, fail with <tt>-EAGAIN</tt>.

<item> If the binary being executed belongs to a modularised execution
       domain, increment the corresponding module's reference count.

<item> If the binary being executed belongs to a modularised binary format,
       increment the corresponding module's reference count.

<item> The child is marked as 'has not execed' (<tt>p->did_exec = 0</tt>)

<item> The child is marked as 'not-swappable' (<tt>p->swappable = 0</tt>)

<item> The child is put into 'uninterruptible sleep' state, i.e.
       <tt>p->state = TASK_UNINTERRUPTIBLE</tt> (TODO: why is this done?
       I think it's not needed - get rid of it, Linus confirms it is not
       needed)

<item> The child's <tt>p->flags</tt> are set according to the value of clone_flags;
       for plain <bf>fork(2)</bf>, this will be <tt>p->flags = PF_FORKNOEXEC</tt>.

<item> The child's pid <tt>p->pid</tt> is set using the fast algorithm in
       <tt>kernel/fork.c:get_pid()</tt> (TODO: <tt>lastpid_lock</tt> spinlock can be made
       redundant since <tt>get_pid()</tt> is always called under big kernel lock
       from <tt>do_fork()</tt>, also remove flags argument of <tt>get_pid()</tt>, patch sent
       to Alan on 20/06/2000 - followup later).

<item> The rest of the code in <tt>do_fork()</tt> initialises the rest of child's
       task structure. At the very end, the child's task structure is
       hashed into the <tt>pidhash</tt> hashtable and the child is woken up (TODO:
       <tt>wake_up_process(p)</tt> sets <tt>p->state = TASK_RUNNING</tt> and adds the process
       to the runq, therefore we probably didn't need to set <tt>p->state</tt> to
       <tt>TASK_RUNNING</tt> earlier on in <tt>do_fork()</tt>). The interesting part is
       setting <tt>p->exit_signal</tt> to <tt>clone_flags & CSIGNAL</tt>, which for <bf>fork(2)</bf>
       means just <tt>SIGCHLD</tt> and setting <tt>p->pdeath_signal</tt> to 0. The
       <tt>pdeath_signal</tt> is used when a process 'forgets' the original parent
       (by dying) and can be set/get by means of <tt>PR_GET/SET_PDEATHSIG</tt>
       commands of <bf>prctl(2)</bf> system call (You might argue that the way the
       value of <tt>pdeath_signal</tt> is returned via userspace pointer argument
       in <bf>prctl(2)</bf> is a bit silly - mea culpa, after Andries Brouwer
       updated the manpage it was too late to fix ;)
</enum>

Thus tasks are created. There are several ways for tasks to terminate:

<enum>

<item> by making <bf>exit(2)</bf> system call;

<item> by being delivered a signal with default disposition to die;

<item> by being forced to die under certain exceptions;

<item> by calling <bf>bdflush(2)</bf> with <tt>func == 1</tt> (this is Linux-specific, for
       compatibility with old distributions that still had the 'update'
       line in <tt>/etc/inittab</tt> - nowadays the work of update is done by
       kernel thread <tt>kupdate</tt>).
</enum>

Functions implementing system calls under Linux are prefixed with <tt>sys_</tt>,
but they are usually concerned only with argument checking or arch-specific
ways to pass some information and the actual work is done by <tt>do_</tt> functions.
So it is with <tt>sys_exit()</tt> which calls <tt>do_exit()</tt> to do the work. Although,
other parts of the kernel sometimes invoke <tt>sys_exit()</tt> while they should really
call <tt>do_exit()</tt>.

The function <tt>do_exit()</tt> is found in <tt>kernel/exit.c</tt>. The points to note about
<tt>do_exit()</tt>:

<itemize>
<item> Uses global kernel lock (locks but doesn't unlock).

<item> Calls <tt>schedule()</tt> at the end, which never returns.

<item> Sets the task state to <tt>TASK_ZOMBIE</tt>.

<item> Notifies any child with <tt>current->pdeath_signal</tt>, if not 0.

<item> Notifies the parent with a <tt>current->exit_signal</tt>, which is usually
       equal to <tt>SIGCHLD</tt>.

<item> Releases resources allocated by fork, closes open files etc.

<item> On architectures that use lazy FPU switching (ia64, mips, mips64)
       (TODO: remove 'flags' argument of
       sparc, sparc64), do whatever the hardware requires to pass the FPU
       ownership (if owned by current) to "none".
</itemize>

<sect1>Linux Scheduler<p>

The job of a scheduler is to arbitrate access to the current CPU between
multiple processes. The scheduler is implemented in the 'main kernel file'
<tt>kernel/sched.c</tt>. The corresponding header file <tt>include/linux/sched.h</tt> is
included (either explicitly or indirectly) by virtually every kernel source
file.

The fields of task structure relevant to scheduler include:

<itemize>
<item> <tt>p->need_resched</tt>: this field is set if <tt>schedule()</tt> should be invoked at
       the 'next opportunity'.

<item> <tt>p->counter</tt>: number of clock ticks left to run in this scheduling
       slice, decremented by a timer. When this field becomes lower than or equal to zero, it is reset
       to 0 and <tt>p->need_resched</tt> is set. This is also sometimes called 'dynamic
       priority' of a process because it can change by itself.

<item> <tt>p->priority</tt>: the process' static priority, only changed through well-known
       system calls like <bf>nice(2)</bf>, POSIX.1b <bf>sched_setparam(2)</bf> or 4.4BSD/SVR4
       <bf>setpriority(2)</bf>.

<item> <tt>p->rt_priority</tt>: realtime priority

<item> <tt>p->policy</tt>: the scheduling policy, specifies which scheduling class the
       task belongs to. Tasks can change their scheduling class using the
       <bf>sched_setscheduler(2)</bf> system call. The valid values are <tt>SCHED_OTHER</tt>
       (traditional UNIX process), <tt>SCHED_FIFO</tt> (POSIX.1b FIFO realtime
       process) and <tt>SCHED_RR</tt> (POSIX round-robin realtime process). One can
       also OR <tt>SCHED_YIELD</tt> to any of these values to signify that the process
       decided to yield the CPU, for example by calling <bf>sched_yield(2)</bf> system
       call. A FIFO realtime process will run until either a) it blocks on I/O,
       b) it explicitly yields the CPU or c) it is preempted by another realtime
       process with a higher <tt>p->rt_priority</tt> value. <tt>SCHED_RR</tt> is the same as
       <tt>SCHED_FIFO</tt>, except that when its timeslice expires it goes back to
       the end of the runqueue.
</itemize>

The scheduler's algorithm is simple, despite the great apparent complexity
of the <tt>schedule()</tt> function. The function is complex because it implements
three scheduling algorithms in one and also because of the subtle
SMP-specifics.

The apparently 'useless' gotos in <tt>schedule()</tt> are there for a purpose - to
generate the best optimised (for i386) code. Also, note that scheduler
(like most of the kernel) was completely rewritten for 2.4, therefore the
discussion below does not apply to 2.2 or earlier kernels.

Let us look at the function in detail:

<enum>
<item> If <tt>current->active_mm == NULL</tt> then something is wrong. Current
       process, even a kernel thread (<tt>current->mm == NULL</tt>) must have a valid
       <tt>p->active_mm</tt> at all times.

<item> If there is something to do on the <tt>tq_scheduler</tt> task queue, process it
       now. Task queues provide a kernel mechanism to schedule execution of
       functions at a later time. We shall look at it in details elsewhere.

<item> Initialise local variables <tt>prev</tt> and <tt>this_cpu</tt> to current task and
       current CPU respectively.

<item> Check if <tt>schedule()</tt> was invoked from interrupt handler (due to a bug)
       and panic if so.

<item> Release the global kernel lock.

<item> If there is some work to do via softirq mechanism, do it now.

<item> Initialise local pointer <tt>struct schedule_data *sched_data</tt> to point
       to per-CPU (cacheline-aligned to prevent cacheline ping-pong)
       scheduling data area, which contains the TSC value of <tt>last_schedule</tt> and the
       pointer to last scheduled task structure (TODO: <tt>sched_data</tt> is used on
       SMP only but why does <tt>init_idle()</tt> initialises it on UP as well?).

<item> <tt>runqueue_lock</tt> spinlock is taken. Note that we use <tt>spin_lock_irq()</tt>
       because in <tt>schedule()</tt> we guarantee that interrupts are enabled. Therefore,
       when we unlock <tt>runqueue_lock</tt>, we can just re-enable them instead of
       saving/restoring eflags (<tt>spin_lock_irqsave/restore</tt> variant).

<item> task state machine: if the task is in <tt>TASK_RUNNING</tt> state, it is left
       alone; if it is in <tt>TASK_INTERRUPTIBLE</tt> state and a signal is pending,
       it is moved into <tt>TASK_RUNNING</tt> state. In all other cases, it is deleted
       from the runqueue.

<item> <tt>next</tt> (best candidate to be scheduled) is set to the idle task of
       this cpu. However, the goodness of this candidate is set to a very
       low value (-1000), in hope that there is someone better than that.

<item> If the <tt>prev</tt> (current) task is in <tt>TASK_RUNNING</tt> state, then the
       current goodness is set to its goodness and it is marked as a better
       candidate to be scheduled than the idle task.

<item> Now the runqueue is examined and a goodness of each process that can
       be scheduled on this cpu is compared with current value; the
       process with highest goodness wins. Now the concept of "can be
       scheduled on this cpu" must be clarified: on UP, every process on
       the runqueue is eligible to be scheduled; on SMP, only process not
       already running on another cpu is eligible to be scheduled on this
       cpu. The goodness is calculated by a function called <tt>goodness()</tt>, which
       treats realtime processes by making their goodness very high
       (<tt>1000 + p->rt_priority</tt>), this being greater than 1000 guarantees that
       no <tt>SCHED_OTHER</tt> process can win; so they only contend with other
       realtime processes that may have a greater <tt>p->rt_priority</tt>. The
       goodness function returns 0 if the process' time slice (<tt>p->counter</tt>)
       is over. For non-realtime processes, the initial value of goodness is
       set to <tt>p->counter</tt> - this way, the process is less likely to get CPU if
       it already had it for a while, i.e. interactive processes are favoured
       more than CPU bound number crunchers. The arch-specific constant
       <tt>PROC_CHANGE_PENALTY</tt> attempts to implement "cpu affinity" (i.e. give
       advantage to a process on the same CPU). It also gives a slight
       advantage to processes with mm pointing to current <tt>active_mm</tt> or to
       processes with no (user) address space, i.e. kernel threads.

<item> if the current value of goodness is 0 then the entire list of
       processes (not just the ones on the runqueue!) is examined and their dynamic
       priorities are recalculated using simple algorithm:

<tscreen><code>

recalculate:
        {
                struct task_struct *p;
                spin_unlock_irq(&amp;runqueue_lock);
                read_lock(&amp;tasklist_lock);
                for_each_task(p)
                        p->counter = (p->counter >> 1) + p->priority;
                read_unlock(&amp;tasklist_lock);
                spin_lock_irq(&amp;runqueue_lock);
        }
</code></tscreen>

	   Note that the we drop the <tt>runqueue_lock</tt> before we recalculate. The
	   reason is that we go through entire set of processes;  this can take
	   a long time, during which the <tt>schedule()</tt> could be called on another CPU and
	   select a process with goodness good enough for that CPU, whilst we on
	   this CPU were forced to recalculate. Ok, admittedly this is somewhat
	   inconsistent because while we (on this CPU) are selecting a process with
	   the best goodness, <tt>schedule()</tt> running on another CPU could be
	   recalculating dynamic priorities.

<item> From this point on it is certain that <tt>next</tt> points to the task to
       be scheduled, so we initialise <tt>next->has_cpu</tt> to 1 and <tt>next->processor</tt>
       to <tt>this_cpu</tt>. The <tt>runqueue_lock</tt> can now be unlocked.

<item> If we are switching back to the same task (<tt>next == prev</tt>) then we can
       simply reacquire the global kernel lock and return, i.e. skip all the
       hardware-level (registers, stack etc.) and VM-related (switch page
       directory, recalculate <tt>active_mm</tt> etc.) stuff.

<item> The macro <tt>switch_to()</tt> is architecture specific. On i386, it is
       concerned with a) FPU handling, b) LDT handling, c) reloading segment
       registers, d) TSS handling and e) reloading debug registers.
</enum>

<sect1>Linux linked list implementation<p>

Before we go on to examine implementation of wait queues, we must
acquaint ourselves with the Linux standard doubly-linked list implementation.
Wait queues (as well as everything else in Linux) make heavy use
of them and they are called in jargon "list.h implementation" because the
most relevant file is <tt>include/linux/list.h</tt>.

The fundamental data structure here is <tt>struct list_head</tt>:

<tscreen><code>
struct list_head {
        struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &amp;(name), &amp;(name) }

#define LIST_HEAD(name) \
        struct list_head name = LIST_HEAD_INIT(name)

#define INIT_LIST_HEAD(ptr) do { \
        (ptr)->next = (ptr); (ptr)->prev = (ptr); \
} while (0)

#define list_entry(ptr, type, member) \
        ((type *)((char *)(ptr)-(unsigned long)(&amp;((type *)0)->member)))

#define list_for_each(pos, head) \
        for (pos = (head)->next; pos != (head); pos = pos->next)
</code></tscreen>

The first three macros are for initialising an empty list by pointing both
<tt>next</tt> and <tt>prev</tt> pointers to itself. It is obvious from C syntactical
restrictions which ones should be used where - for example, <tt>LIST_HEAD_INIT()</tt>
can be used for structure's element initialisation in declaration, the second
can be used for static variable initialising declarations and the third can
be used inside a function.

The macro <tt>list_entry()</tt> gives access to individual list element, for example
(from <tt>fs/file_table.c:fs_may_remount_ro()</tt>):

<tscreen><code>
struct super_block {
   ...
   struct list_head s_files;
   ...
} *sb = &amp;some_super_block;

struct file {
   ...
   struct list_head f_list;
   ...
} *file;

struct list_head *p;

for (p = sb->s_files.next; p != &amp;sb->s_files; p = p->next) {
     struct file *file = list_entry(p, struct file, f_list);
     do something to 'file'
}
</code></tscreen>

A good example of the use of <tt>list_for_each()</tt> macro is in the scheduler where
we walk the runqueue looking for the process with highest goodness:

<tscreen><code>
static LIST_HEAD(runqueue_head);
struct list_head *tmp;
struct task_struct *p;

list_for_each(tmp, &amp;runqueue_head) {
    p = list_entry(tmp, struct task_struct, run_list);
    if (can_schedule(p)) {
        int weight = goodness(p, this_cpu, prev->active_mm);
        if (weight > c)
            c = weight, next = p;
    }
}
</code></tscreen>

Here, <tt>p->run_list</tt> is declared as <tt>struct list_head run_list</tt> inside
<tt>task_struct</tt> structure and serves as anchor to the list. Removing an element
from the list and adding (to head or tail of the list) is done by
 <tt>list_del()/list_add()/list_add_tail()</tt> macros. The examples below are adding
and removing a task from runqueue:

<tscreen><code>
static inline void del_from_runqueue(struct task_struct * p)
{
        nr_running--;
        list_del(&amp;p->run_list);
        p->run_list.next = NULL;
}

static inline void add_to_runqueue(struct task_struct * p)
{
        list_add(&amp;p->run_list, &amp;runqueue_head);
        nr_running++;
}

static inline void move_last_runqueue(struct task_struct * p)
{
        list_del(&amp;p->run_list);
        list_add_tail(&amp;p->run_list, &amp;runqueue_head);
}

static inline void move_first_runqueue(struct task_struct * p)
{
        list_del(&amp;p->run_list);
        list_add(&amp;p->run_list, &amp;runqueue_head);
}
</code></tscreen>

<sect1>Wait Queues<p>

When a process requests the kernel to do something which is currently
impossible but that may become possible later, the process is put to sleep
and is woken up when the request is more likely to be satisfied. One of the
kernel mechanisms used for this is called a 'wait queue'.

Linux implementation allows wake-on semantics using <tt>TASK_EXCLUSIVE</tt> flag.
With waitqueues, you can either use a well-known queue and then simply
<tt>sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout</tt>,
or you can define your own waitqueue and use <tt>add/remove_wait_queue</tt> to add and
remove yourself from it and <tt>wake_up/wake_up_interruptible</tt> to wake up
when needed.

An example of the first usage of waitqueues is interaction between the page
allocator (in <tt>mm/page_alloc.c:__alloc_pages()</tt>) and the <tt>kswapd</tt> kernel daemon (in
<tt>mm/vmscan.c:kswap()</tt>), by means of wait queue <tt>kswapd_wait,</tt> declared in
<tt>mm/vmscan.c</tt>; the <tt>kswapd</tt> daemon sleeps on this queue, and it is woken up
whenever the page allocator needs to free up some pages.

An example of autonomous waitqueue usage is interaction between
user process requesting data via <bf>read(2)</bf> system call and kernel running in
the interrupt context to supply the data. An interrupt handler might look
like (simplified <tt>drivers/char/rtc_interrupt()</tt>):

<tscreen><code>
static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);

void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
	spin_lock(&amp;rtc_lock);
	rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
	spin_unlock(&amp;rtc_lock);
	wake_up_interruptible(&amp;rtc_wait);
}
</code></tscreen>

So, the interrupt handler obtains the data by reading from some
device-specific I/O port (<tt>CMOS_READ()</tt> macro turns into a couple <tt>outb/inb</tt>) and
then wakes up whoever is sleeping on the <tt>rtc_wait</tt> wait queue.

Now, the <bf>read(2)</bf> system call could be implemented as:

<tscreen><code>
ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
{
	DECLARE_WAITQUEUE(wait, current);
	unsigned long data;
	ssize_t retval;

	add_wait_queue(&amp;rtc_wait, &amp;wait);
	current->state = TASK_INTERRUPTIBLE;
	do {
		spin_lock_irq(&amp;rtc_lock);
		data = rtc_irq_data;
		rtc_irq_data = 0;
		spin_unlock_irq(&amp;rtc_lock);

		if (data != 0)
			break;

		if (file->f_flags & O_NONBLOCK) {
			retval = -EAGAIN;
			goto out;
		}
		if (signal_pending(current)) {
			retval = -ERESTARTSYS;
			goto out;
		}
		schedule();
	} while(1);
	retval = put_user(data, (unsigned long *)buf);
	if (!retval)
		retval = sizeof(unsigned long);

out:
	current->state = TASK_RUNNING;
	remove_wait_queue(&amp;rtc_wait, &amp;wait);
	return retval;
}
</code></tscreen>

What happens in <tt>rtc_read()</tt> is this:

<enum>
<item> We declare a wait queue element pointing to current process context.

<item> We add this element to the <tt>rtc_wait</tt> waitqueue.

<item> We mark current context as <tt>TASK_INTERRUPTIBLE</tt> which means it will not
       be rescheduled after the next time it sleeps.

<item> We check if there is no data available; if there is we break out,
       copy data to user buffer, mark ourselves as <tt>TASK_RUNNING</tt>, remove
	   ourselves from the wait queue and return

<item> If there is no data yet, we check whether the user specified non-blocking I/O
       and if so we fail with <tt>EAGAIN</tt> (which is the same as <tt>EWOULDBLOCK</tt>)

<item> We also check if a signal is pending and if so inform the "higher
       layers" to restart the system call if necessary. By "if necessary"
       I meant the details of signal disposition as specified in <bf>sigaction(2)</bf>
       system call.

<item> Then we "switch out", i.e. fall asleep, until woken up by the
       interrupt handler. If we didn't mark ourselves as <tt>TASK_INTERRUPTIBLE</tt>
       then the scheduler could schedule us sooner than when the data is
       available, thus causing unneeded processing.
</enum>

It is also worth pointing out that, using wait queues, it is rather easy to
implement the <bf>poll(2)</bf> system call:

<tscreen><code>
static unsigned int rtc_poll(struct file *file, poll_table *wait)
{
        unsigned long l;

        poll_wait(file, &amp;rtc_wait, wait);

        spin_lock_irq(&amp;rtc_lock);
        l = rtc_irq_data;
        spin_unlock_irq(&amp;rtc_lock);

        if (l != 0)
                return POLLIN | POLLRDNORM;
        return 0;
}
</code></tscreen>

All the work is done by the device-independent function <tt>poll_wait()</tt> which does
the necessary waitqueue manipulations; all we need to do is point it to the
waitqueue which is woken up by our device-specific interrupt handler.

<sect1>Kernel Timers<p>

Now let us turn our attention to kernel timers. Kernel timers are used to
dispatch execution of a particular function (called 'timer handler') at a
specified time in the future. The main data structure is <tt>struct timer_list</tt>
declared in <tt>include/linux/timer.h</tt>:

<tscreen><code>
struct timer_list {
        struct list_head list;
        unsigned long expires;
        unsigned long data;
        void (*function)(unsigned long);
        volatile int running;
};
</code></tscreen>

The <tt>list</tt> field is for linking into the internal list, protected by the
<tt>timerlist_lock</tt> spinlock. The <tt>expires</tt> field is the value of <tt>jiffies</tt> when
the <tt>function</tt> handler should be invoked with <tt>data</tt> passed as a parameter.
The <tt>running</tt> field is used on SMP to test if the timer handler is currently
 running on another CPU.

The functions <tt>add_timer()</tt> and <tt>del_timer()</tt> add and remove a given timer to the
list. When a timer expires, it is removed automatically. Before a timer is
used, it MUST be initialised by means of <tt>init_timer()</tt> function. And before it
is added, the fields <tt>function</tt> and <tt>expires</tt> must be set.

<sect1>Bottom Halves<p>

Sometimes it is reasonable to split the amount of work to be performed inside
an interrupt handler into immediate work (e.g. acknowledging the interrupt,
updating the stats etc.) and work which can be postponed until later, when
interrupts are enabled (e.g. to do some postprocessing on data, wake up
processes waiting for this data, etc).

Bottom halves are the oldest mechanism for deferred execution of kernel
tasks and have been available since Linux 1.x. In Linux 2.0, a new mechanism
was added, called 'task queues', which will be the subject of next section.

Bottom halves are serialised by the <tt>global_bh_lock</tt> spinlock, i.e.
there can only be one bottom half running on any CPU at a time. However,
when attempting to execute the handler, if <tt>global_bh_lock</tt> is not available,
the bottom half is marked (i.e. scheduled) for execution - so processing can
continue, as opposed to a busy loop on <tt>global_bh_lock</tt>.

There can only be 32 bottom halves registered in total.
The functions required to manipulate bottom halves are as follows (all
exported to modules):

<itemize>
<item> <tt>void init_bh(int nr, void (*routine)(void))</tt>: installs a bottom half
       handler pointed to by <tt>routine</tt> argument into slot <tt>nr</tt>. The slot
       ought to be enumerated in <tt>include/linux/interrupt.h</tt> in the form
       <tt>XXXX_BH</tt>, e.g. <tt>TIMER_BH</tt> or <tt>TQUEUE_BH</tt>. Typically, a subsystem's
       initialisation routine (<tt>init_module()</tt> for modules) installs the
       required bottom half using this function.

<item> <tt>void remove_bh(int nr)</tt>: does the opposite of <tt>init_bh()</tt>, i.e.
       de-installs bottom half installed at slot <tt>nr</tt>. There is no error
       checking performed there, so, for example <tt>remove_bh(32)</tt> will
       panic/oops the system. Typically, a subsystem's cleanup routine
       (<tt>cleanup_module()</tt> for modules) uses this function to free up the slot
       that can later be reused by some other subsystem. (TODO: wouldn't it
       be nice to have <tt>/proc/bottom_halves</tt> list all registered bottom
       halves on the system? That means <tt>global_bh_lock</tt> must be made
       read/write, obviously)

<item> <tt>void mark_bh(int nr)</tt>: marks bottom half in slot <tt>nr</tt> for execution. Typically,
       an interrupt handler will mark its bottom half (hence the name!) for
       execution at a "safer time".

</itemize>

Bottom halves are globally locked tasklets, so the question "when are bottom
half handlers executed?" is really "when are tasklets executed?". And the
answer is, in two places: a) on each <tt>schedule()</tt> and b) on each
interrupt/syscall return path in <tt>entry.S</tt> (TODO: therefore, the <tt>schedule()</tt>
case is really boring - it like adding yet another very very slow interrupt,
why not get rid of <tt>handle_softirq</tt> label from <tt>schedule()</tt> altogether?).


<sect1>Task Queues<p>

Task queues can be though of as a dynamic extension to old bottom halves. In
fact, in the source code they are sometimes referred to as "new" bottom
halves. More specifically, the old bottom halves discussed in previous
section have these limitations:

<enum>
<item> There are only a fixed number (32) of them.

<item> Each bottom half can only be associated with one handler function.

<item> Bottom halves are consumed with a spinlock held so they cannot block.
</enum>

So, with task queues, arbitrary number of functions can be chained and
processed one after another at a later time. One creates a new task queue
using the <tt>DECLARE_TASK_QUEUE()</tt> macro and queues a task onto it using
the <tt>queue_task()</tt> function. The task queue then can be processed using
<tt>run_task_queue()</tt>. Instead of creating your own task queue (and
having to consume it manually) you can use one of Linux' predefined
task queues which are consumed at well-known points:

<enum>
<item> <bf>tq_timer</bf>: the timer task queue, run on each timer interrupt
    and when releasing a tty device (closing or releasing a half-opened
    terminal device). Since the timer handler runs in interrupt context,
    the <tt>tq_timer</tt> tasks also run in interrupt context and thus cannot block.

<item> <bf>tq_scheduler</bf>: the scheduler task queue, consumed by the scheduler (and also
    when closing tty devices, like <tt>tq_timer</tt>). Since the scheduler executed
    in the context of the process being re-scheduled, the <tt>tq_scheduler</tt>
    tasks can do anything they like, i.e. block, use process context data
    (but why would they want to), etc.

<item> <bf>tq_immediate</bf>: this is really a bottom half <tt>IMMEDIATE_BH</tt>, so
    drivers can <tt>queue_task(task, &amp;tq_immediate)</tt> and then
    <tt>mark_bh(IMMEDIATE_BH)</tt> to be consumed in interrupt context.

<item> <bf>tq_disk</bf>: used by low level block device access (and RAID) to start
    the actual requests. This task queue is exported to modules but shouldn't
    be used except for the special purposes which it was designed for.
</enum>

Unless a driver uses its own task queues, it does not need to call
<tt>run_tasks_queues()</tt> to process the queue, except under circumstances explained
below.

The reason <tt>tq_timer/tq_scheduler</tt> task queues are consumed not only in the
usual places but elsewhere (closing tty device is but one example) becomes
clear if one remembers that the driver can schedule tasks on the queue, and these tasks
only make sense while a particular instance of the device is still valid
- which usually means until the application closes it. So, the driver may
need to call <tt>run_task_queue()</tt> to flush the tasks it (and anyone else) has
put on the queue, because allowing them to run at a later time may make no
sense - i.e. the relevant data structures may have been freed/reused by a
different instance. This is the reason you see <tt>run_task_queue()</tt> on <tt>tq_timer</tt>
and <tt>tq_scheduler</tt> in places other than timer interrupt and <tt>schedule()</tt>
respectively.

<sect1>Tasklets<p>

Not yet, will be in future revision.

<sect1>Softirqs<p>

Not yet, will be in future revision.

<sect1>How System Calls Are Implemented on i386 Architecture?<p>

There are two mechanisms under Linux for implementing system calls:

<itemize>
<item> lcall7/lcall27 call gates;
<item> int 0x80 software interrupt.
</itemize>

Native Linux programs use int 0x80 whilst binaries from foreign flavours
of UNIX (Solaris, UnixWare 7 etc.) use the lcall7 mechanism. The name 'lcall7' is
historically misleading because it also covers lcall27 (e.g. Solaris/x86), but
the handler function is called lcall7_func.

When the system boots, the function <tt>arch/i386/kernel/traps.c:trap_init()</tt> is
called which sets up the IDT so that vector 0x80 (of type 15, dpl 3) points to
the address of system_call entry from <tt>arch/i386/kernel/entry.S</tt>.

When a userspace application makes a system call, the arguments are passed via registers
and the application executes 'int 0x80' instruction. This causes a trap into
kernel mode and processor jumps to system_call entry point in <tt>entry.S</tt>.
What this does is:

<enum>
<item> Save registers.

<item> Set %ds and %es to KERNEL_DS, so that all data (and extra segment)
       references are made in kernel address space.

<item> If the value of %eax is greater than <tt>NR_syscalls</tt> (currently 256),
       fail with <tt>ENOSYS</tt> error.

<item> If the task is being ptraced (<tt>tsk->ptrace & PF_TRACESYS</tt>), do special
       processing. This is to support programs like strace (analogue of
       SVR4 <bf>truss(1)</bf>) or debuggers.

<item> Call <tt>sys_call_table+4*(syscall_number from %eax)</tt>. This table is
       initialised in the same file (<tt>arch/i386/kernel/entry.S</tt>) to point to
       individual system call handlers which under Linux are (usually)
       prefixed with <tt>sys_</tt>, e.g. <tt>sys_open</tt>, <tt>sys_exit</tt>, etc. These C system
       call handlers will find their arguments on the stack where <tt>SAVE_ALL</tt>
       stored them.

<item> Enter 'system call return path'. This is a separate label because it
       is used not only by int 0x80 but also by lcall7, lcall27. This is
       concerned with handling tasklets (including bottom halves), checking
       if a <tt>schedule()</tt> is needed (<tt>tsk->need_resched != 0</tt>), checking if there
       are signals pending and if so handling them.
</enum>

Linux supports up to 6 arguments for system calls. They are passed in
%ebx, %ecx, %edx, %esi, %edi (and %ebp used temporarily, see <tt>_syscall6()</tt> in
<tt>asm-i386/unistd.h</tt>). The system call number is passed via %eax.

<sect1>Atomic Operations<p>

There are two types of atomic operations: bitmaps and <tt>atomic_t</tt>. Bitmaps are
very convenient for maintaining a concept of "allocated" or "free" units
from some large collection where each unit is identified by some number, for
example free inodes or free blocks. They are also widely used for simple
locking, for example to provide exclusive access to open a device. An example
of this can be found in <tt>arch/i386/kernel/microcode.c</tt>:


<tscreen><code>
/*
 *  Bits in microcode_status. (31 bits of room for future expansion)
 */
#define MICROCODE_IS_OPEN       0       /* set if device is in use */

static unsigned long microcode_status;
</code></tscreen>

There is no need to initialise <tt>microcode_status</tt> to 0 as BSS is zero-cleared
under Linux explicitly.

<tscreen><code>
/*
 * We enforce only one user at a time here with open/close.
 */
static int microcode_open(struct inode *inode, struct file *file)
{
        if (!capable(CAP_SYS_RAWIO))
                return -EPERM;

        /* one at a time, please */
        if (test_and_set_bit(MICROCODE_IS_OPEN, &amp;microcode_status))
                return -EBUSY;

        MOD_INC_USE_COUNT;
        return 0;
}
</code></tscreen>

The operations on bitmaps are:

<itemize>
<item> <bf>void set_bit(int nr, volatile void *addr)</bf>: set bit <tt>nr</tt>
       in the bitmap pointed to by <tt>addr</tt>.

<item> <bf>void clear_bit(int nr, volatile void *addr)</bf>: clear bit
       <tt>nr</tt> in the bitmap pointed to by <tt>addr</tt>.

<item> <bf>void change_bit(int nr, volatile void *addr)</bf>: toggle bit
       <tt>nr</tt> (if set clear, if clear set) in the bitmap pointed to by <tt>addr</tt>.

<item> <bf>int test_and_set_bit(int nr, volatile void *addr)</bf>:
       atomically set bit <tt>nr</tt> and return the old bit value.

<item> <bf>int test_and_clear_bit(int nr, volatile void *addr)</bf>:
       atomically clear bit <tt>nr</tt> and return the old bit value.

<item> <bf>int test_and_change_bit(int nr, volatile void *addr)</bf>:
       atomically toggle bit <tt>nr</tt> and return the old bit value.
</itemize>

These operations use the <tt>LOCK_PREFIX</tt> macro, which on SMP kernels evaluates to
bus lock instruction prefix and to nothing on UP. This guarantees atomicity
of access in SMP environment.

Sometimes bit manipulations are not convenient, but instead we need to perform
arithmetic operations - add, subtract, increment decrement. The typical cases
are reference counts (e.g. for inodes). This facility is provided by the
<tt>atomic_t</tt> data type and the following operations:

<itemize>
<item> <bf>atomic_read(&amp;v)</bf>: read the value of <tt>atomic_t</tt> variable <tt>v</tt>.

<item> <bf>atomic_set(&amp;v, i)</bf>: set the value of <tt>atomic_t</tt> variable
        <tt>v</tt> to integer <tt>i</tt>.

<item> <bf>void atomic_add(int i, volatile atomic_t *v)</bf>: add integer
       <tt>i</tt> to the value of atomic variable pointed to by <tt>v</tt>.

<item> <bf>void atomic_sub(int i, volatile atomic_t *v)</bf>: subtract
       integer <tt>i</tt> from the value of atomic variable pointed to by <tt>v</tt>.

<item> <bf>int atomic_sub_and_test(int i, volatile atomic_t *v)</bf>:
       subtract integer <tt>i</tt> from the value of atomic variable pointed to by
       <tt>v</tt>; return 1 if the new value is 0, return 0 otherwise.

<item> <bf>void atomic_inc(volatile atomic_t *v)</bf>: increment the value
       by 1.

<item> <bf>void atomic_dec(volatile atomic_t *v)</bf>: decrement the value
       by 1.

<item> <bf>int atomic_dec_and_test(volatile atomic_t *v)</bf>: decrement
       the value; return 1 if the new value is 0, return 0 otherwise.

<item> <bf>int atomic_inc_and_test(volatile atomic_t *v)</bf>: increment
       the value; return 1 if the new value is 0, return 0 otherwise.

<item> <bf>int atomic_add_negative(int i, volatile atomic_t *v)</bf>: add
       the value of <tt>i</tt> to <tt>v</tt> and return 1 if the result is negative. Return
       0 if the result is greater than or equal to 0. This operation is used
       for implementing semaphores.
</itemize>

<sect1>Spinlocks, Read-write Spinlocks and Big-Reader Spinlocks<p>

Since the early days of Linux support (early 90s, this century),
developers were faced with the classical problem of accessing shared data
between different types of context (user process vs
interrupt) and different instances of the same context from multiple cpus.

SMP support was added to Linux 1.3.42 on 15 Nov 1995 (the original patch
was made to 1.3.37 in October the same year).

If the critical region of code may be executed by either process context
and interrupt context, then the way to protect it using <tt>cli/sti</tt> instructions
on UP is:

<tscreen><code>
unsigned long flags;

save_flags(flags);
cli();
/* critical code */
restore_flags(flags);
</code></tscreen>

While this is ok on UP, it obviously is of no use on SMP because the same
code sequence may be executed simultaneously on another cpu, and while <tt>cli()</tt>
provides protection against races with interrupt context on each CPU individually, it
provides no protection at all against races between contexts running on different
CPUs. This is where spinlocks are useful for.

There are three types of spinlocks: vanilla (basic), read-write and
big-reader spinlocks. Read-write spinlocks should be used when there is a
natural tendency of 'many readers and few writers'. Example of this is
access to the list of registered filesystems (see <tt>fs/super.c</tt>). The list is
guarded by the <tt>file_systems_lock</tt> read-write spinlock because one needs exclusive
access only when registering/unregistering a filesystem, but any process can
read the file <tt>/proc/filesystems</tt> or use the <bf>sysfs(2)</bf> system call to force a
read-only scan of the file_systems list. This makes it sensible to use
read-write spinlocks. With read-write spinlocks, one can have multiple
readers at a time but only one writer and there can be no readers while
there is
a writer. Btw, it would be nice if new readers would not get a lock while
there
is a writer trying to get a lock, i.e. if Linux could correctly deal with
the issue of potential writer starvation by multiple readers.
This would mean that readers must be blocked while there is a writer
attempting to get the lock. This is not
currently the case and it is not obvious whether this should be fixed - the
argument to the contrary is - readers usually take the lock for a very short
time so should they really be starved while the writer takes the lock for
potentially longer periods?

Big-reader spinlocks are a form of read-write spinlocks
heavily optimised for very light read access, with a penalty for writes.
There is a limited number of big-reader spinlocks - currently only two exist,
of which one is used only on sparc64 (global irq) and the other is used for
networking. In all other cases where the access pattern does not fit into
any of these two scenarios, one should use basic spinlocks. You cannot block
while holding any kind of spinlock.

Spinlocks come in three flavours: plain, <tt>_irq()</tt> and <tt>_bh()</tt>.

<enum>
<item> Plain <tt>spin_lock()/spin_unlock()</tt>: if you know the interrupts are always
       disabled or if you do not race with interrupt context (e.g. from
       within interrupt handler), then you can use this one. It does not
       touch interrupt state on the current CPU.

<item> <tt>spin_lock_irq()/spin_unlock_irq()</tt>: if you know that interrupts are
       always enabled then you can use this version, which simply disables
       (on lock) and re-enables (on unlock) interrupts on the current CPU.
       For example, <tt>rtc_read()</tt> uses
       <tt>spin_lock_irq(&amp;rtc_lock)</tt> (interrupts are always enabled inside
       <tt>read()</tt>) whilst <tt>rtc_interrupt()</tt> uses
       <tt>spin_lock(&amp;rtc_lock)</tt> (interrupts are always disabled inside
       interrupt handler). Note that <tt>rtc_read()</tt> uses <tt>spin_lock_irq()</tt> and not
       the more generic <tt>spin_lock_irqsave()</tt> because on entry to any system
       call interrupts are always enabled.

<item> <tt>spin_lock_irqsave()/spin_unlock_irqrestore()</tt>: the strongest form,
       to be used when the interrupt state is not known, but only if
       interrupts matter at all, i.e. there is no point in using it if
       our interrupt handlers don't execute any critical code.
</enum>

The reason you cannot use plain <tt>spin_lock()</tt> if you race against interrupt handlers is because if you take it and then
an interrupt comes in on the same CPU, it will busy wait for the lock forever:
the lock holder, having been interrupted, will not continue until the
interrupt handler returns.

The most common usage of a spinlock is to access a data structure shared
between user process context and interrupt handlers:

<tscreen><code>
spinlock_t my_lock = SPIN_LOCK_UNLOCKED;

my_ioctl()
{
	spin_lock_irq(&amp;my_lock);
	/* critical section */
	spin_unlock_irq(&amp;my_lock);
}

my_irq_handler()
{
	spin_lock(&amp;lock);
	/* critical section */
	spin_unlock(&amp;lock);
}
</code></tscreen>

There are a couple of things to note about this example:

<enum>
<item> The process context, represented here as a typical driver method -
       <tt>ioctl()</tt> (arguments and return values omitted for clarity), must
       use <tt>spin_lock_irq()</tt> because it knows that interrupts are always
       enabled while executing the device <tt>ioctl()</tt> method.

<item> Interrupt context, represented here by <tt>my_irq_handler()</tt> (again
       arguments omitted for clarity) can use plain <tt>spin_lock()</tt> form because
       interrupts are disabled inside an interrupt handler.
</enum>

<sect1>Semaphores and read/write Semaphores<p>

Sometimes, while accessing a shared data structure, one must perform operations
that can block, for example copy data to userspace. The locking primitive
available for such scenarios under Linux is called a semaphore. There are two
types of semaphores: basic and read-write semaphores. Depending on the
initial value of the semaphore, they can be used for either mutual exclusion
(initial value of 1) or to provide more sophisticated type of access.

Read-write semaphores differ from basic semaphores in the same way as
read-write spinlocks differ from basic spinlocks: one can have multiple
readers at a time but only one writer and there can be no readers while there are
writers - i.e. the writer blocks all readers and new readers block while a
writer is waiting.

Also, basic semaphores can be interruptible - just use the operations
<tt>down/up_interruptible()</tt> instead of the plain <tt>down()/up()</tt> and check the
value returned from <tt>down_interruptible()</tt>: it will be non zero if the operation was
interrupted.

Using semaphores for mutual exclusion is ideal in situations where a critical
code section may call by reference unknown functions registered by other
subsystems/modules, i.e. the caller cannot know apriori whether the function
blocks or not.

A simple example of semaphore usage is in <tt>kernel/sys.c</tt>, implementation of
<bf>gethostname(2)/sethostname(2)</bf> system calls.

<tscreen><code>
asmlinkage long sys_sethostname(char *name, int len)
{
        int errno;

        if (!capable(CAP_SYS_ADMIN))
                return -EPERM;
        if (len < 0 || len > __NEW_UTS_LEN)
                return -EINVAL;
        down_write(&amp;uts_sem);
        errno = -EFAULT;
        if (!copy_from_user(system_utsname.nodename, name, len)) {
                system_utsname.nodename[len] = 0;
                errno = 0;
        }
        up_write(&amp;uts_sem);
        return errno;
}

asmlinkage long sys_gethostname(char *name, int len)
{
        int i, errno;

        if (len < 0)
                return -EINVAL;
        down_read(&amp;uts_sem);
        i = 1 + strlen(system_utsname.nodename);
        if (i > len)
                i = len;
        errno = 0;
        if (copy_to_user(name, system_utsname.nodename, i))
                errno = -EFAULT;
        up_read(&amp;uts_sem);
        return errno;
}
</code></tscreen>

The points to note about this example are:

<enum>
<item> The functions may block while copying data from/to userspace in
       <tt>copy_from_user()/copy_to_user()</tt>. Therefore they could not use any form
       of spinlock here.

<item> The semaphore type chosen is read-write as opposed to basic because
       there may be lots of concurrent <bf>gethostname(2)</bf> requests which need not
       be mutually exclusive.
</enum>

Although Linux implementation of semaphores and read-write semaphores is
very sophisticated, there are possible scenarios one can think of which are
not yet implemented, for example there is no concept of interruptible
read-write semaphores. This is obviously because there are no real-world
situations which require these exotic flavours of the primitives.

<sect1>Kernel Support for Loading Modules<p>

Linux is a monolithic operating system and despite all the modern hype about
some "advantages" offered by operating systems based on micro-kernel design,
the truth remains (quoting Linus Torvalds himself):

<tscreen>
... message passing as the fundamental operation of the OS is just an
exercise in computer science masturbation. It may feel good, but you
don't actually get anything DONE.
</tscreen>

Therefore, Linux is and will always be based on a monolithic design, which
means that all subsystems run in the same privileged mode and share the same
address space; communication between them is achieved by the usual C function
call means.

However, although separating kernel functionality into separate "processes"
as is done in micro-kernels is definitely a bad idea, separating it into
dynamically loadable on demand kernel modules is desirable in some
circumstances (e.g. on machines with low memory or for installation kernels
which could otherwise contain ISA auto-probing device drivers that are
mutually exclusive). The decision whether to include support for loadable
modules is made at compile time and is determined by the <tt>CONFIG_MODULES</tt>
option. Support for module autoloading via <tt>request_module()</tt> mechanism is
a separate compilation option (<tt>CONFIG_KMOD</tt>).

The following functionality can be implemented as loadable modules under
Linux:

<enum>
<item> Character and block device drivers, including misc device drivers.

<item> Terminal line disciplines.

<item> Virtual (regular) files in <tt>/proc</tt> and in devfs (e.g. <tt>/dev/cpu/microcode</tt>
       vs <tt>/dev/misc/microcode</tt>).

<item> Binary file formats (e.g. ELF, aout, etc).

<item> Execution domains (e.g. Linux, UnixWare7, Solaris, etc).

<item> Filesystems.

<item> System V IPC.
</enum>

There a few things that cannot be implemented as modules under Linux
(probably because it makes no sense for them to be modularised):

<enum>
<item> Scheduling algorithms.

<item> VM policies.

<item> Buffer cache, page cache and other caches.
</enum>

Linux provides several system calls to assist in loading modules:

<enum>
<item><tt>caddr_t create_module(const char *name, size_t size)</tt>: allocates
      <tt>size</tt> bytes using <tt>vmalloc()</tt> and maps a module structure at the
      beginning thereof. This new module is then linked into the list headed
      by module_list. Only a process with <tt>CAP_SYS_MODULE</tt> can invoke this
      system call, others will get <tt>EPERM</tt> returned.

<item><tt>long init_module(const char *name, struct module *image)</tt>: loads the
      relocated module image and causes the module's initialisation routine
      to be invoked. Only a process with <tt>CAP_SYS_MODULE</tt> can invoke this
      system call, others will get <tt>EPERM</tt> returned.

<item><tt>long delete_module(const char *name)</tt>: attempts to unload the module.
      If <tt>name == NULL</tt>, attempt is made to unload all unused modules.

<item><tt>long query_module(const char *name, int which, void *buf,
      size_t bufsize, size_t *ret)</tt>: returns information about a module
      (or about all modules).
</enum>

The command interface available to users consists of:

<itemize>
<item><bf>insmod</bf>: insert a single module.

<item><bf>modprobe</bf>: insert a module including all other modules it depends
      on.

<item><bf>rmmod</bf>: remove a module.

<item><bf>modinfo</bf>: print some information about a module, e.g. author,
      description, parameters the module accepts, etc.
</itemize>

Apart from being able to load a module manually using either <bf>insmod</bf> or <bf>modprobe</bf>,
it is also possible to have the module inserted automatically by the kernel
when a particular functionality is required. The kernel interface for this
is the function called <tt>request_module(name)</tt> which is exported to modules,
so that modules can load other modules as well. The <tt>request_module(name)</tt>
internally creates a kernel thread which execs the userspace command
<bf>modprobe -s -k module_name</bf>, using the standard <tt>exec_usermodehelper()</tt> kernel
interface (which is also exported to modules). The function returns 0 on
success, however it is usually not worth checking the return code from
<tt>request_module()</tt>. Instead, the programming idiom is:

<tscreen><code>
if (check_some_feature() == NULL)
    request_module(module);
if (check_some_feature() == NULL)
    return -ENODEV;
</code></tscreen>

For example, this is done by <tt>fs/block_dev.c:get_blkfops()</tt> to load a module
<tt>block-major-N</tt> when attempt is made to open a block device with major <tt>N</tt>.
Obviously, there is no such module called <tt>block-major-N</tt> (Linux developers
only chose sensible names for their modules) but it is mapped to a proper
module name using the file <tt>/etc/modules.conf</tt>. However, for most well-known
major numbers (and other kinds of modules) the <bf>modprobe/insmod</bf> commands
know which real module to load without needing an explicit alias statement
in <tt>/etc/modules.conf</tt>.

A good example of loading a module is inside the <bf>mount(2)</bf> system call. The
 <bf>mount(2)</bf> system call accepts the filesystem type as a string which
 <tt>fs/super.c:do_mount()</tt> then passes on to <tt>fs/super.c:get_fs_type()</tt>:

<tscreen><code>
static struct file_system_type *get_fs_type(const char *name)
{
        struct file_system_type *fs;

        read_lock(&amp;file_systems_lock);
        fs = *(find_filesystem(name));
        if (fs && !try_inc_mod_count(fs->owner))
                fs = NULL;
        read_unlock(&amp;file_systems_lock);
        if (!fs && (request_module(name) == 0)) {
                read_lock(&amp;file_systems_lock);
                fs = *(find_filesystem(name));
                if (fs && !try_inc_mod_count(fs->owner))
                        fs = NULL;
                read_unlock(&amp;file_systems_lock);
        }
        return fs;
}
</code></tscreen>

A few things to note in this function:

<enum>
<item> First we attempt to find the filesystem with the given name amongst
       those already registered. This is done under protection of
       <tt>file_systems_lock</tt> taken for read (as we are not modifying the list
       of registered filesystems).

<item> If such a filesystem is found then we attempt to get a new reference
       to it by trying to increment its module's hold count. This always
       returns 1 for statically linked filesystems or for modules not
       presently being deleted. If <tt>try_inc_mod_count()</tt> returned 0 then
       we consider it a failure - i.e. if the module is there but is being
       deleted, it is as good as if it were not there at all.

<item> We drop the <tt>file_systems_lock</tt> because what we are about to do next
       (<tt>request_module()</tt>) is a blocking operation, and therefore we can't
       hold a spinlock over it. Actually, in this specific case, we would
       have to drop <tt>file_systems_lock</tt> anyway, even if <tt>request_module()</tt> were
       guaranteed to be non-blocking and the module loading were executed
       in the same context atomically. The reason for this is that the module's
       initialisation function will try to call <tt>register_filesystem()</tt>, which will
       take the same <tt>file_systems_lock</tt> read-write spinlock for write.

<item> If the attempt to load was successful, then we take the
       <tt>file_systems_lock</tt> spinlock and try to locate the newly registered
       filesystem in the list. Note that this is slightly wrong because
       it is in principle possible for a bug in modprobe command to cause
       it to coredump after it successfully loaded the requested module, in
       which case <tt>request_module()</tt> will fail even though the new filesystem will be
       registered, and yet <tt>get_fs_type()</tt> won't find it.

<item> If the filesystem is found and we are able to get a reference to it,
       we return it. Otherwise we return NULL.
</enum>

When a module is loaded into the kernel, it can refer to any symbols that
are exported as public by the kernel using <tt>EXPORT_SYMBOL()</tt> macro or by
other currently loaded modules. If the module uses symbols from another
module, it is marked as depending on that module during dependency
recalculation, achieved by running <bf>depmod -a</bf> command on boot (e.g. after
installing a new kernel).

Usually, one must match the set of modules with the version of the kernel
interfaces they use, which under Linux simply means the "kernel version" as
there is no special kernel interface versioning mechanism in general.
However, there is a limited functionality called "module versioning" or
<tt>CONFIG_MODVERSIONS</tt> which allows to avoid recompiling modules when switching
to a new kernel. What happens here is that the kernel symbol table is treated
differently for internal access and for access from modules. The elements of
public (i.e. exported) part of the symbol table are built by 32bit
checksumming the C declaration. So, in order to resolve a symbol used by a
module during loading, the loader must match the full representation of the
symbol that includes the checksum; it will refuse to load the module if these
symbols differ. This
only happens when both the kernel and the module are compiled with module
versioning enabled. If either one of them uses the original symbol names,
the loader simply tries to match the kernel version declared by the module
and the one exported by the kernel and refuses to load if they differ.

<sect>Virtual Filesystem (VFS)<p>

<sect1>Inode Caches and Interaction with Dcache<p>

In order to support multiple filesystems, Linux contains a special kernel
interface level called VFS (Virtual Filesystem Switch). This is similar
to the vnode/vfs interface found in SVR4 derivatives (originally it came from
BSD and Sun original implementations).

Linux inode cache is implemented in a single file, <tt>fs/inode.c</tt>, which consists
of 977 lines of code. It is interesting to note that not many changes have been
made to it for the last 5-7 years: one can still recognise some of the code
comparing the latest version with, say, 1.3.42.

The structure of Linux inode cache is as follows:

<enum>
<item> A global hashtable, <tt>inode_hashtable</tt>, where each inode is hashed by the
       value of the superblock pointer and 32bit inode number. Inodes without a
       superblock (<tt>inode->i_sb == NULL</tt>) are added to a doubly linked list
       headed by <tt>anon_hash_chain</tt> instead. Examples of anonymous inodes
       are sockets created by <tt>net/socket.c:sock_alloc()</tt>, by calling
       <tt>fs/inode.c:get_empty_inode()</tt>.

<item> A global type in_use list (<tt>inode_in_use</tt>), which contains valid inodes
       with <tt>i_count>0</tt> and <tt>i_nlink>0</tt>. Inodes newly allocated by
       <tt>get_empty_inode()</tt> and <tt>get_new_inode()</tt> are added to the <tt>inode_in_use</tt> list.

<item> A global type unused list (<tt>inode_unused</tt>), which contains valid inodes
       with <tt>i_count = 0</tt>.

<item> A per-superblock type dirty list (<tt>sb->s_dirty</tt>) which contains valid
       inodes with <tt>i_count>0</tt>, <tt>i_nlink>0</tt> and <tt>i_state & I_DIRTY</tt>.
       When inode is marked
       dirty, it is added to the <tt>sb->s_dirty</tt> list if it is also hashed.
       Maintaining a per-superblock dirty list of inodes allows to quickly
       sync inodes.

<item> Inode cache proper - a SLAB cache called <tt>inode_cachep</tt>. As inode
       objects are allocated and freed, they are taken from and returned to
       this SLAB cache.
</enum>

The type lists are anchored from <tt>inode->i_list</tt>, the hashtable from
<tt>inode->i_hash</tt>. Each inode can be on a hashtable and one and only one type
(in_use, unused or dirty) list.

All these lists are protected by a single spinlock: <tt>inode_lock</tt>.

The inode cache subsystem is initialised when <tt>inode_init()</tt> function is called from
<tt>init/main.c:start_kernel()</tt>. The function is marked as <tt>__init</tt>, which means
its code is thrown away later on. It is passed a single argument - the
number of physical pages on the system. This is so that the inode cache can
configure itself depending on how much memory is available, i.e. create
a larger hashtable if there is enough memory.

The only stats information about inode cache is the number of unused inodes,
stored in <tt>inodes_stat.nr_unused</tt> and accessible to user programs via files
<tt>/proc/sys/fs/inode-nr</tt> and <tt>/proc/sys/fs/inode-state</tt>.

We can examine one of the lists from <bf>gdb</bf> running on a live kernel thus:

<tscreen><code>
(gdb) printf "%d\n", (unsigned long)(&amp;((struct inode *)0)->i_list)
8
(gdb) p inode_unused
$34 = 0xdfa992a8
(gdb) p (struct list_head)inode_unused
$35 = {next = 0xdfa992a8, prev = 0xdfcdd5a8}
(gdb) p ((struct list_head)inode_unused).prev
$36 = (struct list_head *) 0xdfcdd5a8
(gdb) p (((struct list_head)inode_unused).prev)->prev
$37 = (struct list_head *) 0xdfb5a2e8
(gdb) set $i = (struct inode *)0xdfb5a2e0
(gdb) p $i->i_ino
$38 = 0x3bec7
(gdb) p $i->i_count
$39 = {counter = 0x0}
</code></tscreen>

Note that we deducted 8 from the address 0xdfb5a2e8 to obtain the address of
the <tt>struct inode</tt> (0xdfb5a2e0) according to the definition of <tt>list_entry()</tt>
macro from <tt>include/linux/list.h</tt>.

To understand how inode cache works, let us trace a lifetime of an inode
of a regular file on ext2 filesystem as it is opened and closed:

<tscreen><code>
fd = open("file", O_RDONLY);
close(fd);
</code></tscreen>

The <bf>open(2)</bf> system call is implemented in <tt>fs/open.c:sys_open</tt> function and
the real work is done by <tt>fs/open.c:filp_open()</tt> function, which is split into
two parts:

<enum>
<item> <tt>open_namei()</tt>: fills in the nameidata structure containing the dentry
       and vfsmount structures.

<item> <tt>dentry_open()</tt>: given a dentry and vfsmount, this function allocates a new
       <tt>struct file</tt> and links them together; it also invokes the filesystem
       specific <tt>f_op->open()</tt> method which was set in <tt>inode->i_fop</tt> when inode
       was read in <tt>open_namei()</tt> (which provided inode via <tt>dentry->d_inode</tt>).
</enum>

The <tt>open_namei()</tt> function interacts with dentry cache via <tt>path_walk()</tt>, which
in turn calls <tt>real_lookup()</tt>, which invokes the filesystem specific <tt>inode_operations->lookup()</tt> method.
The role of this method is to find the entry in the parent
directory with the matching name and then do <tt>iget(sb, ino)</tt> to get the
corresponding inode - which brings us to the inode cache. When the inode is
read in, the dentry is instantiated by means of <tt>d_add(dentry, inode)</tt>. While
we are at it, note that for UNIX-style filesystems which have the concept of
on-disk inode number, it is the lookup method's job to map its endianness
to current CPU format, e.g. if the inode number in raw (fs-specific) dir
entry is in little-endian 32 bit format one could do:

<tscreen><code>
unsigned long ino = le32_to_cpu(de->inode);
inode = iget(sb, ino);
d_add(dentry, inode);
</code></tscreen>

So, when we open a file we hit <tt>iget(sb, ino)</tt> which is really
<tt>iget4(sb, ino, NULL, NULL)</tt>, which does:

<enum>
<item> Attempt to find an inode with matching superblock and inode number
       in the hashtable under protection of <tt>inode_lock</tt>. If inode is found,
       its reference count (<tt>i_count</tt>) is incremented; if it
       was 0 prior to incrementation and the inode is not dirty, it is removed from whatever
       type list (<tt>inode->i_list</tt>) it is currently on (it has to be
       <tt>inode_unused</tt> list, of course) and inserted into
       <tt>inode_in_use</tt> type list; finally, <tt>inodes_stat.nr_unused</tt> is decremented.

<item> If inode is currently locked, we wait until it is unlocked so that
       <tt>iget4()</tt> is guaranteed to return an unlocked inode.

<item> If inode was not found in the hashtable then it is the first time we
       encounter this inode, so we call <tt>get_new_inode()</tt>, passing it the pointer
       to the place in the hashtable where it should be inserted to.

<item> <tt>get_new_inode()</tt> allocates a new inode from the <tt>inode_cachep</tt> SLAB
       cache but this operation can block (<tt>GFP_KERNEL</tt> allocation), so it
       must drop the <tt>inode_lock</tt> spinlock which guards the hashtable. Since it
       has dropped the spinlock, it must retry searching the inode in the
       hashtable afterwards; if it is found this time, it returns (after incrementing
       the reference by <tt>__iget</tt>) the one found in the hashtable and destroys
       the newly allocated one. If it is still not found in the hashtable,
       then the new inode we have just allocated is the one to be used;
       therefore it is initialised to the required values and the fs-specific
       <tt>sb->s_op->read_inode()</tt> method is invoked to populate the rest of the
       inode. This brings us from inode cache back to the filesystem code -
       remember that we came to the inode cache when filesystem-specific
       <tt>lookup()</tt> method invoked <tt>iget()</tt>. While the <tt>s_op->read_inode()</tt> method
       is reading the inode from disk, the inode is locked (<tt>i_state = I_LOCK</tt>);
       it is unlocked after the <tt>read_inode()</tt> method returns and all the waiters for it are
       woken up.
</enum>

Now, let's see what happens when we close this file descriptor. The <bf>close(2)</bf>
system call is implemented in <tt>fs/open.c:sys_close()</tt> function, which calls
<tt>do_close(fd, 1)</tt> which rips (replaces with NULL) the descriptor of the
process' file descriptor table and invokes the <tt>filp_close()</tt> function which does
most of the work. The interesting things happen in <tt>fput()</tt>, which checks if
this was the last reference to the file, and if so calls
<tt>fs/file_table.c:_fput()</tt> which calls <tt>__fput()</tt> which is where interaction with
dcache (and therefore with inode cache - remember dcache is a Master of inode
cache!) happens. The <tt>fs/dcache.c:dput()</tt> does <tt>dentry_iput()</tt> which brings us
back to inode cache via <tt>iput(inode)</tt> so let us understand
<tt>fs/inode.c:iput(inode)</tt>:

<enum>
<item> If parameter passed to us is NULL, we do absolutely nothing and return.

<item> if there is a fs-specific <tt>sb->s_op->put_inode()</tt> method, it is invoked
       immediately with no spinlocks held (so it can block).

<item> <tt>inode_lock</tt> spinlock is taken and <tt>i_count</tt> is decremented. If this was
       NOT the last reference to this inode then we simply check if
       there are too many references to it and so <tt>i_count</tt> can wrap around
       the 32 bits allocated to it and if so we print a warning and return.
       Note that we call <tt>printk()</tt> while holding the <tt>inode_lock</tt> spinlock -
       this is fine because <tt>printk()</tt> can never block, therefore it may be called in
       absolutely any context (even from interrupt handlers!).

<item> If this was the last active reference then some work needs to be done.
</enum>

The work performed by <tt>iput()</tt> on the last inode reference is rather complex
so we separate it into a list of its own:

<enum>
<item> If <tt>i_nlink == 0</tt> (e.g. the file was unlinked while we held it open)
       then the inode is removed from hashtable and from its type list; if
       there are any data pages held in page cache for this inode, they are
       removed by means of <tt>truncate_all_inode_pages(&amp;inode->i_data)</tt>. Then
       the filesystem-specific <tt>s_op->delete_inode()</tt> method is invoked,
       which typically deletes the on-disk copy of the inode. If there is no
       <tt>s_op->delete_inode()</tt> method registered by the filesystem (e.g. ramfs)
       then we call <tt>clear_inode(inode)</tt>, which invokes <tt>s_op->clear_inode()</tt> if
       registered and if inode corresponds to a block device, this device's
       reference count is dropped by <tt>bdput(inode->i_bdev)</tt>.

<item> if <tt>i_nlink != 0</tt> then we check if there are other inodes in the same
       hash bucket and if there is none, then if inode is not dirty we delete
       it from its type list and add it to <tt>inode_unused</tt> list, incrementing
       <tt>inodes_stat.nr_unused</tt>. If there are inodes in the same hashbucket
       then we delete it from the type list and add to <tt>inode_unused</tt> list.
       If this was an anonymous inode (NetApp .snapshot) then we delete it
       from the type list and clear/destroy it completely.
</enum>


<sect1>Filesystem Registration/Unregistration<p>

The Linux kernel provides a mechanism for new filesystems to be written with
minimum effort. The historical reasons for this are:

<enum>
<item> In the world where people still use non-Linux operating systems
       to protect their investment in legacy software, Linux had to provide
       interoperability by supporting a great multitude of different
       filesystems - most of which would not deserve to exist on their own
       but only for compatibility with existing non-Linux operating systems.

<item> The interface for filesystem writers had to be very simple so that
       people could try to reverse engineer existing proprietary filesystems
       by writing read-only versions of them. Therefore Linux VFS makes it
       very easy to implement read-only filesystems; 95% of the work is
       to finish them by adding full write-support. As a concrete example,
       I wrote read-only BFS filesystem for Linux in about 10 hours, but it
       took several weeks to complete it to have full write support (and
       even today some purists claim that it is not complete because "it
       doesn't have compactification support").

<item> The VFS interface is exported, and therefore all Linux filesystems can
       be implemented as modules.

</enum>

Let us consider the steps required to implement a filesystem under Linux.
The code to implement a filesystem can be either a dynamically loadable
module or statically linked into the kernel, and the way it is done under
Linux is very transparent. All that is needed is to fill in a
<tt>struct file_system_type</tt> structure and register it with the VFS using
the <tt>register_filesystem()</tt> function as in the following example from
<tt>fs/bfs/inode.c</tt>:

<tscreen><code>
#include <linux/module.h>
#include <linux/init.h>

static struct super_block *bfs_read_super(struct super_block *, void *, int);

static DECLARE_FSTYPE_DEV(bfs_fs_type, "bfs", bfs_read_super);

static int __init init_bfs_fs(void)
{
        return register_filesystem(&amp;bfs_fs_type);
}

static void __exit exit_bfs_fs(void)
{
        unregister_filesystem(&amp;bfs_fs_type);
}

module_init(init_bfs_fs)
module_exit(exit_bfs_fs)
</code></tscreen>

The <tt>module_init()/module_exit()</tt> macros ensure that, when BFS is compiled as a
module, the functions <tt>init_bfs_fs()</tt> and <tt>exit_bfs_fs()</tt> turn into <tt>init_module()</tt>
and <tt>cleanup_module()</tt> respectively; if BFS is statically linked into the kernel,
the <tt>exit_bfs_fs()</tt> code vanishes as it is unnecessary.

The <tt>struct file_system_type</tt> is declared in <tt>include/linux/fs.h</tt>:

<tscreen><code>
struct file_system_type {
        const char *name;
        int fs_flags;
        struct super_block *(*read_super) (struct super_block *, void *, int);
        struct module *owner;
        struct vfsmount *kern_mnt; /* For kernel mount, if it's FS_SINGLE fs */
        struct file_system_type * next;
};
</code></tscreen>

The fields thereof are explained thus:

<itemize>

<item><bf>name</bf>: human readable name, appears in <tt>/proc/filesystems</tt> file
	  and is used as a key to find a filesystem by its name; this same name is
	  used for the filesystem type in <bf>mount(2)</bf>, and it should be unique:  there
	  can (obviously) be only one filesystem with a given name. For modules,
	  name points to module's address spaces and not copied: this means <bf>cat
	  /proc/filesystems</bf> can oops if the module was unloaded but filesystem is
	  still registered.

<item><bf>fs_flags</bf>: one or more (ORed) of the flags: <tt>FS_REQUIRES_DEV</tt>
      for filesystems that can only be mounted on a block device, <tt>FS_SINGLE</tt>
      for filesystems that can have only one superblock, <tt>FS_NOMOUNT</tt> for
      filesystems that cannot be mounted from userspace by means of <bf>mount(2)</bf>
      system call: they can however be mounted internally using <tt>kern_mount()</tt>
      interface, e.g. pipefs.

<item><bf>read_super</bf>: a pointer to the function that reads the super
      block during mount operation. This function is required: if it is not
      provided, mount operation (whether from userspace or inkernel) will
      always fail except in <tt>FS_SINGLE</tt> case where it will Oops in
      <tt>get_sb_single()</tt>, trying to dereference a NULL pointer in
      <tt>fs_type->kern_mnt->mnt_sb</tt> with (<tt>fs_type->kern_mnt = NULL</tt>).

<item><bf>owner</bf>: pointer to the module that implements this filesystem.
      If the filesystem is statically linked into the kernel then this is
      NULL. You don't need to set this manually as the macro <tt>THIS_MODULE</tt>
      does the right thing automatically.

<item><bf>kern_mnt</bf>: for <tt>FS_SINGLE</tt> filesystems only. This is set by
      <tt>kern_mount()</tt> (TODO: <tt>kern_mount()</tt> should refuse to mount filesystems
      if <tt>FS_SINGLE</tt> is not set).

<item><bf>next</bf>: linkage into singly-linked list headed by <tt>file_systems</tt>
      (see <tt>fs/super.c</tt>). The list is protected by the <tt>file_systems_lock</tt>
      read-write spinlock and functions <tt>register/unregister_filesystem()</tt>
	  modify it by linking and unlinking the entry from the list.
</itemize>

The job of the <tt>read_super()</tt> function is to fill in the fields of the superblock,
allocate root inode and initialise any fs-private information associated with
this mounted instance of the filesystem. So, typically the <tt>read_super()</tt> would
do:

<enum>
<item> Read the superblock from the device specified via <tt>sb->s_dev</tt> argument,
       using buffer cache <tt>bread()</tt> function. If it anticipates to read a few
       more subsequent metadata blocks immediately then it makes sense to
       use <tt>breada()</tt> to schedule reading extra blocks asynchronously.

<item> Verify that superblock contains the valid magic number and overall
       "looks" sane.

<item> Initialise <tt>sb->s_op</tt> to point to <tt>struct super_block_operations</tt>
       structure. This structure contains filesystem-specific functions
       implementing operations like "read inode", "delete inode", etc.

<item> Allocate root inode and root dentry using <tt>d_alloc_root()</tt>.

<item> If the filesystem is not mounted read-only then set <tt>sb->s_dirt</tt> to 1
       and mark the buffer containing superblock dirty (TODO: why do we
       do this? I did it in BFS because MINIX did it...)
</enum>

<sect1>File Descriptor Management<p>

Under Linux there are several levels of indirection between user file
descriptor and the kernel inode structure. When a process makes <bf>open(2)</bf>
system call, the kernel returns a small non-negative integer which can be
used for subsequent I/O operations on this file. This integer is an index
into an array of pointers to <tt>struct file</tt>. Each file structure points to
a dentry via <tt>file->f_dentry</tt>. And each dentry points to an inode via
<tt>dentry->d_inode</tt>.

Each task contains a field <tt>tsk->files</tt> which is a pointer to
<tt>struct files_struct</tt> defined in <tt>include/linux/sched.h</tt>:

<tscreen><code>
/*
 * Open file table structure
 */
struct files_struct {
        atomic_t count;
        rwlock_t file_lock;
        int max_fds;
        int max_fdset;
        int next_fd;
        struct file ** fd;      /* current fd array */
        fd_set *close_on_exec;
        fd_set *open_fds;
        fd_set close_on_exec_init;
        fd_set open_fds_init;
        struct file * fd_array[NR_OPEN_DEFAULT];
};
</code></tscreen>

The <tt>file->count</tt> is a reference count, incremented by <tt>get_file()</tt> (usually
called by <tt>fget()</tt>) and decremented by <tt>fput()</tt> and by <tt>put_filp()</tt>. The difference
between <tt>fput()</tt> and <tt>put_filp()</tt> is that <tt>fput()</tt> does more work usually needed
for regular files, such as releasing flock locks, releasing dentry, etc, while
<tt>put_filp()</tt> is only manipulating file table structures, i.e. decrements the
count, removes the file from the <tt>anon_list</tt> and adds it to the <tt>free_list</tt>,
under protection of <tt>files_lock</tt> spinlock.

The <tt>tsk->files</tt> can be shared between parent and child if the child thread
was created using <tt>clone()</tt> system call with <tt>CLONE_FILES</tt> set in the clone flags
argument. This can be seen in <tt>kernel/fork.c:copy_files()</tt> (called by
<tt>do_fork()</tt>) which only increments the <tt>file->count</tt> if <tt>CLONE_FILES</tt> is set
instead of the usual copying file descriptor table in time-honoured
tradition of classical UNIX <bf>fork(2)</bf>.

When a file is opened, the file structure allocated for it is installed into
<tt>current->files->fd[fd]</tt> slot and a <tt>fd</tt> bit is set in the bitmap
<tt>current->files->open_fds</tt> . All this is done under the write protection of
<tt>current->files->file_lock</tt> read-write spinlock. When the descriptor is
closed a <tt>fd</tt> bit is cleared in <tt>current->files->open_fds</tt> and
<tt>current->files->next_fd</tt> is set equal to <tt>fd</tt> as a hint for finding the
first unused descriptor next time this process wants to open a file.

<sect1>File Structure Management<p>

The file structure is declared in <tt>include/linux/fs.h</tt>:

<tscreen><code>
struct fown_struct {
        int pid;                /* pid or -pgrp where SIGIO should be sent */
        uid_t uid, euid;        /* uid/euid of process setting the owner */
        int signum;             /* posix.1b rt signal to be delivered on IO */
};

struct file {
        struct list_head        f_list;
        struct dentry           *f_dentry;
        struct vfsmount         *f_vfsmnt;
        struct file_operations  *f_op;
        atomic_t                f_count;
        unsigned int            f_flags;
        mode_t                  f_mode;
        loff_t                  f_pos;
        unsigned long           f_reada, f_ramax, f_raend, f_ralen, f_rawin;
        struct fown_struct      f_owner;
        unsigned int            f_uid, f_gid;
        int                     f_error;

        unsigned long           f_version;

        /* needed for tty driver, and maybe others */
        void                    *private_data;
};
</code></tscreen>

Let us look at the various fields of <tt>struct file</tt>:

<enum>
<item><bf>f_list</bf>: this field links file structure on one (and only one)
      of the lists: a) <tt>sb->s_files</tt> list of all open files on this filesystem,
      if the corresponding inode is not anonymous, then <tt>dentry_open()</tt> (called
      by <tt>filp_open()</tt>) links the file into this list;
      b) <tt>fs/file_table.c:free_list</tt>, containing unused file structures;
      c) <tt>fs/file_table.c:anon_list</tt>, when a new file structure is created by
      <tt>get_empty_filp()</tt> it is placed on this list. All these lists are
      protected by the <tt>files_lock</tt> spinlock.

<item><bf>f_dentry</bf>: the dentry corresponding to this file. The dentry
      is created at nameidata lookup time by <tt>open_namei()</tt> (or
      rather <tt>path_walk()</tt>
	  which it calls) but the actual <tt>file->f_dentry</tt> field is set by
      <tt>dentry_open()</tt> to contain the dentry thus found.

<item><bf>f_vfsmnt</bf>: the pointer to <tt>vfsmount</tt> structure of the filesystem
      containing the file. This is set by <tt>dentry_open()</tt> but is found as part
      of nameidata lookup by <tt>open_namei()</tt> (or rather <tt>path_init()</tt> which it
      calls).

<item><bf>f_op</bf>: the pointer to <tt>file_operations</tt> which contains various
      methods that can be invoked on the file. This is copied from
      <tt>inode->i_fop</tt> which is placed there by filesystem-specific
      <tt>s_op->read_inode()</tt> method during nameidata lookup. We will look at
      <tt>file_operations</tt> methods in detail later on in this section.

<item><bf>f_count</bf>: reference count manipulated by
      <tt>get_file/put_filp/fput</tt>.

<item><bf>f_flags</bf>: <tt>O_XXX</tt> flags from <bf>open(2)</bf> system call copied there
      (with slight modifications by <tt>filp_open()</tt>) by <tt>dentry_open()</tt> and after
      clearing <tt>O_CREAT</tt>, <tt>O_EXCL</tt>, <tt>O_NOCTTY</tt>, <tt>O_TRUNC</tt> - there is no point in
      storing these flags permanently since they cannot be modified by
      <tt>F_SETFL</tt> (or queried by <tt>F_GETFL</tt>) <bf>fcntl(2)</bf> calls.

<item><bf>f_mode</bf>: a combination of userspace flags and mode, set
      by <tt>dentry_open()</tt>. The point of the conversion is to store read and
      write access in separate bits so one could do easy checks like
      <tt>(f_mode & FMODE_WRITE)</tt> and <tt>(f_mode & FMODE_READ)</tt>.

<item><bf>f_pos</bf>: a current file position for next read or write to
      the file. Under i386 it is of type <tt>long long</tt>, i.e. a 64bit value.

<item><bf>f_reada, f_ramax, f_raend, f_ralen, f_rawin</bf>: to support
      readahead - too complex to be discussed by mortals ;)

<item><bf>f_owner</bf>: owner of file I/O to receive asynchronous I/O
      notifications via <tt>SIGIO</tt> mechanism (see <tt>fs/fcntl.c:kill_fasync()</tt>).

<item><bf>f_uid, f_gid</bf> - set to user id and group id of the process that
      opened the file, when the file structure is created in
      <tt>get_empty_filp()</tt>. If the file is a socket, used by ipv4 netfilter.

<item><bf>f_error</bf>: used by NFS client to return write errors. It is
      set in <tt>fs/nfs/file.c</tt> and checked in <tt>mm/filemap.c:generic_file_write()</tt>.

<item><bf>f_version</bf> - versioning mechanism for invalidating caches,
      incremented (using global <tt>event</tt>) whenever <tt>f_pos</tt> changes.

<item><bf>private_data</bf>: private per-file data which can be used by
      filesystems (e.g. coda stores credentials here) or by device drivers.
      Device drivers (in the presence of devfs) could use this field to
      differentiate between multiple instances instead of the classical
      minor number encoded in <tt>file->f_dentry->d_inode->i_rdev</tt>.

</enum>

Now let us look at <tt>file_operations</tt> structure which contains the methods that
can be invoked on files. Let us recall that it is copied from <tt>inode->i_fop</tt>
where it is set by <tt>s_op->read_inode()</tt> method. It is declared in
<tt>include/linux/fs.h</tt>:

<tscreen><code>
struct file_operations {
        struct module *owner;
        loff_t (*llseek) (struct file *, loff_t, int);
        ssize_t (*read) (struct file *, char *, size_t, loff_t *);
        ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
        int (*readdir) (struct file *, void *, filldir_t);
        unsigned int (*poll) (struct file *, struct poll_table_struct *);
        int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
        int (*mmap) (struct file *, struct vm_area_struct *);
        int (*open) (struct inode *, struct file *);
        int (*flush) (struct file *);
        int (*release) (struct inode *, struct file *);
        int (*fsync) (struct file *, struct dentry *, int datasync);
        int (*fasync) (int, struct file *, int);
        int (*lock) (struct file *, int, struct file_lock *);
        ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
        ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
};
</code></tscreen>

<enum>
<item><bf>owner</bf>: a pointer to the module that owns the subsystem in
      question. Only drivers need to set it to <tt>THIS_MODULE</tt>, filesystems can
      happily ignore it because their module counts are controlled at
      mount/umount time whilst the drivers need to control it at open/release
      time.

<item><bf>llseek</bf>: implements the <bf>lseek(2)</bf> system call. Usually it is
      omitted and <tt>fs/read_write.c:default_llseek()</tt> is used, which does the
      right thing (TODO: force all those who set it to NULL currently to use
      default_llseek - that way we save an <tt>if()</tt> in <tt>llseek()</tt>)

<item><bf>read</bf>: implements <tt>read(2)</tt> system call. Filesystems can use
      <tt>mm/filemap.c:generic_file_read()</tt> for regular files and
      <tt>fs/read_write.c:generic_read_dir()</tt> (which simply returns <tt>-EISDIR</tt>)
      for directories here.

<item><bf>write</bf>: implements <bf>write(2)</bf> system call. Filesystems can use
      <tt>mm/filemap.c:generic_file_write()</tt> for regular files and ignore it for
      directories here.

<item><bf>readdir</bf>: used by filesystems. Ignored for regular files
      and implements <bf>readdir(2)</bf> and <bf>getdents(2)</bf> system calls for directories.

<item><bf>poll</bf>: implements <bf>poll(2)</bf> and <bf>select(2)</bf> system calls.

<item><bf>ioctl</bf>: implements driver or filesystem-specific
      ioctls. Note that generic file ioctls like <tt>FIBMAP</tt>, <tt>FIGETBSZ</tt>, <tt>FIONREAD</tt>
	  are implemented by higher levels so they never read <tt>f_op->ioctl()</tt>
	  method.

<item><bf>mmap</bf>: implements the <bf>mmap(2)</bf> system call. Filesystems can use
      <bf>generic_file_mmap</bf> here for regular files and ignore it on directories.

<item><bf>open</bf>: called at <bf>open(2)</bf> time by <tt>dentry_open()</tt>. Filesystems
      rarely use this, e.g. coda tries to cache the file locally at open
      time.

<item><bf>flush</bf>: called at each <bf>close(2)</bf> of this file, not necessarily
      the last one (see <tt>release()</tt> method below). The only filesystem that
      uses this is NFS client to flush all dirty pages. Note that this can
      return an error which will be passed back to userspace which made the
      <bf>close(2)</bf> system call.

<item><bf>release</bf>: called at the last <bf>close(2)</bf> of this file, i.e. when
      <tt>file->f_count</tt> reaches 0. Although defined as returning int, the return
      value is ignored by VFS (see <tt>fs/file_table.c:__fput()</tt>).

<item><bf>fsync</bf>: maps directly to <bf>fsync(2)/fdatasync(2)</bf> system calls,
      with the last argument specifying whether it is fsync or fdatasync.
      Almost no work is done by VFS around this, except to map file
      descriptor to a file structure (<tt>file = fget(fd)</tt>) and down/up
      <tt>inode->i_sem</tt> semaphore. Ext2 filesystem currently ignores the last
      argument and does exactly the same for <bf>fsync(2)</bf> and <bf>fdatasync(2)</bf>.

<item><bf>fasync</bf>: this method is called when <tt>file->f_flags & FASYNC</tt>
      changes.

<item><bf>lock</bf>: the filesystem-specific portion of the POSIX <bf>fcntl(2)</bf>
      file region locking mechanism. The only bug here is that because it is
      called before fs-independent portion (<tt>posix_lock_file()</tt>), if it
      succeeds but the standard POSIX lock code fails then it will never be
      unlocked on fs-dependent level..

<item><bf>readv</bf>: implements <bf>readv(2)</bf> system call.

<item><bf>writev</bf>: implements <bf>writev(2)</bf> system call.
</enum>

<sect1>Superblock and Mountpoint Management<p>

Under Linux, information about mounted filesystems is kept in two separate
structures - <tt>super_block</tt> and <tt>vfsmount</tt>. The reason for this is that Linux
allows to mount the same filesystem (block device) under multiple mount
points, which means that the same <tt>super_block</tt> can correspond to multiple
<tt>vfsmount</tt> structures.

Let us look at <tt>struct super_block</tt> first, declared in <tt>include/linux/fs.h</tt>:

<tscreen><code>
struct super_block {
        struct list_head        s_list;         /* Keep this first */
        kdev_t                  s_dev;
        unsigned long           s_blocksize;
        unsigned char           s_blocksize_bits;
        unsigned char           s_lock;
        unsigned char           s_dirt;
        struct file_system_type *s_type;
        struct super_operations *s_op;
        struct dquot_operations *dq_op;
        unsigned long           s_flags;
        unsigned long           s_magic;
        struct dentry           *s_root;
        wait_queue_head_t       s_wait;

        struct list_head        s_dirty;        /* dirty inodes */
        struct list_head        s_files;

        struct block_device     *s_bdev;
        struct list_head        s_mounts;       /* vfsmount(s) of this one */
        struct quota_mount_options s_dquot;     /* Diskquota specific options */

       union {
                struct minix_sb_info    minix_sb;
                struct ext2_sb_info     ext2_sb;
		..... all filesystems that need sb-private info ...
                void                    *generic_sbp;
        } u;
       /*
         * The next field is for VFS *only*. No filesystems have any business
         * even looking at it. You had been warned.
         */
        struct semaphore s_vfs_rename_sem;      /* Kludge */

        /* The next field is used by knfsd when converting a (inode number based)
         * file handle into a dentry. As it builds a path in the dcache tree from
         * the bottom up, there may for a time be a subpath of dentrys which is not
         * connected to the main tree.  This semaphore ensure that there is only ever
         * one such free path per filesystem.  Note that unconnected files (or other
         * non-directories) are allowed, but not unconnected diretories.
         */
        struct semaphore s_nfsd_free_path_sem;
};
</code></tscreen>

The various fields in the <tt>super_block</tt> structure are:

<enum>
<item><bf>s_list</bf>: a doubly-linked list of all active superblocks; note
      I don't say "of all mounted filesystems" because under Linux one can
      have multiple instances of a mounted filesystem corresponding to a
      single superblock.

<item><bf>s_dev</bf>: for filesystems which require a block to be mounted
      on, i.e. for <tt>FS_REQUIRES_DEV</tt> filesystems, this is the <tt>i_dev</tt> of the
      block device. For others (called anonymous filesystems) this is an
      integer <tt>MKDEV(UNNAMED_MAJOR, i)</tt> where <tt>i</tt> is the first unset bit in
      <tt>unnamed_dev_in_use</tt> array, between 1 and 255 inclusive. See
      <tt>fs/super.c:get_unnamed_dev()/put_unnamed_dev()</tt>. It has been suggested
      many times that anonymous filesystems should not use <tt>s_dev</tt> field.

<item><bf>s_blocksize, s_blocksize_bits</bf>: blocksize and log2(blocksize).

<item><bf>s_lock</bf>: indicates whether superblock is currently locked by
      <tt>lock_super()/unlock_super()</tt>.

<item><bf>s_dirt</bf>: set when superblock is changed, and cleared whenever
      it is written back to disk.

<item><bf>s_type</bf>: pointer to <tt>struct file_system_type</tt> of the
      corresponding filesystem. Filesystem's <tt>read_super()</tt> method doesn't need
      to set it as VFS <tt>fs/super.c:read_super()</tt> sets it for you if
      fs-specific <tt>read_super()</tt> succeeds and resets to NULL if it fails.

<item><bf>s_op</bf>: pointer to <tt>super_operations</tt> structure which contains
      fs-specific methods to read/write inodes etc. It is the job of
      filesystem's <tt>read_super()</tt> method to initialise <tt>s_op</tt> correctly.

<item><bf>dq_op</bf>: disk quota operations.

<item><bf>s_flags</bf>: superblock flags.

<item><bf>s_magic</bf>: filesystem's magic number. Used by minix filesystem
      to differentiate between multiple flavours of itself.

<item><bf>s_root</bf>: dentry of the filesystem's root. It is the job of
      <tt>read_super()</tt> to read the root inode from the disk and pass it to
      <tt>d_alloc_root()</tt> to allocate the dentry and instantiate it. Some
      filesystems spell "root" other than "/" and so use more generic
      <tt>d_alloc()</tt> function to bind the dentry to a name, e.g. pipefs mounts
      itself on "pipe:" as its own root instead of "/".

<item><bf>s_wait</bf>: waitqueue of processes waiting for superblock to be
      unlocked.

<item><bf>s_dirty</bf>: a list of all dirty inodes. Recall that if inode
      is dirty (<tt>inode->i_state & I_DIRTY</tt>) then it is on superblock-specific
      dirty list linked via <tt>inode->i_list</tt>.

<item><bf>s_files</bf>: a list of all open files on this superblock. Useful
      for deciding whether filesystem can be remounted read-only, see
      <tt>fs/file_table.c:fs_may_remount_ro()</tt> which goes through <tt>sb->s_files</tt> list
      and denies remounting if there are files opened for write
      (<tt>file->f_mode & FMODE_WRITE</tt>) or files with pending
      unlink (<tt>inode->i_nlink == 0</tt>).

<item><bf>s_bdev</bf>: for <tt>FS_REQUIRES_DEV</tt>, this points to the block_device
      structure describing the device the filesystem is mounted on.

<item><bf>s_mounts</bf>: a list of all <tt>vfsmount</tt> structures, one for each
      mounted instance of this superblock.

<item><bf>s_dquot</bf>: more diskquota stuff.
</enum>

The superblock operations are described in the <tt>super_operations</tt> structure
declared in <tt>include/linux/fs.h</tt>:

<tscreen><code>
struct super_operations {
        void (*read_inode) (struct inode *);
        void (*write_inode) (struct inode *, int);
        void (*put_inode) (struct inode *);
        void (*delete_inode) (struct inode *);
        void (*put_super) (struct super_block *);
        void (*write_super) (struct super_block *);
        int (*statfs) (struct super_block *, struct statfs *);
        int (*remount_fs) (struct super_block *, int *, char *);
        void (*clear_inode) (struct inode *);
        void (*umount_begin) (struct super_block *);
};
</code></tscreen>

<enum>
<item><bf>read_inode</bf>: reads the inode from the filesystem. It is only
      called from <tt>fs/inode.c:get_new_inode()</tt> from <tt>iget4()</tt> (and therefore
      <tt>iget()</tt>). If a filesystem wants to use <tt>iget()</tt> then <tt>read_inode()</tt> must be
      implemented - otherwise <tt>get_new_inode()</tt> will panic.
      While inode is being read it is locked (<tt>inode->i_state = I_LOCK</tt>). When
      the function returns, all waiters on <tt>inode->i_wait</tt> are woken up. The job
      of the filesystem's <tt>read_inode()</tt> method is to locate the disk block which
      contains the inode to be read and use buffer cache <tt>bread()</tt> function to
      read it in and initialise the various fields of inode structure, for
      example the <tt>inode->i_op</tt> and <tt>inode->i_fop</tt> so that VFS level knows what
      operations can be performed on the inode or corresponding file.
      Filesystems that don't implement <tt>read_inode()</tt> are ramfs and
      pipefs. For example, ramfs has its own inode-generating function
      <tt>ramfs_get_inode()</tt> with all the inode operations calling it as needed.

<item><bf>write_inode</bf>: write inode back to disk. Similar to
      <tt>read_inode()</tt> in that it needs to locate the relevant block on
      disk and interact with buffer cache by calling
      <tt>mark_buffer_dirty(bh)</tt>. This method is called on dirty inodes
      (those marked dirty with <tt>mark_inode_dirty()</tt>) when the inode needs
      to be sync'd either individually or as part of syncing the
      entire filesystem.

<item><bf>put_inode</bf>: called whenever the reference count is decreased.

<item><bf>delete_inode</bf>: called whenever both <tt>inode->i_count</tt> and
      <tt>inode->i_nlink</tt> reach 0. Filesystem deletes the on-disk copy of the
      inode and calls <tt>clear_inode()</tt> on VFS inode to "terminate it with
      extreme prejudice".

<item><bf>put_super</bf>: called at the last stages of <bf>umount(2)</bf> system
      call to notify the filesystem that any private information held by
      the filesystem about this instance should be freed. Typically this
      would <tt>brelse()</tt> the block containing the superblock and <tt>kfree()</tt> any
      bitmaps allocated for free blocks, inodes, etc.

<item><bf>write_super</bf>: called when superblock needs to be
      written back to disk. It should find the block containing the
      superblock (usually kept in <tt>sb-private</tt> area) and
      <tt>mark_buffer_dirty(bh)</tt> . It should also clear <tt>sb->s_dirt</tt> flag.

<item><bf>statfs</bf>: implements <bf>fstatfs(2)/statfs(2)</bf> system calls. Note
      that the pointer to <tt>struct statfs</tt> passed as argument is a kernel
      pointer, not a user pointer so we don't need to do any I/O to
      userspace. If not implemented then <tt>statfs(2)</tt> will fail with <tt>ENOSYS</tt>.

<item><bf>remount_fs</bf>: called whenever filesystem is being remounted.

<item><bf>clear_inode</bf>: called from VFS level <tt>clear_inode()</tt>. Filesystems
      that attach private data to inode structure (via <tt>generic_ip</tt> field) must
      free it here.

<item><bf>umount_begin</bf>: called during forced umount to notify the
      filesystem beforehand, so that it can do its best to make sure that
      nothing keeps the filesystem busy. Currently used only by NFS. This
      has nothing to do with the idea of generic VFS level forced umount
      support.
</enum>

So, let us look at what happens when we mount a on-disk (<tt>FS_REQUIRES_DEV</tt>)
filesystem. The implementation of the <bf>mount(2)</bf> system call is in
<tt>fs/super.c:sys_mount()</tt> which is the just a wrapper that copies the options,
filesystem type and device name for the <tt>do_mount()</tt> function which does the
real work:

<enum>
<item>Filesystem driver is loaded if needed and its module's reference count
      is incremented. Note that during mount operation, the filesystem
      module's reference count is incremented twice - once by <tt>do_mount()</tt>
	  calling <tt>get_fs_type()</tt> and once by <tt>get_sb_dev()</tt> calling <tt>get_filesystem()</tt>
	  if <tt>read_super()</tt> was successful. The first increment is to prevent
      module unloading while we are inside <tt>read_super()</tt> method and the second
      increment is to indicate that the module is in use by this mounted
      instance. Obviously, <tt>do_mount()</tt> decrements the count before returning, so
      overall the count only grows by 1 after each mount.

<item>Since, in our case, <tt>fs_type->fs_flags & FS_REQUIRES_DEV</tt> is true, the
      superblock is initialised by a call to <tt>get_sb_bdev()</tt> which obtains
      the reference to the block device and interacts with the filesystem's
      <tt>read_super()</tt> method to fill in the superblock. If all goes well, the
      <tt>super_block</tt> structure is initialised and we have an extra reference
      to the filesystem's module and a reference to the underlying block
      device.

<item>A new <tt>vfsmount</tt> structure is allocated and linked to <tt>sb->s_mounts</tt> list
      and to the global <tt>vfsmntlist</tt> list. The <tt>vfsmount</tt> field <tt>mnt_instances</tt>
	  allows to find all instances mounted on the same superblock as this
      one. The <tt>mnt_list</tt> field allows to find all instances for all
      superblocks system-wide.  The <tt>mnt_sb</tt> field
      points to this superblock and <tt>mnt_root</tt> has a new reference to the
      <tt>sb->s_root</tt> dentry.
</enum>

<sect1>Example Virtual Filesystem: pipefs<p>

As a simple example of Linux filesystem that does not require a block device
for mounting, let us consider pipefs from <tt>fs/pipe.c</tt>. The filesystem's preamble
is rather straightforward and requires little explanation:

<tscreen><code>
static DECLARE_FSTYPE(pipe_fs_type, "pipefs", pipefs_read_super,
        FS_NOMOUNT|FS_SINGLE);

static int __init init_pipe_fs(void)
{
        int err = register_filesystem(&amp;pipe_fs_type);
        if (!err) {
                pipe_mnt = kern_mount(&amp;pipe_fs_type);
                err = PTR_ERR(pipe_mnt);
                if (!IS_ERR(pipe_mnt))
                        err = 0;
        }
        return err;
}

static void __exit exit_pipe_fs(void)
{
        unregister_filesystem(&amp;pipe_fs_type);
        kern_umount(pipe_mnt);
}

module_init(init_pipe_fs)
module_exit(exit_pipe_fs)
</code></tscreen>

The filesystem is of type <tt>FS_NOMOUNT|FS_SINGLE</tt>, which means it cannot be
mounted from userspace and can only have one superblock system-wide. The
<tt>FS_SINGLE</tt> file also means that it must be mounted via <tt>kern_mount()</tt> after
it is successfully registered via <tt>register_filesystem()</tt>, which is exactly
what happens in <tt>init_pipe_fs()</tt>. The only bug in this function is that if
<tt>kern_mount()</tt> fails (e.g. because <tt>kmalloc()</tt> failed in <tt>add_vfsmnt()</tt>) then the
filesystem is left as registered but module initialisation fails. This will
cause <bf>cat /proc/filesystems</bf> to Oops. (have just sent a patch to Linus
mentioning that although this is not a real bug today as pipefs can't be
compiled as a module, it should be written with the view that in the future
it may become modularised).

The result of <tt>register_filesystem()</tt> is that <tt>pipe_fs_type</tt> is linked into
the <tt>file_systems</tt> list so one can read <tt>/proc/filesystems</tt> and find "pipefs"
entry in there with "nodev" flag indicating that <tt>FS_REQUIRES_DEV</tt> was not set.
The <tt>/proc/filesystems</tt> file should really be enhanced to support all the new
<tt>FS_</tt> flags (and I made a patch to do so) but it cannot be done because it will
break all the user applications that use it. Despite Linux kernel interfaces
changing every minute (only for the better) when it comes to the userspace
compatibility, Linux is a very conservative operating system which allows
many applications to be used for a long time without being recompiled.

The result of <tt>kern_mount()</tt> is that:

<enum>
<item>A new unnamed (anonymous) device number is allocated by setting a bit in
      <tt>unnamed_dev_in_use</tt> bitmap; if there are no more bits then <tt>kern_mount()</tt>
	  fails with <tt>EMFILE</tt>.

<item>A new superblock structure is allocated by means of <tt>get_empty_super()</tt>.
      The <tt>get_empty_super()</tt> function walks the list of superblocks headed
      by <tt>super_block</tt> and looks for empty entry, i.e. <tt>s->s_dev == 0</tt>. If no
      such empty superblock is found then a new one is allocated using
      <tt>kmalloc()</tt> at <tt>GFP_USER</tt> priority. The maximum system-wide number of
      superblocks is checked in <tt>get_empty_super()</tt> so if it starts failing,
      one can adjust the tunable <tt>/proc/sys/fs/super-max</tt>.

<item>A filesystem-specific <tt>pipe_fs_type->read_super()</tt> method, i.e.
      <tt>pipefs_read_super()</tt>, is invoked which allocates root inode and root
      dentry <tt>sb->s_root</tt>, and sets <tt>sb->s_op</tt> to be <tt>&amp;pipefs_ops</tt>.

<item>Then <tt>kern_mount()</tt> calls <tt>add_vfsmnt(NULL, sb->s_root, "none")</tt> which
      allocates a new <tt>vfsmount</tt> structure and links it into <tt>vfsmntlist</tt> and
      <tt>sb->s_mounts</tt>.

<item>The <tt>pipe_fs_type->kern_mnt</tt> is set to this new <tt>vfsmount</tt> structure and
      it is returned. The reason why the return value of <tt>kern_mount()</tt> is a
      <tt>vfsmount</tt> structure is because even <tt>FS_SINGLE</tt> filesystems can be mounted
      multiple times and so their <tt>mnt->mnt_sb</tt> will point to the same thing
      which would be silly to return from multiple calls to <tt>kern_mount()</tt>.
</enum>

Now that the filesystem is registered and inkernel-mounted we can use it.
The entry point into the pipefs filesystem is the <bf>pipe(2)</bf> system call,
implemented in arch-dependent function <tt>sys_pipe()</tt> but the real work is done
by a portable <tt>fs/pipe.c:do_pipe()</tt> function. Let us look at <tt>do_pipe()</tt> then.
The interaction with pipefs happens when <tt>do_pipe()</tt> calls <tt>get_pipe_inode()</tt>
to allocate a new pipefs inode. For this inode, <tt>inode->i_sb</tt> is set to
pipefs' superblock <tt>pipe_mnt->mnt_sb</tt>, the file operations <tt>i_fop</tt> is set to
<tt>rdwr_pipe_fops</tt> and the number of readers and writers (held in <tt>inode->i_pipe</tt>)
is set to 1. The reason why there is a separate inode field <tt>i_pipe</tt> instead
of keeping it in the <tt>fs-private</tt> union is that pipes and FIFOs share the same
code and FIFOs can exist on other filesystems which use the other access
paths within the same union which is very bad C and can work only by pure
luck. So, yes, 2.2.x kernels work only by pure luck and will stop working
as soon as you slightly rearrange the fields in the inode.

Each <bf>pipe(2)</bf> system call increments a reference count on the <tt>pipe_mnt</tt>
mount instance.

Under Linux, pipes are not symmetric (bidirection or STREAM pipes), i.e.
two sides of the file have different <tt>file->f_op</tt> operations - the
<tt>read_pipe_fops</tt> and <tt>write_pipe_fops</tt> respectively. The write on read side
returns <tt>EBADF</tt> and so does read on write side.


<sect1>Example Disk Filesystem: BFS<p>

As a simple example of ondisk Linux filesystem, let us consider BFS. The
preamble of the BFS module is in <tt>fs/bfs/inode.c</tt>:

<tscreen><code>
static DECLARE_FSTYPE_DEV(bfs_fs_type, "bfs", bfs_read_super);

static int __init init_bfs_fs(void)
{
        return register_filesystem(&amp;bfs_fs_type);
}

static void __exit exit_bfs_fs(void)
{
        unregister_filesystem(&amp;bfs_fs_type);
}

module_init(init_bfs_fs)
module_exit(exit_bfs_fs)
</code></tscreen>

A special fstype declaration macro <tt>DECLARE_FSTYPE_DEV()</tt> is used which
sets the <tt>fs_type->flags</tt> to <tt>FS_REQUIRES_DEV</tt> to signify that BFS requires a
real block device to be mounted on.

The module's initialisation function registers the filesystem with VFS and
the cleanup function (only present when BFS is configured to be a module)
unregisters it.

With the filesystem registered, we can proceed to mount it, which would
invoke out <tt>fs_type->read_super()</tt> method which is implemented in
<tt>fs/bfs/inode.c:bfs_read_super().</tt> It does the following:

<enum>
<item><tt>set_blocksize(s->s_dev, BFS_BSIZE)</tt>: since we are about to interact
      with the block device layer via the buffer cache, we must initialise a few
      things, namely set the block size and also inform VFS via fields
      <tt>s->s_blocksize</tt> and <tt>s->s_blocksize_bits</tt>.

<item><tt>bh = bread(dev, 0, BFS_BSIZE)</tt>: we read block 0 of the device
      passed via <tt>s->s_dev</tt>. This block is the filesystem's superblock.

<item>Superblock is validated against <tt>BFS_MAGIC</tt> number and, if valid, stored
      in the sb-private field <tt>s->su_sbh</tt> (which is really <tt>s->u.bfs_sb.si_sbh</tt>).

<item>Then we allocate inode bitmap using <tt>kmalloc(GFP_KERNEL)</tt> and clear all
      bits to 0 except the first two which we set to 1 to indicate that we
      should never allocate inodes 0 and 1. Inode 2 is root and the
      corresponding bit will be set to 1 a few lines later anyway - the
      filesystem should have a valid root inode at mounting time!

<item>Then we initialise <tt>s->s_op</tt>, which means that we can from this point
      invoke inode cache via <tt>iget()</tt> which results in <tt>s_op->read_inode()</tt> to
      be invoked. This finds the block that contains the specified (by
      <tt>inode->i_ino</tt> and <tt>inode->i_dev</tt>) inode and reads it in. If we fail to
      get root inode then we free the inode bitmap and release superblock
      buffer back to buffer cache and return NULL. If root inode was read OK,
      then we allocate a dentry with name <tt>/</tt> (as becometh root) and
      instantiate it with this inode.

<item>Now we go through all inodes on the filesystem and read them all in
      order to set the corresponding bits in our internal inode bitmap and
      also to calculate some other internal parameters like the offset of
      last inode and the start/end blocks of last file. Each inode we read
      is returned back to inode cache via <tt>iput()</tt> - we don't hold a reference
      to it longer than needed.

<item>If the filesystem was not mounted read-only, we mark the superblock
      buffer dirty and set <tt>s->s_dirt</tt> flag (TODO: why do I do this?
      Originally, I did it because <tt>minix_read_super()</tt> did but neither minix
      nor BFS seem to modify superblock in the <tt>read_super()</tt>).

<item>All is well so we return this initialised superblock back to the caller
      at VFS level, i.e. <tt>fs/super.c:read_super()</tt>.
</enum>

After the <tt>read_super()</tt> function returns successfully, VFS obtains the
reference to the filesystem module via call to <tt>get_filesystem(fs_type)</tt> in
<tt>fs/super.c:get_sb_bdev()</tt> and a reference to the block device.

Now, let us examine what happens when we do I/O on the filesystem. We already
examined how inodes are read when <tt>iget()</tt> is called and how they are released
on <tt>iput().</tt> Reading inodes sets up, among other things, <tt>inode->i_op</tt> and
<tt>inode->i_fop</tt>; opening a file will propagate <tt>inode->i_fop</tt> into <tt>file->f_op</tt>.

Let us examine the code path of the <bf>link(2)</bf> system call. The implementation
of the system call is in <tt>fs/namei.c:sys_link()</tt>:

<enum>
<item>The userspace names are copied into kernel space by means of <tt>getname()</tt>
      function which does the error checking.

<item>These names are nameidata converted using <tt>path_init()/path_walk()</tt>
      interaction with dcache. The result is stored in <tt>old_nd</tt> and <tt>nd</tt>
	  structures.

<item>If <tt>old_nd.mnt != nd.mnt</tt> then "cross-device link" <tt>EXDEV</tt> is returned -
      one cannot link between filesystems, in Linux this translates into -
      one cannot link between mounted instances of a filesystem (or, in
      particular between filesystems).

<item>A new dentry is created corresponding to <tt>nd</tt> by <tt>lookup_create()</tt> .

<item>A generic <tt>vfs_link()</tt> function is called which checks if we can
      create a new entry in the directory and invokes the <tt>dir->i_op->link()</tt>
	  method which brings us back to filesystem-specific
      <tt>fs/bfs/dir.c:bfs_link()</tt> function.

<item>Inside <tt>bfs_link()</tt>, we check if we are trying to link a directory and
      if so, refuse with <tt>EPERM</tt> error. This is the same behaviour as standard (ext2).

<item>We attempt to add a new directory entry to the specified directory
      by calling the helper function <tt>bfs_add_entry()</tt> which goes through all
      entries looking for unused slot (<tt>de->ino == 0</tt>) and, when found, writes
      out the name/inode pair into the corresponding block and marks it
      dirty (at non-superblock priority).

<item>If we successfully added the directory entry then there is no way
      to fail the operation so we increment <tt>inode->i_nlink</tt>, update
      <tt>inode->i_ctime</tt> and mark this inode dirty as well as instantiating the
      new dentry with the inode.
</enum>

Other related inode operations like <tt>unlink()/rename()</tt> etc work in a similar
way, so not much is gained by examining them all in details.

<sect1>Execution Domains and Binary Formats<p>

Linux supports loading user application binaries from disk. More
interestingly, the binaries can be stored in different formats and the
operating system's response to programs via system calls can deviate from
norm (norm being the Linux behaviour) as required, in order to emulate
formats found in other flavours of UNIX (COFF, etc) and also to emulate
system calls behaviour of other flavours (Solaris, UnixWare, etc). This is
what execution domains and binary formats are for.

Each Linux task has a personality stored in its <tt>task_struct</tt> (<tt>p->personality</tt>).
The currently existing (either in the official kernel or as addon patch)
personalities include support for FreeBSD, Solaris, UnixWare, OpenServer and
many other popular operating systems.
The value of <tt>current->personality</tt> is split into two parts:

<enum>
<item>high three bytes - bug emulation: <tt>STICKY_TIMEOUTS</tt>, <tt>WHOLE_SECONDS</tt>, etc.
<item>low byte - personality proper, a unique number.
</enum>

By changing the personality, we can change
the way the operating system treats certain system calls, for example
adding a <tt>STICKY_TIMEOUT</tt> to <tt>current->personality</tt> makes <bf>select(2)</bf> system call
preserve the value of last argument (timeout) instead of storing the
unslept time. Some buggy programs rely on buggy operating systems (non-Linux)
and so Linux provides a way to emulate bugs in cases where the source code
is not available and so bugs cannot be fixed.

Execution domain is a contiguous range of personalities implemented by a
single module. Usually a single execution domain implements a single
personality but sometimes it is possible to implement "close" personalities
in a single module without too many conditionals.

Execution domains are implemented in <tt>kernel/exec_domain.c</tt> and were completely
rewritten for 2.4 kernel, compared with 2.2.x. The list of execution domains
currently supported by the kernel, along with the range of personalities
they support, is available by reading the <tt>/proc/execdomains</tt> file. Execution
domains, except the <tt>PER_LINUX</tt> one, can be implemented as dynamically
loadable modules.

The user interface is via <bf>personality(2)</bf> system call, which sets the current
process' personality or returns the value of <tt>current->personality</tt> if the
argument is set to impossible personality 0xffffffff. Obviously, the
behaviour of this system call itself does not depend on personality..

The kernel interface to execution domains registration consists of two
functions:

<itemize>
<item><tt>int register_exec_domain(struct exec_domain *)</tt>: registers the
      execution domain by linking it into single-linked list <tt>exec_domains</tt>
	  under the write protection of the read-write spinlock <tt>exec_domains_lock</tt>.
      Returns 0 on success, non-zero on failure.

<item><tt>int unregister_exec_domain(struct exec_domain *)</tt>: unregisters the
      execution domain by unlinking it from the <tt>exec_domains</tt> list, again using
      <tt>exec_domains_lock</tt> spinlock in write mode. Returns 0 on success.
<item>
</itemize>

The reason why <tt>exec_domains_lock</tt> is a read-write is that only registration
and unregistration requests modify the list, whilst doing
<bf>cat /proc/filesystems</bf> calls <tt>fs/exec_domain.c:get_exec_domain_list()</tt>, which
needs only read access to the list. Registering a new execution domain
defines a "lcall7 handler" and a signal number conversion map. Actually,
ABI patch extends this concept of exec domain to include extra information
(like socket options, socket types, address family and errno maps).

The binary formats are implemented in a similar manner, i.e. a single-linked
list formats is defined in <tt>fs/exec.c</tt> and is protected by a read-write lock
<tt>binfmt_lock</tt>. As with <tt>exec_domains_lock</tt>, the <tt>binfmt_lock</tt> is taken read on
most occasions except for registration/unregistration of binary formats.
Registering a new binary format enhances the <bf>execve(2)</bf> system call with new
<tt>load_binary()/load_shlib()</tt> functions as well as ability to <tt>core_dump()</tt> . The
<tt>load_shlib()</tt> method is used only by the old <bf>uselib(2)</bf> system call while
the <tt>load_binary()</tt> method is called by the <tt>search_binary_handler()</tt> from
<tt>do_execve()</tt> which implements <bf>execve(2)</bf> system call.

The personality of the process is determined at binary format loading by
the corresponding format's <tt>load_binary()</tt> method using some heuristics.
For example to determine UnixWare7 binaries one first marks the binary
using the <bf>elfmark(1)</bf> utility, which sets the ELF header's <tt>e_flags</tt> to the magic
value 0x314B4455 which is detected at ELF loading time and
<tt>current->personality</tt> is set to PER_UW7. If this heuristic fails, then a more
generic one, such as treat ELF interpreter paths like <tt>/usr/lib/ld.so.1</tt> or
<tt>/usr/lib/libc.so.1</tt> to
indicate a SVR4 binary, is used and personality is set to PER_SVR4. One
could write a little utility program that uses Linux's <bf>ptrace(2)</bf> capabilities
to single-step the code and force a running program into any personality.

Once personality (and therefore <tt>current->exec_domain</tt>) is known, the system
calls are handled as follows. Let us assume that a process makes a system
call by means of lcall7 gate instruction. This transfers control to
<tt>ENTRY(lcall7)</tt> of <tt>arch/i386/kernel/entry.S</tt> because it was prepared in
<tt>arch/i386/kernel/traps.c:trap_init()</tt>. After appropriate stack layout
conversion, <tt>entry.S:lcall7</tt> obtains the pointer to <tt>exec_domain</tt> from <tt>current</tt>
and then an offset of lcall7 handler within the <tt>exec_domain</tt> (which is
hardcoded as 4 in asm code so you can't shift the <tt>handler</tt> field around in
C declaration of <tt>struct exec_domain</tt>) and jumps to it. So, in C, it would
look like this:

<tscreen><code>
static void UW7_lcall7(int segment, struct pt_regs * regs)
{
       abi_dispatch(regs, &amp;uw7_funcs[regs->eax &amp; 0xff], 1);
}
</code></tscreen>

where <tt>abi_dispatch()</tt> is a wrapper around the table of function pointers that
implement this personality's system calls <tt>uw7_funcs</tt>.

<sect>Linux Page Cache<p>

In this chapter we describe the Linux 2.4 pagecache.  The pagecache
is - as the name suggests - a cache of physical pages. In the UNIX world the
concept of a pagecache became popular with the introduction of SVR4 UNIX,
where it replaced the buffercache for data IO operations.

While the SVR4 pagecache is only used for filesystem data cache and thus uses
the struct vnode and an offset into the file as hash parameters, the Linux page
cache is designed to be more generic, and therefore uses a struct address_space
(explained below) as first parameter.  Because the Linux pagecache is tightly
coupled to the notation of address spaces, you will need at least a basic
understanding of adress_spaces to understand the way the pagecache works.
An address_space is some kind of software MMU that maps all pages of one object
(e.g. inode) to an other concurrency (typically physical disk blocks).
The struct address_space is defined in <tt>include/linux/fs.h</tt> as:

<tscreen><code>
	struct address_space {
		struct list_head        clean_pages;
		struct list_head        dirty_pages;
		struct list_head        locked_pages;
		unsigned long           nrpages;
		struct address_space_operations *a_ops;
		struct inode            *host;
		struct vm_area_struct	*i_mmap;
		struct vm_area_struct	*i_mmap_shared;
		spinlock_t		i_shared_lock;

	};
</code></tscreen>

To understand the way address_spaces works, we only need to look at a few of this fields:
<tt>clean_pages</tt>, <tt>dirty_pages</tt> and <tt>locked_pages</tt> are double linked lists
of all clean, dirty and locked pages that belong to this address_space, <tt>nrpages</tt>
is the total number of pages in this address_space.  <tt>a_ops</tt> defines the methods of
this object and <tt>host</tt> is an pointer to the inode this address_space belongs to -
it may also be NULL, e.g. in the case of the swapper address_space
(<tt>mm/swap_state.c,</tt>).

The usage of <tt>clean_pages</tt>, <tt>dirty_pages</tt>, <tt>locked_pages</tt> and
<tt>nrpages</tt> is obvious, so we will take a tighter look at the
<tt>address_space_operations</tt> structure, defined in the same header:

<tscreen><code>
	struct address_space_operations {
		int (*writepage)(struct page *);
		int (*readpage)(struct file *, struct page *);
		int (*sync_page)(struct page *);
		int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
		int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
		int (*bmap)(struct address_space *, long);
	};
</code></tscreen>

For a basic view at the principle of address_spaces (and the pagecache) we need
to take a look at -><tt>writepage</tt> and -><tt>readpage</tt>, but in practice we need
to take a look at -><tt>prepare_write</tt> and -><tt>commit_write</tt>, too.

You can probably guess what the address_space_operations methods do
by virtue of their names alone; nevertheless, they do require some
explanation.  Their use in the course of filesystem data I/O, by
far the most common path through the pagecache, provides a good
way of understanding them.
Unlike most other UNIX-like operating systems, Linux has generic file
operations (a subset of the SYSVish vnode operations) for data IO through the
pagecache.  This means that the data will not directly interact with the file-
system on read/write/mmap, but will be read/written from/to the pagecache
whenever possible.  The pagecache has to get data from the actual low-level
filesystem in case the user wants to read from a page not yet in memory,
or write data to disk in case memory gets low.

In the read path the generic methods will first try to find a page that
matches the wanted inode/index tuple.

<tscreen>
	hash = page_hash(inode->i_mapping, index);
</tscreen>

Then we test whether the page actually exists.

<tscreen>
	hash = page_hash(inode->i_mapping, index);
	page = __find_page_nolock(inode->i_mapping, index, *hash);
</tscreen>

When it does not exist, we allocate a new free page, and add it to the page-
cache hash.

<tscreen>
	page = page_cache_alloc();
	__add_to_page_cache(page, mapping, index, hash);
</tscreen>

After the page is hashed we use the -><tt>readpage</tt> address_space operation to
actually fill the page with data. (file is an open instance of inode).

<tscreen>
	error = mapping->a_ops->readpage(file, page);
</tscreen>

Finally we can copy the data to userspace.

For writing to the filesystem two pathes exist: one for writable mappings
(mmap) and one for the write(2) family of syscalls. The mmap case is very
simple, so it will be discussed first.
When a user modifies mappings, the VM subsystem marks the page dirty.

<tscreen>
	SetPageDirty(page);
</tscreen>

The bdflush kernel thread that is trying to free pages, either as background
activity or because memory gets low will try to call -><tt>writepage</tt> on the pages
that are explicitly marked dirty.  The -><tt>writepage</tt> method does now have to
write the pages content back to disk and free the page.

The second write path is _much_ more complicated.  For each page the user
writes to, we are basically doing the following:
(for the full code see <tt>mm/filemap.c:generic_file_write()</tt>).

<tscreen>
	page = __grab_cache_page(mapping, index, &amp;cached_page);
	mapping->a_ops->prepare_write(file, page, offset, offset+bytes);
	copy_from_user(kaddr+offset, buf, bytes);
	mapping->a_ops->commit_write(file, page, offset, offset+bytes);
</tscreen>

So first we try to find the hashed page or allocate a new one, then we call the
-><tt>prepare_write</tt> address_space method, copy the user buffer to kernel memory and
finally call the -><tt>commit_write</tt> method.  As you probably have seen
->prepare_write and -><tt>commit_write</tt> are fundamentally different from -><tt>readpage</tt>
and -><tt>writepage</tt>, because they are not only called when physical IO is actually
wanted but everytime the user modifies the file.
There are two (or more?) ways to handle this, the first one uses the Linux
buffercache to delay the physical IO, by filling a <tt>page->buffers</tt> pointer with
buffer_heads, that will be used in try_to_free_buffers (<tt>fs/buffers.c</tt>) to
request IO once memory gets low, and is used very widespread in the current
kernel.  The other way just sets the page dirty and relies on -><tt>writepage</tt> to do
all the work.  Due to the lack of a validitity bitmap in struct page this does
not work with filesystem that have a smaller granuality then <tt>PAGE_SIZE</tt>.

<sect>IPC mechanisms<p>
This chapter describes the semaphore, shared memory, and
message queue IPC mechanisms as implemented in the Linux 2.4
kernel. It is organized into four sections. The
first three sections cover the interfaces and support functions
for <ref id="semaphores" name="semaphores">,
<ref id="message" name="message queues">,
and <ref id="sharedmem" name="shared memory"> respectively.
The <ref id="ipc_primitives" name="last"> section describes
a set of common functions and data structures that are shared by
all three mechanisms.

<sect1>Semaphores<label id="semaphores"><p>
The functions described in this section implement the user level
semaphore mechanisms. Note that this implementation relies on the
use of kernel splinlocks and kernel semaphores. To avoid confusion,
the term "kernel semaphore" will be used in reference to kernel
semaphores. All other uses of the word "sempahore" will be in
reference to the user level semaphores.

<sect2>Semaphore System Call Interfaces<label id="sem_apis"><p>

<sect3>sys_semget()<label id="sys_semget"><p>
The entire call to sys_semget() is protected by the
global <ref id="struct_ipc_ids" name="sem_ids.sem">
kernel semaphore.

In the case where a new set of semaphores must be
created, the <ref id="newary" name="newary()"> function is
called to create and initialize a new semaphore set. The ID of
the new set is returned to the caller.

In the case where a key value is provided for an existing
semaphore set, <ref id="ipc_findkey" name="ipc_findkey()">
is invoked to look up the corresponding semaphore descriptor
array index.  The parameters and permissions of the caller are
verified before returning the semaphore set ID.
</sect3>

<sect3>sys_semctl()<label id="sys_semctl"><p>
For the <ref id="IPC_INFO_and_SEM_INFO" name="IPC_INFO">,
<ref id="IPC_INFO_and_SEM_INFO" name="SEM_INFO">, and
<ref id="SEM_STAT" name="SEM_STAT"> commands,
<ref id="semctl_nolock" name="semctl_nolock()">
is called to perform the necessary functions.

For the <ref id="GETALL" name="GETALL">, <ref id="GETVAL" name="GETVAL">,
<ref id="GETPID" name="GETPID">, <ref id="GETNCNT" name="GETNCNT">,
<ref id="GETZCNT" name="GETZCNT">, <ref id="IPC_STAT" name="IPC_STAT">,
<ref id="SETVAL" name="SETVAL">,and <ref id="SETALL" name="SETALL"> commands,
<ref id="semctl_main" name="semctl_main()"> is called to perform the
necessary functions.

For the <ref id="semctl_ipc_rmid" name="IPC_RMID">
and <ref id="semctl_ipc_set" name="IPC_SET"> command,
<ref id="semctl_down" name="semctl_down()"> is called
to perform the necessary functions. Throughout both of these
operations, the global <ref id="struct_ipc_ids" name="sem_ids.sem">
kernel semaphore is held.
</sect3>

<sect3>sys_semop()<label id="sys_semop"><p>
After validating the call parameters, the semaphore
operations data is copied from user space to a temporary buffer.
If a small temporary buffer is sufficient, then a stack buffer is
used. Otherwise, a larger buffer is allocated. After copying in the
semaphore operations data, the global semaphores spinlock is
locked, and the user-specified semaphore set ID is validated.
Access permissions for the semaphore set are also validated.

All of the user-specified semaphore operations are parsed.
During this process, a count is maintained of all the operations that
have the SEM_UNDO flag set. A <tt>decrease</tt> flag is set if any of the
operations subtract from a semaphore value, and an <tt>alter</tt> flag is set
if any of the semaphore values are modified (i.e. increased or
decreased). The number of each
semaphore to be modified is validated.

If SEM_UNDO was asserted for any of the semaphore operations,
then the undo list for the current task is searched for an undo
structure associated with this semaphore set. During this search,
if the semaphore set ID of any of the undo structures is found
to be -1, then <ref id="freeundos" name="freeundos()">
is called to free the undo structure
and remove it from the list. If no undo structure is found for
this semaphore set then <ref id="alloc_undo" name="alloc_undo()">
is called to allocate and initialize one.

The <ref id="try_atomic_semop" name="try_atomic_semop()">
function is called with the <tt>do_undo</tt>
parameter equal to 0 in order to execute the sequence of
operations. The return value indicates that either the
operations passed, failed, or were not executed because
they need to block. Each of these cases are further described below:

<sect4>Non-blocking Semaphore Operations<label id="Non-blocking_Semaphore_Operations"><p>
The <ref id="try_atomic_semop" name="try_atomic_semop()">
function returns zero to indicate that all operations in the
sequence succeeded. In this case,
<ref id="update_queue" name="update_queue()">
is called to traverse the queue of pending semaphore
operations for the semaphore set and awaken any
sleeping tasks that no longer need to block. This completes the
execution of the sys_semop() system call for this case.
</sect4>

<sect4>Failing Semaphore Operations<label id="Failing_Semaphore_Operations"><p>
If <ref id="try_atomic_semop" name="try_atomic_semop()">
returns a negative value, then a failure condition was encountered.
In this case, none of the operations have been executed.
This occurs when either a semaphore operation would cause an
invalid semaphore value, or an operation marked IPC_NOWAIT is
unable to complete.  The error condition is then returned to the
caller of sys_semop().

Before sys_semop() returns, a call is made to
<ref id="update_queue" name="update_queue()"> to traverse
the queue of pending semaphore operations for the semaphore set
and awaken any sleeping tasks that no longer need to block.
</sect4>

<sect4>Blocking Semaphore Operations<label id="Blocking_Semaphore_Operations"><p>
The <ref id="try_atomic_semop" name="try_atomic_semop()">
function returns 1 to indicate that the
sequence of semaphore operations was not executed because
one of the semaphores would block. For this case, a new
<ref id="struct_sem_queue" name="sem_queue"> element is
initialized containing these semaphore operations. If any of
these operations would alter the state of the semaphore, then
the new queue element is added at the tail of the queue.
Otherwise, the new queue element is added at the head of the queue.

The <tt>semsleeping</tt> element of the current
task is set to indicate that the task is sleeping on this
<ref id="struct_sem_queue" name="sem_queue"> element.
The current task is marked as TASK_INTERRUPTIBLE, and the
<tt>sleeper</tt> element of the
<ref id="struct_sem_queue" name="sem_queue">
is set to identify this task as the sleeper. The
global semaphore spinlock is then unlocked, and schedule() is called
to put the current task to sleep.

When awakened, the task re-locks the global semaphore spinlock,
determines why it was awakened, and how it should
respond.  The following cases are handled:

<itemize>
<item>  If the semaphore set has been removed, then
        the system call fails with EIDRM.

<item>  If the <tt>status</tt> element of the
	<ref id="struct_sem_queue" name="sem_queue"> structure
	is set to 1, then the task was awakened in order to retry the
	semaphore operations. Another call to
	<ref id="try_atomic_semop" name="try_atomic_semop()"> is
	made to execute the sequence of semaphore operations.  If
	try_atomic_sweep() returns 1, then the task must block again
	as described above. Otherwise, 0 is returned for success,
	or an appropriate error code is returned in case of failure.

	Before sys_semop() returns, current->semsleeping is cleared,
	and the <ref id="struct_sem_queue" name="sem_queue">
	is removed from the queue.  If any of the specified semaphore
	operations were altering operations (increase or decrease),
	then <ref id="update_queue" name="update_queue()"> is
	called to traverse the queue of pending semaphore operations
	for the semaphore set and awaken any sleeping tasks that no
	longer need to block.

<item>  If the <tt>status</tt> element of the
	<ref id="struct_sem_queue" name="sem_queue"> structure is
	NOT set to 1, and the
	<ref id="struct_sem_queue" name="sem_queue"> element has
	not been dequeued, then the task was awakened by an interrupt.
	In this case, the system call fails with EINTR.  Before
	returning, current->semsleeping is cleared, and the
	<ref id="struct_sem_queue" name="sem_queue"> is removed
	from the queue.  Also,
	<ref id="update_queue" name="update_queue()"> is called
	if any of the operations were altering operations.

<item>  If the <tt>status</tt> element of the
	<ref id="struct_sem_queue" name="sem_queue"> structure is
	NOT set to 1, and the
	<ref id="struct_sem_queue" name="sem_queue"> element
	has been dequeued,
	then the semaphore operations have already been executed by
	<ref id="update_queue" name="update_queue()">.  The
	queue <tt>status</tt>, which could be 0 for success
	or a negated error code for failure, becomes the return value of
	the system call.

</itemize>
</sect4>
</sect3>
</sect2>

<sect2>Semaphore Specific Support Structures<label id="sem_structures"><p>
The following structures are used specifically for semaphore support:

<sect3>struct sem_array<label id="struct_sem_array"><p>
<tscreen><code>
/* One sem_array data structure for each set of semaphores in the system. */
struct sem_array {
    struct kern_ipc_perm sem_perm; /* permissions .. see ipc.h */
    time_t sem_otime; /* last semop time */
    time_t sem_ctime; /* last change time */
    struct sem *sem_base; /* ptr to first semaphore in array */
    struct sem_queue *sem_pending; /* pending operations to be processed */
    struct sem_queue **sem_pending_last; /* last pending operation */
    struct sem_undo *undo; /* undo requests on this array * /
    unsigned long sem_nsems; /* no. of semaphores in array */
};
</code></tscreen>
</sect3>

<sect3>struct sem<label id="struct_sem"><p>
<tscreen><code>
/* One semaphore structure for each semaphore in the system. */
struct sem {
        int     semval;         /* current value */
        int     sempid;         /* pid of last operation */
};
</code></tscreen>
</sect3>

<sect3>struct seminfo<label id="struct_seminfo"><p>
<tscreen><code>
struct  seminfo {
        int semmap;
        int semmni;
        int semmns;
        int semmnu;
        int semmsl;
        int semopm;
        int semume;
        int semusz;
        int semvmx;
        int semaem;
};
</code></tscreen>
</sect3>

<sect3>struct semid64_ds<label id="struct_semid64_ds"><p>
<tscreen><code>
struct semid64_ds {
        struct ipc64_perm sem_perm;             /* permissions .. see
ipc.h */
        __kernel_time_t sem_otime;              /* last semop time */
        unsigned long   __unused1;
        __kernel_time_t sem_ctime;              /* last change time */
        unsigned long   __unused2;
        unsigned long   sem_nsems;              /* no. of semaphores in
array */
        unsigned long   __unused3;
        unsigned long   __unused4;
};
</code></tscreen>
</sect3>

<sect3>struct sem_queue<label id="struct_sem_queue"><p>
<tscreen><code>
/* One queue for each sleeping process in the system. */
struct sem_queue {
        struct sem_queue *      next;    /* next entry in the queue */
        struct sem_queue **     prev;    /* previous entry in the queue, *(q->pr
ev) == q */
        struct task_struct*     sleeper; /* this process */
        struct sem_undo *       undo;    /* undo structure */
        int                     pid;     /* process id of requesting process */
        int                     status;  /* completion status of operation */
        struct sem_array *      sma;     /* semaphore array for operations */
        int                     id;      /* internal sem id */
        struct sembuf *         sops;    /* array of pending operations */
        int                     nsops;   /* number of operations */
        int                     alter;   /* operation will alter semaphore */
};
</code></tscreen>
</sect3>

<sect3>struct sembuf<label id="struct_sembuf"><p>
<tscreen><code>
/* semop system calls takes an array of these. */
struct sembuf {
        unsigned short  sem_num;        /* semaphore index in array */
        short           sem_op;         /* semaphore operation */
        short           sem_flg;        /* operation flags */
};
</code></tscreen>
</sect3>

<sect3>struct sem_undo<label id="struct_sem_undo"><p>
<tscreen><code>
/* Each task has a list of undo requests. They are executed automatically
 * when the process exits.
 */
struct sem_undo {
        struct sem_undo *       proc_next;      /* next entry on this process */
        struct sem_undo *       id_next;        /* next entry on this semaphore set */
        int                     semid;          /* semaphore set identifier */
        short *                 semadj;         /* array of adjustments, one per
 semaphore */
};
</code></tscreen>
</sect3>
</sect2>

<sect2>Semaphore Support Functions<label id="sem_primitives"><p>
The following functions are used specifically in support of
semaphores:

<sect3>newary()<label id="newary"><p>
newary() relies on the <ref id="ipc_alloc" name="ipc_alloc()">
function to allocate the memory
required for the new semaphore set. It allocates enough memory
for the semaphore set descriptor and for each of the semaphores
in the set.  The allocated memory is cleared, and the address of the
first element of the semaphore set descriptor is passed to
<ref id="ipc_addid" name="ipc_addid()">.
<ref id="ipc_addid" name="ipc_addid()"> reserves an array entry
for the new semaphore set descriptor and initializes the
(<ref id="struct_kern_ipc_perm" name="struct kern_ipc_perm">) data for the set.
The global <tt>used_sems</tt> variable is updated by the number of
semaphores in the new set and the initialization of the
(<ref id="struct_kern_ipc_perm" name="struct kern_ipc_perm">)
data for the new set is completed. Other
initialization for this set performed are listed below:

<itemize>
<item>  The <tt>sem_base</tt> element for the set is initialized
	to the address immediately following the
	(<ref id="struct_sem_array" name="struct sem_array">)
	portion of the newly allocated data. This corresponds to
	the location of the first semaphore in the set.

<item>  The <tt>sem_pending</tt> queue is initialized as empty.
</itemize>

All of the operations following the call to <ref id="ipc_addid" name="ipc_addid()">
are performed while holding the global semaphores spinlock. After
unlocking the global semaphores spinlock, newary() calls
<ref id="ipc_buildid" name="ipc_buildid()">
(via sem_buildid()). This function uses the index
of the semaphore set descriptor to create a unique ID, that is then
returned to the caller of newary().

</sect3>

<sect3>freeary()<label id="freeary"><p>
freeary() is called by <ref id="semctl_down" name="semctl_down()"> to perform the
functions listed below. It is called with the global semaphores
spinlock locked and it returns with the spinlock unlocked

<itemize>
<item>  The <ref id="func_ipc_rmid" name="ipc_rmid()"> function
	is called (via the
	sem_rmid() wrapper) to delete the ID for the semaphore
	set and to retrieve a pointer to the semaphore set.

<item>  The undo list for the semaphore set is invalidated.
<item>  All pending processes are awakened and caused to fail
	with EIDRM.

<item>  The number of used semaphores is reduced by the number
	of semaphores in the removed set.

<item>  The memory associated with the semaphore set is freed.
</itemize>
</sect3>

<sect3>semctl_down()<label id="semctl_down"><p>
semctl_down() provides the <ref id="semctl_ipc_rmid" name="IPC_RMID"> and
<ref id="semctl_ipc_set" name="IPC_SET"> operations of the
semctl() system call.  The semaphore set ID and the access permissions
are verified prior to either of these operations, and in either
case, the global semaphore spinlock is held throughout the
operation.

<sect4>IPC_RMID<label id="semctl_ipc_rmid"><p>
The IPC_RMID operation calls <ref id="freeary" name="freeary()"> to remove the semaphore set.
</sect4>

<sect4>IPC_SET<label id="semctl_ipc_set"><p>
The IPC_SET operation updates the <tt>uid</tt>, <tt>gid</tt>,
<tt>mode</tt>, and <tt>ctime</tt> elements of the semaphore set.
</sect4>
</sect3>

<sect3>semctl_nolock()<label id="semctl_nolock"><p>
semctl_nolock() is called by <ref id="sys_semctl" name="sys_semctl()">
to perform the IPC_INFO, SEM_INFO and SEM_STAT functions.

<sect4>IPC_INFO and SEM_INFO<label id="IPC_INFO_and_SEM_INFO"><p>
IPC_INFO and SEM_INFO cause a temporary <ref id="struct_seminfo" name="seminfo">
buffer to be initialized and loaded with unchanging semaphore
statistical data. Then, while holding the global <tt>sem_ids.sem</tt>
kernel semaphore, the <tt>semusz</tt> and <tt>semaem</tt> elements of
the <ref id="struct_seminfo" name="seminfo"> structure are
updated according to the given command (IPC_INFO or SEM_INFO).
The return value of the system call is set to the maximum
semaphore set ID.
</sect4>

<sect4>SEM_STAT<label id="SEM_STAT"><p>
SEM_STAT causes a temporary <ref id="struct_semid64_ds" name="semid64_ds">
buffer to be initialized.  The global
semaphore spinlock is then held while copying the <tt>sem_otime</tt>,
<tt>sem_ctime</tt>, and <tt>sem_nsems</tt> values into the buffer. This data is
then copied to user space.
</sect4>
</sect3>

<sect3>semctl_main()<label id="semctl_main"><p>
semctl_main() is called by <ref id="sys_semctl" name="sys_semctl()"> to perform many
of the supported functions, as described in the subsections below.
Prior to performing any of the following operations, semctl_main()
locks the global semaphore spinlock and validates the
semaphore set ID and the permissions. The spinlock is released
before returning.

<sect4>GETALL<label id="GETALL"><p>
The GETALL operation loads the current semaphore values into
a temporary kernel buffer and copies
them out to user space. The small stack buffer is used if the
semaphore set is small. Otherwise, the spinlock is temporarily
dropped in order to allocate a larger buffer. The spinlock is
held while copying the semaphore values in to the temporary buffer.
</sect4>

<sect4>SETALL<label id="SETALL"><p>
The SETALL operation copies semaphore values from user space into a temporary buffer,
and then into the semaphore set. The spinlock is dropped while
copying the values from user space into the temporary buffer,
and while verifying reasonable values. If the semaphore set
is small, then a stack buffer is used, otherwise a larger buffer
is allocated. The spinlock is regained and held while the
following operations are performed on the semaphore set:

<itemize>
<item>  The semaphore values are copied into the semaphore set.
<item>  The semaphore adjustments of the undo queue for
	the semaphore set are cleared.

<item>  The <tt>sem_ctime</tt> value for the semaphore set is set.

<item>  The <ref id="update_queue" name="update_queue()">
        function is called to traverse
	the queue of pending semops and look for any tasks that
	can be completed as a result of the SETALL operation. Any
	pending tasks that are no longer blocked are awakened.
</itemize>
</sect4>

<sect4>IPC_STAT<label id="IPC_STAT"><p>
In the IPC_STAT operation, the <tt>sem_otime</tt>,
<tt>sem_ctime</tt>, and <tt>sem_nsems</tt> value are copied into
a stack buffer. The data is then copied to user space after
dropping the spinlock.
</sect4>

<sect4>GETVAL<label id="GETVAL"><p>
For GETVAL in the non-error case, the return value for the system call is
set to the value of the specified semaphore.
</sect4>

<sect4>GETPID<label id="GETPID"><p>
For GETPID in the non-error case, the return value for the system call is
set to the <tt>pid</tt> associated with the last operation on the
semaphore.
</sect4>

<sect4>GETNCNT<label id="GETNCNT"><p>
For GETNCNT in the non-error case, the return value for the system call
is set to the number of processes waiting on the semaphore
being less than zero. This number is calculated by the
<ref id="count_semncnt" name="count_semncnt()"> function.
</sect4>

<sect4>GETZCNT<label id="GETZCNT"><p>
For GETZCNT in the non-error case, the return value for the system call
is set to the number of processes waiting on the semaphore
being set to zero. This number is calculated by the
<ref id="count_semzcnt" name="count_semzcnt()"> function.
</sect4>

<sect4>SETVAL<label id="SETVAL"><p>
After validating the new semaphore value, the following
functions are performed:

<itemize>
<item>  The undo queue is searched for any adjustments to
	this semaphore. Any adjustments that are found are reset to
	zero.

<item>  The semaphore value is set to the value provided.
<item>  The <tt>sem_ctime</tt> value for the semaphore set is updated.
<item>  The <ref id="update_queue" name="update_queue()">
        function is called to traverse
	the queue of pending semops and look for any tasks that
	can be completed as a result of the
	<ref id="SETALL" name="SETALL"> operation. Any
	pending tasks that are no longer blocked are awakened.
</itemize>
</sect4>
</sect3>

<sect3>count_semncnt()<label id="count_semncnt"><p>
count_semncnt() counts the number of tasks waiting on the value of a semaphore
to be less than zero.
</sect3>

<sect3>count_semzcnt()<label id="count_semzcnt"><p>
count_semzcnt() counts the number of tasks waiting on the value of a semaphore
to be zero.
</sect3>

<sect3>update_queue()<label id="update_queue"><p>
update_queue() traverses the queue of pending semops for
a semaphore set and calls
<ref id="try_atomic_semop" name="try_atomic_semop()">
to determine which sequences of semaphore operations
would succeed. If the status of the queue element
indicates that blocked tasks have already
been awakened, then the queue element is skipped over. For other
elements of the queue, the <tt>q-alter</tt> flag
is passed as the undo parameter to
<ref id="try_atomic_semop" name="try_atomic_semop()">,
indicating that any
altering operations should be undone before returning.

If the sequence of operations would block, then
update_queue() returns without making any changes.

A sequence of operations can fail if one of the semaphore
operations would cause an invalid semaphore value, or an
operation marked IPC_NOWAIT is unable to complete. In such a
case, the task that is blocked on the sequence of semaphore
operations is awakened, and the queue status is set with an
appropriate error code. The queue element is also dequeued.

If the sequence of operations is non-altering, then
they would have passed a zero value as the undo parameter to
<ref id="try_atomic_semop" name="try_atomic_semop()">.
If these operations succeeded, then they
are considered complete and are removed from the queue.
The blocked task is awakened, and the queue element
<tt>status</tt> is set to indicate success.

If the sequence of operations would alter the semaphore
values, but can succeed, then sleeping tasks that no longer
need to be blocked are awakened. The queue status is set to
1 to indicate that the blocked task has been awakened. The
operations have not been performed, so the queue element is not
removed from the queue. The semaphore operations would be
executed by the awakened task.
</sect3>

<sect3>try_atomic_semop()<label id="try_atomic_semop"><p>
try_atomic_semop() is called by <ref id="sys_semop" name="sys_semop()">
and <ref id="update_queue" name="update_queue()">
to determine if a sequence of semaphore operations will all
succeed. It determines this by attempting to perform each of the
operations.

If a blocking operation is encountered, then the process
is aborted and all operations are reversed. -EAGAIN is returned
if IPC_NOWAIT is set. Otherwise 1 is returned to indicate that
the sequence of semaphore operations is blocked.

If a semaphore value is adjusted beyond system limits, then
then all operations are reversed, and -ERANGE is returned.

If all operations in the sequence succeed, and the <tt>do_undo</tt>
parameter is non-zero, then all operations are reversed, and 0
is returned. If the <tt>do_undo</tt> parameter is zero, then all operations
succeeded and remain in force, and the <tt>sem_otime</tt>, field of the
semaphore set is updated.
</sect3>

<sect3>sem_revalidate()<label id="sem_revalidate"><p>
sem_revalidate() is called when the global semaphores spinlock
has been temporarily dropped and needs to be locked again. It is
called by <ref id="semctl_main" name="semctl_main()">
and <ref id="alloc_undo" name="alloc_undo()">.  It validates the
semaphore ID and permissions and on success, returns with the
global semaphores spinlock locked.
</sect3>

<sect3>freeundos()<label id="freeundos"><p>
freeundos() traverses the process undo list in search of
the desired undo structure. If found, the undo structure is removed from the
list and freed. A pointer to the next undo structure on the
process list is returned.
</sect3>

<sect3>alloc_undo()<label id="alloc_undo"><p>
alloc_undo() expects to be called with the global semaphores
spinlock locked. In the case of an error, it returns with it
unlocked.

The global semaphores spinlock is unlocked, and kmalloc() is
called to allocate sufficient memory for both the
<ref id="struct_sem_undo" name="sem_undo">
structure, and also an array of one adjustment value for each
semaphore in the set. On success, the global spinlock is regained
with a call to <ref id="sem_revalidate" name="sem_revalidate()">.

The new semundo structure is then initialized, and the address
of this structure is placed at the address provided by the
caller. The new undo structure is then placed at the head of undo
list for the current task.
</sect3>

<sect3>sem_exit()<label id="sem_exit"><p>
sem_exit() is called by do_exit(), and is responsible for
executing all of the undo adjustments for the exiting task.

If the current process was blocked on a semaphore, then it is
removed from the <ref id="struct_sem_queue" name="sem_queue">
list while holding the global semaphores spinlock.

The undo list for the current task is then traversed, and the
following operations are performed while holding and releasing the
the global semaphores spinlock around the processing of each
element of the list. The following operations are performed for
each of the undo elements:

<itemize>
<item>  The undo structure and the semaphore set ID are validated.
<item>  The undo list of the corresponding semaphore set is
	searched to find a reference to the same undo structure and to
	remove it from that list.
<item>  The adjustments indicated in the undo structure are
	applied to the semaphore set.
<item>  The <tt>sem_otime</tt> parameter of the semaphore set is updated.
<item>  <ref id="update_queue" name="update_queue()"> is called
        to traverse the queue of
	pending semops and awaken any sleeping tasks that no longer
	need to be blocked as a result of executing the undo
	operations.
<item>  The undo structure is freed.
</itemize>

When the processing of the list is complete, the
current->semundo value is cleared.
</sect3>
</sect2>
</sect1>

<sect1>Message queues<label id= "message"><p>
<sect2>Message System Call Interfaces<label id="Message_System_Call_Interfaces"><p>
<sect3>sys_msgget()<label id="sys_msgget"><p>
The entire call to sys_msgget() is protected by
the global message queue semaphore
(<ref id="struct_ipc_ids" name="msg_ids.sem">).

In the case where a new message queue must be created,
the <ref id="newque" name="newque()"> function is
called to create and initialize
a new message queue, and the new queue ID is returned to
the caller.

If a key value is provided for an existing message queue,
then <ref id="ipc_findkey" name="ipc_findkey()"> is called
to look up the corresponding index in the global message queue
descriptor array (msg_ids.entries). The
parameters and permissions of the caller are verified before
returning the message queue ID. The look up operation and
verification are performed while the global message queue
spinlock(msg_ids.ary) is held.
</sect3>

<sect3>sys_msgctl()<label id="sys_msgctl"><p>
The parameters passed to sys_msgctl() are: a message
queue ID (<tt>msqid</tt>), the operation
(<tt>cmd</tt>), and a pointer to a user space buffer of type
<ref id="struct_msqid_ds" name="msgid_ds">
(<tt>buf</tt>).  Six operations are
provided in this function: IPC_INFO, MSG_INFO,IPC_STAT,
MSG_STAT, IPC_SET and IPC_RMID.  The message queue
ID and the operation parameters are validated; then, the operation(cmd)
is performed as follows:

<sect4>IPC_INFO ( or MSG_INFO)<label id="msgctl_IPCINFO"><p>
The global message queue information is copied to user space.
</sect4>

<sect4>IPC_STAT ( or MSG_STAT)<label id="msgctl_IPCSTAT"><p>
A temporary buffer of type <ref id="struct_msqid64_ds" name="struct msqid64_ds">
is initialized and the global message queue spinlock is locked.
After verifying the access permissions of the calling process,
the message queue information associated with the message
queue ID is loaded into the temporary buffer, the global
message queue spinlock is unlocked, and the contents of
the temporary buffer are copied out to user space by
<ref id="copy_msqid_to_user" name="copy_msqid_to_user()">.
</sect4>

<sect4>IPC_SET<label id="msgctl_IPCSET"><p>
The user data is copied in via
<ref id="copy_msqid_to_user" name="copy_msqid_to_user()">.  The global
message queue semaphore and spinlock are obtained and released
at the end.  After the message queue ID and the current
process access permissions are validated, the message queue
information is updated with the user provided data.  Later,
<ref id="expunge_all" name="expunge_all()"> and
<ref id="ss_wakeup" name="ss_wakeup()">
are called to wake up all
processes sleeping on the receiver and sender waiting queues
of the message queue. This is because some receivers may now
be excluded by stricter access permissions and some senders
may now be able to send the message due to an increased
queue size.
</sect4>

<sect4>IPC_RMID<label id="msgctl_IPCRMID"><p>
The global message queue semaphore
is obtained and the global message queue spinlock is locked.
After validating the message queue ID and the current task
access permissions, <ref id="freeque" name="freeque()">
is called to free the resources related to the message queue ID.
The global message queue semaphore and spinlock are released.
</sect4>
</sect3>

<sect3>sys_msgsnd()<label id="sys_msgsnd"><p>
sys_msgsnd() receives as parameters a message queue ID
(<tt>msqid</tt>), a pointer to a buffer of type
<ref id="struct_msg_msg" name="struct msg_msg">
(<tt>msgp</tt>), the size of the message to be sent
(<tt>msgsz</tt>), and a flag indicating wait vs.
not wait (<tt>msgflg</tt>). There are two task waiting
queues and one message waiting queue associated with the message
queue ID. If there is a task in the receiver waiting queue
that is waiting for this message, then the message is
delivered directly to the receiver, and the receiver is
awakened. Otherwise, if there is enough space available in
the message waiting queue, the message is saved in this
queue. As a last resort, the sending task enqueues itself
on the sender waiting queue. A more in-depth discussion of the
operations performed by sys_msgsnd() follows:

<enum>
<item>  Validates the user buffer address and the message
	type, then invokes
	<ref id="load_msg" name="load_msg()"> to load the
	contents of the user message into a temporary object
	<tt<label id="msg">msg</tt> of type
	<ref id="struct_msg_msg" name="struct msg_msg">.
	The message type and message size fields
	of <tt>msg</tt> are also initialized.
<item>  Locks the global message queue spinlock and gets
	the message queue descriptor associated with the
	message queue ID. If no such message queue exists,
	returns EINVAL.
<item><label id="sndretry">
        Invokes <ref id="ipc_checkid" name="ipc_checkid()">
	(via msg_checkid())to verify that the message
	queue ID is valid and calls
	<ref id="ipcperms" name="ipcperms()"> to check the
	calling process' access permissions.
<item>  Checks the message size and the space left in
	the message waiting queue to see if there is enough
	room to store the message. If not, the following
	substeps are performed:

        <enum>
        <item>  If IPC_NOWAIT is specified in
		<tt>msgflg</tt> the global message
		queue spinlock is unlocked, the memory
		resources for the message are freed, and EAGAIN
		is returned.
        <item>  Invokes
		<ref id="ss_add" name="ss_add()"> to
		enqueue the current
		task in the sender waiting queue. It also unlocks
		the global message queue spinlock and invokes
		schedule() to put the current task to sleep.
        <item>  When awakened, obtains the global spinlock
		again and verifies that the message queue ID
		is still valid. If the message queue ID is not valid,
		ERMID is returned.
        <item>  Invokes <ref id="ss_del" name="ss_del()">
		to remove the sending task from the sender
		waiting queue. If there is any signal pending
		for the task, sys_msgsnd() unlocks the
		global spinlock,
		invokes <ref id="free_msg" name="free_msg()">
		to free the message buffer,
		and returns EINTR. Otherwise, the function goes
		<ref id="sndretry" name="back">
		to check again whether there is enough space
		in the message waiting queue.
        </enum>
<item>  Invokes
	<ref id="pipelined_send" name="pipelined_send()">
	to try to send the message to the waiting receiver directly.
<item>  If there is no receiver waiting for this message,
	enqueues <tt>msg</tt> into the message waiting
	queue(msq->q_messages). Updates the
	<tt>q_cbytes</tt> and
	the <tt>q_qnum</tt> fields of the message
	queue descriptor, as well as the global variables
	<tt>msg_bytes</tt> and
	<tt>msg_hdrs</tt>, which indicate the total
	number of bytes used for messages and the total number
	of messages system wide.
<item>  If the message has been successfully sent or
	enqueued, updates the <tt>q_lspid</tt>
	and the <tt>q_stime</tt> fields
	of the message queue descriptor and releases the global
	message queue spinlock.
</enum>
</sect3>

<sect3>sys_msgrcv()<label id="sys_msgrcv"><p>
The sys_msgrcv() function receives as parameters
a message queue ID
(<tt>msqid</tt>), a pointer to a buffer of type
<ref id="struct_msg_msg" name="msg_msg">
(<tt>msgp</tt>), the desired
message size(<tt>msgsz</tt>), the message type
(<tt>msgtyp</tt>), and the flags
(<tt>msgflg</tt>). It searches the message waiting queue
associated with the message queue ID, finds the first
message in the queue which matches the request type, and
copies it into the given user buffer. If no such message
is found in the message waiting queue, the requesting task
is enqueued into the receiver waiting queue until the
desired message is available. A more in-depth discussion of the
operations performed by sys_msgrcv() follows:

<enum>
<item>  First, invokes
	<ref id="convert_mode" name="convert_mode()">
	to derive the search mode from
	<tt>msgtyp</tt>.  sys_msgrcv() then locks
	the global message
	queue spinlock and obtains the message queue descriptor
	associated with the message queue ID. If no such
	message queue exists, it returns EINVAL.
<item>  Checks whether the current task has the correct
	permissions to access the message queue.
<item><label id="rcvretry">
        Starting from the first message in the message
	waiting queue, invokes
	<ref id="testmsg" name="testmsg()"> to check whether
	the message type matches the required type.  sys_msgrcv()
	continues searching until a matched message is found or the whole
	waiting queue is exhausted. If the search mode is
	SEARCH_LESSEQUAL, then the first message on the queue
	with the lowest type less than or equal to
	<tt>msgtyp</tt> is searched.
<item>  If a message is found, sys_msgrcv() performs
	the following substeps:
        <enum>
        <item>  If the message size is larger than
		the desired size and <tt>msgflg</tt>
		indicates no error allowed, unlocks the global
		message queue spinlock and returns E2BIG.
        <item>  Removes the message from the message
		waiting queue and updates the message queue
		statistics.
        <item>  Wakes up all tasks sleeping on the senders
		waiting queue. The removal of a message from
		the queue in the previous step makes it possible
		for one of the senders to progress. Goes to
		the <ref id="laststep" name="last step">
        </enum>
<item>  If no message matching the receivers criteria is found
	in the message waiting queue, then <tt>msgflg</tt>
	is checked. If IPC_NOWAIT is set, then the global message
	queue spinlock is unlocked and ENOMSG is returned. Otherwise,
	the receiver is enqueued on the receiver waiting queue as
	follows:
	<enum>
	<item>  A <ref id="struct_msg_receiver" name="msg_receiver"> data structure
		<tt>msr</tt> is allocated and is
		added to the head of waiting queue.
        <item>  The <tt>r_tsk</tt> field of <tt>msr</tt>
		is set to current task.
        <item>  The <tt>r_msgtype</tt> and
		<tt>r_mode</tt> fields are
		initialized with the desired message type and
		mode respectively.
        <item>  If <tt>msgflg</tt> indicates
		MSG_NOERROR, then the r_maxsize field of
		<tt>msr</tt> is set to be the
		value of <tt>msgsz</tt> otherwise
		it is set to be INT_MAX.
        <item>  The <tt>r_msg</tt> field
		is initialized to indicate that
		no message has been received yet.
        <item>  After the initialization is complete,
		the status of the receiving task is set to
		TASK_INTERRUPTIBLE, the global message queue
		spinlock is unlocked, and schedule() is invoked.
        </enum>
<item>  After the receiver is awakened,
	the <tt>r_msg</tt> field of
	<tt>msr</tt> is checked.  This field is used to
	store the pipelined message or in the case of an error,
	to store the error status.
	If the <tt>r_msg</tt> field is filled
	with the desired message, then go to the
	<ref id="laststep" name="last step">  Otherwise,
	the global message queue spinlock is locked again.
<item>  After obtaining the spinlock,
	the <tt>r_msg</tt> field is
	re-checked to see if the message was received while
	waiting for the spinlock. If the message has been
	received, the <ref id="laststep" name="last step">
	occurs.
<item>  If the <tt>r_msg</tt> field remains
	unchanged, then the task was
	awakened in order to retry.  In this case,
	<tt>msr</tt> is dequeued. If there is a
	signal pending for the task, then the global message
	queue spinlock is unlocked and EINTR is returned.
	Otherwise, the function needs to go
	<ref id="rcvretry" name="back"> and retry.
<item>  If the <tt>r_msg</tt> field shows
	that an error occurred
	while sleeping, the global message queue spinlock
	is unlocked and the error is returned.
<item><label id="laststep">
        After validating that the address of the user buffer
	<tt>msp</tt> is valid, message type is loaded
	into the <tt>mtype</tt> field of
	<tt>msp</tt>,and
	<ref id="store_msg" name="store_msg()">
	is invoked to copy the message contents to
	the <tt>mtext</tt> field of
	<tt>msp</tt>. Finally the memory for the message is
	freed by function <ref id="free_msg" name="free_msg()">.
</enum>
</sect3>
</sect2>

<sect2>Message Specific Structures<label id="datastructs"><p>
Data structures for message queues are defined in msg.c.
<sect3>struct msg_queue<label id="struct_msg_queue"><p>
<tscreen><code>
/* one msq_queue structure for each present queue on the system */
struct msg_queue {
        struct kern_ipc_perm q_perm;
        time_t q_stime;                 /* last msgsnd time */
        time_t q_rtime;                 /* last msgrcv time */
        time_t q_ctime;                 /* last change time */
        unsigned long q_cbytes;         /* current number of bytes on queue */
        unsigned long q_qnum;           /* number of messages in queue */
        unsigned long q_qbytes;         /* max number of bytes on queue */
        pid_t q_lspid;                  /* pid of last msgsnd */
        pid_t q_lrpid;                  /* last receive pid */

        struct list_head q_messages;
        struct list_head q_receivers;
        struct list_head q_senders;
};
</code></tscreen>
</sect3>

<sect3>struct msg_msg<label id="struct_msg_msg"><p>
<tscreen><code>
/* one msg_msg structure for each message */
struct msg_msg {
        struct list_head m_list;
        long  m_type;
        int m_ts;           /* message text size */
        struct msg_msgseg* next;
        /* the actual message follows immediately */
};
</code></tscreen>
</sect3>

<sect3>struct msg_msgseg<label id="struct_msg_msgseg"><p>
<tscreen><code>
/* message segment for each message */
struct msg_msgseg {
        struct msg_msgseg* next;
        /* the next part of the message follows immediately */
};
</code></tscreen>
</sect3>

<sect3>struct msg_sender<label id="struct_msg_sender"><p>
<tscreen><code>
/* one msg_sender for each sleeping sender */
struct msg_sender {
        struct list_head list;
        struct task_struct* tsk;
};
</code></tscreen>
</sect3>

<sect3>struct msg_receiver<label id="struct_msg_receiver"><p>
<tscreen><code>
/* one msg_receiver structure for each sleeping receiver */
struct msg_receiver {
        struct list_head r_list;
        struct task_struct* r_tsk;

        int r_mode;
        long r_msgtype;
        long r_maxsize;

        struct msg_msg* volatile r_msg;
};
</code></tscreen>
</sect3>

<sect3>struct msqid64_ds<label id="struct_msqid64_ds"><p>
<tscreen><code>
struct msqid64_ds {
        struct ipc64_perm msg_perm;
        __kernel_time_t msg_stime;      /* last msgsnd time */
        unsigned long   __unused1;
        __kernel_time_t msg_rtime;      /* last msgrcv time */
        unsigned long   __unused2;
        __kernel_time_t msg_ctime;      /* last change time */
        unsigned long   __unused3;
        unsigned long  msg_cbytes;      /* current number of bytes on queue */
        unsigned long  msg_qnum;        /* number of messages in queue */
        unsigned long  msg_qbytes;      /* max number of bytes on queue */
        __kernel_pid_t msg_lspid;       /* pid of last msgsnd */
        __kernel_pid_t msg_lrpid;       /* last receive pid */
        unsigned long  __unused4;
        unsigned long  __unused5;
};
</code></tscreen>
</sect3>

<sect3>struct msqid_ds<label id="struct_msqid_ds"><p>
<tscreen><code>
 struct msqid_ds {
        struct ipc_perm msg_perm;
        struct msg *msg_first;          /* first message on queue,unused  */
        struct msg *msg_last;           /* last message in queue,unused */
        __kernel_time_t msg_stime;      /* last msgsnd time */
        __kernel_time_t msg_rtime;      /* last msgrcv time */
        __kernel_time_t msg_ctime;      /* last change time */
        unsigned long  msg_lcbytes;     /* Reuse junk fields for 32 bit */
        unsigned long  msg_lqbytes;     /* ditto */
        unsigned short msg_cbytes;      /* current number of bytes on queue */
        unsigned short msg_qnum;        /* number of messages in queue */
        unsigned short msg_qbytes;      /* max number of bytes on queue */
        __kernel_ipc_pid_t msg_lspid;   /* pid of last msgsnd */
        __kernel_ipc_pid_t msg_lrpid;   /* last receive pid */
};
</code></tscreen>
</sect3>

<sect3>msg_setbuf<label id="msg_setbuf"><p>
<tscreen><code>
struct msq_setbuf {
        unsigned long   qbytes;
        uid_t           uid;
        gid_t           gid;
        mode_t          mode;
};
</code></tscreen>
</sect3>
</sect2>

<sect2>Message Support Functions<label id="msgfuncs"><p>
<sect3>newque()<label id="newque"><p>
newque() allocates the memory for a new message queue
descriptor (<ref id="struct_msg_queue" name="struct msg_queue">)
and then calls <ref id="ipc_addid" name="ipc_addid()">, which
reserves a message queue array entry for the new message queue
descriptor.  The message queue descriptor is initialized as
follows:

<itemize>
<item>  The <ref id="struct_kern_ipc_perm" name="kern_ipc_perm">
        structure is initialized.
<item>  The <tt>q_stime</tt> and <tt>q_rtime</tt> fields of the message
	queue descriptor are initialized as 0. The <tt>q_ctime</tt>
	field is set to be CURRENT_TIME.
<item>  The maximum number of bytes allowed in this
	queue message (<tt>q_qbytes</tt>) is set to be MSGMNB,
	and the number of bytes currently used by the queue
	(<tt>q_cbytes</tt>) is initialized as 0.
<item>  The message waiting queue (<tt>q_messages</tt>),
	the receiver waiting queue (<tt>q_receivers</tt>),
	and the sender waiting queue (<tt>q_senders</tt>)
	are each initialized as empty.
</itemize>

All the operations following the call to
<ref id="ipc_addid" name="ipc_addid()"> are
performed while holding the global message queue spinlock.
After unlocking the spinlock, newque() calls msg_buildid(),
which maps directly to <ref id="ipc_buildid" name="ipc_buildid()">.
<ref id="ipc_buildid" name="ipc_buildid()">
uses the index of the message queue descriptor to create a unique
message queue ID that is then returned to the caller of newque().
</sect3>

<sect3>freeque()<label id="freeque"><p>
When a message queue is going to be removed, the freeque() function is
called.  This function assumes that the global message queue spinlock
is already locked by the calling function.  It frees all kernel
resources associated with that message queue. First, it calls
<ref id="func_ipc_rmid" name="ipc_rmid()"> (via msg_rmid())
to remove the message queue descriptor from the array of global
message queue descriptors. Then it calls
<ref id="expunge_all" name="expunge_all"> to wake up
all receivers and <ref id="ss_wakeup" name="ss_wakeup()">
to wake up all senders sleeping on this message queue. Later
the global message queue spinlock is released.
All messages stored in this message queue are freed and the
memory for the message queue descriptor is freed.
</sect3>

<sect3>ss_wakeup()<label id="ss_wakeup"><p>
ss_wakeup() wakes up all the tasks waiting in the given
message sender waiting queue. If this function is called by
<ref id="freeque" name="freeque()">, then all senders
in the queue are dequeued.
</sect3>

<sect3>ss_add()<label id="ss_add"><p>
ss_add() receives as parameters a message queue descriptor
and a message sender data structure.  It fills the
<tt>tsk</tt> field of the message sender data
structure with the current process, changes the status of
current process to TASK_INTERRUPTIBLE,
then inserts the message sender data structure at the head of
the sender waiting queue of the given message queue.
</sect3>

<sect3>ss_del()<label id="ss_del"><p>
If the given message sender data structure
(<tt>mss</tt>) is still in the associated sender
waiting queue, then ss_del() removes
<tt>mss</tt> from the queue.
</sect3>

<sect3>expunge_all()<label id="expunge_all"><p>
expunge_all() receives as parameters a message queue
descriptor(<tt>msq</tt>) and an integer value
(<tt>res</tt>) indicating the reason for waking up the
receivers. For each sleeping receiver associated with
<tt>msq</tt>, the <tt>r_msg</tt>
field is set to the indicated
wakeup reason (<tt>res</tt>), and the associated receiving
task is awakened. This function is called when a message queue is
removed or a message control operation has been performed.
</sect3>

<sect3>load_msg()<label id="load_msg"><p>
When a process sends a message, the
<ref id="sys_msgsnd" name="sys_msgsnd()"> function
first invokes the load_msg() function to load the message
from user space to kernel space.  The message is represented in
kernel memory as a linked list of data blocks. Associated with
the first data block is a <ref id="struct_msg_msg" name="msg_msg">
structure that describes the overall message. The datablock
associated with the msg_msg structure is limited to a size of
DATA_MSG_LEN. The data block and the structure are allocated in one
contiguous memory block that can be as large as one page in memory.
If the full message will not fit into this first data block, then
additional data blocks are allocated and are organized into a
linked list.  These additional data blocks are limited to a size
of DATA_SEG_LEN, and each include an associated
<ref id="struct_msg_msgseg" name="msg_msgseg)"> structure. The
msg_msgseg structure and the associated data block are allocated in
one contiguous memory block that can be as large as one page in
memory.  This function returns the address of the new
<ref id="struct_msg_msg" name="msg_msg"> structure on success.
</sect3>

<sect3>store_msg()<label id="store_msg"><p>
The store_msg() function is called by
<ref id="sys_msgrcv" name="sys_msgrcv()"> to reassemble a received
message into the user space buffer provided by the caller. The data
described by the <ref id="struct_msg_msg" name="msg_msg">
structure and any <ref id="struct_msg_msgseg" name="msg_msgseg">
structures are sequentially copied to the user space buffer.
</sect3>

<sect3>free_msg()<label id="free_msg"><p>
The free_msg() function releases the memory for a message
data structure <ref id="struct_msg_msg" name="msg_msg">,
and the message segments.
</sect3>

<sect3>convert_mode()<label id="convert_mode"><p>
convert_mode() is called by <ref id="sys_msgrcv" name="sys_msgrcv()">.
It receives as parameters the address of the specified message
type (<tt>msgtyp</tt>) and a flag (<tt>msgflg</tt>).
It returns the search mode to the caller based on the value of
<tt>msgtyp</tt> and <tt>msgflg</tt>.  If
<tt>msgtyp</tt> is null, then SEARCH_ANY is returned.
If <tt>msgtyp</tt> is less than 0, then <tt>msgtyp</tt> is
set to it's absolute value and SEARCH_LESSEQUAL is returned.
If MSG_EXCEPT is specified in <tt>msgflg</tt>, then SEARCH_NOTEQUAL is returned.
Otherwise SEARCH_EQUAL is returned.
</sect3>

<sect3>testmsg()<label id="testmsg"><p>
The testmsg() function checks whether a message meets the
criteria specified by the receiver.  It returns 1 if one of the
following conditions is true:

<itemize>
<item>  The search mode indicates searching any message (SEARCH_ANY).
<item>  The search mode is SEARCH_LESSEQUAL and the message type
	is less than or equal to desired type.
<item>  The search mode is SEARCH_EQUAL and the message type is
	the same as desired type.
<item>  Search mode is SEARCH_NOTEQUAL and the message type is
	not equal to the specified type.
</itemize>
</sect3>

<sect3>pipelined_send()<label id="pipelined_send"><p>
pipelined_send() allows a process to directly send a message
to a waiting receiver rather than deposit the message in the
associated message waiting queue. The
<ref id="testmsg" name="testmsg()"> function is
invoked to find the first receiver which is waiting for the
given message. If found, the waiting receiver is removed from
the receiver waiting queue, and the associated receiving task is
awakened. The message is stored in the <tt>r_msg</tt>
field of the receiver, and 1 is returned. In the case where no
receiver is waiting for the message, 0 is returned.

In the process of searching for a receiver, potential
receivers may be found which have requested a size that is too small
for the given message. Such receivers are removed from the queue,
and are awakened with an error status of E2BIG, which is stored in the
<tt>r_msg</tt> field. The search then continues until
either a valid receiver is found, or the queue is exhausted.
</sect3>

<sect3>copy_msqid_to_user()<label id="copy_msqid_to_user"><p>
copy_msqid_to_user() copies the contents of a kernel buffer to
the user buffer.  It receives as parameters a user buffer, a
kernel buffer of type
<ref id="struct_msqid64_ds" name="msqid64_ds">, and a
version flag indicating
the new IPC version vs. the old IPC version.  If the version
flag equals IPC_64, then copy_to_user() is invoked to copy from
the kernel buffer to the user buffer directly.  Otherwise a
temporary buffer of type struct msqid_ds is initialized, and the
kernel data is translated to this temporary buffer.  Later
copy_to_user() is called to copy the contents of the temporary
buffer to the user buffer.
</sect3>

<sect3>copy_msqid_from_user()<label id="copy_msqid_from_user"><p>
The function copy_msqid_from_user() receives as parameters
a kernel message buffer of type struct msq_setbuf, a user buffer
and a version flag indicating the new IPC version vs. the old IPC
version.  In the case of the new IPC version, copy_from_user()
is called to copy the contents of the user buffer
to a temporary buffer of type <ref id="struct_msqid64_ds" name="msqid64_ds">.
Then, the <tt>qbytes</tt>,<tt>uid</tt>, <tt>gid</tt>,
and <tt>mode</tt> fields of the kernel buffer are
filled with the values of the
corresponding fields from the temporary buffer.  In the case of the
old IPC version, a temporary buffer of type struct
<ref id="struct_msqid_ds" name="msqid_ds"> is used instead.
</sect3>
</sect2>
</sect1>

<sect1>Shared Memory<label id="sharedmem"><p>
<sect2>Shared Memory System Call Interfaces<label id="Shared_Memory_System_Call_Interfaces">
<p>
<sect3>sys_shmget()<label id="sys_shmget"><p>
The entire call to sys_shmget() is protected by the
global shared memory semaphore.

In the case where a new shared memory segment must
be created, the <ref id="newseg" name="newseg()"> function is called to create
and initialize a new shared memory segment.  The ID of
the new segment is returned to the caller.

In the case where a key value is provided for an
existing shared memory segment, the corresponding index
in the shared memory descriptors array is looked up, and
the parameters and permissions of the caller are verified
before returning the shared memory segment ID. The look up
operation and verification are performed while the global
shared memory spinlock is held.
</sect3>

<sect3>sys_shmctl()<label id="sys_shmctl"><p>
<sect4>IPC_INFO<label id="IPC_INFO"><p>
A temporary <ref id="struct_shminfo64" name="shminfo64">
buffer is loaded with system-wide
shared memory parameters and is copied out to user space for
access by the calling application.
</sect4>

<sect4>SHM_INFO<label id="SHM_INFO"><p>
The global shared memory semaphore and the global shared
memory spinlock are held while gathering system-wide statistical
information for shared memory.  The
<ref id="shm_get_stat" name="shm_get_stat()"> function is called
to calculate both the number of shared memory pages that are
resident in memory and the number of shared memory pages that are
swapped out. Other statistics include the total number of shared
memory pages and the number of shared memory segments in use.
The counts of <tt>swap_attempts</tt> and <tt>swap_successes</tt>
are hard-coded to zero. These statistics are stored in a temporary
<ref id="struct_shm_info" name="shm_info"> buffer and copied out
to user space for the calling application.
</sect4>

<sect4>SHM_STAT, IPC_STAT<label id="SHM_STAT_IPC_STAT"><p>
For SHM_STAT and IPC_STATA, a temporary buffer of type
<ref id="struct_shmid64_ds" name="struct shmid64_ds"> is
initialized, and the global shared memory spinlock is locked.

For the SHM_STAT case, the shared memory segment ID parameter is
expected to be a straight index (i.e. 0 to n where n is the
number of shared memory IDs in the system). After validating
the index, <ref id="ipc_buildid" name="ipc_buildid()">
is called (via shm_buildid()) to
convert the index into a shared memory ID. In the passing case
of SHM_STAT, the shared memory ID will be the return value.
Note that this is an undocumented feature, but is maintained
for the ipcs(8) program.

For the IPC_STAT case, the shared memory segment ID parameter is
expected to be an ID that was generated by a call to
<ref id="sys_shmget" name="shmget()">.
The ID is validated before proceeding. In the passing case of
IPC_STAT, 0 will be the return value.

For both SHM_STAT and IPC_STAT, the access permissions of
the caller are verified. The desired statistics are loaded into
the temporary buffer and then copied out to the calling
application.
</sect4>

<sect4>SHM_LOCK, SHM_UNLOCK<label id="SHM_LOCK_SHM_UNLOCK"><p>
After validating access permissions, the global shared
memory spinlock is locked, and the shared memory segment ID
is validated.  For both SHM_LOCK and SHM_UNLOCK,
<ref id="shmem_lock" name="shmem_lock()">
is called to perform the function. The parameters for
<ref id="shmem_lock" name="shmem_lock()">
identify the function to be performed.
</sect4>

<sect4>IPC_RMID<label id="IPC_RMID"><p>
During IPC_RMID the global shared memory semaphore and
the global shared memory spinlock are held throughout this
function. The Shared Memory ID is validated, and then if
there are no current attachments, <ref id="shm_destroy" name="shm_destroy()">
is called to destroy the shared memory segment.
Otherwise, the SHM_DEST flag is set to mark it for destruction,
and the IPC_PRIVATE flag is set to prevent other processes from
being able to reference the shared memory ID.
</sect4>

<sect4>IPC_SET<label id="IPC_SET"><p>
After validating the shared memory segment ID and the user
access permissions, the <tt>uid</tt>, <tt>gid</tt>, and <tt>mode</tt> flags of the
shared memory segment are updated with the user data. The
<tt>shm_ctime</tt> field is also updated.  These changes are made
while holding the global shared memory semaphore and the
global share memory spinlock.
</sect4>
</sect3>

<sect3>sys_shmat()<label id="sys_shmat"><p>
sys_shmat() takes as parameters, a shared memory segment ID,
an address at which the shared memory segment should be
attached(<tt>shmaddr</tt>), and flags which will be described below.

If <tt>shmaddr</tt> is non-zero, and the SHM_RND flag is
specified, then <tt>shmaddr</tt> is rounded down to a multiple of
SHMLBA. If <tt>shmaddr</tt> is not a multiple of SHMLBA and SHM_RND
is not specified, then EINVAL is returned.

The access permissions of the caller are validated and
the <tt>shm_nattch</tt> field for the shared memory segment is
incremented. Note that this increment guarantees that the
attachment count is non-zero and prevents the shared memory
segment from being destroyed during the process of attaching
to the segment.  These operations are performed while holding the
global shared memory spinlock.

The do_mmap() function is called to create a virtual memory
mapping to the shared memory segment pages. This is done while
holding the <tt>mmap_sem</tt> semaphore of the current task. The
MAP_SHARED flag is passed to do_mmap(). If an address was
provided by the caller, then the MAP_FIXED flag is also passed
to do_mmap(). Otherwise, do_mmap() will select the virtual
address at which to map the shared memory segment.

NOTE <ref id="shm_inc" name="shm_inc()"> will be invoked within the do_mmap()
function call via the <tt>shm_file_operations</tt> structure. This
function is called to set the PID, to set the current time, and
to increment the number of attachments to this shared memory
segment.

After the call to do_mmap(), the global shared memory
semaphore and the global shared memory spinlock are both
obtained.  The attachment count is then decremented.  The the net
change to the attachment count is 1 for a call
to shmat() because of the call to <ref id="shm_inc" name="shm_inc()">. If, after
decrementing the attachment count, the resulting count is found
to be zero, and if the segment is marked for destruction
(SHM_DEST), then <ref id="shm_destroy" name="shm_destroy()"> is
called to release the shared memory segment resources.

Finally, the virtual address at which the shared memory is
mapped is returned to the caller at the user specified address.
If an error code had been returned by do_mmap(), then this
failure code is passed on as the return value for the system call.
</sect3>

<sect3>sys_shmdt()<label id="sys_shmdt"><p>
The global shared memory semaphore is held while performing
sys_shmdt(). The <tt>mm_struct</tt> of the current
process is searched for the <tt>vm_area_struct</tt> associated with
the shared memory address. When it is found, do_munmap() is
called to undo the virtual address mapping for the shared memory segment.

Note also that do_munmap() performs a call-back to
<ref id="shm_close" name="shm_close()">,
which performs the shared-memory book keeping functions, and
releases the shared memory segment resources if there are no other
attachments.

sys_shmdt() unconditionally returns 0.
</sect3>
</sect2>

<sect2>Shared Memory Support Structures<label id="shm_structures"><p>
<sect3>struct shminfo64<label id="struct_shminfo64"><p>
<tscreen><code>
struct shminfo64 {
        unsigned long   shmmax;
        unsigned long   shmmin;
        unsigned long   shmmni;
        unsigned long   shmseg;
        unsigned long   shmall;
        unsigned long   __unused1;
        unsigned long   __unused2;
        unsigned long   __unused3;
        unsigned long   __unused4;
};
</code></tscreen>
</sect3>
<sect3>struct shm_info<label id="struct_shm_info"><p>
<tscreen><code>
struct shm_info {
        int used_ids;
        unsigned long shm_tot;  /* total allocated shm */
        unsigned long shm_rss;  /* total resident shm */
        unsigned long shm_swp;  /* total swapped shm */
        unsigned long swap_attempts;
        unsigned long swap_successes;
};
</code></tscreen>
</sect3>

<sect3>struct shmid_kernel<label id="struct_shmid_kernel"><p>
<tscreen><code>
struct shmid_kernel /* private to the kernel */
{
        struct kern_ipc_perm    shm_perm;
        struct file *           shm_file;
        int                     id;
        unsigned long           shm_nattch;
        unsigned long           shm_segsz;
        time_t                  shm_atim;
        time_t                  shm_dtim;
        time_t                  shm_ctim;
        pid_t                   shm_cprid;
        pid_t                   shm_lprid;
};
</code></tscreen>
</sect3>

<sect3>struct shmid64_ds<label id="struct_shmid64_ds"><p>
<tscreen><code>
struct shmid64_ds {
        struct ipc64_perm       shm_perm;       /* operation perms */
        size_t                  shm_segsz;      /* size of segment (bytes) */
        __kernel_time_t         shm_atime;      /* last attach time */
        unsigned long           __unused1;
        __kernel_time_t         shm_dtime;      /* last detach time */
        unsigned long           __unused2;
        __kernel_time_t         shm_ctime;      /* last change time */
        unsigned long           __unused3;
        __kernel_pid_t          shm_cpid;       /* pid of creator */
        __kernel_pid_t          shm_lpid;       /* pid of last operator */
        unsigned long           shm_nattch;     /* no. of current attaches */
        unsigned long           __unused4;
        unsigned long           __unused5;
};
</code></tscreen>
</sect3>

<sect3>struct shmem_inode_info<label id="struct_shmem_inode_info"><p>
<tscreen><code>
struct shmem_inode_info {
        spinlock_t      lock;
        unsigned long   max_index;
        swp_entry_t     i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
        swp_entry_t   **i_indirect; /* doubly indirect blocks */
        unsigned long   swapped;
        int             locked;     /* into memory */
        struct list_head        list;
};
</code></tscreen>
</sect3>
</sect2>

<sect2>Shared Memory Support Functions<label id="shm_primitives"><p>
<sect3>newseg()<label id="newseg"><p>
The newseg() function is called when a new shared memory
segment needs to be created.  It acts on three parameters for
the new segment the key, the flag, and the size.  After
validating that the size of the shared memory segment to be
created is between SHMMIN and SHMMAX and that the total number
of shared memory segments does not exceed SHMALL, it allocates
a new shared memory segment descriptor.
The <ref id="shmem_file_setup" name="shmem_file_setup()">
function is invoked later to create an unlinked file of type
tmpfs.  The returned file pointer is saved in the <tt>shm_file</tt> field
of the associated shared memory segment descriptor. The files
size is set to be the same as the size of the segment.  The
new shared memory segment descriptor is initialized and inserted
into the global IPC shared memory descriptors array.  The shared
memory segment ID is created by shm_buildid()
(via <ref id="ipc_buildid" name="ipc_buildid()">).
This segment ID is saved in the <tt>id</tt> field of the shared memory
segment descriptor, as well as in the <tt>i_ino</tt> field of the associated
inode.  In addition, the address of the shared memory operations
defined in structure <tt>shm_file_operation</tt> is stored in the associated
file.  The value of the global variable <tt>shm_tot</tt>, which indicates
the total number of shared memory segments system wide, is also
increased to reflect this change.  On success, the segment ID is
returned to the caller application.
</sect3>

<sect3>shm_get_stat()<label id="shm_get_stat"><p>
shm_get_stat() cycles through all of the shared memory
structures, and calculates the total number of memory pages in
use by shared memory and the total number of shared memory pages
that are swapped out. There is a file structure and an inode
structure for each shared memory segment.  Since the required
data is obtained via the inode, the spinlock for each inode
structure that is accessed is locked and unlocked in sequence.
</sect3>

<sect3>shmem_lock()<label id="shmem_lock"><p>
shmem_lock() receives as parameters a pointer to the
shared memory segment descriptor and a flag indicating
lock vs. unlock.The locking state of the shared memory
segment is stored in an	associated inode. This state is compared
with the desired locking state; shmem_lock() simply returns if they match.

While holding the semaphore of the associated inode, the
locking state of the inode is set. The following list of items
occur for each page in the shared memory segment:
<itemize>
<item>  find_lock_page() is called to lock the page (setting
	PG_locked) and to increment the reference count of the page.
	Incrementing the reference count assures that the shared
	memory segment remains locked in memory throughout this
	operation.
<item>  If the desired state is locked, then PG_locked is cleared,
	but the reference count remains incremented.
<item>  If the desired state is unlocked, then the reference count
	is decremented twice once for the current reference, and once
	for the existing reference which caused the page to remain
	locked in memory. Then PG_locked is cleared.
</itemize>
</sect3>

<sect3>shm_destroy()<label id="shm_destroy"><p>
During shm_destroy() the total number of shared memory pages
is adjusted to account for the removal of the shared memory segment.
<ref id="func_ipc_rmid" name="ipc_rmid()"> is called
(via shm_rmid()) to remove the Shared Memory ID. <ref id="shmem_lock" name="shmem_lock"> is
called to unlock the shared memory pages, effectively decrementing
the reference counts to zero for each page. fput() is called to
decrement the usage counter <tt>f_count</tt> for the associated file object,
and if necessary, to release the file object resources.  kfree() is
called to free the shared memory segment descriptor.
</sect3>

<sect3>shm_inc()<label id="shm_inc"><p>
shm_inc() sets the PID, sets the current time, and increments
the number of attachments for the given shared memory segment.
These operations are performed while holding the global shared
memory spinlock.
</sect3>

<sect3>shm_close()<label id="shm_close"><p>
shm_close() updates the <tt>shm_lprid</tt> and the <tt>shm_dtim</tt> fields
and decrements the number of attached shared memory segments. If
there are no other attachments to the shared memory segment,
then <ref id="shm_destroy" name="shm_destroy()"> is called to
release the shared memory segment resources. These operations are
all performed while holding both the global shared memory semaphore
and the global	shared memory spinlock.
</sect3>

<sect3>shmem_file_setup()<label id="shmem_file_setup"><p>
The function shmem_file_setup() sets up an unlinked file living
in the tmpfs file system with the given name and size.  If there
are enough systen memory resource for this file, it creates a new
dentry under the mount root of tmpfs, and allocates a new file
descriptor and a new inode object of tmpfs type.  Then it associates
the new dentry object with the new inode object by calling
d_instantiate() and saves the address of the dentry object in the
file descriptor. The <tt>i_size</tt> field of the inode object is set to
be the file size and the <tt>i_nlink</tt> field is set to be 0 in order to
mark the inode unlinked.  Also, shmem_file_setup() stores the
address of the <tt>shmem_file_operations</tt> structure in the <tt>f_op</tt> field,
and initializes <tt>f_mode</tt> and <tt>f_vfsmnt</tt> fields of the file descriptor
properly.  The function shmem_truncate() is called to complete the
initialization of the inode object. On success, shmem_file_setup()
returns the new file descriptor.
</sect3>
</sect2>
</sect1>

<sect1>Linux IPC Primitives<label id="ipc_primitives"><p>
<sect2>Generic Linux IPC Primitives used with Semaphores, Messages,and Shared Memory
<label id="Generic_Linux_IPC_Primitives_used_with_Semaphores_Messages_and_Shared_Memory">
<p>
The semaphores, messages, and shared memory mechanisms of Linux
are built on a set of common primitives. These primitives are described in the sections below.

<sect3>ipc_alloc()<label id="ipc_alloc"><p>
If the memory allocation is greater than PAGE_SIZE, then
vmalloc() is used to allocate memory. Otherwise, kmalloc() is
called with GFP_KERNEL to allocate the memory.
</sect3>

<sect3>ipc_addid()<label id="ipc_addid"><p>
When a new semaphore set, message queue, or shared memory
segment is added,  ipc_addid() first calls <ref id="grow_ary" name="grow_ary()"> to
insure that the size of the corresponding descriptor array is
sufficiently large for the system maximum.  The array of descriptors
is searched for the first unused element. If an unused element
is found, the count of descriptors which are in use is incremented.
The <ref id="struct_kern_ipc_perm" name="kern_ipc_perm"> structure for the new resource descriptor
is then initialized, and the array index for the new descriptor
is returned. When ipc_addid() succeeds, it returns with the global
spinlock for the given IPC type locked.
</sect3>

<sect3>ipc_rmid()<label id="func_ipc_rmid"><p>
ipc_rmid() removes the IPC descriptor from the global
descriptor array of the IPC type, updates the count of IDs which
are in use, and adjusts the maximum ID in the corresponding
descriptor array if necessary. A pointer to  the IPC
descriptor associated with given IPC ID is returned.
</sect3>

<sect3>ipc_buildid()<label id="ipc_buildid"><p>
ipc_buildid() creates a unique ID to be associated with
each descriptor within a given IPC type. This ID is created at
the time a new IPC element is added (e.g. a new shared memory
segment or a new semaphore set).  The IPC ID converts
easily into the corresponding descriptor array index. Each
IPC type maintains a sequence number which is incremented
each time a descriptor is added.  An ID is created by
multiplying the sequence number with SEQ_MULTIPLIER and adding
the product to the descriptor array index. The sequence number
used in creating a particular IPC ID is then stored in the
corresponding descriptor. The existence of the sequence number
makes it possible to detect the use of a stale IPC ID.
</sect3>

<sect3>ipc_checkid()<label id="ipc_checkid"><p>
ipc_checkid() divides the given IPC ID by the SEQ_MULTIPLIER
and compares the quotient with the seq value saved corresponding
descriptor.  If they are equal, then the IPC ID is considered to
be valid and 1 is returned.  Otherwise, 0 is returned.
</sect3>

<sect3>grow_ary()<label id="grow_ary"><p>
grow_ary() handles the possibility that the maximum
(tunable) number of IDs for a given IPC type can be dynamically
changed. It enforces the current maximum limit so that it is no
greater than the permanent system limit (IPCMNI) and adjusts it down
if necessary. It also insures that the existing descriptor array
is large enough.  If the existing array size is sufficiently large,
then the current maximum limit is returned.  Otherwise, a new larger
array is allocated, the old array is copied into the new array,
and the old array is freed.  The corresponding global
spinlock is held when updating the descriptor array for the
given IPC type.
</sect3>

<sect3>ipc_findkey()<label id="ipc_findkey"><p>
ipc_findkey() searches through the descriptor array of
the specified <ref id="struct_ipc_ids" name="ipc_ids"> object,
and searches for the specified key. Once found, the index of
the corresponding descriptor is returned. If the key is not found,
then -1 is returned.
</sect3>

<sect3>ipcperms()<label id="ipcperms"><p>
ipcperms() checks the user, group, and other permissions
for access to the IPC resources. It returns 0 if permission
is granted and -1 otherwise.
</sect3>

<sect3>ipc_lock()<label id="ipc_lock"><p>
ipc_lock() takes an IPC ID as one of its parameters.
It locks the global spinlock for the given IPC type, and
returns a pointer to the descriptor corresponding to the
specified IPC ID.
</sect3>

<sect3>ipc_unlock()<label id="ipc_unlock"><p>
ipc_unlock() releases the global spinlock for the indicated IPC
type.
</sect3>

<sect3>ipc_lockall()<label id="ipc_lockall"><p>
ipc_lockall() locks the global spinlock for the given
IPC mechanism (i.e. shared memory, semaphores, and messaging).
</sect3>

<sect3>ipc_unlockall()<label id="ipc_unlockall"><p>
ipc_unlockall() unlocks the global spinlock for the given
IPC mechanism (i.e. shared memory, semaphores, and messaging).
</sect3>

<sect3>ipc_get()<label id="ipc_get"><p>
ipc_get() takes a pointer to a particular IPC type
(i.e. shared memory, semaphores, or message queues) and a
descriptor ID, and returns a pointer to the corresponding
IPC descriptor.  Note that although the descriptors for each
IPC type are of different data types, the common
<ref id="struct_kern_ipc_perm" name="kern_ipc_perm">
structure type is embedded as the first entity in every case.
The ipc_get() function returns this common data type. The expected
model is that ipc_get() is called through a wrapper function
(e.g. shm_get()) which casts the data type to the correct
descriptor data type.
</sect3>

<sect3>ipc_parse_version()<label id="ipc_parse_version"><p>
ipc_parse_version() removes the IPC_64 flag from the command
if it is present and returns either IPC_64 or IPC_OLD.
</sect3>
</sect2>

<sect2>Generic IPC Structures used with Semaphores,
Messages, and Shared Memory<label id="ipc_structures"><p>
The semaphores, messages, and shared memory mechanisms all make
use of the following common structures:

<sect3>struct kern_ipc_perm<label id="struct_kern_ipc_perm"><p>
Each of the IPC descriptors has a data object of this type
as the first element. This makes it possible to access any
descriptor from any of the generic IPC functions using a pointer
of this data type.

<tscreen><code>
/* used by in-kernel data structures */
struct kern_ipc_perm {
    key_t key;
    uid_t uid;
    gid_t gid;
    uid_t cuid;
    gid_t cgid;
    mode_t mode;
    unsigned long seq;
};
</code></tscreen>
</sect3>

<sect3>struct ipc_ids<label id="struct_ipc_ids"><p>
The ipc_ids structure describes the common data for semaphores,
message queues, and shared memory. There are three global instances of
this data structure-- <tt>semid_ds</tt>,
<tt>msgid_ds</tt> and <tt>shmid_ds</tt>-- for
semaphores, messages and shared memory respectively. In each
instance, the <tt>sem</tt> semaphore is used to
protect access to the structure.
The <tt>entries</tt> field points to an IPC
descriptor array, and the
<tt>ary</tt> spinlock protects access to this array.  The
<tt>seq</tt> field is a global sequence number which will
be incremented when a new IPC resource is created.

<tscreen><code>
struct ipc_ids {
    int size;
    int in_use;
    int max_id;
    unsigned short seq;
    unsigned short seq_max;
    struct semaphore sem;
    spinlock_t ary;
    struct ipc_id* entries;
};
</code></tscreen>
</sect3>

<sect3>struct ipc_id<label id="struct_ipc_id"><p>
An array of struct ipc_id exists in each instance of
the <ref id="struct_ipc_ids" name="ipc_ids"> structure.
The array is dynamically allocated and may be replaced with
larger array by <ref id="grow_ary" name="grow_ary()">
as required. The array is
sometimes referred to as the descriptor array, since the
<ref id="struct_kern_ipc_perm" name="kern_ipc_perm"> data
type is used as the common descriptor data type by the IPC generic
functions.

<tscreen><code>
struct ipc_id {
    struct kern_ipc_perm* p;
};
</code></tscreen>
</sect3>
</sect2>
</sect1>
</sect>

</article>