old-www/LDP/lki/lki-2.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
 <META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
 <TITLE>Linux Kernel 2.4 Internals: Process and Interrupt Management</TITLE>
 <LINK HREF="lki-3.html" REL=next>
 <LINK HREF="lki-1.html" REL=previous>
 <LINK HREF="lki.html#toc2" REL=contents>
</HEAD>
<BODY>
<A HREF="lki-3.html">Next</A>
<A HREF="lki-1.html">Previous</A>
<A HREF="lki.html#toc2">Contents</A>
<HR>
<H2><A NAME="s2">2. Process and Interrupt Management</A></H2>

<P>
<P>
<H2><A NAME="ss2.1">2.1 Task Structure and Process Table</A>
</H2>

<P>
<P>Every process under Linux is dynamically allocated a <CODE>struct task_struct</CODE>
structure. The maximum number of processes which can be created on Linux
is limited only by the amount of physical memory present, and is
equal to (see <CODE>kernel/fork.c:fork_init()</CODE>):
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
        /*
         * The default maximum number of threads is set to a safe
         * value: the thread structures can take up at most half
         * of memory.
         */
        max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>which, on IA32 architecture, basically means <CODE>num_physpages/4</CODE>. As an example,
on a 512M machine, you can create 32k threads. This is a considerable
improvement over the 4k-epsilon limit for older (2.2 and earlier) kernels.
Moreover, this can be changed at runtime using the KERN_MAX_THREADS <B>sysctl(2)</B>,
or simply using procfs interface to kernel tunables:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
# cat /proc/sys/kernel/threads-max
32764
# echo 100000 > /proc/sys/kernel/threads-max
# cat /proc/sys/kernel/threads-max
100000
# gdb -q vmlinux /proc/kcore
Core was generated by `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118'.
#0  0x0 in ?? ()
(gdb) p max_threads
$1 = 100000
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>The set of processes on the Linux system is represented as a collection of
<CODE>struct task_struct</CODE> structures which are linked in two ways:
<P>
<OL>
<LI> as a hashtable, hashed by pid, and</LI>
<LI> as a circular, doubly-linked list using <CODE>p->next_task</CODE> and <CODE>p->prev_task</CODE>
pointers.</LI>
</OL>
<P>The hashtable is called <CODE>pidhash[]</CODE> and is defined in
<CODE>include/linux/sched.h</CODE>:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
/* PID hashing. (shouldnt this be dynamic?) */
#define PIDHASH_SZ (4096 >> 2)
extern struct task_struct *pidhash[PIDHASH_SZ];

#define pid_hashfn(x)   ((((x) >> 8) ^ (x)) &amp; (PIDHASH_SZ - 1))
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>The tasks are hashed by their pid value and the above hashing function is
supposed to distribute the elements uniformly in their domain
(<CODE>0</CODE> to <CODE>PID_MAX-1</CODE>). The hashtable is used to quickly find a task by given pid,
using <CODE>find_task_pid()</CODE> inline from <CODE>include/linux/sched.h</CODE>:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
static inline struct task_struct *find_task_by_pid(int pid)
{
        struct task_struct *p, **htable = &amp;pidhash[pid_hashfn(pid)];

        for(p = *htable; p &amp;&amp; p->pid != pid; p = p->pidhash_next)
                ;

        return p;
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>The tasks on each hashlist (i.e. hashed to the same value) are linked
by <CODE>p->pidhash_next/pidhash_pprev</CODE> which are used by <CODE>hash_pid()</CODE> and
<CODE>unhash_pid()</CODE> to insert and remove a given process into the hashtable.
These are done under protection of the read-write spinlock called <CODE>tasklist_lock</CODE>
taken for WRITE.
<P>The circular doubly-linked list that uses <CODE>p->next_task/prev_task</CODE> is
maintained so that one could go through all tasks on the system easily.
This is achieved by the <CODE>for_each_task()</CODE> macro from <CODE>include/linux/sched.h</CODE>:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
#define for_each_task(p) \
        for (p = &amp;init_task ; (p = p->next_task) != &amp;init_task ; )
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>Users of <CODE>for_each_task()</CODE> should take tasklist_lock for READ.
Note that <CODE>for_each_task()</CODE> is using <CODE>init_task</CODE> to mark the beginning (and end)
of the list - this is safe because the idle task (pid 0) never exits.
<P>The modifiers of the process hashtable or/and the process table links,
notably <CODE>fork()</CODE>, <CODE>exit()</CODE> and <CODE>ptrace()</CODE>, must take <CODE>tasklist_lock</CODE> for WRITE. What is
more interesting is that the writers must also disable interrupts on the
local CPU. The reason for this is not trivial: the <CODE>send_sigio()</CODE> function walks the
task list and thus takes <CODE>tasklist_lock</CODE> for READ, and it is called from
<CODE>kill_fasync()</CODE> in interrupt context. This is why writers must disable
interrupts while readers don't need to.
<P>Now that we understand how the <CODE>task_struct</CODE> structures are linked together,
let us examine the members of <CODE>task_struct</CODE>. They loosely correspond to the
members of UNIX 'struct proc' and 'struct user' combined together.
<P>The other versions of UNIX separated the task state information into
one part which should be kept memory-resident at all times (called 'proc
structure' which includes process state, scheduling information etc.) and
another part which is only needed when the process is running (called 'u area' which
includes file descriptor table, disk quota information etc.). The only reason
for such ugly design was that memory was a very scarce resource. Modern
operating systems (well, only Linux at the moment but others, e.g. FreeBSD
seem to improve in this direction towards Linux) do not need such separation
and therefore maintain process state in a kernel memory-resident data
structure at all times.
<P>The task_struct structure is declared in <CODE>include/linux/sched.h</CODE> and is
currently 1680 bytes in size.
<P>The state field is declared as:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */

#define TASK_RUNNING            0
#define TASK_INTERRUPTIBLE      1
#define TASK_UNINTERRUPTIBLE    2
#define TASK_ZOMBIE             4
#define TASK_STOPPED            8
#define TASK_EXCLUSIVE          32
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>Why is <CODE>TASK_EXCLUSIVE</CODE> defined as 32 and not 16? Because 16 was used up by
<CODE>TASK_SWAPPING</CODE> and I forgot to shift <CODE>TASK_EXCLUSIVE</CODE> up when I removed
all references to <CODE>TASK_SWAPPING</CODE> (sometime in 2.3.x).
<P>The <CODE>volatile</CODE> in <CODE>p->state</CODE> declaration means it can be modified
asynchronously (from interrupt handler):
<P>
<OL>
<LI><B>TASK_RUNNING</B>: means the task is "supposed to be" on the run
queue.  The reason it may not yet be on the runqueue is that marking a task as
<CODE>TASK_RUNNING</CODE> and placing it on the runqueue is not atomic.  You need to hold
the <CODE>runqueue_lock</CODE> read-write spinlock for read in order to look at the
runqueue. If you do so, you will then see that every task on the runqueue is in
<CODE>TASK_RUNNING</CODE> state. However, the converse is not true for the reason explained
above. Similarly, drivers can mark themselves (or rather the process context they
run in) as <CODE>TASK_INTERRUPTIBLE</CODE> (or <CODE>TASK_UNINTERRUPTIBLE</CODE>) and then call <CODE>schedule()</CODE>,
which will then remove it from the runqueue (unless there is a pending signal, in which
case it is left on the runqueue).  </LI>
<LI><B>TASK_INTERRUPTIBLE</B>: means the task is sleeping but can be woken up
by a signal or by expiry of a timer.</LI>
<LI><B>TASK_UNINTERRUPTIBLE</B>: same as <CODE>TASK_INTERRUPTIBLE</CODE>, except it cannot
be woken up.</LI>
<LI><B>TASK_ZOMBIE</B>: task has terminated but has not had its status collected
(<CODE>wait()</CODE>-ed for) by the parent (natural or by adoption).</LI>
<LI><B>TASK_STOPPED</B>: task was stopped, either due to job control signals or
due to <B>ptrace(2)</B>.</LI>
<LI><B>TASK_EXCLUSIVE</B>: this is not a separate state but can be OR-ed to
either one of <CODE>TASK_INTERRUPTIBLE</CODE> or <CODE>TASK_UNINTERRUPTIBLE</CODE>.
This means that when
this task is sleeping on a wait queue with many other tasks, it will be
woken up alone instead of causing "thundering herd" problem by waking up all
the waiters.</LI>
</OL>
<P>Task flags contain information about the process states which are not
mutually exclusive:
<BLOCKQUOTE><CODE>
<HR>
<PRE>
unsigned long flags;    /* per process flags, defined below */
/*
 * Per process flags
 */
#define PF_ALIGNWARN    0x00000001      /* Print alignment warning msgs */
                                        /* Not implemented yet, only for 486*/
#define PF_STARTING     0x00000002      /* being created */
#define PF_EXITING      0x00000004      /* getting shut down */
#define PF_FORKNOEXEC   0x00000040      /* forked but didn't exec */
#define PF_SUPERPRIV    0x00000100      /* used super-user privileges */
#define PF_DUMPCORE     0x00000200      /* dumped core */
#define PF_SIGNALED     0x00000400      /* killed by a signal */
#define PF_MEMALLOC     0x00000800      /* Allocating memory */
#define PF_VFORK        0x00001000      /* Wake up parent in mm_release */
#define PF_USEDFPU      0x00100000      /* task used FPU this quantum (SMP) */
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>The fields <CODE>p->has_cpu</CODE>, <CODE>p->processor</CODE>, <CODE>p->counter</CODE>, <CODE>p->priority</CODE>, <CODE>p->policy</CODE> and
<CODE>p->rt_priority</CODE> are related to the scheduler and will be looked at later.
<P>The fields <CODE>p->mm</CODE> and <CODE>p->active_mm</CODE> point respectively to the process' address space
described by <CODE>mm_struct</CODE> structure and to the active address space if the
process doesn't have a real one (e.g. kernel threads). This helps minimise
TLB flushes on switching address spaces when the task is scheduled out.
So, if we are scheduling-in the kernel thread (which has no <CODE>p->mm</CODE>) then its
<CODE>next->active_mm</CODE> will be set to the <CODE>prev->active_mm</CODE> of the task that was
scheduled-out, which will be the same as <CODE>prev->mm</CODE> if <CODE>prev->mm != NULL</CODE>.
The address space can be shared between threads if <CODE>CLONE_VM</CODE> flag is passed
to the <B>clone(2)</B> system call or by means of <B>vfork(2)</B> system call.
<P>The fields <CODE>p->exec_domain</CODE> and <CODE>p->personality</CODE> relate to the personality of
the task, i.e. to the way certain system calls behave in order to emulate the
"personality" of foreign flavours of UNIX.
<P>The field <CODE>p->fs</CODE> contains filesystem information, which under Linux means
three pieces of information:
<P>
<OL>
<LI>root directory's dentry and mountpoint,</LI>
<LI>alternate root directory's dentry and mountpoint,</LI>
<LI>current working directory's dentry and mountpoint.</LI>
</OL>
<P>This structure also includes a reference count because it can be shared
between cloned tasks when <CODE>CLONE_FS</CODE> flag is passed to the <B>clone(2)</B> system
call.
<P>The field <CODE>p->files</CODE> contains the file descriptor table. This too can be
shared between tasks, provided <CODE>CLONE_FILES</CODE> is specified with <B>clone(2)</B> system
call.
<P>The field <CODE>p->sig</CODE> contains signal handlers and can be shared between cloned
tasks by means of <CODE>CLONE_SIGHAND</CODE>.
<P>
<H2><A NAME="ss2.2">2.2 Creation and termination of tasks and kernel threads</A>
</H2>

<P>
<P>Different books on operating systems define a "process" in different ways,
starting from "instance of a program in execution" and ending with "that
which is produced by clone(2) or fork(2) system calls".
Under Linux, there are three kinds of processes:
<P>
<UL>
<LI> the idle thread(s),</LI>
<LI> kernel threads,</LI>
<LI> user tasks.</LI>
</UL>
<P>The idle thread is created at compile time for the first CPU; it is then
"manually" created for each CPU by means of arch-specific
<CODE>fork_by_hand()</CODE> in <CODE>arch/i386/kernel/smpboot.c</CODE>, which unrolls the <B>fork(2)</B> system
call by hand (on some archs). Idle tasks share one init_task structure but
have a private TSS structure, in the per-CPU array <CODE>init_tss</CODE>. Idle tasks all have
pid = 0 and no other task can share pid, i.e. use <CODE>CLONE_PID</CODE> flag to <B>clone(2)</B>.
<P>Kernel threads are created using <CODE>kernel_thread()</CODE> function which invokes
the <B>clone(2)</B> system call in kernel mode. Kernel threads usually have no user
address space, i.e. <CODE>p->mm = NULL</CODE>, because they explicitly do <CODE>exit_mm()</CODE>, e.g.
via <CODE>daemonize()</CODE> function. Kernel threads can always access kernel address
space directly. They are allocated pid numbers in the low range. Running at
processor's ring 0 (on x86, that is) implies that the kernel threads enjoy all I/O privileges
and cannot be pre-empted by the scheduler.
<P>User tasks are created by means of <B>clone(2)</B> or <B>fork(2)</B> system calls, both of
which internally invoke <B>kernel/fork.c:do_fork()</B>.
<P>Let us understand what happens when a user process makes a <B>fork(2)</B> system
call. Although <B>fork(2)</B> is architecture-dependent due to the
different ways of passing user stack and registers, the actual underlying
function <CODE>do_fork()</CODE> that does the job is portable and is located at
<CODE>kernel/fork.c</CODE>.
<P>The following steps are done:
<P>
<OL>
<LI> Local variable <CODE>retval</CODE> is set to <CODE>-ENOMEM</CODE>, as this is the value which <CODE>errno</CODE>
should be set to if <B>fork(2)</B> fails to allocate a new task structure.
</LI>
<LI> If <CODE>CLONE_PID</CODE> is set in <CODE>clone_flags</CODE> then return an error (<CODE>-EPERM</CODE>), unless
the caller is the idle thread (during boot only). So, normal user
threads cannot pass <CODE>CLONE_PID</CODE> to <B>clone(2)</B> and expect it to succeed.
For <B>fork(2)</B>, this is irrelevant as <CODE>clone_flags</CODE> is set to <CODE>SIFCHLD</CODE> - this
is only relevant when <CODE>do_fork()</CODE> is invoked from <CODE>sys_clone()</CODE> which
passes the <CODE>clone_flags</CODE> from the value requested from userspace.
</LI>
<LI> <CODE>current->vfork_sem</CODE> is initialised (it is later cleared in the child).
This is used by <CODE>sys_vfork()</CODE> (<B>vfork(2)</B> system call, corresponds to
<CODE>clone_flags = CLONE_VFORK|CLONE_VM|SIGCHLD</CODE>) to make the parent sleep
until the child does <CODE>mm_release()</CODE>, for example as a result of <CODE>exec()</CODE>ing
another program or <B>exit(2)</B>-ing.
</LI>
<LI> A new task structure is allocated using arch-dependent
<CODE>alloc_task_struct()</CODE> macro. On x86 it is just a gfp at <CODE>GFP_KERNEL</CODE>
priority. This is the first reason why <B>fork(2)</B> system call may sleep.
If this allocation fails, we return <CODE>-ENOMEM</CODE>.
</LI>
<LI> All the values from current process' task structure are copied into
the new one, using structure assignment <CODE>*p = *current</CODE>. Perhaps this
should be replaced by a memcpy? Later on, the fields that should not
be inherited by the child are set to the correct values.
</LI>
<LI> Big kernel lock is taken as the rest of the code would otherwise be
non-reentrant.
</LI>
<LI> If the parent has user resources (a concept of UID, Linux is flexible
enough to make it a question rather than a fact), then verify if the
user exceeded <CODE>RLIMIT_NPROC</CODE> soft limit - if so, fail with <CODE>-EAGAIN</CODE>, if
not, increment the count of processes by given uid <CODE>p->user->count</CODE>.
</LI>
<LI> If the system-wide number of tasks exceeds the value of the tunable
max_threads, fail with <CODE>-EAGAIN</CODE>.
</LI>
<LI> If the binary being executed belongs to a modularised execution
domain, increment the corresponding module's reference count.
</LI>
<LI> If the binary being executed belongs to a modularised binary format,
increment the corresponding module's reference count.
</LI>
<LI> The child is marked as 'has not execed' (<CODE>p->did_exec = 0</CODE>)
</LI>
<LI> The child is marked as 'not-swappable' (<CODE>p->swappable = 0</CODE>)
</LI>
<LI> The child is put into 'uninterruptible sleep' state, i.e.
<CODE>p->state = TASK_UNINTERRUPTIBLE</CODE> (TODO: why is this done?
I think it's not needed - get rid of it, Linus confirms it is not
needed)
</LI>
<LI> The child's <CODE>p->flags</CODE> are set according to the value of clone_flags;
for plain <B>fork(2)</B>, this will be <CODE>p->flags = PF_FORKNOEXEC</CODE>.
</LI>
<LI> The child's pid <CODE>p->pid</CODE> is set using the fast algorithm in
<CODE>kernel/fork.c:get_pid()</CODE> (TODO: <CODE>lastpid_lock</CODE> spinlock can be made
redundant since <CODE>get_pid()</CODE> is always called under big kernel lock
from <CODE>do_fork()</CODE>, also remove flags argument of <CODE>get_pid()</CODE>, patch sent
to Alan on 20/06/2000 - followup later).
</LI>
<LI> The rest of the code in <CODE>do_fork()</CODE> initialises the rest of child's
task structure. At the very end, the child's task structure is
hashed into the <CODE>pidhash</CODE> hashtable and the child is woken up (TODO:
<CODE>wake_up_process(p)</CODE> sets <CODE>p->state = TASK_RUNNING</CODE> and adds the process
to the runq, therefore we probably didn't need to set <CODE>p->state</CODE> to
<CODE>TASK_RUNNING</CODE> earlier on in <CODE>do_fork()</CODE>). The interesting part is
setting <CODE>p->exit_signal</CODE> to <CODE>clone_flags &amp; CSIGNAL</CODE>, which for <B>fork(2)</B>
means just <CODE>SIGCHLD</CODE> and setting <CODE>p->pdeath_signal</CODE> to 0. The
<CODE>pdeath_signal</CODE> is used when a process 'forgets' the original parent
(by dying) and can be set/get by means of <CODE>PR_GET/SET_PDEATHSIG</CODE>
commands of <B>prctl(2)</B> system call (You might argue that the way the
value of <CODE>pdeath_signal</CODE> is returned via userspace pointer argument
in <B>prctl(2)</B> is a bit silly - mea culpa, after Andries Brouwer
updated the manpage it was too late to fix ;)</LI>
</OL>
<P>Thus tasks are created. There are several ways for tasks to terminate:
<P>
<OL>
<LI> by making <B>exit(2)</B> system call;
</LI>
<LI> by being delivered a signal with default disposition to die;
</LI>
<LI> by being forced to die under certain exceptions;
</LI>
<LI> by calling <B>bdflush(2)</B> with <CODE>func == 1</CODE> (this is Linux-specific, for
compatibility with old distributions that still had the 'update'
line in <CODE>/etc/inittab</CODE> - nowadays the work of update is done by
kernel thread <CODE>kupdate</CODE>).</LI>
</OL>
<P>Functions implementing system calls under Linux are prefixed with <CODE>sys_</CODE>,
but they are usually concerned only with argument checking or arch-specific
ways to pass some information and the actual work is done by <CODE>do_</CODE> functions.
So it is with <CODE>sys_exit()</CODE> which calls <CODE>do_exit()</CODE> to do the work. Although,
other parts of the kernel sometimes invoke <CODE>sys_exit()</CODE> while they should really
call <CODE>do_exit()</CODE>.
<P>The function <CODE>do_exit()</CODE> is found in <CODE>kernel/exit.c</CODE>. The points to note about
<CODE>do_exit()</CODE>:
<P>
<UL>
<LI> Uses global kernel lock (locks but doesn't unlock).
</LI>
<LI> Calls <CODE>schedule()</CODE> at the end, which never returns.
</LI>
<LI> Sets the task state to <CODE>TASK_ZOMBIE</CODE>.
</LI>
<LI> Notifies any child with <CODE>current->pdeath_signal</CODE>, if not 0.
</LI>
<LI> Notifies the parent with a <CODE>current->exit_signal</CODE>, which is usually
equal to <CODE>SIGCHLD</CODE>.
</LI>
<LI> Releases resources allocated by fork, closes open files etc.
</LI>
<LI> On architectures that use lazy FPU switching (ia64, mips, mips64)
(TODO: remove 'flags' argument of
sparc, sparc64), do whatever the hardware requires to pass the FPU
ownership (if owned by current) to "none".</LI>
</UL>
<P>
<H2><A NAME="ss2.3">2.3 Linux Scheduler</A>
</H2>

<P>
<P>The job of a scheduler is to arbitrate access to the current CPU between
multiple processes. The scheduler is implemented in the 'main kernel file'
<CODE>kernel/sched.c</CODE>. The corresponding header file <CODE>include/linux/sched.h</CODE> is
included (either explicitly or indirectly) by virtually every kernel source
file.
<P>The fields of task structure relevant to scheduler include:
<P>
<UL>
<LI> <CODE>p->need_resched</CODE>: this field is set if <CODE>schedule()</CODE> should be invoked at
the 'next opportunity'.
</LI>
<LI> <CODE>p->counter</CODE>: number of clock ticks left to run in this scheduling
slice, decremented by a timer. When this field becomes lower than or equal to zero, it is reset
to 0 and <CODE>p->need_resched</CODE> is set. This is also sometimes called 'dynamic
priority' of a process because it can change by itself.
</LI>
<LI> <CODE>p->priority</CODE>: the process' static priority, only changed through well-known
system calls like <B>nice(2)</B>, POSIX.1b <B>sched_setparam(2)</B> or 4.4BSD/SVR4
<B>setpriority(2)</B>.
</LI>
<LI> <CODE>p->rt_priority</CODE>: realtime priority
</LI>
<LI> <CODE>p->policy</CODE>: the scheduling policy, specifies which scheduling class the
task belongs to. Tasks can change their scheduling class using the
<B>sched_setscheduler(2)</B> system call. The valid values are <CODE>SCHED_OTHER</CODE>
(traditional UNIX process), <CODE>SCHED_FIFO</CODE> (POSIX.1b FIFO realtime
process) and <CODE>SCHED_RR</CODE> (POSIX round-robin realtime process). One can
also OR <CODE>SCHED_YIELD</CODE> to any of these values to signify that the process
decided to yield the CPU, for example by calling <B>sched_yield(2)</B> system
call. A FIFO realtime process will run until either a) it blocks on I/O,
b) it explicitly yields the CPU or c) it is preempted by another realtime
process with a higher <CODE>p->rt_priority</CODE> value. <CODE>SCHED_RR</CODE> is the same as
<CODE>SCHED_FIFO</CODE>, except that when its timeslice expires it goes back to
the end of the runqueue.</LI>
</UL>
<P>The scheduler's algorithm is simple, despite the great apparent complexity
of the <CODE>schedule()</CODE> function. The function is complex because it implements
three scheduling algorithms in one and also because of the subtle
SMP-specifics.
<P>The apparently 'useless' gotos in <CODE>schedule()</CODE> are there for a purpose - to
generate the best optimised (for i386) code. Also, note that scheduler
(like most of the kernel) was completely rewritten for 2.4, therefore the
discussion below does not apply to 2.2 or earlier kernels.
<P>Let us look at the function in detail:
<P>
<OL>
<LI> If <CODE>current->active_mm == NULL</CODE> then something is wrong. Current
process, even a kernel thread (<CODE>current->mm == NULL</CODE>) must have a valid
<CODE>p->active_mm</CODE> at all times.
</LI>
<LI> If there is something to do on the <CODE>tq_scheduler</CODE> task queue, process it
now. Task queues provide a kernel mechanism to schedule execution of
functions at a later time. We shall look at it in details elsewhere.
</LI>
<LI> Initialise local variables <CODE>prev</CODE> and <CODE>this_cpu</CODE> to current task and
current CPU respectively.
</LI>
<LI> Check if <CODE>schedule()</CODE> was invoked from interrupt handler (due to a bug)
and panic if so.
</LI>
<LI> Release the global kernel lock.
</LI>
<LI> If there is some work to do via softirq mechanism, do it now.
</LI>
<LI> Initialise local pointer <CODE>struct schedule_data *sched_data</CODE> to point
to per-CPU (cacheline-aligned to prevent cacheline ping-pong)
scheduling data area, which contains the TSC value of <CODE>last_schedule</CODE> and the
pointer to last scheduled task structure (TODO: <CODE>sched_data</CODE> is used on
SMP only but why does <CODE>init_idle()</CODE> initialises it on UP as well?).
</LI>
<LI> <CODE>runqueue_lock</CODE> spinlock is taken. Note that we use <CODE>spin_lock_irq()</CODE>
because in <CODE>schedule()</CODE> we guarantee that interrupts are enabled. Therefore,
when we unlock <CODE>runqueue_lock</CODE>, we can just re-enable them instead of
saving/restoring eflags (<CODE>spin_lock_irqsave/restore</CODE> variant).
</LI>
<LI> task state machine: if the task is in <CODE>TASK_RUNNING</CODE> state, it is left
alone; if it is in <CODE>TASK_INTERRUPTIBLE</CODE> state and a signal is pending,
it is moved into <CODE>TASK_RUNNING</CODE> state. In all other cases, it is deleted
from the runqueue.
</LI>
<LI> <CODE>next</CODE> (best candidate to be scheduled) is set to the idle task of
this cpu. However, the goodness of this candidate is set to a very
low value (-1000), in hope that there is someone better than that.
</LI>
<LI> If the <CODE>prev</CODE> (current) task is in <CODE>TASK_RUNNING</CODE> state, then the
current goodness is set to its goodness and it is marked as a better
candidate to be scheduled than the idle task.
</LI>
<LI> Now the runqueue is examined and a goodness of each process that can
be scheduled on this cpu is compared with current value; the
process with highest goodness wins. Now the concept of "can be
scheduled on this cpu" must be clarified: on UP, every process on
the runqueue is eligible to be scheduled; on SMP, only process not
already running on another cpu is eligible to be scheduled on this
cpu. The goodness is calculated by a function called <CODE>goodness()</CODE>, which
treats realtime processes by making their goodness very high
(<CODE>1000 + p->rt_priority</CODE>), this being greater than 1000 guarantees that
no <CODE>SCHED_OTHER</CODE> process can win; so they only contend with other
realtime processes that may have a greater <CODE>p->rt_priority</CODE>. The
goodness function returns 0 if the process' time slice (<CODE>p->counter</CODE>)
is over. For non-realtime processes, the initial value of goodness is
set to <CODE>p->counter</CODE> - this way, the process is less likely to get CPU if
it already had it for a while, i.e. interactive processes are favoured
more than CPU bound number crunchers. The arch-specific constant
<CODE>PROC_CHANGE_PENALTY</CODE> attempts to implement "cpu affinity" (i.e. give
advantage to a process on the same CPU). It also gives a slight
advantage to processes with mm pointing to current <CODE>active_mm</CODE> or to
processes with no (user) address space, i.e. kernel threads.
</LI>
<LI> if the current value of goodness is 0 then the entire list of
processes (not just the ones on the runqueue!) is examined and their dynamic
priorities are recalculated using simple algorithm:

<BLOCKQUOTE><CODE>
<HR>
<PRE>

recalculate:
        {
                struct task_struct *p;
                spin_unlock_irq(&amp;runqueue_lock);
                read_lock(&amp;tasklist_lock);
                for_each_task(p)
                        p->counter = (p->counter >> 1) + p->priority;
                read_unlock(&amp;tasklist_lock);
                spin_lock_irq(&amp;runqueue_lock);
        }
</PRE>
<HR>
</CODE></BLOCKQUOTE>


Note that the we drop the <CODE>runqueue_lock</CODE> before we recalculate. The
reason is that we go through entire set of processes;  this can take
a long time, during which the <CODE>schedule()</CODE> could be called on another CPU and
select a process with goodness good enough for that CPU, whilst we on
this CPU were forced to recalculate. Ok, admittedly this is somewhat
inconsistent because while we (on this CPU) are selecting a process with
the best goodness, <CODE>schedule()</CODE> running on another CPU could be
recalculating dynamic priorities.
</LI>
<LI> From this point on it is certain that <CODE>next</CODE> points to the task to
be scheduled, so we initialise <CODE>next->has_cpu</CODE> to 1 and <CODE>next->processor</CODE>
to <CODE>this_cpu</CODE>. The <CODE>runqueue_lock</CODE> can now be unlocked.
</LI>
<LI> If we are switching back to the same task (<CODE>next == prev</CODE>) then we can
simply reacquire the global kernel lock and return, i.e. skip all the
hardware-level (registers, stack etc.) and VM-related (switch page
directory, recalculate <CODE>active_mm</CODE> etc.) stuff.
</LI>
<LI> The macro <CODE>switch_to()</CODE> is architecture specific. On i386, it is
concerned with a) FPU handling, b) LDT handling, c) reloading segment
registers, d) TSS handling and e) reloading debug registers.</LI>
</OL>
<P>
<H2><A NAME="ss2.4">2.4 Linux linked list implementation</A>
</H2>

<P>
<P>Before we go on to examine implementation of wait queues, we must
acquaint ourselves with the Linux standard doubly-linked list implementation.
Wait queues (as well as everything else in Linux) make heavy use
of them and they are called in jargon "list.h implementation" because the
most relevant file is <CODE>include/linux/list.h</CODE>.
<P>The fundamental data structure here is <CODE>struct list_head</CODE>:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
struct list_head {
        struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &amp;(name), &amp;(name) }

#define LIST_HEAD(name) \
        struct list_head name = LIST_HEAD_INIT(name)

#define INIT_LIST_HEAD(ptr) do { \
        (ptr)->next = (ptr); (ptr)->prev = (ptr); \
} while (0)

#define list_entry(ptr, type, member) \
        ((type *)((char *)(ptr)-(unsigned long)(&amp;((type *)0)->member)))

#define list_for_each(pos, head) \
        for (pos = (head)->next; pos != (head); pos = pos->next)
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>The first three macros are for initialising an empty list by pointing both
<CODE>next</CODE> and <CODE>prev</CODE> pointers to itself. It is obvious from C syntactical
restrictions which ones should be used where - for example, <CODE>LIST_HEAD_INIT()</CODE>
can be used for structure's element initialisation in declaration, the second
can be used for static variable initialising declarations and the third can
be used inside a function.
<P>The macro <CODE>list_entry()</CODE> gives access to individual list element, for example
(from <CODE>fs/file_table.c:fs_may_remount_ro()</CODE>):
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
struct super_block {
   ...
   struct list_head s_files;
   ...
} *sb = &amp;some_super_block;

struct file {
   ...
   struct list_head f_list;
   ...
} *file;

struct list_head *p;

for (p = sb->s_files.next; p != &amp;sb->s_files; p = p->next) {
     struct file *file = list_entry(p, struct file, f_list);
     do something to 'file'
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>A good example of the use of <CODE>list_for_each()</CODE> macro is in the scheduler where
we walk the runqueue looking for the process with highest goodness:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
static LIST_HEAD(runqueue_head);
struct list_head *tmp;
struct task_struct *p;

list_for_each(tmp, &amp;runqueue_head) {
    p = list_entry(tmp, struct task_struct, run_list);
    if (can_schedule(p)) {
        int weight = goodness(p, this_cpu, prev->active_mm);
        if (weight > c)
            c = weight, next = p;
    }
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>Here, <CODE>p->run_list</CODE> is declared as <CODE>struct list_head run_list</CODE> inside
<CODE>task_struct</CODE> structure and serves as anchor to the list. Removing an element
from the list and adding (to head or tail of the list) is done by
<CODE>list_del()/list_add()/list_add_tail()</CODE> macros. The examples below are adding
and removing a task from runqueue:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
static inline void del_from_runqueue(struct task_struct * p)
{
        nr_running--;
        list_del(&amp;p->run_list);
        p->run_list.next = NULL;
}

static inline void add_to_runqueue(struct task_struct * p)
{
        list_add(&amp;p->run_list, &amp;runqueue_head);
        nr_running++;
}

static inline void move_last_runqueue(struct task_struct * p)
{
        list_del(&amp;p->run_list);
        list_add_tail(&amp;p->run_list, &amp;runqueue_head);
}

static inline void move_first_runqueue(struct task_struct * p)
{
        list_del(&amp;p->run_list);
        list_add(&amp;p->run_list, &amp;runqueue_head);
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>
<H2><A NAME="ss2.5">2.5 Wait Queues</A>
</H2>

<P>
<P>When a process requests the kernel to do something which is currently
impossible but that may become possible later, the process is put to sleep
and is woken up when the request is more likely to be satisfied. One of the
kernel mechanisms used for this is called a 'wait queue'.
<P>Linux implementation allows wake-on semantics using <CODE>TASK_EXCLUSIVE</CODE> flag.
With waitqueues, you can either use a well-known queue and then simply
<CODE>sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout</CODE>,
or you can define your own waitqueue and use <CODE>add/remove_wait_queue</CODE> to add and
remove yourself from it and <CODE>wake_up/wake_up_interruptible</CODE> to wake up
when needed.
<P>An example of the first usage of waitqueues is interaction between the page
allocator (in <CODE>mm/page_alloc.c:__alloc_pages()</CODE>) and the <CODE>kswapd</CODE> kernel daemon (in
<CODE>mm/vmscan.c:kswap()</CODE>), by means of wait queue <CODE>kswapd_wait,</CODE> declared in
<CODE>mm/vmscan.c</CODE>; the <CODE>kswapd</CODE> daemon sleeps on this queue, and it is woken up
whenever the page allocator needs to free up some pages.
<P>An example of autonomous waitqueue usage is interaction between
user process requesting data via <B>read(2)</B> system call and kernel running in
the interrupt context to supply the data. An interrupt handler might look
like (simplified <CODE>drivers/char/rtc_interrupt()</CODE>):
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);

void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
        spin_lock(&amp;rtc_lock);
        rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
        spin_unlock(&amp;rtc_lock);
        wake_up_interruptible(&amp;rtc_wait);
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>So, the interrupt handler obtains the data by reading from some
device-specific I/O port (<CODE>CMOS_READ()</CODE> macro turns into a couple <CODE>outb/inb</CODE>) and
then wakes up whoever is sleeping on the <CODE>rtc_wait</CODE> wait queue.
<P>Now, the <B>read(2)</B> system call could be implemented as:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
{
        DECLARE_WAITQUEUE(wait, current);
        unsigned long data;
        ssize_t retval;

        add_wait_queue(&amp;rtc_wait, &amp;wait);
        current->state = TASK_INTERRUPTIBLE;
        do {
                spin_lock_irq(&amp;rtc_lock);
                data = rtc_irq_data;
                rtc_irq_data = 0;
                spin_unlock_irq(&amp;rtc_lock);

                if (data != 0)
                        break;

                if (file->f_flags &amp; O_NONBLOCK) {
                        retval = -EAGAIN;
                        goto out;
                }
                if (signal_pending(current)) {
                        retval = -ERESTARTSYS;
                        goto out;
                }
                schedule();
        } while(1);
        retval = put_user(data, (unsigned long *)buf);
        if (!retval)
                retval = sizeof(unsigned long);

out:
        current->state = TASK_RUNNING;
        remove_wait_queue(&amp;rtc_wait, &amp;wait);
        return retval;
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>What happens in <CODE>rtc_read()</CODE> is this:
<P>
<OL>
<LI> We declare a wait queue element pointing to current process context.
</LI>
<LI> We add this element to the <CODE>rtc_wait</CODE> waitqueue.
</LI>
<LI> We mark current context as <CODE>TASK_INTERRUPTIBLE</CODE> which means it will not
be rescheduled after the next time it sleeps.
</LI>
<LI> We check if there is no data available; if there is we break out,
copy data to user buffer, mark ourselves as <CODE>TASK_RUNNING</CODE>, remove
ourselves from the wait queue and return
</LI>
<LI> If there is no data yet, we check whether the user specified non-blocking I/O
and if so we fail with <CODE>EAGAIN</CODE> (which is the same as <CODE>EWOULDBLOCK</CODE>)
</LI>
<LI> We also check if a signal is pending and if so inform the "higher
layers" to restart the system call if necessary. By "if necessary"
I meant the details of signal disposition as specified in <B>sigaction(2)</B>
system call.
</LI>
<LI> Then we "switch out", i.e. fall asleep, until woken up by the
interrupt handler. If we didn't mark ourselves as <CODE>TASK_INTERRUPTIBLE</CODE>
then the scheduler could schedule us sooner than when the data is
available, thus causing unneeded processing.</LI>
</OL>
<P>It is also worth pointing out that, using wait queues, it is rather easy to
implement the <B>poll(2)</B> system call:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
static unsigned int rtc_poll(struct file *file, poll_table *wait)
{
        unsigned long l;

        poll_wait(file, &amp;rtc_wait, wait);

        spin_lock_irq(&amp;rtc_lock);
        l = rtc_irq_data;
        spin_unlock_irq(&amp;rtc_lock);

        if (l != 0)
                return POLLIN | POLLRDNORM;
        return 0;
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>All the work is done by the device-independent function <CODE>poll_wait()</CODE> which does
the necessary waitqueue manipulations; all we need to do is point it to the
waitqueue which is woken up by our device-specific interrupt handler.
<P>
<H2><A NAME="ss2.6">2.6 Kernel Timers</A>
</H2>

<P>
<P>Now let us turn our attention to kernel timers. Kernel timers are used to
dispatch execution of a particular function (called 'timer handler') at a
specified time in the future. The main data structure is <CODE>struct timer_list</CODE>
declared in <CODE>include/linux/timer.h</CODE>:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
struct timer_list {
        struct list_head list;
        unsigned long expires;
        unsigned long data;
        void (*function)(unsigned long);
        volatile int running;
};
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>The <CODE>list</CODE> field is for linking into the internal list, protected by the
<CODE>timerlist_lock</CODE> spinlock. The <CODE>expires</CODE> field is the value of <CODE>jiffies</CODE> when
the <CODE>function</CODE> handler should be invoked with <CODE>data</CODE> passed as a parameter.
The <CODE>running</CODE> field is used on SMP to test if the timer handler is currently
running on another CPU.
<P>The functions <CODE>add_timer()</CODE> and <CODE>del_timer()</CODE> add and remove a given timer to the
list. When a timer expires, it is removed automatically. Before a timer is
used, it MUST be initialised by means of <CODE>init_timer()</CODE> function. And before it
is added, the fields <CODE>function</CODE> and <CODE>expires</CODE> must be set.
<P>
<H2><A NAME="ss2.7">2.7 Bottom Halves</A>
</H2>

<P>
<P>Sometimes it is reasonable to split the amount of work to be performed inside
an interrupt handler into immediate work (e.g. acknowledging the interrupt,
updating the stats etc.) and work which can be postponed until later, when
interrupts are enabled (e.g. to do some postprocessing on data, wake up
processes waiting for this data, etc).
<P>Bottom halves are the oldest mechanism for deferred execution of kernel
tasks and have been available since Linux 1.x. In Linux 2.0, a new mechanism
was added, called 'task queues', which will be the subject of next section.
<P>Bottom halves are serialised by the <CODE>global_bh_lock</CODE> spinlock, i.e.
there can only be one bottom half running on any CPU at a time. However,
when attempting to execute the handler, if <CODE>global_bh_lock</CODE> is not available,
the bottom half is marked (i.e. scheduled) for execution - so processing can
continue, as opposed to a busy loop on <CODE>global_bh_lock</CODE>.
<P>There can only be 32 bottom halves registered in total.
The functions required to manipulate bottom halves are as follows (all
exported to modules):
<P>
<UL>
<LI> <CODE>void init_bh(int nr, void (*routine)(void))</CODE>: installs a bottom half
handler pointed to by <CODE>routine</CODE> argument into slot <CODE>nr</CODE>. The slot
ought to be enumerated in <CODE>include/linux/interrupt.h</CODE> in the form
<CODE>XXXX_BH</CODE>, e.g. <CODE>TIMER_BH</CODE> or <CODE>TQUEUE_BH</CODE>. Typically, a subsystem's
initialisation routine (<CODE>init_module()</CODE> for modules) installs the
required bottom half using this function.
</LI>
<LI> <CODE>void remove_bh(int nr)</CODE>: does the opposite of <CODE>init_bh()</CODE>, i.e.
de-installs bottom half installed at slot <CODE>nr</CODE>. There is no error
checking performed there, so, for example <CODE>remove_bh(32)</CODE> will
panic/oops the system. Typically, a subsystem's cleanup routine
(<CODE>cleanup_module()</CODE> for modules) uses this function to free up the slot
that can later be reused by some other subsystem. (TODO: wouldn't it
be nice to have <CODE>/proc/bottom_halves</CODE> list all registered bottom
halves on the system? That means <CODE>global_bh_lock</CODE> must be made
read/write, obviously)
</LI>
<LI> <CODE>void mark_bh(int nr)</CODE>: marks bottom half in slot <CODE>nr</CODE> for execution. Typically,
an interrupt handler will mark its bottom half (hence the name!) for
execution at a "safer time".
</LI>
</UL>
<P>Bottom halves are globally locked tasklets, so the question "when are bottom
half handlers executed?" is really "when are tasklets executed?". And the
answer is, in two places: a) on each <CODE>schedule()</CODE> and b) on each
interrupt/syscall return path in <CODE>entry.S</CODE> (TODO: therefore, the <CODE>schedule()</CODE>
case is really boring - it like adding yet another very very slow interrupt,
why not get rid of <CODE>handle_softirq</CODE> label from <CODE>schedule()</CODE> altogether?).
<P>
<P>
<H2><A NAME="ss2.8">2.8 Task Queues</A>
</H2>

<P>
<P>Task queues can be though of as a dynamic extension to old bottom halves. In
fact, in the source code they are sometimes referred to as "new" bottom
halves. More specifically, the old bottom halves discussed in previous
section have these limitations:
<P>
<OL>
<LI> There are only a fixed number (32) of them.
</LI>
<LI> Each bottom half can only be associated with one handler function.
</LI>
<LI> Bottom halves are consumed with a spinlock held so they cannot block.</LI>
</OL>
<P>So, with task queues, arbitrary number of functions can be chained and
processed one after another at a later time. One creates a new task queue
using the <CODE>DECLARE_TASK_QUEUE()</CODE> macro and queues a task onto it using
the <CODE>queue_task()</CODE> function. The task queue then can be processed using
<CODE>run_task_queue()</CODE>. Instead of creating your own task queue (and
having to consume it manually) you can use one of Linux' predefined
task queues which are consumed at well-known points:
<P>
<OL>
<LI> <B>tq_timer</B>: the timer task queue, run on each timer interrupt
and when releasing a tty device (closing or releasing a half-opened
terminal device). Since the timer handler runs in interrupt context,
the <CODE>tq_timer</CODE> tasks also run in interrupt context and thus cannot block.
</LI>
<LI> <B>tq_scheduler</B>: the scheduler task queue, consumed by the scheduler (and also
when closing tty devices, like <CODE>tq_timer</CODE>). Since the scheduler executed
in the context of the process being re-scheduled, the <CODE>tq_scheduler</CODE>
tasks can do anything they like, i.e. block, use process context data
(but why would they want to), etc.
</LI>
<LI> <B>tq_immediate</B>: this is really a bottom half <CODE>IMMEDIATE_BH</CODE>, so
drivers can <CODE>queue_task(task, &amp;tq_immediate)</CODE> and then
<CODE>mark_bh(IMMEDIATE_BH)</CODE> to be consumed in interrupt context.
</LI>
<LI> <B>tq_disk</B>: used by low level block device access (and RAID) to start
the actual requests. This task queue is exported to modules but shouldn't
be used except for the special purposes which it was designed for.</LI>
</OL>
<P>Unless a driver uses its own task queues, it does not need to call
<CODE>run_tasks_queues()</CODE> to process the queue, except under circumstances explained
below.
<P>The reason <CODE>tq_timer/tq_scheduler</CODE> task queues are consumed not only in the
usual places but elsewhere (closing tty device is but one example) becomes
clear if one remembers that the driver can schedule tasks on the queue, and these tasks
only make sense while a particular instance of the device is still valid
- which usually means until the application closes it. So, the driver may
need to call <CODE>run_task_queue()</CODE> to flush the tasks it (and anyone else) has
put on the queue, because allowing them to run at a later time may make no
sense - i.e. the relevant data structures may have been freed/reused by a
different instance. This is the reason you see <CODE>run_task_queue()</CODE> on <CODE>tq_timer</CODE>
and <CODE>tq_scheduler</CODE> in places other than timer interrupt and <CODE>schedule()</CODE>
respectively.
<P>
<H2><A NAME="ss2.9">2.9 Tasklets</A>
</H2>

<P>
<P>Not yet, will be in future revision.
<P>
<H2><A NAME="ss2.10">2.10 Softirqs</A>
</H2>

<P>
<P>Not yet, will be in future revision.
<P>
<H2><A NAME="ss2.11">2.11 How System Calls Are Implemented on i386 Architecture?</A>
</H2>

<P>
<P>There are two mechanisms under Linux for implementing system calls:
<P>
<UL>
<LI> lcall7/lcall27 call gates;</LI>
<LI> int 0x80 software interrupt.</LI>
</UL>
<P>Native Linux programs use int 0x80 whilst binaries from foreign flavours
of UNIX (Solaris, UnixWare 7 etc.) use the lcall7 mechanism. The name 'lcall7' is
historically misleading because it also covers lcall27 (e.g. Solaris/x86), but
the handler function is called lcall7_func.
<P>When the system boots, the function <CODE>arch/i386/kernel/traps.c:trap_init()</CODE> is
called which sets up the IDT so that vector 0x80 (of type 15, dpl 3) points to
the address of system_call entry from <CODE>arch/i386/kernel/entry.S</CODE>.
<P>When a userspace application makes a system call, the arguments are passed via registers
and the application executes 'int 0x80' instruction. This causes a trap into
kernel mode and processor jumps to system_call entry point in <CODE>entry.S</CODE>.
What this does is:
<P>
<OL>
<LI> Save registers.
</LI>
<LI> Set %ds and %es to KERNEL_DS, so that all data (and extra segment)
references are made in kernel address space.
</LI>
<LI> If the value of %eax is greater than <CODE>NR_syscalls</CODE> (currently 256),
fail with <CODE>ENOSYS</CODE> error.
</LI>
<LI> If the task is being ptraced (<CODE>tsk->ptrace &amp; PF_TRACESYS</CODE>), do special
processing. This is to support programs like strace (analogue of
SVR4 <B>truss(1)</B>) or debuggers.
</LI>
<LI> Call <CODE>sys_call_table+4*(syscall_number from %eax)</CODE>. This table is
initialised in the same file (<CODE>arch/i386/kernel/entry.S</CODE>) to point to
individual system call handlers which under Linux are (usually)
prefixed with <CODE>sys_</CODE>, e.g. <CODE>sys_open</CODE>, <CODE>sys_exit</CODE>, etc. These C system
call handlers will find their arguments on the stack where <CODE>SAVE_ALL</CODE>
stored them.
</LI>
<LI> Enter 'system call return path'. This is a separate label because it
is used not only by int 0x80 but also by lcall7, lcall27. This is
concerned with handling tasklets (including bottom halves), checking
if a <CODE>schedule()</CODE> is needed (<CODE>tsk->need_resched != 0</CODE>), checking if there
are signals pending and if so handling them.</LI>
</OL>
<P>Linux supports up to 6 arguments for system calls. They are passed in
%ebx, %ecx, %edx, %esi, %edi (and %ebp used temporarily, see <CODE>_syscall6()</CODE> in
<CODE>asm-i386/unistd.h</CODE>). The system call number is passed via %eax.
<P>
<H2><A NAME="ss2.12">2.12 Atomic Operations</A>
</H2>

<P>
<P>There are two types of atomic operations: bitmaps and <CODE>atomic_t</CODE>. Bitmaps are
very convenient for maintaining a concept of "allocated" or "free" units
from some large collection where each unit is identified by some number, for
example free inodes or free blocks. They are also widely used for simple
locking, for example to provide exclusive access to open a device. An example
of this can be found in <CODE>arch/i386/kernel/microcode.c</CODE>:
<P>
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
/*
 *  Bits in microcode_status. (31 bits of room for future expansion)
 */
#define MICROCODE_IS_OPEN       0       /* set if device is in use */

static unsigned long microcode_status;
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>There is no need to initialise <CODE>microcode_status</CODE> to 0 as BSS is zero-cleared
under Linux explicitly.
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
/*
 * We enforce only one user at a time here with open/close.
 */
static int microcode_open(struct inode *inode, struct file *file)
{
        if (!capable(CAP_SYS_RAWIO))
                return -EPERM;

        /* one at a time, please */
        if (test_and_set_bit(MICROCODE_IS_OPEN, &amp;microcode_status))
                return -EBUSY;

        MOD_INC_USE_COUNT;
        return 0;
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>The operations on bitmaps are:
<P>
<UL>
<LI> <B>void set_bit(int nr, volatile void *addr)</B>: set bit <CODE>nr</CODE>
in the bitmap pointed to by <CODE>addr</CODE>.
</LI>
<LI> <B>void clear_bit(int nr, volatile void *addr)</B>: clear bit
<CODE>nr</CODE> in the bitmap pointed to by <CODE>addr</CODE>.
</LI>
<LI> <B>void change_bit(int nr, volatile void *addr)</B>: toggle bit
<CODE>nr</CODE> (if set clear, if clear set) in the bitmap pointed to by <CODE>addr</CODE>.
</LI>
<LI> <B>int test_and_set_bit(int nr, volatile void *addr)</B>:
atomically set bit <CODE>nr</CODE> and return the old bit value.
</LI>
<LI> <B>int test_and_clear_bit(int nr, volatile void *addr)</B>:
atomically clear bit <CODE>nr</CODE> and return the old bit value.
</LI>
<LI> <B>int test_and_change_bit(int nr, volatile void *addr)</B>:
atomically toggle bit <CODE>nr</CODE> and return the old bit value.</LI>
</UL>
<P>These operations use the <CODE>LOCK_PREFIX</CODE> macro, which on SMP kernels evaluates to
bus lock instruction prefix and to nothing on UP. This guarantees atomicity
of access in SMP environment.
<P>Sometimes bit manipulations are not convenient, but instead we need to perform
arithmetic operations - add, subtract, increment decrement. The typical cases
are reference counts (e.g. for inodes). This facility is provided by the
<CODE>atomic_t</CODE> data type and the following operations:
<P>
<UL>
<LI> <B>atomic_read(&amp;v)</B>: read the value of <CODE>atomic_t</CODE> variable <CODE>v</CODE>.
</LI>
<LI> <B>atomic_set(&amp;v, i)</B>: set the value of <CODE>atomic_t</CODE> variable
<CODE>v</CODE> to integer <CODE>i</CODE>.
</LI>
<LI> <B>void atomic_add(int i, volatile atomic_t *v)</B>: add integer
<CODE>i</CODE> to the value of atomic variable pointed to by <CODE>v</CODE>.
</LI>
<LI> <B>void atomic_sub(int i, volatile atomic_t *v)</B>: subtract
integer <CODE>i</CODE> from the value of atomic variable pointed to by <CODE>v</CODE>.
</LI>
<LI> <B>int atomic_sub_and_test(int i, volatile atomic_t *v)</B>:
subtract integer <CODE>i</CODE> from the value of atomic variable pointed to by
<CODE>v</CODE>; return 1 if the new value is 0, return 0 otherwise.
</LI>
<LI> <B>void atomic_inc(volatile atomic_t *v)</B>: increment the value
by 1.
</LI>
<LI> <B>void atomic_dec(volatile atomic_t *v)</B>: decrement the value
by 1.
</LI>
<LI> <B>int atomic_dec_and_test(volatile atomic_t *v)</B>: decrement
the value; return 1 if the new value is 0, return 0 otherwise.
</LI>
<LI> <B>int atomic_inc_and_test(volatile atomic_t *v)</B>: increment
the value; return 1 if the new value is 0, return 0 otherwise.
</LI>
<LI> <B>int atomic_add_negative(int i, volatile atomic_t *v)</B>: add
the value of <CODE>i</CODE> to <CODE>v</CODE> and return 1 if the result is negative. Return
0 if the result is greater than or equal to 0. This operation is used
for implementing semaphores.</LI>
</UL>
<P>
<H2><A NAME="ss2.13">2.13 Spinlocks, Read-write Spinlocks and Big-Reader Spinlocks</A>
</H2>

<P>
<P>Since the early days of Linux support (early 90s, this century),
developers were faced with the classical problem of accessing shared data
between different types of context (user process vs
interrupt) and different instances of the same context from multiple cpus.
<P>SMP support was added to Linux 1.3.42 on 15 Nov 1995 (the original patch
was made to 1.3.37 in October the same year).
<P>If the critical region of code may be executed by either process context
and interrupt context, then the way to protect it using <CODE>cli/sti</CODE> instructions
on UP is:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
unsigned long flags;

save_flags(flags);
cli();
/* critical code */
restore_flags(flags);
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>While this is ok on UP, it obviously is of no use on SMP because the same
code sequence may be executed simultaneously on another cpu, and while <CODE>cli()</CODE>
provides protection against races with interrupt context on each CPU individually, it
provides no protection at all against races between contexts running on different
CPUs. This is where spinlocks are useful for.
<P>There are three types of spinlocks: vanilla (basic), read-write and
big-reader spinlocks. Read-write spinlocks should be used when there is a
natural tendency of 'many readers and few writers'. Example of this is
access to the list of registered filesystems (see <CODE>fs/super.c</CODE>). The list is
guarded by the <CODE>file_systems_lock</CODE> read-write spinlock because one needs exclusive
access only when registering/unregistering a filesystem, but any process can
read the file <CODE>/proc/filesystems</CODE> or use the <B>sysfs(2)</B> system call to force a
read-only scan of the file_systems list. This makes it sensible to use
read-write spinlocks. With read-write spinlocks, one can have multiple
readers at a time but only one writer and there can be no readers while
there is
a writer. Btw, it would be nice if new readers would not get a lock while
there
is a writer trying to get a lock, i.e. if Linux could correctly deal with
the issue of potential writer starvation by multiple readers.
This would mean that readers must be blocked while there is a writer
attempting to get the lock. This is not
currently the case and it is not obvious whether this should be fixed - the
argument to the contrary is - readers usually take the lock for a very short
time so should they really be starved while the writer takes the lock for
potentially longer periods?
<P>Big-reader spinlocks are a form of read-write spinlocks
heavily optimised for very light read access, with a penalty for writes.
There is a limited number of big-reader spinlocks - currently only two exist,
of which one is used only on sparc64 (global irq) and the other is used for
networking. In all other cases where the access pattern does not fit into
any of these two scenarios, one should use basic spinlocks. You cannot block
while holding any kind of spinlock.
<P>Spinlocks come in three flavours: plain, <CODE>_irq()</CODE> and <CODE>_bh()</CODE>.
<P>
<OL>
<LI> Plain <CODE>spin_lock()/spin_unlock()</CODE>: if you know the interrupts are always
disabled or if you do not race with interrupt context (e.g. from
within interrupt handler), then you can use this one. It does not
touch interrupt state on the current CPU.
</LI>
<LI> <CODE>spin_lock_irq()/spin_unlock_irq()</CODE>: if you know that interrupts are
always enabled then you can use this version, which simply disables
(on lock) and re-enables (on unlock) interrupts on the current CPU.
For example, <CODE>rtc_read()</CODE> uses
<CODE>spin_lock_irq(&amp;rtc_lock)</CODE> (interrupts are always enabled inside
<CODE>read()</CODE>) whilst <CODE>rtc_interrupt()</CODE> uses
<CODE>spin_lock(&amp;rtc_lock)</CODE> (interrupts are always disabled inside
interrupt handler). Note that <CODE>rtc_read()</CODE> uses <CODE>spin_lock_irq()</CODE> and not
the more generic <CODE>spin_lock_irqsave()</CODE> because on entry to any system
call interrupts are always enabled.
</LI>
<LI> <CODE>spin_lock_irqsave()/spin_unlock_irqrestore()</CODE>: the strongest form,
to be used when the interrupt state is not known, but only if
interrupts matter at all, i.e. there is no point in using it if
our interrupt handlers don't execute any critical code.</LI>
</OL>
<P>The reason you cannot use plain <CODE>spin_lock()</CODE> if you race against interrupt handlers is because if you take it and then
an interrupt comes in on the same CPU, it will busy wait for the lock forever:
the lock holder, having been interrupted, will not continue until the
interrupt handler returns.
<P>The most common usage of a spinlock is to access a data structure shared
between user process context and interrupt handlers:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
spinlock_t my_lock = SPIN_LOCK_UNLOCKED;

my_ioctl()
{
        spin_lock_irq(&amp;my_lock);
        /* critical section */
        spin_unlock_irq(&amp;my_lock);
}

my_irq_handler()
{
        spin_lock(&amp;lock);
        /* critical section */
        spin_unlock(&amp;lock);
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>There are a couple of things to note about this example:
<P>
<OL>
<LI> The process context, represented here as a typical driver method -
<CODE>ioctl()</CODE> (arguments and return values omitted for clarity), must
use <CODE>spin_lock_irq()</CODE> because it knows that interrupts are always
enabled while executing the device <CODE>ioctl()</CODE> method.
</LI>
<LI> Interrupt context, represented here by <CODE>my_irq_handler()</CODE> (again
arguments omitted for clarity) can use plain <CODE>spin_lock()</CODE> form because
interrupts are disabled inside an interrupt handler.</LI>
</OL>
<P>
<H2><A NAME="ss2.14">2.14 Semaphores and read/write Semaphores</A>
</H2>

<P>
<P>Sometimes, while accessing a shared data structure, one must perform operations
that can block, for example copy data to userspace. The locking primitive
available for such scenarios under Linux is called a semaphore. There are two
types of semaphores: basic and read-write semaphores. Depending on the
initial value of the semaphore, they can be used for either mutual exclusion
(initial value of 1) or to provide more sophisticated type of access.
<P>Read-write semaphores differ from basic semaphores in the same way as
read-write spinlocks differ from basic spinlocks: one can have multiple
readers at a time but only one writer and there can be no readers while there are
writers - i.e. the writer blocks all readers and new readers block while a
writer is waiting.
<P>Also, basic semaphores can be interruptible - just use the operations
<CODE>down/up_interruptible()</CODE> instead of the plain <CODE>down()/up()</CODE> and check the
value returned from <CODE>down_interruptible()</CODE>: it will be non zero if the operation was
interrupted.
<P>Using semaphores for mutual exclusion is ideal in situations where a critical
code section may call by reference unknown functions registered by other
subsystems/modules, i.e. the caller cannot know apriori whether the function
blocks or not.
<P>A simple example of semaphore usage is in <CODE>kernel/sys.c</CODE>, implementation of
<B>gethostname(2)/sethostname(2)</B> system calls.
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
asmlinkage long sys_sethostname(char *name, int len)
{
        int errno;

        if (!capable(CAP_SYS_ADMIN))
                return -EPERM;
        if (len &lt; 0 || len > __NEW_UTS_LEN)
                return -EINVAL;
        down_write(&amp;uts_sem);
        errno = -EFAULT;
        if (!copy_from_user(system_utsname.nodename, name, len)) {
                system_utsname.nodename[len] = 0;
                errno = 0;
        }
        up_write(&amp;uts_sem);
        return errno;
}

asmlinkage long sys_gethostname(char *name, int len)
{
        int i, errno;

        if (len &lt; 0)
                return -EINVAL;
        down_read(&amp;uts_sem);
        i = 1 + strlen(system_utsname.nodename);
        if (i > len)
                i = len;
        errno = 0;
        if (copy_to_user(name, system_utsname.nodename, i))
                errno = -EFAULT;
        up_read(&amp;uts_sem);
        return errno;
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>The points to note about this example are:
<P>
<OL>
<LI> The functions may block while copying data from/to userspace in
<CODE>copy_from_user()/copy_to_user()</CODE>. Therefore they could not use any form
of spinlock here.
</LI>
<LI> The semaphore type chosen is read-write as opposed to basic because
there may be lots of concurrent <B>gethostname(2)</B> requests which need not
be mutually exclusive.</LI>
</OL>
<P>Although Linux implementation of semaphores and read-write semaphores is
very sophisticated, there are possible scenarios one can think of which are
not yet implemented, for example there is no concept of interruptible
read-write semaphores. This is obviously because there are no real-world
situations which require these exotic flavours of the primitives.
<P>
<H2><A NAME="ss2.15">2.15 Kernel Support for Loading Modules</A>
</H2>

<P>
<P>Linux is a monolithic operating system and despite all the modern hype about
some "advantages" offered by operating systems based on micro-kernel design,
the truth remains (quoting Linus Torvalds himself):
<P>
<BLOCKQUOTE><CODE>
... message passing as the fundamental operation of the OS is just an
exercise in computer science masturbation. It may feel good, but you
don't actually get anything DONE.
</CODE></BLOCKQUOTE>
<P>Therefore, Linux is and will always be based on a monolithic design, which
means that all subsystems run in the same privileged mode and share the same
address space; communication between them is achieved by the usual C function
call means.
<P>However, although separating kernel functionality into separate "processes"
as is done in micro-kernels is definitely a bad idea, separating it into
dynamically loadable on demand kernel modules is desirable in some
circumstances (e.g. on machines with low memory or for installation kernels
which could otherwise contain ISA auto-probing device drivers that are
mutually exclusive). The decision whether to include support for loadable
modules is made at compile time and is determined by the <CODE>CONFIG_MODULES</CODE>
option. Support for module autoloading via <CODE>request_module()</CODE> mechanism is
a separate compilation option (<CODE>CONFIG_KMOD</CODE>).
<P>The following functionality can be implemented as loadable modules under
Linux:
<P>
<OL>
<LI> Character and block device drivers, including misc device drivers.
</LI>
<LI> Terminal line disciplines.
</LI>
<LI> Virtual (regular) files in <CODE>/proc</CODE> and in devfs (e.g. <CODE>/dev/cpu/microcode</CODE>
vs <CODE>/dev/misc/microcode</CODE>).
</LI>
<LI> Binary file formats (e.g. ELF, aout, etc).
</LI>
<LI> Execution domains (e.g. Linux, UnixWare7, Solaris, etc).
</LI>
<LI> Filesystems.
</LI>
<LI> System V IPC.</LI>
</OL>
<P>There a few things that cannot be implemented as modules under Linux
(probably because it makes no sense for them to be modularised):
<P>
<OL>
<LI> Scheduling algorithms.
</LI>
<LI> VM policies.
</LI>
<LI> Buffer cache, page cache and other caches.</LI>
</OL>
<P>Linux provides several system calls to assist in loading modules:
<P>
<OL>
<LI><CODE>caddr_t create_module(const char *name, size_t size)</CODE>: allocates
<CODE>size</CODE> bytes using <CODE>vmalloc()</CODE> and maps a module structure at the
beginning thereof. This new module is then linked into the list headed
by module_list. Only a process with <CODE>CAP_SYS_MODULE</CODE> can invoke this
system call, others will get <CODE>EPERM</CODE> returned.
</LI>
<LI><CODE>long init_module(const char *name, struct module *image)</CODE>: loads the
relocated module image and causes the module's initialisation routine
to be invoked. Only a process with <CODE>CAP_SYS_MODULE</CODE> can invoke this
system call, others will get <CODE>EPERM</CODE> returned.
</LI>
<LI><CODE>long delete_module(const char *name)</CODE>: attempts to unload the module.
If <CODE>name == NULL</CODE>, attempt is made to unload all unused modules.
</LI>
<LI><CODE>long query_module(const char *name, int which, void *buf, size_t bufsize, size_t *ret)</CODE>: returns information about a module
(or about all modules).</LI>
</OL>
<P>The command interface available to users consists of:
<P>
<UL>
<LI><B>insmod</B>: insert a single module.
</LI>
<LI><B>modprobe</B>: insert a module including all other modules it depends
on.
          </LI>
<LI><B>rmmod</B>: remove a module.
</LI>
<LI><B>modinfo</B>: print some information about a module, e.g. author,
description, parameters the module accepts, etc.</LI>
</UL>
<P>Apart from being able to load a module manually using either <B>insmod</B> or <B>modprobe</B>,
it is also possible to have the module inserted automatically by the kernel
when a particular functionality is required. The kernel interface for this
is the function called <CODE>request_module(name)</CODE> which is exported to modules,
so that modules can load other modules as well. The <CODE>request_module(name)</CODE>
internally creates a kernel thread which execs the userspace command
<B>modprobe -s -k module_name</B>, using the standard <CODE>exec_usermodehelper()</CODE> kernel
interface (which is also exported to modules). The function returns 0 on
success, however it is usually not worth checking the return code from
<CODE>request_module()</CODE>. Instead, the programming idiom is:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
if (check_some_feature() == NULL)
    request_module(module);
if (check_some_feature() == NULL)
    return -ENODEV;
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>For example, this is done by <CODE>fs/block_dev.c:get_blkfops()</CODE> to load a module
<CODE>block-major-N</CODE> when attempt is made to open a block device with major <CODE>N</CODE>.
Obviously, there is no such module called <CODE>block-major-N</CODE> (Linux developers
only chose sensible names for their modules) but it is mapped to a proper
module name using the file <CODE>/etc/modules.conf</CODE>. However, for most well-known
major numbers (and other kinds of modules) the <B>modprobe/insmod</B> commands
know which real module to load without needing an explicit alias statement
in <CODE>/etc/modules.conf</CODE>.
<P>A good example of loading a module is inside the <B>mount(2)</B> system call. The
<B>mount(2)</B> system call accepts the filesystem type as a string which
<CODE>fs/super.c:do_mount()</CODE> then passes on to <CODE>fs/super.c:get_fs_type()</CODE>:
<P>
<BLOCKQUOTE><CODE>
<HR>
<PRE>
static struct file_system_type *get_fs_type(const char *name)
{
        struct file_system_type *fs;

        read_lock(&amp;file_systems_lock);
        fs = *(find_filesystem(name));
        if (fs &amp;&amp; !try_inc_mod_count(fs->owner))
                fs = NULL;
        read_unlock(&amp;file_systems_lock);
        if (!fs &amp;&amp; (request_module(name) == 0)) {
                read_lock(&amp;file_systems_lock);
                fs = *(find_filesystem(name));
                if (fs &amp;&amp; !try_inc_mod_count(fs->owner))
                        fs = NULL;
                read_unlock(&amp;file_systems_lock);
        }
        return fs;
}
</PRE>
<HR>
</CODE></BLOCKQUOTE>
<P>A few things to note in this function:
<P>
<OL>
<LI> First we attempt to find the filesystem with the given name amongst
those already registered. This is done under protection of
<CODE>file_systems_lock</CODE> taken for read (as we are not modifying the list
of registered filesystems).
</LI>
<LI> If such a filesystem is found then we attempt to get a new reference
to it by trying to increment its module's hold count. This always
returns 1 for statically linked filesystems or for modules not
presently being deleted. If <CODE>try_inc_mod_count()</CODE> returned 0 then
we consider it a failure - i.e. if the module is there but is being
deleted, it is as good as if it were not there at all.
 </LI>
<LI> We drop the <CODE>file_systems_lock</CODE> because what we are about to do next
(<CODE>request_module()</CODE>) is a blocking operation, and therefore we can't
hold a spinlock over it. Actually, in this specific case, we would
have to drop <CODE>file_systems_lock</CODE> anyway, even if <CODE>request_module()</CODE> were
guaranteed to be non-blocking and the module loading were executed
in the same context atomically. The reason for this is that the module's
initialisation function will try to call <CODE>register_filesystem()</CODE>, which will
take the same <CODE>file_systems_lock</CODE> read-write spinlock for write.
</LI>
<LI> If the attempt to load was successful, then we take the
<CODE>file_systems_lock</CODE> spinlock and try to locate the newly registered
filesystem in the list. Note that this is slightly wrong because
it is in principle possible for a bug in modprobe command to cause
it to coredump after it successfully loaded the requested module, in
which case <CODE>request_module()</CODE> will fail even though the new filesystem will be
registered, and yet <CODE>get_fs_type()</CODE> won't find it.
</LI>
<LI> If the filesystem is found and we are able to get a reference to it,
we return it. Otherwise we return NULL.</LI>
</OL>
<P>When a module is loaded into the kernel, it can refer to any symbols that
are exported as public by the kernel using <CODE>EXPORT_SYMBOL()</CODE> macro or by
other currently loaded modules. If the module uses symbols from another
module, it is marked as depending on that module during dependency
recalculation, achieved by running <B>depmod -a</B> command on boot (e.g. after
installing a new kernel).
<P>Usually, one must match the set of modules with the version of the kernel
interfaces they use, which under Linux simply means the "kernel version" as
there is no special kernel interface versioning mechanism in general.
However, there is a limited functionality called "module versioning" or
<CODE>CONFIG_MODVERSIONS</CODE> which allows to avoid recompiling modules when switching
to a new kernel. What happens here is that the kernel symbol table is treated
differently for internal access and for access from modules. The elements of
public (i.e. exported) part of the symbol table are built by 32bit
checksumming the C declaration. So, in order to resolve a symbol used by a
module during loading, the loader must match the full representation of the
symbol that includes the checksum; it will refuse to load the module if these
symbols differ. This
only happens when both the kernel and the module are compiled with module
versioning enabled. If either one of them uses the original symbol names,
the loader simply tries to match the kernel version declared by the module
and the one exported by the kernel and refuses to load if they differ.
<P>
<HR>
<A HREF="lki-3.html">Next</A>
<A HREF="lki-1.html">Previous</A>
<A HREF="lki.html#toc2">Contents</A>
</BODY>
</HTML>