1563 lines
69 KiB
HTML
1563 lines
69 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
|
|
<TITLE>Linux Kernel 2.4 Internals: Process and Interrupt Management</TITLE>
|
|
<LINK HREF="lki-3.html" REL=next>
|
|
<LINK HREF="lki-1.html" REL=previous>
|
|
<LINK HREF="lki.html#toc2" REL=contents>
|
|
</HEAD>
|
|
<BODY>
|
|
<A HREF="lki-3.html">Next</A>
|
|
<A HREF="lki-1.html">Previous</A>
|
|
<A HREF="lki.html#toc2">Contents</A>
|
|
<HR>
|
|
<H2><A NAME="s2">2. Process and Interrupt Management</A></H2>
|
|
|
|
<P>
|
|
<P>
|
|
<H2><A NAME="ss2.1">2.1 Task Structure and Process Table</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Every process under Linux is dynamically allocated a <CODE>struct task_struct</CODE>
|
|
structure. The maximum number of processes which can be created on Linux
|
|
is limited only by the amount of physical memory present, and is
|
|
equal to (see <CODE>kernel/fork.c:fork_init()</CODE>):
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
/*
|
|
* The default maximum number of threads is set to a safe
|
|
* value: the thread structures can take up at most half
|
|
* of memory.
|
|
*/
|
|
max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>which, on IA32 architecture, basically means <CODE>num_physpages/4</CODE>. As an example,
|
|
on a 512M machine, you can create 32k threads. This is a considerable
|
|
improvement over the 4k-epsilon limit for older (2.2 and earlier) kernels.
|
|
Moreover, this can be changed at runtime using the KERN_MAX_THREADS <B>sysctl(2)</B>,
|
|
or simply using procfs interface to kernel tunables:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
# cat /proc/sys/kernel/threads-max
|
|
32764
|
|
# echo 100000 > /proc/sys/kernel/threads-max
|
|
# cat /proc/sys/kernel/threads-max
|
|
100000
|
|
# gdb -q vmlinux /proc/kcore
|
|
Core was generated by `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118'.
|
|
#0 0x0 in ?? ()
|
|
(gdb) p max_threads
|
|
$1 = 100000
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The set of processes on the Linux system is represented as a collection of
|
|
<CODE>struct task_struct</CODE> structures which are linked in two ways:
|
|
<P>
|
|
<OL>
|
|
<LI> as a hashtable, hashed by pid, and</LI>
|
|
<LI> as a circular, doubly-linked list using <CODE>p->next_task</CODE> and <CODE>p->prev_task</CODE>
|
|
pointers.</LI>
|
|
</OL>
|
|
<P>The hashtable is called <CODE>pidhash[]</CODE> and is defined in
|
|
<CODE>include/linux/sched.h</CODE>:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
/* PID hashing. (shouldnt this be dynamic?) */
|
|
#define PIDHASH_SZ (4096 >> 2)
|
|
extern struct task_struct *pidhash[PIDHASH_SZ];
|
|
|
|
#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The tasks are hashed by their pid value and the above hashing function is
|
|
supposed to distribute the elements uniformly in their domain
|
|
(<CODE>0</CODE> to <CODE>PID_MAX-1</CODE>). The hashtable is used to quickly find a task by given pid,
|
|
using <CODE>find_task_pid()</CODE> inline from <CODE>include/linux/sched.h</CODE>:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
static inline struct task_struct *find_task_by_pid(int pid)
|
|
{
|
|
struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];
|
|
|
|
for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
|
|
;
|
|
|
|
return p;
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The tasks on each hashlist (i.e. hashed to the same value) are linked
|
|
by <CODE>p->pidhash_next/pidhash_pprev</CODE> which are used by <CODE>hash_pid()</CODE> and
|
|
<CODE>unhash_pid()</CODE> to insert and remove a given process into the hashtable.
|
|
These are done under protection of the read-write spinlock called <CODE>tasklist_lock</CODE>
|
|
taken for WRITE.
|
|
<P>The circular doubly-linked list that uses <CODE>p->next_task/prev_task</CODE> is
|
|
maintained so that one could go through all tasks on the system easily.
|
|
This is achieved by the <CODE>for_each_task()</CODE> macro from <CODE>include/linux/sched.h</CODE>:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
#define for_each_task(p) \
|
|
for (p = &init_task ; (p = p->next_task) != &init_task ; )
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>Users of <CODE>for_each_task()</CODE> should take tasklist_lock for READ.
|
|
Note that <CODE>for_each_task()</CODE> is using <CODE>init_task</CODE> to mark the beginning (and end)
|
|
of the list - this is safe because the idle task (pid 0) never exits.
|
|
<P>The modifiers of the process hashtable or/and the process table links,
|
|
notably <CODE>fork()</CODE>, <CODE>exit()</CODE> and <CODE>ptrace()</CODE>, must take <CODE>tasklist_lock</CODE> for WRITE. What is
|
|
more interesting is that the writers must also disable interrupts on the
|
|
local CPU. The reason for this is not trivial: the <CODE>send_sigio()</CODE> function walks the
|
|
task list and thus takes <CODE>tasklist_lock</CODE> for READ, and it is called from
|
|
<CODE>kill_fasync()</CODE> in interrupt context. This is why writers must disable
|
|
interrupts while readers don't need to.
|
|
<P>Now that we understand how the <CODE>task_struct</CODE> structures are linked together,
|
|
let us examine the members of <CODE>task_struct</CODE>. They loosely correspond to the
|
|
members of UNIX 'struct proc' and 'struct user' combined together.
|
|
<P>The other versions of UNIX separated the task state information into
|
|
one part which should be kept memory-resident at all times (called 'proc
|
|
structure' which includes process state, scheduling information etc.) and
|
|
another part which is only needed when the process is running (called 'u area' which
|
|
includes file descriptor table, disk quota information etc.). The only reason
|
|
for such ugly design was that memory was a very scarce resource. Modern
|
|
operating systems (well, only Linux at the moment but others, e.g. FreeBSD
|
|
seem to improve in this direction towards Linux) do not need such separation
|
|
and therefore maintain process state in a kernel memory-resident data
|
|
structure at all times.
|
|
<P>The task_struct structure is declared in <CODE>include/linux/sched.h</CODE> and is
|
|
currently 1680 bytes in size.
|
|
<P>The state field is declared as:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
|
|
|
|
#define TASK_RUNNING 0
|
|
#define TASK_INTERRUPTIBLE 1
|
|
#define TASK_UNINTERRUPTIBLE 2
|
|
#define TASK_ZOMBIE 4
|
|
#define TASK_STOPPED 8
|
|
#define TASK_EXCLUSIVE 32
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>Why is <CODE>TASK_EXCLUSIVE</CODE> defined as 32 and not 16? Because 16 was used up by
|
|
<CODE>TASK_SWAPPING</CODE> and I forgot to shift <CODE>TASK_EXCLUSIVE</CODE> up when I removed
|
|
all references to <CODE>TASK_SWAPPING</CODE> (sometime in 2.3.x).
|
|
<P>The <CODE>volatile</CODE> in <CODE>p->state</CODE> declaration means it can be modified
|
|
asynchronously (from interrupt handler):
|
|
<P>
|
|
<OL>
|
|
<LI><B>TASK_RUNNING</B>: means the task is "supposed to be" on the run
|
|
queue. The reason it may not yet be on the runqueue is that marking a task as
|
|
<CODE>TASK_RUNNING</CODE> and placing it on the runqueue is not atomic. You need to hold
|
|
the <CODE>runqueue_lock</CODE> read-write spinlock for read in order to look at the
|
|
runqueue. If you do so, you will then see that every task on the runqueue is in
|
|
<CODE>TASK_RUNNING</CODE> state. However, the converse is not true for the reason explained
|
|
above. Similarly, drivers can mark themselves (or rather the process context they
|
|
run in) as <CODE>TASK_INTERRUPTIBLE</CODE> (or <CODE>TASK_UNINTERRUPTIBLE</CODE>) and then call <CODE>schedule()</CODE>,
|
|
which will then remove it from the runqueue (unless there is a pending signal, in which
|
|
case it is left on the runqueue). </LI>
|
|
<LI><B>TASK_INTERRUPTIBLE</B>: means the task is sleeping but can be woken up
|
|
by a signal or by expiry of a timer.</LI>
|
|
<LI><B>TASK_UNINTERRUPTIBLE</B>: same as <CODE>TASK_INTERRUPTIBLE</CODE>, except it cannot
|
|
be woken up.</LI>
|
|
<LI><B>TASK_ZOMBIE</B>: task has terminated but has not had its status collected
|
|
(<CODE>wait()</CODE>-ed for) by the parent (natural or by adoption).</LI>
|
|
<LI><B>TASK_STOPPED</B>: task was stopped, either due to job control signals or
|
|
due to <B>ptrace(2)</B>.</LI>
|
|
<LI><B>TASK_EXCLUSIVE</B>: this is not a separate state but can be OR-ed to
|
|
either one of <CODE>TASK_INTERRUPTIBLE</CODE> or <CODE>TASK_UNINTERRUPTIBLE</CODE>.
|
|
This means that when
|
|
this task is sleeping on a wait queue with many other tasks, it will be
|
|
woken up alone instead of causing "thundering herd" problem by waking up all
|
|
the waiters.</LI>
|
|
</OL>
|
|
<P>Task flags contain information about the process states which are not
|
|
mutually exclusive:
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
unsigned long flags; /* per process flags, defined below */
|
|
/*
|
|
* Per process flags
|
|
*/
|
|
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
|
|
/* Not implemented yet, only for 486*/
|
|
#define PF_STARTING 0x00000002 /* being created */
|
|
#define PF_EXITING 0x00000004 /* getting shut down */
|
|
#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */
|
|
#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */
|
|
#define PF_DUMPCORE 0x00000200 /* dumped core */
|
|
#define PF_SIGNALED 0x00000400 /* killed by a signal */
|
|
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
|
|
#define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
|
|
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The fields <CODE>p->has_cpu</CODE>, <CODE>p->processor</CODE>, <CODE>p->counter</CODE>, <CODE>p->priority</CODE>, <CODE>p->policy</CODE> and
|
|
<CODE>p->rt_priority</CODE> are related to the scheduler and will be looked at later.
|
|
<P>The fields <CODE>p->mm</CODE> and <CODE>p->active_mm</CODE> point respectively to the process' address space
|
|
described by <CODE>mm_struct</CODE> structure and to the active address space if the
|
|
process doesn't have a real one (e.g. kernel threads). This helps minimise
|
|
TLB flushes on switching address spaces when the task is scheduled out.
|
|
So, if we are scheduling-in the kernel thread (which has no <CODE>p->mm</CODE>) then its
|
|
<CODE>next->active_mm</CODE> will be set to the <CODE>prev->active_mm</CODE> of the task that was
|
|
scheduled-out, which will be the same as <CODE>prev->mm</CODE> if <CODE>prev->mm != NULL</CODE>.
|
|
The address space can be shared between threads if <CODE>CLONE_VM</CODE> flag is passed
|
|
to the <B>clone(2)</B> system call or by means of <B>vfork(2)</B> system call.
|
|
<P>The fields <CODE>p->exec_domain</CODE> and <CODE>p->personality</CODE> relate to the personality of
|
|
the task, i.e. to the way certain system calls behave in order to emulate the
|
|
"personality" of foreign flavours of UNIX.
|
|
<P>The field <CODE>p->fs</CODE> contains filesystem information, which under Linux means
|
|
three pieces of information:
|
|
<P>
|
|
<OL>
|
|
<LI>root directory's dentry and mountpoint,</LI>
|
|
<LI>alternate root directory's dentry and mountpoint,</LI>
|
|
<LI>current working directory's dentry and mountpoint.</LI>
|
|
</OL>
|
|
<P>This structure also includes a reference count because it can be shared
|
|
between cloned tasks when <CODE>CLONE_FS</CODE> flag is passed to the <B>clone(2)</B> system
|
|
call.
|
|
<P>The field <CODE>p->files</CODE> contains the file descriptor table. This too can be
|
|
shared between tasks, provided <CODE>CLONE_FILES</CODE> is specified with <B>clone(2)</B> system
|
|
call.
|
|
<P>The field <CODE>p->sig</CODE> contains signal handlers and can be shared between cloned
|
|
tasks by means of <CODE>CLONE_SIGHAND</CODE>.
|
|
<P>
|
|
<H2><A NAME="ss2.2">2.2 Creation and termination of tasks and kernel threads</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Different books on operating systems define a "process" in different ways,
|
|
starting from "instance of a program in execution" and ending with "that
|
|
which is produced by clone(2) or fork(2) system calls".
|
|
Under Linux, there are three kinds of processes:
|
|
<P>
|
|
<UL>
|
|
<LI> the idle thread(s),</LI>
|
|
<LI> kernel threads,</LI>
|
|
<LI> user tasks.</LI>
|
|
</UL>
|
|
<P>The idle thread is created at compile time for the first CPU; it is then
|
|
"manually" created for each CPU by means of arch-specific
|
|
<CODE>fork_by_hand()</CODE> in <CODE>arch/i386/kernel/smpboot.c</CODE>, which unrolls the <B>fork(2)</B> system
|
|
call by hand (on some archs). Idle tasks share one init_task structure but
|
|
have a private TSS structure, in the per-CPU array <CODE>init_tss</CODE>. Idle tasks all have
|
|
pid = 0 and no other task can share pid, i.e. use <CODE>CLONE_PID</CODE> flag to <B>clone(2)</B>.
|
|
<P>Kernel threads are created using <CODE>kernel_thread()</CODE> function which invokes
|
|
the <B>clone(2)</B> system call in kernel mode. Kernel threads usually have no user
|
|
address space, i.e. <CODE>p->mm = NULL</CODE>, because they explicitly do <CODE>exit_mm()</CODE>, e.g.
|
|
via <CODE>daemonize()</CODE> function. Kernel threads can always access kernel address
|
|
space directly. They are allocated pid numbers in the low range. Running at
|
|
processor's ring 0 (on x86, that is) implies that the kernel threads enjoy all I/O privileges
|
|
and cannot be pre-empted by the scheduler.
|
|
<P>User tasks are created by means of <B>clone(2)</B> or <B>fork(2)</B> system calls, both of
|
|
which internally invoke <B>kernel/fork.c:do_fork()</B>.
|
|
<P>Let us understand what happens when a user process makes a <B>fork(2)</B> system
|
|
call. Although <B>fork(2)</B> is architecture-dependent due to the
|
|
different ways of passing user stack and registers, the actual underlying
|
|
function <CODE>do_fork()</CODE> that does the job is portable and is located at
|
|
<CODE>kernel/fork.c</CODE>.
|
|
<P>The following steps are done:
|
|
<P>
|
|
<OL>
|
|
<LI> Local variable <CODE>retval</CODE> is set to <CODE>-ENOMEM</CODE>, as this is the value which <CODE>errno</CODE>
|
|
should be set to if <B>fork(2)</B> fails to allocate a new task structure.
|
|
</LI>
|
|
<LI> If <CODE>CLONE_PID</CODE> is set in <CODE>clone_flags</CODE> then return an error (<CODE>-EPERM</CODE>), unless
|
|
the caller is the idle thread (during boot only). So, normal user
|
|
threads cannot pass <CODE>CLONE_PID</CODE> to <B>clone(2)</B> and expect it to succeed.
|
|
For <B>fork(2)</B>, this is irrelevant as <CODE>clone_flags</CODE> is set to <CODE>SIFCHLD</CODE> - this
|
|
is only relevant when <CODE>do_fork()</CODE> is invoked from <CODE>sys_clone()</CODE> which
|
|
passes the <CODE>clone_flags</CODE> from the value requested from userspace.
|
|
</LI>
|
|
<LI> <CODE>current->vfork_sem</CODE> is initialised (it is later cleared in the child).
|
|
This is used by <CODE>sys_vfork()</CODE> (<B>vfork(2)</B> system call, corresponds to
|
|
<CODE>clone_flags = CLONE_VFORK|CLONE_VM|SIGCHLD</CODE>) to make the parent sleep
|
|
until the child does <CODE>mm_release()</CODE>, for example as a result of <CODE>exec()</CODE>ing
|
|
another program or <B>exit(2)</B>-ing.
|
|
</LI>
|
|
<LI> A new task structure is allocated using arch-dependent
|
|
<CODE>alloc_task_struct()</CODE> macro. On x86 it is just a gfp at <CODE>GFP_KERNEL</CODE>
|
|
priority. This is the first reason why <B>fork(2)</B> system call may sleep.
|
|
If this allocation fails, we return <CODE>-ENOMEM</CODE>.
|
|
</LI>
|
|
<LI> All the values from current process' task structure are copied into
|
|
the new one, using structure assignment <CODE>*p = *current</CODE>. Perhaps this
|
|
should be replaced by a memcpy? Later on, the fields that should not
|
|
be inherited by the child are set to the correct values.
|
|
</LI>
|
|
<LI> Big kernel lock is taken as the rest of the code would otherwise be
|
|
non-reentrant.
|
|
</LI>
|
|
<LI> If the parent has user resources (a concept of UID, Linux is flexible
|
|
enough to make it a question rather than a fact), then verify if the
|
|
user exceeded <CODE>RLIMIT_NPROC</CODE> soft limit - if so, fail with <CODE>-EAGAIN</CODE>, if
|
|
not, increment the count of processes by given uid <CODE>p->user->count</CODE>.
|
|
</LI>
|
|
<LI> If the system-wide number of tasks exceeds the value of the tunable
|
|
max_threads, fail with <CODE>-EAGAIN</CODE>.
|
|
</LI>
|
|
<LI> If the binary being executed belongs to a modularised execution
|
|
domain, increment the corresponding module's reference count.
|
|
</LI>
|
|
<LI> If the binary being executed belongs to a modularised binary format,
|
|
increment the corresponding module's reference count.
|
|
</LI>
|
|
<LI> The child is marked as 'has not execed' (<CODE>p->did_exec = 0</CODE>)
|
|
</LI>
|
|
<LI> The child is marked as 'not-swappable' (<CODE>p->swappable = 0</CODE>)
|
|
</LI>
|
|
<LI> The child is put into 'uninterruptible sleep' state, i.e.
|
|
<CODE>p->state = TASK_UNINTERRUPTIBLE</CODE> (TODO: why is this done?
|
|
I think it's not needed - get rid of it, Linus confirms it is not
|
|
needed)
|
|
</LI>
|
|
<LI> The child's <CODE>p->flags</CODE> are set according to the value of clone_flags;
|
|
for plain <B>fork(2)</B>, this will be <CODE>p->flags = PF_FORKNOEXEC</CODE>.
|
|
</LI>
|
|
<LI> The child's pid <CODE>p->pid</CODE> is set using the fast algorithm in
|
|
<CODE>kernel/fork.c:get_pid()</CODE> (TODO: <CODE>lastpid_lock</CODE> spinlock can be made
|
|
redundant since <CODE>get_pid()</CODE> is always called under big kernel lock
|
|
from <CODE>do_fork()</CODE>, also remove flags argument of <CODE>get_pid()</CODE>, patch sent
|
|
to Alan on 20/06/2000 - followup later).
|
|
</LI>
|
|
<LI> The rest of the code in <CODE>do_fork()</CODE> initialises the rest of child's
|
|
task structure. At the very end, the child's task structure is
|
|
hashed into the <CODE>pidhash</CODE> hashtable and the child is woken up (TODO:
|
|
<CODE>wake_up_process(p)</CODE> sets <CODE>p->state = TASK_RUNNING</CODE> and adds the process
|
|
to the runq, therefore we probably didn't need to set <CODE>p->state</CODE> to
|
|
<CODE>TASK_RUNNING</CODE> earlier on in <CODE>do_fork()</CODE>). The interesting part is
|
|
setting <CODE>p->exit_signal</CODE> to <CODE>clone_flags & CSIGNAL</CODE>, which for <B>fork(2)</B>
|
|
means just <CODE>SIGCHLD</CODE> and setting <CODE>p->pdeath_signal</CODE> to 0. The
|
|
<CODE>pdeath_signal</CODE> is used when a process 'forgets' the original parent
|
|
(by dying) and can be set/get by means of <CODE>PR_GET/SET_PDEATHSIG</CODE>
|
|
commands of <B>prctl(2)</B> system call (You might argue that the way the
|
|
value of <CODE>pdeath_signal</CODE> is returned via userspace pointer argument
|
|
in <B>prctl(2)</B> is a bit silly - mea culpa, after Andries Brouwer
|
|
updated the manpage it was too late to fix ;)</LI>
|
|
</OL>
|
|
<P>Thus tasks are created. There are several ways for tasks to terminate:
|
|
<P>
|
|
<OL>
|
|
<LI> by making <B>exit(2)</B> system call;
|
|
</LI>
|
|
<LI> by being delivered a signal with default disposition to die;
|
|
</LI>
|
|
<LI> by being forced to die under certain exceptions;
|
|
</LI>
|
|
<LI> by calling <B>bdflush(2)</B> with <CODE>func == 1</CODE> (this is Linux-specific, for
|
|
compatibility with old distributions that still had the 'update'
|
|
line in <CODE>/etc/inittab</CODE> - nowadays the work of update is done by
|
|
kernel thread <CODE>kupdate</CODE>).</LI>
|
|
</OL>
|
|
<P>Functions implementing system calls under Linux are prefixed with <CODE>sys_</CODE>,
|
|
but they are usually concerned only with argument checking or arch-specific
|
|
ways to pass some information and the actual work is done by <CODE>do_</CODE> functions.
|
|
So it is with <CODE>sys_exit()</CODE> which calls <CODE>do_exit()</CODE> to do the work. Although,
|
|
other parts of the kernel sometimes invoke <CODE>sys_exit()</CODE> while they should really
|
|
call <CODE>do_exit()</CODE>.
|
|
<P>The function <CODE>do_exit()</CODE> is found in <CODE>kernel/exit.c</CODE>. The points to note about
|
|
<CODE>do_exit()</CODE>:
|
|
<P>
|
|
<UL>
|
|
<LI> Uses global kernel lock (locks but doesn't unlock).
|
|
</LI>
|
|
<LI> Calls <CODE>schedule()</CODE> at the end, which never returns.
|
|
</LI>
|
|
<LI> Sets the task state to <CODE>TASK_ZOMBIE</CODE>.
|
|
</LI>
|
|
<LI> Notifies any child with <CODE>current->pdeath_signal</CODE>, if not 0.
|
|
</LI>
|
|
<LI> Notifies the parent with a <CODE>current->exit_signal</CODE>, which is usually
|
|
equal to <CODE>SIGCHLD</CODE>.
|
|
</LI>
|
|
<LI> Releases resources allocated by fork, closes open files etc.
|
|
</LI>
|
|
<LI> On architectures that use lazy FPU switching (ia64, mips, mips64)
|
|
(TODO: remove 'flags' argument of
|
|
sparc, sparc64), do whatever the hardware requires to pass the FPU
|
|
ownership (if owned by current) to "none".</LI>
|
|
</UL>
|
|
<P>
|
|
<H2><A NAME="ss2.3">2.3 Linux Scheduler</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>The job of a scheduler is to arbitrate access to the current CPU between
|
|
multiple processes. The scheduler is implemented in the 'main kernel file'
|
|
<CODE>kernel/sched.c</CODE>. The corresponding header file <CODE>include/linux/sched.h</CODE> is
|
|
included (either explicitly or indirectly) by virtually every kernel source
|
|
file.
|
|
<P>The fields of task structure relevant to scheduler include:
|
|
<P>
|
|
<UL>
|
|
<LI> <CODE>p->need_resched</CODE>: this field is set if <CODE>schedule()</CODE> should be invoked at
|
|
the 'next opportunity'.
|
|
</LI>
|
|
<LI> <CODE>p->counter</CODE>: number of clock ticks left to run in this scheduling
|
|
slice, decremented by a timer. When this field becomes lower than or equal to zero, it is reset
|
|
to 0 and <CODE>p->need_resched</CODE> is set. This is also sometimes called 'dynamic
|
|
priority' of a process because it can change by itself.
|
|
</LI>
|
|
<LI> <CODE>p->priority</CODE>: the process' static priority, only changed through well-known
|
|
system calls like <B>nice(2)</B>, POSIX.1b <B>sched_setparam(2)</B> or 4.4BSD/SVR4
|
|
<B>setpriority(2)</B>.
|
|
</LI>
|
|
<LI> <CODE>p->rt_priority</CODE>: realtime priority
|
|
</LI>
|
|
<LI> <CODE>p->policy</CODE>: the scheduling policy, specifies which scheduling class the
|
|
task belongs to. Tasks can change their scheduling class using the
|
|
<B>sched_setscheduler(2)</B> system call. The valid values are <CODE>SCHED_OTHER</CODE>
|
|
(traditional UNIX process), <CODE>SCHED_FIFO</CODE> (POSIX.1b FIFO realtime
|
|
process) and <CODE>SCHED_RR</CODE> (POSIX round-robin realtime process). One can
|
|
also OR <CODE>SCHED_YIELD</CODE> to any of these values to signify that the process
|
|
decided to yield the CPU, for example by calling <B>sched_yield(2)</B> system
|
|
call. A FIFO realtime process will run until either a) it blocks on I/O,
|
|
b) it explicitly yields the CPU or c) it is preempted by another realtime
|
|
process with a higher <CODE>p->rt_priority</CODE> value. <CODE>SCHED_RR</CODE> is the same as
|
|
<CODE>SCHED_FIFO</CODE>, except that when its timeslice expires it goes back to
|
|
the end of the runqueue.</LI>
|
|
</UL>
|
|
<P>The scheduler's algorithm is simple, despite the great apparent complexity
|
|
of the <CODE>schedule()</CODE> function. The function is complex because it implements
|
|
three scheduling algorithms in one and also because of the subtle
|
|
SMP-specifics.
|
|
<P>The apparently 'useless' gotos in <CODE>schedule()</CODE> are there for a purpose - to
|
|
generate the best optimised (for i386) code. Also, note that scheduler
|
|
(like most of the kernel) was completely rewritten for 2.4, therefore the
|
|
discussion below does not apply to 2.2 or earlier kernels.
|
|
<P>Let us look at the function in detail:
|
|
<P>
|
|
<OL>
|
|
<LI> If <CODE>current->active_mm == NULL</CODE> then something is wrong. Current
|
|
process, even a kernel thread (<CODE>current->mm == NULL</CODE>) must have a valid
|
|
<CODE>p->active_mm</CODE> at all times.
|
|
</LI>
|
|
<LI> If there is something to do on the <CODE>tq_scheduler</CODE> task queue, process it
|
|
now. Task queues provide a kernel mechanism to schedule execution of
|
|
functions at a later time. We shall look at it in details elsewhere.
|
|
</LI>
|
|
<LI> Initialise local variables <CODE>prev</CODE> and <CODE>this_cpu</CODE> to current task and
|
|
current CPU respectively.
|
|
</LI>
|
|
<LI> Check if <CODE>schedule()</CODE> was invoked from interrupt handler (due to a bug)
|
|
and panic if so.
|
|
</LI>
|
|
<LI> Release the global kernel lock.
|
|
</LI>
|
|
<LI> If there is some work to do via softirq mechanism, do it now.
|
|
</LI>
|
|
<LI> Initialise local pointer <CODE>struct schedule_data *sched_data</CODE> to point
|
|
to per-CPU (cacheline-aligned to prevent cacheline ping-pong)
|
|
scheduling data area, which contains the TSC value of <CODE>last_schedule</CODE> and the
|
|
pointer to last scheduled task structure (TODO: <CODE>sched_data</CODE> is used on
|
|
SMP only but why does <CODE>init_idle()</CODE> initialises it on UP as well?).
|
|
</LI>
|
|
<LI> <CODE>runqueue_lock</CODE> spinlock is taken. Note that we use <CODE>spin_lock_irq()</CODE>
|
|
because in <CODE>schedule()</CODE> we guarantee that interrupts are enabled. Therefore,
|
|
when we unlock <CODE>runqueue_lock</CODE>, we can just re-enable them instead of
|
|
saving/restoring eflags (<CODE>spin_lock_irqsave/restore</CODE> variant).
|
|
</LI>
|
|
<LI> task state machine: if the task is in <CODE>TASK_RUNNING</CODE> state, it is left
|
|
alone; if it is in <CODE>TASK_INTERRUPTIBLE</CODE> state and a signal is pending,
|
|
it is moved into <CODE>TASK_RUNNING</CODE> state. In all other cases, it is deleted
|
|
from the runqueue.
|
|
</LI>
|
|
<LI> <CODE>next</CODE> (best candidate to be scheduled) is set to the idle task of
|
|
this cpu. However, the goodness of this candidate is set to a very
|
|
low value (-1000), in hope that there is someone better than that.
|
|
</LI>
|
|
<LI> If the <CODE>prev</CODE> (current) task is in <CODE>TASK_RUNNING</CODE> state, then the
|
|
current goodness is set to its goodness and it is marked as a better
|
|
candidate to be scheduled than the idle task.
|
|
</LI>
|
|
<LI> Now the runqueue is examined and a goodness of each process that can
|
|
be scheduled on this cpu is compared with current value; the
|
|
process with highest goodness wins. Now the concept of "can be
|
|
scheduled on this cpu" must be clarified: on UP, every process on
|
|
the runqueue is eligible to be scheduled; on SMP, only process not
|
|
already running on another cpu is eligible to be scheduled on this
|
|
cpu. The goodness is calculated by a function called <CODE>goodness()</CODE>, which
|
|
treats realtime processes by making their goodness very high
|
|
(<CODE>1000 + p->rt_priority</CODE>), this being greater than 1000 guarantees that
|
|
no <CODE>SCHED_OTHER</CODE> process can win; so they only contend with other
|
|
realtime processes that may have a greater <CODE>p->rt_priority</CODE>. The
|
|
goodness function returns 0 if the process' time slice (<CODE>p->counter</CODE>)
|
|
is over. For non-realtime processes, the initial value of goodness is
|
|
set to <CODE>p->counter</CODE> - this way, the process is less likely to get CPU if
|
|
it already had it for a while, i.e. interactive processes are favoured
|
|
more than CPU bound number crunchers. The arch-specific constant
|
|
<CODE>PROC_CHANGE_PENALTY</CODE> attempts to implement "cpu affinity" (i.e. give
|
|
advantage to a process on the same CPU). It also gives a slight
|
|
advantage to processes with mm pointing to current <CODE>active_mm</CODE> or to
|
|
processes with no (user) address space, i.e. kernel threads.
|
|
</LI>
|
|
<LI> if the current value of goodness is 0 then the entire list of
|
|
processes (not just the ones on the runqueue!) is examined and their dynamic
|
|
priorities are recalculated using simple algorithm:
|
|
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
|
|
recalculate:
|
|
{
|
|
struct task_struct *p;
|
|
spin_unlock_irq(&runqueue_lock);
|
|
read_lock(&tasklist_lock);
|
|
for_each_task(p)
|
|
p->counter = (p->counter >> 1) + p->priority;
|
|
read_unlock(&tasklist_lock);
|
|
spin_lock_irq(&runqueue_lock);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
|
|
|
|
Note that the we drop the <CODE>runqueue_lock</CODE> before we recalculate. The
|
|
reason is that we go through entire set of processes; this can take
|
|
a long time, during which the <CODE>schedule()</CODE> could be called on another CPU and
|
|
select a process with goodness good enough for that CPU, whilst we on
|
|
this CPU were forced to recalculate. Ok, admittedly this is somewhat
|
|
inconsistent because while we (on this CPU) are selecting a process with
|
|
the best goodness, <CODE>schedule()</CODE> running on another CPU could be
|
|
recalculating dynamic priorities.
|
|
</LI>
|
|
<LI> From this point on it is certain that <CODE>next</CODE> points to the task to
|
|
be scheduled, so we initialise <CODE>next->has_cpu</CODE> to 1 and <CODE>next->processor</CODE>
|
|
to <CODE>this_cpu</CODE>. The <CODE>runqueue_lock</CODE> can now be unlocked.
|
|
</LI>
|
|
<LI> If we are switching back to the same task (<CODE>next == prev</CODE>) then we can
|
|
simply reacquire the global kernel lock and return, i.e. skip all the
|
|
hardware-level (registers, stack etc.) and VM-related (switch page
|
|
directory, recalculate <CODE>active_mm</CODE> etc.) stuff.
|
|
</LI>
|
|
<LI> The macro <CODE>switch_to()</CODE> is architecture specific. On i386, it is
|
|
concerned with a) FPU handling, b) LDT handling, c) reloading segment
|
|
registers, d) TSS handling and e) reloading debug registers.</LI>
|
|
</OL>
|
|
<P>
|
|
<H2><A NAME="ss2.4">2.4 Linux linked list implementation</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Before we go on to examine implementation of wait queues, we must
|
|
acquaint ourselves with the Linux standard doubly-linked list implementation.
|
|
Wait queues (as well as everything else in Linux) make heavy use
|
|
of them and they are called in jargon "list.h implementation" because the
|
|
most relevant file is <CODE>include/linux/list.h</CODE>.
|
|
<P>The fundamental data structure here is <CODE>struct list_head</CODE>:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
struct list_head {
|
|
struct list_head *next, *prev;
|
|
};
|
|
|
|
#define LIST_HEAD_INIT(name) { &(name), &(name) }
|
|
|
|
#define LIST_HEAD(name) \
|
|
struct list_head name = LIST_HEAD_INIT(name)
|
|
|
|
#define INIT_LIST_HEAD(ptr) do { \
|
|
(ptr)->next = (ptr); (ptr)->prev = (ptr); \
|
|
} while (0)
|
|
|
|
#define list_entry(ptr, type, member) \
|
|
((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))
|
|
|
|
#define list_for_each(pos, head) \
|
|
for (pos = (head)->next; pos != (head); pos = pos->next)
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The first three macros are for initialising an empty list by pointing both
|
|
<CODE>next</CODE> and <CODE>prev</CODE> pointers to itself. It is obvious from C syntactical
|
|
restrictions which ones should be used where - for example, <CODE>LIST_HEAD_INIT()</CODE>
|
|
can be used for structure's element initialisation in declaration, the second
|
|
can be used for static variable initialising declarations and the third can
|
|
be used inside a function.
|
|
<P>The macro <CODE>list_entry()</CODE> gives access to individual list element, for example
|
|
(from <CODE>fs/file_table.c:fs_may_remount_ro()</CODE>):
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
struct super_block {
|
|
...
|
|
struct list_head s_files;
|
|
...
|
|
} *sb = &some_super_block;
|
|
|
|
struct file {
|
|
...
|
|
struct list_head f_list;
|
|
...
|
|
} *file;
|
|
|
|
struct list_head *p;
|
|
|
|
for (p = sb->s_files.next; p != &sb->s_files; p = p->next) {
|
|
struct file *file = list_entry(p, struct file, f_list);
|
|
do something to 'file'
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>A good example of the use of <CODE>list_for_each()</CODE> macro is in the scheduler where
|
|
we walk the runqueue looking for the process with highest goodness:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
static LIST_HEAD(runqueue_head);
|
|
struct list_head *tmp;
|
|
struct task_struct *p;
|
|
|
|
list_for_each(tmp, &runqueue_head) {
|
|
p = list_entry(tmp, struct task_struct, run_list);
|
|
if (can_schedule(p)) {
|
|
int weight = goodness(p, this_cpu, prev->active_mm);
|
|
if (weight > c)
|
|
c = weight, next = p;
|
|
}
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>Here, <CODE>p->run_list</CODE> is declared as <CODE>struct list_head run_list</CODE> inside
|
|
<CODE>task_struct</CODE> structure and serves as anchor to the list. Removing an element
|
|
from the list and adding (to head or tail of the list) is done by
|
|
<CODE>list_del()/list_add()/list_add_tail()</CODE> macros. The examples below are adding
|
|
and removing a task from runqueue:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
static inline void del_from_runqueue(struct task_struct * p)
|
|
{
|
|
nr_running--;
|
|
list_del(&p->run_list);
|
|
p->run_list.next = NULL;
|
|
}
|
|
|
|
static inline void add_to_runqueue(struct task_struct * p)
|
|
{
|
|
list_add(&p->run_list, &runqueue_head);
|
|
nr_running++;
|
|
}
|
|
|
|
static inline void move_last_runqueue(struct task_struct * p)
|
|
{
|
|
list_del(&p->run_list);
|
|
list_add_tail(&p->run_list, &runqueue_head);
|
|
}
|
|
|
|
static inline void move_first_runqueue(struct task_struct * p)
|
|
{
|
|
list_del(&p->run_list);
|
|
list_add(&p->run_list, &runqueue_head);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>
|
|
<H2><A NAME="ss2.5">2.5 Wait Queues</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>When a process requests the kernel to do something which is currently
|
|
impossible but that may become possible later, the process is put to sleep
|
|
and is woken up when the request is more likely to be satisfied. One of the
|
|
kernel mechanisms used for this is called a 'wait queue'.
|
|
<P>Linux implementation allows wake-on semantics using <CODE>TASK_EXCLUSIVE</CODE> flag.
|
|
With waitqueues, you can either use a well-known queue and then simply
|
|
<CODE>sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout</CODE>,
|
|
or you can define your own waitqueue and use <CODE>add/remove_wait_queue</CODE> to add and
|
|
remove yourself from it and <CODE>wake_up/wake_up_interruptible</CODE> to wake up
|
|
when needed.
|
|
<P>An example of the first usage of waitqueues is interaction between the page
|
|
allocator (in <CODE>mm/page_alloc.c:__alloc_pages()</CODE>) and the <CODE>kswapd</CODE> kernel daemon (in
|
|
<CODE>mm/vmscan.c:kswap()</CODE>), by means of wait queue <CODE>kswapd_wait,</CODE> declared in
|
|
<CODE>mm/vmscan.c</CODE>; the <CODE>kswapd</CODE> daemon sleeps on this queue, and it is woken up
|
|
whenever the page allocator needs to free up some pages.
|
|
<P>An example of autonomous waitqueue usage is interaction between
|
|
user process requesting data via <B>read(2)</B> system call and kernel running in
|
|
the interrupt context to supply the data. An interrupt handler might look
|
|
like (simplified <CODE>drivers/char/rtc_interrupt()</CODE>):
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);
|
|
|
|
void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
|
|
{
|
|
spin_lock(&rtc_lock);
|
|
rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
|
|
spin_unlock(&rtc_lock);
|
|
wake_up_interruptible(&rtc_wait);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>So, the interrupt handler obtains the data by reading from some
|
|
device-specific I/O port (<CODE>CMOS_READ()</CODE> macro turns into a couple <CODE>outb/inb</CODE>) and
|
|
then wakes up whoever is sleeping on the <CODE>rtc_wait</CODE> wait queue.
|
|
<P>Now, the <B>read(2)</B> system call could be implemented as:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
|
|
{
|
|
DECLARE_WAITQUEUE(wait, current);
|
|
unsigned long data;
|
|
ssize_t retval;
|
|
|
|
add_wait_queue(&rtc_wait, &wait);
|
|
current->state = TASK_INTERRUPTIBLE;
|
|
do {
|
|
spin_lock_irq(&rtc_lock);
|
|
data = rtc_irq_data;
|
|
rtc_irq_data = 0;
|
|
spin_unlock_irq(&rtc_lock);
|
|
|
|
if (data != 0)
|
|
break;
|
|
|
|
if (file->f_flags & O_NONBLOCK) {
|
|
retval = -EAGAIN;
|
|
goto out;
|
|
}
|
|
if (signal_pending(current)) {
|
|
retval = -ERESTARTSYS;
|
|
goto out;
|
|
}
|
|
schedule();
|
|
} while(1);
|
|
retval = put_user(data, (unsigned long *)buf);
|
|
if (!retval)
|
|
retval = sizeof(unsigned long);
|
|
|
|
out:
|
|
current->state = TASK_RUNNING;
|
|
remove_wait_queue(&rtc_wait, &wait);
|
|
return retval;
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>What happens in <CODE>rtc_read()</CODE> is this:
|
|
<P>
|
|
<OL>
|
|
<LI> We declare a wait queue element pointing to current process context.
|
|
</LI>
|
|
<LI> We add this element to the <CODE>rtc_wait</CODE> waitqueue.
|
|
</LI>
|
|
<LI> We mark current context as <CODE>TASK_INTERRUPTIBLE</CODE> which means it will not
|
|
be rescheduled after the next time it sleeps.
|
|
</LI>
|
|
<LI> We check if there is no data available; if there is we break out,
|
|
copy data to user buffer, mark ourselves as <CODE>TASK_RUNNING</CODE>, remove
|
|
ourselves from the wait queue and return
|
|
</LI>
|
|
<LI> If there is no data yet, we check whether the user specified non-blocking I/O
|
|
and if so we fail with <CODE>EAGAIN</CODE> (which is the same as <CODE>EWOULDBLOCK</CODE>)
|
|
</LI>
|
|
<LI> We also check if a signal is pending and if so inform the "higher
|
|
layers" to restart the system call if necessary. By "if necessary"
|
|
I meant the details of signal disposition as specified in <B>sigaction(2)</B>
|
|
system call.
|
|
</LI>
|
|
<LI> Then we "switch out", i.e. fall asleep, until woken up by the
|
|
interrupt handler. If we didn't mark ourselves as <CODE>TASK_INTERRUPTIBLE</CODE>
|
|
then the scheduler could schedule us sooner than when the data is
|
|
available, thus causing unneeded processing.</LI>
|
|
</OL>
|
|
<P>It is also worth pointing out that, using wait queues, it is rather easy to
|
|
implement the <B>poll(2)</B> system call:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
static unsigned int rtc_poll(struct file *file, poll_table *wait)
|
|
{
|
|
unsigned long l;
|
|
|
|
poll_wait(file, &rtc_wait, wait);
|
|
|
|
spin_lock_irq(&rtc_lock);
|
|
l = rtc_irq_data;
|
|
spin_unlock_irq(&rtc_lock);
|
|
|
|
if (l != 0)
|
|
return POLLIN | POLLRDNORM;
|
|
return 0;
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>All the work is done by the device-independent function <CODE>poll_wait()</CODE> which does
|
|
the necessary waitqueue manipulations; all we need to do is point it to the
|
|
waitqueue which is woken up by our device-specific interrupt handler.
|
|
<P>
|
|
<H2><A NAME="ss2.6">2.6 Kernel Timers</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Now let us turn our attention to kernel timers. Kernel timers are used to
|
|
dispatch execution of a particular function (called 'timer handler') at a
|
|
specified time in the future. The main data structure is <CODE>struct timer_list</CODE>
|
|
declared in <CODE>include/linux/timer.h</CODE>:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
struct timer_list {
|
|
struct list_head list;
|
|
unsigned long expires;
|
|
unsigned long data;
|
|
void (*function)(unsigned long);
|
|
volatile int running;
|
|
};
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The <CODE>list</CODE> field is for linking into the internal list, protected by the
|
|
<CODE>timerlist_lock</CODE> spinlock. The <CODE>expires</CODE> field is the value of <CODE>jiffies</CODE> when
|
|
the <CODE>function</CODE> handler should be invoked with <CODE>data</CODE> passed as a parameter.
|
|
The <CODE>running</CODE> field is used on SMP to test if the timer handler is currently
|
|
running on another CPU.
|
|
<P>The functions <CODE>add_timer()</CODE> and <CODE>del_timer()</CODE> add and remove a given timer to the
|
|
list. When a timer expires, it is removed automatically. Before a timer is
|
|
used, it MUST be initialised by means of <CODE>init_timer()</CODE> function. And before it
|
|
is added, the fields <CODE>function</CODE> and <CODE>expires</CODE> must be set.
|
|
<P>
|
|
<H2><A NAME="ss2.7">2.7 Bottom Halves</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Sometimes it is reasonable to split the amount of work to be performed inside
|
|
an interrupt handler into immediate work (e.g. acknowledging the interrupt,
|
|
updating the stats etc.) and work which can be postponed until later, when
|
|
interrupts are enabled (e.g. to do some postprocessing on data, wake up
|
|
processes waiting for this data, etc).
|
|
<P>Bottom halves are the oldest mechanism for deferred execution of kernel
|
|
tasks and have been available since Linux 1.x. In Linux 2.0, a new mechanism
|
|
was added, called 'task queues', which will be the subject of next section.
|
|
<P>Bottom halves are serialised by the <CODE>global_bh_lock</CODE> spinlock, i.e.
|
|
there can only be one bottom half running on any CPU at a time. However,
|
|
when attempting to execute the handler, if <CODE>global_bh_lock</CODE> is not available,
|
|
the bottom half is marked (i.e. scheduled) for execution - so processing can
|
|
continue, as opposed to a busy loop on <CODE>global_bh_lock</CODE>.
|
|
<P>There can only be 32 bottom halves registered in total.
|
|
The functions required to manipulate bottom halves are as follows (all
|
|
exported to modules):
|
|
<P>
|
|
<UL>
|
|
<LI> <CODE>void init_bh(int nr, void (*routine)(void))</CODE>: installs a bottom half
|
|
handler pointed to by <CODE>routine</CODE> argument into slot <CODE>nr</CODE>. The slot
|
|
ought to be enumerated in <CODE>include/linux/interrupt.h</CODE> in the form
|
|
<CODE>XXXX_BH</CODE>, e.g. <CODE>TIMER_BH</CODE> or <CODE>TQUEUE_BH</CODE>. Typically, a subsystem's
|
|
initialisation routine (<CODE>init_module()</CODE> for modules) installs the
|
|
required bottom half using this function.
|
|
</LI>
|
|
<LI> <CODE>void remove_bh(int nr)</CODE>: does the opposite of <CODE>init_bh()</CODE>, i.e.
|
|
de-installs bottom half installed at slot <CODE>nr</CODE>. There is no error
|
|
checking performed there, so, for example <CODE>remove_bh(32)</CODE> will
|
|
panic/oops the system. Typically, a subsystem's cleanup routine
|
|
(<CODE>cleanup_module()</CODE> for modules) uses this function to free up the slot
|
|
that can later be reused by some other subsystem. (TODO: wouldn't it
|
|
be nice to have <CODE>/proc/bottom_halves</CODE> list all registered bottom
|
|
halves on the system? That means <CODE>global_bh_lock</CODE> must be made
|
|
read/write, obviously)
|
|
</LI>
|
|
<LI> <CODE>void mark_bh(int nr)</CODE>: marks bottom half in slot <CODE>nr</CODE> for execution. Typically,
|
|
an interrupt handler will mark its bottom half (hence the name!) for
|
|
execution at a "safer time".
|
|
</LI>
|
|
</UL>
|
|
<P>Bottom halves are globally locked tasklets, so the question "when are bottom
|
|
half handlers executed?" is really "when are tasklets executed?". And the
|
|
answer is, in two places: a) on each <CODE>schedule()</CODE> and b) on each
|
|
interrupt/syscall return path in <CODE>entry.S</CODE> (TODO: therefore, the <CODE>schedule()</CODE>
|
|
case is really boring - it like adding yet another very very slow interrupt,
|
|
why not get rid of <CODE>handle_softirq</CODE> label from <CODE>schedule()</CODE> altogether?).
|
|
<P>
|
|
<P>
|
|
<H2><A NAME="ss2.8">2.8 Task Queues</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Task queues can be though of as a dynamic extension to old bottom halves. In
|
|
fact, in the source code they are sometimes referred to as "new" bottom
|
|
halves. More specifically, the old bottom halves discussed in previous
|
|
section have these limitations:
|
|
<P>
|
|
<OL>
|
|
<LI> There are only a fixed number (32) of them.
|
|
</LI>
|
|
<LI> Each bottom half can only be associated with one handler function.
|
|
</LI>
|
|
<LI> Bottom halves are consumed with a spinlock held so they cannot block.</LI>
|
|
</OL>
|
|
<P>So, with task queues, arbitrary number of functions can be chained and
|
|
processed one after another at a later time. One creates a new task queue
|
|
using the <CODE>DECLARE_TASK_QUEUE()</CODE> macro and queues a task onto it using
|
|
the <CODE>queue_task()</CODE> function. The task queue then can be processed using
|
|
<CODE>run_task_queue()</CODE>. Instead of creating your own task queue (and
|
|
having to consume it manually) you can use one of Linux' predefined
|
|
task queues which are consumed at well-known points:
|
|
<P>
|
|
<OL>
|
|
<LI> <B>tq_timer</B>: the timer task queue, run on each timer interrupt
|
|
and when releasing a tty device (closing or releasing a half-opened
|
|
terminal device). Since the timer handler runs in interrupt context,
|
|
the <CODE>tq_timer</CODE> tasks also run in interrupt context and thus cannot block.
|
|
</LI>
|
|
<LI> <B>tq_scheduler</B>: the scheduler task queue, consumed by the scheduler (and also
|
|
when closing tty devices, like <CODE>tq_timer</CODE>). Since the scheduler executed
|
|
in the context of the process being re-scheduled, the <CODE>tq_scheduler</CODE>
|
|
tasks can do anything they like, i.e. block, use process context data
|
|
(but why would they want to), etc.
|
|
</LI>
|
|
<LI> <B>tq_immediate</B>: this is really a bottom half <CODE>IMMEDIATE_BH</CODE>, so
|
|
drivers can <CODE>queue_task(task, &tq_immediate)</CODE> and then
|
|
<CODE>mark_bh(IMMEDIATE_BH)</CODE> to be consumed in interrupt context.
|
|
</LI>
|
|
<LI> <B>tq_disk</B>: used by low level block device access (and RAID) to start
|
|
the actual requests. This task queue is exported to modules but shouldn't
|
|
be used except for the special purposes which it was designed for.</LI>
|
|
</OL>
|
|
<P>Unless a driver uses its own task queues, it does not need to call
|
|
<CODE>run_tasks_queues()</CODE> to process the queue, except under circumstances explained
|
|
below.
|
|
<P>The reason <CODE>tq_timer/tq_scheduler</CODE> task queues are consumed not only in the
|
|
usual places but elsewhere (closing tty device is but one example) becomes
|
|
clear if one remembers that the driver can schedule tasks on the queue, and these tasks
|
|
only make sense while a particular instance of the device is still valid
|
|
- which usually means until the application closes it. So, the driver may
|
|
need to call <CODE>run_task_queue()</CODE> to flush the tasks it (and anyone else) has
|
|
put on the queue, because allowing them to run at a later time may make no
|
|
sense - i.e. the relevant data structures may have been freed/reused by a
|
|
different instance. This is the reason you see <CODE>run_task_queue()</CODE> on <CODE>tq_timer</CODE>
|
|
and <CODE>tq_scheduler</CODE> in places other than timer interrupt and <CODE>schedule()</CODE>
|
|
respectively.
|
|
<P>
|
|
<H2><A NAME="ss2.9">2.9 Tasklets</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Not yet, will be in future revision.
|
|
<P>
|
|
<H2><A NAME="ss2.10">2.10 Softirqs</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Not yet, will be in future revision.
|
|
<P>
|
|
<H2><A NAME="ss2.11">2.11 How System Calls Are Implemented on i386 Architecture?</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>There are two mechanisms under Linux for implementing system calls:
|
|
<P>
|
|
<UL>
|
|
<LI> lcall7/lcall27 call gates;</LI>
|
|
<LI> int 0x80 software interrupt.</LI>
|
|
</UL>
|
|
<P>Native Linux programs use int 0x80 whilst binaries from foreign flavours
|
|
of UNIX (Solaris, UnixWare 7 etc.) use the lcall7 mechanism. The name 'lcall7' is
|
|
historically misleading because it also covers lcall27 (e.g. Solaris/x86), but
|
|
the handler function is called lcall7_func.
|
|
<P>When the system boots, the function <CODE>arch/i386/kernel/traps.c:trap_init()</CODE> is
|
|
called which sets up the IDT so that vector 0x80 (of type 15, dpl 3) points to
|
|
the address of system_call entry from <CODE>arch/i386/kernel/entry.S</CODE>.
|
|
<P>When a userspace application makes a system call, the arguments are passed via registers
|
|
and the application executes 'int 0x80' instruction. This causes a trap into
|
|
kernel mode and processor jumps to system_call entry point in <CODE>entry.S</CODE>.
|
|
What this does is:
|
|
<P>
|
|
<OL>
|
|
<LI> Save registers.
|
|
</LI>
|
|
<LI> Set %ds and %es to KERNEL_DS, so that all data (and extra segment)
|
|
references are made in kernel address space.
|
|
</LI>
|
|
<LI> If the value of %eax is greater than <CODE>NR_syscalls</CODE> (currently 256),
|
|
fail with <CODE>ENOSYS</CODE> error.
|
|
</LI>
|
|
<LI> If the task is being ptraced (<CODE>tsk->ptrace & PF_TRACESYS</CODE>), do special
|
|
processing. This is to support programs like strace (analogue of
|
|
SVR4 <B>truss(1)</B>) or debuggers.
|
|
</LI>
|
|
<LI> Call <CODE>sys_call_table+4*(syscall_number from %eax)</CODE>. This table is
|
|
initialised in the same file (<CODE>arch/i386/kernel/entry.S</CODE>) to point to
|
|
individual system call handlers which under Linux are (usually)
|
|
prefixed with <CODE>sys_</CODE>, e.g. <CODE>sys_open</CODE>, <CODE>sys_exit</CODE>, etc. These C system
|
|
call handlers will find their arguments on the stack where <CODE>SAVE_ALL</CODE>
|
|
stored them.
|
|
</LI>
|
|
<LI> Enter 'system call return path'. This is a separate label because it
|
|
is used not only by int 0x80 but also by lcall7, lcall27. This is
|
|
concerned with handling tasklets (including bottom halves), checking
|
|
if a <CODE>schedule()</CODE> is needed (<CODE>tsk->need_resched != 0</CODE>), checking if there
|
|
are signals pending and if so handling them.</LI>
|
|
</OL>
|
|
<P>Linux supports up to 6 arguments for system calls. They are passed in
|
|
%ebx, %ecx, %edx, %esi, %edi (and %ebp used temporarily, see <CODE>_syscall6()</CODE> in
|
|
<CODE>asm-i386/unistd.h</CODE>). The system call number is passed via %eax.
|
|
<P>
|
|
<H2><A NAME="ss2.12">2.12 Atomic Operations</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>There are two types of atomic operations: bitmaps and <CODE>atomic_t</CODE>. Bitmaps are
|
|
very convenient for maintaining a concept of "allocated" or "free" units
|
|
from some large collection where each unit is identified by some number, for
|
|
example free inodes or free blocks. They are also widely used for simple
|
|
locking, for example to provide exclusive access to open a device. An example
|
|
of this can be found in <CODE>arch/i386/kernel/microcode.c</CODE>:
|
|
<P>
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
/*
|
|
* Bits in microcode_status. (31 bits of room for future expansion)
|
|
*/
|
|
#define MICROCODE_IS_OPEN 0 /* set if device is in use */
|
|
|
|
static unsigned long microcode_status;
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>There is no need to initialise <CODE>microcode_status</CODE> to 0 as BSS is zero-cleared
|
|
under Linux explicitly.
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
/*
|
|
* We enforce only one user at a time here with open/close.
|
|
*/
|
|
static int microcode_open(struct inode *inode, struct file *file)
|
|
{
|
|
if (!capable(CAP_SYS_RAWIO))
|
|
return -EPERM;
|
|
|
|
/* one at a time, please */
|
|
if (test_and_set_bit(MICROCODE_IS_OPEN, &microcode_status))
|
|
return -EBUSY;
|
|
|
|
MOD_INC_USE_COUNT;
|
|
return 0;
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The operations on bitmaps are:
|
|
<P>
|
|
<UL>
|
|
<LI> <B>void set_bit(int nr, volatile void *addr)</B>: set bit <CODE>nr</CODE>
|
|
in the bitmap pointed to by <CODE>addr</CODE>.
|
|
</LI>
|
|
<LI> <B>void clear_bit(int nr, volatile void *addr)</B>: clear bit
|
|
<CODE>nr</CODE> in the bitmap pointed to by <CODE>addr</CODE>.
|
|
</LI>
|
|
<LI> <B>void change_bit(int nr, volatile void *addr)</B>: toggle bit
|
|
<CODE>nr</CODE> (if set clear, if clear set) in the bitmap pointed to by <CODE>addr</CODE>.
|
|
</LI>
|
|
<LI> <B>int test_and_set_bit(int nr, volatile void *addr)</B>:
|
|
atomically set bit <CODE>nr</CODE> and return the old bit value.
|
|
</LI>
|
|
<LI> <B>int test_and_clear_bit(int nr, volatile void *addr)</B>:
|
|
atomically clear bit <CODE>nr</CODE> and return the old bit value.
|
|
</LI>
|
|
<LI> <B>int test_and_change_bit(int nr, volatile void *addr)</B>:
|
|
atomically toggle bit <CODE>nr</CODE> and return the old bit value.</LI>
|
|
</UL>
|
|
<P>These operations use the <CODE>LOCK_PREFIX</CODE> macro, which on SMP kernels evaluates to
|
|
bus lock instruction prefix and to nothing on UP. This guarantees atomicity
|
|
of access in SMP environment.
|
|
<P>Sometimes bit manipulations are not convenient, but instead we need to perform
|
|
arithmetic operations - add, subtract, increment decrement. The typical cases
|
|
are reference counts (e.g. for inodes). This facility is provided by the
|
|
<CODE>atomic_t</CODE> data type and the following operations:
|
|
<P>
|
|
<UL>
|
|
<LI> <B>atomic_read(&v)</B>: read the value of <CODE>atomic_t</CODE> variable <CODE>v</CODE>.
|
|
</LI>
|
|
<LI> <B>atomic_set(&v, i)</B>: set the value of <CODE>atomic_t</CODE> variable
|
|
<CODE>v</CODE> to integer <CODE>i</CODE>.
|
|
</LI>
|
|
<LI> <B>void atomic_add(int i, volatile atomic_t *v)</B>: add integer
|
|
<CODE>i</CODE> to the value of atomic variable pointed to by <CODE>v</CODE>.
|
|
</LI>
|
|
<LI> <B>void atomic_sub(int i, volatile atomic_t *v)</B>: subtract
|
|
integer <CODE>i</CODE> from the value of atomic variable pointed to by <CODE>v</CODE>.
|
|
</LI>
|
|
<LI> <B>int atomic_sub_and_test(int i, volatile atomic_t *v)</B>:
|
|
subtract integer <CODE>i</CODE> from the value of atomic variable pointed to by
|
|
<CODE>v</CODE>; return 1 if the new value is 0, return 0 otherwise.
|
|
</LI>
|
|
<LI> <B>void atomic_inc(volatile atomic_t *v)</B>: increment the value
|
|
by 1.
|
|
</LI>
|
|
<LI> <B>void atomic_dec(volatile atomic_t *v)</B>: decrement the value
|
|
by 1.
|
|
</LI>
|
|
<LI> <B>int atomic_dec_and_test(volatile atomic_t *v)</B>: decrement
|
|
the value; return 1 if the new value is 0, return 0 otherwise.
|
|
</LI>
|
|
<LI> <B>int atomic_inc_and_test(volatile atomic_t *v)</B>: increment
|
|
the value; return 1 if the new value is 0, return 0 otherwise.
|
|
</LI>
|
|
<LI> <B>int atomic_add_negative(int i, volatile atomic_t *v)</B>: add
|
|
the value of <CODE>i</CODE> to <CODE>v</CODE> and return 1 if the result is negative. Return
|
|
0 if the result is greater than or equal to 0. This operation is used
|
|
for implementing semaphores.</LI>
|
|
</UL>
|
|
<P>
|
|
<H2><A NAME="ss2.13">2.13 Spinlocks, Read-write Spinlocks and Big-Reader Spinlocks</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Since the early days of Linux support (early 90s, this century),
|
|
developers were faced with the classical problem of accessing shared data
|
|
between different types of context (user process vs
|
|
interrupt) and different instances of the same context from multiple cpus.
|
|
<P>SMP support was added to Linux 1.3.42 on 15 Nov 1995 (the original patch
|
|
was made to 1.3.37 in October the same year).
|
|
<P>If the critical region of code may be executed by either process context
|
|
and interrupt context, then the way to protect it using <CODE>cli/sti</CODE> instructions
|
|
on UP is:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
unsigned long flags;
|
|
|
|
save_flags(flags);
|
|
cli();
|
|
/* critical code */
|
|
restore_flags(flags);
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>While this is ok on UP, it obviously is of no use on SMP because the same
|
|
code sequence may be executed simultaneously on another cpu, and while <CODE>cli()</CODE>
|
|
provides protection against races with interrupt context on each CPU individually, it
|
|
provides no protection at all against races between contexts running on different
|
|
CPUs. This is where spinlocks are useful for.
|
|
<P>There are three types of spinlocks: vanilla (basic), read-write and
|
|
big-reader spinlocks. Read-write spinlocks should be used when there is a
|
|
natural tendency of 'many readers and few writers'. Example of this is
|
|
access to the list of registered filesystems (see <CODE>fs/super.c</CODE>). The list is
|
|
guarded by the <CODE>file_systems_lock</CODE> read-write spinlock because one needs exclusive
|
|
access only when registering/unregistering a filesystem, but any process can
|
|
read the file <CODE>/proc/filesystems</CODE> or use the <B>sysfs(2)</B> system call to force a
|
|
read-only scan of the file_systems list. This makes it sensible to use
|
|
read-write spinlocks. With read-write spinlocks, one can have multiple
|
|
readers at a time but only one writer and there can be no readers while
|
|
there is
|
|
a writer. Btw, it would be nice if new readers would not get a lock while
|
|
there
|
|
is a writer trying to get a lock, i.e. if Linux could correctly deal with
|
|
the issue of potential writer starvation by multiple readers.
|
|
This would mean that readers must be blocked while there is a writer
|
|
attempting to get the lock. This is not
|
|
currently the case and it is not obvious whether this should be fixed - the
|
|
argument to the contrary is - readers usually take the lock for a very short
|
|
time so should they really be starved while the writer takes the lock for
|
|
potentially longer periods?
|
|
<P>Big-reader spinlocks are a form of read-write spinlocks
|
|
heavily optimised for very light read access, with a penalty for writes.
|
|
There is a limited number of big-reader spinlocks - currently only two exist,
|
|
of which one is used only on sparc64 (global irq) and the other is used for
|
|
networking. In all other cases where the access pattern does not fit into
|
|
any of these two scenarios, one should use basic spinlocks. You cannot block
|
|
while holding any kind of spinlock.
|
|
<P>Spinlocks come in three flavours: plain, <CODE>_irq()</CODE> and <CODE>_bh()</CODE>.
|
|
<P>
|
|
<OL>
|
|
<LI> Plain <CODE>spin_lock()/spin_unlock()</CODE>: if you know the interrupts are always
|
|
disabled or if you do not race with interrupt context (e.g. from
|
|
within interrupt handler), then you can use this one. It does not
|
|
touch interrupt state on the current CPU.
|
|
</LI>
|
|
<LI> <CODE>spin_lock_irq()/spin_unlock_irq()</CODE>: if you know that interrupts are
|
|
always enabled then you can use this version, which simply disables
|
|
(on lock) and re-enables (on unlock) interrupts on the current CPU.
|
|
For example, <CODE>rtc_read()</CODE> uses
|
|
<CODE>spin_lock_irq(&rtc_lock)</CODE> (interrupts are always enabled inside
|
|
<CODE>read()</CODE>) whilst <CODE>rtc_interrupt()</CODE> uses
|
|
<CODE>spin_lock(&rtc_lock)</CODE> (interrupts are always disabled inside
|
|
interrupt handler). Note that <CODE>rtc_read()</CODE> uses <CODE>spin_lock_irq()</CODE> and not
|
|
the more generic <CODE>spin_lock_irqsave()</CODE> because on entry to any system
|
|
call interrupts are always enabled.
|
|
</LI>
|
|
<LI> <CODE>spin_lock_irqsave()/spin_unlock_irqrestore()</CODE>: the strongest form,
|
|
to be used when the interrupt state is not known, but only if
|
|
interrupts matter at all, i.e. there is no point in using it if
|
|
our interrupt handlers don't execute any critical code.</LI>
|
|
</OL>
|
|
<P>The reason you cannot use plain <CODE>spin_lock()</CODE> if you race against interrupt handlers is because if you take it and then
|
|
an interrupt comes in on the same CPU, it will busy wait for the lock forever:
|
|
the lock holder, having been interrupted, will not continue until the
|
|
interrupt handler returns.
|
|
<P>The most common usage of a spinlock is to access a data structure shared
|
|
between user process context and interrupt handlers:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
spinlock_t my_lock = SPIN_LOCK_UNLOCKED;
|
|
|
|
my_ioctl()
|
|
{
|
|
spin_lock_irq(&my_lock);
|
|
/* critical section */
|
|
spin_unlock_irq(&my_lock);
|
|
}
|
|
|
|
my_irq_handler()
|
|
{
|
|
spin_lock(&lock);
|
|
/* critical section */
|
|
spin_unlock(&lock);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>There are a couple of things to note about this example:
|
|
<P>
|
|
<OL>
|
|
<LI> The process context, represented here as a typical driver method -
|
|
<CODE>ioctl()</CODE> (arguments and return values omitted for clarity), must
|
|
use <CODE>spin_lock_irq()</CODE> because it knows that interrupts are always
|
|
enabled while executing the device <CODE>ioctl()</CODE> method.
|
|
</LI>
|
|
<LI> Interrupt context, represented here by <CODE>my_irq_handler()</CODE> (again
|
|
arguments omitted for clarity) can use plain <CODE>spin_lock()</CODE> form because
|
|
interrupts are disabled inside an interrupt handler.</LI>
|
|
</OL>
|
|
<P>
|
|
<H2><A NAME="ss2.14">2.14 Semaphores and read/write Semaphores</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Sometimes, while accessing a shared data structure, one must perform operations
|
|
that can block, for example copy data to userspace. The locking primitive
|
|
available for such scenarios under Linux is called a semaphore. There are two
|
|
types of semaphores: basic and read-write semaphores. Depending on the
|
|
initial value of the semaphore, they can be used for either mutual exclusion
|
|
(initial value of 1) or to provide more sophisticated type of access.
|
|
<P>Read-write semaphores differ from basic semaphores in the same way as
|
|
read-write spinlocks differ from basic spinlocks: one can have multiple
|
|
readers at a time but only one writer and there can be no readers while there are
|
|
writers - i.e. the writer blocks all readers and new readers block while a
|
|
writer is waiting.
|
|
<P>Also, basic semaphores can be interruptible - just use the operations
|
|
<CODE>down/up_interruptible()</CODE> instead of the plain <CODE>down()/up()</CODE> and check the
|
|
value returned from <CODE>down_interruptible()</CODE>: it will be non zero if the operation was
|
|
interrupted.
|
|
<P>Using semaphores for mutual exclusion is ideal in situations where a critical
|
|
code section may call by reference unknown functions registered by other
|
|
subsystems/modules, i.e. the caller cannot know apriori whether the function
|
|
blocks or not.
|
|
<P>A simple example of semaphore usage is in <CODE>kernel/sys.c</CODE>, implementation of
|
|
<B>gethostname(2)/sethostname(2)</B> system calls.
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
asmlinkage long sys_sethostname(char *name, int len)
|
|
{
|
|
int errno;
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
return -EPERM;
|
|
if (len < 0 || len > __NEW_UTS_LEN)
|
|
return -EINVAL;
|
|
down_write(&uts_sem);
|
|
errno = -EFAULT;
|
|
if (!copy_from_user(system_utsname.nodename, name, len)) {
|
|
system_utsname.nodename[len] = 0;
|
|
errno = 0;
|
|
}
|
|
up_write(&uts_sem);
|
|
return errno;
|
|
}
|
|
|
|
asmlinkage long sys_gethostname(char *name, int len)
|
|
{
|
|
int i, errno;
|
|
|
|
if (len < 0)
|
|
return -EINVAL;
|
|
down_read(&uts_sem);
|
|
i = 1 + strlen(system_utsname.nodename);
|
|
if (i > len)
|
|
i = len;
|
|
errno = 0;
|
|
if (copy_to_user(name, system_utsname.nodename, i))
|
|
errno = -EFAULT;
|
|
up_read(&uts_sem);
|
|
return errno;
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The points to note about this example are:
|
|
<P>
|
|
<OL>
|
|
<LI> The functions may block while copying data from/to userspace in
|
|
<CODE>copy_from_user()/copy_to_user()</CODE>. Therefore they could not use any form
|
|
of spinlock here.
|
|
</LI>
|
|
<LI> The semaphore type chosen is read-write as opposed to basic because
|
|
there may be lots of concurrent <B>gethostname(2)</B> requests which need not
|
|
be mutually exclusive.</LI>
|
|
</OL>
|
|
<P>Although Linux implementation of semaphores and read-write semaphores is
|
|
very sophisticated, there are possible scenarios one can think of which are
|
|
not yet implemented, for example there is no concept of interruptible
|
|
read-write semaphores. This is obviously because there are no real-world
|
|
situations which require these exotic flavours of the primitives.
|
|
<P>
|
|
<H2><A NAME="ss2.15">2.15 Kernel Support for Loading Modules</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Linux is a monolithic operating system and despite all the modern hype about
|
|
some "advantages" offered by operating systems based on micro-kernel design,
|
|
the truth remains (quoting Linus Torvalds himself):
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
... message passing as the fundamental operation of the OS is just an
|
|
exercise in computer science masturbation. It may feel good, but you
|
|
don't actually get anything DONE.
|
|
</CODE></BLOCKQUOTE>
|
|
<P>Therefore, Linux is and will always be based on a monolithic design, which
|
|
means that all subsystems run in the same privileged mode and share the same
|
|
address space; communication between them is achieved by the usual C function
|
|
call means.
|
|
<P>However, although separating kernel functionality into separate "processes"
|
|
as is done in micro-kernels is definitely a bad idea, separating it into
|
|
dynamically loadable on demand kernel modules is desirable in some
|
|
circumstances (e.g. on machines with low memory or for installation kernels
|
|
which could otherwise contain ISA auto-probing device drivers that are
|
|
mutually exclusive). The decision whether to include support for loadable
|
|
modules is made at compile time and is determined by the <CODE>CONFIG_MODULES</CODE>
|
|
option. Support for module autoloading via <CODE>request_module()</CODE> mechanism is
|
|
a separate compilation option (<CODE>CONFIG_KMOD</CODE>).
|
|
<P>The following functionality can be implemented as loadable modules under
|
|
Linux:
|
|
<P>
|
|
<OL>
|
|
<LI> Character and block device drivers, including misc device drivers.
|
|
</LI>
|
|
<LI> Terminal line disciplines.
|
|
</LI>
|
|
<LI> Virtual (regular) files in <CODE>/proc</CODE> and in devfs (e.g. <CODE>/dev/cpu/microcode</CODE>
|
|
vs <CODE>/dev/misc/microcode</CODE>).
|
|
</LI>
|
|
<LI> Binary file formats (e.g. ELF, aout, etc).
|
|
</LI>
|
|
<LI> Execution domains (e.g. Linux, UnixWare7, Solaris, etc).
|
|
</LI>
|
|
<LI> Filesystems.
|
|
</LI>
|
|
<LI> System V IPC.</LI>
|
|
</OL>
|
|
<P>There a few things that cannot be implemented as modules under Linux
|
|
(probably because it makes no sense for them to be modularised):
|
|
<P>
|
|
<OL>
|
|
<LI> Scheduling algorithms.
|
|
</LI>
|
|
<LI> VM policies.
|
|
</LI>
|
|
<LI> Buffer cache, page cache and other caches.</LI>
|
|
</OL>
|
|
<P>Linux provides several system calls to assist in loading modules:
|
|
<P>
|
|
<OL>
|
|
<LI><CODE>caddr_t create_module(const char *name, size_t size)</CODE>: allocates
|
|
<CODE>size</CODE> bytes using <CODE>vmalloc()</CODE> and maps a module structure at the
|
|
beginning thereof. This new module is then linked into the list headed
|
|
by module_list. Only a process with <CODE>CAP_SYS_MODULE</CODE> can invoke this
|
|
system call, others will get <CODE>EPERM</CODE> returned.
|
|
</LI>
|
|
<LI><CODE>long init_module(const char *name, struct module *image)</CODE>: loads the
|
|
relocated module image and causes the module's initialisation routine
|
|
to be invoked. Only a process with <CODE>CAP_SYS_MODULE</CODE> can invoke this
|
|
system call, others will get <CODE>EPERM</CODE> returned.
|
|
</LI>
|
|
<LI><CODE>long delete_module(const char *name)</CODE>: attempts to unload the module.
|
|
If <CODE>name == NULL</CODE>, attempt is made to unload all unused modules.
|
|
</LI>
|
|
<LI><CODE>long query_module(const char *name, int which, void *buf, size_t bufsize, size_t *ret)</CODE>: returns information about a module
|
|
(or about all modules).</LI>
|
|
</OL>
|
|
<P>The command interface available to users consists of:
|
|
<P>
|
|
<UL>
|
|
<LI><B>insmod</B>: insert a single module.
|
|
</LI>
|
|
<LI><B>modprobe</B>: insert a module including all other modules it depends
|
|
on.
|
|
</LI>
|
|
<LI><B>rmmod</B>: remove a module.
|
|
</LI>
|
|
<LI><B>modinfo</B>: print some information about a module, e.g. author,
|
|
description, parameters the module accepts, etc.</LI>
|
|
</UL>
|
|
<P>Apart from being able to load a module manually using either <B>insmod</B> or <B>modprobe</B>,
|
|
it is also possible to have the module inserted automatically by the kernel
|
|
when a particular functionality is required. The kernel interface for this
|
|
is the function called <CODE>request_module(name)</CODE> which is exported to modules,
|
|
so that modules can load other modules as well. The <CODE>request_module(name)</CODE>
|
|
internally creates a kernel thread which execs the userspace command
|
|
<B>modprobe -s -k module_name</B>, using the standard <CODE>exec_usermodehelper()</CODE> kernel
|
|
interface (which is also exported to modules). The function returns 0 on
|
|
success, however it is usually not worth checking the return code from
|
|
<CODE>request_module()</CODE>. Instead, the programming idiom is:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
if (check_some_feature() == NULL)
|
|
request_module(module);
|
|
if (check_some_feature() == NULL)
|
|
return -ENODEV;
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>For example, this is done by <CODE>fs/block_dev.c:get_blkfops()</CODE> to load a module
|
|
<CODE>block-major-N</CODE> when attempt is made to open a block device with major <CODE>N</CODE>.
|
|
Obviously, there is no such module called <CODE>block-major-N</CODE> (Linux developers
|
|
only chose sensible names for their modules) but it is mapped to a proper
|
|
module name using the file <CODE>/etc/modules.conf</CODE>. However, for most well-known
|
|
major numbers (and other kinds of modules) the <B>modprobe/insmod</B> commands
|
|
know which real module to load without needing an explicit alias statement
|
|
in <CODE>/etc/modules.conf</CODE>.
|
|
<P>A good example of loading a module is inside the <B>mount(2)</B> system call. The
|
|
<B>mount(2)</B> system call accepts the filesystem type as a string which
|
|
<CODE>fs/super.c:do_mount()</CODE> then passes on to <CODE>fs/super.c:get_fs_type()</CODE>:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
static struct file_system_type *get_fs_type(const char *name)
|
|
{
|
|
struct file_system_type *fs;
|
|
|
|
read_lock(&file_systems_lock);
|
|
fs = *(find_filesystem(name));
|
|
if (fs && !try_inc_mod_count(fs->owner))
|
|
fs = NULL;
|
|
read_unlock(&file_systems_lock);
|
|
if (!fs && (request_module(name) == 0)) {
|
|
read_lock(&file_systems_lock);
|
|
fs = *(find_filesystem(name));
|
|
if (fs && !try_inc_mod_count(fs->owner))
|
|
fs = NULL;
|
|
read_unlock(&file_systems_lock);
|
|
}
|
|
return fs;
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>A few things to note in this function:
|
|
<P>
|
|
<OL>
|
|
<LI> First we attempt to find the filesystem with the given name amongst
|
|
those already registered. This is done under protection of
|
|
<CODE>file_systems_lock</CODE> taken for read (as we are not modifying the list
|
|
of registered filesystems).
|
|
</LI>
|
|
<LI> If such a filesystem is found then we attempt to get a new reference
|
|
to it by trying to increment its module's hold count. This always
|
|
returns 1 for statically linked filesystems or for modules not
|
|
presently being deleted. If <CODE>try_inc_mod_count()</CODE> returned 0 then
|
|
we consider it a failure - i.e. if the module is there but is being
|
|
deleted, it is as good as if it were not there at all.
|
|
</LI>
|
|
<LI> We drop the <CODE>file_systems_lock</CODE> because what we are about to do next
|
|
(<CODE>request_module()</CODE>) is a blocking operation, and therefore we can't
|
|
hold a spinlock over it. Actually, in this specific case, we would
|
|
have to drop <CODE>file_systems_lock</CODE> anyway, even if <CODE>request_module()</CODE> were
|
|
guaranteed to be non-blocking and the module loading were executed
|
|
in the same context atomically. The reason for this is that the module's
|
|
initialisation function will try to call <CODE>register_filesystem()</CODE>, which will
|
|
take the same <CODE>file_systems_lock</CODE> read-write spinlock for write.
|
|
</LI>
|
|
<LI> If the attempt to load was successful, then we take the
|
|
<CODE>file_systems_lock</CODE> spinlock and try to locate the newly registered
|
|
filesystem in the list. Note that this is slightly wrong because
|
|
it is in principle possible for a bug in modprobe command to cause
|
|
it to coredump after it successfully loaded the requested module, in
|
|
which case <CODE>request_module()</CODE> will fail even though the new filesystem will be
|
|
registered, and yet <CODE>get_fs_type()</CODE> won't find it.
|
|
</LI>
|
|
<LI> If the filesystem is found and we are able to get a reference to it,
|
|
we return it. Otherwise we return NULL.</LI>
|
|
</OL>
|
|
<P>When a module is loaded into the kernel, it can refer to any symbols that
|
|
are exported as public by the kernel using <CODE>EXPORT_SYMBOL()</CODE> macro or by
|
|
other currently loaded modules. If the module uses symbols from another
|
|
module, it is marked as depending on that module during dependency
|
|
recalculation, achieved by running <B>depmod -a</B> command on boot (e.g. after
|
|
installing a new kernel).
|
|
<P>Usually, one must match the set of modules with the version of the kernel
|
|
interfaces they use, which under Linux simply means the "kernel version" as
|
|
there is no special kernel interface versioning mechanism in general.
|
|
However, there is a limited functionality called "module versioning" or
|
|
<CODE>CONFIG_MODVERSIONS</CODE> which allows to avoid recompiling modules when switching
|
|
to a new kernel. What happens here is that the kernel symbol table is treated
|
|
differently for internal access and for access from modules. The elements of
|
|
public (i.e. exported) part of the symbol table are built by 32bit
|
|
checksumming the C declaration. So, in order to resolve a symbol used by a
|
|
module during loading, the loader must match the full representation of the
|
|
symbol that includes the checksum; it will refuse to load the module if these
|
|
symbols differ. This
|
|
only happens when both the kernel and the module are compiled with module
|
|
versioning enabled. If either one of them uses the original symbol names,
|
|
the loader simply tries to match the kernel version declared by the module
|
|
and the one exported by the kernel and refuses to load if they differ.
|
|
<P>
|
|
<HR>
|
|
<A HREF="lki-3.html">Next</A>
|
|
<A HREF="lki-1.html">Previous</A>
|
|
<A HREF="lki.html#toc2">Contents</A>
|
|
</BODY>
|
|
</HTML>
|