LDP/LDP/guide/docbook/lkmpg-2.6/08-SystemCalls.sgml

<sect1><title>System Calls</title>

	<indexterm><primary>system calls</primary></indexterm>

	<para>So far, the only thing we've done was to use well defined kernel mechanisms to register <filename
	role="directory">/proc</filename> files and device handlers. This is fine if you want to do something the kernel programmers
	thought you'd want, such as write a device driver. But what if you want to do something unusual, to change the behavior of the
	system in some way? Then, you're mostly on your own.</para>

	<para>This is where kernel programming gets dangerous. While writing the example below, I killed the
	<function>open()</function> system call. This meant I couldn't open any files, I couldn't run any programs, and I couldn't
	<command>shutdown</command> the computer.  I had to pull the power switch.  Luckily, no files died. To ensure you won't lose
	any files either, please run <command>sync</command> right before you do the <command>insmod</command> and the
	<command>rmmod</command>.</para>

	<indexterm><primary>sync</primary></indexterm>
	<indexterm><primary>insmod</primary></indexterm>
	<indexterm><primary>rmmod</primary></indexterm>
	<indexterm><primary>shutdown</primary></indexterm>

	<para>Forget about <filename role="directory">/proc</filename> files, forget about device files. They're just minor details.
	The <emphasis>real</emphasis> process to kernel communication mechanism, the one used by all processes, is system calls. When
	a process requests a service from the kernel (such as opening a file, forking to a new process, or requesting more memory),
	this is the mechanism used. If you want to change the behaviour of the kernel in interesting ways, this is the place to do it.
	By the way, if you want to see which system calls a program uses, run <command>strace &lt;arguments&gt;</command>.</para>

	<indexterm><primary>strace</primary></indexterm>

	<para>In general, a process is not supposed to be able to access the kernel. It can't access kernel memory and it can't call
	kernel functions. The hardware of the CPU enforces this (that's the reason why it's called `protected mode').</para>

	<indexterm><primary>interrupt 0x80</primary></indexterm>

	<para>System calls are an exception to this general rule. What happens is that the process fills the registers with the
	appropriate values and then calls a special instruction which jumps to a previously defined location in the kernel (of course,
	that location is readable by user processes, it is not writable by them). Under Intel CPUs, this is done by means of interrupt
	0x80.  The hardware knows that once you jump to this location, you are no longer running in restricted user mode, but as the
	operating system kernel --- and therefore you're allowed to do whatever you want.</para>

	<para>The location in the kernel a process can jump to is called <emphasis>system_call</emphasis>. The procedure at that
	location checks the system call number, which tells the kernel what service the process requested. Then, it looks at the table
	of system calls (<varname>sys_call_table</varname>) to see the address of the kernel function to call. Then it calls the
	function, and after it returns, does a few system checks and then return back to the process (or to a different process, if
	the process time ran out). If you want to read this code, it's at the source file
	<filename>arch/$<$architecture$>$/kernel/entry.S</filename>, after the line <function>ENTRY(system_call)</function>.</para>

	<indexterm><primary>system call</primary></indexterm>
	<indexterm><primary>ENTRY(system call)</primary></indexterm>
	<indexterm><primary>sys_call_table</primary></indexterm>
	<indexterm><primary>entry.S</primary></indexterm>

	<para>So, if we want to change the way a certain system call works, what we need to do is to write our own function to
	implement it (usually by adding a bit of our own code, and then calling the original function) and then change the pointer at
	<varname>sys_call_table</varname> to point to our function.  Because we might be removed later and we don't want to leave the
	system in an unstable state, it's important for <function>cleanup_module</function> to restore the table to its original
	state.</para>

	<para>The source code here is an example of such a kernel module. We want to `spy' on a certain user, and to
	<function>printk()</function> a message whenever that user opens a file. Towards this end, we replace the system call to open
	a file with our own function, called <function>our_sys_open</function>. This function checks the uid (user's id) of the
	current process, and if it's equal to the uid we spy on, it calls <function>printk()</function> to display the name of the
	file to be opened.  Then, either way, it calls the original <function>open()</function> function with the same parameters, to
	actually open the file.</para>

	<indexterm><primary>system call</primary><secondary>open</secondary></indexterm>

	<para>The <function>init_module</function> function replaces the appropriate location in <varname>sys_call_table</varname> and
	keeps the original pointer in a variable. The <function>cleanup_module</function> function uses that variable to restore
	everything back to normal.  This approach is dangerous, because of the possibility of two kernel modules changing the same
	system call. Imagine we have two kernel modules, A and B.  A's open system call will be A_open and B's will be B_open. Now,
	when A is inserted into the kernel, the system call is replaced with A_open, which will call the original sys_open when it's
	done. Next, B is inserted into the kernel, which replaces the system call with B_open, which will call what it thinks is the
	original system call, A_open, when it's done.</para>

	<para>Now, if B is removed first, everything will be well---it will simply restore the system call to A_open, which calls the
	original. However, if A is removed and then B is removed, the system will crash. A's removal will restore the system call to
	the original, sys_open, cutting B out of the loop.  Then, when B is removed, it will restore the system call to what
	<emphasis>it</emphasis> thinks is the original, A_open, which is no longer in memory. At first glance, it appears we could
	solve this particular problem by checking if the system call is equal to our open function and if so not changing it at all
	(so that B won't change the system call when it's removed), but that will cause an even worse problem. When A is removed, it
	sees that the system call was changed to B_open so that it is no longer pointing to A_open, so it won't restore it to sys_open
	before it is removed from memory.  Unfortunately, B_open will still try to call A_open which is no longer there, so that even
	without removing B the system would crash.</para>


	<para>Note that all the related problems make syscall stealing unfeasiable for production use. In order to keep people from
	doing potential harmful things sys_call_table is no longer exported. This means, if you want to do something more than a
	mere dry run of this example, you will have to patch your current kernel in order to have sys_call_table exported.
	In the example directory you will find a README and the patch. As you can imagine, such modifications are not to be
	taken lightly. Do not try this on valueable systems (ie systems that you do not own - or cannot restore easily).
	You'll need to get the complete sourcecode of this guide as a tarball in order to get the patch and the README.
	Depending on your kernel version, you might even need to hand apply the patch. Still here? Well, so is this chapter.
	If Wyle E. Coyote was a kernel hacker, this would be the first thing he'd try. ;)

	</para>


	<indexterm><primary>try_module_get</primary></indexterm>
	<indexterm><primary>sys_open</primary></indexterm>
	<indexterm><primary>source file</primary><secondary>syscall.c</secondary></indexterm>

<example><title>syscall.c</title><programlisting><inlinegraphic fileref="lkmpg-examples/08-SystemCalls/syscall.c" format="linespecific"/></inlinegraphic></programlisting></example>

</sect1>


<!--
vim: tw=128
-->