mirror of https://github.com/tLDP/LDP
111 lines
8.0 KiB
Plaintext
111 lines
8.0 KiB
Plaintext
<sect1><title>System Calls</title>
|
|
|
|
<indexterm><primary>system calls</primary></indexterm>
|
|
|
|
<para>So far, the only thing we've done was to use well defined kernel mechanisms to register <filename
|
|
role="directory">/proc</filename> files and device handlers. This is fine if you want to do something the kernel programmers
|
|
thought you'd want, such as write a device driver. But what if you want to do something unusual, to change the behavior of the
|
|
system in some way? Then, you're mostly on your own.</para>
|
|
|
|
<para>This is where kernel programming gets dangerous. While writing the example below, I killed the
|
|
<function>open()</function> system call. This meant I couldn't open any files, I couldn't run any programs, and I couldn't
|
|
<command>shutdown</command> the computer. I had to pull the power switch. Luckily, no files died. To ensure you won't lose
|
|
any files either, please run <command>sync</command> right before you do the <command>insmod</command> and the
|
|
<command>rmmod</command>.</para>
|
|
|
|
<indexterm><primary>sync</primary></indexterm>
|
|
<indexterm><primary>insmod</primary></indexterm>
|
|
<indexterm><primary>rmmod</primary></indexterm>
|
|
<indexterm><primary>shutdown</primary></indexterm>
|
|
|
|
<para>Forget about <filename role="directory">/proc</filename> files, forget about device files. They're just minor details.
|
|
The <emphasis>real</emphasis> process to kernel communication mechanism, the one used by all processes, is system calls. When
|
|
a process requests a service from the kernel (such as opening a file, forking to a new process, or requesting more memory),
|
|
this is the mechanism used. If you want to change the behaviour of the kernel in interesting ways, this is the place to do it.
|
|
By the way, if you want to see which system calls a program uses, run <command>strace <arguments></command>.</para>
|
|
|
|
<indexterm><primary>strace</primary></indexterm>
|
|
|
|
<para>In general, a process is not supposed to be able to access the kernel. It can't access kernel memory and it can't call
|
|
kernel functions. The hardware of the CPU enforces this (that's the reason why it's called `protected mode').</para>
|
|
|
|
<indexterm><primary>interrupt 0x80</primary></indexterm>
|
|
|
|
<para>System calls are an exception to this general rule. What happens is that the process fills the registers with the
|
|
appropriate values and then calls a special instruction which jumps to a previously defined location in the kernel (of course,
|
|
that location is readable by user processes, it is not writable by them). Under Intel CPUs, this is done by means of interrupt
|
|
0x80. The hardware knows that once you jump to this location, you are no longer running in restricted user mode, but as the
|
|
operating system kernel --- and therefore you're allowed to do whatever you want.</para>
|
|
|
|
<para>The location in the kernel a process can jump to is called <emphasis>system_call</emphasis>. The procedure at that
|
|
location checks the system call number, which tells the kernel what service the process requested. Then, it looks at the table
|
|
of system calls (<varname>sys_call_table</varname>) to see the address of the kernel function to call. Then it calls the
|
|
function, and after it returns, does a few system checks and then return back to the process (or to a different process, if
|
|
the process time ran out). If you want to read this code, it's at the source file
|
|
<filename>arch/$<$architecture$>$/kernel/entry.S</filename>, after the line <function>ENTRY(system_call)</function>.</para>
|
|
|
|
<indexterm><primary>system call</primary></indexterm>
|
|
<indexterm><primary>ENTRY(system call)</primary></indexterm>
|
|
<indexterm><primary>sys_call_table</primary></indexterm>
|
|
<indexterm><primary>entry.S</primary></indexterm>
|
|
|
|
<para>So, if we want to change the way a certain system call works, what we need to do is to write our own function to
|
|
implement it (usually by adding a bit of our own code, and then calling the original function) and then change the pointer at
|
|
<varname>sys_call_table</varname> to point to our function. Because we might be removed later and we don't want to leave the
|
|
system in an unstable state, it's important for <function>cleanup_module</function> to restore the table to its original
|
|
state.</para>
|
|
|
|
<para>The source code here is an example of such a kernel module. We want to `spy' on a certain user, and to
|
|
<function>printk()</function> a message whenever that user opens a file. Towards this end, we replace the system call to open
|
|
a file with our own function, called <function>our_sys_open</function>. This function checks the uid (user's id) of the
|
|
current process, and if it's equal to the uid we spy on, it calls <function>printk()</function> to display the name of the
|
|
file to be opened. Then, either way, it calls the original <function>open()</function> function with the same parameters, to
|
|
actually open the file.</para>
|
|
|
|
<indexterm><primary>system call</primary><secondary>open</secondary></indexterm>
|
|
|
|
<para>The <function>init_module</function> function replaces the appropriate location in <varname>sys_call_table</varname> and
|
|
keeps the original pointer in a variable. The <function>cleanup_module</function> function uses that variable to restore
|
|
everything back to normal. This approach is dangerous, because of the possibility of two kernel modules changing the same
|
|
system call. Imagine we have two kernel modules, A and B. A's open system call will be A_open and B's will be B_open. Now,
|
|
when A is inserted into the kernel, the system call is replaced with A_open, which will call the original sys_open when it's
|
|
done. Next, B is inserted into the kernel, which replaces the system call with B_open, which will call what it thinks is the
|
|
original system call, A_open, when it's done.</para>
|
|
|
|
<para>Now, if B is removed first, everything will be well---it will simply restore the system call to A_open, which calls the
|
|
original. However, if A is removed and then B is removed, the system will crash. A's removal will restore the system call to
|
|
the original, sys_open, cutting B out of the loop. Then, when B is removed, it will restore the system call to what
|
|
<emphasis>it</emphasis> thinks is the original, A_open, which is no longer in memory. At first glance, it appears we could
|
|
solve this particular problem by checking if the system call is equal to our open function and if so not changing it at all
|
|
(so that B won't change the system call when it's removed), but that will cause an even worse problem. When A is removed, it
|
|
sees that the system call was changed to B_open so that it is no longer pointing to A_open, so it won't restore it to sys_open
|
|
before it is removed from memory. Unfortunately, B_open will still try to call A_open which is no longer there, so that even
|
|
without removing B the system would crash.</para>
|
|
|
|
|
|
<para>Note that all the related problems make syscall stealing unfeasiable for production use. In order to keep people from
|
|
doing potential harmful things sys_call_table is no longer exported. This means, if you want to do something more than a
|
|
mere dry run of this example, you will have to patch your current kernel in order to have sys_call_table exported.
|
|
In the example directory you will find a README and the patch. As you can imagine, such modifications are not to be
|
|
taken lightly. Do not try this on valueable systems (ie systems that you do not own - or cannot restore easily).
|
|
You'll need to get the complete sourcecode of this guide as a tarball in order to get the patch and the README.
|
|
Depending on your kernel version, you might even need to hand apply the patch. Still here? Well, so is this chapter.
|
|
If Wyle E. Coyote was a kernel hacker, this would be the first thing he'd try. ;)
|
|
|
|
</para>
|
|
|
|
|
|
<indexterm><primary>try_module_get</primary></indexterm>
|
|
<indexterm><primary>sys_open</primary></indexterm>
|
|
<indexterm><primary>source file</primary><secondary>syscall.c</secondary></indexterm>
|
|
|
|
<example><title>syscall.c</title><programlisting><inlinegraphic fileref="lkmpg-examples/08-SystemCalls/syscall.c" format="linespecific"/></inlinegraphic></programlisting></example>
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<!--
|
|
vim: tw=128
|
|
-->
|