mirror of https://github.com/tLDP/LDP
new
This commit is contained in:
parent
ac79b6464b
commit
44316ea86b
|
@ -0,0 +1,49 @@
|
|||
<sect1><title>Acknowledgements</title>
|
||||
|
||||
<para>Ori Pomerantz would like to thank Yoav Weiss for many helpful ideas
|
||||
and discussions, as well as finding mistakes within this document before
|
||||
its publication. Ori would also like to thank Frodo Looijaard from the
|
||||
Netherlands, Stephen Judd from New Zealand, Magnus Ahltorp from Sweeden and
|
||||
Emmanuel Papirakis from Quebec, Canada.</para>
|
||||
|
||||
<para>I'd like to thank Ori Pomerantz for authoring this guide in the first
|
||||
place and then letting me maintain it. It was a tremendous effort on his
|
||||
part. I hope he likes what I've done with this document.</para>
|
||||
|
||||
<para> I would also like to thank Jeff Newmiller and Rhonda Bailey for
|
||||
teaching me. They've been patient with me and lent me their experience,
|
||||
regardless of how busy they were. David Porter had the unenviable job of
|
||||
helping convert the original LaTeX source into docbook. It was a long,
|
||||
boring and dirty job. But someone had to do it. Thanks, David.</para>
|
||||
|
||||
<para> Thanks also goes to the fine people at <ulink
|
||||
url="www.kernelnewbies.org">www.kernelnewbies.org</ulink>. In particular,
|
||||
Mark McLoughlin and John Levon who I'm sure have much better things to do
|
||||
than to hang out on kernelnewbies.org and teach the newbies. If this guide
|
||||
teaches you anything, they are partially to blame.</para>
|
||||
|
||||
<para>Both Ori and I would like to thank Richard M. Stallman and Linus
|
||||
Torvalds for giving us the opportunity to not only run a high-quality
|
||||
operating system, but to take a close peek at how it works. I've never met
|
||||
Linus, and probably never will, but he has made a profound difference in my
|
||||
life.</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
<sect1><title>Nota Bene</title>
|
||||
|
||||
<para>Ori's original document was good about supporting earlier versions of
|
||||
Linux, going all the way back to the 2.0 days. I had originally intended
|
||||
to keep with the program, but after thinking about it, opted out. My main
|
||||
reason to keep with the compatibility was for Linux distributions like
|
||||
LEAF, which tended to use older kernels. However, even LEAF uses 2.2 and
|
||||
2.4 kernels these days.<para>
|
||||
|
||||
<para>Both Ori and I use the x86 platform. For the most part, the source
|
||||
code and discussions should apply to other architectures, but I can't
|
||||
promise anything. One exception is chapter (fixme), on interrupt handlers,
|
||||
which should not work on any architecture except for x86.</para>
|
||||
|
||||
</sect1>
|
|
@ -0,0 +1,206 @@
|
|||
<sect1><title>What Is A Kernel Module?</title>
|
||||
|
||||
<para>So, you want to write a kernel module. You know C, you've written a few
|
||||
normal programs to run as processes, and now you want to get to where the real
|
||||
action is, to where a single wild pointer can wipe out your file system and a core
|
||||
dump means a reboot.</para>
|
||||
|
||||
<para>What exactly is a kernel module? Modules are pieces of code that can be
|
||||
loaded and unloaded into the kernel upon demand. They extend the functionality of
|
||||
the kernel without the need to reboot the system. For example, one type of module
|
||||
is the device driver, which allows the kernel to access hardware connected to the
|
||||
system. Without modules, we would have to build monolithic kernels and add new
|
||||
functionality directly into the kernel image. Besides having larger kernels, this
|
||||
has the disadvantage of requiring us to rebuild and reboot the kernel every time we
|
||||
want new functionality.</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
<sect1><title>How Do Modules Get Into The Kernel?</title>
|
||||
|
||||
<indexterm><primary>/proc/modules</primary></indexterm>
|
||||
<indexterm><primary>kmod</primary></indexterm>
|
||||
<indexterm><primary>kerneld</primary></indexterm>
|
||||
<indexterm><primary><filename>/etc/modules.conf</filename></primary></indexterm>
|
||||
<indexterm><primary><filename>/etc/conf.modules</filename></primary></indexterm>
|
||||
|
||||
<para>You can see what modules are already loaded into the kernel by running
|
||||
<command>lsmod</command>, which gets its information by reading the file
|
||||
<filename>/proc/modules</filename>.</para>
|
||||
|
||||
<para>How do these modules find their way into the kernel? When the kernel needs a
|
||||
feature that is not resident in the kernel, the kernel module daemon
|
||||
kmod<footnote><para>In earlier versions of linux, this was known as
|
||||
kerneld.</para></footnote> execs modprobe to load the module in. modprobe is
|
||||
passed a string in one of two forms:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem><para>A module name like <filename>softdog</filename> or
|
||||
<filename>ppp</filename>.</listitem>
|
||||
<listitem><para>A more generic identifier like
|
||||
<varname>char-major-10-30</varname>.</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>If modprobe is handed a generic identifier, it first looks for that string in
|
||||
the file <filename>/etc/modules.conf</filename><footnote><para>This file used to be
|
||||
called <filename>conf.modules</filename> before linux 2.0, but this name is now
|
||||
deprecated.</para></footnote>. If it finds an alias line like:</para>
|
||||
|
||||
|
||||
<screen>
|
||||
alias char-major-10-30 softdog
|
||||
</screen>
|
||||
|
||||
|
||||
<para>it knows that the generic identifier refers to the module
|
||||
<filename>softdog.o</filename>.</para>
|
||||
|
||||
<para>Next, modprobe looks through the file
|
||||
<filename>/lib/modules/version/modules.dep</filename>, to see if other modules must
|
||||
be loaded before the requested module may be loaded. This file is created by
|
||||
<command>depmod -a</command> and contains module dependencies. For example,
|
||||
<filename>msdos.o</filename> requires the <filename>fat.o</filename> module to be
|
||||
already loaded into the kernel. The requested module has a dependancy on another
|
||||
module if the other module defines symbols (variables or functions) that the
|
||||
requested module uses.</para>
|
||||
|
||||
<para>Lastly, modprobe uses insmod to first load any prerequisite modules into the
|
||||
kernel, and then the requested module. modprobe directs insmod to <filename
|
||||
role="directory">/lib/modules/version/</filename><footnote><para>If you are
|
||||
modifying the kernel, to avoid overwriting your existing modules you may want to
|
||||
use the <varname>EXTRAVERSION</varname> variable in the kernel Makefile to create a
|
||||
seperate directory.</para></footnote>, the standard directory for modules. insmod
|
||||
is intended to be fairly dumb about the location of modules, whereas modprobe is
|
||||
aware of the default location of modules. So for example, if you wanted to load
|
||||
the msdos module, you'd have to either run:</para>
|
||||
|
||||
|
||||
<screen>
|
||||
insmod /lib/modules/2.5.1/kernel/fs/fat/fat.o
|
||||
insmod /lib/modules/2.5.1/kernel/fs/msdos/msdos.o
|
||||
</screen>
|
||||
|
||||
|
||||
<para>or just run "<command>modprobe -a msdos</command>".</para>
|
||||
|
||||
<indexterm><primary>modules.conf</primary><secondary>keep</secondary></indexterm>
|
||||
<indexterm><primary>modules.conf</primary><secondary>comment</secondary></indexterm>
|
||||
<indexterm><primary>modules.conf</primary><secondary>alias</secondary></indexterm>
|
||||
<indexterm><primary>modules.conf</primary><secondary>options</secondary></indexterm>
|
||||
<indexterm><primary>modules.conf</primary><secondary>path</secondary></indexterm>
|
||||
|
||||
|
||||
<para>Linux distros provide modprobe, insmod and depmod as a package called
|
||||
modutils or mod-utils.</para>
|
||||
|
||||
<para>Before finishing this chapter, let's take a quick look at a piece of
|
||||
<filename>/etc/modules.conf</filename>:</para>
|
||||
|
||||
<screen>
|
||||
#This file is automatically generated by update-modules
|
||||
path[misc]=/lib/modules/2.4.?/local
|
||||
keep
|
||||
path[net]=~p/mymodules
|
||||
options mydriver irq=10
|
||||
alias eth0 eepro
|
||||
</screen>
|
||||
|
||||
<para>Lines beginning with a '#' are comments. Blank lines are ignored.</para>
|
||||
|
||||
<para>The <literal>path[misc]</literal> line tells modprobe to replace the search
|
||||
path for misc modules with the directory <filename
|
||||
role="directory">/lib/modules/2.4.?/local</filename>. As you can see, shell meta
|
||||
characters are honored.</para>
|
||||
|
||||
<para>The <literal>path[net]</literal> line tells modprobe to look for net modules
|
||||
in the directory <filename role="directory">~p/mymodules</filename>, however, the
|
||||
"keep" directive preceding the <literal>path[net]</literal> directive tells modprobe
|
||||
to add this directory to the standard search path of net modules as opposed to
|
||||
replacing the standard search path, as we did for the misc modules.</para>
|
||||
|
||||
<para>The alias line says to load in <filename>eepro.o</filename> whenever kmod
|
||||
requests that the generic identifier `eth0' be loaded.</para>
|
||||
|
||||
<para>You won't see lines like "alias block-major-2 floppy" in
|
||||
<filename>/etc/modules.conf</filename> because modprobe already knows about the
|
||||
standard drivers which will be used on most systems.</para>
|
||||
|
||||
<para>Now you know how modules get into the kernel. There's a bit more to the story
|
||||
if you want to write your own modules which depend on other modules (we calling this
|
||||
`stacking modules'). But this will have to wait for a future chapter. We have a
|
||||
lot to cover before addressing this relatively high-level issue.</para>
|
||||
|
||||
|
||||
|
||||
<sect2><title>Before We Begin</title>
|
||||
|
||||
<para>Before we delve into code, there are a few issues we need to cover.
|
||||
Everyone's system is different and everyone has their own groove. Getting your
|
||||
first "hello world" program to compile and load correctly can sometimes be a
|
||||
trick. Rest assured, after you get over the initial hurdle of doing it for the
|
||||
first time, it will be smooth sailing thereafter.</para>
|
||||
|
||||
<sect3><title>Modversioning</title>
|
||||
|
||||
<para>A module compiled for one kernel won't load if you boot a different
|
||||
kernel unless you enable <literal>CONFIG_MODVERSION</literal> in the kernel.
|
||||
We won't go into module versioning until later in this guide. Until we
|
||||
cover modversions, the examples in the guide may not work if you're running
|
||||
a kernel with modversioning turned on. However, most stock Linux distro
|
||||
kernels come with it turned on. If you're having trouble loading the
|
||||
modules because of versioning errors, compile a kernel with modversioning
|
||||
turned off.</para>
|
||||
|
||||
</sect3>
|
||||
|
||||
|
||||
|
||||
<sect3><title>Using X</title>
|
||||
|
||||
<para>It is highly recommended that you type in, compile and load all the
|
||||
examples this guide discusses. It's also highly recommended you do this
|
||||
from a console. You should not be working on this stuff in X.</para>
|
||||
|
||||
<para>Modules can't print to the screen like <function>printf()</function>
|
||||
can, but they can log information and warnings, which ends up being printed
|
||||
on your screen, but only on a console. If you insmod a module from an
|
||||
xterm, the information and warnings will be logged, but only to your log
|
||||
files. You won't see it unless you look through your log files. To have
|
||||
immediate access to this information, do all your work from console.</para>
|
||||
|
||||
</sect3>
|
||||
|
||||
|
||||
|
||||
<sect3><title>Compiling Issues and Kernel Version</title>
|
||||
|
||||
<para>Very often, Linux distros will distribute kernel source that has been
|
||||
patched in various non-standard ways, which may cause trouble.</para>
|
||||
|
||||
<para>A more common problem is that some Linux distros distribute incomplete
|
||||
kernel headers. You'll need to compile your code using various header files
|
||||
from the Linux kernel. Murphy's Law states that the headers that are
|
||||
missing are exactly the ones that you'll need for your module work.</para>
|
||||
|
||||
<para>To avoid these two problems, I highly recommend that you download,
|
||||
compile and boot into a fresh, stock Linux kernel which can be downloaded
|
||||
from any of the Linux kernel mirror sites. See the Linux Kernel HOWTO for
|
||||
more details.</para>
|
||||
|
||||
<para>Ironically, this can also cause a problem. By default, gcc on your
|
||||
system may look for the kernel headers in their default location rather than
|
||||
where you installed the new copy of the kernel (usually in <filename
|
||||
role="directory">/usr/src/</filename>. This can be fixed by using gcc's
|
||||
<literal>-I</literal> switch.</para>
|
||||
|
||||
</sect3>
|
||||
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
||||
<!--
|
||||
vim: tw=86
|
||||
-->
|
|
@ -0,0 +1,564 @@
|
|||
<sect1><title>Hello, World (part 1): The Simplest Module</title>
|
||||
|
||||
<para>When the first caveman programmer chiseled the first program on the walls of
|
||||
the first cave computer, it was a program to paint the string `Hello, world' in
|
||||
Antelope pictures. Roman programming textbooks began with the `Salut, Mundi'
|
||||
program. I don't know what happens to people who break with this tradition, but I
|
||||
think it's safer not to find out. We'll start with a series of hello world programs
|
||||
that demonstrate the different aspects of the basics of writing a kernel
|
||||
module.</para>
|
||||
|
||||
<para>Here's the simplest module possible. Don't compile it yet; we'll cover module
|
||||
compilation in the next section.</para>
|
||||
|
||||
<example><title>Hello World (part 1)</title>
|
||||
<programlisting><![CDATA[
|
||||
/* hello-1.c - The simplest kernel module.
|
||||
*/
|
||||
|
||||
#include <linux/module.h> /* Needed by all modules */
|
||||
#include <linux/kernel.h> /* Needed for KERN_ALERT */
|
||||
|
||||
|
||||
int init_module(void)
|
||||
{
|
||||
printk("<1>Hello world 1.\n");
|
||||
|
||||
/* A non 0 return means init_module failed; module can't be loaded. */
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
void cleanup_module(void)
|
||||
{
|
||||
printk(KERN_ALERT "Goodbye world 1.\n");
|
||||
}
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
<indexterm><primary><function>init_module()</function></primary></indexterm>
|
||||
<indexterm><primary><function>cleanup_module()</function></primary></indexterm>
|
||||
<indexterm><primary><function>printk()</function></primary></indexterm>
|
||||
|
||||
|
||||
<para>A kernel module must have at least two functions: a "start" (initialization)
|
||||
function called <function>init_module()</function> which is called when the module is
|
||||
insmoded into the kernel, and an "end" (cleanup) function called
|
||||
<function>cleanup_module()</function> which is called just before it is
|
||||
rmmoded.</para>
|
||||
|
||||
<para>Typically, <function>init_module()</function> either registers a handler for
|
||||
something with the kernel, or it replaces one of the kernel functions with its own
|
||||
code (usually code to do something and then call the original function). The
|
||||
<function>cleanup_module()</function> function is supposed to undo whatever
|
||||
<function>init_module()</function> did, so the module can be unloaded safely.</para>
|
||||
|
||||
|
||||
<para>Despite what you might think, <function>printk()</function> was not meant to
|
||||
communicate information to the user, even though we use it for exactly this purpose
|
||||
within this document! It happens to be a logging mechanism for the kernel, and is
|
||||
used to log information or give warnings. Therefore, each
|
||||
<function>printk()</function> statement comes with a priority, which is the
|
||||
<varname><1></varname> you see. There are 8 priorities and the kernel has
|
||||
macros for them, so you don't have to use cryptic numbers. We could've used a macro
|
||||
instead of the explicit priority level: <function>printk(KERN_ALERT "Hello,
|
||||
world.");</function> There are 8 priority levels and you can view them (and what
|
||||
they mean) in the file <filename role="headerfile">linux/kernel.h</filename>. If you
|
||||
don't specify a priority level, the default priority,
|
||||
<varname>DEFAULT_MESSAGE_LOGLEVEL</varname>, will be used.</para>
|
||||
|
||||
<para>Take time to read through the priority macros. The header file also describes
|
||||
what each priority means. In practise, don't use number, like
|
||||
<literal><4></literal>. Always use the macro, like
|
||||
<literal>KERN_WARNING</literal>.</para>
|
||||
|
||||
<para>If the priority is less than <varname>int console_loglevel</varname>, the
|
||||
message is printed on your current terminal. If both <command>syslogd</command> and
|
||||
<application>klogd</application> are running, then the message will also get appended
|
||||
to <filename>/var/log/messages</filename>, whether it got printed to the console or
|
||||
not. We use a high priority, like <literal>KERN_ALERT</literal>, to make sure the
|
||||
<function>printk()</function> messages get printed to your console rather than just
|
||||
logged to your logfile. When you write real modules, you'll want to use priorities
|
||||
that are meaningful for the situation at hand.</para>
|
||||
|
||||
<para>There's more I want to show you using "hello world" type programs, but before
|
||||
we move on, you need to learn how to compile them.</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1><title>Compiling Kernel Modules</title>
|
||||
|
||||
<indexterm><primary>insmod</primary></indexterm>
|
||||
|
||||
<para>A kernel module is not an independant executable, but an object file which will
|
||||
be linked into the kernel during runtime using insmod. As a result, modules should
|
||||
be compiled with the <option>-c</option> flag. Also, because the kernel makes
|
||||
extensive use of inline functions, modules must be compiled with the optimization
|
||||
flag, <option>-O</option>, although heavy optimization like <option>-O2</option> is
|
||||
not recommended. Without optimization, some of the assembler macros calls will be
|
||||
mistaken by the compiler for function calls. This will cause loading the module to
|
||||
fail, since insmod won't find those functions in the kernel.</para>
|
||||
|
||||
<para>Kernel modules also need to be compiled with certain symbols defined. This is
|
||||
because the kernel header files need to behave differently, depending on whether
|
||||
we're compiling a kernel module or an executable. You define symbols using gcc's
|
||||
<option>-D</option> option, and here are a list of symbols that should be defined for
|
||||
every module you compile:</para>
|
||||
|
||||
<indexterm><primary>MODULE</primary></indexterm>
|
||||
<indexterm><primary>__KERNEL__</primary></indexterm>
|
||||
<indexterm><primary>__SMP__</primary></indexterm>
|
||||
|
||||
<itemizedlist>
|
||||
|
||||
<listitem><para><varname>__KERNEL__</varname>: Tells the header files that the code
|
||||
will be run in kernel mode, not as a user process.</para></listitem>
|
||||
|
||||
<listitem><para><varname>MODULE</varname>: Tells the header files to give the
|
||||
appropriate definitions for a kernel module.</para></listitem>
|
||||
|
||||
<listitem><para><varname>__SMP__</varname>: This must be defined if the kernel was
|
||||
compiled to support symmetrical multiprocessing, even if it's running just on one
|
||||
CPU. Note how I did this in <filename>hello-1.c</filename>.</para></listitem>
|
||||
|
||||
</itemizedlist>
|
||||
|
||||
<para> So, let's look at a simple Makefile for compiling a module:</para>
|
||||
|
||||
<example><title>Makefile for a basic kernel module</title>
|
||||
<screen><![CDATA[
|
||||
# Makefile for a basic kernel module
|
||||
|
||||
CC=gcc
|
||||
CFLAGS := -c -0 -W -Wall -Wstrict-prototypes -Wmissing-prototypes
|
||||
MODFLAGS := -DMODULE -D__KERNEL__
|
||||
|
||||
|
||||
hello-1.o: hello.c
|
||||
${CC} ${MODCFLAGS} hello.c
|
||||
]]></screen>
|
||||
</example>
|
||||
|
||||
<para>Type <filename>hello-1.c</filename> in and compile it. Insert it into the
|
||||
kernel with <command>insmod ./hello-1.o</command>. Neat, eh? All modules loaded
|
||||
into the kernel are listed in <filename>/proc/modules</filename>. Go ahead and cat
|
||||
that file to see that your module is really a part of the kernel. Congratulations,
|
||||
you are now the author of Linux kernel code! When the novelty wares off, you can
|
||||
remove your module from the kernel by using <command>rmmod hello-1</command>. Take
|
||||
a look at <filename>/var/log/messages</filename> just to see that it got logged to
|
||||
your system logfile.</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1><title>Hello World (part 2): The <function>module_init()</function> and
|
||||
<function>module_exit()</function> Macros</title>
|
||||
|
||||
<indexterm><primary>module_init</primary></indexterm>
|
||||
<indexterm><primary>module_exit</primary></indexterm>
|
||||
|
||||
<para>As of Linux 2.4, you can rename the init and cleanup functions of your
|
||||
modules; they no longer have to be called <function>init_module()</function> and
|
||||
<function>cleanup_module()</function> respectively. This is done with the
|
||||
<function>module_init()</function> and <function>module_exit()</function> macros.
|
||||
These macros are defined in <filename role="header">linux/init.h</filename>. The
|
||||
only caveat is that your init and cleanup functions must be defined before calling
|
||||
the macros, otherwise you'll get compilation errors. Here's an example of this
|
||||
technique:</para>
|
||||
|
||||
|
||||
<example><title>Hello World (part 2)</title>
|
||||
<programlisting><![CDATA[
|
||||
/* hello-2.c - Demonstrating the module_init() and module_exit() macros.
|
||||
*/
|
||||
|
||||
#include <linux/module.h> /* Needed by all modules */
|
||||
#include <linux/kernel.h> /* Needed for KERN_ALERT */
|
||||
#include <linux/init.h> /* Needed for the macros */
|
||||
|
||||
|
||||
int my_wonderful_init(void)
|
||||
{
|
||||
printk(KERN_ALERT "Hello, world 2\n");
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
void my_wonderful_cleanup(void)
|
||||
{
|
||||
printk(KERN_ALERT "Goodbye, world 2\n");
|
||||
}
|
||||
|
||||
|
||||
module_init(my_wonderful_init);
|
||||
module_exit(my_wonderful_cleanup);
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1><title>Hello World (part 3): The <literal>__init</literal> and
|
||||
<literal>__exit</literal> Macros</title>
|
||||
|
||||
<indexterm><primary><function>__init</function></primary></indexterm>
|
||||
<indexterm><primary><function>__initdata</function></primary></indexterm>
|
||||
<indexterm><primary><function>__exit</function></primary></indexterm>
|
||||
<indexterm><primary><function>__initfunction()</function></primary></indexterm>
|
||||
|
||||
<para>This demonstrates a feature of kernel 2.2 and later. Notice the change in the
|
||||
definitions of the init and cleanup functions. The <function>__init</function> macro
|
||||
will cause the init function to be discarded and its memory reclaimed for the kernel
|
||||
once the init function finishes, but only for built-in drivers. It has no effect for
|
||||
loadable modules.</para>
|
||||
|
||||
<para>There is also an <function>__initdata</function> which works similarly to
|
||||
<function>__init</function> but for init variables rather than functions.</para>
|
||||
|
||||
<para>The <function>__exit</function> macro causes the omission of the function when
|
||||
the module is built into the kernel. This only has an effect for built in modules
|
||||
since they never exit (and hence don't need a cleanup function). It has no effect on
|
||||
loadable modules since they need their cleanup function.</para>
|
||||
|
||||
<para>These macros are defined in <filename role="headerfile">linux/init.h</filename>
|
||||
and serve to free up kernel memory. When you boot your kernel and see something like
|
||||
<literal>Freeing unused kernel memory: 236k freed</literal>, this is precisely what
|
||||
the kernel is freeing.</para>
|
||||
|
||||
|
||||
<example><title>Hello World (part 3)</title>
|
||||
<programlisting><![CDATA[
|
||||
/* hello-3.c - Illustrating the __init, __initdata and __exit macros.
|
||||
*/
|
||||
|
||||
#include <linux/module.h> /* Needed by all modules */
|
||||
#include <linux/kernel.h> /* Needed for KERN_ALERT */
|
||||
#include <linux/init.h> /* Needed for the macros */
|
||||
|
||||
|
||||
static int hello3_data __initdata = 3;
|
||||
|
||||
|
||||
static int __init hello3_init_function(void)
|
||||
{
|
||||
printk(KERN_ALERT "Hello, world %d\n", hello3_data);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
static void __exit hello3_cleanup_function(void)
|
||||
{
|
||||
printk(KERN_ALERT "Goodbye, world 3\n");
|
||||
}
|
||||
|
||||
|
||||
module_init(hello3_init_function);
|
||||
module_exit(hello3_cleanup_function);
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
|
||||
<para>You may see a directive named "<function>__initfunction()</function>" in
|
||||
drivers written for Linux 2.2 kernels:</para>
|
||||
|
||||
|
||||
<screen><![CDATA[
|
||||
__initfunction(int init_module(void))
|
||||
{
|
||||
printk(KERN_ALERT "Hi there.\n");
|
||||
|
||||
return 0;
|
||||
}
|
||||
]]></screen>
|
||||
|
||||
|
||||
<para>This macro served the same purpose as <function>__init</function>, but is now
|
||||
deprecated in favor of <function>__init</function>. Don't use
|
||||
<function>__initfunction()</function> in your own code.</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1><title>Hello World (part 4): Licensing and Module Documentation</title>
|
||||
|
||||
<indexterm><primary><literal>MODULE_LICENSE()</literal></primary></indexterm>
|
||||
<indexterm><primary><literal>MODULE_DESCRIPTION()</literal></primary></indexterm>
|
||||
<indexterm><primary><literal>MODULE_AUTHOR()</literal></primary></indexterm>
|
||||
<indexterm><primary><literal>MODULE_SUPPORTED_DEVICE()</literal></primary></indexterm>
|
||||
|
||||
<para>If you're running kernel 2.4 or later, you might have noticed the message like:
|
||||
"<literal>Warning: loading hello-1.o will taint the kernel: no license</literal>"
|
||||
when you loaded the previous example modules. In 2.4 and later, a mechanism was
|
||||
devised to identify code licensed under the GPL (and friends) so people can be warned
|
||||
that the code is non open-source. This is accomplished by the
|
||||
<literal>MODULE_LICENSE()</literal> macro which is demonstrated in the next piece of
|
||||
code. By setting the license to GPL, you can keep the warning from being printed.
|
||||
This mechanism is documented in <filename
|
||||
role="headerfile">linux/module.h</filename>, and I recommend you read the comments
|
||||
about this macro in the header file.</para>
|
||||
|
||||
<para>A similar mechanism is used to identify the module description
|
||||
"<literal>MODULE_DESCRIPTION()</literal>", author
|
||||
"<literal>MODULE_AUTHOR()</literal>" and what device the module supports
|
||||
"<literal>MODULE_SUPPORTED_DEVICE()</literal>" and is defined in <filename
|
||||
role="headerfile">linux/module.h</filename>. This info isn't really used by the
|
||||
kernel itself; it's used as documentation and can be viewed by a tool like
|
||||
objdump.</para>
|
||||
|
||||
<para>We haven't covered devices yet, so the last macro may be a mystery to you, but
|
||||
keep it in mind. We'll cover char devices shortly.</para>
|
||||
|
||||
|
||||
<example><title>Hello World (part 4)</title>
|
||||
<programlisting><![CDATA[
|
||||
/* hello-4.c - Demonstrates tainting messages and documentation.
|
||||
*/
|
||||
#include <linux/module.h>
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/init.h>
|
||||
#define DRIVER_AUTHOR "Peter Jay Salzman <p@dirac.org>"
|
||||
#define DRIVER_DESC "A sample driver"
|
||||
|
||||
|
||||
static int __init hello4_init_function(void)
|
||||
{
|
||||
printk(KERN_ALERT "Hello, world 4\n");
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
static void __exit hello4_cleanup_function(void)
|
||||
{
|
||||
printk(KERN_ALERT "Goodbye, world 4\n");
|
||||
}
|
||||
|
||||
|
||||
module_init(hello4_init_function);
|
||||
module_exit(hello4_cleanup_function);
|
||||
|
||||
/* You can use strings here or a define, as shown. It doesn't matter what you
|
||||
* actually name the #define's, so "AUTHOR" is as good as "DRIVER_AUTHOR". */
|
||||
MODULE_AUTHOR(DRIVER_AUTHOR);
|
||||
MODULE_DESCRIPTION(DRIVER_DESC);
|
||||
|
||||
/* This gets rid of the "taint message" by declaring this code as GPL. */
|
||||
MODULE_LICENSE("GPL");
|
||||
|
||||
/* This says that the module uses /dev/testdevice. It might be used in the
|
||||
* future to help automatic configuration of modules, but is currently unused
|
||||
* other than documentation purposes. */
|
||||
MODULE_SUPPORTED_DEVICE("testdevice");
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1><title>Passing Command Line Arguments to a Module</title>
|
||||
|
||||
<para>Modules can take command line arguments, but not with the argc/argv you might
|
||||
be used to.</para>
|
||||
|
||||
<para>To allow arguments to be passed to your driver, declare the variables that will
|
||||
take the values of the command line arguments as global and then use the MODULE_PARM
|
||||
macro (defined in <filename role="headerfile">linux/module.h</filename>) to set the
|
||||
mechanism up. At runtime, insmod will fill the variables with any command line
|
||||
arguments that are given. The variable declarations and macros should be placed at
|
||||
the beginning of the module for clarity. The example code should clear up my
|
||||
admittedly lousy explanation.</para>
|
||||
|
||||
<para>The <literal>MODULE_PARM</literal> macro takes 2 arguments: the name of the
|
||||
variable and its type. The supported variable types are "<literal>b</literal>":
|
||||
single byte, "<literal>h</literal>": short int, "<literal>i</literal>": integer,
|
||||
"<literal>l</literal>": long int and "<literal>s</literal>": string. Strings should
|
||||
be declared as "<type>char *</type>" and insmod will allocate memory for them. You
|
||||
should always try to give the variables an initial default value. This is kernel
|
||||
code, and you should program defensively. For example:</para>
|
||||
|
||||
<screen>
|
||||
int myint = 3;
|
||||
char *mystr;
|
||||
|
||||
MODULE_PARM (myint, "i");
|
||||
MODULE_PARM (mystr, "s");
|
||||
</screen>
|
||||
|
||||
<para>Arrays are supported too. An integer value preceding the type in MODULE_PARM
|
||||
will indicate an array of some maximum length. Two numbers separated by a '-' will
|
||||
give the minimum and maximum number of values. For example, an array of shorts with
|
||||
at least 2 and no more than 4 values could be declared as:</para>
|
||||
|
||||
<screen>
|
||||
int myshortArray[4];
|
||||
MODULE_PARM (myintArray, "2-4i");
|
||||
</screen>
|
||||
|
||||
|
||||
<para>A good use for this is to have the module variable's default values set, like
|
||||
which IO port or IO memory to use. If the variables contain the default values, then
|
||||
perform autodetection (explained elsewhere). Otherwise, keep the current value.
|
||||
This will be made clear later on. For now, I just want to demonstrate passing
|
||||
arguments to a module.</para>
|
||||
|
||||
|
||||
<example><title>Hello World (part 5)</title>
|
||||
<programlisting><![CDATA[
|
||||
/* hello-5.c - Demonstrates command line argument passing to a module.
|
||||
*/
|
||||
#include <linux/module.h>
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/init.h>
|
||||
|
||||
static int myint = 0;
|
||||
static char *mystring = "blah";
|
||||
|
||||
MODULE_PARM (myint, "i");
|
||||
MODULE_PARM (mystring, "s");
|
||||
|
||||
|
||||
static int __init hello5_init_function(void)
|
||||
{
|
||||
printk(KERN_ALERT "Hello, world 5\n");
|
||||
printk(KERN_ALERT "integer: %i\n", myint);
|
||||
printk(KERN_ALERT "string: %s\n", mystring);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
static void __exit hello5_cleanup_function(void)
|
||||
{
|
||||
printk(KERN_ALERT "Goodbye, world 5\n");
|
||||
}
|
||||
|
||||
|
||||
module_init(hello5_init_function);
|
||||
module_exit(hello5_cleanup_function);
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<sect1><title>Modules Spanning Multiple Files</title>
|
||||
|
||||
<indexterm><primary>source files</primary><secondary>multiple</secondary></indexterm>
|
||||
<indexterm><primary>__NO_VERSION__</primary></indexterm>
|
||||
<indexterm><primary>module.h</primary></indexterm>
|
||||
<indexterm><primary>version.h</primary></indexterm>
|
||||
<indexterm><primary>kernel\_version</primary></indexterm>
|
||||
<indexterm><primary>ld</primary></indexterm>
|
||||
<indexterm><primary>elf_i386</primary></indexterm>
|
||||
|
||||
<para>Sometimes it makes sense to divide a kernel module between several
|
||||
source files. In this case, you need to:</para>
|
||||
|
||||
<orderedlist>
|
||||
|
||||
<listitem><para>In all the source files but one, add the line
|
||||
<command>#define __NO_VERSION__</command>. This is important because
|
||||
<filename role="headerfile">module.h</filename> normally includes the
|
||||
definition of <varname>kernel_version</varname>, a global variable with
|
||||
the kernel version the module is compiled for. If you need <filename
|
||||
role="headerfile">version.h</filename>, you need to include it yourself,
|
||||
because <filename role="headerfile">module.h</filename> won't do it for
|
||||
you with <varname>__NO_VERSION__</varname>.</para></listitem>
|
||||
|
||||
<listitem><para>Compile all the source files as usual.</para></listitem>
|
||||
|
||||
<listitem><para>Combine all the object files into a single one. Under x86,
|
||||
use <command>ld -m elf_i386 -r -o <module name.o> <1st src
|
||||
file.o> <2nd src file.o></command>.</para></listitem>
|
||||
|
||||
</orderedlist>
|
||||
|
||||
<para>Here's an example of such a kernel module.</para>
|
||||
|
||||
<example><title>start.c</title>
|
||||
<programlisting><![CDATA[
|
||||
/* start.c - Illustration of multi filed modules
|
||||
*/
|
||||
|
||||
#include <linux/kernel.h> /* We're doing kernel work */
|
||||
#include <linux/module.h> /* Specifically, a module */
|
||||
|
||||
int init_module(void)
|
||||
{
|
||||
printk("Hello, world - this is the kernel speaking\n");
|
||||
return 0;
|
||||
}
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
|
||||
<para>The next file:</para>
|
||||
|
||||
|
||||
<example><title>stop.c</title>
|
||||
<programlisting><![CDATA[
|
||||
/* stop.c - Illustration of multi filed modules
|
||||
*/
|
||||
|
||||
#if defined(CONFIG_MODVERSIONS) && ! defined(MODVERSIONS)
|
||||
#include <linux/modversions.h> /* Will be explained later */
|
||||
#define MODVERSIONS
|
||||
#endif
|
||||
#include <linux/kernel.h> /* We're doing kernel work */
|
||||
#include <linux/module.h> /* Specifically, a module */
|
||||
#define __NO_VERSION__ /* It's not THE file of the kernel module */
|
||||
#include <linux/version.h> /* Not included by module.h because of
|
||||
__NO_VERSION__ */
|
||||
|
||||
void cleanup_module()
|
||||
{
|
||||
printk("<1>Short is the life of a kernel module\n");
|
||||
}
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
|
||||
<para>And finally, the makefile:</para>
|
||||
|
||||
<example><title>Makefile for a multi-filed module</title>
|
||||
<screen><![CDATA[
|
||||
CC=gcc
|
||||
MODCFLAGS := -O -Wall -DMODULE -D__KERNEL__
|
||||
|
||||
hello.o: hello2_start.o hello2_stop.o
|
||||
ld -m elf_i386 -r -o hello2.o hello2_start.o hello2_stop.o
|
||||
|
||||
start.o: hello2_start.c
|
||||
${CC} ${MODCFLAGS} -c hello2_start.c
|
||||
|
||||
stop.o: hello2_stop.c
|
||||
${CC} ${MODCFLAGS} -c hello2_stop.c
|
||||
]]></screen>
|
||||
</example>
|
||||
|
||||
</sect1>
|
||||
|
||||
<!--
|
||||
vim: tw=87
|
||||
-->
|
|
@ -0,0 +1,317 @@
|
|||
<sect1><title>Modules vs Programs</title>
|
||||
|
||||
<sect2><title>How modules begin and end</title>
|
||||
|
||||
<para>A program usually begins with a <function>main()</function> function, executes a
|
||||
bunch of instructions and terminates upon completion of those instructions. Kernel
|
||||
modules work a bit differently. A module always begin with either the
|
||||
<function>init_module</function> or the function you specify with
|
||||
<function>module_init</function> call. This is the entry function for modules; it tells
|
||||
the kernel what functionality the module provides and sets up the kernel to run the
|
||||
module's functions when they're needed. Once it does this, entry function returns and the
|
||||
module does nothing until the kernel wants to do something with the code that the module
|
||||
provides.</para>
|
||||
|
||||
<para>All modules end by calling either <function>cleanup_module</function> or the
|
||||
function you specify with the <function>module_exit</function> call. This is the exit
|
||||
function for modules; it undoes whatever entry function did. It unregisters the
|
||||
functionality that the entry function registered.</para>
|
||||
|
||||
<para>Every module must have an entry function and an exit function. Since there's more
|
||||
than one way to specify entry and exit functions, I'll try my best to use the terms `entry
|
||||
function' and `exit function', but if I slip and simply refer to them as
|
||||
<function>init_module</function> and <function>cleanup_module</function>, I think you'll
|
||||
know what I mean.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
|
||||
|
||||
<sect2><title>Functions available to modules</title>
|
||||
|
||||
<indexterm><primary>library function</primary></indexterm>
|
||||
<indexterm><primary>system call</primary></indexterm>
|
||||
<indexterm><primary><filename>/proc/ksyms</filename></primary></indexterm>
|
||||
|
||||
<para>Programmers use functions they don't define all the time. A prime example of this
|
||||
is <function>printf()</function>. You use these library functions which are provided by
|
||||
the standard C library, libc. The definitions for these functions don't actually enter
|
||||
your program until the linking stage, which insures that the code (for
|
||||
<function>printf()</function> for example) is available, and fixes the call instruction
|
||||
to point to that code.</para>
|
||||
|
||||
<para>Kernel modules are different here, too. In the hello world example, you might
|
||||
have noticed that we used a function, <function>printk()</function> but didn't include a
|
||||
standard I/O library. That's because modules are object files whose symbols get
|
||||
resolved upon insmod'ing. The definition for the symbols comes from the kernel itself;
|
||||
the only external functions you can use are the ones provided by the kernel. If you're
|
||||
curious about what symbols have been exported by your kernel, take a look at
|
||||
<filename>/proc/ksyms</filename>.</para>
|
||||
|
||||
<para>One point to keep in mind is the difference between library functions and system
|
||||
calls. Library functions are higher level, run completely in user space and provide a
|
||||
more convenient interface for the programmer to the functions that do the real
|
||||
work---system calls. System calls run in kernel mode on the user's behalf and are
|
||||
provided by the kernel itself. The library function <function>printf()</function> may
|
||||
look like a very general printing function, but all it really does is format the data
|
||||
into strings and write the string data using the low-level system call
|
||||
<function>write()</function>, which then sends the data to standard output.</para>
|
||||
|
||||
<para> Would you like to see what system calls are made by
|
||||
<function>printf()</function>? It's easy! Compile the following program: </para>
|
||||
|
||||
<screen>
|
||||
#include <stdio.h>
|
||||
int main(void)
|
||||
{ printf("hello"); return 0; }
|
||||
</screen>
|
||||
|
||||
<indexterm><primary>strace</primary></indexterm>
|
||||
|
||||
<para>with <command>gcc -Wall -o hello hello.c</command>. Run the exectable with
|
||||
<command>strace hello</command>. Are you impressed? Every line you see corresponds to
|
||||
a system call. strace<footnote><para>It's an invaluable tool for figuring out things
|
||||
like what files a program is trying to access. Ever have a program bail silently
|
||||
because it couldn't find a file? It's a PITA!</para></footnote> is a handy program that
|
||||
gives you details about what system calls a program is making, including which call is
|
||||
made, what its arguments are what it returns. It's an invaluable tool for figuring out
|
||||
things like what files a program is trying to access. Towards the end, you'll see a
|
||||
line which looks like <function>write(1, "hello", 5hello)</function>. There it is. The
|
||||
face behind the <function>printf()</function> mask. You may not be familiar with write,
|
||||
since most people use library functions for file I/O (like fopen, fputs, fclose). If
|
||||
that's the case, try looking at <command>man 2 write</command>. The 2nd man section is
|
||||
devoted to system calls (like <function>kill()</function> and
|
||||
<function>read()</function>. The 3rd man section is devoted to library calls, which you
|
||||
would probably be more familiar with (like <function>cosh()</function> and
|
||||
<function>random()</function>).</para>
|
||||
|
||||
<para>You can even write modules to replace the kernel's system calls, which we'll do
|
||||
shortly. Crackers often make use of this sort of thing for backdoors or trojans, but
|
||||
you can write your own modules to do more benign things, like have the kernel write
|
||||
<emphasis>Tee hee, that tickles!</emphasis> everytime someone tries to delete a file on
|
||||
your system.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
|
||||
|
||||
<sect2><title>User Space vs Kernel Space</title>
|
||||
|
||||
<para>A kernel is all about access to resources, whether the resource in question happens
|
||||
to be a video card, a hard drive or even memory. Programs often compete for the same
|
||||
resource. As I just saved this document, updatedb started updating the locate database.
|
||||
My vim session and updatedb are both using the hard drive concurrently. The kernel needs
|
||||
to keep things orderly, and not give users access to resources whenever they feel like it.
|
||||
To this end, a <acronym>CPU</acronym> can run in different modes. Each mode gives a
|
||||
different level of freedom to do what you want on the system. The Intel 80386
|
||||
architecture has 4 of these modes, which are called rings. Unix uses only two rings; the
|
||||
highest ring (ring 0, also known as `supervisor mode' where everything is allowed to
|
||||
happen) and the lowest ring, which is called `user mode'.</para>
|
||||
|
||||
<para>Recall the discussion about library functions vs system calls. Typically, you use a
|
||||
library function in user mode. The library function calls one or more system calls, and
|
||||
these system calls execute on the library function's behalf, but do so in supervisor mode
|
||||
since they are part of the kernel itself. Once the system call completes its task, it
|
||||
returns and execution gets transfered back to user mode.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
|
||||
|
||||
<sect2><title>Name Space</title>
|
||||
|
||||
<indexterm><primary>symbol table</primary></indexterm>
|
||||
<indexterm><primary>namespace pollution</primary></indexterm>
|
||||
<indexterm><primary><filename>/proc/ksyms</filename></primary></indexterm>
|
||||
|
||||
<para>When you write a small C program, you use variables which are convenient and make
|
||||
sense to the reader. If, on the other hand, you're writing routines which will be part
|
||||
of a bigger problem, any global variables you have are part of a community of other
|
||||
peoples' global variables; some of the variable names can clash. When a program has
|
||||
lots of global variables which aren't meaningful enough to be distinguished, you get
|
||||
<emphasis>namespace pollution</emphasis>. In large projects, effort must be made to
|
||||
remember reserved names, and to find ways to develop a scheme for naming unique variable
|
||||
names and symbols.</para>
|
||||
|
||||
<para>When writing kernel code, even the smallest module will be linked against the
|
||||
entire kernel, so this is definitely an issue. The best way to deal with this is to
|
||||
declare all your variables as <type>static</type> and to use a well-defined prefix for
|
||||
your symbols. By convention, all kernel prefixes are lowercase. If you don't want to
|
||||
declare everything as <type>static</type>, another option is to declare a
|
||||
<varname>symbol table</varname> and register it with a kernel. We'll get to this
|
||||
later.</para>
|
||||
|
||||
<para>The file <filename>/proc/ksyms</filename> holds all the symbols that the kernel
|
||||
knows about and which are therefore accessible to your modules since they share the
|
||||
kernel's codespace.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
|
||||
|
||||
<sect2><title>Code space</title>
|
||||
|
||||
<indexterm><primary>code space</primary></indexterm>
|
||||
<indexterm><primary>monolithic kernel</primary></indexterm>
|
||||
<indexterm><primary>Hurd</primary></indexterm>
|
||||
<indexterm><primary>Neutrino</primary></indexterm>
|
||||
<indexterm><primary>microkernel</primary></indexterm>
|
||||
|
||||
<para>Memory management is a very complicated subject---the majority of O'Reilly's
|
||||
`Understanding The Linux Kernel' is just on memory management! We're not setting out to
|
||||
be experts in memory managements, but we do need to know a couple of facts to even begin
|
||||
worrying about writing real modules.</para>
|
||||
|
||||
<para>If you haven't thought about what a segfault really means, you may be surprised to
|
||||
hear that pointers don't actually point to memory locations. Not real ones, anyway.
|
||||
When a process is created, the kernel sets aside a portion of real physical memory and
|
||||
hands it to the process to use for its executing code, variables, stack, heap and other
|
||||
things which a computer scientist would know about<footnote><para>I'm a physicist, not a
|
||||
computer scientist, Jim!</para></footnote>. This memory begins with $0$ and extends up
|
||||
to whatever it needs to be. Since the memory space for any two processes don't overlap,
|
||||
every process that can access a memory address, say <literal>0xbffff978</literal>, would
|
||||
be accessing a different location in real physical memory! The processes would be
|
||||
accessing an index named <literal>0xbffff978</literal> which points to some kind of
|
||||
offset into the region of memory set aside for that particular process. For the most
|
||||
part, a process like our Hello, World program can't access the space of another process,
|
||||
although there are ways which we'll talk about later.</para>
|
||||
|
||||
<para>The kernel has its own space of memory as well. Since a module is code which can
|
||||
be dynamically inserted and removed in the kernel (as opposed to a semi-autonomous
|
||||
object), it shares the kernel's codespace rather than having its own. Therefore, if
|
||||
your module segfaults, the kernel segfaults. And if you start writing over data because
|
||||
of an off-by-one error, then you're trampling on kernel code. This is even worse than
|
||||
it sounds, so try your best to be careful.</para>
|
||||
|
||||
<para>By the way, I would like to point out that the above discussion is true for any
|
||||
operating system which uses a monolithic kernel<footnote><para>This isn't quite the same
|
||||
thing as `building all your modules into the kernel', although the idea is the
|
||||
same.</para></footnote>. There are things called microkernels which have modules which
|
||||
get their own codespace. The GNU Hurd and QNX Neutrino are two examples of a
|
||||
microkernel.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
|
||||
|
||||
<sect2><title>Device Drivers</title>
|
||||
|
||||
<para>One class of module is the device driver, which provides functionality for
|
||||
hardware like a TV card or a serial port. On unix, each piece of hardware is
|
||||
represented by a file located in <filename role=directory>/dev</filename> named a
|
||||
<filename>device file</filename> which provides the means to communicate with the
|
||||
hardware. The device driver provides the communication on behalf of a user program. So
|
||||
the <filename>es1370.o</filename> sound card device driver might connect the <filename
|
||||
role="devicefile">/dev/sound</filename> device file to the Ensoniq IS1370 sound card. A
|
||||
userspace program like mp3blaster can use <filename
|
||||
role="devicefile">/dev/sound</filename> without ever knowing what kind of sound card is
|
||||
installed.</para>
|
||||
|
||||
|
||||
<sect3><title>Major and Minor Numbers</title>
|
||||
|
||||
<indexterm><primary>major number</primary></indexterm>
|
||||
<indexterm><primary>minor number</primary></indexterm>
|
||||
|
||||
<para>Let's look at some device files. Here are device files which represent the
|
||||
first three partitions on the primary master IDE hard drive:</para>
|
||||
|
||||
<screen>
|
||||
# ls -l /dev/hda[1-3]
|
||||
brw-rw---- 1 root disk 3, 1 Jul 5 2000 /dev/hda1
|
||||
brw-rw---- 1 root disk 3, 2 Jul 5 2000 /dev/hda2
|
||||
brw-rw---- 1 root disk 3, 3 Jul 5 2000 /dev/hda3
|
||||
</screen>
|
||||
|
||||
<para>Notice the column of numbers separated by a comma? The first number is
|
||||
called the device's major number. The second number is the minor number. The
|
||||
major number tells you which driver is used to access the hardware. Each driver
|
||||
is assigned a unique major number; all device files with the same major number are
|
||||
controlled by the same driver. All the above major numbers are 3, because they're
|
||||
all controlled by the same driver.</para>
|
||||
|
||||
<para>The minor number is used by the driver to distinguish between the various
|
||||
hardware it controls. Returning to the example above, although all three devices
|
||||
are handled by the same driver they have unique minor numbers because the driver
|
||||
sees them as being different pieces of hardware.</para>
|
||||
|
||||
<para> Devices are divided into two types: character devices and block devices.
|
||||
The difference is that block devices have a buffer for requests, so they can
|
||||
choose the best order in which to respond to the requests. This is important in
|
||||
the case of storage devices, where it's faster to read or write sectors which are
|
||||
close to each other, rather than those which are further apart. Another
|
||||
difference is that block devices can only accept input and return output in blocks
|
||||
(whose size can vary according to the device), whereas character devices are
|
||||
allowed to use as many or as few bytes as they like. Most devices in the world
|
||||
are character, because they don't need this type of buffering, and they don't
|
||||
operate with a fixed block size. You can tell whether a device file is for a
|
||||
block device or a character device by looking at the first character in the output
|
||||
of <command>ls -l</command>. If it's `b' then it's a block device, and if it's `c'
|
||||
then it's a character device. The devices you see above are block devices. Here
|
||||
are some character devices (the serial ports):</para>
|
||||
|
||||
<screen>
|
||||
crw-rw---- 1 root dial 4, 64 Feb 18 23:34 /dev/ttyS0
|
||||
crw-r----- 1 root dial 4, 65 Nov 17 10:26 /dev/ttyS1
|
||||
crw-rw---- 1 root dial 4, 66 Jul 5 2000 /dev/ttyS2
|
||||
crw-rw---- 1 root dial 4, 67 Jul 5 2000 /dev/ttyS3
|
||||
</screen>
|
||||
|
||||
<para> If you want to see which major numbers have been assigned, you can look at
|
||||
<filename>/usr/src/linux/Documentation/devices.txt</filename>. </para>
|
||||
|
||||
<indexterm><primary>mknod</primary></indexterm>
|
||||
<indexterm><primary>coffee</primary></indexterm>
|
||||
|
||||
<para>When the system was installed, all of those device files were created by the
|
||||
<command>mknod</command> command. To create a new char device named `coffee' with
|
||||
major/minor number <literal>12</literal> and <literal>2</literal>, simply do
|
||||
<command>mknod /dev/coffee c 12 2</command>. You don't <emphasis>have</emphasis>
|
||||
to put your device files into <filename role="directory">/dev</filename>, but it's
|
||||
done by convention. Linus put his device files in <filename> /dev</filename>, and
|
||||
so should you. However, when creating a device file for testing purposes, it's
|
||||
probably OK to place it in your working directory where you compile the kernel
|
||||
module. Just be sure to put it in the right place when you're done writing the
|
||||
device driver.</para>
|
||||
|
||||
<para>I would like to make a few last points which are implicit from the above
|
||||
discussion, but I'd like to make them explicit just in case. When a device file
|
||||
is accessed, the kernel uses the major number of the file to determine which
|
||||
driver should be used to handle the access. This means that the kernel doesn't
|
||||
really need to use or even know about the minor number. The driver itself is the
|
||||
only thing that cares about the minor number. It uses the minor number to
|
||||
distinguish between different pieces of hardware.</para>
|
||||
|
||||
<para>By the way, when I say `hardware', I mean something a bit more abstract than
|
||||
a PCI card that you can hold in your hand. Look at these two device
|
||||
files:</para>
|
||||
|
||||
<screen>
|
||||
% ls -l /dev/fd0 /dev/fd0u1680
|
||||
brwxrwxrwx 1 root floppy 2, 0 Jul 5 2000 /dev/fd0
|
||||
brw-rw---- 1 root floppy 2, 44 Jul 5 2000 /dev/fd0u1680
|
||||
</screen>
|
||||
|
||||
<para>By now you can look at these two device files and know instantly that they
|
||||
are block devices and are handled by same driver (block major
|
||||
<literal>2</literal>). You might even be aware that these both represent your
|
||||
floppy drive, even if you only have one floppy drive. Why two files? One
|
||||
represents the floppy drive with <literal>1.44</literal> <acronym>MB</acronym> of
|
||||
storage. The other is the <emphasis>same</emphasis> floppy drive with
|
||||
<literal>1.68</literal> <acronym>MB</acronym> of storage, and corresponds to what
|
||||
some people call a `superformatted' disk. One that holds more data than a
|
||||
standard formatted floppy. So here's a case where two device files with different
|
||||
minor number actually represent the same piece of physical hardware. So just be
|
||||
aware that the word `hardware' in our discussion can mean something very
|
||||
abstract.</para>
|
||||
|
||||
</sect3>
|
||||
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
<!--
|
||||
vim:textwidth=96
|
||||
-->
|
|
@ -0,0 +1,456 @@
|
|||
<sect1><title>Character Device Drivers</title>
|
||||
|
||||
<indexterm><primary>device file</primary><secondary>character</secondary>
|
||||
</indexterm>
|
||||
|
||||
<sect2><title>The <type>file_operations</type> Structure</title>
|
||||
|
||||
<indexterm><primary>file_operations</primary></indexterm>
|
||||
|
||||
<para>The <type>file_operations</type> structure is defined in <filename
|
||||
role="headerfile">linux/fs.h</filename>, and holds pointers to functions
|
||||
defined by the driver that perform various operations on the device. Each
|
||||
field of the structure corresponds to the address of some function defined
|
||||
by the driver to handle a requested operation.</para>
|
||||
|
||||
<para> For example, every character driver needs to define a function that
|
||||
reads from the device. The <type>file_operations</type> structure holds
|
||||
the address of the module's function that performs that operation. Here is
|
||||
what the definition looks like for kernel <literal>2.4.2</literal>:</para>
|
||||
|
||||
<screen>
|
||||
struct file_operations {
|
||||
struct module *owner;
|
||||
loff_t (*llseek) (struct file *, loff_t, int);
|
||||
ssize_t (*read) (struct file *, char *, size_t, loff_t *);
|
||||
ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
|
||||
int (*readdir) (struct file *, void *, filldir_t);
|
||||
unsigned int (*poll) (struct file *, struct poll_table_struct *);
|
||||
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
|
||||
int (*mmap) (struct file *, struct vm_area_struct *);
|
||||
int (*open) (struct inode *, struct file *);
|
||||
int (*flush) (struct file *);
|
||||
int (*release) (struct inode *, struct file *);
|
||||
int (*fsync) (struct file *, struct dentry *, int datasync);
|
||||
int (*fasync) (int, struct file *, int);
|
||||
int (*lock) (struct file *, int, struct file_lock *);
|
||||
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
|
||||
loff_t *);
|
||||
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
|
||||
loff_t *);
|
||||
};
|
||||
</screen>
|
||||
|
||||
<para>Some operations are not implemented by a driver. For example, a
|
||||
driver that handles a video card won't need to read from a directory
|
||||
structure. The corresponding entries in the <type>file_operations</type>
|
||||
structure should be set to <varname>NULL</varname>.</para>
|
||||
|
||||
<para>There is a gcc extension that makes assigning to this structure more
|
||||
convenient. You'll see it in modern drivers, and may catch you by
|
||||
surprise. This is what the new way of assigning to the structure looks
|
||||
like:</para>
|
||||
|
||||
|
||||
<screen>
|
||||
struct file_operations fops = {
|
||||
read: device_read,
|
||||
write: device_write,
|
||||
open: device_open,
|
||||
release: device_release
|
||||
};
|
||||
</screen>
|
||||
|
||||
<para>However, there's also a C99 way of assigning to elements of a
|
||||
structure. The version of gcc I'm currently using,
|
||||
<literal>2.95</literal>, supports the new C99 syntax. You should use this
|
||||
syntax in case someone wants to port your driver. It will help with
|
||||
compatibility:</para>
|
||||
|
||||
|
||||
<screen>
|
||||
struct file_operations fops = {
|
||||
.read = device_read,
|
||||
.write = device_write,
|
||||
.open = device_open,
|
||||
.release = device_release
|
||||
};
|
||||
</screen>
|
||||
|
||||
<para>The meaning is clear, and you should be aware that any member of the
|
||||
structure which you don't explicitly assign will be initialized to
|
||||
<varname>NULL</varname> by gcc.</para>
|
||||
|
||||
<para>A pointer to a <type>struct file_operations</type> is commonly named
|
||||
<varname>fops</varname>.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
|
||||
|
||||
<sect2><title>The <type>file</type> structure</title>
|
||||
|
||||
<indexterm><primary>file</primary></indexterm>
|
||||
<indexterm><primary>inode</primary></indexterm>
|
||||
|
||||
<para>Each device is represented in the kernel by a <type>file</type>
|
||||
structure, which is defined in <filename
|
||||
role="header">linux/fs.h</filename>. Be aware that a <type>file</type> is
|
||||
a kernel level structure and never appears in a user space program. It's
|
||||
not the same thing as a <type>FILE</type>, which is defined by glibc and
|
||||
would never appear in a kernel space function. Also, its name is a bit
|
||||
misleading; it represents an abstract open `file', not a file on a disk,
|
||||
which is represented by a structure named <type>inode</type>.</para>
|
||||
|
||||
<para>A pointer to a <varname>struct file</varname> is commonly named
|
||||
<function>filp</function>. You'll also see it refered to as
|
||||
<varname>struct file file</varname>. Resist the temptation.</para>
|
||||
|
||||
<para>Go ahead and look at the definition of <function>file</function>.
|
||||
Most of the entries you see, like <function>struct dentry</function> aren't
|
||||
used by device drivers, and you can ignore them. This is because drivers
|
||||
don't fill <varname>file</varname> directly; they only use structures
|
||||
contained in <varname>file</varname> which are created elsewhere.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
|
||||
|
||||
<sect2><title>Registering A Device</title>
|
||||
|
||||
<indexterm><primary>register_chrdev</primary></indexterm>
|
||||
<indexterm><primary>major number</primary>
|
||||
<secondary>dynamic allocation</secondary></indexterm>
|
||||
|
||||
<para>As discussed earlier, char devices are accessed through device files,
|
||||
usually located in <filename
|
||||
role="direcotry">/dev</filename><footnote><para>This is by convention.
|
||||
When writing a driver, it's OK to put the device file in your current
|
||||
directory. Just make sure you place it in <filename
|
||||
role="directory">/dev</filename> for a production driver</para></footnote>.
|
||||
The major number tells you which driver handles which device file. The
|
||||
minor number is used only by the driver itself to differentiate which
|
||||
device it's operating on, just in case the driver handles more than one
|
||||
device.</para>
|
||||
|
||||
<para> Adding a driver to your system means registering it with the kernel.
|
||||
This is synonymous with assigning it a major number during the module's
|
||||
initialization. You do this by using the
|
||||
<function>register_chrdev</function> function, defined by <filename
|
||||
role="headerfile">linux/fs.h</filename>.</para>
|
||||
|
||||
<screen>
|
||||
int register_chrdev(unsigned int major, const char *name,
|
||||
struct file_operations *fops);
|
||||
</screen>
|
||||
|
||||
<para>where <varname>unsigned int major</varname> is the major number you
|
||||
want to request, <varname>const char *name</varname> is the name of the
|
||||
device as it'll appear in <filename>/proc/devices</filename> and
|
||||
<varname>struct file_operations *fops</varname> is a pointer to the
|
||||
<varname>file_operations</varname> table for your driver. A negative
|
||||
return value means the registertration failed. Note that we didn't pass
|
||||
the minor number to <function>register_chrdev</function>. That's because
|
||||
the kernel doesn't care about the minor number; only our driver uses it.
|
||||
</para>
|
||||
|
||||
<para>Now the question is, how do you get a major number without hijacking
|
||||
one that's already in use? The easiest way would be to look through
|
||||
<filename>Documentation/devices.txt</filename> and pick an unused one.
|
||||
That's a bad way of doing things because you'll never be sure if the number
|
||||
you picked will be assigned later. The answer is that you can ask the
|
||||
kernel to assign you a dynamic major number.</para>
|
||||
|
||||
<para> If you pass a major number of 0 to
|
||||
<function>register_chrdev</function>, the return value will be the
|
||||
dynamically allocated major number. The downside is that you can't make a
|
||||
device file in advance, since you don't know what the major number will be.
|
||||
There are a couple of ways to do this. First, the driver itself can print
|
||||
the newly assigned number and we can make the device file by hand. Second,
|
||||
the newly registered device will have an entry in
|
||||
<filename>/proc/devices</filename>, and we can either make the device file
|
||||
by hand or write a shell script to read the file in and make the device
|
||||
file. The third method is we can have our driver make the the device file
|
||||
using the <function>mknod</function> system call after a successful
|
||||
registration and rm during the call to
|
||||
<function>cleanup_module</function>.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
|
||||
|
||||
<sect2><title>Unregistering A Device</title>
|
||||
|
||||
<indexterm><primary>rmmod</primary><secondary>preventing</secondary>
|
||||
</indexterm>
|
||||
|
||||
<para>We can't allow the kernel module to be
|
||||
<application>rmmod</application>'ed whenever root feels like it. If the
|
||||
device file is opened by a process and then we remove the kernel module,
|
||||
using the file would cause a call to the memory location where the
|
||||
appropriate function (read/write) used to be. If we're lucky, no other
|
||||
code was loaded there, and we'll get an ugly error message. If we're
|
||||
unlucky, another kernel module was loaded into the same location, which
|
||||
means a jump into the middle of another function within the kernel. The
|
||||
results of this would be impossible to predict, but they can't be very
|
||||
positive.</para>
|
||||
|
||||
<para> Normally, when you don't want to allow something, you return an
|
||||
error code (a negative number) from the function which is supposed to do
|
||||
it. With <function>cleanup_module</function> that's impossible because
|
||||
it's a void function. However, there's a counter which keeps track of how
|
||||
many processes are using your module. You can see what it's value is by
|
||||
looking at the 3rd field of <filename>/proc/modules</filename>. If this
|
||||
number isn't zero, <function>rmmod</function> will fail. Note that you
|
||||
don't have to check the counter from within
|
||||
<function>cleanup_module</function> because the check will be performed for
|
||||
you by the system call <function>sys_delete_module</function>, defined in
|
||||
<filename>linux/module.c</filename>. You shouldn't use this counter
|
||||
directly, but there are macros defined in <filename
|
||||
role="headerfile">linux/modules.h</filename> which let you increase,
|
||||
decrease and display this counter:</para>
|
||||
|
||||
<itemizedlist>
|
||||
|
||||
<listitem><para><varname>MOD_INC_USE_COUNT</varname>: Increment the use
|
||||
count.</para></listitem>
|
||||
|
||||
<listitem><para><varname>MOD_DEC_USE_COUNT</varname>: Decrement the use count.
|
||||
</para></listitem>
|
||||
|
||||
<listitem><para><varname>MOD_IN_USE</varname>: Display the use count.
|
||||
</para></listitem>
|
||||
|
||||
</itemizedlist>
|
||||
|
||||
<para>It's important to keep the counter accurate; if you ever do lose
|
||||
track of the correct usage count, you'll never be able to unload the
|
||||
module; it's now reboot time, boys and girls. This is bound to happen to
|
||||
you sooner or later during a module's development.</para>
|
||||
|
||||
<indexterm><primary>MOD_INC_USE_COUNT</primary></indexterm>
|
||||
<indexterm><primary>MOD_DEC_USE_COUNT</primary></indexterm>
|
||||
<indexterm><primary>MOD_IN_USE</primary></indexterm>
|
||||
|
||||
</sect2>
|
||||
|
||||
|
||||
|
||||
<sect2><title>chardev.c</title>
|
||||
|
||||
<para>The next code sample creates a char driver named
|
||||
<filename>chardev</filename>. You can <filename>cat</filename> its device
|
||||
file (or <filename>open</filename> the file with a program) and the driver
|
||||
will put the number of times the device file has been read from into the
|
||||
file. We don't support writing to the file (like <command>echo "hi" >
|
||||
/dev/hello</command>), but catch these attempts and tell the user that the
|
||||
operation isn't supported. Don't worry if you don't see what we do with
|
||||
the data we read into the buffer; we don't do much with it. We simply read
|
||||
in the data and print a message acknowledging that we received it.
|
||||
|
||||
<example><title>chardev.c</title>
|
||||
<programlisting><![CDATA[
|
||||
/*
|
||||
* chardev.c: Creates a read-only char device that says how many times
|
||||
* you've read from the dev file
|
||||
*/
|
||||
|
||||
#if defined(CONFIG_MODVERSIONS) && ! defined(MODVERSIONS)
|
||||
#include <linux/modversions.h>
|
||||
#define MODVERSIONS
|
||||
#endif
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/fs.h>
|
||||
#include <asm/uaccess.h> /* for put_user */
|
||||
|
||||
/* Prototypes - this would normally go in a .h file
|
||||
*/
|
||||
int init_module(void);
|
||||
void cleanup_module(void);
|
||||
static int device_open(struct inode *, struct file *);
|
||||
static int device_release(struct inode *, struct file *);
|
||||
static ssize_t device_read(struct file *, char *, size_t, loff_t *);
|
||||
static ssize_t device_write(struct file *, const char *, size_t, loff_t *);
|
||||
|
||||
#define SUCCESS 0
|
||||
#define DEVICE_NAME "chardev" /* Dev name as it appears in /proc/devices */
|
||||
#define BUF_LEN 80 /* Max length of the message from the device */
|
||||
|
||||
|
||||
/* Global variables are declared as static, so are global within the file. */
|
||||
|
||||
static int Major; /* Major number assigned to our device driver */
|
||||
static int Device_Open = 0; /* Is device open? Used to prevent multiple */
|
||||
access to the device */
|
||||
static char msg[BUF_LEN]; /* The msg the device will give when asked */
|
||||
static char *msg_Ptr;
|
||||
|
||||
static struct file_operations fops = {
|
||||
.read = device_read,
|
||||
.write = device_write,
|
||||
.open = device_open,
|
||||
.release = device_release
|
||||
};
|
||||
|
||||
|
||||
/* Functions
|
||||
*/
|
||||
|
||||
int init_module(void)
|
||||
{
|
||||
Major = register_chrdev(0, DEVICE_NAME, &fops);
|
||||
|
||||
if (Major > 0) {
|
||||
printk ("Registering the character device failed with %d\n", Major);
|
||||
return Major;
|
||||
}
|
||||
|
||||
printk("<1>I was assigned major number %d. To talk to\n", Major);
|
||||
printk("<1>the driver, create a dev file with\n");
|
||||
printk("'mknod /dev/hello c %d 0'.\n", Major);
|
||||
printk("<1>Try various minor numbers. Try to cat and echo to\n");
|
||||
printk("the device file.\n");
|
||||
printk("<1>Remove the device file and module when done.\n");
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
void cleanup_module(void)
|
||||
{
|
||||
/* Unregister the device */
|
||||
int ret = unregister_chrdev(Major, DEVICE_NAME);
|
||||
if (ret < 0) printk("Error in unregister_chrdev: %d\n", ret);
|
||||
}
|
||||
|
||||
|
||||
/* Methods
|
||||
*/
|
||||
|
||||
/* Called when a process tries to open the device file, like
|
||||
* "cat /dev/mycharfile"
|
||||
*/
|
||||
static int device_open(struct inode *inode, struct file *file)
|
||||
{
|
||||
static int counter = 0;
|
||||
if (Device_Open) return -EBUSY;
|
||||
Device_Open++;
|
||||
sprintf(msg,"I already told you %d times Hello world!\n", counter++");
|
||||
msg_Ptr = msg;
|
||||
MOD_INC_USE_COUNT;
|
||||
|
||||
return SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
/* Called when a process closes the device file.
|
||||
*/
|
||||
static int device_release(struct inode *inode, struct file *file)
|
||||
{
|
||||
Device_Open --; /* We're now ready for our next caller */
|
||||
|
||||
/* Decrement the usage count, or else once you opened the file, you'll
|
||||
never get get rid of the module. */
|
||||
MOD_DEC_USE_COUNT;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
/* Called when a process, which already opened the dev file, attempts to
|
||||
read from it.
|
||||
*/
|
||||
static ssize_t device_read(struct file *filp,
|
||||
char *buffer, /* The buffer to fill with data */
|
||||
size_t length, /* The length of the buffer */
|
||||
loff_t *offset) /* Our offset in the file */
|
||||
{
|
||||
/* Number of bytes actually written to the buffer */
|
||||
int bytes_read = 0;
|
||||
|
||||
/* If we're at the end of the message, return 0 signifying end of file */
|
||||
if (*msg_Ptr == 0) return 0;
|
||||
|
||||
/* Actually put the data into the buffer */
|
||||
while (length && *msg_Ptr) {
|
||||
|
||||
/* The buffer is in the user data segment, not the kernel segment;
|
||||
* assignment won't work. We have to use put_user which copies data from
|
||||
* the kernel data segment to the user data segment. */
|
||||
put_user(*(msg_Ptr++), buffer++);
|
||||
|
||||
length--;
|
||||
bytes_read++;
|
||||
}
|
||||
|
||||
/* Most read functions return the number of bytes put into the buffer */
|
||||
return bytes_read;
|
||||
}
|
||||
|
||||
|
||||
/* Called when a process writes to dev file: echo "hi" > /dev/hello */
|
||||
static ssize_t device_write(struct file *filp,
|
||||
const char *buff,
|
||||
size_t len,
|
||||
loff_t *off)
|
||||
{
|
||||
printk ("<1>Sorry, this operation isn't supported.\n");
|
||||
return -EINVAL;
|
||||
}
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
</sect2>
|
||||
|
||||
|
||||
|
||||
<sect2><title>Writing Modules for Multiple Kernel Versions</title>
|
||||
|
||||
<indexterm><primary>kernel versions</primary></indexterm>
|
||||
<indexterm><primary>LINUX_VERSION_CODE</primary></indexterm>
|
||||
<indexterm><primary>KERNEL_VERSION</primary></indexterm>
|
||||
|
||||
<para>The system calls, which are the major interface the kernel shows to
|
||||
the processes, generally stay the same across versions. A new system call
|
||||
may be added, but usually the old ones will behave exactly like they used
|
||||
to. This is necessary for backward compatibility -- a new kernel version is
|
||||
not supposed to break regular processes. In most cases, the device files
|
||||
will also remain the same. On the other hand, the internal interfaces
|
||||
within the kernel can and do change between versions.</para>
|
||||
|
||||
<para>The Linux kernel versions are divided between the stable versions
|
||||
(n.$<$even number$>$.m) and the development versions (n.$<$odd
|
||||
number$>$.m). The development versions include all the cool new ideas,
|
||||
including those which will be considered a mistake, or reimplemented, in
|
||||
the next version. As a result, you can't trust the interface to remain the
|
||||
same in those versions (which is why I don't bother to support them in this
|
||||
book, it's too much work and it would become dated too quickly). In the
|
||||
stable versions, on the other hand, we can expect the interface to remain
|
||||
the same regardless of the bug fix version (the m number).</para>
|
||||
|
||||
<para>There are differences between different kernel versions, and if you
|
||||
want to support multiple kernel versions, you'll find yourself having to
|
||||
code conditional compilation directives. The way to do this to compare the
|
||||
macro <varname>LINUX_VERSION_CODE</varname> to the macro
|
||||
<varname>KERNEL_VERSION</varname>. In version <varname>a.b.c</varname> of
|
||||
the kernel, the value of this macro would be $2^{16}a+2^{8}b+c$. Be aware
|
||||
that this macro is not defined for kernel 2.0.35 and earlier, so if you
|
||||
want to write modules that support really old kernels, you'll have to
|
||||
define it yourself, like:</para>
|
||||
|
||||
<example><title>some title</title>
|
||||
<programlisting>
|
||||
#if LINUX_KERNEL_VERSION >= KERNEL_VERSION(2,2,0)
|
||||
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
|
||||
#endif
|
||||
</programlisting>
|
||||
</example>
|
||||
|
||||
<para>Of course since these are macros, you can also use <command>#ifndef
|
||||
KERNEL_VERSION</command> to test the existence of the macro, rather than
|
||||
testing the version of the kernel.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
|
@ -0,0 +1,228 @@
|
|||
<sect1><title>The /proc File System</title>
|
||||
<!-- \index{proc file system} \index{/proc file system}
|
||||
\index{file system\\/proc} -->
|
||||
|
||||
<para>In Linux there is an additional mechanism for the kernel and kernel
|
||||
modules to send information to processes --- the <filename
|
||||
role="directory">/proc</filename> file system. Originally designed to allow
|
||||
easy access to information about processes (hence the name), it is now used
|
||||
by every bit of the kernel which has something interesting to report, such as
|
||||
<filename>/proc/modules</filename> which has the list of modules and
|
||||
<filename>/proc/meminfo</filename> which has memory usage statistics.</para>
|
||||
<!-- \index{/proc/modules} \index{/proc/meminfo} -->
|
||||
|
||||
<para>The method to use the proc file system is very similar to the one used
|
||||
with device drivers --- you create a structure with all the information
|
||||
needed for the <filename role="directory">/proc</filename> file, including
|
||||
pointers to any handler functions (in our case there is only one, the one
|
||||
called when somebody attempts to read from the <filename
|
||||
role="directory">/proc</filename> file). Then,
|
||||
<function>init_module</function> registers the structure with the kernel and
|
||||
<function>cleanup_module</function> unregisters it.</para>
|
||||
|
||||
<para>The reason we use
|
||||
<function>proc_register_dynamic</function><footnote><para>In version 2.0, in
|
||||
version 2.2 this is done for us automatically if we set the inode to
|
||||
zero.</para></footnote> is because we don't want to determine the inode
|
||||
number used for our file in advance, but to allow the kernel to determine it
|
||||
to prevent clashes. Normal file systems are located on a disk, rather than
|
||||
just in memory (which is where <filename role="directory">/proc</filename>
|
||||
is), and in that case the inode number is a pointer to a
|
||||
disk location where the file's index-node (inode for short) is located. The
|
||||
inode contains information about the file, for example the file's
|
||||
permissions, together with a pointer to the disk location or locations where
|
||||
the file's data can be found.</para>
|
||||
|
||||
<!-- \index{proc\_register\_dynamic} \index{proc\_register} \index{inode} -->
|
||||
|
||||
<para>Because we don't get called when the file is opened or closed, there's
|
||||
no where for us to put <varname>MOD_INC_USE_COUNT</varname> and
|
||||
<varname>MOD_DEC_USE_COUNT</varname> in this module, and if the file is
|
||||
opened and then the module is removed, there's no way to avoid the
|
||||
consequences. In the next chapter we'll see a harder to implement, but more
|
||||
flexible, way of dealing with <filename role="directory">/proc</filename>
|
||||
files which will allow us to protect against this problem as well.</para>
|
||||
|
||||
|
||||
<example><title>procfs.c</title>
|
||||
<programlisting><![CDATA[
|
||||
/* procfs.c - create a "file" in /proc
|
||||
* Copyright (C) 2001 by Peter Jay Salzman
|
||||
*/
|
||||
|
||||
|
||||
/* The necessary header files */
|
||||
|
||||
/* Standard in kernel modules */
|
||||
#include <linux/kernel.h> /* We're doing kernel work */
|
||||
#include <linux/module.h> /* Specifically, a module */
|
||||
|
||||
/* Deal with CONFIG_MODVERSIONS */
|
||||
#if CONFIG_MODVERSIONS==1
|
||||
#define MODVERSIONS
|
||||
#include <linux/modversions.h>
|
||||
#endif
|
||||
|
||||
|
||||
/* Necessary because we use the proc fs */
|
||||
#include <linux/proc_fs.h>
|
||||
|
||||
|
||||
|
||||
/* In 2.2.3 /usr/include/linux/version.h includes a
|
||||
* macro for this, but 2.0.35 doesn't - so I add it
|
||||
* here if necessary. */
|
||||
#ifndef KERNEL_VERSION
|
||||
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
/* Put data into the proc fs file.
|
||||
|
||||
Arguments
|
||||
=========
|
||||
1. The buffer where the data is to be inserted, if
|
||||
you decide to use it.
|
||||
2. A pointer to a pointer to characters. This is
|
||||
useful if you don't want to use the buffer
|
||||
allocated by the kernel.
|
||||
3. The current position in the file.
|
||||
4. The size of the buffer in the first argument.
|
||||
5. Zero (for future use?).
|
||||
|
||||
|
||||
Usage and Return Value
|
||||
======================
|
||||
If you use your own buffer, like I do, put its
|
||||
location in the second argument and return the
|
||||
number of bytes used in the buffer.
|
||||
|
||||
A return value of zero means you have no further
|
||||
information at this time (end of file). A negative
|
||||
return value is an error condition.
|
||||
|
||||
|
||||
For More Information
|
||||
====================
|
||||
The way I discovered what to do with this function
|
||||
wasn't by reading documentation, but by reading the
|
||||
code which used it. I just looked to see what uses
|
||||
the get_info field of proc_dir_entry struct (I used a
|
||||
combination of find and grep, if you're interested),
|
||||
and I saw that it is used in <kernel source
|
||||
directory>/fs/proc/array.c.
|
||||
|
||||
If something is unknown about the kernel, this is
|
||||
usually the way to go. In Linux we have the great
|
||||
advantage of having the kernel source code for
|
||||
free - use it.
|
||||
*/
|
||||
int procfile_read(char *buffer,
|
||||
char **buffer_location,
|
||||
off_t offset,
|
||||
int buffer_length,
|
||||
int zero)
|
||||
{
|
||||
int len; /* The number of bytes actually used */
|
||||
|
||||
/* This is static so it will still be in memory
|
||||
* when we leave this function */
|
||||
static char my_buffer[80];
|
||||
|
||||
static int count = 1;
|
||||
|
||||
/* We give all of our information in one go, so if the
|
||||
* user asks us if we have more information the
|
||||
* answer should always be no.
|
||||
*
|
||||
* This is important because the standard read
|
||||
* function from the library would continue to issue
|
||||
* the read system call until the kernel replies
|
||||
* that it has no more information, or until its
|
||||
* buffer is filled.
|
||||
*/
|
||||
if (offset > 0)
|
||||
return 0;
|
||||
|
||||
/* Fill the buffer and get its length */
|
||||
len = sprintf(my_buffer,
|
||||
"For the %d%s time, go away!\n", count,
|
||||
(count % 100 > 10 && count % 100 < 14) ? "th" :
|
||||
(count % 10 == 1) ? "st" :
|
||||
(count % 10 == 2) ? "nd" :
|
||||
(count % 10 == 3) ? "rd" : "th" );
|
||||
count++;
|
||||
|
||||
/* Tell the function which called us where the
|
||||
* buffer is */
|
||||
*buffer_location = my_buffer;
|
||||
|
||||
/* Return the length */
|
||||
return len;
|
||||
}
|
||||
|
||||
|
||||
struct proc_dir_entry Our_Proc_File =
|
||||
{
|
||||
0, /* Inode number - ignore, it will be filled by
|
||||
* proc_register[_dynamic] */
|
||||
4, /* Length of the file name */
|
||||
"test", /* The file name */
|
||||
S_IFREG | S_IRUGO, /* File mode - this is a regular
|
||||
* file which can be read by its
|
||||
* owner, its group, and everybody
|
||||
* else */
|
||||
1, /* Number of links (directories where the
|
||||
* file is referenced) */
|
||||
0, 0, /* The uid and gid for the file - we give it
|
||||
* to root */
|
||||
80, /* The size of the file reported by ls. */
|
||||
NULL, /* functions which can be done on the inode
|
||||
* (linking, removing, etc.) - we don't
|
||||
* support any. */
|
||||
procfile_read, /* The read function for this file,
|
||||
* the function called when somebody
|
||||
* tries to read something from it. */
|
||||
NULL /* We could have here a function to fill the
|
||||
* file's inode, to enable us to play with
|
||||
* permissions, ownership, etc. */
|
||||
};
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
/* Initialize the module - register the proc file */
|
||||
int init_module()
|
||||
{
|
||||
/* Success if proc_register[_dynamic] is a success,
|
||||
* failure otherwise. */
|
||||
#if LINUX_VERSION_CODE > KERNEL_VERSION(2,2,0)
|
||||
/* In version 2.2, proc_register assign a dynamic
|
||||
* inode number automatically if it is zero in the
|
||||
* structure , so there's no more need for
|
||||
* proc_register_dynamic
|
||||
*/
|
||||
return proc_register(&proc_root, &Our_Proc_File);
|
||||
#else
|
||||
return proc_register_dynamic(&proc_root, &Our_Proc_File);
|
||||
#endif
|
||||
|
||||
/* proc_root is the root directory for the proc
|
||||
* fs (/proc). This is where we want our file to be
|
||||
* located.
|
||||
*/
|
||||
}
|
||||
|
||||
|
||||
/* Cleanup - unregister our file from /proc */
|
||||
void cleanup_module()
|
||||
{
|
||||
proc_unregister(&proc_root, Our_Proc_File.low_ino);
|
||||
}
|
||||
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
</sect1>
|
|
@ -0,0 +1,411 @@
|
|||
<sect1><title>Using /proc For Input</title>
|
||||
|
||||
<!-- \label{proc-input
|
||||
\index{Input\\using /proc for}
|
||||
\index{/proc\\using for input}
|
||||
\index{proc\\using for input} -->
|
||||
|
||||
<para>So far we have two ways to generate output from kernel modules: we can
|
||||
register a device driver and <command>mknod</command> a device file, or we
|
||||
can create a <filename role="directory">/proc</filename> file. This allows
|
||||
the kernel module to tell us anything it likes. The only problem is that
|
||||
there is no way for us to talk back. The first way we'll send input to kernel
|
||||
modules will be by writing back to the <filename
|
||||
role="directory">/proc</filename> file.</para>
|
||||
|
||||
<para>Because the proc filesystem was written mainly to allow the kernel to
|
||||
report its situation to processes, there are no special provisions for input.
|
||||
The <varname>struct proc_dir_entry</varname> doesn't include a pointer to an
|
||||
input function, the way it includes a pointer to an output function. Instead,
|
||||
to write into a <filename role="directory">/proc</filename> file, we need to
|
||||
use the standard filesystem mechanism.</para>
|
||||
|
||||
<!-- \index{proc\_dir\_entry structure}
|
||||
\index{struct proc\_dir\_entry} -->
|
||||
|
||||
<para><para>In Linux there is a standard mechanism for file system
|
||||
registration. Since every file system has to have its own functions to handle
|
||||
inode and file operations<footnote><para>The difference between the two is
|
||||
that file operations deal with the file itself, and inode operations deal
|
||||
with ways of referencing the file, such as creating links to
|
||||
it.</para></footnote>, there is a special structure to hold pointers to all
|
||||
those functions, <varname>struct inode_operations</varname>, which includes a
|
||||
pointer to <varname>struct file_operations</varname>. In /proc, whenever we
|
||||
register a new file, we're allowed to specify which <varname>struct
|
||||
inode_operations</varname> will be used for access to it. This is the
|
||||
mechanism we use, a <varname>struct inode_operations</varname> which includes
|
||||
a pointer to a <varname>struct file_operations</varname> which includes
|
||||
pointers to our <function>module_input</function> and
|
||||
<function>module_output</function> functions.</para>
|
||||
|
||||
<!-- \index{file system registration}
|
||||
\index{registration\\file system}
|
||||
\index{struct inode\_operations}
|
||||
\index{inode\_operations structure}
|
||||
\index{struct file\_operations}
|
||||
\index{file\_operations structure} -->
|
||||
|
||||
<para>It's important to note that the standard roles of read and write are
|
||||
reversed in the kernel. Read functions are used for output, whereas write
|
||||
functions are used for input. The reason for that is that read and write
|
||||
refer to the user's point of view --- if a process reads something from the
|
||||
kernel, then the kernel needs to output it, and if a process writes something
|
||||
to the kernel, then the kernel receives it as input.</para>
|
||||
|
||||
<!-- \index{read\\in the kernel}
|
||||
\index{write\\in the kernel} -->
|
||||
|
||||
<para>Another interesting point here is the
|
||||
<function>module_permission</function> function. This function is called
|
||||
whenever a process tries to do something with the <filename
|
||||
role="directory">/proc</filename> file, and it can decide whether to allow
|
||||
access or not. Right now it is only based on the operation and the uid of the
|
||||
current user (as available in <varname>current</varname>, a pointer to a
|
||||
structure which includes information on the currently running process), but
|
||||
it could be based on anything we like, such as what other processes are doing
|
||||
with the same file, the time of day, or the last input we received.</para>
|
||||
|
||||
<!-- \index{module\_permissions} \index{permissions} \index{current pointer}
|
||||
\index{pointer\\current} -->
|
||||
|
||||
<para>The reason for <function>put_user</function> and
|
||||
<function>get_user</function> is that Linux memory (under Intel architecture,
|
||||
it may be different under some other processors) is segmented. This means
|
||||
that a pointer, by itself, does not reference a unique location in memory,
|
||||
only a location in a memory segment, and you need to know which memory
|
||||
segment it is to be able to use it. There is one memory segment for the
|
||||
kernel, and one of each of the processes.</para>
|
||||
|
||||
<!-- \index{put\_user} \index{get\_user} \index{memory segments}
|
||||
\index{segment\\memory} -->
|
||||
|
||||
|
||||
<para>The only memory segment accessible to a process is its own, so when
|
||||
writing regular programs to run as processes, there's no need to worry about
|
||||
segments. When you write a kernel module, normally you want to access the
|
||||
kernel memory segment, which is handled automatically by the system. However,
|
||||
when the content of a memory buffer needs to be passed between the currently
|
||||
running process and the kernel, the kernel function receives a pointer to the
|
||||
memory buffer which is in the process segment. The
|
||||
<function>put_user</function> and <function>get_user</function> macros allow
|
||||
you to access that memory.</para>
|
||||
|
||||
|
||||
<example><title>procfs.c</title>
|
||||
<programlisting><![CDATA[
|
||||
/* procfs.c - create a "file" in /proc, which allows
|
||||
* both input and output. */
|
||||
|
||||
/* Copyright (C) 2001 by Peter Jay Salzman */
|
||||
|
||||
|
||||
/* The necessary header files */
|
||||
|
||||
/* Standard in kernel modules */
|
||||
#include <linux/kernel.h> /* We're doing kernel work */
|
||||
#include <linux/module.h> /* Specifically, a module */
|
||||
|
||||
/* Necessary because we use proc fs */
|
||||
#include <linux/proc_fs.h>
|
||||
|
||||
|
||||
/* In 2.2.3 /usr/include/linux/version.h includes a
|
||||
* macro for this, but 2.0.35 doesn't - so I add it
|
||||
* here if necessary. */
|
||||
#ifndef KERNEL_VERSION
|
||||
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
#include <asm/uaccess.h> /* for get_user and put_user */
|
||||
#endif
|
||||
|
||||
/* The module's file functions ********************** */
|
||||
|
||||
|
||||
/* Here we keep the last message received, to prove
|
||||
* that we can process our input */
|
||||
#define MESSAGE_LENGTH 80
|
||||
static char Message[MESSAGE_LENGTH];
|
||||
|
||||
|
||||
/* Since we use the file operations struct, we can't
|
||||
* use the special proc output provisions - we have to
|
||||
* use a standard read function, which is this function */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
static ssize_t module_output(
|
||||
struct file *file, /* The file read */
|
||||
char *buf, /* The buffer to put data to (in the
|
||||
* user segment) */
|
||||
size_t len, /* The length of the buffer */
|
||||
loff_t *offset) /* Offset in the file - ignore */
|
||||
#else
|
||||
static int module_output(
|
||||
struct inode *inode, /* The inode read */
|
||||
struct file *file, /* The file read */
|
||||
char *buf, /* The buffer to put data to (in the
|
||||
* user segment) */
|
||||
int len) /* The length of the buffer */
|
||||
#endif
|
||||
{
|
||||
static int finished = 0;
|
||||
int i;
|
||||
char message[MESSAGE_LENGTH+30];
|
||||
|
||||
/* We return 0 to indicate end of file, that we have
|
||||
* no more information. Otherwise, processes will
|
||||
* continue to read from us in an endless loop. */
|
||||
if (finished) {
|
||||
finished = 0;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* We use put_user to copy the string from the kernel's
|
||||
* memory segment to the memory segment of the process
|
||||
* that called us. get_user, BTW, is
|
||||
* used for the reverse. */
|
||||
sprintf(message, "Last input:%s", Message);
|
||||
for(i=0; i<len && message[i]; i++)
|
||||
put_user(message[i], buf+i);
|
||||
|
||||
|
||||
/* Notice, we assume here that the size of the message
|
||||
* is below len, or it will be received cut. In a real
|
||||
* life situation, if the size of the message is less
|
||||
* than len then we'd return len and on the second call
|
||||
* start filling the buffer with the len+1'th byte of
|
||||
* the message. */
|
||||
finished = 1;
|
||||
|
||||
return i; /* Return the number of bytes "read" */
|
||||
}
|
||||
|
||||
|
||||
/* This function receives input from the user when the
|
||||
* user writes to the /proc file. */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
static ssize_t module_input(
|
||||
struct file *file, /* The file itself */
|
||||
const char *buf, /* The buffer with input */
|
||||
size_t length, /* The buffer's length */
|
||||
loff_t *offset) /* offset to file - ignore */
|
||||
#else
|
||||
static int module_input(
|
||||
struct inode *inode, /* The file's inode */
|
||||
struct file *file, /* The file itself */
|
||||
const char *buf, /* The buffer with the input */
|
||||
int length) /* The buffer's length */
|
||||
#endif
|
||||
{
|
||||
int i;
|
||||
|
||||
/* Put the input into Message, where module_output
|
||||
* will later be able to use it */
|
||||
for(i=0; i<MESSAGE_LENGTH-1 && i<length; i++)
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
get_user(Message[i], buf+i);
|
||||
/* In version 2.2 the semantics of get_user changed,
|
||||
* it not longer returns a character, but expects a
|
||||
* variable to fill up as its first argument and a
|
||||
* user segment pointer to fill it from as the its
|
||||
* second.
|
||||
*
|
||||
* The reason for this change is that the version 2.2
|
||||
* get_user can also read an short or an int. The way
|
||||
* it knows the type of the variable it should read
|
||||
* is by using sizeof, and for that it needs the
|
||||
* variable itself.
|
||||
*/
|
||||
#else
|
||||
Message[i] = get_user(buf+i);
|
||||
#endif
|
||||
Message[i] = '\0'; /* we want a standard, zero
|
||||
* terminated string */
|
||||
|
||||
/* We need to return the number of input characters
|
||||
* used */
|
||||
return i;
|
||||
}
|
||||
|
||||
|
||||
|
||||
/* This function decides whether to allow an operation
|
||||
* (return zero) or not allow it (return a non-zero
|
||||
* which indicates why it is not allowed).
|
||||
*
|
||||
* The operation can be one of the following values:
|
||||
* 0 - Execute (run the "file" - meaningless in our case)
|
||||
* 2 - Write (input to the kernel module)
|
||||
* 4 - Read (output from the kernel module)
|
||||
*
|
||||
* This is the real function that checks file
|
||||
* permissions. The permissions returned by ls -l are
|
||||
* for referece only, and can be overridden here.
|
||||
*/
|
||||
static int module_permission(struct inode *inode, int op)
|
||||
{
|
||||
/* We allow everybody to read from our module, but
|
||||
* only root (uid 0) may write to it */
|
||||
if (op == 4 || (op == 2 && current->euid == 0))
|
||||
return 0;
|
||||
|
||||
/* If it's anything else, access is denied */
|
||||
return -EACCES;
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
/* The file is opened - we don't really care about
|
||||
* that, but it does mean we need to increment the
|
||||
* module's reference count. */
|
||||
int module_open(struct inode *inode, struct file *file)
|
||||
{
|
||||
MOD_INC_USE_COUNT;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
/* The file is closed - again, interesting only because
|
||||
* of the reference count. */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
int module_close(struct inode *inode, struct file *file)
|
||||
#else
|
||||
void module_close(struct inode *inode, struct file *file)
|
||||
#endif
|
||||
{
|
||||
MOD_DEC_USE_COUNT;
|
||||
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
return 0; /* success */
|
||||
#endif
|
||||
}
|
||||
|
||||
|
||||
/* Structures to register as the /proc file, with
|
||||
* pointers to all the relevant functions. ********** */
|
||||
|
||||
|
||||
|
||||
/* File operations for our proc file. This is where we
|
||||
* place pointers to all the functions called when
|
||||
* somebody tries to do something to our file. NULL
|
||||
* means we don't want to deal with something. */
|
||||
static struct file_operations File_Ops_4_Our_Proc_File =
|
||||
{
|
||||
NULL, /* lseek */
|
||||
module_output, /* "read" from the file */
|
||||
module_input, /* "write" to the file */
|
||||
NULL, /* readdir */
|
||||
NULL, /* select */
|
||||
NULL, /* ioctl */
|
||||
NULL, /* mmap */
|
||||
module_open, /* Somebody opened the file */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
NULL, /* flush, added here in version 2.2 */
|
||||
#endif
|
||||
module_close, /* Somebody closed the file */
|
||||
/* etc. etc. etc. (they are all given in
|
||||
* /usr/include/linux/fs.h). Since we don't put
|
||||
* anything here, the system will keep the default
|
||||
* data, which in Unix is zeros (NULLs when taken as
|
||||
* pointers). */
|
||||
};
|
||||
|
||||
|
||||
|
||||
/* Inode operations for our proc file. We need it so
|
||||
* we'll have some place to specify the file operations
|
||||
* structure we want to use, and the function we use for
|
||||
* permissions. It's also possible to specify functions
|
||||
* to be called for anything else which could be done to
|
||||
* an inode (although we don't bother, we just put
|
||||
* NULL). */
|
||||
static struct inode_operations Inode_Ops_4_Our_Proc_File =
|
||||
{
|
||||
&File_Ops_4_Our_Proc_File,
|
||||
NULL, /* create */
|
||||
NULL, /* lookup */
|
||||
NULL, /* link */
|
||||
NULL, /* unlink */
|
||||
NULL, /* symlink */
|
||||
NULL, /* mkdir */
|
||||
NULL, /* rmdir */
|
||||
NULL, /* mknod */
|
||||
NULL, /* rename */
|
||||
NULL, /* readlink */
|
||||
NULL, /* follow_link */
|
||||
NULL, /* readpage */
|
||||
NULL, /* writepage */
|
||||
NULL, /* bmap */
|
||||
NULL, /* truncate */
|
||||
module_permission /* check for permissions */
|
||||
};
|
||||
|
||||
|
||||
/* Directory entry */
|
||||
static struct proc_dir_entry Our_Proc_File =
|
||||
{
|
||||
0, /* Inode number - ignore, it will be filled by
|
||||
* proc_register[_dynamic] */
|
||||
7, /* Length of the file name */
|
||||
"rw_test", /* The file name */
|
||||
S_IFREG | S_IRUGO | S_IWUSR,
|
||||
/* File mode - this is a regular file which
|
||||
* can be read by its owner, its group, and everybody
|
||||
* else. Also, its owner can write to it.
|
||||
*
|
||||
* Actually, this field is just for reference, it's
|
||||
* module_permission that does the actual check. It
|
||||
* could use this field, but in our implementation it
|
||||
* doesn't, for simplicity. */
|
||||
1, /* Number of links (directories where the
|
||||
* file is referenced) */
|
||||
0, 0, /* The uid and gid for the file -
|
||||
* we give it to root */
|
||||
80, /* The size of the file reported by ls. */
|
||||
&Inode_Ops_4_Our_Proc_File,
|
||||
/* A pointer to the inode structure for
|
||||
* the file, if we need it. In our case we
|
||||
* do, because we need a write function. */
|
||||
NULL
|
||||
/* The read function for the file. Irrelevant,
|
||||
* because we put it in the inode structure above */
|
||||
};
|
||||
|
||||
|
||||
|
||||
/* Module initialization and cleanup ******************* */
|
||||
|
||||
/* Initialize the module - register the proc file */
|
||||
int init_module()
|
||||
{
|
||||
/* Success if proc_register[_dynamic] is a success,
|
||||
* failure otherwise */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
/* In version 2.2, proc_register assign a dynamic
|
||||
* inode number automatically if it is zero in the
|
||||
* structure , so there's no more need for
|
||||
* proc_register_dynamic
|
||||
*/
|
||||
return proc_register(&proc_root, &Our_Proc_File);
|
||||
#else
|
||||
return proc_register_dynamic(&proc_root, &Our_Proc_File);
|
||||
#endif
|
||||
}
|
||||
|
||||
|
||||
/* Cleanup - unregister our file from /proc */
|
||||
void cleanup_module()
|
||||
{
|
||||
proc_unregister(&proc_root, Our_Proc_File.low_ino);
|
||||
}
|
||||
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
</sect1>
|
||||
|
|
@ -0,0 +1,646 @@
|
|||
<sect1><title>Talking to Device Files (writes and IOCTLs)}</title>
|
||||
|
||||
<!-- \label{dev-input} \index{device files\\input to}
|
||||
\index{input to device files} \index{ioctl}
|
||||
\index{write\\to device files} -->
|
||||
|
||||
<para>Device files are supposed to represent physical devices. Most physical
|
||||
devices are used for output as well as input, so there has to be some
|
||||
mechanism for device drivers in the kernel to get the output to send to
|
||||
the device from processes. This is done by opening the device file for
|
||||
output and writing to it, just like writing to a file. In the following
|
||||
example, this is implemented by <function>device_write</function>.</para>
|
||||
|
||||
<para>This is not always enough. Imagine you had a serial port connected to a
|
||||
modem (even if you have an internal modem, it is still implemented from the
|
||||
CPU's perspective as a serial port connected to a modem, so you don't have to
|
||||
tax your imagination too hard). The natural thing to do would be to use the
|
||||
device file to write things to the modem (either modem commands or data to be
|
||||
sent through the phone line) and read things from the modem (either responses
|
||||
for commands or the data received through the phone line). However, this
|
||||
leaves open the question of what to do when you need to talk to the serial
|
||||
port itself, for example to send the rate at which data is sent and
|
||||
received.</para>
|
||||
|
||||
<indexterm><primary>serial port</primary></indexterm>
|
||||
<indexterm><primary>modem</primary></indexterm>
|
||||
|
||||
<para>The answer in Unix is to use a special function called
|
||||
<function>ioctl</function> (short for Input Output ConTroL). Every device can
|
||||
have its own <function>ioctl</function> commands, which can be read
|
||||
<function>ioctl</function>'s (to send information from a process to the
|
||||
kernel), write <function>ioctl</function>'s (to return information to a
|
||||
process), <footnote><para>Notice that here the roles of read and write are
|
||||
reversed <emphasis>again</emphasis>, so in <function>ioctl</function>'s read
|
||||
is to send information to the kernel and write is to receive information from
|
||||
the kernel.</para></footnote> both or neither. The
|
||||
<function>ioctl</function> function is called with three parameters: the file
|
||||
descriptor of the appropriate device file, the ioctl number, and a parameter,
|
||||
which is of type long so you can use a cast to use it to pass anything.
|
||||
<footnote><para>This isn't exact. You won't be able to pass a structure, for
|
||||
example, through an ioctl --- but you will be able to pass a pointer to the
|
||||
structure.</para></footnote></para>
|
||||
|
||||
<para>The ioctl number encodes the major device number, the type of the
|
||||
ioctl, the command, and the type of the parameter. This ioctl number is
|
||||
usually created by a macro call (<varname>_IO</varname>,
|
||||
<varname>_IOR</varname>, <varname>_IOW</varname> or <varname>_IOWR</varname>
|
||||
--- depending on the type) in a header file. This header file should then be
|
||||
included both by the programs which will use
|
||||
<function>ioctl</function> (so they can generate the appropriate
|
||||
<function>ioctl</function>'s) and by the kernel module (so it can understand
|
||||
it). In the example below, the header file is <filename
|
||||
class="headerfile">chardev.h</filename> and the program which uses it is
|
||||
<function>ioctl.c</function>.</para>
|
||||
|
||||
<!-- \index{\_IO} \index{\_IOR} \index{\_IOW} \index{\_IOWR} -->
|
||||
|
||||
<para>If you want to use <function>ioctl</function>s in your own kernel
|
||||
modules, it is best to receive an official <function>ioctl</function>
|
||||
assignment, so if you accidentally get somebody else's
|
||||
<function>ioctl</function>s, or if they get yours, you'll know something is
|
||||
wrong. For more information, consult the kernel source tree at
|
||||
<filename>Documentation/ioctl-number.txt</filename>.</para>
|
||||
|
||||
<!-- \index{official ioctl assignment} \index{ioctl\\official assignment} -->
|
||||
|
||||
<!-- \index{chardev.c, source file}\index{source\\chardev.c} -->
|
||||
|
||||
<example><title>chardev.c</title>
|
||||
<programlisting><![CDATA[
|
||||
/* chardev.c
|
||||
*
|
||||
* Create an input/output character device
|
||||
*/
|
||||
|
||||
|
||||
/* Copyright (C) 2001 by Peter Jay Salzman */
|
||||
|
||||
|
||||
|
||||
/* The necessary header files */
|
||||
|
||||
/* Standard in kernel modules */
|
||||
#include <linux/kernel.h> /* We're doing kernel work */
|
||||
#include <linux/module.h> /* Specifically, a module */
|
||||
|
||||
/* Deal with CONFIG_MODVERSIONS */
|
||||
#if CONFIG_MODVERSIONS==1
|
||||
#define MODVERSIONS
|
||||
#include <linux/modversions.h>
|
||||
#endif
|
||||
|
||||
/* For character devices */
|
||||
|
||||
/* The character device definitions are here */
|
||||
#include <linux/fs.h>
|
||||
|
||||
/* A wrapper which does next to nothing at
|
||||
* at present, but may help for compatibility
|
||||
* with future versions of Linux */
|
||||
#include <linux/wrapper.h>
|
||||
|
||||
|
||||
/* Our own ioctl numbers */
|
||||
#include "chardev.h"
|
||||
|
||||
|
||||
/* In 2.2.3 /usr/include/linux/version.h includes a
|
||||
* macro for this, but 2.0.35 doesn't - so I add it
|
||||
* here if necessary. */
|
||||
#ifndef KERNEL_VERSION
|
||||
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
#include <asm/uaccess.h> /* for get_user and put_user */
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
#define SUCCESS 0
|
||||
|
||||
|
||||
/* Device Declarations ******************************** */
|
||||
|
||||
|
||||
/* The name for our device, as it will appear in
|
||||
* /proc/devices */
|
||||
#define DEVICE_NAME "char_dev"
|
||||
|
||||
|
||||
/* The maximum length of the message for the device */
|
||||
#define BUF_LEN 80
|
||||
|
||||
/* Is the device open right now? Used to prevent
|
||||
* concurent access into the same device */
|
||||
static int Device_Open = 0;
|
||||
|
||||
/* The message the device will give when asked */
|
||||
static char Message[BUF_LEN];
|
||||
|
||||
/* How far did the process reading the message get?
|
||||
* Useful if the message is larger than the size of the
|
||||
* buffer we get to fill in device_read. */
|
||||
static char *Message_Ptr;
|
||||
|
||||
|
||||
/* This function is called whenever a process attempts
|
||||
* to open the device file */
|
||||
static int device_open(struct inode *inode,
|
||||
struct file *file)
|
||||
{
|
||||
#ifdef DEBUG
|
||||
printk ("device_open(%p)\n", file);
|
||||
#endif
|
||||
|
||||
/* We don't want to talk to two processes at the
|
||||
* same time */
|
||||
if (Device_Open)
|
||||
return -EBUSY;
|
||||
|
||||
/* If this was a process, we would have had to be
|
||||
* more careful here, because one process might have
|
||||
* checked Device_Open right before the other one
|
||||
* tried to increment it. However, we're in the
|
||||
* kernel, so we're protected against context switches.
|
||||
*
|
||||
* This is NOT the right attitude to take, because we
|
||||
* might be running on an SMP box, but we'll deal with
|
||||
* SMP in a later chapter.
|
||||
*/
|
||||
|
||||
Device_Open++;
|
||||
|
||||
/* Initialize the message */
|
||||
Message_Ptr = Message;
|
||||
|
||||
MOD_INC_USE_COUNT;
|
||||
|
||||
return SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
/* This function is called when a process closes the
|
||||
* device file. It doesn't have a return value because
|
||||
* it cannot fail. Regardless of what else happens, you
|
||||
* should always be able to close a device (in 2.0, a 2.2
|
||||
* device file could be impossible to close).
|
||||
*/
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
static int device_release(struct inode *inode,
|
||||
struct file *file)
|
||||
#else
|
||||
static void device_release(struct inode *inode,
|
||||
struct file *file)
|
||||
#endif
|
||||
{
|
||||
#ifdef DEBUG
|
||||
printk ("device_release(%p,%p)\n", inode, file);
|
||||
#endif
|
||||
|
||||
/* We're now ready for our next caller */
|
||||
Device_Open --;
|
||||
|
||||
MOD_DEC_USE_COUNT;
|
||||
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
return 0;
|
||||
#endif
|
||||
}
|
||||
|
||||
|
||||
|
||||
/* This function is called whenever a process which
|
||||
* has already opened the device file attempts to
|
||||
* read from it. */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
static ssize_t device_read(
|
||||
struct file *file,
|
||||
char *buffer, /* The buffer to fill with the data */
|
||||
size_t length, /* The length of the buffer */
|
||||
loff_t *offset) /* offset to the file */
|
||||
#else
|
||||
static int device_read(
|
||||
struct inode *inode,
|
||||
struct file *file,
|
||||
char *buffer, /* The buffer to fill with the data */
|
||||
int length) /* The length of the buffer
|
||||
* (mustn't write beyond that!) */
|
||||
#endif
|
||||
{
|
||||
/* Number of bytes actually written to the buffer */
|
||||
int bytes_read = 0;
|
||||
|
||||
#ifdef DEBUG
|
||||
printk("device_read(%p,%p,%d)\n", file, buffer, length);
|
||||
#endif
|
||||
|
||||
/* If we're at the end of the message, return 0
|
||||
* (which signifies end of file) */
|
||||
if (*Message_Ptr == 0)
|
||||
return 0;
|
||||
|
||||
/* Actually put the data into the buffer */
|
||||
while (length && *Message_Ptr) {
|
||||
|
||||
/* Because the buffer is in the user data segment,
|
||||
* not the kernel data segment, assignment wouldn't
|
||||
* work. Instead, we have to use put_user which
|
||||
* copies data from the kernel data segment to the
|
||||
* user data segment. */
|
||||
put_user(*(Message_Ptr++), buffer++);
|
||||
length --;
|
||||
bytes_read ++;
|
||||
}
|
||||
|
||||
#ifdef DEBUG
|
||||
printk ("Read %d bytes, %d left\n", bytes_read, length);
|
||||
#endif
|
||||
|
||||
/* Read functions are supposed to return the number
|
||||
* of bytes actually inserted into the buffer */
|
||||
return bytes_read;
|
||||
}
|
||||
|
||||
|
||||
/* This function is called when somebody tries to
|
||||
* write into our device file. */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
static ssize_t device_write(struct file *file,
|
||||
const char *buffer,
|
||||
size_t length,
|
||||
loff_t *offset)
|
||||
#else
|
||||
static int device_write(struct inode *inode,
|
||||
struct file *file,
|
||||
const char *buffer,
|
||||
int length)
|
||||
#endif
|
||||
{
|
||||
int i;
|
||||
|
||||
#ifdef DEBUG
|
||||
printk ("device_write(%p,%s,%d)",
|
||||
file, buffer, length);
|
||||
#endif
|
||||
|
||||
for(i=0; i<length && i<BUF_LEN; i++)
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
get_user(Message[i], buffer+i);
|
||||
#else
|
||||
Message[i] = get_user(buffer+i);
|
||||
#endif
|
||||
|
||||
Message_Ptr = Message;
|
||||
|
||||
/* Again, return the number of input characters used */
|
||||
return i;
|
||||
}
|
||||
|
||||
|
||||
/* This function is called whenever a process tries to
|
||||
* do an ioctl on our device file. We get two extra
|
||||
* parameters (additional to the inode and file
|
||||
* structures, which all device functions get): the number
|
||||
* of the ioctl called and the parameter given to the
|
||||
* ioctl function.
|
||||
*
|
||||
* If the ioctl is write or read/write (meaning output
|
||||
* is returned to the calling process), the ioctl call
|
||||
* returns the output of this function.
|
||||
*/
|
||||
int device_ioctl(
|
||||
struct inode *inode,
|
||||
struct file *file,
|
||||
unsigned int ioctl_num,/* The number of the ioctl */
|
||||
unsigned long ioctl_param) /* The parameter to it */
|
||||
{
|
||||
int i;
|
||||
char *temp;
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
char ch;
|
||||
#endif
|
||||
|
||||
/* Switch according to the ioctl called */
|
||||
switch (ioctl_num) {
|
||||
case IOCTL_SET_MSG:
|
||||
/* Receive a pointer to a message (in user space)
|
||||
* and set that to be the device's message. */
|
||||
|
||||
/* Get the parameter given to ioctl by the process */
|
||||
temp = (char *) ioctl_param;
|
||||
|
||||
/* Find the length of the message */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
get_user(ch, temp);
|
||||
for (i=0; ch && i<BUF_LEN; i++, temp++)
|
||||
get_user(ch, temp);
|
||||
#else
|
||||
for (i=0; get_user(temp) && i<BUF_LEN; i++, temp++)
|
||||
;
|
||||
#endif
|
||||
|
||||
/* Don't reinvent the wheel - call device_write */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
device_write(file, (char *) ioctl_param, i, 0);
|
||||
#else
|
||||
device_write(inode, file, (char *) ioctl_param, i);
|
||||
#endif
|
||||
break;
|
||||
|
||||
case IOCTL_GET_MSG:
|
||||
/* Give the current message to the calling
|
||||
* process - the parameter we got is a pointer,
|
||||
* fill it. */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
i = device_read(file, (char *) ioctl_param, 99, 0);
|
||||
#else
|
||||
i = device_read(inode, file, (char *) ioctl_param, 99);
|
||||
#endif
|
||||
/* Warning - we assume here the buffer length is
|
||||
* 100. If it's less than that we might overflow
|
||||
* the buffer, causing the process to core dump.
|
||||
*
|
||||
* The reason we only allow up to 99 characters is
|
||||
* that the NULL which terminates the string also
|
||||
* needs room. */
|
||||
|
||||
/* Put a zero at the end of the buffer, so it
|
||||
* will be properly terminated */
|
||||
put_user('\0', (char *) ioctl_param+i);
|
||||
break;
|
||||
|
||||
case IOCTL_GET_NTH_BYTE:
|
||||
/* This ioctl is both input (ioctl_param) and
|
||||
* output (the return value of this function) */
|
||||
return Message[ioctl_param];
|
||||
break;
|
||||
}
|
||||
|
||||
return SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
/* Module Declarations *************************** */
|
||||
|
||||
|
||||
/* This structure will hold the functions to be called
|
||||
* when a process does something to the device we
|
||||
* created. Since a pointer to this structure is kept in
|
||||
* the devices table, it can't be local to
|
||||
* init_module. NULL is for unimplemented functions. */
|
||||
struct file_operations Fops = {
|
||||
NULL, /* seek */
|
||||
device_read,
|
||||
device_write,
|
||||
NULL, /* readdir */
|
||||
NULL, /* select */
|
||||
device_ioctl, /* ioctl */
|
||||
NULL, /* mmap */
|
||||
device_open,
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
NULL, /* flush */
|
||||
#endif
|
||||
device_release /* a.k.a. close */
|
||||
};
|
||||
|
||||
|
||||
/* Initialize the module - Register the character device */
|
||||
int init_module()
|
||||
{
|
||||
int ret_val;
|
||||
|
||||
/* Register the character device (atleast try) */
|
||||
ret_val = module_register_chrdev(MAJOR_NUM,
|
||||
DEVICE_NAME,
|
||||
&Fops);
|
||||
|
||||
/* Negative values signify an error */
|
||||
if (ret_val < 0) {
|
||||
printk ("%s failed with %d\n",
|
||||
"Sorry, registering the character device ",
|
||||
ret_val);
|
||||
return ret_val;
|
||||
}
|
||||
|
||||
printk ("%s The major device number is %d.\n",
|
||||
"Registeration is a success",
|
||||
MAJOR_NUM);
|
||||
printk ("If you want to talk to the device driver,\n");
|
||||
printk ("you'll have to create a device file. \n");
|
||||
printk ("We suggest you use:\n");
|
||||
printk ("mknod %s c %d 0\n", DEVICE_FILE_NAME,
|
||||
MAJOR_NUM);
|
||||
printk ("The device file name is important, because\n");
|
||||
printk ("the ioctl program assumes that's the\n");
|
||||
printk ("file you'll use.\n");
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
/* Cleanup - unregister the appropriate file from /proc */
|
||||
void cleanup_module()
|
||||
{
|
||||
int ret;
|
||||
|
||||
/* Unregister the device */
|
||||
ret = module_unregister_chrdev(MAJOR_NUM, DEVICE_NAME);
|
||||
|
||||
/* If there's an error, report it */
|
||||
if (ret < 0)
|
||||
printk("Error in module_unregister_chrdev: %d\n", ret);
|
||||
}
|
||||
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
|
||||
<!-- \index{chardev.h, source file}\index{source\\chardev.h} -->
|
||||
|
||||
<example><title>chardev.h</title>
|
||||
<programlisting><![CDATA[
|
||||
\begin{verbatim}
|
||||
/* chardev.h - the header file with the ioctl definitions.
|
||||
*
|
||||
* The declarations here have to be in a header file,
|
||||
* because they need to be known both to the kernel
|
||||
* module (in chardev.c) and the process calling ioctl
|
||||
* (ioctl.c)
|
||||
*/
|
||||
|
||||
#ifndef CHARDEV_H
|
||||
#define CHARDEV_H
|
||||
|
||||
#include <linux/ioctl.h>
|
||||
|
||||
|
||||
|
||||
/* The major device number. We can't rely on dynamic
|
||||
* registration any more, because ioctls need to know
|
||||
* it. */
|
||||
#define MAJOR_NUM 100
|
||||
|
||||
|
||||
/* Set the message of the device driver */
|
||||
#define IOCTL_SET_MSG _IOR(MAJOR_NUM, 0, char *)
|
||||
/* _IOR means that we're creating an ioctl command
|
||||
* number for passing information from a user process
|
||||
* to the kernel module.
|
||||
*
|
||||
* The first arguments, MAJOR_NUM, is the major device
|
||||
* number we're using.
|
||||
*
|
||||
* The second argument is the number of the command
|
||||
* (there could be several with different meanings).
|
||||
*
|
||||
* The third argument is the type we want to get from
|
||||
* the process to the kernel.
|
||||
*/
|
||||
|
||||
/* Get the message of the device driver */
|
||||
#define IOCTL_GET_MSG _IOR(MAJOR_NUM, 1, char *)
|
||||
/* This IOCTL is used for output, to get the message
|
||||
* of the device driver. However, we still need the
|
||||
* buffer to place the message in to be input,
|
||||
* as it is allocated by the process.
|
||||
*/
|
||||
|
||||
|
||||
/* Get the n'th byte of the message */
|
||||
#define IOCTL_GET_NTH_BYTE _IOWR(MAJOR_NUM, 2, int)
|
||||
/* The IOCTL is used for both input and output. It
|
||||
* receives from the user a number, n, and returns
|
||||
* Message[n]. */
|
||||
|
||||
|
||||
/* The name of the device file */
|
||||
#define DEVICE_FILE_NAME "char_dev"
|
||||
|
||||
|
||||
#endif
|
||||
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
|
||||
<!-- \index{ioctl\\defining} \index{defining ioctls}
|
||||
\index{ioctl\\header file for} \index{header file for ioctls} -->
|
||||
|
||||
|
||||
<example><title>ioctl.c</title>
|
||||
<programlisting><![CDATA[
|
||||
/* ioctl.c - the process to use ioctl's to control the
|
||||
* kernel module
|
||||
*
|
||||
* Until now we could have used cat for input and
|
||||
* output. But now we need to do ioctl's, which require
|
||||
* writing our own process.
|
||||
*/
|
||||
|
||||
/* Copyright (C) 2001 by Peter Jay Salzman */
|
||||
|
||||
|
||||
/* device specifics, such as ioctl numbers and the
|
||||
* major device file. */
|
||||
#include "chardev.h"
|
||||
|
||||
|
||||
#include <fcntl.h> /* open */
|
||||
#include <unistd.h> /* exit */
|
||||
#include <sys/ioctl.h> /* ioctl */
|
||||
|
||||
|
||||
|
||||
/* Functions for the ioctl calls */
|
||||
|
||||
ioctl_set_msg(int file_desc, char *message)
|
||||
{
|
||||
int ret_val;
|
||||
|
||||
ret_val = ioctl(file_desc, IOCTL_SET_MSG, message);
|
||||
|
||||
if (ret_val < 0) {
|
||||
printf ("ioctl_set_msg failed:%d\n", ret_val);
|
||||
exit(-1);
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
||||
ioctl_get_msg(int file_desc)
|
||||
{
|
||||
int ret_val;
|
||||
char message[100];
|
||||
|
||||
/* Warning - this is dangerous because we don't tell
|
||||
* the kernel how far it's allowed to write, so it
|
||||
* might overflow the buffer. In a real production
|
||||
* program, we would have used two ioctls - one to tell
|
||||
* the kernel the buffer length and another to give
|
||||
* it the buffer to fill
|
||||
*/
|
||||
ret_val = ioctl(file_desc, IOCTL_GET_MSG, message);
|
||||
|
||||
if (ret_val < 0) {
|
||||
printf ("ioctl_get_msg failed:%d\n", ret_val);
|
||||
exit(-1);
|
||||
}
|
||||
|
||||
printf("get_msg message:%s\n", message);
|
||||
}
|
||||
|
||||
|
||||
|
||||
ioctl_get_nth_byte(int file_desc)
|
||||
{
|
||||
int i;
|
||||
char c;
|
||||
|
||||
printf("get_nth_byte message:");
|
||||
|
||||
i = 0;
|
||||
while (c != 0) {
|
||||
c = ioctl(file_desc, IOCTL_GET_NTH_BYTE, i++);
|
||||
|
||||
if (c < 0) {
|
||||
printf(
|
||||
"ioctl_get_nth_byte failed at the %d'th byte:\n", i);
|
||||
exit(-1);
|
||||
}
|
||||
|
||||
putchar(c);
|
||||
}
|
||||
putchar('\n');
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
/* Main - Call the ioctl functions */
|
||||
main()
|
||||
{
|
||||
int file_desc, ret_val;
|
||||
char *msg = "Message passed by ioctl\n";
|
||||
|
||||
file_desc = open(DEVICE_FILE_NAME, 0);
|
||||
if (file_desc < 0) {
|
||||
printf ("Can't open device file: %s\n",
|
||||
DEVICE_FILE_NAME);
|
||||
exit(-1);
|
||||
}
|
||||
|
||||
ioctl_get_nth_byte(file_desc);
|
||||
ioctl_get_msg(file_desc);
|
||||
ioctl_set_msg(file_desc, msg);
|
||||
|
||||
close(file_desc);
|
||||
}
|
||||
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
</sect1>
|
|
@ -0,0 +1,306 @@
|
|||
<sect1><title>System Calls</title>
|
||||
|
||||
<!-- \label{sys-call} \index{system calls} \index{calls\\system} -->
|
||||
|
||||
<para>So far, the only thing we've done was to use well defined kernel
|
||||
mechanisms to register <filename role="directory">/proc</filename> files and
|
||||
device handlers. This is fine if you want to do something the kernel
|
||||
programmers thought you'd want, such as write a device driver. But what if
|
||||
you want to do something unusual, to change the behavior of the system in
|
||||
some way? Then, you're mostly on your own.</para>
|
||||
|
||||
<para>This is where kernel programming gets dangerous. While writing the
|
||||
example below, I killed the <function>open()</function> system call. This
|
||||
meant I couldn't open any files, I couldn't run any programs, and I couldn't
|
||||
<command>shutdown</command> the computer. I had to pull the power switch.
|
||||
Luckily, no files died. To ensure you won't lose any files either, please run
|
||||
<command>sync</command> right before you do the <command>insmod</command> and
|
||||
the <command>rmmod</command>.
|
||||
|
||||
<!-- \index{sync} \index{insmod} \index{rmmod} \index{shutdown} -->
|
||||
|
||||
<para>Forget about <filename role="directory">/proc</filename> files, forget
|
||||
about device files. They're just minor details. The <emphasis>real</emphasis>
|
||||
process to kernel communication mechanism, the one used by all processes, is
|
||||
system calls. When a process requests a service from the kernel (such as
|
||||
opening a file, forking to a new process, or requesting more memory), this is
|
||||
the mechanism used. If you want to change the behaviour of the kernel in
|
||||
interesting ways, this is the place to do it. By the way, if you want to see
|
||||
which system calls a program uses, run <command>strace
|
||||
<arguments></command>.</para> <!-- \index{strace} -->
|
||||
|
||||
<para>In general, a process is not supposed to be able to access the kernel. It
|
||||
can't access kernel memory and it can't call kernel functions. The hardware
|
||||
of the CPU enforces this (that's the reason why it's called `protected
|
||||
mode').</para>
|
||||
|
||||
<para>System calls are an exception to this general rule. What happens is that
|
||||
the process fills the registers with the appropriate values and then calls
|
||||
a special instruction which jumps to a previously defined location in the
|
||||
kernel (of course, that location is readable by user processes, it is not
|
||||
writable by them). Under Intel CPUs, this is done by means of interrupt 0x80.
|
||||
The hardware knows that once you jump to this location, you are no longer
|
||||
running in restricted user mode, but as the operating system kernel --- and
|
||||
therefore you're allowed to do whatever you want.</para>
|
||||
<!-- \index{interrupt 0x80} -->
|
||||
|
||||
<para>The location in the kernel a process can jump to is called
|
||||
<emphasis>system_call</emphasis>. The procedure at that location checks the
|
||||
system call number, which tells the kernel what service the process
|
||||
requested. Then, it looks at the table of system calls
|
||||
(<varname>sys_call_table</varname>) to see the address of the kernel function
|
||||
to call. Then it calls the function, and after it returns, does a few system
|
||||
checks and then return back to the process (or to a different process, if the
|
||||
process time ran out). If you want to read this code, it's at the source file
|
||||
<filename>arch/$<$architecture$>$/kernel/entry.S</filename>, after the line
|
||||
<function>ENTRY(system_call)</function>.</para>
|
||||
|
||||
<!-- \index{system\_call} \index{ENTRY(system\_call)} \index{sys\_call\_table}
|
||||
\index{entry.S} -->
|
||||
|
||||
<para>So, if we want to change the way a certain system call works, what we
|
||||
need to do is to write our own function to implement it (usually by adding a
|
||||
bit of our own code, and then calling the original function) and then change
|
||||
the pointer at <varname>sys_call_table</varname> to point to our function.
|
||||
Because we might be removed later and we don't want to leave the system in an
|
||||
unstable state, it's important for <function>cleanup_module</function> to
|
||||
restore the table to its original state.</para>
|
||||
|
||||
<para>The source code here is an example of such a kernel module. We want to
|
||||
`spy' on a certain user, and to <function>printk()</function> a message
|
||||
whenever that user opens a file. Towards this end, we replace the system call
|
||||
to open a file with our own function, called
|
||||
<function>our_sys_open</function>. This function checks the uid (user's id)
|
||||
of the current process, and if it's equal to the uid we spy on, it calls
|
||||
<function>printk()</function> to display the name of the file to be opened.
|
||||
Then, either way, it calls the original <function>open()</function> function
|
||||
with the same parameters, to actually open the file.</para>
|
||||
|
||||
<!-- \index{open\\system call} -->
|
||||
|
||||
<para>The <function>init_module</function> function replaces the appropriate
|
||||
location in <varname>sys_call_table</varname> and keeps the original pointer
|
||||
in a variable. The <function>cleanup_module</function> function uses that
|
||||
variable to restore everything back to normal. This approach is dangerous,
|
||||
because of the possibility of two kernel modules changing the same system
|
||||
call. Imagine we have two kernel modules, A and B. A's open system call will
|
||||
be A_open and B's will be B_open. Now, when A is inserted into the kernel,
|
||||
the system call is replaced with A_open, which will call the original
|
||||
sys_open when it's done. Next, B is inserted into the kernel, which replaces
|
||||
the system call with B_open, which will call what it thinks is the original
|
||||
system call, A_open, when it's done.</para>
|
||||
|
||||
<para>Now, if B is removed first, everything will be well --- it will simply
|
||||
restore the system call to A_open, which calls the original. However, if A
|
||||
is removed and then B is removed, the system will crash. A's removal will
|
||||
restore the system call to the original, sys_open, cutting B out of the
|
||||
loop. Then, when B is removed, it will restore the system call to what
|
||||
<emphasis>it</emphasis> thinks is the original, A_open, which is no longer
|
||||
in memory. At first glance, it appears we could solve this particular problem
|
||||
by checking if the system call is equal to our open function and if so not
|
||||
changing it at all (so that B won't change the system call when it's
|
||||
removed), but that will cause an even worse problem. When A is removed, it
|
||||
sees that the system call was changed to B_open so that it is no longer
|
||||
pointing to A_open, so it won't restore it to sys_open before it is removed
|
||||
from memory. Unfortunately, B_open will still try to call A_open which is
|
||||
no longer there, so that even without removing B the system would
|
||||
crash.</para>
|
||||
|
||||
<para>I can think of two ways to prevent this problem. The first is to
|
||||
restore the call to the original value, sys_open. Unfortunately, sys_open is
|
||||
not part of the kernel system table in <filename>/proc/ksyms</filename>, so
|
||||
we can't access it. The other solution is to use the reference count to
|
||||
prevent root from <command>rmmod</command>'ing the module once it is loaded.
|
||||
This is good for production modules, but bad for an educational sample ---
|
||||
which is why I didn't do it here.</para>
|
||||
|
||||
<!-- \index{rmmod}\index{MOD\_INC\_USE\_COUNT} \index{sys\_open} -->
|
||||
|
||||
|
||||
<example><title>procfs.c</title>
|
||||
<programlisting><![CDATA[
|
||||
/* syscall.c
|
||||
*
|
||||
* System call "stealing" sample
|
||||
*/
|
||||
|
||||
|
||||
/* Copyright (C) 2001 by Peter Jay Salzman */
|
||||
|
||||
|
||||
/* The necessary header files */
|
||||
|
||||
/* Standard in kernel modules */
|
||||
#include <linux/kernel.h> /* We're doing kernel work */
|
||||
#include <linux/module.h> /* Specifically, a module */
|
||||
|
||||
/* Deal with CONFIG_MODVERSIONS */
|
||||
#if CONFIG_MODVERSIONS==1
|
||||
#define MODVERSIONS
|
||||
#include <linux/modversions.h>
|
||||
#endif
|
||||
|
||||
#include <sys/syscall.h> /* The list of system calls */
|
||||
|
||||
/* For the current (process) structure, we need
|
||||
* this to know who the current user is. */
|
||||
#include <linux/sched.h>
|
||||
|
||||
|
||||
|
||||
|
||||
/* In 2.2.3 /usr/include/linux/version.h includes a
|
||||
* macro for this, but 2.0.35 doesn't - so I add it
|
||||
* here if necessary. */
|
||||
#ifndef KERNEL_VERSION
|
||||
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
#include <asm/uaccess.h>
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
/* The system call table (a table of functions). We
|
||||
* just define this as external, and the kernel will
|
||||
* fill it up for us when we are insmod'ed
|
||||
*/
|
||||
extern void *sys_call_table[];
|
||||
|
||||
|
||||
/* UID we want to spy on - will be filled from the
|
||||
* command line */
|
||||
int uid;
|
||||
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
MODULE_PARM(uid, "i");
|
||||
#endif
|
||||
|
||||
/* A pointer to the original system call. The reason
|
||||
* we keep this, rather than call the original function
|
||||
* (sys_open), is because somebody else might have
|
||||
* replaced the system call before us. Note that this
|
||||
* is not 100% safe, because if another module
|
||||
* replaced sys_open before us, then when we're inserted
|
||||
* we'll call the function in that module - and it
|
||||
* might be removed before we are.
|
||||
*
|
||||
* Another reason for this is that we can't get sys_open.
|
||||
* It's a static variable, so it is not exported. */
|
||||
asmlinkage int (*original_call)(const char *, int, int);
|
||||
|
||||
|
||||
|
||||
/* For some reason, in 2.2.3 current->uid gave me
|
||||
* zero, not the real user ID. I tried to find what went
|
||||
* wrong, but I couldn't do it in a short time, and
|
||||
* I'm lazy - so I'll just use the system call to get the
|
||||
* uid, the way a process would.
|
||||
*
|
||||
* For some reason, after I recompiled the kernel this
|
||||
* problem went away.
|
||||
*/
|
||||
asmlinkage int (*getuid_call)();
|
||||
|
||||
|
||||
|
||||
/* The function we'll replace sys_open (the function
|
||||
* called when you call the open system call) with. To
|
||||
* find the exact prototype, with the number and type
|
||||
* of arguments, we find the original function first
|
||||
* (it's at fs/open.c).
|
||||
*
|
||||
* In theory, this means that we're tied to the
|
||||
* current version of the kernel. In practice, the
|
||||
* system calls almost never change (it would wreck havoc
|
||||
* and require programs to be recompiled, since the system
|
||||
* calls are the interface between the kernel and the
|
||||
* processes).
|
||||
*/
|
||||
asmlinkage int our_sys_open(const char *filename,
|
||||
int flags,
|
||||
int mode)
|
||||
{
|
||||
int i = 0;
|
||||
char ch;
|
||||
|
||||
/* Check if this is the user we're spying on */
|
||||
if (uid == getuid_call()) {
|
||||
/* getuid_call is the getuid system call,
|
||||
* which gives the uid of the user who
|
||||
* ran the process which called the system
|
||||
* call we got */
|
||||
|
||||
/* Report the file, if relevant */
|
||||
printk("Opened file by %d: ", uid);
|
||||
do {
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
get_user(ch, filename+i);
|
||||
#else
|
||||
ch = get_user(filename+i);
|
||||
#endif
|
||||
i++;
|
||||
printk("%c", ch);
|
||||
} while (ch != 0);
|
||||
printk("\n");
|
||||
}
|
||||
|
||||
/* Call the original sys_open - otherwise, we lose
|
||||
* the ability to open files */
|
||||
return original_call(filename, flags, mode);
|
||||
}
|
||||
|
||||
|
||||
|
||||
/* Initialize the module - replace the system call */
|
||||
int init_module()
|
||||
{
|
||||
/* Warning - too late for it now, but maybe for
|
||||
* next time... */
|
||||
printk("I'm dangerous. I hope you did a ");
|
||||
printk("sync before you insmod'ed me.\n");
|
||||
printk("My counterpart, cleanup_module(), is even");
|
||||
printk("more dangerous. If\n");
|
||||
printk("you value your file system, it will ");
|
||||
printk("be \"sync; rmmod\" \n");
|
||||
printk("when you remove this module.\n");
|
||||
|
||||
/* Keep a pointer to the original function in
|
||||
* original_call, and then replace the system call
|
||||
* in the system call table with our_sys_open */
|
||||
original_call = sys_call_table[__NR_open];
|
||||
sys_call_table[__NR_open] = our_sys_open;
|
||||
|
||||
/* To get the address of the function for system
|
||||
* call foo, go to sys_call_table[__NR_foo]. */
|
||||
|
||||
printk("Spying on UID:%d\n", uid);
|
||||
|
||||
/* Get the system call for getuid */
|
||||
getuid_call = sys_call_table[__NR_getuid];
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
/* Cleanup - unregister the appropriate file from /proc */
|
||||
void cleanup_module()
|
||||
{
|
||||
/* Return the system call back to normal */
|
||||
if (sys_call_table[__NR_open] != our_sys_open) {
|
||||
printk("Somebody else also played with the ");
|
||||
printk("open system call\n");
|
||||
printk("The system may be left in ");
|
||||
printk("an unstable state.\n");
|
||||
}
|
||||
|
||||
sys_call_table[__NR_open] = original_call;
|
||||
}
|
||||
]]></programlisting>
|
||||
</example>
|
||||
|
||||
</sect1>
|
||||
|
|
@ -0,0 +1,509 @@
|
|||
<sect1><title>Blocking Processes</title>
|
||||
|
||||
<indexterm><primary>blocking processes</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>processes</primary>
|
||||
<secondary>blocking</secondary>
|
||||
</indexterm>
|
||||
|
||||
<sect2><title>Replacing <function>printk</function></title>
|
||||
|
||||
<para>
|
||||
What do you do when somebody asks you for something you can't do right
|
||||
away? If you're a human being and you're bothered by a human being, the
|
||||
only thing you can say is: <quote>Not right now, I'm busy. <emphasis>Go
|
||||
away!</emphasis></quote>. But if you're a kernel module and you're
|
||||
bothered by a process, you have another possibility. You can put the
|
||||
process to sleep until you can service it. After all, processes are
|
||||
being put to sleep by the kernel and woken up all the time (that's the
|
||||
way multiple processes appear to run on the same time on a single
|
||||
<acronym>CPU</acronym>).
|
||||
</para>
|
||||
|
||||
<indexterm><primary>multi-tasking</primary></indexterm>
|
||||
<indexterm><primary>busy</primary></indexterm>
|
||||
|
||||
<para>
|
||||
This kernel module is an example of this. The file (called
|
||||
<filename>/proc/sleep</filename>) can only be opened by a single process
|
||||
at a time. If the file is already open, the kernel module calls
|
||||
<function>module_interruptible_sleep_on</function>.
|
||||
<indexterm><primary>module_interruptible_sleep_on</primary></indexterm>
|
||||
<indexterm><primary>interruptible_sleep_on</primary></indexterm>
|
||||
<footnote>
|
||||
<para>
|
||||
The easiest way to keep a file open is to open it with <command>tail
|
||||
-f</command>.
|
||||
</para>
|
||||
</footnote>
|
||||
This function changes the status of the task (a task is the kernel data
|
||||
structure which holds information about a process and the system call
|
||||
it's in, if any) to <parameter>TASK_INTERRUPTIBLE</parameter>,
|
||||
<indexterm><primary>TASK_INTERRUPTIBLE</primary></indexterm> which means
|
||||
that the task will not run until it is woken up somehow, and adds it to
|
||||
<structname>WaitQ</structname>, the queue of tasks waiting to access the
|
||||
file. Then, the function calls the scheduler to context switch to a
|
||||
different process, one which has some use for the <acronym>CPU</acronym>.
|
||||
</para>
|
||||
<indexterm><primary>putting processes to sleep</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>sleep</primary>
|
||||
<secondary>putting processes to</secondary>
|
||||
</indexterm>
|
||||
|
||||
<para>
|
||||
When a process is done with the file, it closes it, and
|
||||
<function>module_close</function> is called. That function wakes up all
|
||||
the processes in the queue (there's no mechanism to only wake up one of
|
||||
them). It then returns and the process which just closed the file can
|
||||
continue to run. In time, the scheduler decides that that process has
|
||||
had enough and gives control of the <acronym>CPU</acronym> to another
|
||||
process. Eventually, one of the processes which was in the queue will be
|
||||
given control of the <acronym>CPU</acronym> by the scheduler. It starts
|
||||
at the point right after the call to
|
||||
<function>module_interruptible_sleep_on</function>.
|
||||
<footnote>
|
||||
<para>
|
||||
This means that the process is still in kernel mode -- as far as the
|
||||
process is concerned, it issued the <function>open</function> system
|
||||
call and the system call hasn't returned yet. The process doesn't
|
||||
know somebody else used the <acronym>CPU</acronym> for most of the
|
||||
time between the moment it issued the call and the moment it
|
||||
returned.
|
||||
</para>
|
||||
</footnote>
|
||||
It can then proceed to set a global variable to tell all the other
|
||||
processes that the file is still open and go on with its life. When the
|
||||
other processes get a piece of the <acronym>CPU</acronym>, they'll see
|
||||
that global variable and go back to sleep.
|
||||
</para>
|
||||
<indexterm><primary>waking up processes</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>processes</primary>
|
||||
<secondary>waking up</secondary>
|
||||
</indexterm>
|
||||
<indexterm><primary>multitasking</primary></indexterm>
|
||||
<indexterm><primary>scheduler</primary></indexterm>
|
||||
|
||||
<para>
|
||||
To make our life more interesting, <function>module_close</function>
|
||||
doesn't have a monopoly on waking up the processes which wait to access
|
||||
the file. A signal, such as <keycombo
|
||||
action="simul"><keycap>Ctrl</keycap><keycap>c</keycap></keycombo>
|
||||
<indexterm><primary>ctrl-c</primary></indexterm>
|
||||
(<parameter>SIGINT</parameter>)
|
||||
<indexterm><primary>signal</primary></indexterm>
|
||||
<indexterm><primary>SIGINT</primary></indexterm>
|
||||
can also wake up a process.
|
||||
<indexterm><primary>module_wake_up</primary></indexterm>
|
||||
<footnote>
|
||||
<para>
|
||||
This is because we used
|
||||
<function>module_interruptible_sleep_on</function>. We could have
|
||||
used <function>module_sleep_on</function>
|
||||
<indexterm><primary>module_sleep_on</primary></indexterm>
|
||||
<indexterm><primary>sleep_on</primary></indexterm>
|
||||
instead, but that would have resulted is extremely angry users whose
|
||||
<keycombo
|
||||
action="simul"><keycap>Ctrl</keycap><keycap>c</keycap></keycombo>s
|
||||
are ignored.
|
||||
</para>
|
||||
</footnote>
|
||||
In that case, we want to return with <parameter>-EINTR</parameter>
|
||||
<indexterm><primary>EINTR</primary></indexterm>
|
||||
immediately. This is important so users can, for example, kill the
|
||||
process before it receives the file.
|
||||
</para>
|
||||
|
||||
<indexterm>
|
||||
<primary>processes</primary>
|
||||
<secondary>killing</secondary>
|
||||
</indexterm>
|
||||
|
||||
<para>
|
||||
There is one more point to remember. Some times processes don't want to
|
||||
sleep, they want either to get what they want immediately, or to be told
|
||||
it cannot be done. Such processes use the
|
||||
<parameter>O_NONBLOCK</parameter>
|
||||
<indexterm><primary>O_NONBLOCK</primary></indexterm>
|
||||
<indexterm><primary>non-blocking</primary></indexterm> flag when opening
|
||||
the file. The kernel is supposed to respond by returning with the error
|
||||
code <parameter>-EAGAIN</parameter>
|
||||
<indexterm><primary>EAGAIN</primary></indexterm> from operations which
|
||||
would otherwise block, such as opening the file in this example. The
|
||||
program <command>cat_noblock</command>, available in the source directory
|
||||
for this chapter, can be used to open a file with
|
||||
<parameter>O_NONBLOCK</parameter>.
|
||||
</para>
|
||||
|
||||
<indexterm><primary>blocking, how to avoid</primary></indexterm>
|
||||
|
||||
<example>
|
||||
<title>sleep.c</title>
|
||||
|
||||
<indexterm><primary>sleep.c</primary></indexterm>
|
||||
|
||||
<programlisting>
|
||||
<![CDATA[
|
||||
|
||||
/* sleep.c - create a /proc file, and if several processes try to open it at
|
||||
* the same time, put all but one to sleep
|
||||
*
|
||||
* Copyright (C) 2001 by Peter Jay Salzman
|
||||
*/
|
||||
|
||||
/* The necessary header files */
|
||||
|
||||
/* Standard in kernel modules */
|
||||
#include <linux/kernel.h> /* We're doing kernel work */
|
||||
#include <linux/module.h> /* Specifically, a module */
|
||||
|
||||
/* Deal with CONFIG_MODVERSIONS */
|
||||
#if CONFIG_MODVERSIONS==1
|
||||
#define MODVERSIONS
|
||||
#include <linux/modversions.h>
|
||||
#endif
|
||||
|
||||
/* Necessary because we use proc fs */
|
||||
#include <linux/proc_fs.h>
|
||||
|
||||
/* For putting processes to sleep and waking them up */
|
||||
#include <linux/sched.h>
|
||||
#include <linux/wrapper.h>
|
||||
|
||||
/* In 2.2.3 /usr/include/linux/version.h includes a macro for this, but 2.0.35
|
||||
* doesn't - so I add it here if necessary.
|
||||
*/
|
||||
#ifndef KERNEL_VERSION
|
||||
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
|
||||
#endif
|
||||
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
#include <asm/uaccess.h> /* for get_user and put_user */
|
||||
#endif
|
||||
|
||||
/* The module's file functions */
|
||||
|
||||
/* Here we keep the last message received, to prove that we can process our
|
||||
* input
|
||||
*/
|
||||
#define MESSAGE_LENGTH 80
|
||||
static char Message[MESSAGE_LENGTH];
|
||||
|
||||
/* Since we use the file operations struct, we can't use the special proc
|
||||
* output provisions - we have to use a standard read function, which is this
|
||||
* function
|
||||
*/
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
static ssize_t module_output (
|
||||
struct file *file, /* The file read */
|
||||
char *buf, /* The buffer to put data to (in the user segment) */
|
||||
size_t len, /* The length of the buffer */
|
||||
loff_t *offset) /* Offset in the file - ignore */
|
||||
#else
|
||||
static int module_output (
|
||||
struct inode *inode, /* The inode read */
|
||||
struct file *file, /* The file read */
|
||||
char *buf, /* The buffer to put data to (in the user segment) */
|
||||
int len) /* The length of the buffer */
|
||||
#endif
|
||||
{
|
||||
static int finished = 0;
|
||||
int i;
|
||||
char message[MESSAGE_LENGTH+30];
|
||||
|
||||
/* Return 0 to signify end of file - that we have nothing more to say at this
|
||||
* point.
|
||||
*/
|
||||
if (finished) {
|
||||
finished = 0;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* If you don't understand this by now, you're hopeless as a kernel
|
||||
* programmer.
|
||||
*/
|
||||
sprintf(message, "Last input:%s\n", Message);
|
||||
for (i = 0; i < len && message[i]; i++)
|
||||
put_user(message[i], buf+i);
|
||||
|
||||
finished = 1;
|
||||
return i; /* Return the number of bytes "read" */
|
||||
}
|
||||
|
||||
/* This function receives input from the user when the user writes to the /proc
|
||||
* file.
|
||||
*/
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
static ssize_t module_input (
|
||||
struct file *file, /* The file itself */
|
||||
const char *buf, /* The buffer with input */
|
||||
size_t length, /* The buffer's length */
|
||||
loff_t *offset) /* offset to file - ignore */
|
||||
#else
|
||||
static int module_input (
|
||||
struct inode *inode, /* The file's inode */
|
||||
struct file *file, /* The file itself */
|
||||
const char *buf, /* The buffer with the input */
|
||||
int length) /* The buffer's length */
|
||||
#endif
|
||||
{
|
||||
int i;
|
||||
|
||||
/* Put the input into Message, where module_output will later be able to use
|
||||
* it
|
||||
*/
|
||||
for(i = 0; i < MESSAGE_LENGTH-1 && i < length; i++)
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
get_user(Message[i], buf+i);
|
||||
#else
|
||||
Message[i] = get_user(buf+i);
|
||||
#endif
|
||||
/* we want a standard, zero terminated string */
|
||||
Message[i] = '\0';
|
||||
|
||||
/* We need to return the number of input characters used */
|
||||
return i;
|
||||
}
|
||||
|
||||
/* 1 if the file is currently open by somebody */
|
||||
int Already_Open = 0;
|
||||
|
||||
/* Queue of processes who want our file */
|
||||
static struct wait_queue *WaitQ = NULL;
|
||||
|
||||
/* Called when the /proc file is opened */
|
||||
static int module_open(struct inode *inode, struct file *file)
|
||||
{
|
||||
/* If the file's flags include O_NONBLOCK, it means the process doesn't want
|
||||
* to wait for the file. In this case, if the file is already open, we
|
||||
* should fail with -EAGAIN, meaning "you'll have to try again", instead of
|
||||
* blocking a process which would rather stay awake.
|
||||
*/
|
||||
if ((file->f_flags & O_NONBLOCK) && Already_Open)
|
||||
return -EAGAIN;
|
||||
|
||||
/* This is the correct place for MOD_INC_USE_COUNT because if a process is
|
||||
* in the loop, which is within the kernel module, the kernel module must
|
||||
* not be removed.
|
||||
*/
|
||||
MOD_INC_USE_COUNT;
|
||||
|
||||
/* If the file is already open, wait until it isn't */
|
||||
while (Already_Open)
|
||||
{
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
int i, is_sig = 0;
|
||||
#endif
|
||||
|
||||
/* This function puts the current process, including any system calls,
|
||||
* such as us, to sleep. Execution will be resumed right after the
|
||||
* function call, either because somebody called wake_up(&WaitQ) (only
|
||||
* module_close does that, when the file is closed) or when a signal,
|
||||
* such as Ctrl-C, is sent to the process
|
||||
*/
|
||||
module_interruptible_sleep_on(&WaitQ);
|
||||
|
||||
/* If we woke up because we got a signal we're not blocking, return
|
||||
* -EINTR (fail the system call). This allows processes to be killed or
|
||||
* stopped.
|
||||
*/
|
||||
|
||||
/*
|
||||
* Emmanuel Papirakis:
|
||||
*
|
||||
* This is a little update to work with 2.2.*. Signals now are contained in
|
||||
* two words (64 bits) and are stored in a structure that contains an array of
|
||||
* two unsigned longs. We now have to make 2 checks in our if.
|
||||
*
|
||||
* Ori Pomerantz:
|
||||
*
|
||||
* Nobody promised me they'll never use more than 64 bits, or that this book
|
||||
* won't be used for a version of Linux with a word size of 16 bits. This code
|
||||
* would work in any case.
|
||||
*/
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
for (i = 0; i < _NSIG_WORDS && !is_sig; i++)
|
||||
is_sig = current->signal.sig[i] & ~current->blocked.sig[i];
|
||||
|
||||
if (is_sig) {
|
||||
#else
|
||||
if (current->signal & ~current->blocked) {
|
||||
#endif
|
||||
/* It's important to put MOD_DEC_USE_COUNT here, because for processes
|
||||
* where the open is interrupted there will never be a corresponding
|
||||
* close. If we don't decrement the usage count here, we will be left
|
||||
* with a positive usage count which we'll have no way to bring down
|
||||
* to zero, giving us an immortal module, which can only be killed by
|
||||
* rebooting the machine.
|
||||
*/
|
||||
MOD_DEC_USE_COUNT;
|
||||
return -EINTR;
|
||||
}
|
||||
}
|
||||
|
||||
/* If we got here, Already_Open must be zero */
|
||||
|
||||
/* Open the file */
|
||||
Already_Open = 1;
|
||||
return 0; /* Allow the access */
|
||||
}
|
||||
|
||||
/* Called when the /proc file is closed */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
int module_close(struct inode *inode, struct file *file)
|
||||
#else
|
||||
void module_close(struct inode *inode, struct file *file)
|
||||
#endif
|
||||
{
|
||||
/* Set Already_Open to zero, so one of the processes in the WaitQ will be
|
||||
* able to set Already_Open back to one and to open the file. All the other
|
||||
* processes will be called when Already_Open is back to one, so they'll go
|
||||
* back to sleep.
|
||||
*/
|
||||
Already_Open = 0;
|
||||
|
||||
/* Wake up all the processes in WaitQ, so if anybody is waiting for the
|
||||
* file, they can have it.
|
||||
*/
|
||||
module_wake_up(&WaitQ);
|
||||
|
||||
MOD_DEC_USE_COUNT;
|
||||
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
return 0; /* success */
|
||||
#endif
|
||||
}
|
||||
|
||||
/* This function decides whether to allow an operation (return zero) or not
|
||||
* allow it (return a non-zero which indicates why it is not allowed).
|
||||
*
|
||||
* The operation can be one of the following values:
|
||||
* 0 - Execute (run the "file" - meaningless in our case)
|
||||
* 2 - Write (input to the kernel module)
|
||||
* 4 - Read (output from the kernel module)
|
||||
*
|
||||
* This is the real function that checks file permissions. The permissions
|
||||
* returned by ls -l are for referece only, and can be overridden here.
|
||||
*/
|
||||
static int module_permission(struct inode *inode, int op)
|
||||
{
|
||||
/* We allow everybody to read from our module, but only root (uid 0) may
|
||||
* write to it
|
||||
*/
|
||||
if (op == 4 || (op == 2 && current->euid == 0))
|
||||
return 0;
|
||||
|
||||
/* If it's anything else, access is denied */
|
||||
return -EACCES;
|
||||
}
|
||||
|
||||
/* Structures to register as the /proc file, with pointers to all the relevant
|
||||
* functions.
|
||||
*/
|
||||
|
||||
/* File operations for our proc file. This is where we place pointers to all
|
||||
* the functions called when somebody tries to do something to our file. NULL
|
||||
* means we don't want to deal with something.
|
||||
*/
|
||||
static struct file_operations File_Ops_4_Our_Proc_File = {
|
||||
NULL, /* lseek */
|
||||
module_output, /* "read" from the file */
|
||||
module_input, /* "write" to the file */
|
||||
NULL, /* readdir */
|
||||
NULL, /* select */
|
||||
NULL, /* ioctl */
|
||||
NULL, /* mmap */
|
||||
module_open, /* called when the /proc file is opened */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
NULL, /* flush */
|
||||
#endif
|
||||
module_close}; /* called when it's classed */
|
||||
|
||||
/* Inode operations for our proc file. We need it so we'll have somewhere to
|
||||
* specify the file operations structure we want to use, and the function we
|
||||
* use for permissions. It's also possible to specify functions to be called
|
||||
* for anything else which could be done to an inode (although we don't bother,
|
||||
* we just put NULL).
|
||||
*/
|
||||
static struct inode_operations Inode_Ops_4_Our_Proc_File = {
|
||||
&File_Ops_4_Our_Proc_File,
|
||||
NULL, /* create */
|
||||
NULL, /* lookup */
|
||||
NULL, /* link */
|
||||
NULL, /* unlink */
|
||||
NULL, /* symlink */
|
||||
NULL, /* mkdir */
|
||||
NULL, /* rmdir */
|
||||
NULL, /* mknod */
|
||||
NULL, /* rename */
|
||||
NULL, /* readlink */
|
||||
NULL, /* follow_link */
|
||||
NULL, /* readpage */
|
||||
NULL, /* writepage */
|
||||
NULL, /* bmap */
|
||||
NULL, /* truncate */
|
||||
module_permission}; /* check for permissions */
|
||||
|
||||
/* Directory entry */
|
||||
static struct proc_dir_entry Our_Proc_File = {
|
||||
0, /* Inode number - ignore, it will be filled by
|
||||
* proc_register[_dynamic]
|
||||
*/
|
||||
5, /* Length of the file name */
|
||||
"sleep", /* The file name */
|
||||
|
||||
/* File mode - this is a regular file which can be read by its owner, its
|
||||
* group, and everybody else. Also, its owner can write to it.
|
||||
*
|
||||
* Actually, this field is just for reference, it's module_permission that
|
||||
* does the actual check. It could use this field, but in our
|
||||
* implementation it doesn't, for simplicity.
|
||||
*/
|
||||
S_IFREG | S_IRUGO | S_IWUSR,
|
||||
1, /* Number of links (directories where the file is referenced) */
|
||||
0, 0, /* The uid and gid for the file - we give it to root */
|
||||
80, /* The size of the file reported by ls. */
|
||||
|
||||
/* A pointer to the inode structure for the file, if we need it. In our
|
||||
* case we do, because we need a write function.
|
||||
*/
|
||||
&Inode_Ops_4_Our_Proc_File,
|
||||
|
||||
/* The read function for the file. Irrelevant, because we put it in the
|
||||
* inode structure above
|
||||
*/
|
||||
NULL};
|
||||
|
||||
/* Module initialization and cleanup */
|
||||
|
||||
/* Initialize the module - register the proc file */
|
||||
int init_module()
|
||||
{
|
||||
/* Success if proc_register_dynamic is a success, failure otherwise */
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
|
||||
return proc_register(&proc_root, &Our_Proc_File);
|
||||
#else
|
||||
return proc_register_dynamic(&proc_root, &Our_Proc_File);
|
||||
#endif
|
||||
|
||||
/* proc_root is the root directory for the proc fs (/proc). This is where
|
||||
* we want our file to be located.
|
||||
*/
|
||||
}
|
||||
|
||||
/* Cleanup - unregister our file from /proc. This could get dangerous if
|
||||
* there are still processes waiting in WaitQ, because they are inside our
|
||||
* open function, which will get unloaded. I'll explain how to avoid removal
|
||||
* of a kernel module in such a case in chapter 10.
|
||||
*/
|
||||
void cleanup_module()
|
||||
{
|
||||
proc_unregister(&proc_root, Our_Proc_File.low_ino);
|
||||
}
|
||||
|
||||
]]>
|
||||
</programlisting>
|
||||
</example>
|
||||
</sect2>
|
||||
</sect1>
|
|
@ -0,0 +1,144 @@
|
|||
<sect1><title>Replacing <function>printk</function></title>
|
||||
|
||||
<indexterm><primary>replacing printk</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>printk</primary>
|
||||
<secondary>replacing</secondary>
|
||||
</indexterm>
|
||||
|
||||
<para>Good writing style says we have a paragraph here.</para>
|
||||
|
||||
<sect2><title>Replacing <function>printk</function></title>
|
||||
|
||||
<para>
|
||||
In the beginning (chapter \ref{hello-world}), I said that X and kernel
|
||||
module programming don't mix. That's true while developing the kernel
|
||||
module, but in actual use you want to be able to send messages to
|
||||
whichever <acronym>tty</acronym>
|
||||
<footnote>
|
||||
<para>
|
||||
<emphasis>T</emphasis>ele<emphasis>ty</emphasis>pe, originally a
|
||||
combination keyboard-printer used to communicate with a Unix system,
|
||||
and today an abstraction for the text stream used for a Unix program,
|
||||
whether it's a physical terminal, an xterm on an X display, a network
|
||||
connection used with telnet, etc.
|
||||
</para>
|
||||
</footnote>
|
||||
the command to the module came from. This is important for identifying
|
||||
errors after the kernel module is released, because it will be used
|
||||
through all of them.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The way this is done is by using <parameter>current</parameter>,
|
||||
<indexterm><primary>current task</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>task</primary>
|
||||
<secondary>current></secondary>
|
||||
</indexterm>
|
||||
a pointer to the currently running task, to get the current task's
|
||||
<structname>tty</structname> structure.
|
||||
<indexterm><primary>tty_structure</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>struct</primary>
|
||||
<secondary>tty</secondary>
|
||||
</indexterm>
|
||||
Then, we look inside that <structname>tty</structname> structure to find
|
||||
a pointer to a string write function, which we use to write a string to
|
||||
the <acronym>tty</acronym>.
|
||||
</para>
|
||||
|
||||
<example><title>printk.c</title>
|
||||
|
||||
<indexterm><primary>printk.c</primary></indexterm>
|
||||
|
||||
<programlisting><![CDATA[
|
||||
/* printk.c - send textual output to the tty you're running on, regardless of
|
||||
* whether it's passed through X11, telnet, etc.
|
||||
*
|
||||
* Copyright (C) 2001 by Peter Jay Salzman
|
||||
*/
|
||||
|
||||
/* The necessary header files */
|
||||
|
||||
/* Standard in kernel modules */
|
||||
#include <linux/kernel.h> /* We're doing kernel work */
|
||||
#include <linux/module.h> /* Specifically, a module */
|
||||
|
||||
/* Deal with CONFIG_MODVERSIONS */
|
||||
#if CONFIG_MODVERSIONS==1
|
||||
#define MODVERSIONS
|
||||
#include <linux/modversions.h>
|
||||
#endif
|
||||
|
||||
/* Necessary here */
|
||||
#include <linux/sched.h> /* For current */
|
||||
#include <linux/tty.h> /* For the tty declarations */
|
||||
|
||||
/* Print the string to the appropriate tty, the one the current task uses */
|
||||
void print_string(char *str)
|
||||
{
|
||||
struct tty_struct *my_tty;
|
||||
|
||||
/* The tty for the current task */
|
||||
my_tty = current->tty;
|
||||
|
||||
/* If my_tty is NULL, it means that the current task has no tty you can print
|
||||
* to (this is possible, for example, if it's a daemon). In this case,
|
||||
* there's nothing we can do. */
|
||||
if (my_tty != NULL) {
|
||||
|
||||
/* my_tty->driver is a struct which holds the tty's functions, one of
|
||||
* which (write) is used to write strings to the tty. It can be used to
|
||||
* take a string either from the user's memory segment or the kernel's
|
||||
* memory segment.
|
||||
*
|
||||
* The function's first parameter is the tty to write to, because the
|
||||
* same function would normally be used for all tty's of a certain type.
|
||||
* The second parameter controls whether the function receives a string
|
||||
* from kernel memory (false, 0) or from user memory (true, non zero).
|
||||
* The third parameter is a pointer to a string, and the fourth
|
||||
* parameter is the length of the string.
|
||||
*/
|
||||
(*(my_tty->driver).write)(
|
||||
my_tty, /* The tty itself */
|
||||
0, /* We don't take the string from user space */
|
||||
str, /* String */
|
||||
strlen(str)); /* Length */
|
||||
|
||||
/* ttys were originally hardware devices, which (usually) strictly
|
||||
* followed the ASCII standard. In ASCII, to move to a new line you
|
||||
* need two characters, a carriage return and a line feed. On Unix,
|
||||
* the ASCII line feed is used for both purposes - so we can't just
|
||||
* use \n, because it wouldn't have a carriage return and the next
|
||||
* next line will start at the column right after the line feed.
|
||||
*
|
||||
* BTW, this is the reason why the text file is different between
|
||||
* Unix and Windows. In CP/M and its derivatives, such as MS-DOS and
|
||||
* Windows the ASCII standard was strictly adhered to, and therefore a
|
||||
* newline requires both a line feed and a carriage return.
|
||||
*/
|
||||
(*(my_tty->driver).write)(my_tty, 0, "\015\012", 2);
|
||||
}
|
||||
}
|
||||
|
||||
/* Module initialization and cleanup */
|
||||
|
||||
/* Initialize the module - register the proc file */
|
||||
int init_module()
|
||||
{
|
||||
print_string("Module Inserted");
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* Cleanup - unregister our file from /proc */
|
||||
void cleanup_module()
|
||||
{
|
||||
print_string("Module Removed");
|
||||
}
|
||||
|
||||
]]></programlisting>
|
||||
</example>
|
||||
</sect2>
|
||||
</sect1>
|
|
@ -0,0 +1,244 @@
|
|||
<sect1><title>Scheduling Tasks</title>
|
||||
|
||||
<indexterm><primary>scheduling tasks</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>tasks</primary>
|
||||
<secondary>scheduling</secondary>
|
||||
</indexterm>
|
||||
|
||||
<para>Good writing style says we have a paragraph here.</para>
|
||||
|
||||
<sect2><title>Scheduling Tasks</title>
|
||||
|
||||
<para>
|
||||
Very often, we have <quote>housekeeping</quote>
|
||||
<indexterm><primary>housekeeping</primary></indexterm>
|
||||
<indexterm><primary>crontab</primary></indexterm> tasks which have to be
|
||||
done at a certain time, or every so often. If the task is to be done by a
|
||||
process, we do it by putting it in the <filename>crontab</filename> file.
|
||||
If the task is to be done by a kernel module, we have two possibilities.
|
||||
The first is to put a process in the <filename>crontab</filename> file
|
||||
which will wake up the module by a system call when necessary, for
|
||||
example by opening a file. This is terribly inefficient, however -- we
|
||||
run a new process off of <filename>crontab</filename>, read a new
|
||||
executable to memory, and all this just to wake up a kernel module which
|
||||
is in memory anyway.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Instead of doing that, we can create a function that will be called once
|
||||
for every timer interrupt. The way we do this is we create a task,
|
||||
<indexterm><primary>task</primary></indexterm> held in a
|
||||
<structname>tq_struct</structname>
|
||||
<indexterm><primary>tq_struct</primary></indexterm> structure, which will
|
||||
hold a pointer to the function. Then, we use
|
||||
<function>queue_task</function>
|
||||
<indexterm><primary>queue_task</primary></indexterm> to put that task on
|
||||
a task list called <structname>tq_timer</structname>,
|
||||
<indexterm><primary>tq_timer</primary></indexterm> which is the list of
|
||||
tasks to be executed on the next timer interrupt. Because we want the
|
||||
function to keep on being executed, we need to put it back on
|
||||
<structname>tq_timer</structname> whenever it is called, for the next
|
||||
timer interrupt.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
There's one more point we need to remember here. When a module is
|
||||
removed by <command>rmmod</command>,
|
||||
<indexterm><primary>rmmod</primary></indexterm> first its reference count
|
||||
<indexterm><primary>reference count</primary></indexterm> is checked. If
|
||||
it is zero, <function>module_cleanup</function>
|
||||
<indexterm><primary>module_cleanup</primary></indexterm> is called.
|
||||
Then, the module is removed from memory with all its functions. Nobody
|
||||
checks to see if the timer's task list happens to contain a pointer to
|
||||
one of those functions, which will no longer be available. Ages later
|
||||
(from the computer's perspective, from a human perspective it's nothing,
|
||||
less than a hundredth of a second), the kernel has a timer interrupt and
|
||||
tries to call the function on the task list. Unfortunately, the function
|
||||
is no longer there. In most cases, the memory page where it sat is
|
||||
unused, and you get an ugly error message. But if some other code is now
|
||||
sitting at the same memory location, things could get
|
||||
<emphasis>very</emphasis> ugly. Unfortunately, we don't have an easy way
|
||||
to unregister a task from a task list.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Since <function>cleanup_module</function> can't return with an error code
|
||||
(it's a void function), the solution is to not let it return at all.
|
||||
Instead, it calls <function>sleep_on</function>
|
||||
<indexterm><primary>sleep_on</primary></indexterm> or
|
||||
<function>module_sleep_on</function>
|
||||
<indexterm><primary>module_sleep_on</primary></indexterm>
|
||||
<footnote><para>They're really the same.</para></footnote>
|
||||
to put the <command>rmmod</command> process to sleep. Before that, it
|
||||
informs the function called on the timer interrupt to stop attaching
|
||||
itself by setting a global variable. Then, on the next timer interrupt,
|
||||
the <command>rmmod</command> process will be woken up, when our function
|
||||
is no longer in the queue and it's safe to remove the module.
|
||||
</para>
|
||||
|
||||
<example><title>sched.c</title>
|
||||
|
||||
<indexterm><primary>sched.c</primary></indexterm>
|
||||
|
||||
<programlisting><![CDATA[
|
||||
|
||||
/* sched.c - scheduale a function to be called on every timer interrupt.
|
||||
*
|
||||
* Copyright (C) 2001 by Peter Jay Salzman
|
||||
*/
|
||||
|
||||
/* The necessary header files */
|
||||
|
||||
/* Standard in kernel modules */
|
||||
#include <linux/kernel.h> /* We're doing kernel work */
|
||||
#include <linux/module.h> /* Specifically, a module */
|
||||
|
||||
/* Deal with CONFIG_MODVERSIONS */
|
||||
#if CONFIG_MODVERSIONS==1
|
||||
#define MODVERSIONS
|
||||
#include <linux/modversions.h>
|
||||
#endif
|
||||
|
||||
/* Necessary because we use the proc fs */
|
||||
#include <linux/proc_fs.h>
|
||||
|
||||
/* We scheduale tasks here */
|
||||
#include <linux/tqueue.h>
|
||||
|
||||
/* We also need the ability to put ourselves to sleep and wake up later */
|
||||
#include <linux/sched.h>
|
||||
|
||||
/* In 2.2.3 /usr/include/linux/version.h includes a macro for this, but
|
||||
* 2.0.35 doesn't - so I add it here if necessary.
|
||||
*/
|
||||
#ifndef KERNEL_VERSION
|
||||
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
|
||||
#endif
|
||||
|
||||
/* The number of times the timer interrupt has been called so far */
|
||||
static int TimerIntrpt = 0;
|
||||
|
||||
/* This is used by cleanup, to prevent the module from being unloaded while
|
||||
* intrpt_routine is still in the task queue
|
||||
*/
|
||||
static struct wait_queue *WaitQ = NULL;
|
||||
|
||||
static void intrpt_routine(void *);
|
||||
|
||||
/* The task queue structure for this task, from tqueue.h */
|
||||
static struct tq_struct Task = {
|
||||
NULL, /* Next item in list - queue_task will do this for us */
|
||||
0, /* A flag meaning we haven't been inserted into a task
|
||||
* queue yet
|
||||
*/
|
||||
intrpt_routine, /* The function to run */
|
||||
NULL /* The void* parameter for that function */
|
||||
};
|
||||
|
||||
/* This function will be called on every timer interrupt. Notice the void*
|
||||
* pointer - task functions can be used for more than one purpose, each time
|
||||
* getting a different parameter.
|
||||
*/
|
||||
static void intrpt_routine(void *irrelevant)
|
||||
{
|
||||
/* Increment the counter */
|
||||
TimerIntrpt++;
|
||||
|
||||
/* If cleanup wants us to die */
|
||||
if (WaitQ != NULL)
|
||||
wake_up(&WaitQ); /* Now cleanup_module can return */
|
||||
else
|
||||
/* Put ourselves back in the task queue */
|
||||
queue_task(&Task, &tq_timer);
|
||||
}
|
||||
|
||||
/* Put data into the proc fs file. */
|
||||
int procfile_read(char *buffer,
|
||||
char **buffer_location, off_t offset,
|
||||
int buffer_length, int zero)
|
||||
{
|
||||
int len; /* The number of bytes actually used */
|
||||
|
||||
/* It's static so it will still be in memory when we leave this function
|
||||
*/
|
||||
static char my_buffer[80];
|
||||
|
||||
static int count = 1;
|
||||
|
||||
/* We give all of our information in one go, so if the anybody asks us
|
||||
* if we have more information the answer should always be no.
|
||||
*/
|
||||
if (offset > 0)
|
||||
return 0;
|
||||
|
||||
/* Fill the buffer and get its length */
|
||||
len = sprintf(my_buffer, "Timer called %d times so far\n", TimerIntrpt);
|
||||
count++;
|
||||
|
||||
/* Tell the function which called us where the buffer is */
|
||||
*buffer_location = my_buffer;
|
||||
|
||||
/* Return the length */
|
||||
return len;
|
||||
}
|
||||
|
||||
struct proc_dir_entry Our_Proc_File = {
|
||||
0, /* Inode number - ignore, it'll be filled by proc_register_dynamic */
|
||||
5, /* Length of the file name */
|
||||
"sched", /* The file name */
|
||||
S_IFREG | S_IRUGO, /* File mode - this is a regular file which can be
|
||||
* read by its owner, its group, and everybody else
|
||||
*/
|
||||
1, /* Number of links (directories where the file is referenced) */
|
||||
0, 0, /* The uid and gid for the file - we give it to root */
|
||||
80, /* The size of the file reported by ls. */
|
||||
NULL, /* functions which can be done on the inode (linking, removing,
|
||||
* etc). - we don't * support any.
|
||||
*/
|
||||
procfile_read, /* The read function for this file, the function called
|
||||
* when somebody tries to read something from it.
|
||||
*/
|
||||
NULL /* We could have here a function to fill the file's inode, to
|
||||
* enable us to play with permissions, ownership, etc.
|
||||
*/
|
||||
};
|
||||
|
||||
/* Initialize the module - register the proc file */
|
||||
int init_module()
|
||||
{
|
||||
/* Put the task in the tq_timer task queue, so it will be executed at
|
||||
* next timer interrupt
|
||||
*/
|
||||
queue_task(&Task, &tq_timer);
|
||||
|
||||
/* Success if proc_register_dynamic is a success, failure otherwise */
|
||||
#if LINUX_VERSION_CODE > KERNEL_VERSION(2,2,0)
|
||||
return proc_register(&proc_root, &Our_Proc_File);
|
||||
#else
|
||||
return proc_register_dynamic(&proc_root, &Our_Proc_File);
|
||||
#endif
|
||||
}
|
||||
|
||||
/* Cleanup */
|
||||
void cleanup_module()
|
||||
{
|
||||
/* Unregister our /proc file */
|
||||
proc_unregister(&proc_root, Our_Proc_File.low_ino);
|
||||
|
||||
/* Sleep until intrpt_routine is called one last time. This is necessary,
|
||||
* because otherwise we'll deallocate the memory holding intrpt_routine
|
||||
* and Task while tq_timer still references them. Notice that here we
|
||||
* don't allow signals to interrupt us.
|
||||
*
|
||||
* Since WaitQ is now not NULL, this automatically tells the interrupt
|
||||
* routine it's time to die.
|
||||
*/
|
||||
sleep_on(&WaitQ);
|
||||
}
|
||||
|
||||
]]></programlisting>
|
||||
</example>
|
||||
</sect2>
|
||||
</sect1>
|
||||
|
|
@ -0,0 +1,269 @@
|
|||
<sect1><title>Interrupt Handlers</title>
|
||||
|
||||
<indexterm><primary>interrupt handlers</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>handlers</primary>
|
||||
<secondary>interrupt</secondary>
|
||||
</indexterm>
|
||||
|
||||
<sect2><title>Interrupt Handlers</title>
|
||||
|
||||
<para>
|
||||
Except for the last chapter, everything we did in the kernel so far we've
|
||||
done as a response to a process asking for it, either by dealing with a
|
||||
special file, sending an <function>ioctl</function>, or issuing a system
|
||||
call. But the job of the kernel isn't just to respond to process
|
||||
requests. Another job, which is every bit as important, is to speak to
|
||||
the hardware connected to the machine.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
There are two types of interaction between the <acronym>CPU</acronym> and
|
||||
the rest of the computer's hardware. The first type is when the
|
||||
<acronym>CPU</acronym> gives orders to the hardware, the other is when
|
||||
the hardware needs to tell the <acronym>CPU</acronym> something. The
|
||||
second, called interrupts, is much harder to implement because it has to
|
||||
be dealt with when convenient for the hardware, not the
|
||||
<acronym>CPU</acronym>. Hardware devices typically have a very small
|
||||
amount of <acronym>RAM</acronym>, and if you don't read their information
|
||||
when available, it is lost.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Under Linux, hardware interrupts are called <acronym>IRQ</acronym>s
|
||||
(short for <emphasis>I</emphasis>nterrupt
|
||||
<emphasis>R</emphasis>e<emphasis>q</emphasis>uests).
|
||||
<footnote>
|
||||
<para>
|
||||
This is standard nomencalture on the Intel architecture where Linux
|
||||
originated.
|
||||
<para>
|
||||
</footnote>
|
||||
There are two types of <acronym>IRQ</acronym>s, short and long. A short
|
||||
<acronym>IRQ</acronym> is one which is expected to take a
|
||||
<emphasis>very</emphasis> short period of time, during which the rest of
|
||||
the machine will be blocked and no other interrupts will be handled. A
|
||||
long <acronym>IRQ</acronym> is one which can take longer, and during
|
||||
which other interrupts may occur (but not interrupts from the same
|
||||
device). If at all possible, it's better to declare an interrupt handler
|
||||
to be long.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
When the <acronym>CPU</acronym> receives an interrupt, it stops whatever
|
||||
it's doing (unless it's processing a more important interrupt, in which
|
||||
case it will deal with this one only when the more important one is
|
||||
done), saves certain parameters on the stack and calls the interrupt
|
||||
handler. This means that certain things are not allowed in the interrupt
|
||||
handler itself, because the system is in an unknown state. The solution
|
||||
to this problem is for the interrupt handler to do what needs to be done
|
||||
immediately, usually read something from the hardware or send something
|
||||
to the hardware, and then schedule the handling of the new information at
|
||||
a later time (this is called the <quote>bottom half</quote>)
|
||||
<indexterm><primary>bottom half</primary></indexterm> and return. The
|
||||
kernel is then guaranteed to call the bottom half as soon as possible --
|
||||
and when it does, everything allowed in kernel modules will be allowed.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The way to implement this is to call <function>request_irq</function>
|
||||
<indexterm><primary>request_irq</primary></indexterm> to get your
|
||||
interrupt handler called when the relevant <acronym>IRQ</acronym> is
|
||||
received (there are 15 of them, plus 1 which is used to cascade the
|
||||
interrupt controllers, on Intel platforms). This function receives the
|
||||
<acronym>IRQ</acronym> number, the name of the function, flags, a name
|
||||
for <filename>/proc/interrupts</filename>
|
||||
<indexterm><primary>/proc/interrupts</primary></indexterm> and a
|
||||
parameter to pass to the interrupt handler. The flags can include
|
||||
<parameter>SA_SHIRQ</parameter>
|
||||
<indexterm><primary>SA_SHIRQ</primary></indexterm> to indicate you're
|
||||
willing to share the <acronym>IRQ</acronym> with other interrupt handlers
|
||||
(usually because a number of hardware devices sit on the same
|
||||
<acronym>IRQ</acronym>) and <parameter>SA_INTERRUPT</parameter>
|
||||
<indexterm><primary>SA_INTERRUPT</primary></indexterm> to indicate this
|
||||
is a fast interrupt. This function will only succeed if there isn't
|
||||
already a handler on this <acronym>IRQ</acronym>, or if you're both
|
||||
willing to share.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Then, from within the interrupt handler, we communicate with the hardware
|
||||
and then use <function>queue_task_irq</function>
|
||||
<indexterm><primary>queue_task_irq</primary></indexterm> with
|
||||
<function>tq_immediate</function>
|
||||
<indexterm><primary>tq_immediate</primary></indexterm> and
|
||||
<function>mark_bh(BH_IMMEDIATE)</function>
|
||||
<indexterm><primary>mark_bh</primary></indexterm>
|
||||
<indexterm><primary>BH_IMMEDIATE</primary></indexterm> to schedule the
|
||||
bottom half. The reason we can't use the standard
|
||||
<function>queue_task</function>
|
||||
<indexterm><primary>queue_task</primary></indexterm> in version 2.0 is
|
||||
that the interrupt might happen right in the middle of somebody else's
|
||||
<function>queue_task</function>.
|
||||
<footnote>
|
||||
<para>
|
||||
<function>queue_task_irq</function> is protected from this by a
|
||||
global lock -- in 2.2 there is no <function>queue_task_irq</function>
|
||||
and <function>queue_task</function> is protected by a lock.
|
||||
</para>
|
||||
</footnote>
|
||||
We need <function>mark_bh</function> because earlier versions of Linux
|
||||
only had an array of 32 bottom halves, and now one of them
|
||||
(<parameter>BH_IMMEDIATE</parameter>) is used for the linked list of
|
||||
bottom halves for drivers which didn't get a bottom half entry assigned
|
||||
to them.
|
||||
</para>
|
||||
</sect2>
|
||||
|
||||
<sect2 id="keyboard">
|
||||
<title>Keyboards on the Intel Architecture</title>
|
||||
|
||||
<indexterm><primary>keyboard</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>Intel architecture</primary>
|
||||
<secondary>keyboard</secondary>
|
||||
</indexterm>
|
||||
|
||||
<warning>
|
||||
<para>
|
||||
The rest of this chapter is completely Intel specific. If you're not
|
||||
running on an Intel platform, it will not work. Don't even try to
|
||||
compile the code here.
|
||||
</para>
|
||||
</warning>
|
||||
|
||||
<para>
|
||||
I had a problem with writing the sample code for this chapter. On one
|
||||
hand, for an example to be useful it has to run on everybody's computer
|
||||
with meaningful results. On the other hand, the kernel already includes
|
||||
device drivers for all of the common devices, and those device drivers
|
||||
won't coexist with what I'm going to write. The solution I've found was
|
||||
to write something for the keyboard interrupt, and disable the regular
|
||||
keyboard interrupt handler first. Since it is defined as a static symbol
|
||||
in the kernel source files (specifically,
|
||||
<filename>drivers/char/keyboard.c</filename>), there is no way to restore
|
||||
it. Before <userinput>insmod</userinput>'ing this code, do on another
|
||||
terminal <userinput>sleep 120 ; reboot</userinput> if you value your
|
||||
file system.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
This code binds itself to <acronym>IRQ</acronym> 1, which is the
|
||||
<acronym>IRQ</acronym> of the keyboard controlled under Intel
|
||||
architectures. Then, when it receives a keyboard interrupt, it reads the
|
||||
keyboard's status (that's the purpose of the
|
||||
<userinput>inb(0x64)</userinput>)
|
||||
<indexterm><primary>inb</primary></indexterm> and the scan code, which is
|
||||
the value returned by the keyboard. Then, as soon as the kernel thinks
|
||||
it's feasible, it runs <function>got_char</function> which gives the code
|
||||
of the key used (the first seven bits of the scan code) and whether it
|
||||
has been pressed (if the 8th bit is zero) or released (if it's one).
|
||||
</para>
|
||||
|
||||
<example><title>intrpt.c</title>
|
||||
|
||||
<indexterm><primary>intrpt.c</primary></indexterm>
|
||||
|
||||
<programlisting><![CDATA[
|
||||
/* intrpt.c - An interrupt handler.
|
||||
*
|
||||
* Copyright (C) 2001 by Peter Jay Salzman
|
||||
*/
|
||||
|
||||
/* The necessary header files */
|
||||
|
||||
/* Standard in kernel modules */
|
||||
#include <linux/kernel.h> /* We're doing kernel work */
|
||||
#include <linux/module.h> /* Specifically, a module */
|
||||
|
||||
/* Deal with CONFIG_MODVERSIONS */
|
||||
#if CONFIG_MODVERSIONS==1
|
||||
#define MODVERSIONS
|
||||
#include <linux/modversions.h>
|
||||
#endif
|
||||
|
||||
#include <linux/sched.h>
|
||||
#include <linux/tqueue.h>
|
||||
|
||||
/* We want an interrupt */
|
||||
#include <linux/interrupt.h>
|
||||
|
||||
#include <asm/io.h>
|
||||
|
||||
/* In 2.2.3 /usr/include/linux/version.h includes a macro for this, but
|
||||
* 2.0.35 doesn't - so I add it here if necessary.
|
||||
*/
|
||||
#ifndef KERNEL_VERSION
|
||||
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
|
||||
#endif
|
||||
|
||||
/* Bottom Half - this will get called by the kernel as soon as it's safe
|
||||
* to do everything normally allowed by kernel modules.
|
||||
*/
|
||||
static void got_char(void *scancode)
|
||||
{
|
||||
printk("Scan Code %x %s.\n",
|
||||
(int) *((char *) scancode) & 0x7F,
|
||||
*((char *) scancode) & 0x80 ? "Released" : "Pressed");
|
||||
}
|
||||
|
||||
/* This function services keyboard interrupts. It reads the relevant
|
||||
* information from the keyboard and then scheduales the bottom half
|
||||
* to run when the kernel considers it safe.
|
||||
*/
|
||||
void irq_handler(int irq, void *dev_id, struct pt_regs *regs)
|
||||
{
|
||||
/* This variables are static because they need to be
|
||||
* accessible (through pointers) to the bottom half routine.
|
||||
*/
|
||||
static unsigned char scancode;
|
||||
static struct tq_struct task = {NULL, 0, got_char, &scancode};
|
||||
unsigned char status;
|
||||
|
||||
/* Read keyboard status */
|
||||
status = inb(0x64);
|
||||
scancode = inb(0x60);
|
||||
|
||||
/* Scheduale bottom half to run */
|
||||
#if LINUX_VERSION_CODE > KERNEL_VERSION(2,2,0)
|
||||
queue_task(&task, &tq_immediate);
|
||||
#else
|
||||
queue_task_irq(&task, &tq_immediate);
|
||||
#endif
|
||||
mark_bh(IMMEDIATE_BH);
|
||||
}
|
||||
|
||||
/* Initialize the module - register the IRQ handler */
|
||||
int init_module()
|
||||
{
|
||||
/* Since the keyboard handler won't co-exist with another handler,
|
||||
* such as us, we have to disable it (free its IRQ) before we do
|
||||
* anything. Since we don't know where it is, there's no way to
|
||||
* reinstate it later - so the computer will have to be rebooted
|
||||
* when we're done.
|
||||
*/
|
||||
free_irq(1, NULL);
|
||||
|
||||
/* Request IRQ 1, the keyboard IRQ, to go to our irq_handler.
|
||||
* SA_SHIRQ means we're willing to have othe handlers on this IRQ.
|
||||
* SA_INTERRUPT can be used to make the handler into a fast interrupt.
|
||||
*/
|
||||
return request_irq(1, /* The number of the keyboard IRQ on PCs */
|
||||
irq_handler, /* our handler */
|
||||
SA_SHIRQ,
|
||||
"test_keyboard_irq_handler", NULL);
|
||||
}
|
||||
|
||||
/* Cleanup */
|
||||
void cleanup_module()
|
||||
{
|
||||
/* This is only here for completeness. It's totally irrelevant, since
|
||||
* we don't have a way to restore the normal keyboard interrupt so the
|
||||
* computer is completely useless and has to be rebooted.
|
||||
*/
|
||||
free_irq(1, NULL);
|
||||
}
|
||||
]]></programlisting>
|
||||
</example>
|
||||
</sect2>
|
||||
</sect1>
|
|
@ -0,0 +1,83 @@
|
|||
<sect1><title>Symmetrical Multi-Processing</title>
|
||||
|
||||
<indexterm><primary>SMP</primary></indexterm>
|
||||
<indexterm><primary>multi-processing</primary></indexterm>
|
||||
<indexterm><primary>symmetrical multi-processing</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>processing</primary>
|
||||
<secondary>multi</secondary>
|
||||
</indexterm>
|
||||
|
||||
<para>Good writing style says we have a paragraph here.</para>
|
||||
|
||||
<sect2><title>Symmetrical Multi-Processing</title>
|
||||
|
||||
<para>
|
||||
One of the easiest (read, cheapest) ways to improve hardware performance
|
||||
is to put more than one <acronym>CPU</acronym> on the board.
|
||||
<indexterm>
|
||||
<primary>CPU</primary>
|
||||
<secondary>multiple</secondary>
|
||||
</indexterm>
|
||||
This can be done either making the different <acronym>CPU</acronym>s take
|
||||
on different jobs (asymmetrical multi-processing) or by making them all
|
||||
run in parallel, doing the same job (symmetrical multi-processing, a.k.a.
|
||||
<acronym>SMP</acronym>). Doing asymmetrical multi-processing effectively
|
||||
requires specialized knowledge about the tasks the computer should do,
|
||||
which is unavailable in a general purpose operating system such as Linux.
|
||||
On the other hand, symmetrical multi-processing is relatively easy to
|
||||
implement.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
By relatively easy, I mean exactly that -- not that it's
|
||||
<emphasis>really</emphasis> easy. In a symmetrical multi-processing
|
||||
environment, the <acronym>CPU</acronym>s share the same memory, and as a
|
||||
result code running in one <acronym>CPU</acronym> can affect the memory
|
||||
used by another. You can no longer be certain that a variable you've set
|
||||
to a certain value in the previous line still has that value -- the other
|
||||
<acronym>CPU</acronym> might have played with it while you weren't
|
||||
looking. Obviously, it's impossible to program like this.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
In the case of process programming this normally isn't an issue, because
|
||||
a process will normally only run on one <acronym>CPU</acronym> at a time.
|
||||
<footnote>
|
||||
<para>
|
||||
The exception is threaded processes, which can run on several
|
||||
<acronym>CPU</acronym>s at once.
|
||||
</para>
|
||||
</footnote>
|
||||
The kernel, on the other hand, could be called by different processes
|
||||
running on different <acronym>CPU</acronym>s.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
In version 2.0.x, this isn't a problem because the entire kernel is in
|
||||
one big spinlock. This means that if one <acronym>CPU</acronym> is in
|
||||
the kernel and another <acronym>CPU</acronym> wants to get in, for
|
||||
example because of a system call, it has to wait until the first
|
||||
<acronym>CPU</acronym> is done. This makes Linux <acronym>SMP</acronym>
|
||||
safe,
|
||||
<footnote>
|
||||
<para>Meaning it is safe to use it with <acronym>SMP</acronym></para>
|
||||
</footnote>
|
||||
but terriably inefficient.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
In version 2.2.x, several <acronym>CPU</acronym>s can be in the kernel at
|
||||
the same time. This is something module writers need to be aware of. I
|
||||
got somebody to give me access to an <acronym>SMP</acronym> box, so
|
||||
hopefully the next version of this book will include more information.
|
||||
</para>
|
||||
|
||||
<!-- Unfortunately, I don't have access to an SMP box to test things, so I
|
||||
can't write a chapter about how to do it right. It anybody out there has
|
||||
access to one and is willing to help me with this, I'll be grateful. If a
|
||||
company will provide me with this access, I'll give them a free one
|
||||
paragraph ad at the top of this chapter.
|
||||
-->
|
||||
</sect2>
|
||||
</sect1>
|
|
@ -0,0 +1,61 @@
|
|||
<sect1><title>Common Pitfalls</title>
|
||||
|
||||
<sect2><title>Common Pitfalls</title>
|
||||
|
||||
<para>Before I send you on your way to go out into the world and write
|
||||
kernel modules, there are a few things I need to warn you about. If I
|
||||
fail to warn you and something bad happens, please report the problem to
|
||||
me for a full refund of the amount I was paid for your copy of the book.
|
||||
</para>
|
||||
|
||||
<indexterm><primary>refund policy</primary></indexterm>
|
||||
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
<term>Using standard libraries</term>
|
||||
<indexterm><primary>standard libraries</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>libraries</primary>
|
||||
<secondary>standard</secondary>
|
||||
</indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
You can't do that. In a kernel module you can only use kernel
|
||||
functions, which are the functions you can see in
|
||||
<filename>/proc/ksyms</filename>.
|
||||
<indexterm><primary>/proc/ksyms</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>proc file</primary>
|
||||
<secondary>ksyms</secondary>
|
||||
</indexterm>
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term>Disabling interrupts</term>
|
||||
<indexterm>
|
||||
<primary>interrupts</primary>
|
||||
<secondary>disabling</secondary>
|
||||
</indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
You might need to do this for a short time and that is OK, but if
|
||||
you don't enable them afterwards, your system will be stuck and
|
||||
you'll have to power it off.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term>Sticking your head inside a large carnivore</term>
|
||||
<listitem>
|
||||
<para>
|
||||
I probably don't have to warn you about this, but I figured I will
|
||||
anyway, just in case.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
</sect2>
|
||||
</sect1>
|
|
@ -0,0 +1,185 @@
|
|||
<sect1>
|
||||
<title>Changes between 2.0 and 2.2</title>
|
||||
|
||||
<indexterm><primary>2.2 changes</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>kernel</primary>
|
||||
<secondary>versions</secondary>
|
||||
</indexterm>
|
||||
|
||||
<para>Good writing style says we have a paragraph here.</para>
|
||||
|
||||
<sect2>
|
||||
<title>Changes between 2.0 and 2.2</title>
|
||||
|
||||
<para>
|
||||
I don't know the entire kernel well enough do document all of the
|
||||
changes. In the course of converting the examples (or actually, adapting
|
||||
Emmanuel Papirakis's changes) I came across the following differences. I
|
||||
listed all of them here together to help module programmers, especially
|
||||
those who learned from previous versions of this book and are most
|
||||
familiar with the techniques I use, convert to the new version.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
An additional resource for people who wish to convert to 2.2 is located
|
||||
on
|
||||
<ulink
|
||||
url="http://www.atnf.csiro.au/~rgooch/linux/docs/porting-to-2.2.html">
|
||||
Richard Gooch's site
|
||||
</ulink>.
|
||||
</para>
|
||||
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
<term><filename class="headerfile">asm/uaccess.h</filename></term>
|
||||
<indexterm><primary>asm/uaccess.h</primary></indexterm>
|
||||
<indexterm>
|
||||
<primary>asm</primary>
|
||||
<secondary>uaccess.h</secondary>
|
||||
</indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
If you need <function>put_user</function>
|
||||
<indexterm><primary>put_user</primary></indexterm> or
|
||||
<function>get_user</function>
|
||||
<indexterm><primary>get_user</primary></indexterm> you have to
|
||||
<userinput>#include</userinput> it.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><function>get_user</function></term>
|
||||
<listitem>
|
||||
<para>
|
||||
In version 2.2, <function>get_user</function> receives both the
|
||||
pointer into user memory and the variable in kernel memory to fill
|
||||
with the information. The reason for this is that
|
||||
<function>get_user</function> can now read two or four bytes at a
|
||||
time if the variable we read is two or four bytes long.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><structname>file_operations</structname></term>
|
||||
<indexterm>
|
||||
<primary>structure</primary>
|
||||
<secondary>file_operations</secondary>
|
||||
</indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
This structure now has a flush
|
||||
<indexterm><primary>flush</primary></indexterm> function between
|
||||
the <function>open</function> and <function>close</function>
|
||||
functions.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term>
|
||||
<function>close</function> in
|
||||
<structname>file_operations</structname>
|
||||
</term>
|
||||
<indexterm><primary>close</primary></indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
In version 2.2, the <function>close</function> function returns an
|
||||
integer, so it's allowed to fail.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term>
|
||||
<function>read</function> and <function>write</function> in
|
||||
<structname>file_operations</structname>
|
||||
</term>
|
||||
<indexterm><primary>read</primary></indexterm>
|
||||
<indexterm><primary>write</primary></indexterm>
|
||||
<indexterm><primary>ssize_t</primary></indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
The headers for these functions changed. They now return
|
||||
<userinput>ssize_t</userinput> instead of an integer, and their
|
||||
parameter list is different. The inode is no longer a parameter,
|
||||
and on the other hand the offset into the file is.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><function>proc_register_dynamic</function></term>
|
||||
<indexterm><primary>proc_register_dynamic</primary></indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
This function no longer exists. Instead, you call the regular
|
||||
<function>proc_register</function>
|
||||
<indexterm><primary>proc_register</primary></indexterm> and put
|
||||
zero in the inode field of the structure.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term>Signals</term>
|
||||
<indexterm><primary>signals</primary></indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
The signals in the task structure are no longer a 32 bit integer,
|
||||
but an array of <parameter>_NSIG_WORDS</parameter>
|
||||
<indexterm><primary>_NSIG_WORDS</primary></indexterm> integers.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><function>queue_task_irq</function></term>
|
||||
<indexterm><primary>queue_task_irq</primary></indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
Even if you want to scheduale a task to happen from inside an
|
||||
interrupt handler, you use <function>queue_task</function>,
|
||||
<indexterm><primary>queue_task</primary></indexterm> not
|
||||
<function>queue_task_irq</function>.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<indexterm><primary>interrupts</primary></indexterm>
|
||||
<indexterm><primary>irqs</primary></indexterm>
|
||||
|
||||
<varlistentry>
|
||||
<term>Module Parameters</term>
|
||||
<indexterm>
|
||||
<primary>module</primary>
|
||||
<secondary>parameters</secondary>
|
||||
</indexterm>
|
||||
<indexterm><primary>module parameters</primary></indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
You no longer just declare module parameters as global variables.
|
||||
In 2.2 you have to also use <parameter>MODULE_PARM</parameter>
|
||||
<indexterm><primary>MODULE_PARM</primary></indexterm> to declare
|
||||
their type. This is a big improvement, because it allows the module
|
||||
to receive string parameters which start with a digits, for
|
||||
example, without getting confused.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term>Symmetrical Multi-Processing</term>
|
||||
<indexterm><primary>Symmetrical Multi-Processing</primary></indexterm>
|
||||
<indexterm><primary>SMP</primary></indexterm>
|
||||
<listitem>
|
||||
<para>
|
||||
The kernel is no longer inside one huge spinlock, which means that
|
||||
kernel modules have to be aware of <acronym>SMP</acronym>.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
</sect2>
|
||||
</sect1>
|
|
@ -0,0 +1,45 @@
|
|||
<sect1>
|
||||
<title>Where From Here?</title>
|
||||
|
||||
<para>Good writing style says we have a paragraph here.</para>
|
||||
|
||||
<sect2>
|
||||
<title>Where From Here?</title>
|
||||
|
||||
<para>
|
||||
I could easily have squeezed a few more chapters into this book. I could
|
||||
have added a chapter about creating new file systems, or about adding new
|
||||
protocol stacks (as if there's a need for that -- you'd have to dig
|
||||
underground to find a protocol stack not supported by Linux). I could
|
||||
have added explanations of the kernel mechanisms we haven't touched upon,
|
||||
such as bootstrapping or the disk interface.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
However, I chose not to. My purpose in writing this book was to provide
|
||||
initiation into the mysteries of kernel module programming and to teach
|
||||
the common techniques for that purpose. For people seriously interested
|
||||
in kernel programming, I recommend Juan-Mariano de Goyeneche's
|
||||
<ulink
|
||||
url="http://jungla.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html">
|
||||
list of kernel resources
|
||||
</ulink>. Also, as Linus said, the best way to learn the kernel is to
|
||||
read the source code yourself.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
If you're interested in more examples of short kernel modules, I
|
||||
recommend Phrack magazine. Even if you're not interested in security,
|
||||
and as a programmer you should be, the kernel modules there are good
|
||||
examples of what you can do inside the kernel, and they're short enough
|
||||
not to require too much effort to understand.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
I hope I have helped you in your quest to become a better programmer, or
|
||||
at least to have fun through technology. And, if you do write useful
|
||||
kernel modules, I hope you publish them under the GPL, so I can use them
|
||||
too.
|
||||
</para>
|
||||
</sect2>
|
||||
</sect1>
|
|
@ -0,0 +1,92 @@
|
|||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V3.1//EN" [
|
||||
<!ENTITY Forward SYSTEM "00-Forward.sgml">
|
||||
<!ENTITY Introduction SYSTEM "01-Introduction.sgml">
|
||||
<!ENTITY HelloWorld SYSTEM "02-HelloWorld.sgml">
|
||||
<!ENTITY Preliminaries SYSTEM "03-Preliminaries.sgml">
|
||||
<!ENTITY CharDevFiles SYSTEM "04-CharacterDeviceFiles.sgml">
|
||||
<!ENTITY TheProcFileSystem SYSTEM "05-TheProcFileSystem.sgml">
|
||||
<!ENTITY UsingProcForInput SYSTEM "06-UsingProcForInput.sgml">
|
||||
<!ENTITY TalkingToDevFiles SYSTEM "07-TalkingToDeviceFiles.sgml">
|
||||
<!ENTITY SystemCalls SYSTEM "08-SystemCalls.sgml">
|
||||
<!ENTITY BlockingProcesses SYSTEM "09-BlockingProcesses.sgml">
|
||||
<!ENTITY ReplacingPrintks SYSTEM "10-ReplacingPrintks.sgml">
|
||||
<!ENTITY SchedulingTasks SYSTEM "11-SchedulingTasks.sgml">
|
||||
<!ENTITY InterruptHandlers SYSTEM "12-InterruptHandlers.sgml">
|
||||
<!ENTITY SymmetricMultiProc SYSTEM "13-SymmetricMultiProcessing.sgml">
|
||||
<!ENTITY CommonPitfalls SYSTEM "14-CommonPitfalls.sgml">
|
||||
<!ENTITY Changes20-22 SYSTEM "A1-ChangesBet20And22.sgml">
|
||||
<!ENTITY WhereFromHere SYSTEM "A2-WhereToGoFromHere.sgml">
|
||||
]>
|
||||
<book>
|
||||
<bookinfo>
|
||||
<title>The Linux Kernel Module Programming Guide</title>
|
||||
<titleabbrev>MPG</titleabbrev>
|
||||
<authorgroup>
|
||||
<collab>
|
||||
<collabname>Peter Jay Salzman</collabname>
|
||||
</collab>
|
||||
|
||||
<collab><collabname>Ori Pomerantz</collabname></collab>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2001</year>
|
||||
<holder>Peter Jay Salzman</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>The Linux Kernel Module Programming Guide is a free book; you may
|
||||
reproduce and/or modify it under the terms of the Open Software
|
||||
License, version 1.1. You can obtain a copy of this license at <ulink
|
||||
url="http://opensource.org/licenses/osl.php"
|
||||
>http://opensource.org/licenses/osl.php</ulink>.</para>
|
||||
|
||||
<para>This book is distributed in the hope it will be useful, but
|
||||
without any warranty, without even the implied warranty of
|
||||
merchantability or fitness for a particular purpose.</para>
|
||||
|
||||
<para>The author encourages wide distribution of this book for personal
|
||||
or commercial use, provided the above copyright notice remains intact
|
||||
and the method adheres to the provisions of the Open Software License.
|
||||
In summary, you may copy and distribute this book free of charge or for
|
||||
a profit. No explicit permission is required from the author for
|
||||
reproduction of this book in any medium, physical or electronic.</para>
|
||||
|
||||
<para>Derivative works and translations of this document must be placed
|
||||
under the Open Software License, and the original copyright notice
|
||||
must remain intact. If you have contributed new material to this book,
|
||||
you must make the material and source code available for your
|
||||
revisions. Please make revisions and updates available directly to the
|
||||
document maintainer, Peter Jay Salzman <email>p@dirac.org</email>.
|
||||
This will allow for the merging of updates and provide consistent
|
||||
revisions to the Linux community.</para>
|
||||
|
||||
<para>If you publish or distribute this book commercially, donations,
|
||||
royalties, and/or printed copies are greatly appreciated by the author
|
||||
and the <ulink url="http://www.tldp.org">Linux Documentation
|
||||
Project</ulink> (LDP). Contributing in this way shows your support for
|
||||
free software and the LDP. If you have questions or comments, please
|
||||
contact the address above.</para>
|
||||
</legalnotice>
|
||||
|
||||
</bookinfo>
|
||||
|
||||
<preface><title>Foreword</title> &Forward;</preface>
|
||||
<chapter><title>Introduction</title> &Introduction;</chapter>
|
||||
<chapter><title>Hello World</title> &HelloWorld;</chapter>
|
||||
<chapter><title>Preliminaries</title> &Preliminaries;</chapter>
|
||||
<chapter><title>Character Device Files</title> &CharDevFiles;</chapter>
|
||||
<chapter><title>The /proc File System</title> &TheProcFileSystem;</chapter>
|
||||
<chapter><title>Using /proc For Input</title> &UsingProcForInput;</chapter>
|
||||
<chapter><title>Talking To Device Files</title> &TalkingToDevFiles;</chapter>
|
||||
<chapter><title>System Calls</title> &SystemCalls;</chapter>
|
||||
<chapter><title>Blocking Processes</title> &BlockingProcesses;</chapter>
|
||||
<chapter><title>Replacing Printks</title> &ReplacingPrintks;</chapter>
|
||||
<chapter><title>Scheduling Tasks</title> &SchedulingTasks;</chapter>
|
||||
<chapter><title>Interrupt Handlers</title> &InterruptHandlers;</chapter>
|
||||
<chapter><title>Symmetric Multi Processing</title>&SymmetricMultiProc;</chapter>
|
||||
<chapter><title>Common Pitfalls</title> &CommonPitfalls;</chapter>
|
||||
<appendix><title>Changes: 2.0 To 2.2</title> &Changes20-22;</appendix>
|
||||
<appendix><title>Where To Go From Here</title> &WhereFromHere;</appendix>
|
||||
|
||||
</book>
|
Loading…
Reference in New Issue