650 lines
30 KiB
HTML
650 lines
30 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
|
|
<TITLE>Linux Kernel 2.4 Internals: Booting</TITLE>
|
|
<LINK HREF="lki-2.html" REL=next>
|
|
|
|
<LINK HREF="lki.html#toc1" REL=contents>
|
|
</HEAD>
|
|
<BODY>
|
|
<A HREF="lki-2.html">Next</A>
|
|
Previous
|
|
<A HREF="lki.html#toc1">Contents</A>
|
|
<HR>
|
|
<H2><A NAME="s1">1. Booting</A></H2>
|
|
|
|
<P>
|
|
<H2><A NAME="ss1.1">1.1 Building the Linux Kernel Image</A>
|
|
</H2>
|
|
|
|
<P>This section explains the steps taken during compilation of the Linux kernel
|
|
and the output produced at each stage.
|
|
The build process depends on the architecture so I would like to emphasize
|
|
that we only consider building a Linux/x86 kernel.
|
|
<P>When the user types 'make zImage' or 'make bzImage' the resulting bootable
|
|
kernel image is stored as
|
|
<CODE>arch/i386/boot/zImage</CODE> or
|
|
<CODE>arch/i386/boot/bzImage</CODE> respectively.
|
|
Here is how the image is built:
|
|
<OL>
|
|
<LI> C and assembly source files are compiled into ELF relocatable object format (.o) and
|
|
some of them are grouped logically into archives (.a) using
|
|
<B>ar(1)</B>.
|
|
</LI>
|
|
<LI> Using <B>ld(1)</B>, the above .o and .a are linked into <CODE>vmlinux</CODE> which is a
|
|
statically linked, non-stripped ELF 32-bit LSB 80386 executable file.
|
|
</LI>
|
|
<LI> <CODE>System.map</CODE> is produced by <B>nm vmlinux</B>, irrelevant or uninteresting
|
|
symbols are grepped out.
|
|
</LI>
|
|
<LI> Enter directory <CODE>arch/i386/boot</CODE>.
|
|
</LI>
|
|
<LI> Bootsector asm code <CODE>bootsect.S</CODE> is preprocessed either with or without
|
|
<B>-D__BIG_KERNEL__</B>, depending on whether the target is
|
|
bzImage or zImage, into <CODE>bbootsect.s</CODE> or <CODE>bootsect.s</CODE> respectively.
|
|
</LI>
|
|
<LI> <CODE>bbootsect.s</CODE> is assembled and then converted into 'raw binary' form
|
|
called <CODE>bbootsect</CODE> (or <CODE>bootsect.s</CODE> assembled and raw-converted into
|
|
<CODE>bootsect</CODE> for zImage).
|
|
</LI>
|
|
<LI> Setup code <CODE>setup.S</CODE> (<CODE>setup.S</CODE> includes <CODE>video.S</CODE>) is preprocessed into
|
|
<CODE>bsetup.s</CODE> for bzImage or <CODE>setup.s</CODE> for zImage. In the same way as the
|
|
bootsector code, the difference is marked by -<B>D__BIG_KERNEL__</B> present
|
|
for bzImage. The result is then converted into 'raw binary' form
|
|
called <CODE>bsetup</CODE>.
|
|
</LI>
|
|
<LI> Enter directory <CODE>arch/i386/boot/compressed</CODE> and convert
|
|
<CODE>/usr/src/linux/vmlinux</CODE> to $tmppiggy (tmp filename) in raw binary
|
|
format, removing <CODE>.note</CODE> and <CODE>.comment</CODE> ELF sections.
|
|
</LI>
|
|
<LI> <B>gzip -9 < $tmppiggy > $tmppiggy.gz</B>
|
|
</LI>
|
|
<LI> Link $tmppiggy.gz into ELF relocatable (<B>ld -r</B>) <CODE>piggy.o</CODE>.
|
|
</LI>
|
|
<LI> Compile compression routines <CODE>head.S</CODE> and <CODE>misc.c</CODE> (still in
|
|
<CODE>arch/i386/boot/compressed</CODE> directory) into ELF objects <CODE>head.o</CODE> and
|
|
<CODE>misc.o</CODE>.
|
|
</LI>
|
|
<LI> Link together <CODE>head.o</CODE>, <CODE>misc.o</CODE> and <CODE>piggy.o</CODE> into <CODE>bvmlinux</CODE> (or <CODE>vmlinux</CODE> for
|
|
zImage, don't mistake this for <CODE>/usr/src/linux/vmlinux</CODE>!). Note the
|
|
difference between <B>-Ttext 0x1000</B> used for <CODE>vmlinux</CODE> and <B>-Ttext 0x100000</B>
|
|
for <CODE>bvmlinux</CODE>, i.e. for bzImage compression loader is high-loaded.
|
|
</LI>
|
|
<LI> Convert <CODE>bvmlinux</CODE> to 'raw binary' <CODE>bvmlinux.out</CODE> removing <CODE>.note</CODE> and
|
|
<CODE>.comment</CODE> ELF sections.
|
|
</LI>
|
|
<LI> Go back to <CODE>arch/i386/boot</CODE> directory and, using the program <B>tools/build</B>,
|
|
cat together <CODE>bbootsect</CODE>, <CODE>bsetup</CODE> and <CODE>compressed/bvmlinux.out</CODE> into <CODE>bzImage</CODE>
|
|
(delete extra 'b' above for <CODE>zImage</CODE>). This writes important variables
|
|
like <CODE>setup_sects</CODE> and <CODE>root_dev</CODE> at the end of the bootsector.</LI>
|
|
</OL>
|
|
|
|
The size of the bootsector is always 512 bytes. The size of the setup must
|
|
be greater than 4 sectors but is limited above by about 12K - the rule
|
|
is:
|
|
<P>0x4000 bytes >= 512 + setup_sects * 512 + room for stack while running
|
|
bootsector/setup
|
|
<P>We will see later where this limitation comes from.
|
|
<P>The upper limit on the bzImage size produced at this step is about 2.5M for
|
|
booting with LILO and 0xFFFF paragraphs (0xFFFF0 = 1048560 bytes) for
|
|
booting raw image, e.g. from floppy disk or CD-ROM (El-Torito emulation mode).
|
|
<P>Note that while <B>tools/build</B> does validate the size of boot sector, kernel image
|
|
and lower bound of setup size, it does not check the *upper* bound of said
|
|
setup size. Therefore it is easy to build a broken kernel by just adding some
|
|
large ".space" at the end of <CODE>setup.S</CODE>.
|
|
<P>
|
|
<H2><A NAME="ss1.2">1.2 Booting: Overview</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>The boot process details are architecture-specific, so we shall
|
|
focus our attention on the IBM PC/IA32 architecture.
|
|
Due to old design and backward compatibility, the PC firmware boots the
|
|
operating system in an old-fashioned manner.
|
|
This process can be separated into the following six logical stages:
|
|
<P>
|
|
<OL>
|
|
<LI> BIOS selects the boot device.</LI>
|
|
<LI> BIOS loads the bootsector from the boot device.</LI>
|
|
<LI> Bootsector loads setup, decompression routines and compressed kernel
|
|
image.</LI>
|
|
<LI> The kernel is uncompressed in protected mode.</LI>
|
|
<LI> Low-level initialisation is performed by asm code.</LI>
|
|
<LI> High-level C initialisation.</LI>
|
|
</OL>
|
|
<P>
|
|
<H2><A NAME="ss1.3">1.3 Booting: BIOS POST</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>
|
|
<OL>
|
|
<LI> The power supply starts the clock generator and asserts #POWERGOOD
|
|
signal on the bus.</LI>
|
|
<LI> CPU #RESET line is asserted (CPU now in real 8086 mode).</LI>
|
|
<LI> %ds=%es=%fs=%gs=%ss=0, %cs=0xFFFF0000,%eip = 0x0000FFF0 (ROM BIOS POST code).</LI>
|
|
<LI> All POST checks are performed with interrupts disabled.</LI>
|
|
<LI> IVT (Interrupt Vector Table) initialised at address 0.</LI>
|
|
<LI> The BIOS Bootstrap Loader function is invoked via <B>int 0x19</B>,
|
|
with %dl containing the boot device 'drive number'. This loads
|
|
track 0, sector 1 at physical address 0x7C00 (0x07C0:0000).</LI>
|
|
</OL>
|
|
<P>
|
|
<H2><A NAME="ss1.4">1.4 Booting: bootsector and setup</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>The bootsector used to boot Linux kernel could be either:
|
|
<P>
|
|
<UL>
|
|
<LI> Linux bootsector (<CODE>arch/i386/boot/bootsect.S</CODE>),</LI>
|
|
<LI> LILO (or other bootloader's) bootsector, or</LI>
|
|
<LI> no bootsector (loadlin etc)</LI>
|
|
</UL>
|
|
<P>We consider here the Linux bootsector in detail.
|
|
The first few lines initialise the convenience macros to be used for segment
|
|
values:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
29 SETUPSECS = 4 /* default nr of setup-sectors */
|
|
30 BOOTSEG = 0x07C0 /* original address of boot-sector */
|
|
31 INITSEG = DEF_INITSEG /* we move boot here - out of the way */
|
|
32 SETUPSEG = DEF_SETUPSEG /* setup starts here */
|
|
33 SYSSEG = DEF_SYSSEG /* system loaded at 0x10000 (65536) */
|
|
34 SYSSIZE = DEF_SYSSIZE /* system size: # of 16-byte clicks */
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>(the numbers on the left are the line numbers of bootsect.S file)
|
|
The values of <CODE>DEF_INITSEG</CODE>, <CODE>DEF_SETUPSEG</CODE>, <CODE>DEF_SYSSEG</CODE> and <CODE>DEF_SYSSIZE</CODE> are taken
|
|
from <CODE>include/asm/boot.h</CODE>:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
/* Don't touch these, unless you really know what you're doing. */
|
|
#define DEF_INITSEG 0x9000
|
|
#define DEF_SYSSEG 0x1000
|
|
#define DEF_SETUPSEG 0x9020
|
|
#define DEF_SYSSIZE 0x7F00
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>Now, let us consider the actual code of <CODE>bootsect.S</CODE>:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
54 movw $BOOTSEG, %ax
|
|
55 movw %ax, %ds
|
|
56 movw $INITSEG, %ax
|
|
57 movw %ax, %es
|
|
58 movw $256, %cx
|
|
59 subw %si, %si
|
|
60 subw %di, %di
|
|
61 cld
|
|
62 rep
|
|
63 movsw
|
|
64 ljmp $INITSEG, $go
|
|
|
|
65 # bde - changed 0xff00 to 0x4000 to use debugger at 0x6400 up (bde). We
|
|
66 # wouldn't have to worry about this if we checked the top of memory. Also
|
|
67 # my BIOS can be configured to put the wini drive tables in high memory
|
|
68 # instead of in the vector table. The old stack might have clobbered the
|
|
69 # drive table.
|
|
|
|
70 go: movw $0x4000-12, %di # 0x4000 is an arbitrary value >=
|
|
71 # length of bootsect + length of
|
|
72 # setup + room for stack;
|
|
73 # 12 is disk parm size.
|
|
74 movw %ax, %ds # ax and es already contain INITSEG
|
|
75 movw %ax, %ss
|
|
76 movw %di, %sp # put stack at INITSEG:0x4000-12.
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>Lines 54-63 move the bootsector code from address 0x7C00 to 0x90000.
|
|
This is achieved by:
|
|
<P>
|
|
<OL>
|
|
<LI> set %ds:%si to $BOOTSEG:0 (0x7C0:0 = 0x7C00)
|
|
</LI>
|
|
<LI> set %es:%di to $INITSEG:0 (0x9000:0 = 0x90000)
|
|
</LI>
|
|
<LI> set the number of 16bit words in %cx (256 words = 512 bytes = 1 sector)
|
|
</LI>
|
|
<LI> clear DF (direction) flag in EFLAGS to auto-increment addresses (cld)
|
|
</LI>
|
|
<LI> go ahead and copy 512 bytes (rep movsw)</LI>
|
|
</OL>
|
|
<P>The reason this code does not use <CODE>rep movsd</CODE> is intentional (hint - .code16).
|
|
<P>Line 64 jumps to label <CODE>go:</CODE> in the newly made copy of the
|
|
bootsector, i.e. in segment 0x9000. This and the following three
|
|
instructions (lines 64-76) prepare the stack at $INITSEG:0x4000-0xC, i.e.
|
|
%ss = $INITSEG (0x9000) and %sp = 0x3FF4 (0x4000-0xC). This is where the
|
|
limit on setup size comes from that we mentioned earlier (see Building the
|
|
Linux Kernel Image).
|
|
<P>Lines 77-103 patch the disk parameter table for the first disk to
|
|
allow multi-sector reads:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
77 # Many BIOS's default disk parameter tables will not recognise
|
|
78 # multi-sector reads beyond the maximum sector number specified
|
|
79 # in the default diskette parameter tables - this may mean 7
|
|
80 # sectors in some cases.
|
|
81 #
|
|
82 # Since single sector reads are slow and out of the question,
|
|
83 # we must take care of this by creating new parameter tables
|
|
84 # (for the first disk) in RAM. We will set the maximum sector
|
|
85 # count to 36 - the most we will encounter on an ED 2.88.
|
|
86 #
|
|
87 # High doesn't hurt. Low does.
|
|
88 #
|
|
89 # Segments are as follows: ds = es = ss = cs - INITSEG, fs = 0,
|
|
90 # and gs is unused.
|
|
|
|
91 movw %cx, %fs # set fs to 0
|
|
92 movw $0x78, %bx # fs:bx is parameter table address
|
|
93 pushw %ds
|
|
94 ldsw %fs:(%bx), %si # ds:si is source
|
|
95 movb $6, %cl # copy 12 bytes
|
|
96 pushw %di # di = 0x4000-12.
|
|
97 rep # don't need cld -> done on line 66
|
|
98 movsw
|
|
99 popw %di
|
|
100 popw %ds
|
|
101 movb $36, 0x4(%di) # patch sector count
|
|
102 movw %di, %fs:(%bx)
|
|
103 movw %es, %fs:2(%bx)
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The floppy disk controller is reset using BIOS service int 0x13 function 0
|
|
(reset FDC) and setup sectors are loaded immediately after the
|
|
bootsector, i.e. at physical address 0x90200 ($INITSEG:0x200), again using
|
|
BIOS service int 0x13, function 2 (read sector(s)).
|
|
This happens during lines 107-124:
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
107 load_setup:
|
|
108 xorb %ah, %ah # reset FDC
|
|
109 xorb %dl, %dl
|
|
110 int $0x13
|
|
111 xorw %dx, %dx # drive 0, head 0
|
|
112 movb $0x02, %cl # sector 2, track 0
|
|
113 movw $0x0200, %bx # address = 512, in INITSEG
|
|
114 movb $0x02, %ah # service 2, "read sector(s)"
|
|
115 movb setup_sects, %al # (assume all on head 0, track 0)
|
|
116 int $0x13 # read it
|
|
117 jnc ok_load_setup # ok - continue
|
|
|
|
118 pushw %ax # dump error code
|
|
119 call print_nl
|
|
120 movw %sp, %bp
|
|
121 call print_hex
|
|
122 popw %ax
|
|
123 jmp load_setup
|
|
|
|
124 ok_load_setup:
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
|
|
If loading failed for some reason (bad floppy or someone pulled the diskette
|
|
out during the operation), we dump error code and retry in an endless
|
|
loop.
|
|
The only way to get out of it is to reboot the machine, unless retry succeeds
|
|
but usually it doesn't (if something is wrong it will only get worse).
|
|
<P>If loading setup_sects sectors of setup code succeeded we jump to label
|
|
<CODE>ok_load_setup:</CODE>.
|
|
<P>Then we proceed to load the compressed kernel image at physical
|
|
address 0x10000. This
|
|
is done to preserve the firmware data areas in low memory (0-64K).
|
|
After the kernel is loaded, we jump to $SETUPSEG:0 (<CODE>arch/i386/boot/setup.S</CODE>).
|
|
Once the data is no longer needed (e.g. no more calls to BIOS) it is
|
|
overwritten by moving the entire (compressed) kernel image from 0x10000 to
|
|
0x1000 (physical addresses, of course).
|
|
This is done by <CODE>setup.S</CODE> which sets things up for protected mode and jumps
|
|
to 0x1000 which is the head of the compressed kernel, i.e.
|
|
<CODE>arch/386/boot/compressed/{head.S,misc.c}</CODE>.
|
|
This sets up stack and calls <CODE>decompress_kernel()</CODE> which uncompresses the
|
|
kernel to address 0x100000 and jumps to it.
|
|
<P>Note that old bootloaders (old versions of LILO) could only load the
|
|
first 4 sectors of setup, which is why there is code in setup to load the rest of
|
|
itself if needed. Also, the code in setup has to take care of various
|
|
combinations of loader type/version vs zImage/bzImage and is therefore
|
|
highly complex.
|
|
<P>Let us examine the kludge in the bootsector code that allows to load a big
|
|
kernel, known also as "bzImage".
|
|
The setup sectors are loaded as usual at 0x90200, but the kernel is loaded
|
|
64K chunk at a time using a special helper routine that calls BIOS to move
|
|
data from low to high memory. This helper routine is referred to by
|
|
<CODE>bootsect_kludge</CODE> in <CODE>bootsect.S</CODE> and is defined as <CODE>bootsect_helper</CODE> in <CODE>setup.S</CODE>.
|
|
The <CODE>bootsect_kludge</CODE> label in <CODE>setup.S</CODE> contains the value of setup segment
|
|
and the offset of <CODE>bootsect_helper</CODE> code in it so that bootsector can use the <CODE>lcall</CODE>
|
|
instruction to jump to it (inter-segment jump).
|
|
The reason why it is in <CODE>setup.S</CODE> is simply because there is no more space left
|
|
in bootsect.S (which is strictly not true - there are approximately 4 spare bytes
|
|
and at least 1 spare byte in <CODE>bootsect.S</CODE> but that is not enough, obviously).
|
|
This routine uses BIOS service int 0x15 (ax=0x8700) to move to high memory
|
|
and resets %es to always point to 0x10000. This ensures that the code in <CODE>bootsect.S</CODE>
|
|
doesn't run out of low memory when copying data from disk.
|
|
<P>
|
|
<H2><A NAME="ss1.5">1.5 Using LILO as a bootloader </A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>There are several advantages in using a specialised bootloader (LILO) over
|
|
a bare bones Linux bootsector:
|
|
<OL>
|
|
<LI> Ability to choose between multiple Linux kernels or even multiple OSes.</LI>
|
|
<LI> Ability to pass kernel command line parameters (there is a patch
|
|
called BCP that adds this ability to bare-bones bootsector+setup).</LI>
|
|
<LI> Ability to load much larger bzImage kernels - up to 2.5M vs 1M.</LI>
|
|
</OL>
|
|
|
|
Old versions of LILO (v17 and earlier) could not load bzImage kernels. The
|
|
newer versions (as of a couple of years ago or earlier) use the same
|
|
technique as bootsect+setup of moving data from low into high memory by
|
|
means of BIOS services. Some people (Peter Anvin notably) argue that zImage
|
|
support should be removed. The main reason (according to Alan Cox) it stays
|
|
is that there are apparently some broken BIOSes that make it impossible to
|
|
boot bzImage kernels while loading zImage ones fine.
|
|
<P>The last thing LILO does is to jump to <CODE>setup.S</CODE> and things proceed as normal.
|
|
<P>
|
|
<H2><A NAME="ss1.6">1.6 High level initialisation </A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>By "high-level initialisation" we consider anything which is not directly
|
|
related to bootstrap, even though parts of the code to perform this are
|
|
written in asm, namely <CODE>arch/i386/kernel/head.S</CODE> which is the head of the
|
|
uncompressed kernel. The following steps are performed:
|
|
<P>
|
|
<OL>
|
|
<LI> Initialise segment values (%ds = %es = %fs = %gs = __KERNEL_DS = 0x18).</LI>
|
|
<LI> Initialise page tables.</LI>
|
|
<LI> Enable paging by setting PG bit in %cr0.</LI>
|
|
<LI> Zero-clean BSS (on SMP, only first CPU does this).</LI>
|
|
<LI> Copy the first 2k of bootup parameters (kernel commandline).</LI>
|
|
<LI> Check CPU type using EFLAGS and, if possible, cpuid, able to detect
|
|
386 and higher.</LI>
|
|
<LI> The first CPU calls <CODE>start_kernel()</CODE>, all others call
|
|
<CODE>arch/i386/kernel/smpboot.c:initialize_secondary()</CODE> if ready=1,
|
|
which just reloads esp/eip and doesn't return.</LI>
|
|
</OL>
|
|
<P>The <CODE>init/main.c:start_kernel()</CODE> is written in C and does the following:
|
|
<P>
|
|
<OL>
|
|
<LI> Take a global kernel lock (it is needed so that only one CPU
|
|
goes through initialisation).</LI>
|
|
<LI> Perform arch-specific setup (memory layout analysis, copying
|
|
boot command line again, etc.).</LI>
|
|
<LI> Print Linux kernel "banner" containing the version, compiler used to
|
|
build it etc. to the kernel ring buffer for messages. This is taken
|
|
from the variable linux_banner defined in init/version.c and is the
|
|
same string as displayed by <B>cat /proc/version</B>.</LI>
|
|
<LI> Initialise traps.</LI>
|
|
<LI> Initialise irqs.</LI>
|
|
<LI> Initialise data required for scheduler.</LI>
|
|
<LI> Initialise time keeping data.</LI>
|
|
<LI> Initialise softirq subsystem.</LI>
|
|
<LI> Parse boot commandline options.</LI>
|
|
<LI> Initialise console.</LI>
|
|
<LI> If module support was compiled into the kernel, initialise dynamical
|
|
module loading facility.</LI>
|
|
<LI> If "profile=" command line was supplied, initialise profiling buffers.</LI>
|
|
<LI> <CODE>kmem_cache_init()</CODE>, initialise most of slab allocator.</LI>
|
|
<LI> Enable interrupts.</LI>
|
|
<LI> Calculate BogoMips value for this CPU.</LI>
|
|
<LI> Call <CODE>mem_init()</CODE> which calculates <CODE>max_mapnr</CODE>, <CODE>totalram_pages</CODE> and
|
|
<CODE>high_memory</CODE> and prints out the "Memory: ..." line.</LI>
|
|
<LI> <CODE>kmem_cache_sizes_init()</CODE>, finish slab allocator initialisation.</LI>
|
|
<LI> Initialise data structures used by procfs.</LI>
|
|
<LI> <CODE>fork_init()</CODE>, create <CODE>uid_cache</CODE>, initialise <CODE>max_threads</CODE> based on
|
|
the amount of memory available and configure <CODE>RLIMIT_NPROC</CODE> for
|
|
<CODE>init_task</CODE> to be <CODE>max_threads/2</CODE>.</LI>
|
|
<LI> Create various slab caches needed for VFS, VM, buffer cache, etc.</LI>
|
|
<LI> If System V IPC support is compiled in, initialise the IPC subsystem.
|
|
Note that for System V shm, this includes mounting an internal
|
|
(in-kernel) instance of shmfs filesystem.</LI>
|
|
<LI> If quota support is compiled into the kernel, create and initialise
|
|
a special slab cache for it.</LI>
|
|
<LI> Perform arch-specific "check for bugs" and, whenever possible,
|
|
activate workaround for processor/bus/etc bugs. Comparing various
|
|
architectures reveals that "ia64 has no bugs" and "ia32 has quite a
|
|
few bugs", good example is "f00f bug" which is only checked if kernel
|
|
is compiled for less than 686 and worked around accordingly.</LI>
|
|
<LI> Set a flag to indicate that a schedule should be invoked at "next
|
|
opportunity" and create a kernel thread <CODE>init()</CODE> which execs
|
|
execute_command if supplied via "init=" boot parameter, or tries to
|
|
exec <B>/sbin/init</B>, <B>/etc/init</B>, <B>/bin/init</B>, <B>/bin/sh</B> in this order; if
|
|
all these fail, panic with "suggestion" to use "init=" parameter.</LI>
|
|
<LI> Go into the idle loop, this is an idle thread with pid=0.</LI>
|
|
</OL>
|
|
<P>Important thing to note here that the <CODE>init()</CODE> kernel thread calls
|
|
<CODE>do_basic_setup()</CODE> which in turn calls <CODE>do_initcalls()</CODE> which goes through the
|
|
list of functions registered by means of <CODE>__initcall</CODE> or <CODE>module_init()</CODE> macros
|
|
and invokes them. These functions either do not depend on each other
|
|
or their dependencies have been manually fixed by the link order in the
|
|
Makefiles. This means that, depending on
|
|
the position of directories in the trees and the structure of the Makefiles,
|
|
the order in which initialisation functions are invoked can change. Sometimes, this
|
|
is important because you can imagine two subsystems A and B with B depending
|
|
on some initialisation done by A. If A is compiled statically and B is a
|
|
module then B's entry point is guaranteed to be invoked after A prepared
|
|
all the necessary environment. If A is a module, then B is also necessarily
|
|
a module so there are no problems. But what if both A and B are statically
|
|
linked into the kernel? The order in which they are invoked depends on the relative
|
|
entry point offsets in the <CODE>.initcall.init</CODE> ELF section of the kernel image.
|
|
Rogier Wolff proposed to introduce a hierarchical "priority" infrastructure
|
|
whereby modules could let the linker know in what (relative) order they
|
|
should be linked, but so far there are no patches available that implement
|
|
this in a sufficiently elegant manner to be acceptable into the kernel.
|
|
Therefore, make sure your link order is correct. If, in the example above,
|
|
A and B work fine when compiled statically once, they will always work,
|
|
provided they are listed sequentially in the same Makefile. If they don't
|
|
work, change the order in which their object files are listed.
|
|
<P>Another thing worth noting is Linux's ability to execute an "alternative
|
|
init program" by means of passing "init=" boot commandline. This is useful
|
|
for recovering from accidentally overwritten <B>/sbin/init</B> or debugging the
|
|
initialisation (rc) scripts and <CODE>/etc/inittab</CODE> by hand, executing them
|
|
one at a time.
|
|
<P>
|
|
<H2><A NAME="ss1.7">1.7 SMP Bootup on x86</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>On SMP, the BP goes through the normal sequence of bootsector, setup etc
|
|
until it reaches the <CODE>start_kernel()</CODE>, and then on to <CODE>smp_init()</CODE> and
|
|
especially <CODE>src/i386/kernel/smpboot.c:smp_boot_cpus()</CODE>. The <CODE>smp_boot_cpus()</CODE>
|
|
goes in a loop for each apicid (until <CODE>NR_CPUS</CODE>) and calls <CODE>do_boot_cpu()</CODE> on
|
|
it. What <CODE>do_boot_cpu()</CODE> does is create (i.e. <CODE>fork_by_hand</CODE>) an idle task for
|
|
the target cpu and write in well-known locations defined by the Intel MP
|
|
spec (0x467/0x469) the EIP of trampoline code found in <CODE>trampoline.S</CODE>. Then
|
|
it generates STARTUP IPI to the target cpu which makes this AP execute the
|
|
code in <CODE>trampoline.S</CODE>.
|
|
<P>The boot CPU creates a copy of trampoline code for each CPU in
|
|
low memory. The AP code writes a magic number in its own code which is
|
|
verified by the BP to make sure that AP is executing the trampoline code.
|
|
The requirement that trampoline code must be in low memory is enforced by
|
|
the Intel MP specification.
|
|
<P>The trampoline code simply sets %bx register to 1, enters protected mode
|
|
and jumps to startup_32 which is the main entry to <CODE>arch/i386/kernel/head.S</CODE>.
|
|
<P>Now, the AP starts executing <CODE>head.S</CODE> and discovering that it is not a BP,
|
|
it skips the code that clears BSS and then enters <CODE>initialize_secondary()</CODE>
|
|
which just enters the idle task for this CPU - recall that <CODE>init_tasks[cpu]</CODE>
|
|
was already initialised by BP executing <CODE>do_boot_cpu(cpu)</CODE>.
|
|
<P>Note that init_task can be shared but each idle thread must have its own
|
|
TSS. This is why <CODE>init_tss[NR_CPUS]</CODE> is an array.
|
|
<P>
|
|
<H2><A NAME="ss1.8">1.8 Freeing initialisation data and code</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>When the operating system initialises itself, most of the code and data
|
|
structures are never needed again.
|
|
Most operating systems (BSD, FreeBSD etc.) cannot dispose of this unneeded
|
|
information, thus wasting precious physical kernel memory.
|
|
The excuse they use (see McKusick's 4.4BSD book) is that "the relevant code
|
|
is spread around various subsystems and so it is not feasible to free it".
|
|
Linux, of course, cannot use such excuses because under Linux "if something
|
|
is possible in principle, then it is already implemented or somebody is
|
|
working on it".
|
|
<P>So, as I said earlier, Linux kernel can only be compiled as an ELF binary, and
|
|
now we find out the reason (or one of the reasons) for that. The reason
|
|
related to throwing away initialisation code/data is that Linux provides two
|
|
macros to be used:
|
|
<P>
|
|
<UL>
|
|
<LI> <CODE>__init</CODE> - for initialisation code</LI>
|
|
<LI> <CODE>__initdata</CODE> - for data</LI>
|
|
</UL>
|
|
<P>These evaluate to gcc attribute specificators (also known as "gcc magic")
|
|
as defined in <CODE>include/linux/init.h</CODE>:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
#ifndef MODULE
|
|
#define __init __attribute__ ((__section__ (".text.init")))
|
|
#define __initdata __attribute__ ((__section__ (".data.init")))
|
|
#else
|
|
#define __init
|
|
#define __initdata
|
|
#endif
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>What this means is that if the code is compiled statically into the kernel
|
|
(i.e. MODULE is not defined) then it is placed in the special ELF section
|
|
<CODE>.text.init</CODE>, which is declared in the linker map in <CODE>arch/i386/vmlinux.lds</CODE>.
|
|
Otherwise (i.e. if it is a module) the macros evaluate to nothing.
|
|
<P>What happens during boot is that the "init" kernel thread (function
|
|
<CODE>init/main.c:init()</CODE>) calls the arch-specific function <CODE>free_initmem()</CODE> which
|
|
frees all the pages between addresses <CODE>__init_begin</CODE> and <CODE>__init_end</CODE>.
|
|
<P>On a typical system (my workstation), this results in freeing about 260K of
|
|
memory.
|
|
<P>The functions registered via <CODE>module_init()</CODE> are placed in <CODE>.initcall.init</CODE>
|
|
which is also freed in the static case. The current trend in Linux, when
|
|
designing a subsystem (not necessarily a module), is to provide
|
|
init/exit entry points from the early stages of design so that in the
|
|
future, the subsystem in question can be modularised if needed. Example of
|
|
this is pipefs, see <CODE>fs/pipe.c</CODE>. Even if a given subsystem will never become a
|
|
module, e.g. bdflush (see <CODE>fs/buffer.c</CODE>), it is still nice and tidy to use
|
|
the <CODE>module_init()</CODE> macro against its initialisation function, provided it does
|
|
not matter when exactly is the function called.
|
|
<P>There are two more macros which work in a similar manner, called <CODE>__exit</CODE> and
|
|
<CODE>__exitdata</CODE>, but they are more directly connected to the module support and
|
|
therefore will be explained in a later section.
|
|
<P>
|
|
<H2><A NAME="ss1.9">1.9 Processing kernel command line</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Let us recall what happens to the commandline passed to kernel during boot:
|
|
<P>
|
|
<OL>
|
|
<LI> LILO (or BCP) accepts the commandline using BIOS keyboard services
|
|
and stores it at a well-known location in physical memory, as well
|
|
as a signature saying that there is a valid commandline there.
|
|
</LI>
|
|
<LI> <CODE>arch/i386/kernel/head.S</CODE> copies the first 2k of it out to the zeropage.
|
|
</LI>
|
|
<LI> <CODE>arch/i386/kernel/setup.c:parse_mem_cmdline()</CODE> (called by
|
|
<CODE>setup_arch()</CODE>, itself called by <CODE>start_kernel()</CODE>) copies 256 bytes from zeropage
|
|
into <CODE>saved_command_line</CODE> which is displayed by <CODE>/proc/cmdline</CODE>. This
|
|
same routine processes the "mem=" option if present and makes appropriate
|
|
adjustments to VM parameters.
|
|
</LI>
|
|
<LI> We return to commandline in <CODE>parse_options()</CODE> (called by <CODE>start_kernel()</CODE>)
|
|
which processes some "in-kernel" parameters (currently "init=" and
|
|
environment/arguments for init) and passes each word to <CODE>checksetup()</CODE>.
|
|
</LI>
|
|
<LI> <CODE>checksetup()</CODE> goes through the code in ELF section <CODE>.setup.init</CODE> and
|
|
invokes each function, passing it the word if it matches. Note that
|
|
using the return value of 0 from the function registered via <CODE>__setup()</CODE>,
|
|
it is possible to pass the same "variable=value" to more than one
|
|
function with "value" invalid to one and valid to another.
|
|
Jeff Garzik commented: "hackers who do that get spanked :)"
|
|
Why? Because this is clearly ld-order specific, i.e. kernel linked
|
|
in one order will have functionA invoked before functionB and another
|
|
will have it in reversed order, with the result depending on the order.
|
|
</LI>
|
|
</OL>
|
|
<P>So, how do we write code that processes boot commandline? We use the <CODE>__setup()</CODE>
|
|
macro defined in <CODE>include/linux/init.h</CODE>:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
|
|
/*
|
|
* Used for kernel command line parameter setup
|
|
*/
|
|
struct kernel_param {
|
|
const char *str;
|
|
int (*setup_func)(char *);
|
|
};
|
|
|
|
extern struct kernel_param __setup_start, __setup_end;
|
|
|
|
#ifndef MODULE
|
|
#define __setup(str, fn) \
|
|
static char __setup_str_##fn[] __initdata = str; \
|
|
static struct kernel_param __setup_##fn __initsetup = \
|
|
{ __setup_str_##fn, fn }
|
|
|
|
#else
|
|
#define __setup(str,func) /* nothing */
|
|
endif
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>So, you would typically use it in your code like this
|
|
(taken from code of real driver, BusLogic HBA <CODE>drivers/scsi/BusLogic.c</CODE>):
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<HR>
|
|
<PRE>
|
|
static int __init
|
|
BusLogic_Setup(char *str)
|
|
{
|
|
int ints[3];
|
|
|
|
(void)get_options(str, ARRAY_SIZE(ints), ints);
|
|
|
|
if (ints[0] != 0) {
|
|
BusLogic_Error("BusLogic: Obsolete Command Line Entry "
|
|
"Format Ignored\n", NULL);
|
|
return 0;
|
|
}
|
|
if (str == NULL || *str == '\0')
|
|
return 0;
|
|
return BusLogic_ParseDriverOptions(str);
|
|
}
|
|
|
|
__setup("BusLogic=", BusLogic_Setup);
|
|
</PRE>
|
|
<HR>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>Note that <CODE>__setup()</CODE> does nothing for modules, so the code that wishes to
|
|
process boot commandline and can be either a module or statically linked
|
|
must invoke its parsing function manually in the module initialisation
|
|
routine. This also means that it is possible to write code that
|
|
processes parameters when compiled as a module but not when it is static or
|
|
vice versa.
|
|
<P>
|
|
<HR>
|
|
<A HREF="lki-2.html">Next</A>
|
|
Previous
|
|
<A HREF="lki.html#toc1">Contents</A>
|
|
</BODY>
|
|
</HTML>
|