291 lines
16 KiB
HTML
291 lines
16 KiB
HTML
<!--startcut ==========================================================-->
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<title>Dispelling the Myth LG #37</title>
|
|
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
|
|
<META NAME="GENERATOR" CONTENT="Mozilla/4.07 [en] (X11; I; Linux 2.0.34 i586) [Netscape]">
|
|
</HEAD>
|
|
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#A000A0"
|
|
ALINK="#FF0000">
|
|
<!--endcut ============================================================-->
|
|
|
|
<H4>
|
|
"Linux Gazette...<I>making Linux just a little more fun!</I>"
|
|
</H4>
|
|
|
|
<P> <HR> <P>
|
|
<!--===================================================================-->
|
|
|
|
<center>
|
|
<H1><font color="maroon">Dispelling the Kernel Compiling Myth</font></H1>
|
|
<H4>By <a href="mailto:jfm2@club-internet.fr">Jean Francois Martinez</a></H4>
|
|
</center>
|
|
<P> <HR> <P>
|
|
|
|
<P><BR><I>"Thou hast to recompile thee kernel".</I>
|
|
<P>This antique curse has been thrown on every Linux newcomer since the
|
|
birth of Linux. Unfortunately as long as kernel recompiling is deemed a
|
|
necessary part of a Linux installation it will be impossible to spread
|
|
Linux between non-nerds. In this article we will make a detailed analysis
|
|
of the performance increases one can expect of kernel compiling.
|
|
<H2>
|
|
Memory savings</H2>
|
|
"Thanks to kernel recompiling you can free your installation kernel of
|
|
much unneeded bloat. You also should compile permanently used modules in
|
|
the kernel for additional savings. A leaner kernel will make your computer
|
|
faster thanks to reducing paging".
|
|
<BR>Let's quantify this.
|
|
<P>To begin with we will see module compiling. Compiling a module in the
|
|
kernel will save a little more than 2K per module: 2K due to page alignment
|
|
and a small bit of code for the loading, unloading of the module. Now,
|
|
despite being a module fanatic I never managed to be in a situation with
|
|
more than ten modules loaded, but let's imagine you have 20 modules loaded
|
|
and all of them are needed permanently so you recompile them in the kernel.
|
|
You would save 40K of memory, that is 0.5% of the memory of an 8 Meg computer.
|
|
<P>Now we will look at benefits of a lean kernel. When Matt Welsh wrote
|
|
his books kernel recompiling was undoubtedly necessary. It was not uncommon
|
|
to be able to save above 1.5 Megs of memory and your average computer had
|
|
8 Megs of RAM. Thus recompiling would increase memory available from 5.5
|
|
to 7 Megs that is a 27% increase.
|
|
<BR>But people failed to notice that Linux has gone modular and computers
|
|
got more memory. Today most distributions ship modular kernels so recompiling
|
|
will get benefits much smaller than in 1995. As an example I tested recompiling
|
|
the kernel shipped in RedHat 5.2 with everything unneeded thrown out and
|
|
modularizing everything else when it was possible. The boot messages (that
|
|
is before loading of any module) showed I had saved a mere 400K. In addition
|
|
today even low end computers have 32 Megs of RAM that means that recompiling
|
|
your kernel will increase your available memory of only 1.25%
|
|
<P>It is possible to write a specially designed program who will not do
|
|
a single page fault with N Megs of memory and thrash horribly if you reduce
|
|
it by a single page. However in normal situations a 1.25% increase in memory
|
|
available will make little difference. There ARE still a couple distributions
|
|
who ship kernels good for little else outside installation: huge kernels
|
|
lacking essential features so recompiling is not a performance issue but
|
|
a requirement. Now consider what happens if a small company without a full-time
|
|
guru needs a firewall. Its expert is good for little else short of starting
|
|
Word. If he stumbles upon a distribution with one of those broken kernels
|
|
he will fail and will end recommending NT.
|
|
<P>Most modern distribs (Caldera, Suse, RedHat and their clones) ship fully-featured
|
|
kernels and in addition kernel recompiling will produce no appreciable
|
|
speed increase due to memory savings: they are good enough out of the box.
|
|
Only a couple of "hackeristic" distribs will force you to recompile the
|
|
kernel. But for the good of Linux you should ask the maintainers to fix
|
|
them instead of supplying for their deficiencies. YOU can recompile but
|
|
your neighbour cannot and he will choose NT.
|
|
<H2>
|
|
Evaluating CPU speedups due to recompiling</H2>
|
|
"Recompiling will allow you to build a faster kernel because you will be
|
|
able to compile for the right CPU".
|
|
<P>Again let's quantify this. Linux performs a number of optimizations
|
|
for CPU type but most of them are performed at execution time and don't
|
|
depend on compiling options. For one part we will quantify the influence
|
|
due to alternative portions of code being compiled and we will also take
|
|
a look at the influence of compilation options in the code generated by
|
|
GCC.
|
|
<H3>
|
|
Effect of the ifdefs</H3>
|
|
If you take a look at the source code of the 2.0 kernel you will notice
|
|
only two portions of code whose inclusion depends on CPU type. The first
|
|
one is related to selective invalidation of TLB entries and the second
|
|
one is related to the way used for swapping bytes. In both cases the choice
|
|
is 386 versus everything else. There was a third portion of code who depended
|
|
on CPU time: the way blocks of memory were copied: the fastest way for
|
|
386 and PPros, Pentim IIs is slightly sub-optimal on 486s and much slower
|
|
on plain Pentiums. However this optimization has been disabled and now
|
|
whatever CPU you have blocks of memory are copied the 386-PPro-PII way.
|
|
<H4>
|
|
Effect of byte swapping</H4>
|
|
Byte swapping takes place in two cases: header info when trading packets
|
|
through a network with a different endian machine and addressing info for
|
|
SCSI peripherals. In both cases the content (eg what you write to an SCSI
|
|
disk) is not changed. The only effect is on headers/control info and that
|
|
is only a minimal part of the CPU time spent for networking/SCSI activity
|
|
so it has no noticeable effect on performance.
|
|
<H4>
|
|
Effects of selective invalidation of TLB</H4>
|
|
We will explain some basics about VM and address translation. When given
|
|
an address the CPU will first look into a page directory, and later into
|
|
a page table in order to translate the virtual address into a real address
|
|
before being able to access the data. That means a threefold slowdown because
|
|
there are three accesses to memory instead of one. In fact it could be
|
|
much more than that in case the page table entries are in slow regular
|
|
RAM while the real data is in the much faster cache. To avoid this the
|
|
CPU keeps a list of the last accessed pages and of their translations into
|
|
an internal ultra-fast memory called the TLB (translation lookaside buffer).
|
|
Now suppose the kernel wants to unmap a page belonging to a process, it
|
|
will modify the page tables but the problem is they are no longer in sync
|
|
with the TLB so if the CPU finds the adress in TLB it will not look at
|
|
the page tables and will use the wrong data. Therefore the kernel needs
|
|
to tell the CPU to avoid using the TLB entry, but 386s don't support selective
|
|
invalidation of TLB entries so the kernel invalidates the whole TLB. Now
|
|
the kernel you get with your distribution has to be able to work with 386s
|
|
as well as newer processors so they are compiled to use total TLB invalidation
|
|
and that means if you are using a newer processor you lose the benefits
|
|
of selective invalidation.
|
|
<P>Let's look now at the circumsatnces where selective TLB invalidation
|
|
has a significant effect and let's quantify the slow down.
|
|
<BR>First of all if the kernel unmaps a page and then handles control to
|
|
another process it will reload CR3 and that will cause a total TLB invalidation
|
|
(different processes have entirely different mappings) so you get any benefit
|
|
only if control is handled back to the same process either immediately
|
|
or after some time in kernel mode. Also consider that time wasted due to
|
|
entire TLB invalidation is some microseconds while disk IO takes 10 milliseconds
|
|
in best case that is one thousand times more. That means in case there
|
|
is disk IO following this unmapping (due to swap out) benefits would be
|
|
unsignificant.
|
|
<P>In fact about the only case where selective TLB will be meaningful would
|
|
be in the following scenario: process frees memory so the kernel will invalidate
|
|
TLB, it handles control to the same process and then the process scans
|
|
a large array doing only a single access for every entry, then just when
|
|
the TLB is fully reloaded, it unmaps memory again, new TLB invalidation,
|
|
kernel gives back control again and then the process scans the same array
|
|
entries. Highly theorical and don't forget that during the second pass
|
|
page entries will be in cache so address translation will be much faster
|
|
and this will reduce benefits got due to selective TLB invalidation.
|
|
<P>Let's evaluate what happens in a normal process. We will arbitrarily
|
|
assume this process runs for one tick (10 ms) after the unmapping.
|
|
For everything else we will take the worst case. The slower the memory
|
|
the more costly is translation so we will assume this computer uses 60
|
|
ms DRAM instead of SDRAM. The larger the TLB the bigger the benefits of
|
|
selective invalidation so we will choose a CPU with a big TLB in our case
|
|
it will be an AMD K6 model 7: it has a 64 entry TLB for code pages and
|
|
a 128 entry TLB for data pages. We will also assume that we never find
|
|
nor page table entries nor page directory entries in cache (the later is
|
|
very irrealistic because a single directory entry is used every 4 Megs
|
|
of address space) so every translation will need 2x60=120 ns so the complete
|
|
refilling of the TLB needs 120 ns * 192 TMB entries = 23 microseconds.
|
|
Because we assumed the process would be running for a whole tick that means
|
|
the slowdown due to address translation is only 0.2 per cent.
|
|
<H3>
|
|
Effects of tuning GCC options</H3>
|
|
Precise measuring of kernel timing is quite difficult, in addition the
|
|
kernel is a mix of C and assembler. What will we do will be to recompile
|
|
the Byte benchmark using GCC 2.7.2.3 with the same flags used in 2.0 kernels
|
|
both for 386s (the one used for native kernels in distributions) and for
|
|
Pentiums and above (486 is an intermediary case). However those benchmarks
|
|
will give us a good idea, with perhaps a bias towards overestimation because
|
|
the Byte benchmarks are pure C so the compiler gains will be felt in full
|
|
while the kernel is a mix of C and assembler the later being unaffected
|
|
by compiler optimizations.
|
|
<BR>The benchmarks were run in two computers: a Pentium 75 and an AMD K6-300.
|
|
The Pentium tuned test was effectively faster than the 386 tuned test ...
|
|
by a mere 1.8% on the P75, about the same in the AMD. The conclusions to
|
|
be drawn is that GCC 2.7 for the x86 family has little model-dependent
|
|
optimizations nor are the alignment optimizations particularly effective.
|
|
Those paltry TWO percent (rounded UP) is all you get when you listen to
|
|
the words of wisdom dispensated in magazines.
|
|
<P>If you are an expert and have a spare machine for experimenting then
|
|
you could try recompilings using more agressive optimizations than the
|
|
standard -O2 or using a better compiler than gcc 2.7 like egcs or pgcc.
|
|
However be warned that all 2.0 kernels until 2.0.35 and possibly 2.0.36
|
|
have some bugs who will break the kernel with any other compiler than gcc
|
|
2.7 (they work due to gcc 2.7 bugs). Also be wary about some optimizations
|
|
like loop-unrolling who according to egcs or pgcc doc were never thorougly
|
|
tested be in gcc, egcs or pgcc and that egcs and pgcc are not as well tested
|
|
as gcc (egcs 1.0 was notorious for its FP bugs). Given these warnings there
|
|
is a 7% speed difference between the Byte benchmarks compiled with -O6
|
|
and loop-unrolling against plain -O2. So playing with compiler and compiler
|
|
flags is an interesting possibility if you are an expert: it could help
|
|
the kernel developpers to determine what are the more agresive optimizations
|
|
who don't break the kernel. If you are not an expert then don't lose sleep
|
|
about this. The problem is that only a small part of the time spent
|
|
by your program will be spent executing those parts of kernel code affected
|
|
<OL>
|
|
<LI>
|
|
If your program spends 90% of its CPU time in user mode then kernel optimizations
|
|
will be hardly felt.</LI>
|
|
|
|
<LI>
|
|
Compiler optimizations will have no effect whenever the kernel runs parts
|
|
written in assembler.</LI>
|
|
|
|
<LI>
|
|
Many kernel-intensive processes are in fact IO-bound: the CPU waits for
|
|
the peripheral. That means that if there is only one active process the
|
|
kernel will end its job earlier and will wait a bit longer until the disk
|
|
is ready. In that case you will get any benefit only if you have two active
|
|
processes: the speed increase in the kernel will allow running the other
|
|
process until it gets the answer of the peripheral.</LI>
|
|
|
|
<LI>
|
|
Consider also that there are some peripherals (notoriously some broken
|
|
IDE disks) who force the kernel to enter active loops until it gets the
|
|
answer of the peripheral. That means that recompiling your
|
|
kernel will only affect the number of times the kernel executes the loop.</LI>
|
|
|
|
<LI>
|
|
Two cases were the kernel spends time doing pure CPU are pipe data transfers
|
|
and disk reading when data is found in cache. This should benefit from
|
|
tuning the compiler flags were it not that data transfer is done in assembler
|
|
and will not be affected by compiler magic.</LI>
|
|
</OL>
|
|
Now remember that if your process spends only 10% of its time in kernel
|
|
parts written in C then recompiling the kernel with a compiler generating
|
|
30% faster code will only provide a 3% speed increase in the overall performance.
|
|
<P>Kernel recompiling for your specific processor gives only a minimal
|
|
CPU boost when the kernel version is 2.0 and the processor is a 1998 or
|
|
earlier model of the i386 architecture. This could change in
|
|
future versions of Linux or when using newer processors.
|
|
<H2>
|
|
Advice and conclusions</H2>
|
|
Kernel compiling is not presently an effective way to optimize a Linux
|
|
box. Don't do it if it frightens you. At most, because it is easy and relatively
|
|
safe, prepare a rescue floppy, ensure you can boot from it and then recompile
|
|
changing only two things: processor type and disable FPU emulation if you
|
|
have one (do a cat /proc/cpuinfo if you don't know). With most distributions
|
|
you will get exactly the same drivers your distribution kernel was compiled
|
|
(keep a backup of the original modules just in case).
|
|
<P>Kernel compiling has been seen as the panacea for Linux optimization.
|
|
Unfortunately this doesn't resist serious analysis. It also has two serious
|
|
drawbacks. First it is poor public relations for spreading Linux between
|
|
normal people. Second this has sterilized investigation of more effective
|
|
optimizations.
|
|
<UL>
|
|
<LI>
|
|
Some broken IDE disks absorb 90% of CPU time when data tranfer is taking
|
|
place, tuning them with hdparms can reduce this to 20%. But tuning
|
|
hdparms is very dangerous and everyone who has used has suffered massive
|
|
data corruption at least once. Never use it unless you can backup your
|
|
disks or perform your tests having a single partition mounted and that
|
|
one being expendable. But if half the energy who has been spent in
|
|
kernel compiling had been spent on hdparms we would have a data base specifying
|
|
what settings can be safely used according to disk and chipset model.</LI>
|
|
|
|
<LI>
|
|
Little has been written about to the placement of swap partitions, however
|
|
smart placement of them can shorten the moves of the disk arm. In addition
|
|
if you have two or more disks you can play with swap partition priorities
|
|
in order to get your pages being spread evenly between two disks thus doubling
|
|
transfer rate. You can also try placing your partition in a different disk
|
|
than Linux itself.</LI>
|
|
|
|
<LI>
|
|
Your kernel can be tuned by writing in files under /proc/sys. Problem is
|
|
we have had little experimentation for finding the right values. In fact
|
|
few people know about this. Again emphasis on kernel compiling has precluded
|
|
serious investigation about it.</LI>
|
|
</UL>
|
|
The people advocating other solutions will use kernel compiling as
|
|
an argument against Linux. Let's kill this myth.
|
|
|
|
<!--===================================================================-->
|
|
<P> <hr> <P>
|
|
<center><H5>Copyright © 1999, Jean Francois Martinez <BR>
|
|
Published in Issue 37 of <i>Linux Gazette</i>, February 1999</H5></center>
|
|
|
|
<!--===================================================================-->
|
|
<P> <hr> <P>
|
|
<A HREF="./index.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif"
|
|
ALT="[ TABLE OF CONTENTS ]"></A>
|
|
<A HREF="../index.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
|
|
ALT="[ FRONT PAGE ]"></A>
|
|
<A HREF="./york.html"><IMG SRC="../gx/back2.gif"
|
|
ALT=" Back "></A>
|
|
<A HREF="./pennington.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
|
|
<P> <hr> <P>
|
|
<!--startcut ==========================================================-->
|
|
</BODY>
|
|
</HTML>
|
|
<!--endcut ============================================================-->
|