291 lines
11 KiB
HTML
291 lines
11 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<TITLE>Linux Benchmarking - Article III - Interpreting benchmark results: Benchmarks for SMP systems</TITLE>
|
|
<META NAME="GENERATOR" CONTENT="Mozilla/3.01Gold (X11; I; Linux 2.0.29 i686) [Netscape]">
|
|
</HEAD>
|
|
<BODY>
|
|
|
|
<P><A HREF="./Article3e-2.html"><IMG SRC="./gx/balsa/prev.gif" ALT="Previous" HEIGHT=16 WIDTH=16></A>
|
|
<A HREF="./Article3e-4.html"><IMG SRC="./gx/balsa/next.gif" ALT="Next" HEIGHT=16 WIDTH=16></A>
|
|
<A HREF="./Article3e.html#toc3"><IMG SRC="./gx/balsa/toc.gif" ALT="Contents" HEIGHT=16 WIDTH=16></A>
|
|
|
|
<HR></P>
|
|
|
|
<H2><A NAME="s3"></A>3. Benchmarks for SMP systems</H2>
|
|
|
|
<H2><A NAME="ss3.1"></A>3.1 Description of the problem</H2>
|
|
|
|
<P>SMP (<B>S</B>ymmetric <B>M</B>ulti<B>P</B>rocessing) has been implemented
|
|
in the Linux kernel for Intel Pentium, Pentium MMX, Pentium Pro and Pentium
|
|
II processors <A HREF="Article3e-7.html#smp">(4) </A>and more recently
|
|
for SPARC architectures. SMP systems are usually more expensive than their
|
|
uniprocessor counterparts because they are frequently used to implement
|
|
heavy-duty (possibly fault-tolerant) servers. For this reason potential
|
|
buyers of such systems often want to make sure that applications, OS and
|
|
hardware platform will be able to satisfy their needs in terms of overall
|
|
performance before deciding on an expensive purchase. This is precisely
|
|
where a Linux SMP benchmark would be useful. As this series of articles
|
|
focuses on using current and stable 2.0.x kernels, we will only deal with
|
|
what could be done for benchmarking Linux SMP systems with current Linux
|
|
distributions.</P>
|
|
|
|
<P>Taking advantage of the additional computing power brought to the end-user
|
|
by an SMP hardware platform puts constraints on almost all layers of the
|
|
software involved: application, runtime libraries and operating system.</P>
|
|
|
|
<P>Basically two approaches are possible depending on how the application
|
|
being considered is designed: </P>
|
|
|
|
<OL>
|
|
<LI>The application uses multiple simultaneously running processes. Those
|
|
processes are very likely to communicate with each other using standard
|
|
IPC (Inter-Process Communications) mechanisms. </LI>
|
|
|
|
<LI>The application is multi-threaded: for some of the related processes,
|
|
multiple instances of sequential execution exist in the same address space.</LI>
|
|
</OL>
|
|
|
|
<P>The table below summarizes the impact of these two designs on the software
|
|
layers involved, on the programming complexity and on the expected performance
|
|
improvement (relatively to a comparable uniprocessor system):</P>
|
|
|
|
<CENTER><TABLE ALIGN=ABSCENTER BORDER=3 CELLSPACING=2 CELLPADDING=2 WIDTH="100%" >
|
|
<TR>
|
|
<TD><B>Application</B></TD>
|
|
|
|
<TD><B>Multiple single-threaded processes</B></TD>
|
|
|
|
<TD><B>Multi-threaded application</B></TD>
|
|
</TR>
|
|
|
|
<TR>
|
|
<TD>Runtime libraries special requirements</TD>
|
|
|
|
<TD>None.</TD>
|
|
|
|
<TD>Libraries must be thread safe and should preferably offer some POSIX
|
|
control over the threads.</TD>
|
|
</TR>
|
|
|
|
<TR>
|
|
<TD>Operating system special requirements: load balancing</TD>
|
|
|
|
<TD>Smart assignment of processes to processors must be implemented (static
|
|
or dynamic). </TD>
|
|
|
|
<TD>An assignment mechanism of kernel-threads to processors must be supported
|
|
</TD>
|
|
</TR>
|
|
|
|
<TR>
|
|
<TD>Example </TD>
|
|
|
|
<TD>make -j 4 vmlinux</TD>
|
|
|
|
<TD>None available AFAIK.</TD>
|
|
</TR>
|
|
|
|
<TR>
|
|
<TD>Additional programming complexity</TD>
|
|
|
|
<TD>None. </TD>
|
|
|
|
<TD>Greater than for single-hreaded applications, but it can be done by
|
|
us mere mortals. </TD>
|
|
</TR>
|
|
|
|
<TR>
|
|
<TD>Expected performance improvement </TD>
|
|
|
|
<TD>Average to poor. </TD>
|
|
|
|
<TD>High (close to linear speedup) for CPU bound applications but can also
|
|
degrade to become as low as single processor performance for system call
|
|
intensive applications.</TD>
|
|
</TR>
|
|
</TABLE></CENTER>
|
|
|
|
<P>How do those issues relate to current stable Linux kernels? </P>
|
|
|
|
<P>Good results obtained from a Linux multi-threaded benchmark would be
|
|
<I>very interesting for power users</I>.</P>
|
|
|
|
<H2><A NAME="ss3.2"></A>3.2 Runtime issues</H2>
|
|
|
|
<P>Threads can be implemented at the user-level as coroutines (e.g. the
|
|
LinuxThreads package), or can be kernel threads (i.e. threads running in
|
|
user mode but scheduled by the kernel). Until the very recent release of
|
|
Glibc 2.0 which RedHat 5.0 includes as its standard C library, finding
|
|
a thread safe runtime library could be a tough job.</P>
|
|
|
|
<H2><A NAME="ss3.3"></A>3.3 Scheduling issues</H2>
|
|
|
|
<P>The issue here is the way scheduling is implemented on SMP platforms
|
|
by the current stable kernels. Quoting its implementor Alan Cox (in a paper
|
|
he wrote in 1995):</P>
|
|
|
|
<P><I>"A single lock is maintained across all processors. This lock
|
|
is required to access the kernel space. Any processor may hold it and once
|
|
it is held may also re-enter the kernel for interrupts and other services
|
|
whenever it likes until the lock is relinquished. This lock ensures that
|
|
a kernel mode process will not be pre-empted and ensures that blocking
|
|
interrupts in kernel mode behaves correctly. This is guaranteed because
|
|
only the processor holding the lock can be in kernel mode, only kernel
|
|
mode processes can disable interrupts and only the processor holding the
|
|
lock may handle an interrupt."</I></P>
|
|
|
|
<P>So a correct interpretation of this is: right now, no more than a single
|
|
process may be executing in kernel mode (i.e. executing a system call)
|
|
at any given time.</P>
|
|
|
|
<P>But efforts are underway to improve the granularity of locking in future
|
|
2.2.x kernels. We should also soon be able to take interrupts without having
|
|
to take a lock. This should result in much better performance of system
|
|
call intensive applications on SMP systems running GNU/Linux.</P>
|
|
|
|
<H2><A NAME="ss3.4"></A>3.4 Further reading/links</H2>
|
|
|
|
<OL>
|
|
<LI>"An Implementation Of Multiprocessor Linux", Alan Cox, 1995.
|
|
I found this TeX article in the Linux source tree (kernel 2.0.33 source
|
|
in /Documentation/smp.tex).</LI>
|
|
|
|
<LI>A <A HREF="http://www.accessone.com/~jql/clone-faq.html">FAQ </A>about
|
|
the clone() Linux system call.</LI>
|
|
|
|
<LI>A clone() <A HREF="http://www.accessone.com/~jql/linus-example.html">utilization
|
|
example </A></LI>
|
|
|
|
<LI>LinuxThreads: a <A HREF="http://pauillac.inria.fr/~xleroy/linuxthreads/">package
|
|
</A>that implements user-level threads under Linux.</LI>
|
|
</OL>
|
|
|
|
<H2><A NAME="ss3.5"></A>3.5 Benchmark availability</H2>
|
|
|
|
<P>If we stick to our guideline for simple, quick running, readily available
|
|
benchmarks (or more simply, K.I.S.S. benchmarks), we can use a modified
|
|
version of the Linux kernel 2.0.0 compilation benchmark (described in article
|
|
II), now for SMP systems. Andy Kahn provided us with this test and some
|
|
very interesting results. Quoting directly from some email we exchanged
|
|
on this subject:</P>
|
|
|
|
<P><TT>"...actually, it's pretty simple. GNU "make" has
|
|
an option you can specify to use multiple processes (either a default number
|
|
or a user specified number).I don't have the man page handy right now,
|
|
but i'm pretty sure it's either the -j option or the -p option (actually,
|
|
i think both options have some importance to multiple processes). Once
|
|
you specify multiple make processes, each process will have gcc compiling
|
|
something (so in effect, it's just multiple gcc processes).</TT></P>
|
|
|
|
<P>(later)</P>
|
|
|
|
<P><TT>"Andre Derrick Balsa" wrote:</TT></P>
|
|
|
|
<P><TT>-> Great news :-)</TT></P>
|
|
|
|
<P><TT>-></TT></P>
|
|
|
|
<P><TT>-> Thanks to Andy who actually tried this on a dual PPro SMP
|
|
system and</TT></P>
|
|
|
|
<P><TT>-> explained the whole thing to me, I am pleased to announce
|
|
a version of</TT></P>
|
|
|
|
<P><TT>-> the Linux 2.0.0 kernel compilation application benchmark for
|
|
SMP</TT></P>
|
|
|
|
<P><TT>-> systems:</TT></P>
|
|
|
|
<P><TT>-></TT></P>
|
|
|
|
<P><TT>-> Just replace the "make vmlinux" (was "make
|
|
zImage") by "make -j n</TT></P>
|
|
|
|
<P><TT>-> vmlinux". Replace n by 2, 3 ... and make will launch
|
|
2, 3 ... processes</TT></P>
|
|
|
|
<P><TT>-> in parallel. Since Linux SMP will transparently distribute
|
|
processes</TT></P>
|
|
|
|
<P><TT>-> between the SMP processors, there is no need to program anything
|
|
special</TT></P>
|
|
|
|
<P><TT>-> in terms of message-passing, clone(), etc...</TT></P>
|
|
|
|
<P><TT>-></TT></P>
|
|
|
|
<P><TT>-> Andy doesn't have any exact figures available, but it seems
|
|
this would</TT></P>
|
|
|
|
<P><TT>-> provide a 30% decrease in compilation time (over a single
|
|
serialized</TT></P>
|
|
|
|
<P><TT>-> process). Thanks, Andy. :-)</TT></P>
|
|
|
|
<P><TT>-> </TT></P>
|
|
|
|
<P><TT>and because I don't have any exact figures, I decided that I would
|
|
go and get some exact figures. :) </TT></P>
|
|
|
|
<P><TT>The system tested was:</TT></P>
|
|
|
|
<P><TT>Dual Pentium Pro 180MHz overclocked to 200MHz 64MB EDO RAM</TT></P>
|
|
|
|
<P><TT>Linux 2.0.27 gcc v2.7.2.1 libc v5.3.12</TT></P>
|
|
|
|
<P><TT>hda: QUANTUM TRB850A, 810MB w/96kB Cache, LBA, CHS=823/32/63 </TT></P>
|
|
|
|
<P><TT>This is more or less your "standard" PC from about 13-14
|
|
months ago. I'm not at liberty to upgrade the software on this system,
|
|
so this is as good as it gets from me with this setup.</TT></P>
|
|
|
|
<P><TT>Also, instead of doing a "sync" before issuing the final
|
|
"make" command, I propose that if the circumstances allow it
|
|
(you have root access), then umount the file system, remount it, then go
|
|
back to that directory and build the kernel.</TT></P>
|
|
|
|
<P><TT>--- THE RESULTS! ---</TT></P>
|
|
|
|
<P><TT>"time make vmlinux" 107.32user 149.01system 4:27.91elapsed
|
|
95%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (143472major+167951minor)pagefaults
|
|
0swaps</TT></P>
|
|
|
|
<P><TT>"time make -j 2 vmlinux" 131.13user 177.77system 3:28.34elapsed
|
|
148%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (169498major+168582minor)pagefaults
|
|
8903swaps</TT></P>
|
|
|
|
<P><TT>Ugh, the results are terrible (only a 22% improvement)!! Note that
|
|
in the SMP case, CPU usage was only 148%. From this, we can see that the
|
|
2nd CPU wasn't really used all that much (efficiently)."</TT> </P>
|
|
|
|
<P>I really appreciated Andy's attitude: not only he improved on my previous
|
|
test procedure, but he went right ahead and produced some nice experimental
|
|
data to go with it! Plus one can feel how enthusiastic he was at doing
|
|
some hands-on experimentation!</P>
|
|
|
|
<P>Another nice feature of this simple SMP benchmark is that it provides
|
|
a basis for performance comparisons between uniprocessor and SMP GNU/Linux
|
|
systems.</P>
|
|
|
|
<P>Two more benchmarks would deserve a thorough description, but I will
|
|
just mention them here: </P>
|
|
|
|
<OL>
|
|
<LI><A HREF="http://www.tux.org/pub/people/david-niemi/bench/index.html">UnixBench
|
|
4.1</A> has some tests that will launch simultaneous processes.</LI>
|
|
|
|
<LI>A rather complex, but complete Unix benchmark suite developed in France,
|
|
called <A HREF="http://www.afuu.fr/">SSBA</A>. François is working
|
|
on a Linux port of the latest 2.4F revision.</LI>
|
|
</OL>
|
|
|
|
<P>
|
|
<HR><A HREF="./Article3e-2.html"><IMG SRC="./gx/balsa/prev.gif" ALT="Previous" HEIGHT=16 WIDTH=16></A>
|
|
<A HREF="./Article3e-4.html"><IMG SRC="./gx/balsa/next.gif" ALT="Next" HEIGHT=16 WIDTH=16></A>
|
|
<A HREF="./Article3e.html#toc3"><IMG SRC="./gx/balsa/toc.gif" ALT="Contents" HEIGHT=16 WIDTH=16></A>
|
|
</P>
|
|
|
|
</BODY>
|
|
</HTML>
|