old-www/LDP/LG/issue24/Article3e-3.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
   <TITLE>Linux Benchmarking - Article III - Interpreting benchmark results: Benchmarks for SMP systems</TITLE>
   <META NAME="GENERATOR" CONTENT="Mozilla/3.01Gold (X11; I; Linux 2.0.29 i686) [Netscape]">
</HEAD>
<BODY>

<P><A HREF="./Article3e-2.html"><IMG SRC="./gx/balsa/prev.gif" ALT="Previous" HEIGHT=16 WIDTH=16></A>
<A HREF="./Article3e-4.html"><IMG SRC="./gx/balsa/next.gif" ALT="Next" HEIGHT=16 WIDTH=16></A>
<A HREF="./Article3e.html#toc3"><IMG SRC="./gx/balsa/toc.gif" ALT="Contents" HEIGHT=16 WIDTH=16></A>

<HR></P>

<H2><A NAME="s3"></A>3. Benchmarks for SMP systems</H2>

<H2><A NAME="ss3.1"></A>3.1 Description of the problem</H2>

<P>SMP (<B>S</B>ymmetric <B>M</B>ulti<B>P</B>rocessing) has been implemented
in the Linux kernel for Intel Pentium, Pentium MMX, Pentium Pro and Pentium
II processors <A HREF="Article3e-7.html#smp">(4) </A>and more recently
for SPARC architectures. SMP systems are usually more expensive than their
uniprocessor counterparts because they are frequently used to implement
heavy-duty (possibly fault-tolerant) servers. For this reason potential
buyers of such systems often want to make sure that applications, OS and
hardware platform will be able to satisfy their needs in terms of overall
performance before deciding on an expensive purchase. This is precisely
where a Linux SMP benchmark would be useful. As this series of articles
focuses on using current and stable 2.0.x kernels, we will only deal with
what could be done for benchmarking Linux SMP systems with current Linux
distributions.</P>

<P>Taking advantage of the additional computing power brought to the end-user
by an SMP hardware platform puts constraints on almost all layers of the
software involved: application, runtime libraries and operating system.</P>

<P>Basically two approaches are possible depending on how the application
being considered is designed: </P>

<OL>
<LI>The application uses multiple simultaneously running processes. Those
processes are very likely to communicate with each other using standard
IPC (Inter-Process Communications) mechanisms. </LI>

<LI>The application is multi-threaded: for some of the related processes,
multiple instances of sequential execution exist in the same address space.</LI>
</OL>

<P>The table below summarizes the impact of these two designs on the software
layers involved, on the programming complexity and on the expected performance
improvement (relatively to a comparable uniprocessor system):</P>

<CENTER><TABLE ALIGN=ABSCENTER BORDER=3 CELLSPACING=2 CELLPADDING=2 WIDTH="100%" >
<TR>
<TD><B>Application</B></TD>

<TD><B>Multiple single-threaded processes</B></TD>

<TD><B>Multi-threaded application</B></TD>
</TR>

<TR>
<TD>Runtime libraries special requirements</TD>

<TD>None.</TD>

<TD>Libraries must be thread safe and should preferably offer some POSIX
control over the threads.</TD>
</TR>

<TR>
<TD>Operating system special requirements: load balancing</TD>

<TD>Smart assignment of processes to processors must be implemented (static
or dynamic). </TD>

<TD>An assignment mechanism of kernel-threads to processors must be supported
</TD>
</TR>

<TR>
<TD>Example </TD>

<TD>make -j 4 vmlinux</TD>

<TD>None available AFAIK.</TD>
</TR>

<TR>
<TD>Additional programming complexity</TD>

<TD>None. </TD>

<TD>Greater than for single-hreaded applications, but it can be done by
us mere mortals. </TD>
</TR>

<TR>
<TD>Expected performance improvement </TD>

<TD>Average to poor. </TD>

<TD>High (close to linear speedup) for CPU bound applications but can also
degrade to become as low as single processor performance for system call
intensive applications.</TD>
</TR>
</TABLE></CENTER>

<P>How do those issues relate to current stable Linux kernels? </P>

<P>Good results obtained from a Linux multi-threaded benchmark would be
<I>very interesting for power users</I>.</P>

<H2><A NAME="ss3.2"></A>3.2 Runtime issues</H2>

<P>Threads can be implemented at the user-level as coroutines (e.g. the
LinuxThreads package), or can be kernel threads (i.e. threads running in
user mode but scheduled by the kernel). Until the very recent release of
Glibc 2.0 which RedHat 5.0 includes as its standard C library, finding
a thread safe runtime library could be a tough job.</P>

<H2><A NAME="ss3.3"></A>3.3 Scheduling issues</H2>

<P>The issue here is the way scheduling is implemented on SMP platforms
by the current stable kernels. Quoting its implementor Alan Cox (in a paper
he wrote in 1995):</P>

<P><I>&quot;A single lock is maintained across all processors. This lock
is required to access the kernel space. Any processor may hold it and once
it is held may also re-enter the kernel for interrupts and other services
whenever it likes until the lock is relinquished. This lock ensures that
a kernel mode process will not be pre-empted and ensures that blocking
interrupts in kernel mode behaves correctly. This is guaranteed because
only the processor holding the lock can be in kernel mode, only kernel
mode processes can disable interrupts and only the processor holding the
lock may handle an interrupt.&quot;</I></P>

<P>So a correct interpretation of this is: right now, no more than a single
process may be executing in kernel mode (i.e. executing a system call)
at any given time.</P>

<P>But efforts are underway to improve the granularity of locking in future
2.2.x kernels. We should also soon be able to take interrupts without having
to take a lock. This should result in much better performance of system
call intensive applications on SMP systems running GNU/Linux.</P>

<H2><A NAME="ss3.4"></A>3.4 Further reading/links</H2>

<OL>
<LI>&quot;An Implementation Of Multiprocessor Linux&quot;, Alan Cox, 1995.
I found this TeX article in the Linux source tree (kernel 2.0.33 source
in /Documentation/smp.tex).</LI>

<LI>A <A HREF="http://www.accessone.com/~jql/clone-faq.html">FAQ </A>about
the clone() Linux system call.</LI>

<LI>A clone() <A HREF="http://www.accessone.com/~jql/linus-example.html">utilization
example </A></LI>

<LI>LinuxThreads: a <A HREF="http://pauillac.inria.fr/~xleroy/linuxthreads/">package
</A>that implements user-level threads under Linux.</LI>
</OL>

<H2><A NAME="ss3.5"></A>3.5 Benchmark availability</H2>

<P>If we stick to our guideline for simple, quick running, readily available
benchmarks (or more simply, K.I.S.S. benchmarks), we can use a modified
version of the Linux kernel 2.0.0 compilation benchmark (described in article
II), now for SMP systems. Andy Kahn provided us with this test and some
very interesting results. Quoting directly from some email we exchanged
on this subject:</P>

<P><TT>&quot;...actually, it's pretty simple. GNU &quot;make&quot; has
an option you can specify to use multiple processes (either a default number
or a user specified number).I don't have the man page handy right now,
but i'm pretty sure it's either the -j option or the -p option (actually,
i think both options have some importance to multiple processes). Once
you specify multiple make processes, each process will have gcc compiling
something (so in effect, it's just multiple gcc processes).</TT></P>

<P>(later)</P>

<P><TT>&quot;Andre Derrick Balsa&quot; wrote:</TT></P>

<P><TT>-&gt; Great news :-)</TT></P>

<P><TT>-&gt;</TT></P>

<P><TT>-&gt; Thanks to Andy who actually tried this on a dual PPro SMP
system and</TT></P>

<P><TT>-&gt; explained the whole thing to me, I am pleased to announce
a version of</TT></P>

<P><TT>-&gt; the Linux 2.0.0 kernel compilation application benchmark for
SMP</TT></P>

<P><TT>-&gt; systems:</TT></P>

<P><TT>-&gt;</TT></P>

<P><TT>-&gt; Just replace the &quot;make vmlinux&quot; (was &quot;make
zImage&quot;) by &quot;make -j n</TT></P>

<P><TT>-&gt; vmlinux&quot;. Replace n by 2, 3 ... and make will launch
2, 3 ... processes</TT></P>

<P><TT>-&gt; in parallel. Since Linux SMP will transparently distribute
processes</TT></P>

<P><TT>-&gt; between the SMP processors, there is no need to program anything
special</TT></P>

<P><TT>-&gt; in terms of message-passing, clone(), etc...</TT></P>

<P><TT>-&gt;</TT></P>

<P><TT>-&gt; Andy doesn't have any exact figures available, but it seems
this would</TT></P>

<P><TT>-&gt; provide a 30% decrease in compilation time (over a single
serialized</TT></P>

<P><TT>-&gt; process). Thanks, Andy. :-)</TT></P>

<P><TT>-&gt; </TT></P>

<P><TT>and because I don't have any exact figures, I decided that I would
go and get some exact figures. :) </TT></P>

<P><TT>The system tested was:</TT></P>

<P><TT>Dual Pentium Pro 180MHz overclocked to 200MHz 64MB EDO RAM</TT></P>

<P><TT>Linux 2.0.27 gcc v2.7.2.1 libc v5.3.12</TT></P>

<P><TT>hda: QUANTUM TRB850A, 810MB w/96kB Cache, LBA, CHS=823/32/63 </TT></P>

<P><TT>This is more or less your &quot;standard&quot; PC from about 13-14
months ago. I'm not at liberty to upgrade the software on this system,
so this is as good as it gets from me with this setup.</TT></P>

<P><TT>Also, instead of doing a &quot;sync&quot; before issuing the final
&quot;make&quot; command, I propose that if the circumstances allow it
(you have root access), then umount the file system, remount it, then go
back to that directory and build the kernel.</TT></P>

<P><TT>--- THE RESULTS! ---</TT></P>

<P><TT>&quot;time make vmlinux&quot; 107.32user 149.01system 4:27.91elapsed
95%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (143472major+167951minor)pagefaults
0swaps</TT></P>

<P><TT>&quot;time make -j 2 vmlinux&quot; 131.13user 177.77system 3:28.34elapsed
148%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (169498major+168582minor)pagefaults
8903swaps</TT></P>

<P><TT>Ugh, the results are terrible (only a 22% improvement)!! Note that
in the SMP case, CPU usage was only 148%. From this, we can see that the
2nd CPU wasn't really used all that much (efficiently).&quot;</TT> </P>

<P>I really appreciated Andy's attitude: not only he improved on my previous
test procedure, but he went right ahead and produced some nice experimental
data to go with it! Plus one can feel how enthusiastic he was at doing
some hands-on experimentation!</P>

<P>Another nice feature of this simple SMP benchmark is that it provides
a basis for performance comparisons between uniprocessor and SMP GNU/Linux
systems.</P>

<P>Two more benchmarks would deserve a thorough description, but I will
just mention them here: </P>

<OL>
<LI><A HREF="http://www.tux.org/pub/people/david-niemi/bench/index.html">UnixBench
4.1</A> has some tests that will launch simultaneous processes.</LI>

<LI>A rather complex, but complete Unix benchmark suite developed in France,
called <A HREF="http://www.afuu.fr/">SSBA</A>. Fran&ccedil;ois is working
on a Linux port of the latest 2.4F revision.</LI>
</OL>

<P>
<HR><A HREF="./Article3e-2.html"><IMG SRC="./gx/balsa/prev.gif" ALT="Previous" HEIGHT=16 WIDTH=16></A>
<A HREF="./Article3e-4.html"><IMG SRC="./gx/balsa/next.gif" ALT="Next" HEIGHT=16 WIDTH=16></A>
<A HREF="./Article3e.html#toc3"><IMG SRC="./gx/balsa/toc.gif" ALT="Contents" HEIGHT=16 WIDTH=16></A>
</P>

</BODY>
</HTML>