old-www/LDP/LG/issue24/Article3e-2.html

307 lines
15 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<TITLE>Linux Benchmarking - Article III - Interpreting benchmark results: Benchmarking vs. benchmarketing</TITLE>
<META NAME="GENERATOR" CONTENT="Mozilla/3.01Gold (X11; I; Linux 2.0.29 i686) [Netscape]">
</HEAD>
<BODY>
<P><A HREF="./Article3e-1.html"><IMG SRC="./gx/balsa/prev.gif" ALT="Previous" HEIGHT=16 WIDTH=16></A>
<A HREF="./Article3e-3.html"><IMG SRC="./gx/balsa/next.gif" ALT="Next" HEIGHT=16 WIDTH=16></A>
<A HREF="./Article3e.html#toc2"><IMG SRC="./gx/balsa/toc.gif" ALT="Contents" HEIGHT=16 WIDTH=16></A>
<HR></P>
<H2><A NAME="s2"></A>2. Benchmarking vs. benchmarketing</H2>
<P>There are two basic approaches to benchmarking in the field of computing:
the &quot;scientific&quot; or quantitative approach and the &quot;benchmarketing&quot;
approach. Both approaches use exactly the same tools, however with a slightly
different methodology and of course with widely diverging objectives and
results.</P>
<H2><A NAME="ss2.1"></A>2.1 The scientific/quantitative approach</H2>
<P>The first approach is to think of benchmarking as a tool for experimentation.
Benchmarking is a specific branch of Computer Science, since it produces
numbers which can then be mathematically processed and analyzed; this analysis
will be later used to draw relevant conclusions about CPU architectures,
compiler design, etc.</P>
<P>As with any scientific activity, experiments (benchmark runs and reporting)
must follow some basic guidelines or rules: </P>
<UL>
<LI>A good dose of modesty/humility (don't be too ambitious to begin with)
and common sense.</LI>
<LI>No bias or prejudice.</LI>
<LI>A clearly stated objective related to advancing the state-of-the-art.</LI>
<LI>Reproducibility.</LI>
<LI>Accuracy.</LI>
<LI>Relevance.</LI>
<LI>Correct logical/statistical inference.</LI>
<LI>Conciseness.</LI>
<LI>Sharing of information.</LI>
<LI>Quoting sources/references.</LI>
</UL>
<P>Of course, this is an idealized view of the scientific community, but
these are some of the basic rules for the experimental methods in all branches
of Science.</P>
<P>I should stress that benchmarking results in <B>documented quantitative
data</B>.</P>
<P>The correct procedure for GNU/Linux benchmarking under this approach
is: </P>
<OL>
<LI>Decide on what is the issue that is going to be investigated. <B>It
is very important to execute this step before anything else gets started</B>.
Stating clearly what we are going to investigate is getting half the work
done.</LI>
<LI>Also note that we are not out to prove anything: we must start with
a clean, <B>Zen-like mind</B>. This is particularly difficult for us, GNU/Linux
benchmarkers, since we are all <B>utterly convinced</B> that: </LI>
<OL>
<LI>GNU/Linux is the best OS in the universe (what &quot;best&quot; means
in this context is not clear, however; probably the same as &quot;coolest&quot;),</LI>
<LI>Wide-SCSI-3 is better than plain EIDE, (idem),</LI>
<LI><A HREF="http://www.digital.com/semiconductor/alpha/alpha-chips.html">Digital's
64-bit RISC Alpha </A>is the best platform around for GNU/Linux development
(idem), and</LI>
<LI>X Window is a good, modern GUI (no comments). </LI>
</OL>
<LI>After purifying our minds and souls ;-), we will have to select the
tools (i.e. the benchmarks) which will be used for our benchmarking experiments.
You can take a look at my previous article for a selection of GPLed benchmarks.
Another way to get the right tool for the job at hand is to devise and
implement your own benchmark. This approach takes a lot more time and energy,
and sometimes amounts to reinventing the wheel. Creativity being one of
the nicest features in the GNU/Linux world, writing a new benchmark is
recommended nonetheless, especially in the areas where such tools are sorely
missed (Graphics, 3D, multimedia, etc). <B>Summarizing, selecting the appropriate
tool for the job is very important</B>.</LI>
<LI>Now comes the painstaking part: gathering the data. This takes huge
amounts of patience and attention to details. See my two previous articles.</LI>
<LI>And finally we reach the stage of data analysis and logical inference,
based on the data we gathered/analyzed. This is also where one can spoil
everything by joining the Dark Side of the Force (see subsection 1.2 below).
Quoting Andrew Tanenbaum: &quot;Figures don't lie, but liars figure&quot;.</LI>
<LI>If relevant conclusions can be drawn, publishing them on the appropriate
mailing lists, newsgroups or on the <B><A HREF="http://www.linuxgazette.com">Linux
Gazette </A></B>is in order. Again this is very much a Zen attitude (known
as &quot;coming back to the village&quot;).</LI>
<LI>Just when you thought it was over and you could finally close the cabinet
of your CPU after having disassembled it more times than you could count,
you get a sympathetic email that mentions a small but fundamental flaw
in your entire benchmarking procedure. And you begin to understand that
benchmarking is an iterative process, much like self-improvement...</LI>
</OL>
<H2><A NAME="ss2.2"></A>2.2 The benchmarketing approach</H2>
<P>This second approach is more popular than the first one, as it serves
commercial purposes and gets more subsidies (i.e. grants, sponsorship,
money, cash, dinero, l'argent, $) than the first approach. Benchmarketing
has one basic objective, and that is to prove that equipment/software A
is better (faster, more powerful, better performing or with a better price/performance
ratio) than equipment/software B. The basic inspiration for this approach
is the Greek philosophical current known as Sophism. Sophistry has had
its adepts at all times and ages, but the Greeks made it into a veritable
art. Benchmarketers have continued this tradition with varying success
(also note that the first Sophists were lawyers <A HREF="Article3e-7.html#sophism">(1)</A>
see my comment on Intel below). Of course with this approach there is no
hope of spiritual redemption... Quoting Larry Wall (of Perl fame) as often
quoted by David Niemi:</P>
<P><I>&quot;Down that path lies madness. On the other hand the road to
Hell is paved with melting snowballs.&quot;</I></P>
<P>Benchmarketing results cover the entire range from outright lies to
subtle fallacies. Sometimes an excessive amount of data is involved, and
in other cases no quantitative data at all is provided; in both cases the
task of proving benchmarketing wrong is made more arduous.</P>
<H3>A short history of benchmarketing/CPU developments</H3>
<P>We already saw that the first widely used benchmark, Whetstone, originated
as the result of research into computer architecture and compiler design.
So the original Whetstone benchmark can be traced to the &quot;scientific
approach&quot;.</P>
<P>At the time Whestone was written, computers were indeed rare and very
expensive, and the fact that they executed tasks impossible for human beings
was enough to justify their purchase by large organizations. </P>
<P>Very soon competition changed this. Foremost among the early benchmarketers
was the need to justify the purchase of very expensive mainframes (at the
time called supercomputers; these early machines would not even match my
&lt; $900 AMD K6 box). This gave rise to a good number of now obsolete
benchmarks, as of course each different architecture needed a new algorithm
to justify its existence in commercial terms.</P>
<P>This <B>supercomputer market issue </B>is still not over, but two factors
contributed to its relative decline:</P>
<OL>
<LI>Jack Dongarra's effort to standardize the LINPACK benchmark as the
basic tool for supercomputer benchmarking. This was not entirely successful,
as specific &quot;optimizers&quot; were created to make LINPACK run faster
on some CPU architectures (note that unless you are trying to solve large
scientific problems involving matrix operations - the usual task assigned
to most supercomputers - LINPACK is not a good measure of the CPU performance
of your GNU/Linux box; anyway, you can find a version of LINPACK ported
to C in Al Aburto's excellent <A HREF="ftp://ftp.nosc.mil/pub/aburto">FTP
site</A>.</LI>
<LI>The appearance of very fast and cheap superminis, and later microprocessors,
and the widespread use of networking technologies. These changed the idea
of a centralized computing facility and signaled the end of the supercomputer
for most applications. Also modern supercomputers are built with arrays
of microprocessors nowadays (notably the latest Cray machines are built
using up to 2048 Alpha processors), so there was a shift in focus.</LI>
</OL>
<P>Next in line was the <B>workstation market </B>issue. A nice side-effect
of the various marketing initiatives on the part of some competitors (HP,
Sun, IBM, SGI among others) is that it spawned the development of various
Unix benchmarks that we can now use to benchmark our GNU/Linux boxes!</P>
<P>In parallel to the workstation market development, we saw fierce competition
develop in the <B>microprocessor market</B>, with each manufacturer touting
its architecture as the &quot;superior&quot; design. In terms of microprocessor
architecture an interesting development was the performance issue of CISC
against RISC designs. In market terms the dominating architecture is Intel's
x86 CISC design (c.f. Computer Architecture, a Quantitative Approach, Hennessy
and Patterson, 2nd. edition; there is an excellent 25-page appendix on
the x86 architecture). </P>
<P>Recently the demonstrably better-performing Alpha RISC architecture
was almost wiped out by Intel lawyers: as a settlement of a complex legal
battle over patent infringements, Intel bought Digital's microelectronics
operation (which also produced the StrongARM <A HREF="Article3e-7.html#arm">(2)
</A>and Digital's highly successful line of Ethernet chips). Note however
that Digital kept its Alpha design team and the settlement includes the
possibility by Digital to have present and future Alpha chips manufactured
by Intel.</P>
<P>The x86 market attracted <A HREF="http://www.intel.com">Intel </A>competitors
<A HREF="http://www.amd.com/products/cpg/cpg.html">AMD </A>and more recently
<A HREF="http://www.cyrix.com">Cyrix</A> which created original x86 designs.
AMD also bought a small startup called NexGen which designed the precursor
to the K6, and Cyrix had to grow under the umbrella of IBM and now <A HREF="http://www.national.com">National
Semiconductor</A> but that's another story altogether. Intel is still the
market leader since it has 90% of the microprocessor market, even though
both the AMD K6 and Cyrix 6x86MX architectures provide better Linux performance/MHz
than Intel's best effort to date, the Pentium II (except for floating-point
operations).</P>
<P>Lastly, we have the <B>OS market</B> issue. The <A HREF="http://www.microsoft.com">Microsoft
</A>Windows (R) line of OS's is the overwhelming market leader as far as
desktop applications are concerned, but in terms of performance/security/stability/flexibility
it sometimes does not compare well with other OSes. Of course, inter-OS
benchmarking is a risky business and OS designers are aware of that.</P>
<P>Besides, comparing GNU/Linux to other OSes using benchmarks is almost
always an exercise in futility: GNU/Linux is GPLed, whereas no other OS
can be said to be <I>free</I> (in the GNU/GPL sense). Can you compare something
that is <I>free</I> to something that is proprietary <A HREF="Article3e-7.html#freedom">(3)</A>
Does benchmarketing apply to something that is <I>free</I>?</P>
<P>Comparing GNU/Linux to other OSes is also a good way to start a nice
flame war on comp.os.linux.advocacy, specially when GNU/Linux is compared
to BSD Unices or Windows NT. Most debaters don't seem to realize that each
OS had different design objectives!</P>
<P>These debates usually reach a steady state when both sides are convinced
that they are &quot;right&quot; and that their opponents are &quot;wrong&quot;.
Sometimes benchmarking data is called in to prove or disprove an argument.
But even then we see that this has more to do with benchmarketing than
with benchmarking. My $0.02 of advice: <B>avoid such debates like the plague</B>.</P>
<H3>Turning benchmarking into benchmarketing</H3>
<P>The <A HREF="http://www.specbench.org">SPEC95 </A>CPU benchmark suite
(the CPU Integer and FP tests, which SPEC calls CINT95/CFP95) is an example
of a promising Jedi that succumbed to the Dark side of the Force ;-).</P>
<P>SPEC (<B>S</B>tandard <B>P</B>erformance <B>E</B>valuation <B>C</B>orporation)
originated as a non-profit corporation with the explicit objective of creating
a vendor-independent, objective, non-biased, industry-wide CPU benchmark
suite. Founding members were some universities and various CPU and systems
manufacturers, such as Intel, HP, Digital, IBM and Motorola.</P>
<P>However, some technical and philosophical issues have developed for
historical reasons that make SPEC95 inadequate for Linux benchmarking:
</P>
<OL>
<LI><B>Cost</B>. Strangely enough, SPEC95 benchmarks are free but you have
to pay for them: last time I checked, the CINT95/CFP95 cost was $600. The
quarterly newsletter was $550. These sums correspond to &quot;administrative
costs&quot;, according to SPEC.</LI>
<LI><B>Licensing</B>. SPEC benchmarks are not placed under GPL. In fact,
SPEC95 has a severely limiting license that makes it inadequate for GNU/Linux
users. The license is clearly geared to large corporations/organizations:
you almost need a full-time employee just to handle all the requisites
specified in the license, you cannot freely reproduce the sources, new
releases are every three years, etc...</LI>
<LI><B>Outright cheating</B>. Recently, a California Court ordered a major
microprocessor manufacturer to pay back $50 for each processor sold of
a given speed and model, because the manufacturer had distorted SPEC results
with a modified version of gcc, and used such results in its advertisements.
Benchmarketing seems to have backfired on this occasion.</LI>
<LI><B>Comparability</B>. Hennessy and Patterson (see reference above)
clearly identify the technical limitations of SPEC92. Basically these have
to do with each vendor optimizing benchmark runs for their specific purposes.
Even though SPEC95 was released as an update that would work around these
limitations, it does not (and cannot, in practical terms) satisfactorily
address this issue. Compiler flag issues in SPEC92 prompted SPEC to release
a 10-page document entitled &quot;SPEC Run and Reporting Rules for CPU95
Suites&quot;. It clearly shows how confident SPEC is that nobody will try
to circumvent specific CPU shortcomings with tailor-made compilers/optimizers!
Unfortunately, SPEC98 is likely to carry over these problems to the next
generation of CPU performance measurements.</LI>
<LI><B>Run time</B>. Last but not least, the SPEC95 benchmarks take about
2 days to run on the SPARC reference machine. Note that this in no way
makes them more accurate than other CPU benchmarks that run in &lt; 5 minutes
(e.g. nbench-byte, presented below)!</LI>
</OL>
<P>Summarizing, if you must absolutely compare CPU performance for different
configurations running GNU/Linux, SPEC95 is definitely <B>not</B> the recommended
benchmark. On the other hand it's a handy tool for benchmarketers.</P>
<P>
<HR><A HREF="./Article3e-1.html"><IMG SRC="./gx/balsa/prev.gif" ALT="Previous" HEIGHT=16 WIDTH=16></A>
<A HREF="./Article3e-3.html"><IMG SRC="./gx/balsa/next.gif" ALT="Next" HEIGHT=16 WIDTH=16></A>
<A HREF="./Article3e.html#toc2"><IMG SRC="./gx/balsa/toc.gif" ALT="Contents" HEIGHT=16 WIDTH=16></A>
</P>
</BODY>
</HTML>