4171 lines
196 KiB
HTML
4171 lines
196 KiB
HTML
<HTML>
|
|
<HEAD>
|
|
|
|
<TITLE>
|
|
Linux Parallel Processing HOWTO
|
|
</TITLE>
|
|
</HEAD>
|
|
<BODY BGCOLOR=white>
|
|
|
|
<HR>
|
|
<H1>Linux Parallel Processing HOWTO</H1>
|
|
|
|
<H2>Hank Dietz, <CODE>
|
|
<A HREF="mailto:hankd@engr.uky.edu">hankd@engr.uky.edu</A></CODE></H2>v2.0, 2004-06-28<P>Although this HOWTO has been "republished" (v2.0, 2004-06-28) to update the author
|
|
contact info, it has many broken links and some information is
|
|
seriously out of date. Rather than just repairing links, this
|
|
document is being heavily rewritten as a Guide which we expect
|
|
to release in July 2004. At that time, the HOWTO will be obsolete.
|
|
The prefered home URL for both the old and new documents is
|
|
<A HREF="http://aggregate.org/LDP/">http://aggregate.org/LDP/</A>
|
|
<P><HR>
|
|
<EM><B>Parallel Processing</B> refers to the concept of speeding-up the
|
|
execution of a program by dividing the program into multiple fragments
|
|
that can execute simultaneously, each on its own processor. A program
|
|
being executed across <EM>N</EM> processors might execute <EM>N</EM>
|
|
times faster than it would using a single processor. This document
|
|
discusses the four basic approaches to parallel processing that are
|
|
available to Linux users: SMP Linux systems, clusters of networked
|
|
Linux systems, parallel execution using multimedia instructions (i.e.,
|
|
MMX), and attached (parallel) processors hosted by a Linux system.</EM>
|
|
<HR>
|
|
<P>
|
|
<P>
|
|
<H2><A NAME="toc1">1.</A> <A HREF="#s1">Introduction</A></H2>
|
|
|
|
<UL>
|
|
<LI><A HREF="#ss1.1">1.1 Is Parallel Processing What I Want?</A>
|
|
<LI><A HREF="#ss1.2">1.2 Terminology</A>
|
|
<LI><A HREF="#ss1.3">1.3 Example Algorithm</A>
|
|
<LI><A HREF="#ss1.4">1.4 Organization Of This Document</A>
|
|
</UL>
|
|
<P>
|
|
<H2><A NAME="toc2">2.</A> <A HREF="#s2">SMP Linux</A></H2>
|
|
|
|
<UL>
|
|
<LI><A HREF="#ss2.1">2.1 SMP Hardware</A>
|
|
<LI><A HREF="#ss2.2">2.2 Introduction To Shared Memory Programming</A>
|
|
<LI><A HREF="#ss2.3">2.3 bb_threads</A>
|
|
<LI><A HREF="#ss2.4">2.4 LinuxThreads</A>
|
|
<LI><A HREF="#ss2.5">2.5 System V Shared Memory</A>
|
|
<LI><A HREF="#ss2.6">2.6 Memory Map Call</A>
|
|
</UL>
|
|
<P>
|
|
<H2><A NAME="toc3">3.</A> <A HREF="#s3">Clusters Of Linux Systems</A></H2>
|
|
|
|
<UL>
|
|
<LI><A HREF="#ss3.1">3.1 Why A Cluster?</A>
|
|
<LI><A HREF="#ss3.2">3.2 Network Hardware</A>
|
|
<LI><A HREF="#ss3.3">3.3 Network Software Interface</A>
|
|
<LI><A HREF="#ss3.4">3.4 PVM (Parallel Virtual Machine)</A>
|
|
<LI><A HREF="#ss3.5">3.5 MPI (Message Passing Interface)</A>
|
|
<LI><A HREF="#ss3.6">3.6 AFAPI (Aggregate Function API)</A>
|
|
<LI><A HREF="#ss3.7">3.7 Other Cluster Support Libraries</A>
|
|
<LI><A HREF="#ss3.8">3.8 General Cluster References</A>
|
|
</UL>
|
|
<P>
|
|
<H2><A NAME="toc4">4.</A> <A HREF="#s4">SIMD Within A Register (e.g., using MMX)</A></H2>
|
|
|
|
<UL>
|
|
<LI><A HREF="#ss4.1">4.1 SWAR: What Is It Good For?</A>
|
|
<LI><A HREF="#ss4.2">4.2 Introduction To SWAR Programming</A>
|
|
<LI><A HREF="#ss4.3">4.3 MMX SWAR Under Linux</A>
|
|
</UL>
|
|
<P>
|
|
<H2><A NAME="toc5">5.</A> <A HREF="#s5">Linux-Hosted Attached Processors</A></H2>
|
|
|
|
<UL>
|
|
<LI><A HREF="#ss5.1">5.1 A Linux PC Is A Good Host</A>
|
|
<LI><A HREF="#ss5.2">5.2 Did You DSP That?</A>
|
|
<LI><A HREF="#ss5.3">5.3 FPGAs And Reconfigurable Logic Computing</A>
|
|
</UL>
|
|
<P>
|
|
<H2><A NAME="toc6">6.</A> <A HREF="#s6">Of General Interest</A></H2>
|
|
|
|
<UL>
|
|
<LI><A HREF="#ss6.1">6.1 Programming Languages And Compilers</A>
|
|
<LI><A HREF="#ss6.2">6.2 Performance Issues</A>
|
|
<LI><A HREF="#ss6.3">6.3 Conclusion - It's Out There</A>
|
|
</UL>
|
|
<HR>
|
|
<H2><A NAME="s1">1.</A> <A HREF="#toc1">Introduction</A></H2>
|
|
|
|
<P>
|
|
<P><B>Parallel Processing</B> refers to the concept of speeding-up the
|
|
execution of a program by dividing the program into multiple fragments
|
|
that can execute simultaneously, each on its own processor. A program
|
|
being executed across <EM>n</EM> processors might execute <EM>n</EM>
|
|
times faster than it would using a single processor.
|
|
<P>
|
|
<P>
|
|
<P>Traditionally, multiple processors were provided within a specially
|
|
designed "parallel computer"; along these lines, Linux now supports
|
|
<B>SMP</B> systems (often sold as "servers") in which multiple
|
|
processors share a single memory and bus interface within a single
|
|
computer. It is also possible for a group of computers (for example,
|
|
a group of PCs each running Linux) to be interconnected by a network
|
|
to form a parallel-processing <B>cluster</B>. The third alternative
|
|
for parallel computing using Linux is to use the <B>multimedia
|
|
instruction extensions</B> (i.e., MMX) to operate in parallel on
|
|
vectors of integer data. Finally, it is also possible to use a Linux
|
|
system as a "host" for a specialized <B>attached</B> parallel
|
|
processing compute engine. All these approaches are discussed in
|
|
detail in this document.
|
|
<P>
|
|
<H2><A NAME="ss1.1">1.1 Is Parallel Processing What I Want?</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Although use of multiple processors can speed-up many operations, most
|
|
applications cannot yet benefit from parallel processing. Basically,
|
|
parallel processing is appropriate only if:
|
|
<P>
|
|
<UL>
|
|
<LI>Your application has enough parallelism to make good use of
|
|
multiple processors. In part, this is a matter of identifying
|
|
portions of the program that can execute independently and
|
|
simultaneously on separate processors, but you will also find that
|
|
some things that <EM>could</EM> execute in parallel might actually slow
|
|
execution if executed in parallel using a particular system. For
|
|
example, a program that takes four seconds to execute within a single
|
|
machine might be able to execute in only one second of processor time
|
|
on each of four machines, but no speedup would be achieved if it took
|
|
three seconds or more for these machines to coordinate their actions.
|
|
</LI>
|
|
<LI>Either the particular application program you are interested in
|
|
already has been <B>parallelized</B> (rewritten to take advantage of
|
|
parallel processing) or you are willing to do at least some new coding
|
|
to take advantage of parallel processing.
|
|
</LI>
|
|
<LI>You are interested in researching, or at least becoming familiar
|
|
with, issues involving parallel processing. Parallel processing using
|
|
Linux systems isn't necessarily difficult, but it is not familiar to
|
|
most computer users, and there isn't any book called "Parallel
|
|
Processing for Dummies"... at least not yet. This HOWTO is a good
|
|
starting point, not all you need to know.</LI>
|
|
</UL>
|
|
<P>
|
|
<P>The good news is that if all the above are true, you'll find that
|
|
parallel processing using Linux can yield supercomputer performance
|
|
for some programs that perform complex computations or operate on
|
|
large data sets. What's more, it can do that using cheap hardware...
|
|
which you might already own. As an added bonus, it is also easy to
|
|
use a parallel Linux system for other things when it is not busy
|
|
executing a parallel job.
|
|
<P>If parallel processing is <EM>not</EM> what you want, but you would
|
|
like to achieve at least a modest improvement in performance, there are
|
|
still things you can do. For example, you can improve performance of
|
|
sequential programs by moving to a faster processor, adding memory,
|
|
replacing an IDE disk with fast wide SCSI, etc. If that's all you are
|
|
interested in, jump to section 6.2; otherwise, read on.
|
|
<P>
|
|
<H2><A NAME="ss1.2">1.2 Terminology</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Although parallel processing has been used for many years in many
|
|
systems, it is still somewhat unfamiliar to most computer users.
|
|
Thus, before discussing the various alternatives, it is important to
|
|
become familiar with a few commonly used terms.
|
|
<P>
|
|
<DL>
|
|
<DT><B>SIMD:</B><DD><P>SIMD (Single Instruction stream, Multiple Data stream) refers to a
|
|
parallel execution model in which all processors execute the same
|
|
operation at the same time, but each processor is allowed to operate
|
|
upon its own data. This model naturally fits the concept of
|
|
performing the same operation on every element of an array, and is
|
|
thus often associated with vector or array manipulation. Because all
|
|
operations are inherently synchronized, interactions among SIMD
|
|
processors tend to be easily and efficiently implemented.
|
|
<P>
|
|
<DT><B>MIMD:</B><DD><P>MIMD (Multiple Instruction stream, Multiple Data stream) refers to a
|
|
parallel execution model in which each processor is essentially acting
|
|
independently. This model most naturally fits the concept of
|
|
decomposing a program for parallel execution on a functional basis;
|
|
for example, one processor might update a database file while another
|
|
processor generates a graphic display of the new entry. This is a
|
|
more flexible model than SIMD execution, but it is achieved at the
|
|
risk of debugging nightmares called <B>race conditions</B>, in which
|
|
a program may intermittently fail due to timing variations reordering
|
|
the operations of one processor relative to those of another.
|
|
<P>
|
|
<DT><B>SPMD:</B><DD><P>SPMD (Single Program, Multiple Data) is a restricted version of MIMD
|
|
in which all processors are running the same program. Unlike SIMD,
|
|
each processor executing SPMD code may take a different control flow
|
|
path through the program.
|
|
<P>
|
|
<DT><B>Communication Bandwidth:</B><DD><P>The bandwidth of a communication system is the maximum amount of data
|
|
that can be transmitted in a unit of time... once data transmission
|
|
has begun. Bandwidth for serial connections is often measured in
|
|
<B>baud</B> or <B>bits/second (b/s)</B>, which generally
|
|
correspond to 1/10 to 1/8 that many <B>Bytes/second (B/s)</B>. For
|
|
example, a 1,200 baud modem transfers about 120 B/s, whereas a 155
|
|
Mb/s ATM network connection is nearly 130,000 times faster,
|
|
transferring about about 17 MB/s. High bandwidth allows large blocks
|
|
of data to be transferred efficiently between processors.
|
|
<P>
|
|
<DT><B>Communication Latency:</B><DD><P>The latency of a communication system is the minimum time taken to
|
|
transmit one object, including any send and receive software
|
|
overhead. Latency is very important in parallel processing because it
|
|
determines the minimum useful <B>grain size</B>, the minimum run
|
|
time for a segment of code to yield speed-up through parallel
|
|
execution. Basically, if a segment of code runs for less time than it
|
|
takes to transmit its result value (i.e., latency), executing that
|
|
code segment serially on the processor that needed the result value
|
|
would be faster than parallel execution; serial execution would avoid
|
|
the communication overhead.
|
|
<P>
|
|
<DT><B>Message Passing:</B><DD><P>Message passing is a model for interactions between processors within
|
|
a parallel system. In general, a message is constructed by software
|
|
on one processor and is sent through an interconnection network to
|
|
another processor, which then must accept and act upon the message
|
|
contents. Although the overhead in handling each message (latency)
|
|
may be high, there are typically few restrictions on how much
|
|
information each message may contain. Thus, message passing can yield
|
|
high bandwidth making it a very effective way to transmit a large
|
|
block of data from one processor to another. However, to minimize the
|
|
need for expensive message passing operations, data structures within
|
|
a parallel program must be spread across the processors so that most
|
|
data referenced by each processor is in its local memory... this task
|
|
is known as <B>data layout</B>.
|
|
<P>
|
|
<DT><B>Shared Memory:</B><DD><P>Shared memory is a model for interactions between processors within a
|
|
parallel system. Systems like the multi-processor Pentium machines
|
|
running Linux <B>physically</B> share a single memory among
|
|
their processors, so that a value written to shared memory by one
|
|
processor can be directly accessed by any processor. Alternatively,
|
|
<B>logically</B> shared memory can be implemented for
|
|
systems in which each processor has it own memory by converting each
|
|
non-local memory reference into an appropriate inter-processor
|
|
communication. Either implementation of shared memory is generally
|
|
considered easier to use than message passing. Physically shared
|
|
memory can have both high bandwidth and low latency, but only when
|
|
multiple processors do not try to access the bus simultaneously; thus,
|
|
data layout still can seriously impact performance, and cache effects,
|
|
etc., can make it difficult to determine what the best layout is.
|
|
<P>
|
|
<DT><B>Aggregate Functions:</B><DD><P>In both the message passing and shared memory models, a communication
|
|
is initiated by a single processor; in contrast, aggregate function
|
|
communication is an inherently parallel communication model in which
|
|
an entire group of processors act together. The simplest such action
|
|
is a <B>barrier synchronization</B>, in which each individual
|
|
processor waits until every processor in the group has arrived at the
|
|
barrier. By having each processor output a datum as a side-effect of
|
|
reaching a barrier, it is possible to have the communication hardware
|
|
return a value to each processor which is an arbitrary function of the
|
|
values collected from all processors. For example, the return value
|
|
might be the answer to the question "did any processor find a
|
|
solution?" or it might be the sum of one value from each processor.
|
|
Latency can be very low, but bandwidth per processor also tends to be
|
|
low. Traditionally, this model is used primarily to control parallel
|
|
execution rather than to distribute data values.
|
|
<P>
|
|
<DT><B>Collective Communication:</B><DD><P>This is another name for aggregate functions, most often used when
|
|
referring to aggregate functions that are constructed using multiple
|
|
message-passing operations.
|
|
<P>
|
|
<DT><B>SMP:</B><DD><P>SMP (Symmetric Multi-Processor) refers to the operating system concept
|
|
of a group of processors working together as peers, so that any piece
|
|
of work could be done equally well by any processor. Typically, SMP
|
|
implies the combination of MIMD and shared memory. In the IA32 world,
|
|
SMP generally means compliant with MPS (the Intel MultiProcessor
|
|
Specification); in the future, it may mean "Slot 2"....
|
|
<P>
|
|
<DT><B>SWAR:</B><DD><P>SWAR (SIMD Within A Register) is a generic term for the concept of
|
|
partitioning a register into multiple integer fields and using
|
|
register-width operations to perform SIMD-parallel computations across
|
|
those fields. Given a machine with <EM>k</EM>-bit registers, data
|
|
paths, and function units, it has long been known that ordinary
|
|
register operations can function as SIMD parallel operations on as
|
|
many as <EM>n</EM>, <EM>k</EM>/<EM>n</EM>-bit, field values. Although
|
|
this type of parallelism can be implemented using ordinary integer
|
|
registers and instructions, many high-end microprocessors have
|
|
recently added specialized instructions to enhance the performance of
|
|
this technique for multimedia-oriented tasks. In addition to the
|
|
Intel/AMD/Cyrix <B>MMX</B> (MultiMedia eXtensions), there are:
|
|
Digital Alpha <B>MAX</B> (MultimediA eXtensions), Hewlett-Packard
|
|
PA-RISC <B>MAX</B> (Multimedia Acceleration eXtensions), MIPS
|
|
<B>MDMX</B> (Digital Media eXtension, pronounced "Mad Max"), and Sun
|
|
SPARC V9 <B>VIS</B> (Visual Instruction Set). Aside from the three
|
|
vendors who have agreed on MMX, all of these instruction set
|
|
extensions are roughly comparable, but mutually incompatible.
|
|
<P>
|
|
<DT><B>Attached Processors:</B><DD><P>Attached processors are essentially special-purpose computers that are
|
|
connected to a <B>host</B> system to accelerate specific types of
|
|
computation. For example, many video and audio cards for PCs contain
|
|
attached processors designed, respectively, to accelerate common
|
|
graphics operations and audio <B>DSP</B> (Digital Signal
|
|
Processing). There is also a wide range of attached <B>array
|
|
processors</B>, so called because they are designed to accelerate
|
|
arithmetic operations on arrays. In fact, many commercial
|
|
supercomputers are really attached processors with workstation hosts.
|
|
<P>
|
|
<DT><B>RAID:</B><DD><P>RAID (Redundant Array of Inexpensive Disks) is a simple technology for
|
|
increasing both the bandwidth and reliability of disk I/O. Although
|
|
there are many different variations, all have two key concepts in
|
|
common. First, each data block is <B>striped</B> across a group of
|
|
<EM>n+k</EM> disk drives such that each drive only has to read or
|
|
write 1/<EM>n</EM> of the data... yielding <EM>n</EM> times the
|
|
bandwidth of one drive. Second, redundant data is written so that
|
|
data can be recovered if a disk drive fails; this is important because
|
|
otherwise if any one of the <EM>n+k</EM> drives were to fail, the
|
|
entire file system could be lost. A good overview of RAID in general
|
|
is given at
|
|
<A HREF="http://www.uni-mainz.de/~neuffer/scsi/what_is_raid.html">http://www.uni-mainz.de/~neuffer/scsi/what_is_raid.html</A>, and
|
|
information about RAID options for Linux systems is at
|
|
<A HREF="http://linas.org/linux/raid.html">http://linas.org/linux/raid.html</A>. Aside from specialized RAID
|
|
hardware support, Linux also supports software RAID 0, 1, 4, and 5
|
|
across multiple disks hosted by a single Linux system; see the
|
|
Software RAID mini-HOWTO and the Multi-Disk System Tuning mini-HOWTO
|
|
for details. RAID across disk drives <EM>on multiple machines in a
|
|
cluster</EM> is not directly supported.
|
|
<P>
|
|
<DT><B>IA32:</B><DD><P>IA32 (Intel Architecture, 32-bit) really has nothing to do with
|
|
parallel processing, but rather refers to the class of processors whose
|
|
instruction sets are generally compatible with that of the Intel 386.
|
|
Basically, any Intel x86 processor after the 286 is compatible with
|
|
the 32-bit flat memory model that characterizes IA32. AMD and Cyrix
|
|
also make a multitude of IA32-compatible processors. Because Linux
|
|
evolved primarily on IA32 processors and that is where the commodity
|
|
market is centered, it is convenient to use IA32 to distinguish any of
|
|
these processors from the PowerPC, Alpha, PA-RISC, MIPS, SPARC, etc.
|
|
The upcoming IA64 (64-bit with EPIC, Explicitly Parallel Instruction
|
|
Computing) will certainly complicate matters, but Merced, the first
|
|
IA64 processor, is not scheduled for production until 1999.
|
|
<P>
|
|
<DT><B>COTS:</B><DD><P>Since the demise of many parallel supercomputer companies, COTS
|
|
(Commercial Off-The-Shelf) is commonly discussed as a requirement for
|
|
parallel computing systems. Being fanatically pure, the only COTS
|
|
parallel processing techniques using PCs are things like SMP Windows
|
|
NT servers and various MMX Windows applications; it really doesn't pay
|
|
to be that fanatical. The underlying concept of COTS is really
|
|
minimization of development time and cost. Thus, a more useful, more
|
|
common, meaning of COTS is that at least most subsystems benefit from
|
|
commodity marketing, but other technologies are used where they are
|
|
effective. Most often, COTS parallel processing refers to a cluster
|
|
in which the nodes are commodity PCs, but the network interface and
|
|
software are somewhat customized... typically running Linux and
|
|
applications codes that are freely available (e.g., copyleft or public
|
|
domain), but not literally COTS.
|
|
</DL>
|
|
<H2><A NAME="ss1.3">1.3 Example Algorithm</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>In order to better understand the use of the various parallel
|
|
programming approaches outlined in this HOWTO, it is useful to have an
|
|
example problem. Although just about any simple parallel algorithm
|
|
would do, by selecting an algorithm that has been used to demonstrate
|
|
various other parallel programming systems, it becomes a bit easier to
|
|
compare and contrast approaches. M. J. Quinn's book, <EM>Parallel
|
|
Computing Theory And Practice</EM>, second edition, McGraw Hill, New
|
|
York, 1994, uses a parallel algorithm that computes the value of Pi to
|
|
demonstrate a variety of different parallel supercomputer programming
|
|
environments (e.g., nCUBE message passing, Sequent shared memory). In
|
|
this HOWTO, we use the same basic algorithm.
|
|
<P>The algorithm computes the approximate value of Pi by summing the area
|
|
under <EM>x</EM> squared. As a purely sequential C program, the
|
|
algorithm looks like:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>;
|
|
#include <stdio.h>;
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width, sum;
|
|
register int intervals, i;
|
|
|
|
/* get the number of intervals */
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
|
|
/* do the computation */
|
|
sum = 0;
|
|
for (i=0; i<intervals; ++i) {
|
|
register double x = (i + 0.5) * width;
|
|
sum += 4.0 / (1.0 + x * x);
|
|
}
|
|
sum *= width;
|
|
|
|
printf("Estimation of pi is %f\n", sum);
|
|
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>However, this sequential algorithm easily yields an "embarrassingly
|
|
parallel" implementation. The area is subdivided into intervals, and
|
|
any number of processors can each independently sum the intervals
|
|
assigned to it, with no need for interaction between processors. Once
|
|
the local sums have been computed, they are added together to create a
|
|
global sum; this step requires some level of coordination and
|
|
communication between processors. Finally, this global sum is printed
|
|
by one processor as the approximate value of Pi.
|
|
<P>In this HOWTO, the various parallel implementations of this algorithm
|
|
appear where each of the different programming methods is discussed.
|
|
<P>
|
|
<H2><A NAME="ss1.4">1.4 Organization Of This Document</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>The remainder of this document is divided into five parts. Sections
|
|
2, 3, 4, and 5 correspond to the three different types of hardware
|
|
configurations supporting parallel processing using Linux:
|
|
<P>
|
|
<UL>
|
|
<LI>Section 2 discusses SMP Linux systems. These directly support
|
|
MIMD execution using shared memory, although message passing also is
|
|
implemented easily. Although Linux supports SMP configurations up to
|
|
16 processors, most SMP PC systems have either two or four identical
|
|
processors.
|
|
</LI>
|
|
<LI>Section 3 discusses clusters of networked machines, each running
|
|
Linux. A cluster can be used as a parallel processing system that
|
|
directly supports MIMD execution and message passing, perhaps also
|
|
providing logically shared memory. Simulated SIMD execution and
|
|
aggregate function communication also can be supported, depending on
|
|
the networking method used. The number of processors in a cluster can
|
|
range from two to thousands, primarily limited by the physical wiring
|
|
constraints of the network. In some cases, various types of machines
|
|
can be mixed within a cluster; for example, a network combining DEC
|
|
Alpha and Pentium Linux systems would be a <B>heterogeneous
|
|
cluster</B>.
|
|
</LI>
|
|
<LI>Section 4 discusses SWAR, SIMD Within A Register. This is a
|
|
very restrictive type of parallel execution model, but on the other
|
|
hand, it is a built-in capability of ordinary processors. Recently,
|
|
MMX (and other) instruction set extensions to modern processors have
|
|
made this approach even more effective.
|
|
</LI>
|
|
<LI>Section 5 discusses the use of Linux PCs as hosts for simple
|
|
parallel computing systems. Either as an add-in card or as an
|
|
external box, attached processors can provide a Linux system with
|
|
formidable processing power for specific types of applications. For
|
|
example, inexpensive ISA cards are available that provide multiple DSP
|
|
processors offering hundreds of MFLOPS for compute-bound problems.
|
|
However, these add-in boards are <EM>just</EM> processors; they
|
|
generally do not run an OS, have disk or console I/O capability, etc.
|
|
To make such systems useful, the Linux "host" must provide these
|
|
functions.</LI>
|
|
</UL>
|
|
<P>
|
|
<P>The final section of this document covers aspects that are of general
|
|
interest for parallel processing using Linux, not specific to a
|
|
particular one of the approaches listed above.
|
|
<P>As you read this document, keep in mind that we haven't tested
|
|
everything, and a lot of stuff reported here "still has a research
|
|
character" (a nice way to say "doesn't quite work like it should" ;-).
|
|
However, parallel processing using Linux is useful now, and an
|
|
increasingly large group is working to make it better.
|
|
<P>The author of this HOWTO is Hank Dietz, Ph.D., currently Professor &
|
|
James F. Hardymon Chair in Networking at the
|
|
University of Kentucky, Electrical & Computer Engineering Dept in
|
|
Lexington, KY, 40506-0046.
|
|
Dietz retains rights to this
|
|
document as per the Linux Documentation Project guidelines. Although
|
|
an effort has been made to ensure the correctness and fairness of this
|
|
presentation, neither Dietz nor University of Kentucky can be held
|
|
responsible for any problems or errors, and University of Kentucky does not
|
|
endorse any of the work/products discussed.
|
|
<P>
|
|
<HR>
|
|
<H2><A NAME="s2">2.</A> <A HREF="#toc2">SMP Linux</A></H2>
|
|
|
|
<P>
|
|
<P>This document gives a brief overview of how to use
|
|
<A HREF="http://www.linux.org.uk/SMP/title.html">SMP Linux</A> systems
|
|
for parallel processing. The most up-to-date information on SMP Linux
|
|
is probably available via the SMP Linux project mailing list; send
|
|
email to
|
|
<A HREF="mailto:majordomo@vger.rutgers.edu">majordomo@vger.rutgers.edu</A> with the text <CODE>subscribe
|
|
linux-smp</CODE> to join the list.
|
|
<P>Does SMP Linux really work? In June 1996, I purchased a brand new
|
|
(well, new off-brand ;-) two-processor 100MHz Pentium system. The
|
|
fully assembled system, including both processors, Asus motherboard,
|
|
256K cache, 32M RAM, 1.6G disk, 6X CDROM, Stealth 64, and 15" Acer
|
|
monitor, cost a total of $1,800. This was just a few hundred
|
|
dollars more than a comparable uniprocessor system. Getting SMP Linux
|
|
running was simply a matter of installing the "stock" uniprocessor
|
|
Linux, recompiling the kernel with the <CODE>SMP=1</CODE> line in the
|
|
makefile uncommented (although I find setting <CODE>SMP</CODE> to
|
|
<CODE>1</CODE> a bit ironic ;-), and informing <CODE>lilo</CODE> about the new
|
|
kernel. This system performs well enough, and has been stable enough,
|
|
to serve as my primary workstation ever since. In summary, SMP Linux
|
|
really does work.
|
|
<P>The next question is how much high-level support is available for
|
|
writing and executing shared memory parallel programs under SMP Linux.
|
|
Through early 1996, there wasn't much. Things have changed. For
|
|
example, there is now a very complete POSIX threads library.
|
|
<P>Although performance may be lower than for native shared-memory
|
|
mechanisms, an SMP Linux system also can use most parallel processing
|
|
software that was originally developed for a workstation cluster using
|
|
socket communication. Sockets (see section 3.3) work within an SMP
|
|
Linux system, and even for multiple SMPs networked as a cluster.
|
|
However, sockets imply a lot of unnecessary overhead for an SMP. Much
|
|
of that overhead is within the kernel or interrupt handlers; this
|
|
worsens the problem because SMP Linux generally allows only one
|
|
processor to be in the kernel at a time and the interrupt controller
|
|
is set so that only the boot processor can process interrupts.
|
|
Despite this, typical SMP communication hardware is so much better
|
|
than most cluster networks that cluster software will often run better
|
|
on an SMP than on the cluster for which it was designed.
|
|
<P>The remainder of this section discusses SMP hardware, reviews the
|
|
basic Linux mechanisms for sharing memory across the processes of a
|
|
parallel program, makes a few observations about atomicity, volatility,
|
|
locks, and cache lines, and finally gives some pointers to other
|
|
shared memory parallel processing resources.
|
|
<P>
|
|
<H2><A NAME="ss2.1">2.1 SMP Hardware</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Although SMP systems have been around for many years, until very
|
|
recently, each such machine tended to implement basic functions
|
|
differently enough so that operating system support was not portable.
|
|
The thing that has changed this situation is Intel's Multiprocessor
|
|
Specification, often referred to as simply <B>MPS</B>. The MPS 1.4
|
|
specification is currently available as a PDF file at
|
|
<A HREF="http://www.intel.com/design/pro/datashts/242016.htm">http://www.intel.com/design/pro/datashts/242016.htm</A>, and there
|
|
is a brief overview of MPS 1.1 at
|
|
<A HREF="http://support.intel.com/oem_developer/ial/support/9300.HTM">http://support.intel.com/oem_developer/ial/support/9300.HTM</A>,
|
|
but be aware that Intel does re-arrange their WWW site often. A wide
|
|
range of
|
|
<A HREF="http://www.uruk.org/~erich/mps-hw.html">vendors</A> are building MPS-compliant systems supporting up to
|
|
four processors, but MPS theoretically allows many more processors.
|
|
<P>The only non-MPS, non-IA32, systems supported by SMP Linux are Sun4m
|
|
multiprocessor SPARC machines. SMP Linux supports most Intel MPS
|
|
version 1.1 or 1.4 compliant machines with up to sixteen 486DX,
|
|
Pentium, Pentium MMX, Pentium Pro, or Pentium II processors.
|
|
Unsupported IA32 processors include the Intel 386, Intel 486SX/SLC
|
|
processors (the lack of floating point hardware interferes with the
|
|
SMP mechanisms), and AMD & Cyrix processors (they require
|
|
different SMP support chips that do not seem to be available at this
|
|
writing).
|
|
<P>It is important to understand that the performance of MPS-compliant
|
|
systems can vary widely. As expected, one cause for performance
|
|
differences is processor speed: faster clock speeds tend to yield
|
|
faster systems, and a Pentium Pro processor is faster than a Pentium.
|
|
However, MPS does not really specify how hardware implements shared
|
|
memory, but only how that implementation must function from a software
|
|
point of view; this means that performance is also a function of how
|
|
the shared memory implementation interacts with the characteristics of
|
|
SMP Linux and your particular programs.
|
|
<P>The primary way in which systems that comply with MPS differ is in how
|
|
they implement access to physically shared memory.
|
|
<P>
|
|
<H3>Does each processor have its own L2 cache?</H3>
|
|
|
|
<P>
|
|
<P>Some MPS Pentium systems, and all MPS Pentium Pro and Pentium II
|
|
systems, have independent L2 caches. (The L2 cache is packaged within
|
|
the Pentium Pro or Pentium II modules.) Separate L2 caches are
|
|
generally viewed as maximizing compute performance, but things are not
|
|
quite so obvious under Linux. The primary complication is that the
|
|
current SMP Linux scheduler does not attempt to keep each process on
|
|
the same processor, a concept known as <B>processor affinity</B>.
|
|
This may change soon; there has recently been some discussion about
|
|
this in the SMP Linux development community under the title "processor
|
|
binding." Without processor affinity, having separate L2 caches may
|
|
introduce significant overhead when a process is given a timeslice on
|
|
a processor other than the one that was executing it last.
|
|
<P>Many relatively inexpensive systems are organized so that two Pentium
|
|
processors share a single L2 cache. The bad news is that this causes
|
|
contention for the cache, seriously degrading performance when running
|
|
multiple independent sequential programs. The good news is that many
|
|
parallel programs might actually benefit from the shared cache because
|
|
if both processors will want to access the same line from shared
|
|
memory, only one had to fetch it into cache and contention for the bus
|
|
is averted. The lack of processor affinity also causes less damage
|
|
with a shared L2 cache. Thus, for parallel programs, it isn't really
|
|
clear that sharing L2 cache is as harmful as one might expect.
|
|
<P>Experience with our dual Pentium shared 256K cache system shows quite
|
|
a wide range of performance depending on the level of kernel activity
|
|
required. At worst, we see only about 1.2x speedup. However, we also
|
|
have seen up to 2.1x speedup, which suggests that compute-intensive
|
|
SPMD-style code really does profit from the "shared fetch" effect.
|
|
<P>
|
|
<H3>Bus configuration?</H3>
|
|
|
|
<P>
|
|
<P>The first thing to say is that most modern systems connect the
|
|
processors to one or more PCI buses that in turn are "bridged" to one
|
|
or more ISA/EISA buses. These bridges add latency, and both EISA and
|
|
ISA generally offer lower bandwidth than PCI (ISA being the lowest), so
|
|
disk drives, video cards, and other high-performance devices generally
|
|
should be connected via a PCI bus interface.
|
|
<P>Although an MPS system can achieve good speed-up for many
|
|
compute-intensive parallel programs even if there is only one PCI bus,
|
|
I/O operations occur at no better than uniprocessor performance...
|
|
and probably a little worse due to bus contention from the
|
|
processors. Thus, if you are looking to speed-up I/O, make sure that
|
|
you get an MPS system with multiple independent PCI busses and I/O
|
|
controllers (e.g., multiple SCSI chains). You will need to be careful
|
|
to make sure SMP Linux supports what you get. Also keep in mind that
|
|
the current SMP Linux essentially allows only one processor in the
|
|
kernel at any time, so you should choose your I/O controllers
|
|
carefully to pick ones that minimize the kernel time required for each
|
|
I/O operation. For really high performance, you might even consider
|
|
doing raw device I/O directly from user processes, without a system
|
|
call... this isn't necessarily as hard as it sounds, and need not
|
|
compromise security (see section 3.3 for a description of the basic
|
|
techniques).
|
|
<P>It is important to note that the relationship between bus speed and
|
|
processor clock rate has become very fuzzy over the past few years.
|
|
Although most systems now use the same PCI clock rate, it is not
|
|
uncommon to find a faster processor clock paired with a slower bus
|
|
clock. The classic example of this was that the Pentium 133 generally
|
|
used a faster bus than a Pentium 150, with appropriately
|
|
strange-looking performance on various benchmarks. These effects are
|
|
amplified in SMP systems; it is even more important to have a faster
|
|
bus clock.
|
|
<P>
|
|
<H3>Memory interleaving and DRAM technologies?</H3>
|
|
|
|
<P>
|
|
<P>Memory interleaving actually has nothing whatsoever to do with MPS,
|
|
but you will often see it mentioned for MPS systems because these
|
|
systems are typically more demanding of memory bandwidth. Basically,
|
|
two-way or four-way interleaving organizes RAM so that a block access
|
|
is accomplished using multiple banks of RAM rather than just one.
|
|
This provides higher memory access bandwidth, particularly for cache
|
|
line loads and stores.
|
|
<P>The waters are a bit muddied about this, however, because EDO DRAM and
|
|
various other memory technologies tend to improve similar kinds of
|
|
operations. An excellent overview of DRAM technologies is given in
|
|
<A HREF="http://www.pcguide.com/ref/ram/tech.htm">http://www.pcguide.com/ref/ram/tech.htm</A>.
|
|
<P>So, for example, is it better to have 2-way interleaved EDO DRAM or
|
|
non-interleaved SDRAM? That is a very good question with no simple
|
|
answer, because both interleaving and exotic DRAM technologies tend to
|
|
be expensive. The same dollar investment in more ordinary memory
|
|
configurations generally will give you a significantly larger main
|
|
memory. Even the slowest DRAM is still a heck of a lot faster than
|
|
using disk-based virtual memory....
|
|
<P>
|
|
<H2><A NAME="ss2.2">2.2 Introduction To Shared Memory Programming</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Ok, so you have decided that parallel processing on an SMP is a great
|
|
thing to do... how do you get started? Well, the first step is to
|
|
learn a little bit about how shared memory communication really works.
|
|
<P>It sounds like you simply have one processor store a value into memory
|
|
and another processor load it; unfortunately, it isn't quite that
|
|
simple. For example, the relationship between processes and
|
|
processors is very blurry; however, if we have no more active
|
|
processes than there are processors, the terms are roughly
|
|
interchangeable. The remainder of this section briefly summarizes the
|
|
key issues that could cause serious problems, if you were not aware of
|
|
them: the two different models used to determine what is shared,
|
|
atomicity issues, the concept of volatility, hardware lock
|
|
instructions, cache line effects, and Linux scheduler issues.
|
|
<P>
|
|
<H3>Shared Everything Vs. Shared Something</H3>
|
|
|
|
<P>
|
|
<P>There are two fundamentally different models commonly used for shared
|
|
memory programming: <B>shared everything</B> and <B>shared
|
|
something</B>. Both of these models allow processors to communicate
|
|
by loads and stores from/into shared memory; the distinction comes in
|
|
the fact that shared everything places all data structures in shared
|
|
memory, while shared something requires the user to explicitly
|
|
indicate which data structures are potentially shared and which are
|
|
<B>private</B> to a single processor.
|
|
<P>Which shared memory model should you use? That is mostly a question of
|
|
religion. A lot of people like the shared everything model because
|
|
they do not really need to identify which data structures should be
|
|
shared at the time they are declared... you simply put locks around
|
|
potentially-conflicting accesses to shared objects to ensure that only
|
|
one process(or) has access at any moment. Then again, that really
|
|
isn't all that simple... so many people prefer the relative safety of
|
|
shared something.
|
|
<P>
|
|
<H3>Shared Everything</H3>
|
|
|
|
<P>
|
|
<P>The nice thing about sharing everything is that you can easily take an
|
|
existing sequential program and incrementally convert it into a shared
|
|
everything parallel program. You do not have to first determine which
|
|
data need to be accessible by other processors.
|
|
<P>Put simply, the primary problem with sharing everything is that any
|
|
action taken by one processor could affect the other processors. This
|
|
problem surfaces in two ways:
|
|
<P>
|
|
<UL>
|
|
<LI>Many libraries use data structures that simply are not
|
|
sharable. For example, the UNIX convention is that most functions can
|
|
return an error code in a variable called <CODE>errno</CODE>; if two shared
|
|
everything processes perform various calls, they would interfere with
|
|
each other because they share the same <CODE>errno</CODE>. Although there
|
|
is now a library version that fixes the <CODE>errno</CODE> problem,
|
|
similar problems still exist in most libraries. For example, unless
|
|
special precautions are taken, the X library will not work if calls
|
|
are made from multiple shared everything processes.
|
|
</LI>
|
|
<LI>Normally, the worst-case behavior for a program with a bad
|
|
pointer or array subscript is that the process that contains the
|
|
offending code dies. It might even generate a <CODE>core</CODE> file that
|
|
clues you in to what happened. In shared everything parallel
|
|
processing, it is very likely that the stray accesses will bring the
|
|
demise of <EM>a process other than the one at fault</EM>, making it
|
|
nearly impossible to localize and correct the error.</LI>
|
|
</UL>
|
|
<P>Neither of these types of problems is common when shared something is
|
|
used, because only the explicitly-marked data structures are shared.
|
|
It also is fairly obvious that shared everything only works if all
|
|
processors are executing the exact same memory image; you cannot use
|
|
shared everything across multiple different code images (i.e., can use
|
|
only SPMD, not general MIMD).
|
|
<P>The most common type of shared everything programming support is a
|
|
<B>threads library</B>.
|
|
<A HREF="http://liinwww.ira.uka.de/bibliography/Os/threads.html">Threads</A> are essentially "light-weight" processes that might
|
|
not be scheduled in the same way as regular UNIX processes and, most
|
|
importantly, share access to a single memory map. The POSIX
|
|
<A HREF="http://www.humanfactor.com/pthreads/mit-pthreads.html">Pthreads</A> package has been the focus of a number of porting
|
|
efforts; the big question is whether any of these ports actually run
|
|
the threads of a program in parallel under SMP Linux (ideally, with a
|
|
processor for each thread). The POSIX API doesn't require it, and
|
|
versions like
|
|
<A HREF="http://www.aa.net/~mtp/PCthreads.html">http://www.aa.net/~mtp/PCthreads.html</A>
|
|
apparently do not implement parallel thread execution - all the threads
|
|
of a program are kept within a single Linux process.
|
|
<P>The first threads library that supported SMP Linux parallelism was the
|
|
now somewhat obsolete bb_threads library,
|
|
<A HREF="ftp://caliban.physics.utoronto.ca/pub/linux/">ftp://caliban.physics.utoronto.ca/pub/linux/</A>, a very small
|
|
library that used the Linux <CODE>clone()</CODE> call to fork new,
|
|
independently scheduled, Linux processes all sharing a single address
|
|
space. SMP Linux machines can run multiple of these "threads" in
|
|
parallel because each "thread" is a full Linux process; the trade-off
|
|
is that you do not get the same "light-weight" scheduling control
|
|
provided by some thread libraries under other operating systems. The
|
|
library used a bit of C-wrapped assembly code to install a new chunk
|
|
of memory as each thread's stack and to provide atomic access
|
|
functions for an array of locks (mutex objects). Documentation
|
|
consisted of a <CODE>README</CODE> and a short sample program.
|
|
<P>More recently, a version of POSIX threads using <CODE>clone()</CODE> has
|
|
been developed. This library,
|
|
<A HREF="http://pauillac.inria.fr/~xleroy/linuxthreads/">LinuxThreads</A>, is clearly the preferred shared everything
|
|
library for use under SMP Linux. POSIX threads are well documented,
|
|
and the
|
|
<A HREF="http://pauillac.inria.fr/~xleroy/linuxthreads/README">LinuxThreads README</A> and
|
|
<A HREF="http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html">LinuxThreads FAQ</A> are very well done. The primary problem now
|
|
is simply that POSIX threads have a lot of details to get right and
|
|
LinuxThreads is still a work in progress. There is also the problem
|
|
that the POSIX thread standard has evolved through the standardization
|
|
process, so you need to be a bit careful not to program for obsolete
|
|
early versions of the standard.
|
|
<P>
|
|
<H3>Shared Something</H3>
|
|
|
|
<P>
|
|
<P>Shared something is really "only share what needs to be shared." This
|
|
approach can work for general MIMD (not just SPMD) provided that care
|
|
is taken for the shared objects to be allocated at the same places in
|
|
each processor's memory map. More importantly, shared something makes
|
|
it easier to predict and tune performance, debug code, etc. The only
|
|
problems are:
|
|
<P>
|
|
<UL>
|
|
<LI>It can be hard to know beforehand what really needs to be shared.
|
|
</LI>
|
|
<LI>The actual allocation of objects in shared memory may be awkward,
|
|
especially for what would have been stack-allocated objects. For
|
|
example, it may be necessary to explicitly allocate shared objects in
|
|
a separate memory segment, requiring separate memory allocation
|
|
routines and introducing extra pointer indirections in each reference.</LI>
|
|
</UL>
|
|
<P>Currently, there are two very similar mechanisms that allow groups of
|
|
Linux processes to have independent memory spaces, all sharing only a
|
|
relatively small memory segment. Assuming that you didn't foolishly
|
|
exclude "System V IPC" when you configured your Linux system, Linux
|
|
supports a very portable mechanism that has generally become known as
|
|
"System V Shared Memory." The other alternative is a memory mapping
|
|
facility whose implementation varies widely across different UNIX
|
|
systems: the <CODE>mmap()</CODE> system call. You can, and should, learn
|
|
about these calls from the manual pages... but a brief overview of
|
|
each is given in sections 2.5 and 2.6 to help get you started.
|
|
<P>
|
|
<H3>Atomicity And Ordering</H3>
|
|
|
|
<P>
|
|
<P>No matter which of the above two models you use, the result is pretty
|
|
much the same: you get a pointer to a chunk of read/write memory that
|
|
is accessible by all processes within your parallel program. Does
|
|
that mean I can just have my parallel program access shared memory
|
|
objects as though they were in ordinary local memory? Well, not
|
|
quite....
|
|
<P><B>Atomicity</B> refers to the concept that an operation on an
|
|
object is accomplished as an indivisible, uninterruptible, sequence.
|
|
Unfortunately, sharing memory access does not imply that all
|
|
operations on data in shared memory occur atomically. Unless special
|
|
precautions are taken, only simple load or store operations that occur
|
|
within a single bus transaction (i.e., aligned 8, 16, or 32-bit
|
|
operations, but not misaligned nor 64-bit operations) are atomic.
|
|
Worse still, "smart" compilers like GCC will often perform
|
|
optimizations that could eliminate the memory operations needed to
|
|
ensure that other processors can see what this processor has done.
|
|
Fortunately, both these problems can be remedied... leaving only the
|
|
relationship between access efficiency and cache line size for us to
|
|
worry about.
|
|
<P>However, before discussing these issues, it is useful to point-out
|
|
that all of this assumes that memory references for each processor
|
|
happen in the order in which they were coded. The Pentium does this,
|
|
but also notes that future Intel processors might not. So, for future
|
|
processors, keep in mind that it may be necessary to surround some
|
|
shared memory accesses with instructions that cause all pending memory
|
|
accesses to complete, thus providing memory access ordering. The
|
|
<CODE>CPUID</CODE> instruction apparently is reserved to have this
|
|
side-effect.
|
|
<P>
|
|
<H3>Volatility</H3>
|
|
|
|
<P>
|
|
<P>To prevent GCC's optimizer from buffering values of shared memory
|
|
objects in registers, all objects in shared memory should be declared
|
|
as having types with the <CODE>volatile</CODE> attribute. If this is
|
|
done, all shared object reads and writes that require just one word
|
|
access will occur atomically. For example, suppose that <EM>p</EM>
|
|
is a pointer to an integer, where both the pointer and the integer it
|
|
will point at are in shared memory; the ANSI C declaration might be:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
volatile int * volatile p;
|
|
</PRE>
|
|
<HR>
|
|
<P>In this code, the first <CODE>volatile</CODE> refers to the <CODE>int</CODE>
|
|
that <CODE>p</CODE> will eventually point at; the second <CODE>volatile</CODE>
|
|
refers to the pointer itself. Yes, it is annoying, but it is the
|
|
price one pays for enabling GCC to perform some very powerful
|
|
optimizations. At least in theory, the <CODE>-traditional</CODE> option
|
|
to GCC might suffice to produce correct code at the expense of some
|
|
optimization, because pre-ANSI K&R C essentially claimed that all
|
|
variables were volatile unless explicitly declared as
|
|
<CODE>register</CODE>. Still, if your typical GCC compile looks like
|
|
<CODE>cc -O6 <EM>...</EM></CODE>, you really will want to explicitly mark
|
|
things as volatile only where necessary.
|
|
<P>There has been a rumor to the effect that using assembly-language
|
|
locks that are marked as modifying all processor registers will cause
|
|
GCC to appropriately flush all variables, thus avoiding the
|
|
"inefficient" compiled code associated with things declared as
|
|
<CODE>volatile</CODE>. This hack appears to work for statically allocated
|
|
global variables using version 2.7.0 of GCC... however, that behavior
|
|
is <EM>not</EM> required by the ANSI C standard. Still worse, other
|
|
processes that are making only read accesses can buffer the values in
|
|
registers forever, thus <EM>never</EM> noticing that the shared memory
|
|
value has actually changed. In summary, do what you want, but only
|
|
variables accessed through <CODE>volatile</CODE> are <EM>guaranteed</EM>
|
|
to work correctly.
|
|
<P>Note that you can cause a volatile access to an ordinary variable by
|
|
using a type cast that imposes the <CODE>volatile</CODE> attribute. For
|
|
example, the ordinary <CODE>int i;</CODE> can be referenced as a volatile
|
|
by <CODE>*((volatile int *) &i)</CODE>; thus, you can explicitly
|
|
invoke the "overhead" of volatility only where it is critical.
|
|
<P>
|
|
<H3>Locks</H3>
|
|
|
|
<P>
|
|
<P>If you thought that <CODE>++i;</CODE> would always work to add one to a
|
|
variable <CODE>i</CODE> in shared memory, you've got a nasty little
|
|
surprise coming: even if coded as a single instruction, the load and
|
|
store of the result are separate memory transactions, and other
|
|
processors could access <CODE>i</CODE> between these two transactions.
|
|
For example, having two processes both perform <CODE>++i;</CODE> might
|
|
only increment <CODE>i</CODE> by one, rather than by two. According to
|
|
the Intel Pentium "Architecture and Programming Manual," the
|
|
<CODE>LOCK</CODE> prefix can be used to ensure that any of the following
|
|
instructions is atomic relative to the data memory location it
|
|
accesses:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
BTS, BTR, BTC mem, reg/imm
|
|
XCHG reg, mem
|
|
XCHG mem, reg
|
|
ADD, OR, ADC, SBB, AND, SUB, XOR mem, reg/imm
|
|
NOT, NEG, INC, DEC mem
|
|
CMPXCHG, XADD
|
|
</PRE>
|
|
<HR>
|
|
<P>However, it probably is not a good idea to use all these operations.
|
|
For example, <CODE>XADD</CODE> did not even exist for the 386, so coding
|
|
it may cause portability problems.
|
|
<P>The <CODE>XCHG</CODE> instruction <EM>always</EM> asserts a lock, even
|
|
without the <CODE>LOCK</CODE> prefix, and thus is clearly the preferred
|
|
atomic operation from which to build higher-level atomic constructs
|
|
such as semaphores and shared queues. Of course, you can't get GCC to
|
|
generate this instruction just by writing C code... instead, you must
|
|
use a bit of in-line assembly code. Given a word-size volatile object
|
|
<EM>obj</EM> and a word-size register value <EM>reg</EM>, the GCC
|
|
in-line assembly code is:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
__asm__ __volatile__ ("xchgl %1,%0"
|
|
:"=r" (reg), "=m" (obj)
|
|
:"r" (reg), "m" (obj));
|
|
</PRE>
|
|
<HR>
|
|
<P>Examples of GCC in-line assembly code using bit operations for locking
|
|
are given in the source code for the
|
|
<A HREF="ftp://caliban.physics.utoronto.ca/pub/linux/">bb_threads library</A>.
|
|
<P>It is important to remember, however, that there is a cost associated
|
|
with making memory transactions atomic. A locking operation carries a
|
|
fair amount of overhead and may delay memory activity from other
|
|
processors, whereas ordinary references may use local cache. The best
|
|
performance results when locking operations are used as infrequently
|
|
as possible. Further, these IA32 atomic instructions obviously are not
|
|
portable to other systems.
|
|
<P>There are many alternative approaches that allow ordinary instructions
|
|
to be used to implement various synchronizations, including <B>mutual
|
|
exclusion</B> - ensuring that at most one processor is updating a
|
|
given shared object at any moment. Most OS textbooks discuss at least
|
|
one of these techniques. There is a fairly good discussion in the
|
|
Fourth Edition of <EM>Operating System Concepts</EM>, by Abraham
|
|
Silberschatz and Peter B. Galvin, ISBN 0-201-50480-4.
|
|
<P>
|
|
<H3>Cache Line Size</H3>
|
|
|
|
<P>
|
|
<P>One more fundamental atomicity concern can have a dramatic impact on
|
|
SMP performance: cache line size. Although the MPS standard requires
|
|
references to be coherent no matter what caching is used, the fact is
|
|
that when one processor writes to a particular line of memory, every
|
|
cached copy of the old line must be invalidated or updated. This
|
|
implies that if two or more processors are both writing data to
|
|
different portions of the same line a lot of cache and bus traffic may
|
|
result, effectively to pass the line from cache to cache. This problem
|
|
is known as <B>false sharing</B>. The solution is simply to try to
|
|
<EM>organize data so that what is accessed in parallel tends to come
|
|
from a different cache line for each process</EM>.
|
|
<P>You might be thinking that false sharing is not a problem using a
|
|
system with a shared L2 cache, but remember that there are still
|
|
separate L1 caches. Cache organization and number of separate levels
|
|
can both vary, but the Pentium L1 cache line size is 32 bytes and
|
|
typical external cache line sizes are around 256 bytes. Suppose that
|
|
the addresses (physical or virtual) of two items are <EM>a</EM> and
|
|
<EM>b</EM> and that the largest per-processor cache line size is
|
|
<EM>c</EM>, which we assume to be a power of two. To be very precise,
|
|
if <CODE>((int) <EM>a</EM>) & ~(<EM>c</EM> - 1)</CODE> is equal to
|
|
<CODE>((int) <EM>b</EM>) & ~(<EM>c</EM> - 1)</CODE>, then both
|
|
references are in the same cache line. A simpler rule is that if
|
|
shared objects being referenced in parallel are at least <EM>c</EM>
|
|
bytes apart, they should map to different cache lines.
|
|
<P>
|
|
<H3>Linux Scheduler Issues</H3>
|
|
|
|
<P>
|
|
<P>Although the whole point of using shared memory for parallel
|
|
processing is to avoid OS overhead, OS overhead can come from things
|
|
other than communication per se. We have already said that the number
|
|
of processes that should be constructed is less than or equal to the
|
|
number of processors in the machine. But how do you decide exactly
|
|
how many processes to make?
|
|
<P>For best performance, <EM>the number of processes in your parallel
|
|
program should be equal to the expected number of your program's
|
|
processes that simultaneously can be running on different
|
|
processors</EM>. For example, if a four-processor SMP typically has
|
|
one process actively running for some other purpose (e.g., a WWW
|
|
server), then your parallel program should use only three processes.
|
|
You can get a rough idea of how many other processes are active on
|
|
your system by looking at the "load average" quoted by the
|
|
<CODE>uptime</CODE> command.
|
|
<P>Alternatively, you could boost the priority of the processes in your
|
|
parallel program using, for example, the <CODE>renice</CODE> command or
|
|
<CODE>nice()</CODE> system call. You must be privileged to increase
|
|
priority. The idea is simply to force the other processes out of
|
|
processors so that your program can run simultaneously across all
|
|
processors. This can be accomplished somewhat more explicitly using
|
|
the prototype version of SMP Linux at
|
|
<A HREF="http://luz.cs.nmt.edu/~rtlinux/">http://luz.cs.nmt.edu/~rtlinux/</A>, which offers real-time
|
|
schedulers.
|
|
<P>If you are not the only user treating your SMP system as a parallel
|
|
machine, you may also have conflicts between the two or more parallel
|
|
programs trying to execute simultaneously. This standard solution is
|
|
<B>gang scheduling</B> - i.e., manipulating scheduling priority so
|
|
that at any given moment, only the processes of a single parallel
|
|
program are running. It is useful to recall, however, that using more
|
|
parallelism tends to have diminishing returns and scheduler activity
|
|
adds overhead. Thus, for example, it is probably better for a
|
|
four-processor machine to run two programs with two processes each
|
|
rather than gang scheduling between two programs with four processes
|
|
each.
|
|
<P>There is one more twist to this. Suppose that you are developing a
|
|
program on a machine that is heavily used all day, but will be fully
|
|
available for parallel execution at night. You need to write and test
|
|
your code for correctness with the full number of processes, even
|
|
though you know that your daytime test runs will be slow. Well, they
|
|
will be <EM>very</EM> slow if you have processes <B>busy waiting</B>
|
|
for shared memory values to be changed by other processes that are not
|
|
currently running (on other processors). The same problem occurs if
|
|
you develop and test your code on a single-processor system.
|
|
<P>The solution is to embed calls in your code, wherever it may loop
|
|
awaiting an action from another processor, so that Linux will give
|
|
another process a chance to run. I use a C macro, call it
|
|
<CODE>IDLE_ME</CODE>, to do this: for a test run, compile with
|
|
<CODE>cc -DIDLE_ME=usleep(1); ...</CODE>; for a "production" run,
|
|
compile with <CODE>cc -DIDLE_ME={} ...</CODE>. The
|
|
<CODE>usleep(1)</CODE> call requests a 1 microsecond sleep, which has the
|
|
effect of allowing the Linux scheduler to select a different process
|
|
to run on that processor. If the number of processes is more than
|
|
twice the number of processors available, it is not unusual for codes
|
|
to run ten times faster with <CODE>usleep(1)</CODE> calls than without
|
|
them.
|
|
<P>
|
|
<H2><A NAME="ss2.3">2.3 bb_threads</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>The bb_threads ("Bare Bones" threads) library,
|
|
<A HREF="ftp://caliban.physics.utoronto.ca/pub/linux/">ftp://caliban.physics.utoronto.ca/pub/linux/</A>, is a remarkably
|
|
simple library that demonstrates use of the Linux <CODE>clone()</CODE>
|
|
call. The <CODE>gzip tar</CODE> file is only 7K bytes! Although this
|
|
library is essentially made obsolete by the LinuxThreads library
|
|
discussed in section 2.4, bb_threads is still usable, and it is
|
|
small and simple enough to serve well as an introduction to use of
|
|
Linux thread support. Certainly, it is far less daunting to read this
|
|
source code than to browse the source code for LinuxThreads. In
|
|
summary, the bb_threads library is a good starting point, but
|
|
is not really suitable for coding large projects.
|
|
<P>The basic program structure for using the bb_threads library is:
|
|
<P>
|
|
<OL>
|
|
<LI>Start the program running as a single process.
|
|
</LI>
|
|
<LI>You will need to estimate the maximum stack space that will be
|
|
required for each thread. Guessing large is relatively harmless (that
|
|
is what virtual memory is for ;-), but remember that <EM>all</EM> the
|
|
stacks are coming from a single virtual address space, so guessing
|
|
huge is not a great idea. The demo suggests 64K. This size is set to
|
|
<EM>b</EM> bytes by
|
|
<CODE>bb_threads_stacksize(<EM>b</EM>)</CODE>.
|
|
</LI>
|
|
<LI>The next step is to initialize any locks that you will need.
|
|
The lock mechanism built-into this library numbers locks from 0 to
|
|
<CODE>MAX_MUTEXES</CODE>, and initializes lock <EM>i</EM> by
|
|
<CODE>bb_threads_mutexcreate(<EM>i</EM>)</CODE>.
|
|
</LI>
|
|
<LI>Spawning a new thread is done by calling a library routine that
|
|
takes arguments specifying what function the new thread should execute
|
|
and what arguments should be transmitted to it. To start a new thread
|
|
executing the <CODE>void</CODE>-returning function <EM>f</EM> with the
|
|
single argument <EM>arg</EM>, you do something like
|
|
<CODE>bb_threads_newthread(<EM>f</EM>, &arg)</CODE>,
|
|
where <EM>f</EM> should be declared something like <CODE>void
|
|
<EM>f</EM>(void *arg, size_t dummy)</CODE>. If you need to pass
|
|
more than one argument, pass a pointer to a structure initialized to
|
|
hold the argument values.
|
|
</LI>
|
|
<LI>Run parallel code, being careful to use
|
|
<CODE>bb_threads_lock(<EM>n</EM>)</CODE> and
|
|
<CODE>bb_threads_unlock(<EM>n</EM>)</CODE> where <EM>n</EM>
|
|
is an integer identifying which lock to use. Note that the lock and
|
|
unlock operations in this library are very basic spin locks using
|
|
atomic bus-locking instructions, which can cause excessive
|
|
memory-reference interference and do not make any attempt to ensure
|
|
fairness.
|
|
|
|
The demo program packaged with bb_threads did not correctly use
|
|
locks to prevent <CODE>printf()</CODE> from being executed simultaneously
|
|
from within the functions <CODE>fnn</CODE> and <CODE>main</CODE>... and
|
|
because of this, the demo does not always work. I'm not saying this
|
|
to knock the demo, but rather to emphasize that this stuff is <EM>very
|
|
tricky</EM>; also, it is only slightly easier using LinuxThreads.
|
|
</LI>
|
|
<LI>When a thread executes a <CODE>return</CODE>, it actually destroys
|
|
the process... but the local stack memory is not automatically
|
|
deallocated. To be precise, Linux doesn't support deallocation, but
|
|
the memory space is not automatically added back to the
|
|
<CODE>malloc()</CODE> free list. Thus, the parent process should reclaim
|
|
the space for each dead child by
|
|
<CODE>bb_threads_cleanup(wait(NULL))</CODE>.</LI>
|
|
</OL>
|
|
<P>
|
|
<P>The following C program uses the algorithm discussed in section 1.3 to
|
|
compute the approximate value of Pi using two bb_threads
|
|
threads.
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdio.h>
|
|
#include <stdlib.h>
|
|
#include <unistd.h>
|
|
#include <sys/types.h>
|
|
#include <sys/wait.h>
|
|
#include "bb_threads.h"
|
|
|
|
volatile double pi = 0.0;
|
|
volatile int intervals;
|
|
volatile int pids[2]; /* Unix PIDs of threads */
|
|
|
|
void
|
|
do_pi(void *data, size_t len)
|
|
{
|
|
register double width, localsum;
|
|
register int i;
|
|
register int iproc = (getpid() != pids[0]);
|
|
|
|
/* set width */
|
|
width = 1.0 / intervals;
|
|
|
|
/* do the local computations */
|
|
localsum = 0;
|
|
for (i=iproc; i<intervals; i+=2) {
|
|
register double x = (i + 0.5) * width;
|
|
localsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
localsum *= width;
|
|
|
|
/* get permission, update pi, and unlock */
|
|
bb_threads_lock(0);
|
|
pi += localsum;
|
|
bb_threads_unlock(0);
|
|
}
|
|
|
|
int
|
|
main(int argc, char **argv)
|
|
{
|
|
/* get the number of intervals */
|
|
intervals = atoi(argv[1]);
|
|
|
|
/* set stack size and create lock... */
|
|
bb_threads_stacksize(65536);
|
|
bb_threads_mutexcreate(0);
|
|
|
|
/* make two threads... */
|
|
pids[0] = bb_threads_newthread(do_pi, NULL);
|
|
pids[1] = bb_threads_newthread(do_pi, NULL);
|
|
|
|
/* cleanup after two threads (really a barrier sync) */
|
|
bb_threads_cleanup(wait(NULL));
|
|
bb_threads_cleanup(wait(NULL));
|
|
|
|
/* print the result */
|
|
printf("Estimation of pi is %f\n", pi);
|
|
|
|
/* check-out */
|
|
exit(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>
|
|
<H2><A NAME="ss2.4">2.4 LinuxThreads</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>LinuxThreads
|
|
<A HREF="http://pauillac.inria.fr/~xleroy/linuxthreads/">http://pauillac.inria.fr/~xleroy/linuxthreads/</A> is a fairly
|
|
complete and solid implementation of "shared everything" as per the
|
|
POSIX 1003.1c threads standard. Unlike other POSIX threads ports,
|
|
LinuxThreads uses the same Linux kernel threads facility
|
|
(<CODE>clone()</CODE>) that is used by bb_threads. POSIX
|
|
compatibility means that it is relatively easy to port quite a few
|
|
threaded applications from other systems and various tutorial
|
|
materials are available. In short, this is definitely the threads
|
|
package to use under Linux for developing large-scale threaded
|
|
programs.
|
|
<P>The basic program structure for using the LinuxThreads library is:
|
|
<P>
|
|
<OL>
|
|
<LI>Start the program running as a single process.
|
|
</LI>
|
|
<LI>The next step is to initialize any locks that you will need.
|
|
Unlike bb_threads locks, which are identified by numbers, POSIX
|
|
locks are declared as variables of type
|
|
<CODE>pthread_mutex_t lock</CODE>. Use
|
|
<CODE>pthread_mutex_init(&lock,val)</CODE> to initialize
|
|
each one you will need to use.
|
|
</LI>
|
|
<LI>As with bb_threads, spawning a new thread is done by
|
|
calling a library routine that takes arguments specifying what
|
|
function the new thread should execute and what arguments should be
|
|
transmitted to it. However, POSIX requires the user to declare a
|
|
variable of type <CODE>pthread_t</CODE> to identify each thread. To
|
|
create a thread <CODE>pthread_t thread</CODE> running <CODE>f()</CODE>,
|
|
one calls <CODE>pthread_create(&thread,NULL,f,&arg)</CODE>.
|
|
</LI>
|
|
<LI>Run parallel code, being careful to use
|
|
<CODE>pthread_mutex_lock(&lock)</CODE> and
|
|
<CODE>pthread_mutex_unlock(&lock)</CODE> as appropriate.
|
|
</LI>
|
|
<LI>Use <CODE>pthread_join(thread,&retval)</CODE> to clean-up
|
|
after each thread.
|
|
</LI>
|
|
<LI>Use <CODE>-D_REENTRANT</CODE> when compiling your C code.</LI>
|
|
</OL>
|
|
<P>An example parallel computation of Pi using LinuxThreads follows. The
|
|
algorithm of section 1.3 is used and, as for the bb_threads
|
|
example, two threads execute in parallel.
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdio.h>
|
|
#include <stdlib.h>
|
|
#include "pthread.h"
|
|
|
|
volatile double pi = 0.0; /* Approximation to pi (shared) */
|
|
pthread_mutex_t pi_lock; /* Lock for above */
|
|
volatile double intervals; /* How many intervals? */
|
|
|
|
void *
|
|
process(void *arg)
|
|
{
|
|
register double width, localsum;
|
|
register int i;
|
|
register int iproc = (*((char *) arg) - '0');
|
|
|
|
/* Set width */
|
|
width = 1.0 / intervals;
|
|
|
|
/* Do the local computations */
|
|
localsum = 0;
|
|
for (i=iproc; i<intervals; i+=2) {
|
|
register double x = (i + 0.5) * width;
|
|
localsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
localsum *= width;
|
|
|
|
/* Lock pi for update, update it, and unlock */
|
|
pthread_mutex_lock(&pi_lock);
|
|
pi += localsum;
|
|
pthread_mutex_unlock(&pi_lock);
|
|
|
|
return(NULL);
|
|
}
|
|
|
|
int
|
|
main(int argc, char **argv)
|
|
{
|
|
pthread_t thread0, thread1;
|
|
void * retval;
|
|
|
|
/* Get the number of intervals */
|
|
intervals = atoi(argv[1]);
|
|
|
|
/* Initialize the lock on pi */
|
|
pthread_mutex_init(&pi_lock, NULL);
|
|
|
|
/* Make the two threads */
|
|
if (pthread_create(&thread0, NULL, process, "0") ||
|
|
pthread_create(&thread1, NULL, process, "1")) {
|
|
fprintf(stderr, "%s: cannot make thread\n", argv[0]);
|
|
exit(1);
|
|
}
|
|
|
|
/* Join (collapse) the two threads */
|
|
if (pthread_join(thread0, &retval) ||
|
|
pthread_join(thread1, &retval)) {
|
|
fprintf(stderr, "%s: thread join failed\n", argv[0]);
|
|
exit(1);
|
|
}
|
|
|
|
/* Print the result */
|
|
printf("Estimation of pi is %f\n", pi);
|
|
|
|
/* Check-out */
|
|
exit(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>
|
|
<H2><A NAME="ss2.5">2.5 System V Shared Memory</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>The System V IPC (Inter-Process Communication) support consists of a
|
|
number of system calls providing message queues, semaphores, and a
|
|
shared memory mechanism. Of course, these mechanisms were originally
|
|
intended to be used for multiple processes to communicate within a
|
|
uniprocessor system. However, that implies that it also should work
|
|
to communicate between processes under SMP Linux, no matter which
|
|
processors they run on.
|
|
<P>Before going into how these calls are used, it is important to
|
|
understand that although System V IPC calls exist for things like
|
|
semaphores and message transmission, you probably should not use
|
|
them. Why not? These functions are generally slow and serialized
|
|
under SMP Linux. Enough said.
|
|
<P>The basic procedure for creating a group of processes sharing access
|
|
to a shared memory segment is:
|
|
<P>
|
|
<OL>
|
|
<LI>Start the program running as a single process.
|
|
</LI>
|
|
<LI>Typically, you will want each run of a parallel program to have
|
|
its own shared memory segment, so you will need to call
|
|
<CODE>shmget()</CODE> to create a new segment of the desired size.
|
|
Alternatively, this call can be used to get the ID of a pre-existing
|
|
shared memory segment. In either case, the return value is either the
|
|
shared memory segment ID or -1 for error. For example, to create a
|
|
shared memory segment of <EM>b</EM> bytes, the call might be <CODE>shmid
|
|
= shmget(IPC_PRIVATE, <EM>b</EM>, (IPC_CREAT | 0666))</CODE>.
|
|
</LI>
|
|
<LI>The next step is to attach this shared memory segment to this
|
|
process, literally adding it to the virtual memory map of this process.
|
|
Although the <CODE>shmat()</CODE> call allows the programmer to specify
|
|
the virtual address at which the segment should appear, the address
|
|
selected must be aligned on a page boundary (i.e., be a multiple of
|
|
the page size returned by <CODE>getpagesize()</CODE>, which is usually
|
|
4096 bytes), and will override the mapping of any memory formerly at
|
|
that address. Thus, we instead prefer to let the system pick the
|
|
address. In either case, the return value is a pointer to the base
|
|
virtual address of the segment just mapped. The code is <CODE>shmptr =
|
|
shmat(shmid, 0, 0)</CODE>.
|
|
|
|
Notice that you can allocate all your static shared variables into
|
|
this shared memory segment by simply declaring all shared variables as
|
|
members of a <CODE>struct</CODE> type, and declaring <EM>shmptr</EM> to be
|
|
a pointer to that type. Using this technique, shared variable
|
|
<EM>x</EM> would be accessed as
|
|
<EM>shmptr</EM><CODE>-></CODE><EM>x</EM>.
|
|
</LI>
|
|
<LI>Since this shared memory segment should be destroyed when the
|
|
last process with access to it terminates or detaches from it, we need
|
|
to call <CODE>shmctl()</CODE> to set-up this default action. The code is
|
|
something like <CODE>shmctl(shmid, IPC_RMID, 0)</CODE>.
|
|
</LI>
|
|
<LI>Use the standard Linux <CODE>fork()</CODE> call to make the desired
|
|
number of processes... each will inherit the shared memory segment.
|
|
</LI>
|
|
<LI>When a process is done using a shared memory segment, it really
|
|
should detach from that shared memory segment. This is done by
|
|
<CODE>shmdt(shmptr)</CODE>.</LI>
|
|
</OL>
|
|
<P>
|
|
<P>Although the above set-up does require a few system calls, once the
|
|
shared memory segment has been established, any change made by one
|
|
processor to a value in that memory will automatically be visible to
|
|
all processes. Most importantly, each communication operation will
|
|
occur without the overhead of a system call.
|
|
<P>An example C program using System V shared memory segments follows.
|
|
It computes Pi, using the same algorithm given in section 1.3.
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdio.h>
|
|
#include <stdlib.h>
|
|
#include <unistd.h>
|
|
#include <sys/types.h>
|
|
#include <sys/stat.h>
|
|
#include <fcntl.h>
|
|
#include <sys/ipc.h>
|
|
#include <sys/shm.h>
|
|
|
|
volatile struct shared { double pi; int lock; } *shared;
|
|
|
|
inline extern int xchg(register int reg,
|
|
volatile int * volatile obj)
|
|
{
|
|
/* Atomic exchange instruction */
|
|
__asm__ __volatile__ ("xchgl %1,%0"
|
|
:"=r" (reg), "=m" (*obj)
|
|
:"r" (reg), "m" (*obj));
|
|
return(reg);
|
|
}
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width, localsum;
|
|
register int intervals, i;
|
|
register int shmid;
|
|
register int iproc = 0;;
|
|
|
|
/* Allocate System V shared memory */
|
|
shmid = shmget(IPC_PRIVATE,
|
|
sizeof(struct shared),
|
|
(IPC_CREAT | 0600));
|
|
shared = ((volatile struct shared *) shmat(shmid, 0, 0));
|
|
shmctl(shmid, IPC_RMID, 0);
|
|
|
|
/* Initialize... */
|
|
shared->pi = 0.0;
|
|
shared->lock = 0;
|
|
|
|
/* Fork a child */
|
|
if (!fork()) ++iproc;
|
|
|
|
/* get the number of intervals */
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
|
|
/* do the local computations */
|
|
localsum = 0;
|
|
for (i=iproc; i<intervals; i+=2) {
|
|
register double x = (i + 0.5) * width;
|
|
localsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
localsum *= width;
|
|
|
|
/* Atomic spin lock, add, unlock... */
|
|
while (xchg((iproc + 1), &(shared->lock))) ;
|
|
shared->pi += localsum;
|
|
shared->lock = 0;
|
|
|
|
/* Terminate child (barrier sync) */
|
|
if (iproc == 0) {
|
|
wait(NULL);
|
|
printf("Estimation of pi is %f\n", shared->pi);
|
|
}
|
|
|
|
/* Check out */
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>In this example, I have used the IA32 atomic exchange instruction to
|
|
implement locking. For better performance and portability, substitute
|
|
a synchronization technique that avoids atomic bus-locking
|
|
instructions (discussed in section 2.2).
|
|
<P>When debugging your code, it is useful to remember that the
|
|
<CODE>ipcs</CODE> command will report the status of the System V IPC
|
|
facilities currently in use.
|
|
<P>
|
|
<H2><A NAME="ss2.6">2.6 Memory Map Call</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Using system calls for file I/O can be very expensive; in fact, that is
|
|
why there is a user-buffered file I/O library (<CODE>getchar()</CODE>,
|
|
<CODE>fwrite()</CODE>, etc.). But user buffers don't work if multiple
|
|
processes are accessing the same writeable file, and the user buffer
|
|
management overhead is significant. The BSD UNIX fix for this was the
|
|
addition of a system call that allows a portion of a file to be mapped
|
|
into user memory, essentially using virtual memory paging mechanisms to
|
|
cause updates. This same mechanism also has been used in systems from
|
|
Sequent for many years as the basis for their shared memory parallel
|
|
processing support. Despite some very negative comments in the (quite
|
|
old) man page, Linux seems to correctly perform at least some of the
|
|
basic functions, and it supports the degenerate use of this system
|
|
call to map an anonymous segment of memory that can be shared across
|
|
multiple processes.
|
|
<P>In essence, the Linux implementation of <CODE>mmap()</CODE> is a plug-in
|
|
replacement for steps 2, 3, and 4 in the System V shared memory scheme
|
|
outlined in section 2.5. To create an anonymous shared memory segment:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
shmptr =
|
|
mmap(0, /* system assigns address */
|
|
b, /* size of shared memory segment */
|
|
(PROT_READ | PROT_WRITE), /* access rights, can be rwx */
|
|
(MAP_ANON | MAP_SHARED), /* anonymous, shared */
|
|
0, /* file descriptor (not used) */
|
|
0); /* file offset (not used) */
|
|
</PRE>
|
|
<HR>
|
|
<P>The equivalent to the System V shared memory <CODE>shmdt()</CODE>
|
|
call is <CODE>munmap()</CODE>:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
munmap(shmptr, b);
|
|
</PRE>
|
|
<HR>
|
|
<P>In my opinion, there is no real benefit in using <CODE>mmap()</CODE>
|
|
instead of the System V shared memory support.
|
|
<P>
|
|
<HR>
|
|
<H2><A NAME="s3">3.</A> <A HREF="#toc3">Clusters Of Linux Systems</A></H2>
|
|
|
|
<P>
|
|
<P>This section attempts to give an overview of cluster parallel
|
|
processing using Linux. Clusters are currently both the most popular
|
|
and the most varied approach, ranging from a conventional network of
|
|
workstations (<B>NOW</B>) to essentially custom parallel machines
|
|
that just happen to use Linux PCs as processor nodes. There is also
|
|
quite a lot of software support for parallel processing using clusters
|
|
of Linux machines.
|
|
<P>
|
|
<H2><A NAME="ss3.1">3.1 Why A Cluster?</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Cluster parallel processing offers several important advantages:
|
|
<P>
|
|
<UL>
|
|
<LI>Each of the machines in a cluster can be a complete system,
|
|
usable for a wide range of other computing applications. This leads
|
|
many people to suggest that cluster parallel computing can simply
|
|
claim all the "wasted cycles" of workstations sitting idle on people's
|
|
desks. It is not really so easy to salvage those cycles, and it will
|
|
probably slow your co-worker's screen saver, but it can be done.
|
|
</LI>
|
|
<LI>The current explosion in networked systems means that most of the
|
|
hardware for building a cluster is being sold in high volume, with
|
|
correspondingly low "commodity" prices as the result. Further savings
|
|
come from the fact that only one video card, monitor, and keyboard are
|
|
needed for each cluster (although you may need to swap these into each
|
|
machine to perform the initial installation of Linux, once running, a
|
|
typical Linux PC does not need a "console"). In comparison, SMP and
|
|
attached processors are much smaller markets, tending toward somewhat
|
|
higher price per unit performance.
|
|
</LI>
|
|
<LI>Cluster computing can <EM>scale to very large systems</EM>.
|
|
While it is currently hard to find a Linux-compatible SMP with many
|
|
more than four processors, most commonly available network hardware
|
|
easily builds a cluster with up to 16 machines. With a little work,
|
|
hundreds or even thousands of machines can be networked. In fact, the
|
|
entire Internet can be viewed as one truly huge cluster.
|
|
</LI>
|
|
<LI>The fact that replacing a "bad machine" within a cluster is
|
|
trivial compared to fixing a partly faulty SMP yields much higher
|
|
availability for carefully designed cluster configurations. This
|
|
becomes important not only for particular applications that cannot
|
|
tolerate significant service interruptions, but also for general use
|
|
of systems containing enough processors so that single-machine
|
|
failures are fairly common. (For example, even though the average
|
|
time to failure of a PC might be two years, in a cluster with 32
|
|
machines, the probability that at least one will fail within 6 months
|
|
is quite high.)</LI>
|
|
</UL>
|
|
<P>
|
|
<P>OK, so clusters are free or cheap and can be very large and highly
|
|
available... why doesn't everyone use a cluster? Well, there are
|
|
problems too:
|
|
<P>
|
|
<UL>
|
|
<LI>With a few exceptions, network hardware is not designed for
|
|
parallel processing. Typically latency is very high and bandwidth
|
|
relatively low compared to SMP and attached processors. For example,
|
|
SMP latency is generally no more than a few microseconds, but is
|
|
commonly hundreds or thousands of microseconds for a cluster. SMP
|
|
communication bandwidth is often more than 100 MBytes/second; although
|
|
the fastest network hardware (e.g., "Gigabit Ethernet") offers
|
|
comparable speed, the most commonly used networks are between 10 and
|
|
1000 times slower.
|
|
|
|
The performance of network hardware is poor enough as an <EM>isolated
|
|
cluster network</EM>. If the network is not isolated from other
|
|
traffic, as is often the case using "machines that happen to be
|
|
networked" rather than a system designed as a cluster, performance can
|
|
be substantially worse.
|
|
</LI>
|
|
<LI>There is very little software support for treating a cluster as a
|
|
single system. For example, the <CODE>ps</CODE> command only reports the
|
|
processes running on one Linux system, not all processes running
|
|
across a cluster of Linux systems.</LI>
|
|
</UL>
|
|
<P>
|
|
<P>Thus, the basic story is that clusters offer great potential, but that
|
|
potential may be very difficult to achieve for most applications. The
|
|
good news is that there is quite a lot of software support that will
|
|
help you achieve good performance for programs that are well suited to
|
|
this environment, and there are also networks designed specifically to
|
|
widen the range of programs that can achieve good performance.
|
|
<P>
|
|
<H2><A NAME="ss3.2">3.2 Network Hardware</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Computer networking is an exploding field... but you already knew
|
|
that. An ever-increasing range of networking technologies and
|
|
products are being developed, and most are available in forms that
|
|
could be applied to make a parallel-processing cluster out of a group
|
|
of machines (i.e., PCs each running Linux).
|
|
<P>Unfortunately, no one network technology solves all problems best; in
|
|
fact, the range of approach, cost, and performance is at first hard to
|
|
believe. For example, using standard commercially-available hardware,
|
|
the cost per machine networked ranges from less than $5 to over
|
|
$4,000. The delivered bandwidth and latency each also vary
|
|
over four orders of magnitude.
|
|
<P>Before trying to learn about specific networks, it is important to
|
|
recognize that these things change like the wind (see
|
|
<A HREF="http://www.linux.org.uk/NetNews.html">http://www.linux.org.uk/NetNews.html</A> for Linux networking news),
|
|
and it is very difficult to get accurate data about some networks.
|
|
<P>Where I was particularly uncertain,
|
|
I've placed a <EM>?</EM>. I have spent a lot of time researching this
|
|
topic, but I'm sure my summary is full of errors and has omitted many
|
|
important things. If you have any corrections or additions, please
|
|
send email to
|
|
<A HREF="mailto:hankd@engr.uky.edu">hankd@engr.uky.edu</A>.
|
|
<P>Summaries like the LAN Technology Scorecard at
|
|
<A HREF="http://web.syr.edu/~jmwobus/comfaqs/lan-technology.html">http://web.syr.edu/~jmwobus/comfaqs/lan-technology.html</A> give
|
|
some characteristics of many different types of networks and LAN
|
|
standards. However, the summary in this HOWTO centers on the network
|
|
properties that are most relevant to construction of Linux clusters.
|
|
The section discussing each network begins with a short list of
|
|
characteristics. The following defines what these entries mean.
|
|
<P>
|
|
<DL>
|
|
<DT><B>Linux support:</B><DD><P>If the answer is <EM>no</EM>, the meaning is pretty clear. Other
|
|
answers try to describe the basic program interface that is used to
|
|
access the network. Most network hardware is interfaced via a kernel
|
|
driver, typically supporting TCP/UDP communication. Some other
|
|
networks use more direct (e.g., library) interfaces to reduce latency
|
|
by bypassing the kernel.
|
|
<P>
|
|
<P>
|
|
<P>Years ago, it used to be considered perfectly acceptable to access a
|
|
floating point unit via an OS call, but that is now clearly ludicrous;
|
|
in my opinion, it is just as awkward for each communication between
|
|
processors executing a parallel program to require an OS call. The
|
|
problem is that computers haven't yet integrated these communication
|
|
mechanisms, so non-kernel approaches tend to have portability problems.
|
|
You are going to hear a lot more about this in the near future, mostly
|
|
in the form of the new <B>Virtual Interface (VI) Architecture</B>,
|
|
<A HREF="http://www.viarch.org/">http://www.viarch.org/</A>, which is a standardized method for
|
|
most network interface operations to bypass the usual OS call layers.
|
|
The VI standard is backed by Compaq, Intel, and Microsoft, and is sure
|
|
to have a strong impact on SAN (System Area Network) designs over the
|
|
next few years.
|
|
<P>
|
|
<DT><B>Maximum bandwidth:</B><DD><P>This is the number everybody cares about. I have generally used the
|
|
theoretical best case numbers; your mileage <EM>will</EM> vary.
|
|
<P>
|
|
<DT><B>Minimum latency:</B><DD><P>In my opinion, this is the number everybody should care about even more
|
|
than bandwidth. Again, I have used the unrealistic best-case numbers,
|
|
but at least these numbers do include <EM>all</EM> sources of latency,
|
|
both hardware and software. In most cases, the network latency is just
|
|
a few microseconds; the much larger numbers reflect layers of
|
|
inefficient hardware and software interfaces.
|
|
<P>
|
|
<DT><B>Available as:</B><DD><P>Simply put, this describes how you get this type of network hardware.
|
|
Commodity stuff is widely available from many vendors, with price as
|
|
the primary distinguishing factor. Multiple-vendor things are
|
|
available from more than one competing vendor, but there are
|
|
significant differences and potential interoperability problems.
|
|
Single-vendor networks leave you at the mercy of that supplier
|
|
(however benevolent it may be). Public domain designs mean that even
|
|
if you cannot find somebody to sell you one, you or anybody else can
|
|
buy parts and make one. Research prototypes are just that; they are
|
|
generally neither ready for external users nor available to them.
|
|
<P>
|
|
<DT><B>Interface port/bus used:</B><DD><P>How does one hook-up this network? The highest performance and most
|
|
common now is a PCI bus interface card. There are also EISA, VESA
|
|
local bus (VL bus), and ISA bus cards. ISA was there first, and is
|
|
still commonly used for low-performance cards. EISA is still around
|
|
as the second bus in a lot of PCI machines, so there are a few cards.
|
|
These days, you don't see much VL stuff (although
|
|
<A HREF="http://www.vesa.org/">http://www.vesa.org/</A> would beg to differ).
|
|
<P>
|
|
<P>
|
|
<P>Of course, any interface that you can use without having to open your
|
|
PC's case has more than a little appeal. IrDA and USB interfaces are
|
|
appearing with increasing frequency. The Standard Parallel Port (SPP)
|
|
used to be what your printer was plugged into, but it has seen a lot
|
|
of use lately as an external extension of the ISA bus; this new
|
|
functionality is enhanced by the IEEE 1284 standard, which specifies
|
|
EPP and ECP improvements. There is also the old, reliable, slow RS232
|
|
serial port. I don't know of anybody connecting machines using VGA
|
|
video connectors, keyboard, mouse, or game ports... so that's about
|
|
it.
|
|
<P>
|
|
<DT><B>Network structure:</B><DD><P>A bus is a wire, set of wires, or fiber. A hub is a little box that
|
|
knows how to connect different wires/fibers plugged into it; switched
|
|
hubs allow multiple connections to be actively transmitting data
|
|
simultaneously.
|
|
<P>
|
|
<DT><B>Cost per machine connected:</B><DD><P>Here's how to use these numbers. Suppose that, not counting the
|
|
network connection, it costs $2,000 to purchase a PC for use as
|
|
a node in your cluster. Adding a Fast Ethernet brings the per node
|
|
cost to about $2,400; adding a Myrinet instead brings the cost
|
|
to about $3,800. If you have about $20,000 to spend,
|
|
that means you could have either 8 machines connected by Fast Ethernet
|
|
or 5 machines connected by Myrinet. It also can be very reasonable to
|
|
have multiple networks; e.g., $20,000 could buy 8 machines
|
|
connected by both Fast Ethernet and TTL_PAPERS. Pick the
|
|
network, or set of networks, that is most likely to yield a cluster
|
|
that will run your application fastest.
|
|
<P>
|
|
<P>
|
|
<P>By the time you read this, these numbers will be wrong... heck,
|
|
they're probably wrong already. There may also be quantity discounts,
|
|
special deals, etc. Still, the prices quoted here aren't likely to be
|
|
wrong enough to lead you to a totally inappropriate choice. It
|
|
doesn't take a PhD (although I do have one ;-) to see that expensive
|
|
networks only make sense if your application needs their special
|
|
properties or if the PCs being clustered are relatively expensive.
|
|
</DL>
|
|
<P>Now that you have the disclaimers, on with the show....
|
|
<P>
|
|
<H3>ArcNet</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>2.5 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>1,000 microseconds?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>ISA</EM></LI>
|
|
<LI>Network structure: <EM>unswitched hub or bus (logical ring)</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$200</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>ARCNET is a local area network that is primarily intended for use in
|
|
embedded real-time control systems. Like Ethernet, the network is
|
|
physically organized either as taps on a bus or one or more hubs,
|
|
however, unlike Ethernet, it uses a token-based protocol logically
|
|
structuring the network as a ring. Packet headers are small (3 or 4
|
|
bytes) and messages can carry as little as a single byte of data.
|
|
Thus, ARCNET yields more consistent performance than Ethernet, with
|
|
bounded delays, etc. Unfortunately, it is slower than Ethernet and
|
|
less popular, making it more expensive. More information is available
|
|
from the ARCNET Trade Association at
|
|
<A HREF="http://www.arcnet.com/">http://www.arcnet.com/</A>.
|
|
<P>
|
|
<H3>ATM</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel driver, AAL* library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>155 Mb/s</EM> (soon, <EM>1,200 Mb/s</EM>)</LI>
|
|
<LI>Minimum latency: <EM>120 microseconds</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$3,000</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>Unless you've been in a coma for the past few years, you have probably
|
|
heard a lot about how ATM (Asynchronous Transfer Mode) <EM>is</EM> the
|
|
future... well, sort-of. ATM is cheaper than HiPPI and faster than
|
|
Fast Ethernet, and it can be used over the very long distances that
|
|
the phone companies care about. The ATM network protocol is also
|
|
designed to provide a lower-overhead software interface and to more
|
|
efficiently manage small messages and real-time communications (e.g.,
|
|
digital audio and video). It is also one of the highest-bandwidth
|
|
networks that Linux currently supports. The bad news is that ATM isn't
|
|
cheap, and there are still some compatibility problems across
|
|
vendors. An overview of Linux ATM development is available at
|
|
<A HREF="http://lrcwww.epfl.ch/linux-atm/">http://lrcwww.epfl.ch/linux-atm/</A>.
|
|
<P>
|
|
<H3>CAPERS</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>AFAPI library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1.2 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>SPP</EM></LI>
|
|
<LI>Network structure: <EM>cable between 2 machines</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$2</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>CAPERS (Cable Adapter for Parallel Execution and Rapid
|
|
Synchronization) is a spin-off of the PAPERS project,
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>, at the Purdue University
|
|
School of Electrical and Computer Engineering. In essence, it defines
|
|
a software protocol for using an ordinary "LapLink" SPP-to-SPP cable
|
|
to implement the PAPERS library for two Linux PCs. The idea doesn't
|
|
scale, but you can't beat the price. As with TTL_PAPERS, to improve
|
|
system security, there is a minor kernel patch recommended, but not
|
|
required:
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">http://garage.ecn.purdue.edu/~papers/giveioperm.html</A>.
|
|
<P>
|
|
<H3>Ethernet</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>10 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>100 microseconds</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched or unswitched hubs, or hubless bus</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$100</EM> (hubless, <EM>$50</EM>)</LI>
|
|
</UL>
|
|
<P>
|
|
<P>For some years now, 10 Mbits/s Ethernet has been the standard network
|
|
technology. Good Ethernet interface cards can be purchased for well
|
|
under $50, and a fair number of PCs now have an Ethernet controller
|
|
built-into the motherboard. For lightly-used networks, Ethernet
|
|
connections can be organized as a multi-tap bus without a hub; such
|
|
configurations can serve up to 200 machines with minimal cost, but are
|
|
not appropriate for parallel processing. Adding an unswitched hub
|
|
does not really help performance. However, switched hubs that can
|
|
provide full bandwidth to simultaneous connections cost only about
|
|
$100 per port. Linux supports an amazing range of Ethernet
|
|
interfaces, but it is important to keep in mind that variations in the
|
|
interface hardware can yield significant performance differences. See
|
|
the Hardware Compatibility HOWTO for comments on which are supported
|
|
and how well they work; also see
|
|
<A HREF="http://cesdis1.gsfc.nasa.gov/linux/drivers/">http://cesdis1.gsfc.nasa.gov/linux/drivers/</A>.
|
|
<P>An interesting way to improve performance is offered by the 16-machine
|
|
Linux cluster work done in the Beowulf project,
|
|
<A HREF="http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html">http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html</A>, at NASA
|
|
CESDIS. There, Donald Becker, who is the author of many Ethernet card
|
|
drivers, has developed support for load sharing across multiple
|
|
Ethernet networks that shadow each other (i.e., share the same network
|
|
addresses). This load sharing is built-into the standard Linux
|
|
distribution, and is done invisibly below the socket operation level.
|
|
Because hub cost is significant, having each machine connected to two
|
|
or more hubless or unswitched hub Ethernet networks can be a very
|
|
cost-effective way to improve performance. In fact, in situations
|
|
where one machine is the network performance bottleneck, load sharing
|
|
using shadow networks works much better than using a single switched
|
|
hub network.
|
|
<P>
|
|
<H3>Ethernet (Fast Ethernet)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>100 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>80 microseconds</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched or unswitched hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$400?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>Although there are really quite a few different technologies calling
|
|
themselves "Fast Ethernet," this term most often refers to a hub-based
|
|
100 Mbits/s Ethernet that is somewhat compatible with older "10 BaseT"
|
|
10 Mbits/s devices and cables. As might be expected, anything called
|
|
Ethernet is generally priced for a volume market, and these interfaces
|
|
are generally a small fraction of the price of 155 Mbits/s ATM cards.
|
|
The catch is that having a bunch of machines dividing the bandwidth of
|
|
a single 100 Mbits/s "bus" (using an unswitched hub) yields
|
|
performance that might not even be as good on average as using 10
|
|
Mbits/s Ethernet with a switched hub that can give each machine's
|
|
connection a full 10 Mbits/s.
|
|
<P>Switched hubs that can provide 100 Mbits/s for each machine
|
|
simultaneously are expensive, but prices are dropping every day, and
|
|
these switches do yield much higher total network bandwidth than
|
|
unswitched hubs. The thing that makes ATM switches so expensive is
|
|
that they must switch for each (relatively short) ATM cell; some Fast
|
|
Ethernet switches take advantage of the expected lower switching
|
|
frequency by using techniques that may have low latency through the
|
|
switch, but take multiple milliseconds to change the switch path...
|
|
if your routing pattern changes frequently, avoid those switches. See
|
|
<A HREF="http://cesdis1.gsfc.nasa.gov/linux/drivers/">http://cesdis1.gsfc.nasa.gov/linux/drivers/</A> for information
|
|
about the various cards and drivers.
|
|
<P>Also note that, as described for Ethernet, the Beowulf project,
|
|
<A HREF="http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html">http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html</A>, at NASA
|
|
has been developing support that offers improved performance by load
|
|
sharing across multiple Fast Ethernets.
|
|
<P>
|
|
<H3>Ethernet (Gigabit Ethernet)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1,000 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>300 microseconds?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched hubs or FDRs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$2,500?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>I'm not sure that Gigabit Ethernet,
|
|
<A HREF="http://www.gigabit-ethernet.org/">http://www.gigabit-ethernet.org/</A>, has a good technological
|
|
reason to be called Ethernet... but the name does accurately reflect
|
|
the fact that this is intended to be a cheap, mass-market, computer
|
|
network technology with native support for IP. However, current
|
|
pricing reflects the fact that Gb/s hardware is still a tricky thing
|
|
to build.
|
|
<P>Unlike other Ethernet technologies, Gigabit Ethernet provides for a
|
|
level of flow control that should make it a more reliable network.
|
|
FDRs, or Full-Duplex Repeaters, simply multiplex lines, using
|
|
buffering and localized flow control to improve performance. Most
|
|
switched hubs are being built as new interface modules for existing
|
|
gigabit-capable switch fabrics. Switch/FDR products have been shipped
|
|
or announced by at least
|
|
<A HREF="http://www.acacianet.com/">http://www.acacianet.com/</A>,
|
|
<A HREF="http://www.baynetworks.com/">http://www.baynetworks.com/</A>,
|
|
<A HREF="http://www.cabletron.com/">http://www.cabletron.com/</A>,
|
|
<A HREF="http://www.networks.digital.com/">http://www.networks.digital.com/</A>,
|
|
<A HREF="http://www.extremenetworks.com/">http://www.extremenetworks.com/</A>,
|
|
<A HREF="http://www.foundrynet.com/">http://www.foundrynet.com/</A>,
|
|
<A HREF="http://www.gigalabs.com/">http://www.gigalabs.com/</A>,
|
|
<A HREF="http://www.packetengines.com/">http://www.packetengines.com/</A>.
|
|
<A HREF="http://www.plaintree.com/">http://www.plaintree.com/</A>,
|
|
<A HREF="http://www.prominet.com/">http://www.prominet.com/</A>,
|
|
<A HREF="http://www.sun.com/">http://www.sun.com/</A>, and
|
|
<A HREF="http://www.xlnt.com/">http://www.xlnt.com/</A>.
|
|
<P>There is a Linux driver,
|
|
<A HREF="http://cesdis.gsfc.nasa.gov/linux/drivers/yellowfin.html">http://cesdis.gsfc.nasa.gov/linux/drivers/yellowfin.html</A>, for
|
|
the Packet Engines "Yellowfin" G-NIC,
|
|
<A HREF="http://www.packetengines.com/">http://www.packetengines.com/</A>. Early tests under Linux achieved
|
|
about 2.5x higher bandwidth than could be achieved with the best 100
|
|
Mb/s Fast Ethernet; with gigabit networks, careful tuning of PCI bus
|
|
use is a critical factor. There is little doubt that driver
|
|
improvements, and Linux drivers for other NICs, will follow.
|
|
<P>
|
|
<H3>FC (Fibre Channel)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1,062 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI?</EM></LI>
|
|
<LI>Network structure: <EM>?</EM></LI>
|
|
<LI>Cost per machine connected: <EM>?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>The goal of FC (Fibre Channel) is to provide high-performance block
|
|
I/O (an FC frame carries a 2,048 byte data payload), particularly for
|
|
sharing disks and other storage devices that can be directly connected
|
|
to the FC rather than connected through a computer. Bandwidth-wise,
|
|
FC is specified to be relatively fast, running anywhere between 133
|
|
and 1,062 Mbits/s. If FC becomes popular as a high-end SCSI
|
|
replacement, it may quickly become a cheap technology; for now, it is
|
|
not cheap and is not supported by Linux. A good collection of FC
|
|
references is maintained by the Fibre Channel Association at
|
|
<A HREF="http://www.amdahl.com/ext/CARP/FCA/FCA.html">http://www.amdahl.com/ext/CARP/FCA/FCA.html</A><P>
|
|
<H3>FireWire (IEEE 1394)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>196.608 Mb/s</EM> (soon, <EM>393.216 Mb/s</EM>)</LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>random without cycles (self-configuring)</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$600</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>FireWire,
|
|
<A HREF="http://www.firewire.org/">http://www.firewire.org/</A>, the IEEE 1394-1995
|
|
standard, is destined to be the low-cost high-speed digital network
|
|
for consumer electronics. The showcase application is connecting DV
|
|
digital video camcorders to computers, but FireWire is intended to be
|
|
used for applications ranging from being a SCSI replacement to
|
|
interconnecting the components of your home theater. It allows up to
|
|
64K devices to be connected in any topology using busses and bridges
|
|
that does not create a cycle, and automatically detects the
|
|
configuration when components are added or removed. Short (four-byte
|
|
"quadlet") low-latency messages are supported as well as ATM-like
|
|
isochronous transmission (used to keep multimedia messages
|
|
synchronized). Adaptec has FireWire products that allow up to 63
|
|
devices to be connected to a single PCI interface card, and also has
|
|
good general FireWire information at
|
|
<A HREF="http://www.adaptec.com/serialio/">http://www.adaptec.com/serialio/</A>.
|
|
<P>Although FireWire will not be the highest bandwidth network available,
|
|
the consumer-level market (which should drive prices very low) and low
|
|
latency support might make this one of the best Linux PC cluster
|
|
message-passing network technologies within the next year or so.
|
|
<P>
|
|
<H3>HiPPI And Serial HiPPI</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1,600 Mb/s</EM> (serial is <EM>1,200 Mb/s</EM>)</LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>EISA, PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$3,500</EM> (serial is <EM>$4,500</EM>)</LI>
|
|
</UL>
|
|
<P>
|
|
<P>HiPPI (High Performance Parallel Interface) was originally intended to
|
|
provide very high bandwidth for transfer of huge data sets between a
|
|
supercomputer and another machine (a supercomputer, frame buffer, disk
|
|
array, etc.), and has become the dominant standard for
|
|
supercomputers. Although it is an oxymoron, <B>Serial HiPPI</B> is
|
|
also becoming popular, typically using a fiber optic cable instead of
|
|
the 32-bit wide standard (parallel) HiPPI cables. Over the past few
|
|
years, HiPPI crossbar switches have become common and prices have
|
|
dropped sharply; unfortunately, serial HiPPI is still pricey, and that
|
|
is what PCI bus interface cards generally support. Worse still, Linux
|
|
doesn't yet support HiPPI. A good overview of HiPPI is maintained by
|
|
CERN at
|
|
<A HREF="http://www.cern.ch/HSI/hippi/">http://www.cern.ch/HSI/hippi/</A>; they also maintain
|
|
a rather long list of HiPPI vendors at
|
|
<A HREF="http://www.cern.ch/HSI/hippi/procintf/manufact.htm">http://www.cern.ch/HSI/hippi/procintf/manufact.htm</A>.
|
|
<P>
|
|
<H3>IrDA (Infrared Data Association)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no?</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1.15 Mb/s</EM> and <EM>4 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>IrDA</EM></LI>
|
|
<LI>Network structure: <EM>thin air</EM> ;-)</LI>
|
|
<LI>Cost per machine connected: <EM>$0</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>IrDA (Infrared Data Association,
|
|
<A HREF="http://www.irda.org/">http://www.irda.org/</A>) is
|
|
that little infrared device on the side of a lot of laptop PCs. It is
|
|
inherently difficult to connect more than two machines using this
|
|
interface, so it is unlikely to be used for clustering. Don Becker
|
|
did some preliminary work with IrDA.
|
|
<P>
|
|
<H3>Myrinet</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1,280 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>9 microseconds</EM></LI>
|
|
<LI>Available as: <EM>single-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$1,800</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>Myrinet
|
|
<A HREF="http://www.myri.com/">http://www.myri.com/</A> is a local area network (LAN)
|
|
designed to also serve as a "system area network" (SAN), i.e., the
|
|
network within a cabinet full of machines connected as a parallel
|
|
system. The LAN and SAN versions use different physical media and
|
|
have somewhat different characteristics; generally, the SAN version
|
|
would be used within a cluster.
|
|
<P>Myrinet is fairly conventional in structure, but has a reputation for
|
|
being particularly well-implemented. The drivers for Linux are said
|
|
to perform very well, although shockingly large performance variations
|
|
have been reported with different PCI bus implementations for the host
|
|
computers.
|
|
<P>Currently, Myrinet is clearly the favorite network of cluster groups
|
|
that are not too severely "budgetarily challenged." If your idea of a
|
|
Linux PC is a high-end Pentium Pro or Pentium II with at least 256 MB
|
|
RAM and a SCSI RAID, the cost of Myrinet is quite reasonable. However,
|
|
using more ordinary PC configurations, you may find that your choice
|
|
is between <EM>N</EM> machines linked by Myrinet or <EM>2N</EM> linked
|
|
by multiple Fast Ethernets and TTL_PAPERS. It really depends
|
|
on what your budget is and what types of computations you care about
|
|
most.
|
|
<P>
|
|
<H3>Parastation</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>HAL or socket library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>125 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>2 microseconds</EM></LI>
|
|
<LI>Available as: <EM>single-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>hubless mesh</EM></LI>
|
|
<LI>Cost per machine connected: <EM>> $1,000</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>The ParaStation project
|
|
<A HREF="http://wwwipd.ira.uka.de/parastation">http://wwwipd.ira.uka.de/parastation</A> at University of Karlsruhe
|
|
Department of Informatics is building a PVM-compatible custom
|
|
low-latency network. They first constructed a two-processor ParaPC
|
|
prototype using a custom EISA card interface and PCs running BSD UNIX,
|
|
and then built larger clusters using DEC Alphas. Since January 1997,
|
|
ParaStation has been available for Linux. The PCI cards are being
|
|
made in cooperation with a company called Hitex (see
|
|
<A HREF="http://www.hitex.com:80/parastation/">http://www.hitex.com:80/parastation/</A>). Parastation hardware
|
|
implements both fast, reliable, message transmission and simple barrier
|
|
synchronization.
|
|
<P>
|
|
<H3>PLIP</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel driver</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1.2 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>1,000 microseconds?</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>SPP</EM></LI>
|
|
<LI>Network structure: <EM>cable between 2 machines</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$2</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>For just the cost of a "LapLink" cable, PLIP (Parallel Line Interface
|
|
Protocol) allows two Linux machines to communicate through standard
|
|
parallel ports using standard socket-based software. In terms of
|
|
bandwidth, latency, and scalability, this is not a very serious
|
|
network technology; however, the near-zero cost and the software
|
|
compatibility are useful. The driver is part of the standard Linux
|
|
kernel distributions.
|
|
<P>
|
|
<H3>SCI</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>4,000 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>2.7 microseconds</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI, proprietary</EM></LI>
|
|
<LI>Network structure: <EM>?</EM></LI>
|
|
<LI>Cost per machine connected: <EM>> $1,000</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>The goal of SCI (Scalable Coherent Interconnect, ANSI/IEEE 1596-1992)
|
|
is essentially to provide a high performance mechanism that can
|
|
support coherent shared memory access across large numbers of
|
|
machines, as well various types of block message transfers. It is
|
|
fairly safe to say that the designed bandwidth and latency of SCI are
|
|
both "awesome" in comparison to most other network technologies. The
|
|
catch is that SCI is not widely available as cheap production units,
|
|
and there isn't any Linux support.
|
|
<P>SCI primarily is used in various proprietary designs for
|
|
logically-shared physically-distributed memory machines, such as the
|
|
HP/Convex Exemplar SPP and the Sequent NUMA-Q 2000 (see
|
|
<A HREF="http://www.sequent.com/">http://www.sequent.com/</A>). However, SCI is available as a PCI
|
|
interface card and 4-way switches (up to 16 machines can be connected
|
|
by cascading four 4-way switches) from Dolphin,
|
|
<A HREF="http://www.dolphinics.com/">http://www.dolphinics.com/</A>, as their CluStar product line. A
|
|
good set of links overviewing SCI is maintained by CERN at
|
|
<A HREF="http://www.cern.ch/HSI/sci/sci.html">http://www.cern.ch/HSI/sci/sci.html</A>.
|
|
<P>
|
|
<H3>SCSI</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>5 Mb/s</EM> to over <EM>20 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI, EISA, ISA card</EM></LI>
|
|
<LI>Network structure: <EM>inter-machine bus sharing SCSI devices</EM></LI>
|
|
<LI>Cost per machine connected: <EM>?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>SCSI (Small Computer Systems Interconnect) is essentially an I/O bus
|
|
that is used for disk drives, CD ROMS, image scanners, etc. There are
|
|
three separate standards SCSI-1, SCSI-2, and SCSI-3; Fast and Ultra
|
|
speeds; and data path widths of 8, 16, or 32 bits (with FireWire
|
|
compatibility also mentioned in SCSI-3). It is all pretty confusing,
|
|
but we all know a good SCSI is somewhat faster than EIDE and can handle
|
|
more devices more efficiently.
|
|
<P>What many people do not realize is that it is fairly simple for two
|
|
computers to share a single SCSI bus. This type of configuration is
|
|
very useful for sharing disk drives between machines and implementing
|
|
<B>fail-over</B> - having one machine take over database requests
|
|
when the other machine fails. Currently, this is the only mechanism
|
|
supported by Microsoft's PC cluster product, WolfPack. However, the
|
|
inability to scale to larger systems renders shared SCSI uninteresting
|
|
for parallel processing in general.
|
|
<P>
|
|
<H3>ServerNet</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>400 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
|
|
<LI>Available as: <EM>single-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>hexagonal tree/tetrahedral lattice of hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>ServerNet is the high-performance network hardware from Tandem,
|
|
<A HREF="http://www.tandem.com">http://www.tandem.com</A>. Especially in the online transation
|
|
processing (OLTP) world, Tandem is well known as a leading producer of
|
|
high-reliability systems, so it is not surprising that their network
|
|
claims not just high performance, but also "high data integrity and
|
|
reliability." Another interesting aspect of ServerNet is that it
|
|
claims to be able to transfer data from any device directly to any
|
|
device; not just between processors, but also disk drives, etc., in a
|
|
one-sided style similar to that suggested by the MPI remote memory
|
|
access mechanisms described in section 3.5. One last comment about
|
|
ServerNet: although there is just a single vendor, that vendor is
|
|
powerful enough to potentially establish ServerNet as a major
|
|
standard... Tandem is owned by Compaq.
|
|
<P>
|
|
<H3>SHRIMP</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>user-level memory mapped interface</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>180 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>5 microseconds</EM></LI>
|
|
<LI>Available as: <EM>research prototype</EM></LI>
|
|
<LI>Interface port/bus used: <EM>EISA</EM></LI>
|
|
<LI>Network structure: <EM>mesh backplane (as in Intel Paragon)</EM></LI>
|
|
<LI>Cost per machine connected: <EM>?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>The SHRIMP project,
|
|
<A HREF="http://www.CS.Princeton.EDU/shrimp/">http://www.CS.Princeton.EDU/shrimp/</A>,
|
|
at the Princeton University Computer Science Department is building a
|
|
parallel computer using PCs running Linux as the processing elements.
|
|
The first SHRIMP (Scalable, High-Performance, Really Inexpensive
|
|
Multi-Processor) was a simple two-processor prototype using a
|
|
dual-ported RAM on a custom EISA card interface. There is now a
|
|
prototype that will scale to larger configurations using a custom
|
|
interface card to connect to a "hub" that is essentially the same mesh
|
|
routing network used in the Intel Paragon (see
|
|
<A HREF="http://www.ssd.intel.com/paragon.html">http://www.ssd.intel.com/paragon.html</A>). Considerable effort
|
|
has gone into developing low-overhead "virtual memory mapped
|
|
communication" hardware and support software.
|
|
<P>
|
|
<H3>SLIP</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>0.1 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>1,000 microseconds?</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>RS232C</EM></LI>
|
|
<LI>Network structure: <EM>cable between 2 machines</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$2</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>Although SLIP (Serial Line Interface Protocol) is firmly planted at
|
|
the low end of the performance spectrum, SLIP (or CSLIP or PPP) allows
|
|
two machines to perform socket communication via ordinary RS232 serial
|
|
ports. The RS232 ports can be connected using a null-modem RS232
|
|
serial cable, or they can even be connected via dial-up through a
|
|
modem. In any case, latency is high and bandwidth is low, so SLIP
|
|
should be used only when no other alternatives are available. It is
|
|
worth noting, however, that most PCs have two RS232 ports, so it would
|
|
be possible to network a group of machines simply by connecting the
|
|
machines as a linear array or as a ring. There is even load sharing
|
|
software called EQL.
|
|
<P>
|
|
<H3>TTL_PAPERS</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>AFAPI library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1.6 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
|
|
<LI>Available as: <EM>public-domain design, single-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>SPP</EM></LI>
|
|
<LI>Network structure: <EM>tree of hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$100</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>The PAPERS (Purdue's Adapter for Parallel Execution and Rapid
|
|
Synchronization) project,
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>, at the Purdue University
|
|
School of Electrical and Computer Engineering is building scalable,
|
|
low-latency, aggregate function communication hardware and software
|
|
that allows a parallel supercomputer to be built using unmodified
|
|
PCs/workstations as nodes.
|
|
<P>There have been over a dozen different types of PAPERS hardware built
|
|
that connect to PCs/workstations via the SPP (Standard Parallel Port),
|
|
roughly following two development lines. The versions called "PAPERS"
|
|
target higher performance, using whatever technologies are appropriate;
|
|
current work uses FPGAs, and high bandwidth PCI bus interface designs
|
|
are also under development. In contrast, the versions called
|
|
"TTL_PAPERS" are designed to be easily reproduced outside
|
|
Purdue, and are remarkably simple public domain designs that can be
|
|
built using ordinary TTL logic. One such design is produced
|
|
commercially,
|
|
<A HREF="http://chelsea.ios.com:80/~hgdietz/sbm4.html">http://chelsea.ios.com:80/~hgdietz/sbm4.html</A>.
|
|
<P>Unlike the custom hardware designs from other universities,
|
|
TTL_PAPERS clusters have been assembled at many universities
|
|
from the USA to South Korea. Bandwidth is severely limited by the SPP
|
|
connections, but PAPERS implements very low latency aggregate function
|
|
communications; even the fastest message-oriented systems cannot
|
|
provide comparable performance on those aggregate functions. Thus,
|
|
PAPERS is particularly good for synchronizing the displays of a video
|
|
wall (to be discussed further in the upcoming Video Wall HOWTO),
|
|
scheduling accesses to a high-bandwidth network, evaluating global
|
|
fitness in genetic searches, etc. Although PAPERS clusters have been
|
|
built using IBM PowerPC AIX, DEC Alpha OSF/1, and HP PA-RISC HP-UX
|
|
machines, Linux-based PCs are the platforms best supported.
|
|
<P>User programs using TTL_PAPERS AFAPI directly access the SPP
|
|
hardware port registers under Linux, without an OS call for each
|
|
access. To do this, AFAPI first gets port permission using either
|
|
<CODE>iopl()</CODE> or <CODE>ioperm()</CODE>. The problem with these calls is
|
|
that both require the user program to be privileged, yielding a
|
|
potential security hole. The solution is an optional kernel patch,
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">http://garage.ecn.purdue.edu/~papers/giveioperm.html</A>, that
|
|
allows a privileged process to control port permission for any process.
|
|
<P>
|
|
<H3>USB (Universal Serial Bus)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel driver</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>12 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>USB</EM></LI>
|
|
<LI>Network structure: <EM>bus</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$5?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>USB (Universal Serial Bus,
|
|
<A HREF="http://www.usb.org/">http://www.usb.org/</A>) is a
|
|
hot-pluggable conventional-Ethernet-speed, bus for up to 127
|
|
peripherals ranging from keyboards to video conferencing cameras. It
|
|
isn't really clear how multiple computers get connected to each other
|
|
using USB. In any case, USB ports are quickly becoming as standard on
|
|
PC motherboards as RS232 and SPP, so don't be surprised if one or two
|
|
USB ports are lurking on the back of the next PC you buy. Development
|
|
of a Linux driver is discussed at
|
|
<A HREF="http://peloncho.fis.ucm.es/~inaky/USB.html">http://peloncho.fis.ucm.es/~inaky/USB.html</A>.
|
|
<P>In some ways, USB is almost the low-performance, zero-cost, version of
|
|
FireWire that you can purchase today.
|
|
<P>
|
|
<H3>WAPERS</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>AFAPI library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>0.4 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
|
|
<LI>Available as: <EM>public-domain design</EM></LI>
|
|
<LI>Interface port/bus used: <EM>SPP</EM></LI>
|
|
<LI>Network structure: <EM>wiring pattern between 2-64 machines</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$5</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>WAPERS (Wired-AND Adapter for Parallel Execution and Rapid
|
|
Synchronization) is a spin-off of the PAPERS project,
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>, at the Purdue University
|
|
School of Electrical and Computer Engineering. If implemented
|
|
properly, the SPP has four bits of open-collector output that can be
|
|
wired together across machines to implement a 4-bit wide wired AND.
|
|
This wired-AND is electrically touchy, and the maximum number of
|
|
machines that can be connected in this way critically depends on the
|
|
analog properties of the ports (maximum sink current and pull-up
|
|
resistor value); typically, up to 7 or 8 machines can be networked by
|
|
WAPERS. Although cost and latency are very low, so is bandwidth;
|
|
WAPERS is much better as a second network for aggregate operations
|
|
than as the only network in a cluster. As with TTL_PAPERS, to
|
|
improve system security, there is a minor kernel patch recommended,
|
|
but not required:
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">http://garage.ecn.purdue.edu/~papers/giveioperm.html</A>.
|
|
<P>
|
|
<H2><A NAME="ss3.3">3.3 Network Software Interface</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Before moving on to discuss the software support for parallel
|
|
applications, it is useful to first briefly cover the basics of
|
|
low-level software interface to the network hardware. There are
|
|
really only three basic choices: sockets, device drivers, and
|
|
user-level libraries.
|
|
<P>
|
|
<H3>Sockets</H3>
|
|
|
|
<P>
|
|
<P>By far the most common low-level network interface is a socket
|
|
interface. Sockets have been a part of unix for over a decade, and
|
|
most standard network hardware is designed to support at least two
|
|
types of socket protocols: UDP and TCP. Both types of socket allow
|
|
you to send arbitrary size blocks of data from one machine to another,
|
|
but there are several important differences. Typically, both yield a
|
|
minimum latency of around 1,000 microseconds, although performance can
|
|
be far worse depending on network traffic.
|
|
<P>These socket types are the basic network software interface for most
|
|
of the portable, higher-level, parallel processing software; for
|
|
example, PVM uses a combination of UDP and TCP, so knowing the
|
|
difference will help you tune performance. For even better
|
|
performance, you can also use these mechanisms directly in your
|
|
program. The following is just a simple overview of UDP and TCP; see
|
|
the manual pages and a good network programming book for details.
|
|
<P>
|
|
<H3>UDP Protocol (SOCK_DGRAM)</H3>
|
|
|
|
<P>
|
|
<P><B>UDP</B> is the User Datagram Protocol, but you more easily can
|
|
remember the properties of UDP as Unreliable Datagram Processing. In
|
|
other words, UDP allows each block to be sent as an individual message,
|
|
but a message might be lost in transmission. In fact, depending on
|
|
network traffic, UDP messages can be lost, can arrive multiple times,
|
|
or can arrive in an order different from that in which they were
|
|
sent. The sender of a UDP message does not automatically get an
|
|
acknowledgment, so it is up to user-written code to detect and
|
|
compensate for these problems. Fortunately, UDP does ensure that if a
|
|
message arrives, the message contents are intact (i.e., you never get
|
|
just part of a UDP message).
|
|
<P>The nice thing about UDP is that it tends to be the fastest socket
|
|
protocol. Further, UDP is "connectionless," which means that each
|
|
message is essentially independent of all others. A good analogy is
|
|
that each message is like a letter to be mailed; you might send
|
|
multiple letters to the same address, but each one is independent of
|
|
the others and there is no limit on how many people you can send
|
|
letters to.
|
|
<P>
|
|
<H3>TCP Protocol (SOCK_STREAM)</H3>
|
|
|
|
<P>
|
|
<P>Unlike UDP, <B>TCP</B> is a reliable, connection-based, protocol.
|
|
Each block sent is not seen as a message, but as a block of data
|
|
within an apparently continuous stream of bytes being transmitted
|
|
through a connection between sender and receiver. This is very
|
|
different from UDP messaging because each block is simply part of the
|
|
byte stream and it is up to the user code to figure-out how to extract
|
|
each block from the byte stream; there are no markings separating
|
|
messages. Further, the connections are more fragile with respect to
|
|
network problems, and only a limited number of connections can exist
|
|
simultaneously for each process. Because it is reliable, TCP
|
|
generally implies significantly more overhead than UDP.
|
|
<P>There are, however, a few pleasant surprises about TCP. One is that,
|
|
if multiple messages are sent through a connection, TCP is able to
|
|
pack them together in a buffer to better match network hardware packet
|
|
sizes, potentially yielding better-than-UDP performance for groups of
|
|
short or oddly-sized messages. The other bonus is that networks
|
|
constructed using reliable direct physical links between machines can
|
|
easily and efficiently simulate TCP connections. For example, this was
|
|
done for the ParaStation's "Socket Library" interface software, which
|
|
provides TCP semantics using user-level calls that differ from the
|
|
standard TCP OS calls only by the addition of the prefix
|
|
<CODE>PSS</CODE> to each function name.
|
|
<P>
|
|
<H3>Device Drivers</H3>
|
|
|
|
<P>
|
|
<P>When it comes to actually pushing data onto the network or pulling data
|
|
off the network, the standard unix software interface is a part of the
|
|
unix kernel called a device driver. UDP and TCP don't just transport
|
|
data, they also imply a fair amount of overhead for socket management.
|
|
For example, something has to manage the fact that multiple TCP
|
|
connections can share a single physical network interface. In
|
|
contrast, a device driver for a dedicated network interface only needs
|
|
to implement a few simple data transport functions. These device
|
|
driver functions can then be invoked by user programs by using
|
|
<CODE>open()</CODE> to identify the proper device and then using system
|
|
calls like <CODE>read()</CODE> and <CODE>write()</CODE> on the open "file."
|
|
Thus, each such operation could transport a block of data with little
|
|
more than the overhead of a system call, which might be as fast as
|
|
tens of microseconds.
|
|
<P>Writing a device driver to be used with Linux is not hard... if you
|
|
know <EM>precisely</EM> how the device hardware works. If you are not
|
|
sure how it works, don't guess. Debugging device drivers isn't fun
|
|
and mistakes can fry hardware. However, if that hasn't scared you
|
|
off, it may be possible to write a device driver to, for example, use
|
|
dedicated Ethernet cards as dumb but fast direct machine-to-machine
|
|
connections without the usual Ethernet protocol overhead. In fact,
|
|
that's pretty much what some early Intel supercomputers did.... Look
|
|
at the Device Driver HOWTO for more information.
|
|
<P>
|
|
<H3>User-Level Libraries</H3>
|
|
|
|
<P>
|
|
<P>If you've taken an OS course, user-level access to hardware device
|
|
registers is exactly what you have been taught never to do, because
|
|
one of the primary purposes of an OS is to control device access.
|
|
However, an OS call is at least tens of microseconds of overhead. For
|
|
custom network hardware like TTL_PAPERS, which can perform a
|
|
basic network operation in just 3 microseconds, such OS call overhead
|
|
is intolerable. The only way to avoid that overhead is to have
|
|
user-level code - a user-level library - directly access hardware
|
|
device registers. Thus, the question becomes one of how a user-level
|
|
library can access hardware directly, yet not compromise the OS
|
|
control of device access rights.
|
|
<P>On a typical system, the only way for a user-level library to directly
|
|
access hardware device registers is to:
|
|
<P>
|
|
<OL>
|
|
<LI>At user program start-up, use an OS call to map the page of
|
|
memory address space containing the device registers into the user
|
|
process virtual memory map. For some systems, the <CODE>mmap()</CODE>
|
|
call (first mentioned in section 2.6) can be used to map a special
|
|
file which represents the physical memory page addresses of the I/O
|
|
devices. Alternatively, it is relatively simple to write a device
|
|
driver to perform this function. Further, this device driver can
|
|
control access by only mapping the page(s) containing the specific
|
|
device registers needed, thereby maintaining OS access control.
|
|
</LI>
|
|
<LI>Access device registers without an OS call by simply loading or
|
|
storing to the mapped addresses. For example, <CODE>*((char *) 0x1234) =
|
|
5;</CODE> would store the byte value 5 into memory location 1234
|
|
(hexadecimal).</LI>
|
|
</OL>
|
|
<P>Fortunately, it happens that Linux for the Intel 386 (and compatible
|
|
processors) offers an even better solution:
|
|
<P>
|
|
<OL>
|
|
<LI>Using the <CODE>ioperm()</CODE> OS call from a privileged process,
|
|
get permission to access the precise I/O port addresses that
|
|
correspond to the device registers. Alternatively, permission can be
|
|
managed by an independent privileged user process (i.e., a "meta OS")
|
|
using the
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">giveioperm() OS call</A> patch for Linux.
|
|
</LI>
|
|
<LI>Access device registers without an OS call by using 386 port I/O
|
|
instructions.</LI>
|
|
</OL>
|
|
<P>
|
|
<P>This second solution is preferable because it is common that multiple
|
|
I/O devices have their registers within a single page, in which case
|
|
the first technique would not provide protection against accessing
|
|
other device registers that happened to reside in the same page as the
|
|
ones intended. Of course, the down side is that 386 port I/O
|
|
instructions cannot be coded in C - instead, you will need to use a
|
|
bit of assembly code. The GCC-wrapped (usable in C programs) inline
|
|
assembly code function for a port input of a byte value is:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
extern inline unsigned char
|
|
inb(unsigned short port)
|
|
{
|
|
unsigned char _v;
|
|
__asm__ __volatile__ ("inb %w1,%b0"
|
|
:"=a" (_v)
|
|
:"d" (port), "0" (0));
|
|
return _v;
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>Similarly, the GCC-wrapped code for a byte port output is:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
extern inline void
|
|
outb(unsigned char value,
|
|
unsigned short port)
|
|
{
|
|
__asm__ __volatile__ ("outb %b0,%w1"
|
|
:/* no outputs */
|
|
:"a" (value), "d" (port));
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>
|
|
<H2><A NAME="ss3.4">3.4 PVM (Parallel Virtual Machine)</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>PVM (Parallel Virtual Machine) is a freely-available, portable,
|
|
message-passing library generally implemented on top of sockets. It
|
|
is clearly established as the de-facto standard for message-passing
|
|
cluster parallel computing.
|
|
<P>PVM supports single-processor and SMP Linux machines, as well as
|
|
clusters of Linux machines linked by socket-capable networks (e.g.,
|
|
SLIP, PLIP, Ethernet, ATM). In fact, PVM will even work across groups
|
|
of machines in which a variety of different types of processors,
|
|
configurations, and physical networks are used - <B>Heterogeneous
|
|
Clusters</B> - even to the scale of treating machines linked by the
|
|
Internet as a parallel cluster. PVM also provides facilities for
|
|
parallel job control across a cluster. Best of all, PVM has long been
|
|
freely available (currently from
|
|
<A HREF="http://www.epm.ornl.gov/pvm/pvm_home.html">http://www.epm.ornl.gov/pvm/pvm_home.html</A>), which has
|
|
led to many programming language compilers, application libraries,
|
|
programming and debugging tools, etc., using it as their "portable
|
|
message-passing target library." There is also a network newsgroup,
|
|
<A HREF="news:comp.parallel.pvm">comp.parallel.pvm</A>.
|
|
<P>It is important to note, however, that PVM message-passing calls
|
|
generally add significant overhead to standard socket operations,
|
|
which already had high latency. Further, the message handling calls
|
|
themselves do not constitute a particularly "friendly" programming
|
|
model.
|
|
<P>Using the same Pi computation example first described in section 1.3,
|
|
the version using C with PVM library calls is:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>
|
|
#include <stdio.h>
|
|
#include <pvm3.h>
|
|
|
|
#define NPROC 4
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double lsum, width;
|
|
double sum;
|
|
register int intervals, i;
|
|
int mytid, iproc, msgtag = 4;
|
|
int tids[NPROC]; /* array of task ids */
|
|
|
|
/* enroll in pvm */
|
|
mytid = pvm_mytid();
|
|
|
|
/* Join a group and, if I am the first instance,
|
|
iproc=0, spawn more copies of myself
|
|
*/
|
|
iproc = pvm_joingroup("pi");
|
|
|
|
if (iproc == 0) {
|
|
tids[0] = pvm_mytid();
|
|
pvm_spawn("pvm_pi", &argv[1], 0, NULL, NPROC-1, &tids[1]);
|
|
}
|
|
/* make sure all processes are here */
|
|
pvm_barrier("pi", NPROC);
|
|
|
|
/* get the number of intervals */
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
|
|
lsum = 0.0;
|
|
for (i = iproc; i<intervals; i+=NPROC) {
|
|
register double x = (i + 0.5) * width;
|
|
lsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
|
|
/* sum across the local results & scale by width */
|
|
sum = lsum * width;
|
|
pvm_reduce(PvmSum, &sum, 1, PVM_DOUBLE, msgtag, "pi", 0);
|
|
|
|
/* have only the console PE print the result */
|
|
if (iproc == 0) {
|
|
printf("Estimation of pi is %f\n", sum);
|
|
}
|
|
|
|
/* Check program finished, leave group, exit pvm */
|
|
pvm_barrier("pi", NPROC);
|
|
pvm_lvgroup("pi");
|
|
pvm_exit();
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>
|
|
<H2><A NAME="ss3.5">3.5 MPI (Message Passing Interface)</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Although PVM is the de-facto standard message-passing library, MPI
|
|
(Message Passing Interface) is the relatively new official standard.
|
|
The home page for the MPI standard is
|
|
<A HREF="http://www.mcs.anl.gov:80/mpi/">http://www.mcs.anl.gov:80/mpi/</A> and the newsgroup is
|
|
<A HREF="news:comp.parallel.mpi">comp.parallel.mpi</A>.
|
|
<P>However, before discussing MPI, I feel compelled to say a little bit
|
|
about the PVM vs. MPI religious war that has been going on for the
|
|
past few years. I'm not really on either side. Here's my attempt at
|
|
a relatively unbiased summary of the differences:
|
|
<P>
|
|
<DL>
|
|
<DT><B>Execution control environment.</B><DD><P>Put simply, PVM has one and
|
|
MPI doesn't specify how/if one is implemented. Thus, things like
|
|
starting a PVM program executing are done identically everywhere, while
|
|
for MPI it depends on which implementation is being used.
|
|
<P>
|
|
<DT><B>Support for heterogeneous clusters.</B><DD><P>PVM grew-up in the
|
|
workstation cycle-scavenging world, and thus directly manages
|
|
heterogeneous mixes of machines and operating systems. In contrast,
|
|
MPI largely assumes that the target is an MPP (Massively Parallel
|
|
Processor) or a dedicated cluster of nearly identical workstations.
|
|
<P>
|
|
<DT><B>Kitchen sink syndrome.</B><DD><P>PVM evidences a unity of purpose that
|
|
MPI 2.0 doesn't. The new MPI 2.0 standard includes a lot of features
|
|
that go way beyond the basic message passing model - things like RMA
|
|
(Remote Memory Access) and parallel file I/O. Are these things
|
|
useful? Of course they are... but learning MPI 2.0 is a lot like
|
|
learning a complete new programming language.
|
|
<P>
|
|
<DT><B>User interface design.</B><DD><P>MPI was designed after PVM, and
|
|
clearly learned from it. MPI offers simpler, more efficient, buffer
|
|
handling and higher-level abstractions allowing user-defined data
|
|
structures to be transmitted in messages.
|
|
<P>
|
|
<DT><B>The force of law.</B><DD><P>By my count, there are still
|
|
significantly more things designed to use PVM than there are to use
|
|
MPI; however, porting them to MPI is easy, and the fact that MPI is
|
|
backed by a widely-supported formal standard means that using MPI is,
|
|
for many institutions, a matter of policy.
|
|
</DL>
|
|
<P>Conclusion? Well, there are at least three independently developed,
|
|
freely available, versions of MPI that can run on clusters of Linux
|
|
systems (and I wrote one of them):
|
|
<P>
|
|
<UL>
|
|
<LI>LAM (Local Area Multicomputer) is a full implementation of the
|
|
MPI 1.1 standard. It allows MPI programs to be executed within an
|
|
individual Linux system or across a cluster of Linux systems using
|
|
UDP/TCP socket communication. The system includes simple execution
|
|
control facilities, as well as a variety of program development and
|
|
debugging aids. It is freely available from
|
|
<A HREF="http://www.osc.edu/lam.html">http://www.osc.edu/lam.html</A>.
|
|
</LI>
|
|
<LI>MPICH (MPI CHameleon) is designed as a highly portable full
|
|
implementation of the MPI 1.1 standard. Like LAM, it allows MPI
|
|
programs to be executed within an individual Linux system or across a
|
|
cluster of Linux systems using UDP/TCP socket communication. However,
|
|
the emphasis is definitely on promoting MPI by providing an efficient,
|
|
easily retargetable, implementation. To port this MPI implementation,
|
|
one implements either the five functions of the "channel interface"
|
|
or, for better performance, the full MPICH ADI (Abstract Device
|
|
Interface). MPICH, and lots of information about it and porting, are
|
|
available from
|
|
<A HREF="http://www.mcs.anl.gov/mpi/mpich/">http://www.mcs.anl.gov/mpi/mpich/</A>.
|
|
</LI>
|
|
<LI>AFMPI (Aggregate Function MPI) is a subset implementation of the
|
|
MPI 2.0 standard. This is the one that I wrote. Built on top of the
|
|
AFAPI, it is designed to showcase low-latency collective communication
|
|
functions and RMAs, and thus provides only minimal support for MPI
|
|
data types, communicators, etc. It allows C programs using MPI to run
|
|
on an individual Linux system or across a cluster connected by
|
|
AFAPI-capable network hardware. It is freely available from
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>.</LI>
|
|
</UL>
|
|
<P>No matter which of these (or other) MPI implementations one uses, it
|
|
is fairly simple to perform the most common types of communications.
|
|
<P>However, MPI 2.0 incorporates several communication paradigms that are
|
|
fundamentally different enough so that a programmer using one of them
|
|
might not even recognize the other coding styles as MPI. Thus, rather
|
|
than giving a single example program, it is useful to have an example
|
|
of each of the fundamentally different communication paradigms that
|
|
MPI supports. All three programs implement the same basic algorithm
|
|
(from section 1.3) that is used throughout this HOWTO to compute the
|
|
value of Pi.
|
|
<P>The first MPI program uses basic MPI message-passing calls for each
|
|
processor to send its partial sum to processor 0, which sums and
|
|
prints the result:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>
|
|
#include <stdio.h>
|
|
#include <mpi.h>
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width;
|
|
double sum, lsum;
|
|
register int intervals, i;
|
|
int nproc, iproc;
|
|
MPI_Status status;
|
|
|
|
if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);
|
|
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
|
|
MPI_Comm_rank(MPI_COMM_WORLD, &iproc);
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
lsum = 0;
|
|
for (i=iproc; i<intervals; i+=nproc) {
|
|
register double x = (i + 0.5) * width;
|
|
lsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
lsum *= width;
|
|
if (iproc != 0) {
|
|
MPI_Send(&lbuf, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
|
|
} else {
|
|
sum = lsum;
|
|
for (i=1; i<nproc; ++i) {
|
|
MPI_Recv(&lbuf, 1, MPI_DOUBLE, MPI_ANY_SOURCE,
|
|
MPI_ANY_TAG, MPI_COMM_WORLD, &status);
|
|
sum += lsum;
|
|
}
|
|
printf("Estimation of pi is %f\n", sum);
|
|
}
|
|
MPI_Finalize();
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>The second MPI version uses collective communication (which, for this
|
|
particular application, is clearly the most appropriate):
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>
|
|
#include <stdio.h>
|
|
#include <mpi.h>
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width;
|
|
double sum, lsum;
|
|
register int intervals, i;
|
|
int nproc, iproc;
|
|
|
|
if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);
|
|
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
|
|
MPI_Comm_rank(MPI_COMM_WORLD, &iproc);
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
lsum = 0;
|
|
for (i=iproc; i<intervals; i+=nproc) {
|
|
register double x = (i + 0.5) * width;
|
|
lsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
lsum *= width;
|
|
MPI_Reduce(&lsum, &sum, 1, MPI_DOUBLE,
|
|
MPI_SUM, 0, MPI_COMM_WORLD);
|
|
if (iproc == 0) {
|
|
printf("Estimation of pi is %f\n", sum);
|
|
}
|
|
MPI_Finalize();
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>The third MPI version uses the MPI 2.0 RMA mechanism for each processor
|
|
to add its local <CODE>lsum</CODE> into <CODE>sum</CODE> on processor 0:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>
|
|
#include <stdio.h>
|
|
#include <mpi.h>
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width;
|
|
double sum = 0, lsum;
|
|
register int intervals, i;
|
|
int nproc, iproc;
|
|
MPI_Win sum_win;
|
|
|
|
if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);
|
|
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
|
|
MPI_Comm_rank(MPI_COMM_WORLD, &iproc);
|
|
MPI_Win_create(&sum, sizeof(sum), sizeof(sum),
|
|
0, MPI_COMM_WORLD, &sum_win);
|
|
MPI_Win_fence(0, sum_win);
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
lsum = 0;
|
|
for (i=iproc; i<intervals; i+=nproc) {
|
|
register double x = (i + 0.5) * width;
|
|
lsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
lsum *= width;
|
|
MPI_Accumulate(&lsum, 1, MPI_DOUBLE, 0, 0,
|
|
1, MPI_DOUBLE, MPI_SUM, sum_win);
|
|
MPI_Win_fence(0, sum_win);
|
|
if (iproc == 0) {
|
|
printf("Estimation of pi is %f\n", sum);
|
|
}
|
|
MPI_Finalize();
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>It is useful to note that the MPI 2.0 RMA mechanism very neatly
|
|
overcomes any potential problems with the corresponding data structure
|
|
on various processors residing at different memory locations. This is
|
|
done by referencing a "window" that implies the base address,
|
|
protection against out-of-bound accesses, and even address scaling.
|
|
Efficient implementation is aided by the fact that RMA processing may
|
|
be delayed until the next <CODE>MPI_Win_fence</CODE>. In
|
|
summary, the RMA mechanism may be a strange cross between distributed
|
|
shared memory and message passing, but it is a very clean interface
|
|
that potentially generates very efficient communication.
|
|
<P>
|
|
<H2><A NAME="ss3.6">3.6 AFAPI (Aggregate Function API)</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Unlike PVM, MPI, etc., the AFAPI (Aggregate Function Application
|
|
Program Interface) did not start life as an attempt to build a
|
|
portable abstract interface layered on top of existing network
|
|
hardware and software. Rather, AFAPI began as the very
|
|
hardware-specific low-level support library for PAPERS (Purdue's
|
|
Adapter for Parallel Execution and Rapid Synchronization; see
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>).
|
|
<P>PAPERS was discussed briefly in section 3.2; it is a public domain
|
|
design custom aggregate function network that delivers latencies as
|
|
low as a few microseconds. However, the important thing about PAPERS
|
|
is that it was developed as an attempt to build a supercomputer that
|
|
would be a better target for compiler technology than existing
|
|
supercomputers. This is qualitatively different from most Linux
|
|
cluster efforts and PVM/MPI, which generally focus on trying to use
|
|
standard networks for the relatively few sufficiently coarse-grain
|
|
parallel applications. The fact that Linux PCs are used as components
|
|
of PAPERS systems is simply an artifact of implementing prototypes in
|
|
the most cost-effective way possible.
|
|
<P>The need for a common low-level software interface across more than a
|
|
dozen different prototype implementations was what made the PAPERS
|
|
library become standardized as AFAPI. However, the model used by
|
|
AFAPI is inherently simpler and better suited for the finer-grain
|
|
interactions typical of code compiled by parallelizing compilers or
|
|
written for SIMD architectures. The simplicity of the model not only
|
|
makes PAPERS hardware easy to build, but also yields surprisingly
|
|
efficient AFAPI ports for a variety of other hardware systems, such as
|
|
SMPs.
|
|
<P>AFAPI currently runs on Linux clusters connected using TTL_PAPERS,
|
|
CAPERS, or WAPERS. It also runs (without OS calls or even bus-lock
|
|
instructions, see section 2.2) on SMP systems using a System V Shared
|
|
Memory library called SHMAPERS. A version that runs across Linux
|
|
clusters using UDP broadcasts on conventional networks (e.g.,
|
|
Ethernet) is under development. All released versions are available
|
|
from
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>. All versions
|
|
of the AFAPI are designed to be called from C or C++.
|
|
<P>The following example program is the AFAPI version of the Pi
|
|
computation described in section 1.3.
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>
|
|
#include <stdio.h>
|
|
#include "afapi.h"
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width, sum;
|
|
register int intervals, i;
|
|
|
|
if (p_init()) exit(1);
|
|
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
|
|
sum = 0;
|
|
for (i=IPROC; i<intervals; i+=NPROC) {
|
|
register double x = (i + 0.5) * width;
|
|
sum += 4.0 / (1.0 + x * x);
|
|
}
|
|
|
|
sum = p_reduceAdd64f(sum) * width;
|
|
|
|
if (IPROC == CPROC) {
|
|
printf("Estimation of pi is %f\n", sum);
|
|
}
|
|
|
|
p_exit();
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>
|
|
<H2><A NAME="ss3.7">3.7 Other Cluster Support Libraries</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>In addition to PVM, MPI, and AFAPI, the following libraries offer
|
|
features that may be useful in parallel computing using a cluster of
|
|
Linux systems. These systems are given a lighter treatment in this
|
|
document simply because, unlike PVM, MPI, and AFAPI, I have little or
|
|
no direct experience with the use of these systems on Linux clusters.
|
|
If you find any of these or other libraries to be especially useful,
|
|
please send email to me at
|
|
<A HREF="mailto:hankd@engr.uky.edu">hankd@engr.uky.edu</A> describing what you've found, and I will
|
|
consider adding an expanded section on that library.
|
|
<P>
|
|
<H3>Condor (process migration support)</H3>
|
|
|
|
<P>
|
|
<P>Condor is a distributed resource management system that can manage
|
|
large heterogeneous clusters of workstations. Its design has been
|
|
motivated by the needs of users who would like to use the unutilized
|
|
capacity of such clusters for their long-running,
|
|
computation-intensive jobs. Condor preserves a large measure of the
|
|
originating machine's environment on the execution machine, even if
|
|
the originating and execution machines do not share a common file
|
|
system and/or password mechanisms. Condor jobs that consist of a
|
|
single process are automatically checkpointed and migrated between
|
|
workstations as needed to ensure eventual completion.
|
|
<P>Condor is available at
|
|
<A HREF="http://www.cs.wisc.edu/condor/">http://www.cs.wisc.edu/condor/</A>. A
|
|
Linux port exists; more information is available at
|
|
<A HREF="http://www.cs.wisc.edu/condor/linux/linux.html">http://www.cs.wisc.edu/condor/linux/linux.html</A>. Contact
|
|
<A HREF="mailto:condor-admin@cs.wisc.edu">condor-admin@cs.wisc.edu</A> for details.
|
|
<P>
|
|
<H3>DFN-RPC (German Research Network - Remote Procedure Call)</H3>
|
|
|
|
<P>
|
|
<P>The DFN-RPC, a (German Research Network Remote Procedure Call) tool,
|
|
was developed to distribute and parallelize scientific-technical
|
|
application programs between a workstation and a compute server or a
|
|
cluster. The interface is optimized for applications written in
|
|
fortran, but the DFN-RPC can also be used in a C environment. It has
|
|
been ported to Linux. More information is at
|
|
<A HREF="ftp://ftp.uni-stuttgart.de/pub/rus/dfn_rpc/README_dfnrpc.html">ftp://ftp.uni-stuttgart.de/pub/rus/dfn_rpc/README_dfnrpc.html</A>.
|
|
<P>
|
|
<H3>DQS (Distributed Queueing System)</H3>
|
|
|
|
<P>
|
|
<P>Not exactly a library, DQS 3.0 (Distributed Queueing System) is a job
|
|
queueing system that has been developed and tested under Linux. It is
|
|
designed to allow both use and administration of a heterogeneous
|
|
cluster as a single entity. It is available from
|
|
<A HREF="http://www.scri.fsu.edu/~pasko/dqs.html">http://www.scri.fsu.edu/~pasko/dqs.html</A>.
|
|
<P>There is also a commercial version called CODINE 4.1.1 (COmputing in
|
|
DIstributed Network Environments). Information on it is available
|
|
from
|
|
<A HREF="http://www.genias.de/genias_welcome.html">http://www.genias.de/genias_welcome.html</A>.
|
|
<P>
|
|
<H2><A NAME="ss3.8">3.8 General Cluster References</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Because clusters can be constructed and used in so many different ways,
|
|
there are quite a few groups that have made interesting contributions.
|
|
The following are references to various cluster-related projects that
|
|
may be of general interest. This includes a mix of Linux-specific and
|
|
generic cluster references. The list is given in alphabetical order.
|
|
<P>
|
|
<H3>Beowulf</H3>
|
|
|
|
<P>
|
|
<P>The Beowulf project,
|
|
<A HREF="http://cesdis1.gsfc.nasa.gov/beowulf/">http://cesdis1.gsfc.nasa.gov/beowulf/</A>, centers on production of
|
|
software for using off-the-shelf clustered workstations based on
|
|
commodity PC-class hardware, a high-bandwidth cluster-internal
|
|
network, and the Linux operating system.
|
|
<P>Thomas Sterling has been the driving force behind Beowulf, and
|
|
continues to be an eloquent and outspoken proponent of Linux
|
|
clustering for scientific computing in general. In fact, many groups
|
|
now refer to their clusters as "Beowulf class" systems - even if the
|
|
cluster isn't really all that similar to the official Beowulf design.
|
|
<P>Don Becker, working in support of the Beowulf project, has produced
|
|
many of the network drivers used by Linux in general. Many of these
|
|
drivers have even been adapted for use in BSD. Don also is
|
|
responsible for many of these Linux network drivers allowing
|
|
load-sharing across multiple parallel connections to achieve higher
|
|
bandwidth without expensive switched hubs. This type of load sharing
|
|
was the original distinguishing feature of the Beowulf cluster.
|
|
<P>
|
|
<H3>Linux/AP+</H3>
|
|
|
|
<P>
|
|
<P>The Linux/AP+ project,
|
|
<A HREF="http://cap.anu.edu.au/cap/projects/linux/">http://cap.anu.edu.au/cap/projects/linux/</A>, is not exactly about
|
|
Linux clustering, but centers on porting Linux to the Fujitsu AP1000+
|
|
and adding appropriate parallel processing enhancements. The AP1000+
|
|
is a commercially available SPARC-based parallel machine that uses a
|
|
custom network with a torus topology, 25 MB/s bandwidth, and 10
|
|
microsecond latency... in short, it looks a lot like a SPARC Linux
|
|
cluster.
|
|
<P>
|
|
<H3>Locust</H3>
|
|
|
|
<P>
|
|
<P>The Locust project,
|
|
<A HREF="http://www.ecsl.cs.sunysb.edu/~manish/locust/">http://www.ecsl.cs.sunysb.edu/~manish/locust/</A>, is building a
|
|
distributed virtual shared memory system that uses compile-time
|
|
information to hide message-latency and to reduce network traffic at
|
|
run time. Pupa is the underlying communication subsystem of Locust,
|
|
and is implemented using Ethernet to connect 486 PCs under FreeBSD.
|
|
Linux?
|
|
<P>
|
|
<H3>Midway DSM (Distributed Shared Memory)</H3>
|
|
|
|
<P>
|
|
<P>Midway,
|
|
<A HREF="http://www.cs.cmu.edu/afs/cs.cmu.edu/project/midway/WWW/HomePage.html">http://www.cs.cmu.edu/afs/cs.cmu.edu/project/midway/WWW/HomePage.html</A>,
|
|
is a software-based DSM (Distributed Shared Memory) system, not unlike
|
|
TreadMarks. The good news is that it uses compile-time aids rather
|
|
than relatively slow page-fault mechanisms, and it is free. The bad
|
|
news is that it doesn't run on Linux clusters.
|
|
<P>
|
|
<H3>Mosix</H3>
|
|
|
|
<P>
|
|
<P>MOSIX modifies the BSDI BSD/OS to provide dynamic load balancing and
|
|
preemptive process migration across a networked group of PCs. This is
|
|
nice stuff not just for parallel processing, but for generally using a
|
|
cluster much like a scalable SMP. Will there be a Linux version? Look
|
|
at
|
|
<A HREF="http://www.cs.huji.ac.il/mosix/">http://www.cs.huji.ac.il/mosix/</A> for more information.
|
|
<P>
|
|
<H3>NOW (Network Of Workstations)</H3>
|
|
|
|
<P>
|
|
<P>The Berkeley NOW (Network Of Workstations) project,
|
|
<A HREF="http://now.cs.berkeley.edu/">http://now.cs.berkeley.edu/</A>, has led much of the push toward
|
|
parallel computing using networks of workstations. There is a lot
|
|
work going on here, all aimed toward "demonstrating a practical 100
|
|
processor system in the next few years." Alas, they don't use Linux.
|
|
<P>
|
|
<H3>Parallel Processing Using Linux</H3>
|
|
|
|
<P>
|
|
<P>The parallel processing using Linux WWW site,
|
|
<A HREF="http://aggregate.org/LDP/">http://aggregate.org/LDP/</A>, is the home of this HOWTO
|
|
and many related documents including online slides for a full-day
|
|
tutorial. Aside from the work on the PAPERS project, the Purdue
|
|
University School of Electrical and Computer Engineering generally has
|
|
been a leader in parallel processing; this site was established to
|
|
help others apply Linux PCs for parallel processing.
|
|
<P>Since Purdue's first cluster of Linux PCs was assembled in February
|
|
1994, there have been many Linux PC clusters assembled at Purdue,
|
|
including several with video walls. Although these clusters used 386,
|
|
486, and Pentium systems (no Pentium Pro systems), Intel recently
|
|
awarded Purdue a donation which will allow it to construct multiple
|
|
large clusters of Pentium II systems (with as many as 165 machines
|
|
planned for a single cluster). Although these clusters all have/will
|
|
have PAPERS networks, most also have conventional networks.
|
|
<P>
|
|
<H3>Pentium Pro Cluster Workshop</H3>
|
|
|
|
<P>
|
|
<P>In Des Moines, Iowa, April 10-11, 1997, AMES Laboratory held the
|
|
Pentium Pro Cluster Workshop. The WWW site from this workshop,
|
|
<A HREF="http://www.scl.ameslab.gov/workshops/PPCworkshop.html">http://www.scl.ameslab.gov/workshops/PPCworkshop.html</A>, contains
|
|
a wealth of PC cluster information gathered from all the attendees.
|
|
<P>
|
|
<H3>TreadMarks DSM (Distributed Shared Memory)</H3>
|
|
|
|
<P>
|
|
<P>DSM (Distributed Shared Memory) is a technique whereby a
|
|
message-passing system can appear to behave as an SMP. There are
|
|
quite a few such systems, most of which use the OS page-fault mechanism
|
|
to trigger message transmissions. TreadMarks,
|
|
<A HREF="http://www.cs.rice.edu/~willy/TreadMarks/overview.html">http://www.cs.rice.edu/~willy/TreadMarks/overview.html</A>, is one
|
|
of the more efficient of such systems and does run on Linux clusters.
|
|
The bad news is "TreadMarks is being distributed at a small cost to
|
|
universities and nonprofit institutions." For more information about
|
|
the software, contact
|
|
<A HREF="mailto:treadmarks@ece.rice.edu">treadmarks@ece.rice.edu</A>.
|
|
<P>
|
|
<H3>U-Net (User-level NETwork interface architecture)</H3>
|
|
|
|
<P>
|
|
<P>The U-Net (User-level NETwork interface architecture) project at
|
|
Cornell,
|
|
<A HREF="http://www2.cs.cornell.edu/U-Net/Default.html">http://www2.cs.cornell.edu/U-Net/Default.html</A>,
|
|
attempts to provide low-latency and high-bandwidth using commodity
|
|
network hardware by by virtualizing the network interface so that
|
|
applications can send and receive messages without operating system
|
|
intervention. U-Net runs on Linux PCs using a DECchip DC21140 based
|
|
Fast Ethernet card or a Fore Systems PCA-200 (not PCA-200E) ATM card.
|
|
<P>
|
|
<H3>WWT (Wisconsin Wind Tunnel)</H3>
|
|
|
|
<P>
|
|
<P>There is really quite a lot of cluster-related work at Wisconsin. The
|
|
WWT (Wisconsin Wind Tunnel) project,
|
|
<A HREF="http://www.cs.wisc.edu/~wwt/">http://www.cs.wisc.edu/~wwt/</A>, is doing all sorts of work toward
|
|
developing a "standard" interface between compilers and the underlying
|
|
parallel hardware. There is the Wisconsin COW (Cluster Of
|
|
Workstations), Cooperative Shared Memory and Tempest, the Paradyn
|
|
Parallel Performance Tools, etc. Unfortunately, there is not much
|
|
about Linux.
|
|
<P>
|
|
<HR>
|
|
<H2><A NAME="s4">4.</A> <A HREF="#toc4">SIMD Within A Register (e.g., using MMX)</A></H2>
|
|
|
|
<P>
|
|
<P>SIMD (Single Instruction stream, Multiple Data stream) Within A
|
|
Register (SWAR) isn't a new idea. Given a machine with <EM>k</EM>-bit
|
|
registers, data paths, and function units, it has long been known that
|
|
ordinary register operations can function as SIMD parallel operations
|
|
on <EM>n</EM>, <EM>k</EM>/<EM>n</EM>-bit, integer field values.
|
|
However, it is only with the recent push for multimedia that the 2x to
|
|
8x speedup offered by SWAR techniques has become a concern for
|
|
mainstream computing. The 1997 versions of most microprocessors
|
|
incorporate hardware support for SWAR:
|
|
<P>
|
|
<UL>
|
|
<LI>
|
|
<A HREF="http://www.amd.com/html/products/pcd/techdocs/appnotes/20726a.pdf">AMD K6 MMX (MultiMedia eXtensions)</A>
|
|
</LI>
|
|
<LI>
|
|
<A HREF="http://www.cyrix.com:80/process/SUPPORT/isv.htm">Cyrix M2 MMX (MultiMedia eXtensions)</A>
|
|
</LI>
|
|
<LI>
|
|
<A HREF="http://ftp.digital.com/pub/Digital/info/semiconductor/literature/alphahb2.pdf">Digital Alpha MAX (MultimediA eXtensions)</A>
|
|
</LI>
|
|
<LI>
|
|
<A HREF="http://hpcc997.external.hp.com:80/wsg/strategies/pa2go3/pa2go3.html">Hewlett-Packard PA-RISC MAX (Multimedia Acceleration eXtensions)</A>
|
|
</LI>
|
|
<LI>
|
|
<A HREF="http://developer.intel.com/drg/mmx/">Intel Pentium II & Pentium with MMX (MultiMedia eXtensions)</A>
|
|
</LI>
|
|
<LI>
|
|
<A HREF="http://www.microunity.com/www/mediaprc.htm">Microunity Mediaprocessor SIGD (Single Instruction on Groups of Data)</A>
|
|
</LI>
|
|
<LI>
|
|
<A HREF="http://www.mips.com/arch/ISA5/">MIPS Digital Media eXtension (MDMX, pronounced Mad Max)</A>
|
|
</LI>
|
|
<LI>
|
|
<A HREF="http://www.sun.com/sparc/vis/index.html">Sun SPARC V9 VIS (Visual Instruction Set)</A></LI>
|
|
</UL>
|
|
<P>There are a few holes in the hardware support provided by the new
|
|
microprocessors, quirks like only supporting some operations for some
|
|
field sizes. It is important to remember, however, that you don't
|
|
need any hardware support for many SWAR operations to be efficient.
|
|
For example, bitwise operations are not affected by the logical
|
|
partitioning of a register.
|
|
<P>
|
|
<H2><A NAME="ss4.1">4.1 SWAR: What Is It Good For?</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Although <EM>every</EM> modern processor is capable of executing with
|
|
at least some SWAR parallelism, the sad fact is that even the best
|
|
SWAR-enhanced instruction sets do not support very general-purpose
|
|
parallelism. In fact, many people have noticed that the performance
|
|
difference between Pentium and "Pentium with MMX technology" is often
|
|
due to things like the larger L1 cache that coincided with appearance
|
|
of MMX. So, realistically, what is SWAR (or MMX) good for?
|
|
<P>
|
|
<UL>
|
|
<LI>Integers only, the smaller the better. Two 32-bit values fit in
|
|
a 64-bit MMX register, but so do eight one-byte characters or even an
|
|
entire chess board worth of one-bit values.
|
|
|
|
Note: there <EM>will be a floating-point version of MMX</EM>, although
|
|
very little has been said about it at this writing. Cyrix has posted
|
|
a set of slides,
|
|
<A HREF="ftp://ftp.cyrix.com/developr/mpf97rm.pdf">ftp://ftp.cyrix.com/developr/mpf97rm.pdf</A>,
|
|
that includes a few comments about <B>MMFP</B>. Apparently, MMFP
|
|
will support two 32-bit floating-point numbers to be packed into a
|
|
64-bit MMX register; combining this with two MMFP pipelines will yield
|
|
four single-precision FLOPs per clock.
|
|
</LI>
|
|
<LI>SIMD or vector-style parallelism. The same operation is applied
|
|
to all fields simultaneously. There are ways to nullify the effects on
|
|
selected fields (i.e., equivalent to SIMD enable masking), but they
|
|
complicate coding and hurt performance.
|
|
</LI>
|
|
<LI>Localized, regular (preferably packed), memory reference
|
|
patterns. SWAR in general, and MMX in particular, are terrible at
|
|
randomly-ordered accesses; gathering a vector <CODE>x[y]</CODE> (where
|
|
<CODE>y</CODE> is an index array) is prohibitively expensive.</LI>
|
|
</UL>
|
|
<P>These are serious restrictions, but this type of parallelism occurs in
|
|
many parallel algorithms - not just multimedia applications. For the
|
|
right type of algorithm, SWAR is more effective than SMP or cluster
|
|
parallelism... and it doesn't cost anything to use it.
|
|
<P>
|
|
<H2><A NAME="ss4.2">4.2 Introduction To SWAR Programming</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>The basic concept of SWAR, SIMD Within A Register, is that operations
|
|
on word-length registers can be used to speed-up computations by
|
|
performing SIMD parallel operations on <EM>n</EM>
|
|
<EM>k</EM>/<EM>n</EM>-bit field values. However, making use of
|
|
SWAR technology can be awkward, and some SWAR operations are actually
|
|
more expensive than the corresponding sequences of serial operations
|
|
because they require additional instructions to enforce the field
|
|
partitioning.
|
|
<P>To illustrate this point, let's consider a greatly simplified SWAR
|
|
mechanism that manages four 8-bit fields within each 32-bit register.
|
|
The values in two registers might be represented as:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
PE3 PE2 PE1 PE0
|
|
+-------+-------+-------+-------+
|
|
Reg0 | D 7:0 | C 7:0 | B 7:0 | A 7:0 |
|
|
+-------+-------+-------+-------+
|
|
Reg1 | H 7:0 | G 7:0 | F 7:0 | E 7:0 |
|
|
+-------+-------+-------+-------+
|
|
</PRE>
|
|
<HR>
|
|
<P>This simply indicates that each register is viewed as essentially a
|
|
vector of four independent 8-bit integer values. Alternatively, think
|
|
of <CODE>A</CODE> and <CODE>E</CODE> as values in Reg0 and Reg1 of processing
|
|
element 0 (PE0), <CODE>B</CODE> and <CODE>F</CODE> as values in PE1's
|
|
registers, and so forth.
|
|
<P>The remainder of this document briefly reviews the basic classes of
|
|
SIMD parallel operations on these integer vectors and how these
|
|
functions can be implemented.
|
|
<P>
|
|
<H3>Polymorphic Operations</H3>
|
|
|
|
<P>
|
|
<P>Some SWAR operations can be performed trivially using ordinary 32-bit
|
|
integer operations, without concern for the fact that the operation is
|
|
really intended to operate independently in parallel on these 8-bit
|
|
fields. We call any such SWAR operation <EM>polymorphic</EM>, since
|
|
the function is unaffected by the field types (sizes).
|
|
<P>Testing if any field is non-zero is polymorphic, as are all bitwise
|
|
logic operations. For example, an ordinary bitwise-and operation (C's
|
|
<CODE>&</CODE> operator) performs a bitwise and no matter what the
|
|
field sizes are. A simple bitwise and of the above registers yields:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
PE3 PE2 PE1 PE0
|
|
+---------+---------+---------+---------+
|
|
Reg2 | D&H 7:0 | C&G 7:0 | B&F 7:0 | A&E 7:0 |
|
|
+---------+---------+---------+---------+
|
|
</PRE>
|
|
<HR>
|
|
<P>Because the bitwise and operation always has the value of result bit
|
|
<EM>k</EM> affected only by the values of the operand bit <EM>k</EM>
|
|
values, all field sizes are supported using the same single
|
|
instruction.
|
|
<P>
|
|
<H3>Partitioned Operations</H3>
|
|
|
|
<P>
|
|
<P>Unfortunately, lots of important SWAR operations are not polymorphic.
|
|
Arithmetic operations such as add, subtract, multiply, and divide are
|
|
all subject to carry/borrow interactions between fields. We call such
|
|
SWAR operations <EM>partitioned</EM>, because each such operation must
|
|
effectively partition the operands and result to prevent interactions
|
|
between fields. However, there are actually three different methods
|
|
that can be used to achieve this effect.
|
|
<P>
|
|
<H3>Partitioned Instructions</H3>
|
|
|
|
<P>
|
|
<P>Perhaps the most obvious approach to implementing partitioned
|
|
operations is to provide hardware support for "partitioned parallel
|
|
instructions" that cut the carry/borrow logic between fields. This
|
|
approach can yield the highest performance, but it requires a change
|
|
to the processor's instruction set and generally places many
|
|
restrictions on field size (e.g., 8-bit fields might be supported, but
|
|
not 12-bit fields).
|
|
<P>The AMD/Cyrix/Intel MMX, Digital MAX, HP MAX, and Sun VIS all
|
|
implement restricted versions of partitioned instructions.
|
|
Unfortunately, these different instruction set extensions have
|
|
significantly different restrictions, making algorithms somewhat
|
|
non-portable between them. For example, consider the following
|
|
sampling of partitioned operations:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
Instruction AMD/Cyrix/Intel MMX DEC MAX HP MAX Sun VIS
|
|
+---------------------+---------------------+---------+--------+---------+
|
|
| Absolute Difference | | 8 | | 8 |
|
|
+---------------------+---------------------+---------+--------+---------+
|
|
| Merge Maximum | | 8, 16 | | |
|
|
+---------------------+---------------------+---------+--------+---------+
|
|
| Compare | 8, 16, 32 | | | 16, 32 |
|
|
+---------------------+---------------------+---------+--------+---------+
|
|
| Multiply | 16 | | | 8x16 |
|
|
+---------------------+---------------------+---------+--------+---------+
|
|
| Add | 8, 16, 32 | | 16 | 16, 32 |
|
|
+---------------------+---------------------+---------+--------+---------+
|
|
</PRE>
|
|
<HR>
|
|
<P>In the table, the numbers indicate the field sizes, in bits, for which
|
|
each operation is supported. Even though the table omits many
|
|
instructions including all the more exotic ones, it is clear that
|
|
there are many differences. The direct result is that high-level
|
|
languages (HLLs) really are not very effective as programming models,
|
|
and portability is generally poor.
|
|
<P>
|
|
<H3>Unpartitioned Operations With Correction Code</H3>
|
|
|
|
<P>
|
|
<P>Implementing partitioned operations using partitioned instructions can
|
|
certainly be efficient, but what do you do if the partitioned
|
|
operation you need is not supported by the hardware? The answer is
|
|
that you use a series of ordinary instructions to perform the operation
|
|
with carry/borrow across fields, and then correct for the undesired
|
|
field interactions.
|
|
<P>This is a purely software approach, and the corrections do introduce
|
|
overhead, but it works with fully general field partitioning. This
|
|
approach is also fully general in that it can be used either to fill
|
|
gaps in the hardware support for partitioned instructions, or it can
|
|
be used to provide full functionality for target machines that have no
|
|
hardware support at all. In fact, by expressing the code sequences in
|
|
a language like C, this approach allows SWAR programs to be fully
|
|
portable.
|
|
<P>The question immediately arises: precisely how inefficient is it to
|
|
simulate SWAR partitioned operations using unpartitioned operations
|
|
with correction code? Well, that is certainly the $64k question...
|
|
but many operations are not as difficult as one might expect.
|
|
<P>Consider implementing a four-element 8-bit integer vector add of two
|
|
source vectors, <CODE>x</CODE>+<CODE>y</CODE>, using ordinary 32-bit
|
|
operations.
|
|
<P>An ordinary 32-bit add might actually yield the correct result, but
|
|
not if any 8-bit field carries into the next field. Thus, our goal is
|
|
simply to ensure that such a carry does not occur. Because adding two
|
|
<EM>k</EM>-bit fields generates an at most <EM>k</EM>+1 bit
|
|
result, we can ensure that no carry occurs by simply "masking out" the
|
|
most significant bit of each field. This is done by bitwise anding
|
|
each operand with <CODE>0x7f7f7f7f</CODE> and then performing an
|
|
ordinary 32-bit add.
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
t = ((x & 0x7f7f7f7f) + (y & 0x7f7f7f7f));
|
|
</PRE>
|
|
<HR>
|
|
<P>That result is correct... except for the most significant bit within
|
|
each field. Computing the correct value for each field is simply a
|
|
matter of doing two 1-bit partitioned adds of the most significant
|
|
bits from <CODE>x</CODE> and <CODE>y</CODE> to the 7-bit carry result
|
|
which was computed for <CODE>t</CODE>. Fortunately, a 1-bit
|
|
partitioned add is implemented by an ordinary exclusive or operation.
|
|
Thus, the result is simply:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
(t ^ ((x ^ y) & 0x80808080))
|
|
</PRE>
|
|
<HR>
|
|
<P>Ok, well, maybe that isn't so simple. After all, it is six operations
|
|
to do just four adds. However, notice that the number of operations
|
|
is not a function of how many fields there are... so, with more
|
|
fields, we get speedup. In fact, we may get speedup anyway simply
|
|
because the fields were loaded and stored in a single (integer vector)
|
|
operation, register availability may be improved, and there are fewer
|
|
dynamic code scheduling dependencies (because partial word references
|
|
are avoided).
|
|
<P>
|
|
<H3>Controlling Field Values</H3>
|
|
|
|
<P>
|
|
<P>While the other two approaches to partitioned operation implementation
|
|
both center on getting the maximum space utilization for the registers,
|
|
it can be computationally more efficient to instead control the field
|
|
values so that inter-field carry/borrow events should never occur.
|
|
For example, if we know that all the field values being added are such
|
|
that no field overflow will occur, a partitioned add operation can be
|
|
implemented using an ordinary add instruction; in fact, given this
|
|
constraint, an ordinary add instruction appears polymorphic, and is
|
|
usable for any field sizes without correction code. The question
|
|
thus becomes how to ensure that field values will not cause
|
|
carry/borrow events.
|
|
<P>One way to ensure this property is to implement partitioned
|
|
instructions that can restrict the range of field values. The Digital
|
|
MAX vector minimum and maximum instructions can be viewed as hardware
|
|
support for clipping field values to avoid inter-field carry/borrow.
|
|
<P>However, suppose that we do not have partitioned instructions that can
|
|
efficiently restrict the range of field values... is there a
|
|
sufficient condition that can be cheaply imposed to ensure
|
|
carry/borrow events will not interfere with adjacent fields? The
|
|
answer lies in analysis of the arithmetic properties. Adding two
|
|
<EM>k</EM>-bit numbers generates a result with at most
|
|
<EM>k</EM>+1 bits; thus, a field of <EM>k</EM>+1 bits can safely
|
|
contain such an operation despite using ordinary instructions.
|
|
<P>Thus, suppose that the 8-bit fields in our earlier example are now
|
|
7-bit fields with 1-bit "carry/borrow spacers":
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
PE3 PE2 PE1 PE0
|
|
+----+-------+----+-------+----+-------+----+-------+
|
|
Reg0 | D' | D 6:0 | C' | C 6:0 | B' | B 6:0 | A' | A 6:0 |
|
|
+----+-------+----+-------+----+-------+----+-------+
|
|
</PRE>
|
|
<HR>
|
|
<P>A vector of 7-bit adds is performed as follows. Let us assume that,
|
|
prior to the start of any partitioned operation, all the carry spacer
|
|
bits (<CODE>A'</CODE>, <CODE>B'</CODE>, <CODE>C'</CODE>, and <CODE>D'</CODE>) have the
|
|
value 0. By simply executing an ordinary add operation, all the
|
|
fields obtain the correct 7-bit values; however, some spacer bit
|
|
values might now be 1. We can correct this by just one more
|
|
conventional operation, masking-out the spacer bits. Our 7-bit
|
|
integer vector add, <CODE>x</CODE>+<CODE>y</CODE>, is thus:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
((x + y) & 0x7f7f7f7f)
|
|
</PRE>
|
|
<HR>
|
|
<P>This is just two instructions for four adds, clearly yielding good
|
|
speedup.
|
|
<P>The sharp reader may have noticed that setting the spacer bits to 0
|
|
does not work for subtract operations. The correction is, however,
|
|
remarkably simple. To compute <CODE>x</CODE>-<CODE>y</CODE>, we simply
|
|
ensure the initial condition that the spacers in <CODE>x</CODE> are all
|
|
1, while the spacers in <CODE>y</CODE> are all 0. In the worst case,
|
|
we would thus get:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
(((x | 0x80808080) - y) & 0x7f7f7f7f)
|
|
</PRE>
|
|
<HR>
|
|
<P>However, the additional bitwise or operation can often be optimized
|
|
out by ensuring that the operation generating the value for
|
|
<CODE>x</CODE> used <CODE>| 0x80808080</CODE> rather than <CODE>&
|
|
0x7f7f7f7f</CODE> as the last step.
|
|
<P>Which method should be used for SWAR partitioned operations? The
|
|
answer is simply "whichever yields the best speedup." Interestingly,
|
|
the ideal method to use may be different for different field sizes
|
|
within the same program running on the same machine.
|
|
<P>
|
|
<H3>Communication & Type Conversion Operations</H3>
|
|
|
|
<P>
|
|
<P>Although some parallel computations, including many operations on image
|
|
pixels, have the property that the <EM>i</EM>th value in a vector is
|
|
a function only of values that appear in the <EM>i</EM>th position
|
|
of the operand vectors, this is generally not the case. For example,
|
|
even pixel operations such as smoothing require values from adjacent
|
|
pixels as operands, and transformations like FFTs require more complex
|
|
(less localized) communication patterns.
|
|
<P>It is not difficult to efficiently implement 1-dimensional nearest
|
|
neighbor communication for SWAR using unpartitioned shift operations.
|
|
For example, to move a value from <CODE>PE</CODE><EM>i</EM> to
|
|
<CODE>PE</CODE>(<EM>i</EM>+1), a simple shift operation suffices.
|
|
If the fields are 8-bits in length, we would use:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
(x << 8)
|
|
</PRE>
|
|
<HR>
|
|
<P>Still, it isn't always quite that simple. For example, to move a
|
|
value from <CODE>PE</CODE><EM>i</EM> to
|
|
<CODE>PE</CODE>(<EM>i</EM>-1), a simple shift operation might
|
|
suffice... but the C language does not specify if shifts right
|
|
preserve the sign bit, and some machines only provide signed shift
|
|
right. Thus, in the general case, we must explicitly zero the
|
|
potentially replicated sign bits:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
((x >> 8) & 0x00ffffff)
|
|
</PRE>
|
|
<HR>
|
|
<P>Adding "wrap-around connections" is also reasonably efficient using
|
|
unpartitioned shifts. For example, to move a value from
|
|
<CODE>PE</CODE><EM>i</EM> to <CODE>PE</CODE>(<EM>i</EM>+1) with
|
|
wraparound:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
((x << 8) | ((x >> 24) & 0x000000ff))
|
|
</PRE>
|
|
<HR>
|
|
<P>The real problem comes when more general communication patterns must
|
|
be implemented. Only the HP MAX instruction set supports arbitrary
|
|
rearrangement of fields with a single instruction, which is called
|
|
<CODE>Permute</CODE>. This <CODE>Permute</CODE> instruction is really
|
|
misnamed; not only can it perform an arbitrary permutation of the
|
|
fields, but it also allows repetition. In short, it implements an
|
|
arbitrary <CODE>x[y]</CODE> operation.
|
|
<P>Unfortunately, <CODE>x[y]</CODE> is very difficult to implement without
|
|
such an instruction. The code sequence is generally both long and
|
|
inefficient; in fact, it is sequential code. This is very
|
|
disappointing. The relatively high speed of <CODE>x[y]</CODE>
|
|
operations in the MasPar MP1/MP2 and Thinking Machines CM1/CM2/CM200
|
|
SIMD supercomputers was one of the key reasons these machines performed
|
|
well. However, <CODE>x[y]</CODE> has always been slower than nearest
|
|
neighbor communication, even on those supercomputers, so many
|
|
algorithms have been designed to minimize the need for
|
|
<CODE>x[y]</CODE> operations. In short, without hardware support, it
|
|
is probably best to develop SWAR algorithms as though
|
|
<CODE>x[y]</CODE> wasn't legal... or at least isn't cheap.
|
|
<P>
|
|
<H3>Recurrence Operations (Reductions, Scans, etc.)</H3>
|
|
|
|
<P>
|
|
<P>A recurrence is a computation in which there is an apparently
|
|
sequential relationship between values being computed. However, if
|
|
these recurrences involve associative operations, it may be possible
|
|
to recode the computation using a tree-structured parallel algorithm.
|
|
<P>The most common type of parallelizable recurrence is probably the
|
|
class known as associative reductions. For example, to compute the
|
|
sum of a vector's values, one commonly writes purely sequential C code
|
|
like:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
t = 0;
|
|
for (i=0; i<MAX; ++i) t += x[i];
|
|
</PRE>
|
|
<HR>
|
|
<P>However, the order of the additions is rarely important. Floating
|
|
point and saturation math can yield different answers if the order of
|
|
additions is changed, but ordinary wrap-around integer additions will
|
|
yield the same results independent of addition order. Thus, we can
|
|
re-write this sequence into a tree-structured parallel summation in
|
|
which we first add pairs of values, then pairs of those partial sums,
|
|
and so forth, until a single final sum results. For a vector of four
|
|
8-bit values, just two addition steps are needed; the first step does
|
|
two 8-bit adds, yielding two 16-bit result fields (each containing a
|
|
9-bit result):
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
t = ((x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff));
|
|
</PRE>
|
|
<HR>
|
|
<P>The second step adds these two 9-bit values in 16-bit fields to
|
|
produce a single 10-bit result:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
((t + (t >> 16)) & 0x000003ff)
|
|
</PRE>
|
|
<HR>
|
|
<P>Actually, the second step performs two 16-bit field adds... but the
|
|
top 16-bit add is meaningless, which is why the result is masked to a
|
|
single 10-bit result value.
|
|
<P>Scans, also known as "parallel prefix" operations, are somewhat harder
|
|
to implement efficiently. This is because, unlike reductions, scans
|
|
produce partitioned results. For this reason, scans can be implemented
|
|
using a fairly obvious sequence of partitioned operations.
|
|
<P>
|
|
<H2><A NAME="ss4.3">4.3 MMX SWAR Under Linux</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>For Linux, IA32 processors are our primary concern. The good news is
|
|
that AMD, Cyrix, and Intel all implement the same MMX instructions.
|
|
However, MMX performance varies; for example, the K6 has only one MMX
|
|
pipeline - the Pentium with MMX has two. The only really bad news is
|
|
that Intel is still running those stupid MMX commercials.... ;-)
|
|
<P>There are really three approaches to using MMX for SWAR:
|
|
<P>
|
|
<OL>
|
|
<LI>Use routines from an MMX library. In particular, Intel has
|
|
developed several "performance libraries,"
|
|
<A HREF="http://developer.intel.com/drg/tools/ad.htm">http://developer.intel.com/drg/tools/ad.htm</A>, that offer a
|
|
variety of hand-optimized routines for common multimedia tasks. With
|
|
a little effort, many non-multimedia algorithms can be reworked to
|
|
enable some of the most compute-intensive portions to be implemented
|
|
using one or more of these library routines. These libraries are not
|
|
currently available for Linux, but could be ported.
|
|
</LI>
|
|
<LI>Use MMX instructions directly. This is somewhat complicated by
|
|
two facts. The first problem is that MMX might not be available on
|
|
the processor, so an alternative implementation must also be
|
|
provided. The second problem is that the IA32 assembler generally
|
|
used under Linux does not currently recognize MMX instructions.
|
|
</LI>
|
|
<LI>Use a high-level language or module compiler that can directly
|
|
generate appropriate MMX instructions. Such tools are currently under
|
|
development, but none is yet fully functional under Linux. For
|
|
example, at Purdue University (
|
|
<A HREF="http://dynamo.ecn.purdue.edu/~hankd/SWAR/">http://dynamo.ecn.purdue.edu/~hankd/SWAR/</A>) we are currently
|
|
developing a compiler that will take functions written in an
|
|
explicitly parallel C dialect and will generate SWAR modules that are
|
|
callable as C functions, yet make use of whatever SWAR support is
|
|
available, including MMX. The first prototype module compilers were
|
|
built in Fall 1996, however, bringing this technology to a usable
|
|
state is taking much longer than was originally expected.</LI>
|
|
</OL>
|
|
<P>In summary, MMX SWAR is still awkward to use. However, with a little
|
|
extra effort, the second approach given above can be used now. Here
|
|
are the basics:
|
|
<P>
|
|
<OL>
|
|
<LI>You cannot use MMX if your processor does not support it. The
|
|
following GCC code can be used to test if MMX is supported on your
|
|
processor. It returns 0 if not, non-zero if it is supported.
|
|
|
|
<HR>
|
|
<PRE>
|
|
inline extern
|
|
int mmx_init(void)
|
|
{
|
|
int mmx_available;
|
|
|
|
__asm__ __volatile__ (
|
|
/* Get CPU version information */
|
|
"movl $1, %%eax\n\t"
|
|
"cpuid\n\t"
|
|
"andl $0x800000, %%edx\n\t"
|
|
"movl %%edx, %0"
|
|
: "=q" (mmx_available)
|
|
: /* no input */
|
|
);
|
|
return mmx_available;
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
|
|
</LI>
|
|
<LI>An MMX register essentially holds one of what GCC would call an
|
|
<CODE>unsigned long long</CODE>. Thus, memory-based variables of this type
|
|
become the communication mechanism between your MMX modules and the C
|
|
programs that call them. Alternatively, you can declare your MMX data
|
|
as any 64-bit aligned data structure (it is convenient to ensure
|
|
64-bit alignment by declaring your data type as a <CODE>union</CODE> with
|
|
an <CODE>unsigned long long</CODE> field).
|
|
</LI>
|
|
<LI>If MMX is available, you can write your MMX code using
|
|
the <CODE>.byte</CODE> assembler directive to encode each instruction.
|
|
This is painful stuff to do by hand, but not difficult for a compiler
|
|
to generate. For example, the MMX instruction <CODE>PADDB MM0,MM1</CODE>
|
|
could be encoded as the GCC in-line assembly code:
|
|
|
|
<HR>
|
|
<PRE>
|
|
__asm__ __volatile__ (".byte 0x0f, 0xfc, 0xc1\n\t");
|
|
</PRE>
|
|
<HR>
|
|
|
|
|
|
Remember that MMX uses some of the same hardware that is used for
|
|
floating point operations, so code intermixed with MMX code must not
|
|
invoke any floating point operations. The floating point stack also
|
|
should be empty before executing any MMX code; the floating point
|
|
stack is normally empty at the beginning of a C function that does not
|
|
use floating point.
|
|
</LI>
|
|
<LI>Exit your MMX code by executing the <CODE>EMMS</CODE> instruction,
|
|
which can be encoded as:
|
|
|
|
<HR>
|
|
<PRE>
|
|
__asm__ __volatile__ (".byte 0x0f, 0x77\n\t");
|
|
</PRE>
|
|
<HR>
|
|
</LI>
|
|
</OL>
|
|
<P>If the above looks very awkward and crude, it is. However, MMX is
|
|
still quite young.... future versions of this document will offer
|
|
better ways to program MMX SWAR.
|
|
<P>
|
|
<HR>
|
|
<H2><A NAME="s5">5.</A> <A HREF="#toc5">Linux-Hosted Attached Processors</A></H2>
|
|
|
|
<P>
|
|
<P>Although this approach has recently fallen out of favor, it is
|
|
virtually impossible for other parallel processing methods to achieve
|
|
the low cost and high performance possible by using a Linux system to
|
|
host an attached parallel computing system. The problem is that very
|
|
little software support is available; you are pretty much on your own.
|
|
<P>
|
|
<H2><A NAME="ss5.1">5.1 A Linux PC Is A Good Host</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>In general, attached parallel processors tend to be specialized to
|
|
perform specific types of functions.
|
|
<P>Before becoming discouraged by the fact that you are somewhat on your
|
|
own, it is useful to understand that, although it may be difficult to
|
|
get a Linux PC to appropriately host a particular system, a Linux PC
|
|
is one of the few platforms well suited to this type of use.
|
|
<P>PCs make a good host for two primary reasons. The first is the cheap
|
|
and easy expansion capability; resources such as more memory, disks,
|
|
networks, etc., are trivially added to a PC. The second is the ease
|
|
of interfacing. Not only are ISA and PCI bus prototyping cards widely
|
|
available, but the parallel port offers reasonable performance in a
|
|
completely non-invasive interface. The IA32 separate I/O space also
|
|
facilitates interfacing by providing hardware I/O address protection
|
|
at the level of individual I/O port addresses.
|
|
<P>Linux also makes a good host OS. The free availability of full source
|
|
code, and extensive "hacking" guides, obviously are a tremendous help.
|
|
However, Linux also provides good near-real-time scheduling, and there
|
|
is even a true real-time version of Linux at
|
|
<A HREF="http://luz.cs.nmt.edu/~rtlinux/">http://luz.cs.nmt.edu/~rtlinux/</A>. Perhaps even more important
|
|
is the fact that while providing a full UNIX environment, Linux can
|
|
support development tools that were written to run under Microsoft DOS
|
|
and/or Windows. MSDOS programs can execute within a Linux process
|
|
using <CODE>dosemu</CODE> to provide a protected virtual machine that can
|
|
literally run MSDOS. Linux support for Windows 3.xx programs is even
|
|
more direct: free software such as <CODE>wine</CODE>,
|
|
<A HREF="http://www.linpro.no/wine/">http://www.linpro.no/wine/</A>, simulates Windows 3.11 well enough
|
|
for most programs to execute correctly and efficiently within a UNIX/X
|
|
environment.
|
|
<P>The following two sections give examples of attached parallel systems
|
|
that I'd like to see supported under Linux....
|
|
<P>
|
|
<H2><A NAME="ss5.2">5.2 Did You DSP That?</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>There is a thriving market for high-performance DSP (Digital Signal
|
|
Processing) processors. Although these chips were generally designed
|
|
to be embedded in application-specific systems, they also make great
|
|
attached parallel computers. Why?
|
|
<P>
|
|
<UL>
|
|
<LI>Many of them, such as the Texas Instruments (
|
|
<A HREF="http://www.ti.com/">http://www.ti.com/</A>) TMS320 and the Analog Devices (
|
|
<A HREF="http://www.analog.com/">http://www.analog.com/</A>) SHARC DSP families, are designed to
|
|
construct parallel machines with little or no "glue" logic.
|
|
</LI>
|
|
<LI>They are cheap, especially per MIP or MFLOP. Including the cost
|
|
of basic support logic, it is not unheard of for a DSP processor to be
|
|
one tenth the cost of a PC processor with comparable performance.
|
|
</LI>
|
|
<LI>They do not use much power nor generate much heat. This means
|
|
that it is possible to have a bunch of these chips powered by a
|
|
conventional PC's power supply - and enclosing them in your PC's case
|
|
will not turn it into an oven.
|
|
</LI>
|
|
<LI>There are strange-looking things in most DSP instruction sets
|
|
that high-level (e.g., C) compilers are unlikely to use well - for
|
|
example, "Bit Reverse Addressing." Using an attached parallel system,
|
|
it is possible to straightforwardly compile and run most code on the
|
|
host, while running the most time-consuming few algorithms on the DSPs
|
|
as carefully hand-tuned code.
|
|
</LI>
|
|
<LI>These DSP processors are not really designed to run a UNIX-like
|
|
OS, and generally are not very good as stand-alone general-purpose
|
|
computer processors. For example, many do not have memory management
|
|
hardware. In other words, they work best when hosted by a more
|
|
general-purpose machine... such as a Linux PC.</LI>
|
|
</UL>
|
|
<P>Although some audio cards and modems include DSP processors that Linux
|
|
drivers can access, the big payoff comes from using an attached
|
|
parallel system that has four or more DSP processors.
|
|
<P>Because the Texas Instruments TMS320 series,
|
|
<A HREF="http://www.ti.com/sc/docs/dsps/dsphome.htm">http://www.ti.com/sc/docs/dsps/dsphome.htm</A>, has been very
|
|
popular for a long time, and it is trivial to construct a TMS320-based
|
|
parallel processor, there are quite a few such systems available.
|
|
There are both integer-only and floating-point capable versions of the
|
|
TMS320; older designs used a somewhat unusual single-precision
|
|
floating-point format, but the new models support IEEE formats. The
|
|
older TMS320C4x (aka, 'C4x) achieves up to 80 MFLOPS using the
|
|
TI-specific single-precision floating-point format; in contrast, a
|
|
single 'C67x will provide up to 1 GFLOPS single-precision or 420
|
|
MFLOPS double-precision for IEEE floating point calculations, using a
|
|
VLIW-based chip architecture called VelociTI. Not only is it easy to
|
|
configure a group of these chips as a multiprocessor, but in a single
|
|
chip, the 'C8x multiprocessor will provide a 100 MFLOPS IEEE
|
|
floating-point RISC master processor along with either two or four
|
|
integer slave DSPs.
|
|
<P>The other DSP processor family that has been used in more than a few
|
|
attached parallel systems lately is the SHARC (aka, ADSP-2106x) from
|
|
Analog Devices
|
|
<A HREF="http://www.analog.com/">http://www.analog.com/</A>. These chips can be
|
|
configured as a 6-processor shared memory multiprocessor without
|
|
external glue logic, and larger systems also can be configured using
|
|
six 4-bit links/chip. Most of the larger systems seem targeted to
|
|
military applications, and are a bit pricey. However, Integrated
|
|
Computing Engines, Inc.,
|
|
<A HREF="http://www.iced.com/">http://www.iced.com/</A>, makes an
|
|
interesting little two-board PCI card set called GreenICE. This unit
|
|
contains an array of 16 SHARC processors, and is capable of delivering
|
|
a peak speed of about 1.9 GFLOPS using a single-precision IEEE format.
|
|
GreenICE costs less than $5,000.
|
|
<P>In my opinion, attached parallel DSPs really deserve a lot more
|
|
attention from the Linux parallel processing community....
|
|
<P>
|
|
<H2><A NAME="ss5.3">5.3 FPGAs And Reconfigurable Logic Computing</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>If parallel processing is all about getting the highest speedup, then
|
|
why not build custom hardware? Well, we all know the answers; it
|
|
costs too much, takes too long to develop, becomes useless when we
|
|
change the algorithm even slightly, etc. However, recent advances in
|
|
electrically reprogrammable FPGAs (Field Programmable Gate Arrays)
|
|
have nullified most of those objections. Now, the gate density is
|
|
high enough so that an entire simple processor can be built within a
|
|
single FPGA, and the time to reconfigure (reprogram) an FPGA has also
|
|
been dropping to a level where it is reasonable to reconfigure even
|
|
when moving from one phase of an algorithm to the next.
|
|
<P>This stuff is not for the weak of heart: you'll have to work with
|
|
hardware description languages like VHDL for the FPGA configuration, as
|
|
well as writing low-level code to interface to programs on the Linux
|
|
host system. However, the cost of FPGAs is low, and especially for
|
|
algorithms operating on low-precision integer data (actually, a small
|
|
superset of the stuff SWAR is good at), FPGAs can perform complex
|
|
operations just about as fast as you can feed them data. For example,
|
|
simple FPGA-based systems have yielded better-than-supercomputer times
|
|
for searching gene databases.
|
|
<P>There are other companies making appropriate FPGA-based hardware, but
|
|
the following two companies represent a good sample.
|
|
<P>Virtual Computer Company offers a variety of products using
|
|
dynamically reconfigurable SRAM-based Xilinx FPGAs. Their 8/16 bit
|
|
"Virtual ISA Proto Board"
|
|
<A HREF="http://www.vcc.com/products/isa.html">http://www.vcc.com/products/isa.html</A> is less than $2,000.
|
|
<P>The Altera ARC-PCI (Altera Reconfigurable Computer, PCI bus),
|
|
<A HREF="http://www.altera.com/html/new/pressrel/pr_arc-pci.html">http://www.altera.com/html/new/pressrel/pr_arc-pci.html</A>,
|
|
is a similar type of card, but uses Altera FPGAs and a PCI bus
|
|
interface rather than ISA.
|
|
<P>Many of the design tools, hardware description languages, compilers,
|
|
routers, mappers, etc., come as object code only that runs under
|
|
Windows and/or DOS. You could simply keep a disk partition with
|
|
DOS/Windows on your host PC and reboot whenever you need to use them,
|
|
however, many of these software packages may work under Linux using
|
|
<CODE>dosemu</CODE> or Windows emulators like <CODE>wine</CODE>.
|
|
<P>
|
|
<HR>
|
|
<H2><A NAME="s6">6.</A> <A HREF="#toc6">Of General Interest</A></H2>
|
|
|
|
<P>
|
|
<P>The material covered in this section applies to all four parallel
|
|
processing models for Linux.
|
|
<P>
|
|
<H2><A NAME="ss6.1">6.1 Programming Languages And Compilers</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>I am primarily known as a compiler researcher, so I'd like to be able
|
|
to say that there are lots of really great compilers automatically
|
|
generating efficient parallel code for Linux systems. Unfortunately,
|
|
the truth is that it is hard to beat the performance obtained by
|
|
expressing your parallel program using various explicit communication
|
|
and other parallel operations within C code that is compiled by GCC.
|
|
<P>The following language/compiler projects represent some of the best
|
|
efforts toward producing reasonably efficient code from high-level
|
|
languages. Generally, each is reasonably effective for the kinds of
|
|
programming tasks it targets, but none is the powerful general-purpose
|
|
language and compiler system that will make you forever stop writing C
|
|
programs to compile with GCC... which is fine. Use these languages
|
|
and compilers as they were intended, and you'll be rewarded with
|
|
shorter development times, easier debugging and maintenance, etc.
|
|
<P>There are plenty of languages and compilers beyond those listed here
|
|
(in alphabetical order). A list of freely available compilers (most
|
|
of which have nothing to do with Linux parallel processing) is at
|
|
<A HREF="http://www.idiom.com/free-compilers/">http://www.idiom.com/free-compilers/</A>.
|
|
<P>
|
|
<H3>Fortran 66/77/PCF/90/HPF/95</H3>
|
|
|
|
<P>
|
|
<P>At least in the scientific computing community, there will always be
|
|
Fortran. Of course, now Fortran doesn't mean the same thing it did in
|
|
the 1966 ANSI standard. Basically, Fortran 66 was pretty simple stuff.
|
|
Fortran 77 added tons of features, the most noticeable of which were the
|
|
improved support for character data and the change of <CODE>DO</CODE> loop
|
|
semantics. PCF (Parallel Computing Forum) Fortran attempted to add a
|
|
variety of parallel processing support features to 77. Fortran 90 is
|
|
a fully-featured modern language, essentially adding C++-like
|
|
object-oriented programming features and parallel array syntax to the
|
|
77 language. HPF (High-Performance Fortran,
|
|
<A HREF="http://www.crpc.rice.edu/HPFF/home.html">http://www.crpc.rice.edu/HPFF/home.html</A>), which has itself gone
|
|
through two versions (HPF-1 and HPF-2), is essentially the enhanced,
|
|
standardized, version of what many of us used to know as CM Fortran,
|
|
MasPar Fortran, or Fortran D; it extends Fortran 90 with a variety of
|
|
parallel processing enhancements, largely focussed on specifying data
|
|
layouts. Finally, Fortran 95 represents a relatively minor
|
|
enhancement and refinement of 90.
|
|
<P>What works with C generally can also work with <CODE>f2c</CODE>,
|
|
<CODE>g77</CODE> (a nice Linux-specific overview is at
|
|
<A HREF="http://linux.uni-regensburg.de/psi_linux/gcc/html_g77/g77_91.html">http://linux.uni-regensburg.de/psi_linux/gcc/html_g77/g77_91.html</A>),
|
|
or the commercial Fortran 90/95 products from
|
|
<A HREF="http://extweb.nag.co.uk/nagware/NCNJNKNM.html">http://extweb.nag.co.uk/nagware/NCNJNKNM.html</A>. This is because
|
|
all of these compilers eventually come down to the same code-generation
|
|
used in the back-end of GCC.
|
|
<P>Commercial Fortran parallelizers that can generate code for SMPs are
|
|
available from
|
|
<A HREF="http://www.kai.com/">http://www.kai.com/</A> and
|
|
<A HREF="http://www.psrv.com/vast/vast_parallel.html">http://www.psrv.com/vast/vast_parallel.html</A>. It is not
|
|
clear if these compilers will work for SMP Linux, but it should be
|
|
possible given that the standard POSIX threads (i.e., LinuxThreads)
|
|
work under SMP Linux.
|
|
<P>The Portland Group,
|
|
<A HREF="http://www.pgroup.com/">http://www.pgroup.com/</A>, has commercial
|
|
parallelizing HPF Fortran (and C, C++) compilers that generate code for
|
|
SMP Linux; they also have a version targeting clusters using MPI or
|
|
PVM. FORGE/spf/xHPF products at
|
|
<A HREF=" http://www.apri.com/"> http://www.apri.com/</A>
|
|
might also be useful for SMPs or clusters.
|
|
<P>Freely available parallelizing Fortrans that might be made to work
|
|
with parallel Linux systems include:
|
|
<P>
|
|
<UL>
|
|
<LI>ADAPTOR (Automatic DAta Parallelism TranslaTOR,
|
|
<A HREF="http://www.gmd.de/SCAI/lab/adaptor/adaptor_home.html">http://www.gmd.de/SCAI/lab/adaptor/adaptor_home.html</A>),
|
|
which can translate HPF into Fortran 77/90 code with MPI or PVM calls,
|
|
but does not mention Linux.
|
|
</LI>
|
|
<LI>Fx
|
|
<A HREF="http://www.cs.cmu.edu/~fx/Fx">http://www.cs.cmu.edu/~fx/Fx</A> at Carnegie Mellon
|
|
targets some workstation clusters, but Linux?
|
|
</LI>
|
|
<LI>HPFC (prototype HPF Compiler,
|
|
<A HREF="http://www.cri.ensmp.fr/~coelho/hpfc.html">http://www.cri.ensmp.fr/~coelho/hpfc.html</A>) generates Fortran 77
|
|
code with PVM calls. Is it usable on a Linux cluster?
|
|
</LI>
|
|
<LI>Can PARADIGM (PARAllelizing compiler for DIstributed-memory
|
|
General-purpose Multicomputers,
|
|
<A HREF="http://www.crhc.uiuc.edu/Paradigm/">http://www.crhc.uiuc.edu/Paradigm/</A>) be used with Linux?
|
|
</LI>
|
|
<LI>The Polaris compiler,
|
|
<A HREF="http://ece.www.ecn.purdue.edu/~eigenman/polaris/">http://ece.www.ecn.purdue.edu/~eigenman/polaris/</A>, generates
|
|
Fortran code for shared memory multiprocessors, and may soon be
|
|
retargeted to PAPERS Linux clusters.
|
|
</LI>
|
|
<LI>PREPARE,
|
|
<A HREF="http://www.irisa.fr/EXTERNE/projet/pampa/PREPARE/prepare.html">http://www.irisa.fr/EXTERNE/projet/pampa/PREPARE/prepare.html</A>,
|
|
targets MPI clusters... it is not clear if it can generate code to
|
|
run on IA32 processors.
|
|
</LI>
|
|
<LI>Combining ADAPT and ADLIB, shpf (Subset High Performance Fortran
|
|
compilation system,
|
|
<A HREF="http://www.ccg.ecs.soton.ac.uk/Projects/shpf/shpf.html">http://www.ccg.ecs.soton.ac.uk/Projects/shpf/shpf.html</A>) is
|
|
public domain and generates Fortran 90 with MPI calls... so, if you
|
|
have a Fortran 90 compiler under Linux....
|
|
</LI>
|
|
<LI>SUIF (Stanford University Intermediate Form, see
|
|
<A HREF="http://suif.stanford.edu/">http://suif.stanford.edu/</A>) has parallelizing compilers for both
|
|
C and Fortran. This is also the focus of the National Compiler
|
|
Infrastructure Project... so, is anybody targeting parallel Linux
|
|
systems?</LI>
|
|
</UL>
|
|
<P>I'm sure that I have omitted many potentially useful compilers for
|
|
various dialects of Fortran, but there are so many that it is difficult
|
|
to keep track. In the future, I would prefer to list only those
|
|
compilers known to work with Linux. Please email comments and/or
|
|
corrections to
|
|
<A HREF="hankd@engr.uky.edu">hankd@engr.uky.edu</A>.
|
|
<P>
|
|
<H3>GLU (Granular Lucid)</H3>
|
|
|
|
<P>
|
|
<P>GLU (Granular Lucid) is a very high-level programming system based on
|
|
a hybrid programming model that combines intensional (Lucid) and
|
|
imperative models. It supports both PVM and TCP sockets. Does it run
|
|
under Linux? More information is available at
|
|
<A HREF="http://www.csl.sri.com/GLU.html">http://www.csl.sri.com/GLU.html</A>.
|
|
<P>
|
|
<H3>Jade And SAM</H3>
|
|
|
|
<P>
|
|
<P>Jade is a parallel programming language that extends C to exploit
|
|
coarse-grain concurrency in sequential, imperative programs. It
|
|
assumes a distributed shared memory model, which is implemented by SAM
|
|
for workstation clusters using PVM. More information is available at
|
|
<A HREF="http://suif.stanford.edu/~scales/sam.html">http://suif.stanford.edu/~scales/sam.html</A>.
|
|
<P>
|
|
<H3>Mentat And Legion</H3>
|
|
|
|
<P>
|
|
<P>Mentat is an object-oriented parallel processing system that works
|
|
with workstation clusters and has been ported to Linux. Mentat
|
|
Programming Language (MPL) is an object-oriented programming language
|
|
based on C++. The Mentat run-time system uses something vaguely
|
|
resembling non-blocking remote procedure calls. More information is
|
|
available at
|
|
<A HREF="http://www.cs.virginia.edu/~mentat/">http://www.cs.virginia.edu/~mentat/</A>.
|
|
<P>Legion
|
|
<A HREF="http://www.cs.virginia.edu/~legion/">http://www.cs.virginia.edu/~legion/</A> is built on top
|
|
on Mentat, providing the appearance of a single virtual machine across
|
|
wide-area networked machines.
|
|
<P>
|
|
<H3>MPL (MasPar Programming Language)</H3>
|
|
|
|
<P>
|
|
<P>Not to be confussed with Mentat's MPL, this language was originally
|
|
developed as the native parallel C dialect for the MasPar SIMD
|
|
supercomputers. Well, MasPar isn't really in that business any more
|
|
(they are now NeoVista Solutions,
|
|
<A HREF="http://www.neovista.com">http://www.neovista.com</A>,
|
|
a data mining company), but their MPL compiler was built using GCC, so
|
|
it is still freely available. In a joint effort between the
|
|
University of Alabama at Huntsville and Purdue University, MasPar's
|
|
MPL has been retargeted to generate C code with AFAPI calls (see
|
|
section 3.6), and thus runs on both Linux SMPs and clusters. The
|
|
compiler is, however, somewhat buggy... see
|
|
<A HREF="http://www.math.luc.edu/~laufer/mspls/papers/cohen.ps">http://www.math.luc.edu/~laufer/mspls/papers/cohen.ps</A>.
|
|
<P>
|
|
<H3>PAMS (Parallel Application Management System)</H3>
|
|
|
|
<P>
|
|
<P>Myrias is a company selling a software product called PAMS (Parallel
|
|
Application Management System). PAMS provides very simple directives
|
|
for virtual shared memory parallel processing. Networks of Linux
|
|
machines are not yet supported. See
|
|
<A HREF="http://www.myrias.com/">http://www.myrias.com/</A> for more information.
|
|
<P>
|
|
<H3>Parallaxis-III</H3>
|
|
|
|
<P>
|
|
<P>Parallaxis-III is a structured programming language that extends
|
|
Modula-2 with "virtual processors and connections" for data
|
|
parallelism (a SIMD model). The Parallaxis software comprises
|
|
compilers for sequential and parallel computer systems, a debugger
|
|
(extensions to the gdb and xgbd debugger), and a large variety of
|
|
sample algorithms from different areas, especially image processing.
|
|
This runs on sequential Linux systems... an old version supported
|
|
various parallel targets, and the new version also will (e.g.,
|
|
targeting a PVM cluster). More information is available at
|
|
<A HREF="http://www.informatik.uni-stuttgart.de/ipvr/bv/p3/p3.html">http://www.informatik.uni-stuttgart.de/ipvr/bv/p3/p3.html</A>.
|
|
<P>
|
|
<H3>pC++/Sage++</H3>
|
|
|
|
<P>
|
|
<P>pC++/Sage++ is a language extension to C++ that permits data-parallel
|
|
style operations using "collections of objects" from some base
|
|
"element" class. It is a preprocessor generating C++ code that can
|
|
run under PVM. Does it run under Linux? More information is
|
|
available at
|
|
<A HREF="http://www.extreme.indiana.edu/sage/">http://www.extreme.indiana.edu/sage/</A>.
|
|
<P>
|
|
<H3>SR (Synchronizing Resources)</H3>
|
|
|
|
<P>
|
|
<P>SR (Synchronizing Resources) is a concurrent programming language in
|
|
which resources encapsulate processes and the variables they share;
|
|
operations provide the primary mechanism for process interaction. SR
|
|
provides a novel integration of the mechanisms for invoking and
|
|
servicing operations. Consequently, all of local and remote procedure
|
|
call, rendezvous, message passing, dynamic process creation,
|
|
multicast, and semaphores are supported. SR also supports shared
|
|
global variables and operations.
|
|
<P>It has been ported to Linux, but it isn't clear what parallelism it
|
|
can execute with. More information is available at
|
|
<A HREF="http://www.cs.arizona.edu/sr/www/index.html">http://www.cs.arizona.edu/sr/www/index.html</A>.
|
|
<P>
|
|
<H3>ZPL And IronMan</H3>
|
|
|
|
<P>
|
|
<P>ZPL is an array-based programming language intended to support
|
|
engineering and scientific applications. It generates calls to a
|
|
simple message-passing interface called IronMan, and the few functions
|
|
which constitute this interface can be easily implemented using nearly
|
|
any message-passing system. However, it is primarily targeted to PVM
|
|
and MPI on workstation clusters, and Linux is supported. More
|
|
information is available at
|
|
<A HREF="http://www.cs.washington.edu/research/projects/orca3/zpl/www/">http://www.cs.washington.edu/research/projects/orca3/zpl/www/</A>.
|
|
<P>
|
|
<H2><A NAME="ss6.2">6.2 Performance Issues</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>There are a lot of people who spend a lot of time benchmarking
|
|
particular motherboards, network cards, etc., trying to determine
|
|
which is the best. The problem with that approach is that by the time
|
|
you've been able to benchmark something, it is no longer the best
|
|
available; it even may have been taken off the market and replaced by
|
|
a revised model with entirely different properties.
|
|
<P>Buying PC hardware is like buying orange juice. Usually, it is made
|
|
with pretty good stuff no matter what company name is on the label.
|
|
Few people know, or care, where the components (or orange juice
|
|
concentrate) came from. That said, there are some hardware
|
|
differences that you should pay attention to. My advice is simply
|
|
that you be aware of what you can expect from the hardware under
|
|
Linux, and then focus your attention on getting rapid delivery, a good
|
|
price, and a reasonable policy for returns.
|
|
<P>An excellent overview of the different PC processors is given in
|
|
<A HREF="http://www.pcguide.com/ref/cpu/fam/">http://www.pcguide.com/ref/cpu/fam/</A>; in fact, the whole WWW
|
|
site
|
|
<A HREF="http://www.pcguide.com/">http://www.pcguide.com/</A> is full of good technical
|
|
overviews of PC hardware. It is also useful to know a bit about
|
|
performance of specific hardware configurations, and the Linux
|
|
Benchmarking HOWTO
|
|
<A HREF="http://sunsite.unc.edu/LDP/HOWTO/Benchmarking-HOWTO.html">http://sunsite.unc.edu/LDP/HOWTO/Benchmarking-HOWTO.html</A> is a
|
|
good place to start.
|
|
<P>The Intel IA32 processors have many special registers that can be used
|
|
to measure the performance of a running system in exquisite detail.
|
|
Intel VTune,
|
|
<A HREF="http://developer.intel.com/design/perftool/vtune/">http://developer.intel.com/design/perftool/vtune/</A>, uses the
|
|
performance registers extensively in a very complete code-tuning
|
|
system... that unfortunately doesn't run under Linux. A loadable
|
|
module device driver, and library routines, for accessing the Pentium
|
|
performance registers is available from
|
|
<A HREF="http://www.cs.umd.edu/users/akinlar/driver.html">http://www.cs.umd.edu/users/akinlar/driver.html</A>. Keep in mind
|
|
that these performance registers are different on different IA32
|
|
processors; this code works only with Pentium, not with 486, Pentium
|
|
Pro, Pentium II, K6, etc.
|
|
<P>Another comment on performance is appropriate, especially for those
|
|
of you who want to build big clusters and put them in small spaces.
|
|
At least some modern processors incorporate thermal sensors and
|
|
circuits that are used to slow the internal clock rate if operating
|
|
temperature gets too high (an attempt to reduce heat output and
|
|
improve reliability). I'm not suggesting that everyone should go buy
|
|
a peltier device (heat pump) to cool each CPU, but you should be aware
|
|
that high operating temperature does not just shorten component life -
|
|
it also can directly reduce system performance. Do not arrange your
|
|
computers in physical configurations that block airflow, trap heat
|
|
within confined areas, etc.
|
|
<P>Finally, performance isn't just speed, but also reliability and
|
|
availability. High reliability means that your system almost never
|
|
crashes, even when components fail... which generally requires
|
|
special features like redundant power supplies and hot-swap
|
|
motherboards. That usually isn't cheap. High availability refers to
|
|
the concept that your system is available for use nearly all the
|
|
time... the system may crash when components fail, but the system is
|
|
quickly repaired and rebooted. There is a High-Availability HOWTO
|
|
that discusses many of the basic issues. However, especially for
|
|
clusters, high availablity can be achieved simply by having a few
|
|
spares. I recommend at least one spare, and prefer to have at least
|
|
one spare for every 16 machines in a large cluster. Discarding faulty
|
|
hardware and replacing it with a spare can yield both higher
|
|
availability and lower cost than a maintenance contract.
|
|
<P>
|
|
<H2><A NAME="ss6.3">6.3 Conclusion - It's Out There</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>So, is anybody doing parallel processing using Linux? Yes!
|
|
<P>It wasn't very long ago that a lot of people were wondering if the
|
|
death of many parallel-processing supercomputer companies meant that
|
|
parallel processing was on its way out. I didn't think it was dead
|
|
then (see
|
|
<A HREF="http://dynamo.ecn.purdue.edu/~hankd/Opinions/pardead.html">http://dynamo.ecn.purdue.edu/~hankd/Opinions/pardead.html</A> for a
|
|
fun overview of what I think really happened), and it seems quite
|
|
clear now that parallel processing is again on the rise. Even Intel,
|
|
which just recently stopped making parallel supercomputers, is proud
|
|
of the parallel processing support in things like MMX and the upcoming
|
|
IA64 EPIC (Explicitly Parallel Instruction Computer).
|
|
<P>If you search for "Linux" and "parallel" with your favorite search
|
|
engine, you'll find quite a few places are involved in parallel
|
|
processing using Linux. In particular, Linux PC clusters seem to be
|
|
popping-up everywhere. The appropriateness of Linux, combined with
|
|
the low cost and high performance of PC hardware, have made parallel
|
|
processing using Linux a popular approach to supercomputing for both
|
|
small, budget-constrained, groups and large, well-funded, national
|
|
research laboratories.
|
|
<P>Various projects listed elsewhere in this document maintain lists of
|
|
"kindred" research sites that have similar parallel Linux
|
|
configurations. However, at
|
|
<A HREF="http://yara.ecn.purdue.edu/~pplinux/Sites/">http://yara.ecn.purdue.edu/~pplinux/Sites/</A>, there is a
|
|
hypertext document intended to provide photographs, descriptions, and
|
|
contact information for all the various sites using Linux systems for
|
|
parallel processing. To have information about your site posted there:
|
|
<P>
|
|
<UL>
|
|
<LI>You must have a "permanent" parallel Linux site: an SMP,
|
|
cluster of machines, SWAR system, or PC with attached processor, which
|
|
is configured to allow users to <EM>execute parallel programs under
|
|
Linux</EM>. A Linux-based software environment (e.g., PVM, MPI,
|
|
AFAPI) that directly supports parallel processing must be installed on
|
|
the system. However, the hardware need not be dedicated to parallel
|
|
processing under Linux, and may be used for completely different
|
|
purposes when parallel programs are not being run.
|
|
</LI>
|
|
<LI>Request that your site be listed. Send your site information to
|
|
<A HREF="mailto:hankd@engr.uky.edu">hankd@engr.uky.edu</A>. Please follow the format used in other
|
|
entries for your site information. <EM>No site will be listed without
|
|
an explicit request from the contact person for that site.</EM></LI>
|
|
</UL>
|
|
<P>There are 14 clusters in the current listing, but we are aware of at
|
|
least several dozen Linux clusters world-wide. Of course, listing
|
|
does not imply any endorsement, etc.; our hope is simply to increase
|
|
awareness, research, and collaboration involving parallel processing
|
|
using Linux.
|
|
<P>
|
|
<HR>
|
|
</BODY>
|
|
</HTML>
|