403 lines
21 KiB
HTML
403 lines
21 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
|
|
<TITLE>Linux Parallel Processing HOWTO: Introduction</TITLE>
|
|
<LINK HREF="Parallel-Processing-HOWTO-2.html" REL=next>
|
|
|
|
<LINK HREF="Parallel-Processing-HOWTO.html#toc1" REL=contents>
|
|
</HEAD>
|
|
<BODY>
|
|
<A HREF="Parallel-Processing-HOWTO-2.html">Next</A>
|
|
Previous
|
|
<A HREF="Parallel-Processing-HOWTO.html#toc1">Contents</A>
|
|
<HR>
|
|
<H2><A NAME="s1">1. Introduction</A></H2>
|
|
|
|
<P>
|
|
<P><B>Parallel Processing</B> refers to the concept of speeding-up the
|
|
execution of a program by dividing the program into multiple fragments
|
|
that can execute simultaneously, each on its own processor. A program
|
|
being executed across <EM>n</EM> processors might execute <EM>n</EM>
|
|
times faster than it would using a single processor.
|
|
<P>
|
|
<P>
|
|
<P>Traditionally, multiple processors were provided within a specially
|
|
designed "parallel computer"; along these lines, Linux now supports
|
|
<B>SMP</B> systems (often sold as "servers") in which multiple
|
|
processors share a single memory and bus interface within a single
|
|
computer. It is also possible for a group of computers (for example,
|
|
a group of PCs each running Linux) to be interconnected by a network
|
|
to form a parallel-processing <B>cluster</B>. The third alternative
|
|
for parallel computing using Linux is to use the <B>multimedia
|
|
instruction extensions</B> (i.e., MMX) to operate in parallel on
|
|
vectors of integer data. Finally, it is also possible to use a Linux
|
|
system as a "host" for a specialized <B>attached</B> parallel
|
|
processing compute engine. All these approaches are discussed in
|
|
detail in this document.
|
|
<P>
|
|
<H2><A NAME="ss1.1">1.1 Is Parallel Processing What I Want?</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Although use of multiple processors can speed-up many operations, most
|
|
applications cannot yet benefit from parallel processing. Basically,
|
|
parallel processing is appropriate only if:
|
|
<P>
|
|
<UL>
|
|
<LI>Your application has enough parallelism to make good use of
|
|
multiple processors. In part, this is a matter of identifying
|
|
portions of the program that can execute independently and
|
|
simultaneously on separate processors, but you will also find that
|
|
some things that <EM>could</EM> execute in parallel might actually slow
|
|
execution if executed in parallel using a particular system. For
|
|
example, a program that takes four seconds to execute within a single
|
|
machine might be able to execute in only one second of processor time
|
|
on each of four machines, but no speedup would be achieved if it took
|
|
three seconds or more for these machines to coordinate their actions.
|
|
</LI>
|
|
<LI>Either the particular application program you are interested in
|
|
already has been <B>parallelized</B> (rewritten to take advantage of
|
|
parallel processing) or you are willing to do at least some new coding
|
|
to take advantage of parallel processing.
|
|
</LI>
|
|
<LI>You are interested in researching, or at least becoming familiar
|
|
with, issues involving parallel processing. Parallel processing using
|
|
Linux systems isn't necessarily difficult, but it is not familiar to
|
|
most computer users, and there isn't any book called "Parallel
|
|
Processing for Dummies"... at least not yet. This HOWTO is a good
|
|
starting point, not all you need to know.</LI>
|
|
</UL>
|
|
<P>
|
|
<P>The good news is that if all the above are true, you'll find that
|
|
parallel processing using Linux can yield supercomputer performance
|
|
for some programs that perform complex computations or operate on
|
|
large data sets. What's more, it can do that using cheap hardware...
|
|
which you might already own. As an added bonus, it is also easy to
|
|
use a parallel Linux system for other things when it is not busy
|
|
executing a parallel job.
|
|
<P>If parallel processing is <EM>not</EM> what you want, but you would
|
|
like to achieve at least a modest improvement in performance, there are
|
|
still things you can do. For example, you can improve performance of
|
|
sequential programs by moving to a faster processor, adding memory,
|
|
replacing an IDE disk with fast wide SCSI, etc. If that's all you are
|
|
interested in, jump to section 6.2; otherwise, read on.
|
|
<P>
|
|
<H2><A NAME="ss1.2">1.2 Terminology</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Although parallel processing has been used for many years in many
|
|
systems, it is still somewhat unfamiliar to most computer users.
|
|
Thus, before discussing the various alternatives, it is important to
|
|
become familiar with a few commonly used terms.
|
|
<P>
|
|
<DL>
|
|
<DT><B>SIMD:</B><DD><P>SIMD (Single Instruction stream, Multiple Data stream) refers to a
|
|
parallel execution model in which all processors execute the same
|
|
operation at the same time, but each processor is allowed to operate
|
|
upon its own data. This model naturally fits the concept of
|
|
performing the same operation on every element of an array, and is
|
|
thus often associated with vector or array manipulation. Because all
|
|
operations are inherently synchronized, interactions among SIMD
|
|
processors tend to be easily and efficiently implemented.
|
|
<P>
|
|
<DT><B>MIMD:</B><DD><P>MIMD (Multiple Instruction stream, Multiple Data stream) refers to a
|
|
parallel execution model in which each processor is essentially acting
|
|
independently. This model most naturally fits the concept of
|
|
decomposing a program for parallel execution on a functional basis;
|
|
for example, one processor might update a database file while another
|
|
processor generates a graphic display of the new entry. This is a
|
|
more flexible model than SIMD execution, but it is achieved at the
|
|
risk of debugging nightmares called <B>race conditions</B>, in which
|
|
a program may intermittently fail due to timing variations reordering
|
|
the operations of one processor relative to those of another.
|
|
<P>
|
|
<DT><B>SPMD:</B><DD><P>SPMD (Single Program, Multiple Data) is a restricted version of MIMD
|
|
in which all processors are running the same program. Unlike SIMD,
|
|
each processor executing SPMD code may take a different control flow
|
|
path through the program.
|
|
<P>
|
|
<DT><B>Communication Bandwidth:</B><DD><P>The bandwidth of a communication system is the maximum amount of data
|
|
that can be transmitted in a unit of time... once data transmission
|
|
has begun. Bandwidth for serial connections is often measured in
|
|
<B>baud</B> or <B>bits/second (b/s)</B>, which generally
|
|
correspond to 1/10 to 1/8 that many <B>Bytes/second (B/s)</B>. For
|
|
example, a 1,200 baud modem transfers about 120 B/s, whereas a 155
|
|
Mb/s ATM network connection is nearly 130,000 times faster,
|
|
transferring about about 17 MB/s. High bandwidth allows large blocks
|
|
of data to be transferred efficiently between processors.
|
|
<P>
|
|
<DT><B>Communication Latency:</B><DD><P>The latency of a communication system is the minimum time taken to
|
|
transmit one object, including any send and receive software
|
|
overhead. Latency is very important in parallel processing because it
|
|
determines the minimum useful <B>grain size</B>, the minimum run
|
|
time for a segment of code to yield speed-up through parallel
|
|
execution. Basically, if a segment of code runs for less time than it
|
|
takes to transmit its result value (i.e., latency), executing that
|
|
code segment serially on the processor that needed the result value
|
|
would be faster than parallel execution; serial execution would avoid
|
|
the communication overhead.
|
|
<P>
|
|
<DT><B>Message Passing:</B><DD><P>Message passing is a model for interactions between processors within
|
|
a parallel system. In general, a message is constructed by software
|
|
on one processor and is sent through an interconnection network to
|
|
another processor, which then must accept and act upon the message
|
|
contents. Although the overhead in handling each message (latency)
|
|
may be high, there are typically few restrictions on how much
|
|
information each message may contain. Thus, message passing can yield
|
|
high bandwidth making it a very effective way to transmit a large
|
|
block of data from one processor to another. However, to minimize the
|
|
need for expensive message passing operations, data structures within
|
|
a parallel program must be spread across the processors so that most
|
|
data referenced by each processor is in its local memory... this task
|
|
is known as <B>data layout</B>.
|
|
<P>
|
|
<DT><B>Shared Memory:</B><DD><P>Shared memory is a model for interactions between processors within a
|
|
parallel system. Systems like the multi-processor Pentium machines
|
|
running Linux <B>physically</B> share a single memory among
|
|
their processors, so that a value written to shared memory by one
|
|
processor can be directly accessed by any processor. Alternatively,
|
|
<B>logically</B> shared memory can be implemented for
|
|
systems in which each processor has it own memory by converting each
|
|
non-local memory reference into an appropriate inter-processor
|
|
communication. Either implementation of shared memory is generally
|
|
considered easier to use than message passing. Physically shared
|
|
memory can have both high bandwidth and low latency, but only when
|
|
multiple processors do not try to access the bus simultaneously; thus,
|
|
data layout still can seriously impact performance, and cache effects,
|
|
etc., can make it difficult to determine what the best layout is.
|
|
<P>
|
|
<DT><B>Aggregate Functions:</B><DD><P>In both the message passing and shared memory models, a communication
|
|
is initiated by a single processor; in contrast, aggregate function
|
|
communication is an inherently parallel communication model in which
|
|
an entire group of processors act together. The simplest such action
|
|
is a <B>barrier synchronization</B>, in which each individual
|
|
processor waits until every processor in the group has arrived at the
|
|
barrier. By having each processor output a datum as a side-effect of
|
|
reaching a barrier, it is possible to have the communication hardware
|
|
return a value to each processor which is an arbitrary function of the
|
|
values collected from all processors. For example, the return value
|
|
might be the answer to the question "did any processor find a
|
|
solution?" or it might be the sum of one value from each processor.
|
|
Latency can be very low, but bandwidth per processor also tends to be
|
|
low. Traditionally, this model is used primarily to control parallel
|
|
execution rather than to distribute data values.
|
|
<P>
|
|
<DT><B>Collective Communication:</B><DD><P>This is another name for aggregate functions, most often used when
|
|
referring to aggregate functions that are constructed using multiple
|
|
message-passing operations.
|
|
<P>
|
|
<DT><B>SMP:</B><DD><P>SMP (Symmetric Multi-Processor) refers to the operating system concept
|
|
of a group of processors working together as peers, so that any piece
|
|
of work could be done equally well by any processor. Typically, SMP
|
|
implies the combination of MIMD and shared memory. In the IA32 world,
|
|
SMP generally means compliant with MPS (the Intel MultiProcessor
|
|
Specification); in the future, it may mean "Slot 2"....
|
|
<P>
|
|
<DT><B>SWAR:</B><DD><P>SWAR (SIMD Within A Register) is a generic term for the concept of
|
|
partitioning a register into multiple integer fields and using
|
|
register-width operations to perform SIMD-parallel computations across
|
|
those fields. Given a machine with <EM>k</EM>-bit registers, data
|
|
paths, and function units, it has long been known that ordinary
|
|
register operations can function as SIMD parallel operations on as
|
|
many as <EM>n</EM>, <EM>k</EM>/<EM>n</EM>-bit, field values. Although
|
|
this type of parallelism can be implemented using ordinary integer
|
|
registers and instructions, many high-end microprocessors have
|
|
recently added specialized instructions to enhance the performance of
|
|
this technique for multimedia-oriented tasks. In addition to the
|
|
Intel/AMD/Cyrix <B>MMX</B> (MultiMedia eXtensions), there are:
|
|
Digital Alpha <B>MAX</B> (MultimediA eXtensions), Hewlett-Packard
|
|
PA-RISC <B>MAX</B> (Multimedia Acceleration eXtensions), MIPS
|
|
<B>MDMX</B> (Digital Media eXtension, pronounced "Mad Max"), and Sun
|
|
SPARC V9 <B>VIS</B> (Visual Instruction Set). Aside from the three
|
|
vendors who have agreed on MMX, all of these instruction set
|
|
extensions are roughly comparable, but mutually incompatible.
|
|
<P>
|
|
<DT><B>Attached Processors:</B><DD><P>Attached processors are essentially special-purpose computers that are
|
|
connected to a <B>host</B> system to accelerate specific types of
|
|
computation. For example, many video and audio cards for PCs contain
|
|
attached processors designed, respectively, to accelerate common
|
|
graphics operations and audio <B>DSP</B> (Digital Signal
|
|
Processing). There is also a wide range of attached <B>array
|
|
processors</B>, so called because they are designed to accelerate
|
|
arithmetic operations on arrays. In fact, many commercial
|
|
supercomputers are really attached processors with workstation hosts.
|
|
<P>
|
|
<DT><B>RAID:</B><DD><P>RAID (Redundant Array of Inexpensive Disks) is a simple technology for
|
|
increasing both the bandwidth and reliability of disk I/O. Although
|
|
there are many different variations, all have two key concepts in
|
|
common. First, each data block is <B>striped</B> across a group of
|
|
<EM>n+k</EM> disk drives such that each drive only has to read or
|
|
write 1/<EM>n</EM> of the data... yielding <EM>n</EM> times the
|
|
bandwidth of one drive. Second, redundant data is written so that
|
|
data can be recovered if a disk drive fails; this is important because
|
|
otherwise if any one of the <EM>n+k</EM> drives were to fail, the
|
|
entire file system could be lost. A good overview of RAID in general
|
|
is given at
|
|
<A HREF="http://www.uni-mainz.de/~neuffer/scsi/what_is_raid.html">http://www.uni-mainz.de/~neuffer/scsi/what_is_raid.html</A>, and
|
|
information about RAID options for Linux systems is at
|
|
<A HREF="http://linas.org/linux/raid.html">http://linas.org/linux/raid.html</A>. Aside from specialized RAID
|
|
hardware support, Linux also supports software RAID 0, 1, 4, and 5
|
|
across multiple disks hosted by a single Linux system; see the
|
|
Software RAID mini-HOWTO and the Multi-Disk System Tuning mini-HOWTO
|
|
for details. RAID across disk drives <EM>on multiple machines in a
|
|
cluster</EM> is not directly supported.
|
|
<P>
|
|
<DT><B>IA32:</B><DD><P>IA32 (Intel Architecture, 32-bit) really has nothing to do with
|
|
parallel processing, but rather refers to the class of processors whose
|
|
instruction sets are generally compatible with that of the Intel 386.
|
|
Basically, any Intel x86 processor after the 286 is compatible with
|
|
the 32-bit flat memory model that characterizes IA32. AMD and Cyrix
|
|
also make a multitude of IA32-compatible processors. Because Linux
|
|
evolved primarily on IA32 processors and that is where the commodity
|
|
market is centered, it is convenient to use IA32 to distinguish any of
|
|
these processors from the PowerPC, Alpha, PA-RISC, MIPS, SPARC, etc.
|
|
The upcoming IA64 (64-bit with EPIC, Explicitly Parallel Instruction
|
|
Computing) will certainly complicate matters, but Merced, the first
|
|
IA64 processor, is not scheduled for production until 1999.
|
|
<P>
|
|
<DT><B>COTS:</B><DD><P>Since the demise of many parallel supercomputer companies, COTS
|
|
(Commercial Off-The-Shelf) is commonly discussed as a requirement for
|
|
parallel computing systems. Being fanatically pure, the only COTS
|
|
parallel processing techniques using PCs are things like SMP Windows
|
|
NT servers and various MMX Windows applications; it really doesn't pay
|
|
to be that fanatical. The underlying concept of COTS is really
|
|
minimization of development time and cost. Thus, a more useful, more
|
|
common, meaning of COTS is that at least most subsystems benefit from
|
|
commodity marketing, but other technologies are used where they are
|
|
effective. Most often, COTS parallel processing refers to a cluster
|
|
in which the nodes are commodity PCs, but the network interface and
|
|
software are somewhat customized... typically running Linux and
|
|
applications codes that are freely available (e.g., copyleft or public
|
|
domain), but not literally COTS.
|
|
</DL>
|
|
<H2><A NAME="ss1.3">1.3 Example Algorithm</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>In order to better understand the use of the various parallel
|
|
programming approaches outlined in this HOWTO, it is useful to have an
|
|
example problem. Although just about any simple parallel algorithm
|
|
would do, by selecting an algorithm that has been used to demonstrate
|
|
various other parallel programming systems, it becomes a bit easier to
|
|
compare and contrast approaches. M. J. Quinn's book, <EM>Parallel
|
|
Computing Theory And Practice</EM>, second edition, McGraw Hill, New
|
|
York, 1994, uses a parallel algorithm that computes the value of Pi to
|
|
demonstrate a variety of different parallel supercomputer programming
|
|
environments (e.g., nCUBE message passing, Sequent shared memory). In
|
|
this HOWTO, we use the same basic algorithm.
|
|
<P>The algorithm computes the approximate value of Pi by summing the area
|
|
under <EM>x</EM> squared. As a purely sequential C program, the
|
|
algorithm looks like:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>;
|
|
#include <stdio.h>;
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width, sum;
|
|
register int intervals, i;
|
|
|
|
/* get the number of intervals */
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
|
|
/* do the computation */
|
|
sum = 0;
|
|
for (i=0; i<intervals; ++i) {
|
|
register double x = (i + 0.5) * width;
|
|
sum += 4.0 / (1.0 + x * x);
|
|
}
|
|
sum *= width;
|
|
|
|
printf("Estimation of pi is %f\n", sum);
|
|
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>However, this sequential algorithm easily yields an "embarrassingly
|
|
parallel" implementation. The area is subdivided into intervals, and
|
|
any number of processors can each independently sum the intervals
|
|
assigned to it, with no need for interaction between processors. Once
|
|
the local sums have been computed, they are added together to create a
|
|
global sum; this step requires some level of coordination and
|
|
communication between processors. Finally, this global sum is printed
|
|
by one processor as the approximate value of Pi.
|
|
<P>In this HOWTO, the various parallel implementations of this algorithm
|
|
appear where each of the different programming methods is discussed.
|
|
<P>
|
|
<H2><A NAME="ss1.4">1.4 Organization Of This Document</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>The remainder of this document is divided into five parts. Sections
|
|
2, 3, 4, and 5 correspond to the three different types of hardware
|
|
configurations supporting parallel processing using Linux:
|
|
<P>
|
|
<UL>
|
|
<LI>Section 2 discusses SMP Linux systems. These directly support
|
|
MIMD execution using shared memory, although message passing also is
|
|
implemented easily. Although Linux supports SMP configurations up to
|
|
16 processors, most SMP PC systems have either two or four identical
|
|
processors.
|
|
</LI>
|
|
<LI>Section 3 discusses clusters of networked machines, each running
|
|
Linux. A cluster can be used as a parallel processing system that
|
|
directly supports MIMD execution and message passing, perhaps also
|
|
providing logically shared memory. Simulated SIMD execution and
|
|
aggregate function communication also can be supported, depending on
|
|
the networking method used. The number of processors in a cluster can
|
|
range from two to thousands, primarily limited by the physical wiring
|
|
constraints of the network. In some cases, various types of machines
|
|
can be mixed within a cluster; for example, a network combining DEC
|
|
Alpha and Pentium Linux systems would be a <B>heterogeneous
|
|
cluster</B>.
|
|
</LI>
|
|
<LI>Section 4 discusses SWAR, SIMD Within A Register. This is a
|
|
very restrictive type of parallel execution model, but on the other
|
|
hand, it is a built-in capability of ordinary processors. Recently,
|
|
MMX (and other) instruction set extensions to modern processors have
|
|
made this approach even more effective.
|
|
</LI>
|
|
<LI>Section 5 discusses the use of Linux PCs as hosts for simple
|
|
parallel computing systems. Either as an add-in card or as an
|
|
external box, attached processors can provide a Linux system with
|
|
formidable processing power for specific types of applications. For
|
|
example, inexpensive ISA cards are available that provide multiple DSP
|
|
processors offering hundreds of MFLOPS for compute-bound problems.
|
|
However, these add-in boards are <EM>just</EM> processors; they
|
|
generally do not run an OS, have disk or console I/O capability, etc.
|
|
To make such systems useful, the Linux "host" must provide these
|
|
functions.</LI>
|
|
</UL>
|
|
<P>
|
|
<P>The final section of this document covers aspects that are of general
|
|
interest for parallel processing using Linux, not specific to a
|
|
particular one of the approaches listed above.
|
|
<P>As you read this document, keep in mind that we haven't tested
|
|
everything, and a lot of stuff reported here "still has a research
|
|
character" (a nice way to say "doesn't quite work like it should" ;-).
|
|
However, parallel processing using Linux is useful now, and an
|
|
increasingly large group is working to make it better.
|
|
<P>The author of this HOWTO is Hank Dietz, Ph.D., currently Professor &
|
|
James F. Hardymon Chair in Networking at the
|
|
University of Kentucky, Electrical & Computer Engineering Dept in
|
|
Lexington, KY, 40506-0046.
|
|
Dietz retains rights to this
|
|
document as per the Linux Documentation Project guidelines. Although
|
|
an effort has been made to ensure the correctness and fairness of this
|
|
presentation, neither Dietz nor University of Kentucky can be held
|
|
responsible for any problems or errors, and University of Kentucky does not
|
|
endorse any of the work/products discussed.
|
|
<P>
|
|
<HR>
|
|
<A HREF="Parallel-Processing-HOWTO-2.html">Next</A>
|
|
Previous
|
|
<A HREF="Parallel-Processing-HOWTO.html#toc1">Contents</A>
|
|
</BODY>
|
|
</HTML>
|