1657 lines
76 KiB
HTML
1657 lines
76 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
|
|
<TITLE>Linux Parallel Processing HOWTO: Clusters Of Linux Systems</TITLE>
|
|
<LINK HREF="Parallel-Processing-HOWTO-4.html" REL=next>
|
|
<LINK HREF="Parallel-Processing-HOWTO-2.html" REL=previous>
|
|
<LINK HREF="Parallel-Processing-HOWTO.html#toc3" REL=contents>
|
|
</HEAD>
|
|
<BODY>
|
|
<A HREF="Parallel-Processing-HOWTO-4.html">Next</A>
|
|
<A HREF="Parallel-Processing-HOWTO-2.html">Previous</A>
|
|
<A HREF="Parallel-Processing-HOWTO.html#toc3">Contents</A>
|
|
<HR>
|
|
<H2><A NAME="s3">3. Clusters Of Linux Systems</A></H2>
|
|
|
|
<P>
|
|
<P>This section attempts to give an overview of cluster parallel
|
|
processing using Linux. Clusters are currently both the most popular
|
|
and the most varied approach, ranging from a conventional network of
|
|
workstations (<B>NOW</B>) to essentially custom parallel machines
|
|
that just happen to use Linux PCs as processor nodes. There is also
|
|
quite a lot of software support for parallel processing using clusters
|
|
of Linux machines.
|
|
<P>
|
|
<H2><A NAME="ss3.1">3.1 Why A Cluster?</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Cluster parallel processing offers several important advantages:
|
|
<P>
|
|
<UL>
|
|
<LI>Each of the machines in a cluster can be a complete system,
|
|
usable for a wide range of other computing applications. This leads
|
|
many people to suggest that cluster parallel computing can simply
|
|
claim all the "wasted cycles" of workstations sitting idle on people's
|
|
desks. It is not really so easy to salvage those cycles, and it will
|
|
probably slow your co-worker's screen saver, but it can be done.
|
|
</LI>
|
|
<LI>The current explosion in networked systems means that most of the
|
|
hardware for building a cluster is being sold in high volume, with
|
|
correspondingly low "commodity" prices as the result. Further savings
|
|
come from the fact that only one video card, monitor, and keyboard are
|
|
needed for each cluster (although you may need to swap these into each
|
|
machine to perform the initial installation of Linux, once running, a
|
|
typical Linux PC does not need a "console"). In comparison, SMP and
|
|
attached processors are much smaller markets, tending toward somewhat
|
|
higher price per unit performance.
|
|
</LI>
|
|
<LI>Cluster computing can <EM>scale to very large systems</EM>.
|
|
While it is currently hard to find a Linux-compatible SMP with many
|
|
more than four processors, most commonly available network hardware
|
|
easily builds a cluster with up to 16 machines. With a little work,
|
|
hundreds or even thousands of machines can be networked. In fact, the
|
|
entire Internet can be viewed as one truly huge cluster.
|
|
</LI>
|
|
<LI>The fact that replacing a "bad machine" within a cluster is
|
|
trivial compared to fixing a partly faulty SMP yields much higher
|
|
availability for carefully designed cluster configurations. This
|
|
becomes important not only for particular applications that cannot
|
|
tolerate significant service interruptions, but also for general use
|
|
of systems containing enough processors so that single-machine
|
|
failures are fairly common. (For example, even though the average
|
|
time to failure of a PC might be two years, in a cluster with 32
|
|
machines, the probability that at least one will fail within 6 months
|
|
is quite high.)</LI>
|
|
</UL>
|
|
<P>
|
|
<P>OK, so clusters are free or cheap and can be very large and highly
|
|
available... why doesn't everyone use a cluster? Well, there are
|
|
problems too:
|
|
<P>
|
|
<UL>
|
|
<LI>With a few exceptions, network hardware is not designed for
|
|
parallel processing. Typically latency is very high and bandwidth
|
|
relatively low compared to SMP and attached processors. For example,
|
|
SMP latency is generally no more than a few microseconds, but is
|
|
commonly hundreds or thousands of microseconds for a cluster. SMP
|
|
communication bandwidth is often more than 100 MBytes/second; although
|
|
the fastest network hardware (e.g., "Gigabit Ethernet") offers
|
|
comparable speed, the most commonly used networks are between 10 and
|
|
1000 times slower.
|
|
|
|
The performance of network hardware is poor enough as an <EM>isolated
|
|
cluster network</EM>. If the network is not isolated from other
|
|
traffic, as is often the case using "machines that happen to be
|
|
networked" rather than a system designed as a cluster, performance can
|
|
be substantially worse.
|
|
</LI>
|
|
<LI>There is very little software support for treating a cluster as a
|
|
single system. For example, the <CODE>ps</CODE> command only reports the
|
|
processes running on one Linux system, not all processes running
|
|
across a cluster of Linux systems.</LI>
|
|
</UL>
|
|
<P>
|
|
<P>Thus, the basic story is that clusters offer great potential, but that
|
|
potential may be very difficult to achieve for most applications. The
|
|
good news is that there is quite a lot of software support that will
|
|
help you achieve good performance for programs that are well suited to
|
|
this environment, and there are also networks designed specifically to
|
|
widen the range of programs that can achieve good performance.
|
|
<P>
|
|
<H2><A NAME="ss3.2">3.2 Network Hardware</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Computer networking is an exploding field... but you already knew
|
|
that. An ever-increasing range of networking technologies and
|
|
products are being developed, and most are available in forms that
|
|
could be applied to make a parallel-processing cluster out of a group
|
|
of machines (i.e., PCs each running Linux).
|
|
<P>Unfortunately, no one network technology solves all problems best; in
|
|
fact, the range of approach, cost, and performance is at first hard to
|
|
believe. For example, using standard commercially-available hardware,
|
|
the cost per machine networked ranges from less than $5 to over
|
|
$4,000. The delivered bandwidth and latency each also vary
|
|
over four orders of magnitude.
|
|
<P>Before trying to learn about specific networks, it is important to
|
|
recognize that these things change like the wind (see
|
|
<A HREF="http://www.linux.org.uk/NetNews.html">http://www.linux.org.uk/NetNews.html</A> for Linux networking news),
|
|
and it is very difficult to get accurate data about some networks.
|
|
<P>Where I was particularly uncertain,
|
|
I've placed a <EM>?</EM>. I have spent a lot of time researching this
|
|
topic, but I'm sure my summary is full of errors and has omitted many
|
|
important things. If you have any corrections or additions, please
|
|
send email to
|
|
<A HREF="mailto:hankd@engr.uky.edu">hankd@engr.uky.edu</A>.
|
|
<P>Summaries like the LAN Technology Scorecard at
|
|
<A HREF="http://web.syr.edu/~jmwobus/comfaqs/lan-technology.html">http://web.syr.edu/~jmwobus/comfaqs/lan-technology.html</A> give
|
|
some characteristics of many different types of networks and LAN
|
|
standards. However, the summary in this HOWTO centers on the network
|
|
properties that are most relevant to construction of Linux clusters.
|
|
The section discussing each network begins with a short list of
|
|
characteristics. The following defines what these entries mean.
|
|
<P>
|
|
<DL>
|
|
<DT><B>Linux support:</B><DD><P>If the answer is <EM>no</EM>, the meaning is pretty clear. Other
|
|
answers try to describe the basic program interface that is used to
|
|
access the network. Most network hardware is interfaced via a kernel
|
|
driver, typically supporting TCP/UDP communication. Some other
|
|
networks use more direct (e.g., library) interfaces to reduce latency
|
|
by bypassing the kernel.
|
|
<P>
|
|
<P>
|
|
<P>Years ago, it used to be considered perfectly acceptable to access a
|
|
floating point unit via an OS call, but that is now clearly ludicrous;
|
|
in my opinion, it is just as awkward for each communication between
|
|
processors executing a parallel program to require an OS call. The
|
|
problem is that computers haven't yet integrated these communication
|
|
mechanisms, so non-kernel approaches tend to have portability problems.
|
|
You are going to hear a lot more about this in the near future, mostly
|
|
in the form of the new <B>Virtual Interface (VI) Architecture</B>,
|
|
<A HREF="http://www.viarch.org/">http://www.viarch.org/</A>, which is a standardized method for
|
|
most network interface operations to bypass the usual OS call layers.
|
|
The VI standard is backed by Compaq, Intel, and Microsoft, and is sure
|
|
to have a strong impact on SAN (System Area Network) designs over the
|
|
next few years.
|
|
<P>
|
|
<DT><B>Maximum bandwidth:</B><DD><P>This is the number everybody cares about. I have generally used the
|
|
theoretical best case numbers; your mileage <EM>will</EM> vary.
|
|
<P>
|
|
<DT><B>Minimum latency:</B><DD><P>In my opinion, this is the number everybody should care about even more
|
|
than bandwidth. Again, I have used the unrealistic best-case numbers,
|
|
but at least these numbers do include <EM>all</EM> sources of latency,
|
|
both hardware and software. In most cases, the network latency is just
|
|
a few microseconds; the much larger numbers reflect layers of
|
|
inefficient hardware and software interfaces.
|
|
<P>
|
|
<DT><B>Available as:</B><DD><P>Simply put, this describes how you get this type of network hardware.
|
|
Commodity stuff is widely available from many vendors, with price as
|
|
the primary distinguishing factor. Multiple-vendor things are
|
|
available from more than one competing vendor, but there are
|
|
significant differences and potential interoperability problems.
|
|
Single-vendor networks leave you at the mercy of that supplier
|
|
(however benevolent it may be). Public domain designs mean that even
|
|
if you cannot find somebody to sell you one, you or anybody else can
|
|
buy parts and make one. Research prototypes are just that; they are
|
|
generally neither ready for external users nor available to them.
|
|
<P>
|
|
<DT><B>Interface port/bus used:</B><DD><P>How does one hook-up this network? The highest performance and most
|
|
common now is a PCI bus interface card. There are also EISA, VESA
|
|
local bus (VL bus), and ISA bus cards. ISA was there first, and is
|
|
still commonly used for low-performance cards. EISA is still around
|
|
as the second bus in a lot of PCI machines, so there are a few cards.
|
|
These days, you don't see much VL stuff (although
|
|
<A HREF="http://www.vesa.org/">http://www.vesa.org/</A> would beg to differ).
|
|
<P>
|
|
<P>
|
|
<P>Of course, any interface that you can use without having to open your
|
|
PC's case has more than a little appeal. IrDA and USB interfaces are
|
|
appearing with increasing frequency. The Standard Parallel Port (SPP)
|
|
used to be what your printer was plugged into, but it has seen a lot
|
|
of use lately as an external extension of the ISA bus; this new
|
|
functionality is enhanced by the IEEE 1284 standard, which specifies
|
|
EPP and ECP improvements. There is also the old, reliable, slow RS232
|
|
serial port. I don't know of anybody connecting machines using VGA
|
|
video connectors, keyboard, mouse, or game ports... so that's about
|
|
it.
|
|
<P>
|
|
<DT><B>Network structure:</B><DD><P>A bus is a wire, set of wires, or fiber. A hub is a little box that
|
|
knows how to connect different wires/fibers plugged into it; switched
|
|
hubs allow multiple connections to be actively transmitting data
|
|
simultaneously.
|
|
<P>
|
|
<DT><B>Cost per machine connected:</B><DD><P>Here's how to use these numbers. Suppose that, not counting the
|
|
network connection, it costs $2,000 to purchase a PC for use as
|
|
a node in your cluster. Adding a Fast Ethernet brings the per node
|
|
cost to about $2,400; adding a Myrinet instead brings the cost
|
|
to about $3,800. If you have about $20,000 to spend,
|
|
that means you could have either 8 machines connected by Fast Ethernet
|
|
or 5 machines connected by Myrinet. It also can be very reasonable to
|
|
have multiple networks; e.g., $20,000 could buy 8 machines
|
|
connected by both Fast Ethernet and TTL_PAPERS. Pick the
|
|
network, or set of networks, that is most likely to yield a cluster
|
|
that will run your application fastest.
|
|
<P>
|
|
<P>
|
|
<P>By the time you read this, these numbers will be wrong... heck,
|
|
they're probably wrong already. There may also be quantity discounts,
|
|
special deals, etc. Still, the prices quoted here aren't likely to be
|
|
wrong enough to lead you to a totally inappropriate choice. It
|
|
doesn't take a PhD (although I do have one ;-) to see that expensive
|
|
networks only make sense if your application needs their special
|
|
properties or if the PCs being clustered are relatively expensive.
|
|
</DL>
|
|
<P>Now that you have the disclaimers, on with the show....
|
|
<P>
|
|
<H3>ArcNet</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>2.5 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>1,000 microseconds?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>ISA</EM></LI>
|
|
<LI>Network structure: <EM>unswitched hub or bus (logical ring)</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$200</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>ARCNET is a local area network that is primarily intended for use in
|
|
embedded real-time control systems. Like Ethernet, the network is
|
|
physically organized either as taps on a bus or one or more hubs,
|
|
however, unlike Ethernet, it uses a token-based protocol logically
|
|
structuring the network as a ring. Packet headers are small (3 or 4
|
|
bytes) and messages can carry as little as a single byte of data.
|
|
Thus, ARCNET yields more consistent performance than Ethernet, with
|
|
bounded delays, etc. Unfortunately, it is slower than Ethernet and
|
|
less popular, making it more expensive. More information is available
|
|
from the ARCNET Trade Association at
|
|
<A HREF="http://www.arcnet.com/">http://www.arcnet.com/</A>.
|
|
<P>
|
|
<H3>ATM</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel driver, AAL* library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>155 Mb/s</EM> (soon, <EM>1,200 Mb/s</EM>)</LI>
|
|
<LI>Minimum latency: <EM>120 microseconds</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$3,000</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>Unless you've been in a coma for the past few years, you have probably
|
|
heard a lot about how ATM (Asynchronous Transfer Mode) <EM>is</EM> the
|
|
future... well, sort-of. ATM is cheaper than HiPPI and faster than
|
|
Fast Ethernet, and it can be used over the very long distances that
|
|
the phone companies care about. The ATM network protocol is also
|
|
designed to provide a lower-overhead software interface and to more
|
|
efficiently manage small messages and real-time communications (e.g.,
|
|
digital audio and video). It is also one of the highest-bandwidth
|
|
networks that Linux currently supports. The bad news is that ATM isn't
|
|
cheap, and there are still some compatibility problems across
|
|
vendors. An overview of Linux ATM development is available at
|
|
<A HREF="http://lrcwww.epfl.ch/linux-atm/">http://lrcwww.epfl.ch/linux-atm/</A>.
|
|
<P>
|
|
<H3>CAPERS</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>AFAPI library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1.2 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>SPP</EM></LI>
|
|
<LI>Network structure: <EM>cable between 2 machines</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$2</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>CAPERS (Cable Adapter for Parallel Execution and Rapid
|
|
Synchronization) is a spin-off of the PAPERS project,
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>, at the Purdue University
|
|
School of Electrical and Computer Engineering. In essence, it defines
|
|
a software protocol for using an ordinary "LapLink" SPP-to-SPP cable
|
|
to implement the PAPERS library for two Linux PCs. The idea doesn't
|
|
scale, but you can't beat the price. As with TTL_PAPERS, to improve
|
|
system security, there is a minor kernel patch recommended, but not
|
|
required:
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">http://garage.ecn.purdue.edu/~papers/giveioperm.html</A>.
|
|
<P>
|
|
<H3>Ethernet</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>10 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>100 microseconds</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched or unswitched hubs, or hubless bus</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$100</EM> (hubless, <EM>$50</EM>)</LI>
|
|
</UL>
|
|
<P>
|
|
<P>For some years now, 10 Mbits/s Ethernet has been the standard network
|
|
technology. Good Ethernet interface cards can be purchased for well
|
|
under $50, and a fair number of PCs now have an Ethernet controller
|
|
built-into the motherboard. For lightly-used networks, Ethernet
|
|
connections can be organized as a multi-tap bus without a hub; such
|
|
configurations can serve up to 200 machines with minimal cost, but are
|
|
not appropriate for parallel processing. Adding an unswitched hub
|
|
does not really help performance. However, switched hubs that can
|
|
provide full bandwidth to simultaneous connections cost only about
|
|
$100 per port. Linux supports an amazing range of Ethernet
|
|
interfaces, but it is important to keep in mind that variations in the
|
|
interface hardware can yield significant performance differences. See
|
|
the Hardware Compatibility HOWTO for comments on which are supported
|
|
and how well they work; also see
|
|
<A HREF="http://cesdis1.gsfc.nasa.gov/linux/drivers/">http://cesdis1.gsfc.nasa.gov/linux/drivers/</A>.
|
|
<P>An interesting way to improve performance is offered by the 16-machine
|
|
Linux cluster work done in the Beowulf project,
|
|
<A HREF="http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html">http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html</A>, at NASA
|
|
CESDIS. There, Donald Becker, who is the author of many Ethernet card
|
|
drivers, has developed support for load sharing across multiple
|
|
Ethernet networks that shadow each other (i.e., share the same network
|
|
addresses). This load sharing is built-into the standard Linux
|
|
distribution, and is done invisibly below the socket operation level.
|
|
Because hub cost is significant, having each machine connected to two
|
|
or more hubless or unswitched hub Ethernet networks can be a very
|
|
cost-effective way to improve performance. In fact, in situations
|
|
where one machine is the network performance bottleneck, load sharing
|
|
using shadow networks works much better than using a single switched
|
|
hub network.
|
|
<P>
|
|
<H3>Ethernet (Fast Ethernet)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>100 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>80 microseconds</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched or unswitched hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$400?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>Although there are really quite a few different technologies calling
|
|
themselves "Fast Ethernet," this term most often refers to a hub-based
|
|
100 Mbits/s Ethernet that is somewhat compatible with older "10 BaseT"
|
|
10 Mbits/s devices and cables. As might be expected, anything called
|
|
Ethernet is generally priced for a volume market, and these interfaces
|
|
are generally a small fraction of the price of 155 Mbits/s ATM cards.
|
|
The catch is that having a bunch of machines dividing the bandwidth of
|
|
a single 100 Mbits/s "bus" (using an unswitched hub) yields
|
|
performance that might not even be as good on average as using 10
|
|
Mbits/s Ethernet with a switched hub that can give each machine's
|
|
connection a full 10 Mbits/s.
|
|
<P>Switched hubs that can provide 100 Mbits/s for each machine
|
|
simultaneously are expensive, but prices are dropping every day, and
|
|
these switches do yield much higher total network bandwidth than
|
|
unswitched hubs. The thing that makes ATM switches so expensive is
|
|
that they must switch for each (relatively short) ATM cell; some Fast
|
|
Ethernet switches take advantage of the expected lower switching
|
|
frequency by using techniques that may have low latency through the
|
|
switch, but take multiple milliseconds to change the switch path...
|
|
if your routing pattern changes frequently, avoid those switches. See
|
|
<A HREF="http://cesdis1.gsfc.nasa.gov/linux/drivers/">http://cesdis1.gsfc.nasa.gov/linux/drivers/</A> for information
|
|
about the various cards and drivers.
|
|
<P>Also note that, as described for Ethernet, the Beowulf project,
|
|
<A HREF="http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html">http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html</A>, at NASA
|
|
has been developing support that offers improved performance by load
|
|
sharing across multiple Fast Ethernets.
|
|
<P>
|
|
<H3>Ethernet (Gigabit Ethernet)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1,000 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>300 microseconds?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched hubs or FDRs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$2,500?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>I'm not sure that Gigabit Ethernet,
|
|
<A HREF="http://www.gigabit-ethernet.org/">http://www.gigabit-ethernet.org/</A>, has a good technological
|
|
reason to be called Ethernet... but the name does accurately reflect
|
|
the fact that this is intended to be a cheap, mass-market, computer
|
|
network technology with native support for IP. However, current
|
|
pricing reflects the fact that Gb/s hardware is still a tricky thing
|
|
to build.
|
|
<P>Unlike other Ethernet technologies, Gigabit Ethernet provides for a
|
|
level of flow control that should make it a more reliable network.
|
|
FDRs, or Full-Duplex Repeaters, simply multiplex lines, using
|
|
buffering and localized flow control to improve performance. Most
|
|
switched hubs are being built as new interface modules for existing
|
|
gigabit-capable switch fabrics. Switch/FDR products have been shipped
|
|
or announced by at least
|
|
<A HREF="http://www.acacianet.com/">http://www.acacianet.com/</A>,
|
|
<A HREF="http://www.baynetworks.com/">http://www.baynetworks.com/</A>,
|
|
<A HREF="http://www.cabletron.com/">http://www.cabletron.com/</A>,
|
|
<A HREF="http://www.networks.digital.com/">http://www.networks.digital.com/</A>,
|
|
<A HREF="http://www.extremenetworks.com/">http://www.extremenetworks.com/</A>,
|
|
<A HREF="http://www.foundrynet.com/">http://www.foundrynet.com/</A>,
|
|
<A HREF="http://www.gigalabs.com/">http://www.gigalabs.com/</A>,
|
|
<A HREF="http://www.packetengines.com/">http://www.packetengines.com/</A>.
|
|
<A HREF="http://www.plaintree.com/">http://www.plaintree.com/</A>,
|
|
<A HREF="http://www.prominet.com/">http://www.prominet.com/</A>,
|
|
<A HREF="http://www.sun.com/">http://www.sun.com/</A>, and
|
|
<A HREF="http://www.xlnt.com/">http://www.xlnt.com/</A>.
|
|
<P>There is a Linux driver,
|
|
<A HREF="http://cesdis.gsfc.nasa.gov/linux/drivers/yellowfin.html">http://cesdis.gsfc.nasa.gov/linux/drivers/yellowfin.html</A>, for
|
|
the Packet Engines "Yellowfin" G-NIC,
|
|
<A HREF="http://www.packetengines.com/">http://www.packetengines.com/</A>. Early tests under Linux achieved
|
|
about 2.5x higher bandwidth than could be achieved with the best 100
|
|
Mb/s Fast Ethernet; with gigabit networks, careful tuning of PCI bus
|
|
use is a critical factor. There is little doubt that driver
|
|
improvements, and Linux drivers for other NICs, will follow.
|
|
<P>
|
|
<H3>FC (Fibre Channel)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1,062 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI?</EM></LI>
|
|
<LI>Network structure: <EM>?</EM></LI>
|
|
<LI>Cost per machine connected: <EM>?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>The goal of FC (Fibre Channel) is to provide high-performance block
|
|
I/O (an FC frame carries a 2,048 byte data payload), particularly for
|
|
sharing disks and other storage devices that can be directly connected
|
|
to the FC rather than connected through a computer. Bandwidth-wise,
|
|
FC is specified to be relatively fast, running anywhere between 133
|
|
and 1,062 Mbits/s. If FC becomes popular as a high-end SCSI
|
|
replacement, it may quickly become a cheap technology; for now, it is
|
|
not cheap and is not supported by Linux. A good collection of FC
|
|
references is maintained by the Fibre Channel Association at
|
|
<A HREF="http://www.amdahl.com/ext/CARP/FCA/FCA.html">http://www.amdahl.com/ext/CARP/FCA/FCA.html</A><P>
|
|
<H3>FireWire (IEEE 1394)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>196.608 Mb/s</EM> (soon, <EM>393.216 Mb/s</EM>)</LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>random without cycles (self-configuring)</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$600</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>FireWire,
|
|
<A HREF="http://www.firewire.org/">http://www.firewire.org/</A>, the IEEE 1394-1995
|
|
standard, is destined to be the low-cost high-speed digital network
|
|
for consumer electronics. The showcase application is connecting DV
|
|
digital video camcorders to computers, but FireWire is intended to be
|
|
used for applications ranging from being a SCSI replacement to
|
|
interconnecting the components of your home theater. It allows up to
|
|
64K devices to be connected in any topology using busses and bridges
|
|
that does not create a cycle, and automatically detects the
|
|
configuration when components are added or removed. Short (four-byte
|
|
"quadlet") low-latency messages are supported as well as ATM-like
|
|
isochronous transmission (used to keep multimedia messages
|
|
synchronized). Adaptec has FireWire products that allow up to 63
|
|
devices to be connected to a single PCI interface card, and also has
|
|
good general FireWire information at
|
|
<A HREF="http://www.adaptec.com/serialio/">http://www.adaptec.com/serialio/</A>.
|
|
<P>Although FireWire will not be the highest bandwidth network available,
|
|
the consumer-level market (which should drive prices very low) and low
|
|
latency support might make this one of the best Linux PC cluster
|
|
message-passing network technologies within the next year or so.
|
|
<P>
|
|
<H3>HiPPI And Serial HiPPI</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1,600 Mb/s</EM> (serial is <EM>1,200 Mb/s</EM>)</LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>EISA, PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$3,500</EM> (serial is <EM>$4,500</EM>)</LI>
|
|
</UL>
|
|
<P>
|
|
<P>HiPPI (High Performance Parallel Interface) was originally intended to
|
|
provide very high bandwidth for transfer of huge data sets between a
|
|
supercomputer and another machine (a supercomputer, frame buffer, disk
|
|
array, etc.), and has become the dominant standard for
|
|
supercomputers. Although it is an oxymoron, <B>Serial HiPPI</B> is
|
|
also becoming popular, typically using a fiber optic cable instead of
|
|
the 32-bit wide standard (parallel) HiPPI cables. Over the past few
|
|
years, HiPPI crossbar switches have become common and prices have
|
|
dropped sharply; unfortunately, serial HiPPI is still pricey, and that
|
|
is what PCI bus interface cards generally support. Worse still, Linux
|
|
doesn't yet support HiPPI. A good overview of HiPPI is maintained by
|
|
CERN at
|
|
<A HREF="http://www.cern.ch/HSI/hippi/">http://www.cern.ch/HSI/hippi/</A>; they also maintain
|
|
a rather long list of HiPPI vendors at
|
|
<A HREF="http://www.cern.ch/HSI/hippi/procintf/manufact.htm">http://www.cern.ch/HSI/hippi/procintf/manufact.htm</A>.
|
|
<P>
|
|
<H3>IrDA (Infrared Data Association)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no?</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1.15 Mb/s</EM> and <EM>4 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>IrDA</EM></LI>
|
|
<LI>Network structure: <EM>thin air</EM> ;-)</LI>
|
|
<LI>Cost per machine connected: <EM>$0</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>IrDA (Infrared Data Association,
|
|
<A HREF="http://www.irda.org/">http://www.irda.org/</A>) is
|
|
that little infrared device on the side of a lot of laptop PCs. It is
|
|
inherently difficult to connect more than two machines using this
|
|
interface, so it is unlikely to be used for clustering. Don Becker
|
|
did some preliminary work with IrDA.
|
|
<P>
|
|
<H3>Myrinet</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1,280 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>9 microseconds</EM></LI>
|
|
<LI>Available as: <EM>single-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>switched hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$1,800</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>Myrinet
|
|
<A HREF="http://www.myri.com/">http://www.myri.com/</A> is a local area network (LAN)
|
|
designed to also serve as a "system area network" (SAN), i.e., the
|
|
network within a cabinet full of machines connected as a parallel
|
|
system. The LAN and SAN versions use different physical media and
|
|
have somewhat different characteristics; generally, the SAN version
|
|
would be used within a cluster.
|
|
<P>Myrinet is fairly conventional in structure, but has a reputation for
|
|
being particularly well-implemented. The drivers for Linux are said
|
|
to perform very well, although shockingly large performance variations
|
|
have been reported with different PCI bus implementations for the host
|
|
computers.
|
|
<P>Currently, Myrinet is clearly the favorite network of cluster groups
|
|
that are not too severely "budgetarily challenged." If your idea of a
|
|
Linux PC is a high-end Pentium Pro or Pentium II with at least 256 MB
|
|
RAM and a SCSI RAID, the cost of Myrinet is quite reasonable. However,
|
|
using more ordinary PC configurations, you may find that your choice
|
|
is between <EM>N</EM> machines linked by Myrinet or <EM>2N</EM> linked
|
|
by multiple Fast Ethernets and TTL_PAPERS. It really depends
|
|
on what your budget is and what types of computations you care about
|
|
most.
|
|
<P>
|
|
<H3>Parastation</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>HAL or socket library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>125 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>2 microseconds</EM></LI>
|
|
<LI>Available as: <EM>single-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>hubless mesh</EM></LI>
|
|
<LI>Cost per machine connected: <EM>> $1,000</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>The ParaStation project
|
|
<A HREF="http://wwwipd.ira.uka.de/parastation">http://wwwipd.ira.uka.de/parastation</A> at University of Karlsruhe
|
|
Department of Informatics is building a PVM-compatible custom
|
|
low-latency network. They first constructed a two-processor ParaPC
|
|
prototype using a custom EISA card interface and PCs running BSD UNIX,
|
|
and then built larger clusters using DEC Alphas. Since January 1997,
|
|
ParaStation has been available for Linux. The PCI cards are being
|
|
made in cooperation with a company called Hitex (see
|
|
<A HREF="http://www.hitex.com:80/parastation/">http://www.hitex.com:80/parastation/</A>). Parastation hardware
|
|
implements both fast, reliable, message transmission and simple barrier
|
|
synchronization.
|
|
<P>
|
|
<H3>PLIP</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel driver</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1.2 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>1,000 microseconds?</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>SPP</EM></LI>
|
|
<LI>Network structure: <EM>cable between 2 machines</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$2</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>For just the cost of a "LapLink" cable, PLIP (Parallel Line Interface
|
|
Protocol) allows two Linux machines to communicate through standard
|
|
parallel ports using standard socket-based software. In terms of
|
|
bandwidth, latency, and scalability, this is not a very serious
|
|
network technology; however, the near-zero cost and the software
|
|
compatibility are useful. The driver is part of the standard Linux
|
|
kernel distributions.
|
|
<P>
|
|
<H3>SCI</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>4,000 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>2.7 microseconds</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI, proprietary</EM></LI>
|
|
<LI>Network structure: <EM>?</EM></LI>
|
|
<LI>Cost per machine connected: <EM>> $1,000</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>The goal of SCI (Scalable Coherent Interconnect, ANSI/IEEE 1596-1992)
|
|
is essentially to provide a high performance mechanism that can
|
|
support coherent shared memory access across large numbers of
|
|
machines, as well various types of block message transfers. It is
|
|
fairly safe to say that the designed bandwidth and latency of SCI are
|
|
both "awesome" in comparison to most other network technologies. The
|
|
catch is that SCI is not widely available as cheap production units,
|
|
and there isn't any Linux support.
|
|
<P>SCI primarily is used in various proprietary designs for
|
|
logically-shared physically-distributed memory machines, such as the
|
|
HP/Convex Exemplar SPP and the Sequent NUMA-Q 2000 (see
|
|
<A HREF="http://www.sequent.com/">http://www.sequent.com/</A>). However, SCI is available as a PCI
|
|
interface card and 4-way switches (up to 16 machines can be connected
|
|
by cascading four 4-way switches) from Dolphin,
|
|
<A HREF="http://www.dolphinics.com/">http://www.dolphinics.com/</A>, as their CluStar product line. A
|
|
good set of links overviewing SCI is maintained by CERN at
|
|
<A HREF="http://www.cern.ch/HSI/sci/sci.html">http://www.cern.ch/HSI/sci/sci.html</A>.
|
|
<P>
|
|
<H3>SCSI</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>5 Mb/s</EM> to over <EM>20 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>multiple-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI, EISA, ISA card</EM></LI>
|
|
<LI>Network structure: <EM>inter-machine bus sharing SCSI devices</EM></LI>
|
|
<LI>Cost per machine connected: <EM>?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>SCSI (Small Computer Systems Interconnect) is essentially an I/O bus
|
|
that is used for disk drives, CD ROMS, image scanners, etc. There are
|
|
three separate standards SCSI-1, SCSI-2, and SCSI-3; Fast and Ultra
|
|
speeds; and data path widths of 8, 16, or 32 bits (with FireWire
|
|
compatibility also mentioned in SCSI-3). It is all pretty confusing,
|
|
but we all know a good SCSI is somewhat faster than EIDE and can handle
|
|
more devices more efficiently.
|
|
<P>What many people do not realize is that it is fairly simple for two
|
|
computers to share a single SCSI bus. This type of configuration is
|
|
very useful for sharing disk drives between machines and implementing
|
|
<B>fail-over</B> - having one machine take over database requests
|
|
when the other machine fails. Currently, this is the only mechanism
|
|
supported by Microsoft's PC cluster product, WolfPack. However, the
|
|
inability to scale to larger systems renders shared SCSI uninteresting
|
|
for parallel processing in general.
|
|
<P>
|
|
<H3>ServerNet</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>no</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>400 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
|
|
<LI>Available as: <EM>single-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>PCI</EM></LI>
|
|
<LI>Network structure: <EM>hexagonal tree/tetrahedral lattice of hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>ServerNet is the high-performance network hardware from Tandem,
|
|
<A HREF="http://www.tandem.com">http://www.tandem.com</A>. Especially in the online transation
|
|
processing (OLTP) world, Tandem is well known as a leading producer of
|
|
high-reliability systems, so it is not surprising that their network
|
|
claims not just high performance, but also "high data integrity and
|
|
reliability." Another interesting aspect of ServerNet is that it
|
|
claims to be able to transfer data from any device directly to any
|
|
device; not just between processors, but also disk drives, etc., in a
|
|
one-sided style similar to that suggested by the MPI remote memory
|
|
access mechanisms described in section 3.5. One last comment about
|
|
ServerNet: although there is just a single vendor, that vendor is
|
|
powerful enough to potentially establish ServerNet as a major
|
|
standard... Tandem is owned by Compaq.
|
|
<P>
|
|
<H3>SHRIMP</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>user-level memory mapped interface</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>180 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>5 microseconds</EM></LI>
|
|
<LI>Available as: <EM>research prototype</EM></LI>
|
|
<LI>Interface port/bus used: <EM>EISA</EM></LI>
|
|
<LI>Network structure: <EM>mesh backplane (as in Intel Paragon)</EM></LI>
|
|
<LI>Cost per machine connected: <EM>?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>The SHRIMP project,
|
|
<A HREF="http://www.CS.Princeton.EDU/shrimp/">http://www.CS.Princeton.EDU/shrimp/</A>,
|
|
at the Princeton University Computer Science Department is building a
|
|
parallel computer using PCs running Linux as the processing elements.
|
|
The first SHRIMP (Scalable, High-Performance, Really Inexpensive
|
|
Multi-Processor) was a simple two-processor prototype using a
|
|
dual-ported RAM on a custom EISA card interface. There is now a
|
|
prototype that will scale to larger configurations using a custom
|
|
interface card to connect to a "hub" that is essentially the same mesh
|
|
routing network used in the Intel Paragon (see
|
|
<A HREF="http://www.ssd.intel.com/paragon.html">http://www.ssd.intel.com/paragon.html</A>). Considerable effort
|
|
has gone into developing low-overhead "virtual memory mapped
|
|
communication" hardware and support software.
|
|
<P>
|
|
<H3>SLIP</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel drivers</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>0.1 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>1,000 microseconds?</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>RS232C</EM></LI>
|
|
<LI>Network structure: <EM>cable between 2 machines</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$2</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>Although SLIP (Serial Line Interface Protocol) is firmly planted at
|
|
the low end of the performance spectrum, SLIP (or CSLIP or PPP) allows
|
|
two machines to perform socket communication via ordinary RS232 serial
|
|
ports. The RS232 ports can be connected using a null-modem RS232
|
|
serial cable, or they can even be connected via dial-up through a
|
|
modem. In any case, latency is high and bandwidth is low, so SLIP
|
|
should be used only when no other alternatives are available. It is
|
|
worth noting, however, that most PCs have two RS232 ports, so it would
|
|
be possible to network a group of machines simply by connecting the
|
|
machines as a linear array or as a ring. There is even load sharing
|
|
software called EQL.
|
|
<P>
|
|
<H3>TTL_PAPERS</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>AFAPI library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>1.6 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
|
|
<LI>Available as: <EM>public-domain design, single-vendor hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>SPP</EM></LI>
|
|
<LI>Network structure: <EM>tree of hubs</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$100</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>The PAPERS (Purdue's Adapter for Parallel Execution and Rapid
|
|
Synchronization) project,
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>, at the Purdue University
|
|
School of Electrical and Computer Engineering is building scalable,
|
|
low-latency, aggregate function communication hardware and software
|
|
that allows a parallel supercomputer to be built using unmodified
|
|
PCs/workstations as nodes.
|
|
<P>There have been over a dozen different types of PAPERS hardware built
|
|
that connect to PCs/workstations via the SPP (Standard Parallel Port),
|
|
roughly following two development lines. The versions called "PAPERS"
|
|
target higher performance, using whatever technologies are appropriate;
|
|
current work uses FPGAs, and high bandwidth PCI bus interface designs
|
|
are also under development. In contrast, the versions called
|
|
"TTL_PAPERS" are designed to be easily reproduced outside
|
|
Purdue, and are remarkably simple public domain designs that can be
|
|
built using ordinary TTL logic. One such design is produced
|
|
commercially,
|
|
<A HREF="http://chelsea.ios.com:80/~hgdietz/sbm4.html">http://chelsea.ios.com:80/~hgdietz/sbm4.html</A>.
|
|
<P>Unlike the custom hardware designs from other universities,
|
|
TTL_PAPERS clusters have been assembled at many universities
|
|
from the USA to South Korea. Bandwidth is severely limited by the SPP
|
|
connections, but PAPERS implements very low latency aggregate function
|
|
communications; even the fastest message-oriented systems cannot
|
|
provide comparable performance on those aggregate functions. Thus,
|
|
PAPERS is particularly good for synchronizing the displays of a video
|
|
wall (to be discussed further in the upcoming Video Wall HOWTO),
|
|
scheduling accesses to a high-bandwidth network, evaluating global
|
|
fitness in genetic searches, etc. Although PAPERS clusters have been
|
|
built using IBM PowerPC AIX, DEC Alpha OSF/1, and HP PA-RISC HP-UX
|
|
machines, Linux-based PCs are the platforms best supported.
|
|
<P>User programs using TTL_PAPERS AFAPI directly access the SPP
|
|
hardware port registers under Linux, without an OS call for each
|
|
access. To do this, AFAPI first gets port permission using either
|
|
<CODE>iopl()</CODE> or <CODE>ioperm()</CODE>. The problem with these calls is
|
|
that both require the user program to be privileged, yielding a
|
|
potential security hole. The solution is an optional kernel patch,
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">http://garage.ecn.purdue.edu/~papers/giveioperm.html</A>, that
|
|
allows a privileged process to control port permission for any process.
|
|
<P>
|
|
<H3>USB (Universal Serial Bus)</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>kernel driver</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>12 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>?</EM></LI>
|
|
<LI>Available as: <EM>commodity hardware</EM></LI>
|
|
<LI>Interface port/bus used: <EM>USB</EM></LI>
|
|
<LI>Network structure: <EM>bus</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$5?</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>USB (Universal Serial Bus,
|
|
<A HREF="http://www.usb.org/">http://www.usb.org/</A>) is a
|
|
hot-pluggable conventional-Ethernet-speed, bus for up to 127
|
|
peripherals ranging from keyboards to video conferencing cameras. It
|
|
isn't really clear how multiple computers get connected to each other
|
|
using USB. In any case, USB ports are quickly becoming as standard on
|
|
PC motherboards as RS232 and SPP, so don't be surprised if one or two
|
|
USB ports are lurking on the back of the next PC you buy. Development
|
|
of a Linux driver is discussed at
|
|
<A HREF="http://peloncho.fis.ucm.es/~inaky/USB.html">http://peloncho.fis.ucm.es/~inaky/USB.html</A>.
|
|
<P>In some ways, USB is almost the low-performance, zero-cost, version of
|
|
FireWire that you can purchase today.
|
|
<P>
|
|
<H3>WAPERS</H3>
|
|
|
|
<P>
|
|
<UL>
|
|
<LI>Linux support: <EM>AFAPI library</EM></LI>
|
|
<LI>Maximum bandwidth: <EM>0.4 Mb/s</EM></LI>
|
|
<LI>Minimum latency: <EM>3 microseconds</EM></LI>
|
|
<LI>Available as: <EM>public-domain design</EM></LI>
|
|
<LI>Interface port/bus used: <EM>SPP</EM></LI>
|
|
<LI>Network structure: <EM>wiring pattern between 2-64 machines</EM></LI>
|
|
<LI>Cost per machine connected: <EM>$5</EM></LI>
|
|
</UL>
|
|
<P>
|
|
<P>WAPERS (Wired-AND Adapter for Parallel Execution and Rapid
|
|
Synchronization) is a spin-off of the PAPERS project,
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>, at the Purdue University
|
|
School of Electrical and Computer Engineering. If implemented
|
|
properly, the SPP has four bits of open-collector output that can be
|
|
wired together across machines to implement a 4-bit wide wired AND.
|
|
This wired-AND is electrically touchy, and the maximum number of
|
|
machines that can be connected in this way critically depends on the
|
|
analog properties of the ports (maximum sink current and pull-up
|
|
resistor value); typically, up to 7 or 8 machines can be networked by
|
|
WAPERS. Although cost and latency are very low, so is bandwidth;
|
|
WAPERS is much better as a second network for aggregate operations
|
|
than as the only network in a cluster. As with TTL_PAPERS, to
|
|
improve system security, there is a minor kernel patch recommended,
|
|
but not required:
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">http://garage.ecn.purdue.edu/~papers/giveioperm.html</A>.
|
|
<P>
|
|
<H2><A NAME="ss3.3">3.3 Network Software Interface</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Before moving on to discuss the software support for parallel
|
|
applications, it is useful to first briefly cover the basics of
|
|
low-level software interface to the network hardware. There are
|
|
really only three basic choices: sockets, device drivers, and
|
|
user-level libraries.
|
|
<P>
|
|
<H3>Sockets</H3>
|
|
|
|
<P>
|
|
<P>By far the most common low-level network interface is a socket
|
|
interface. Sockets have been a part of unix for over a decade, and
|
|
most standard network hardware is designed to support at least two
|
|
types of socket protocols: UDP and TCP. Both types of socket allow
|
|
you to send arbitrary size blocks of data from one machine to another,
|
|
but there are several important differences. Typically, both yield a
|
|
minimum latency of around 1,000 microseconds, although performance can
|
|
be far worse depending on network traffic.
|
|
<P>These socket types are the basic network software interface for most
|
|
of the portable, higher-level, parallel processing software; for
|
|
example, PVM uses a combination of UDP and TCP, so knowing the
|
|
difference will help you tune performance. For even better
|
|
performance, you can also use these mechanisms directly in your
|
|
program. The following is just a simple overview of UDP and TCP; see
|
|
the manual pages and a good network programming book for details.
|
|
<P>
|
|
<H3>UDP Protocol (SOCK_DGRAM)</H3>
|
|
|
|
<P>
|
|
<P><B>UDP</B> is the User Datagram Protocol, but you more easily can
|
|
remember the properties of UDP as Unreliable Datagram Processing. In
|
|
other words, UDP allows each block to be sent as an individual message,
|
|
but a message might be lost in transmission. In fact, depending on
|
|
network traffic, UDP messages can be lost, can arrive multiple times,
|
|
or can arrive in an order different from that in which they were
|
|
sent. The sender of a UDP message does not automatically get an
|
|
acknowledgment, so it is up to user-written code to detect and
|
|
compensate for these problems. Fortunately, UDP does ensure that if a
|
|
message arrives, the message contents are intact (i.e., you never get
|
|
just part of a UDP message).
|
|
<P>The nice thing about UDP is that it tends to be the fastest socket
|
|
protocol. Further, UDP is "connectionless," which means that each
|
|
message is essentially independent of all others. A good analogy is
|
|
that each message is like a letter to be mailed; you might send
|
|
multiple letters to the same address, but each one is independent of
|
|
the others and there is no limit on how many people you can send
|
|
letters to.
|
|
<P>
|
|
<H3>TCP Protocol (SOCK_STREAM)</H3>
|
|
|
|
<P>
|
|
<P>Unlike UDP, <B>TCP</B> is a reliable, connection-based, protocol.
|
|
Each block sent is not seen as a message, but as a block of data
|
|
within an apparently continuous stream of bytes being transmitted
|
|
through a connection between sender and receiver. This is very
|
|
different from UDP messaging because each block is simply part of the
|
|
byte stream and it is up to the user code to figure-out how to extract
|
|
each block from the byte stream; there are no markings separating
|
|
messages. Further, the connections are more fragile with respect to
|
|
network problems, and only a limited number of connections can exist
|
|
simultaneously for each process. Because it is reliable, TCP
|
|
generally implies significantly more overhead than UDP.
|
|
<P>There are, however, a few pleasant surprises about TCP. One is that,
|
|
if multiple messages are sent through a connection, TCP is able to
|
|
pack them together in a buffer to better match network hardware packet
|
|
sizes, potentially yielding better-than-UDP performance for groups of
|
|
short or oddly-sized messages. The other bonus is that networks
|
|
constructed using reliable direct physical links between machines can
|
|
easily and efficiently simulate TCP connections. For example, this was
|
|
done for the ParaStation's "Socket Library" interface software, which
|
|
provides TCP semantics using user-level calls that differ from the
|
|
standard TCP OS calls only by the addition of the prefix
|
|
<CODE>PSS</CODE> to each function name.
|
|
<P>
|
|
<H3>Device Drivers</H3>
|
|
|
|
<P>
|
|
<P>When it comes to actually pushing data onto the network or pulling data
|
|
off the network, the standard unix software interface is a part of the
|
|
unix kernel called a device driver. UDP and TCP don't just transport
|
|
data, they also imply a fair amount of overhead for socket management.
|
|
For example, something has to manage the fact that multiple TCP
|
|
connections can share a single physical network interface. In
|
|
contrast, a device driver for a dedicated network interface only needs
|
|
to implement a few simple data transport functions. These device
|
|
driver functions can then be invoked by user programs by using
|
|
<CODE>open()</CODE> to identify the proper device and then using system
|
|
calls like <CODE>read()</CODE> and <CODE>write()</CODE> on the open "file."
|
|
Thus, each such operation could transport a block of data with little
|
|
more than the overhead of a system call, which might be as fast as
|
|
tens of microseconds.
|
|
<P>Writing a device driver to be used with Linux is not hard... if you
|
|
know <EM>precisely</EM> how the device hardware works. If you are not
|
|
sure how it works, don't guess. Debugging device drivers isn't fun
|
|
and mistakes can fry hardware. However, if that hasn't scared you
|
|
off, it may be possible to write a device driver to, for example, use
|
|
dedicated Ethernet cards as dumb but fast direct machine-to-machine
|
|
connections without the usual Ethernet protocol overhead. In fact,
|
|
that's pretty much what some early Intel supercomputers did.... Look
|
|
at the Device Driver HOWTO for more information.
|
|
<P>
|
|
<H3>User-Level Libraries</H3>
|
|
|
|
<P>
|
|
<P>If you've taken an OS course, user-level access to hardware device
|
|
registers is exactly what you have been taught never to do, because
|
|
one of the primary purposes of an OS is to control device access.
|
|
However, an OS call is at least tens of microseconds of overhead. For
|
|
custom network hardware like TTL_PAPERS, which can perform a
|
|
basic network operation in just 3 microseconds, such OS call overhead
|
|
is intolerable. The only way to avoid that overhead is to have
|
|
user-level code - a user-level library - directly access hardware
|
|
device registers. Thus, the question becomes one of how a user-level
|
|
library can access hardware directly, yet not compromise the OS
|
|
control of device access rights.
|
|
<P>On a typical system, the only way for a user-level library to directly
|
|
access hardware device registers is to:
|
|
<P>
|
|
<OL>
|
|
<LI>At user program start-up, use an OS call to map the page of
|
|
memory address space containing the device registers into the user
|
|
process virtual memory map. For some systems, the <CODE>mmap()</CODE>
|
|
call (first mentioned in section 2.6) can be used to map a special
|
|
file which represents the physical memory page addresses of the I/O
|
|
devices. Alternatively, it is relatively simple to write a device
|
|
driver to perform this function. Further, this device driver can
|
|
control access by only mapping the page(s) containing the specific
|
|
device registers needed, thereby maintaining OS access control.
|
|
</LI>
|
|
<LI>Access device registers without an OS call by simply loading or
|
|
storing to the mapped addresses. For example, <CODE>*((char *) 0x1234) =
|
|
5;</CODE> would store the byte value 5 into memory location 1234
|
|
(hexadecimal).</LI>
|
|
</OL>
|
|
<P>Fortunately, it happens that Linux for the Intel 386 (and compatible
|
|
processors) offers an even better solution:
|
|
<P>
|
|
<OL>
|
|
<LI>Using the <CODE>ioperm()</CODE> OS call from a privileged process,
|
|
get permission to access the precise I/O port addresses that
|
|
correspond to the device registers. Alternatively, permission can be
|
|
managed by an independent privileged user process (i.e., a "meta OS")
|
|
using the
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/giveioperm.html">giveioperm() OS call</A> patch for Linux.
|
|
</LI>
|
|
<LI>Access device registers without an OS call by using 386 port I/O
|
|
instructions.</LI>
|
|
</OL>
|
|
<P>
|
|
<P>This second solution is preferable because it is common that multiple
|
|
I/O devices have their registers within a single page, in which case
|
|
the first technique would not provide protection against accessing
|
|
other device registers that happened to reside in the same page as the
|
|
ones intended. Of course, the down side is that 386 port I/O
|
|
instructions cannot be coded in C - instead, you will need to use a
|
|
bit of assembly code. The GCC-wrapped (usable in C programs) inline
|
|
assembly code function for a port input of a byte value is:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
extern inline unsigned char
|
|
inb(unsigned short port)
|
|
{
|
|
unsigned char _v;
|
|
__asm__ __volatile__ ("inb %w1,%b0"
|
|
:"=a" (_v)
|
|
:"d" (port), "0" (0));
|
|
return _v;
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>Similarly, the GCC-wrapped code for a byte port output is:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
extern inline void
|
|
outb(unsigned char value,
|
|
unsigned short port)
|
|
{
|
|
__asm__ __volatile__ ("outb %b0,%w1"
|
|
:/* no outputs */
|
|
:"a" (value), "d" (port));
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>
|
|
<H2><A NAME="ss3.4">3.4 PVM (Parallel Virtual Machine)</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>PVM (Parallel Virtual Machine) is a freely-available, portable,
|
|
message-passing library generally implemented on top of sockets. It
|
|
is clearly established as the de-facto standard for message-passing
|
|
cluster parallel computing.
|
|
<P>PVM supports single-processor and SMP Linux machines, as well as
|
|
clusters of Linux machines linked by socket-capable networks (e.g.,
|
|
SLIP, PLIP, Ethernet, ATM). In fact, PVM will even work across groups
|
|
of machines in which a variety of different types of processors,
|
|
configurations, and physical networks are used - <B>Heterogeneous
|
|
Clusters</B> - even to the scale of treating machines linked by the
|
|
Internet as a parallel cluster. PVM also provides facilities for
|
|
parallel job control across a cluster. Best of all, PVM has long been
|
|
freely available (currently from
|
|
<A HREF="http://www.epm.ornl.gov/pvm/pvm_home.html">http://www.epm.ornl.gov/pvm/pvm_home.html</A>), which has
|
|
led to many programming language compilers, application libraries,
|
|
programming and debugging tools, etc., using it as their "portable
|
|
message-passing target library." There is also a network newsgroup,
|
|
<A HREF="news:comp.parallel.pvm">comp.parallel.pvm</A>.
|
|
<P>It is important to note, however, that PVM message-passing calls
|
|
generally add significant overhead to standard socket operations,
|
|
which already had high latency. Further, the message handling calls
|
|
themselves do not constitute a particularly "friendly" programming
|
|
model.
|
|
<P>Using the same Pi computation example first described in section 1.3,
|
|
the version using C with PVM library calls is:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>
|
|
#include <stdio.h>
|
|
#include <pvm3.h>
|
|
|
|
#define NPROC 4
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double lsum, width;
|
|
double sum;
|
|
register int intervals, i;
|
|
int mytid, iproc, msgtag = 4;
|
|
int tids[NPROC]; /* array of task ids */
|
|
|
|
/* enroll in pvm */
|
|
mytid = pvm_mytid();
|
|
|
|
/* Join a group and, if I am the first instance,
|
|
iproc=0, spawn more copies of myself
|
|
*/
|
|
iproc = pvm_joingroup("pi");
|
|
|
|
if (iproc == 0) {
|
|
tids[0] = pvm_mytid();
|
|
pvm_spawn("pvm_pi", &argv[1], 0, NULL, NPROC-1, &tids[1]);
|
|
}
|
|
/* make sure all processes are here */
|
|
pvm_barrier("pi", NPROC);
|
|
|
|
/* get the number of intervals */
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
|
|
lsum = 0.0;
|
|
for (i = iproc; i<intervals; i+=NPROC) {
|
|
register double x = (i + 0.5) * width;
|
|
lsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
|
|
/* sum across the local results & scale by width */
|
|
sum = lsum * width;
|
|
pvm_reduce(PvmSum, &sum, 1, PVM_DOUBLE, msgtag, "pi", 0);
|
|
|
|
/* have only the console PE print the result */
|
|
if (iproc == 0) {
|
|
printf("Estimation of pi is %f\n", sum);
|
|
}
|
|
|
|
/* Check program finished, leave group, exit pvm */
|
|
pvm_barrier("pi", NPROC);
|
|
pvm_lvgroup("pi");
|
|
pvm_exit();
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>
|
|
<H2><A NAME="ss3.5">3.5 MPI (Message Passing Interface)</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Although PVM is the de-facto standard message-passing library, MPI
|
|
(Message Passing Interface) is the relatively new official standard.
|
|
The home page for the MPI standard is
|
|
<A HREF="http://www.mcs.anl.gov:80/mpi/">http://www.mcs.anl.gov:80/mpi/</A> and the newsgroup is
|
|
<A HREF="news:comp.parallel.mpi">comp.parallel.mpi</A>.
|
|
<P>However, before discussing MPI, I feel compelled to say a little bit
|
|
about the PVM vs. MPI religious war that has been going on for the
|
|
past few years. I'm not really on either side. Here's my attempt at
|
|
a relatively unbiased summary of the differences:
|
|
<P>
|
|
<DL>
|
|
<DT><B>Execution control environment.</B><DD><P>Put simply, PVM has one and
|
|
MPI doesn't specify how/if one is implemented. Thus, things like
|
|
starting a PVM program executing are done identically everywhere, while
|
|
for MPI it depends on which implementation is being used.
|
|
<P>
|
|
<DT><B>Support for heterogeneous clusters.</B><DD><P>PVM grew-up in the
|
|
workstation cycle-scavenging world, and thus directly manages
|
|
heterogeneous mixes of machines and operating systems. In contrast,
|
|
MPI largely assumes that the target is an MPP (Massively Parallel
|
|
Processor) or a dedicated cluster of nearly identical workstations.
|
|
<P>
|
|
<DT><B>Kitchen sink syndrome.</B><DD><P>PVM evidences a unity of purpose that
|
|
MPI 2.0 doesn't. The new MPI 2.0 standard includes a lot of features
|
|
that go way beyond the basic message passing model - things like RMA
|
|
(Remote Memory Access) and parallel file I/O. Are these things
|
|
useful? Of course they are... but learning MPI 2.0 is a lot like
|
|
learning a complete new programming language.
|
|
<P>
|
|
<DT><B>User interface design.</B><DD><P>MPI was designed after PVM, and
|
|
clearly learned from it. MPI offers simpler, more efficient, buffer
|
|
handling and higher-level abstractions allowing user-defined data
|
|
structures to be transmitted in messages.
|
|
<P>
|
|
<DT><B>The force of law.</B><DD><P>By my count, there are still
|
|
significantly more things designed to use PVM than there are to use
|
|
MPI; however, porting them to MPI is easy, and the fact that MPI is
|
|
backed by a widely-supported formal standard means that using MPI is,
|
|
for many institutions, a matter of policy.
|
|
</DL>
|
|
<P>Conclusion? Well, there are at least three independently developed,
|
|
freely available, versions of MPI that can run on clusters of Linux
|
|
systems (and I wrote one of them):
|
|
<P>
|
|
<UL>
|
|
<LI>LAM (Local Area Multicomputer) is a full implementation of the
|
|
MPI 1.1 standard. It allows MPI programs to be executed within an
|
|
individual Linux system or across a cluster of Linux systems using
|
|
UDP/TCP socket communication. The system includes simple execution
|
|
control facilities, as well as a variety of program development and
|
|
debugging aids. It is freely available from
|
|
<A HREF="http://www.osc.edu/lam.html">http://www.osc.edu/lam.html</A>.
|
|
</LI>
|
|
<LI>MPICH (MPI CHameleon) is designed as a highly portable full
|
|
implementation of the MPI 1.1 standard. Like LAM, it allows MPI
|
|
programs to be executed within an individual Linux system or across a
|
|
cluster of Linux systems using UDP/TCP socket communication. However,
|
|
the emphasis is definitely on promoting MPI by providing an efficient,
|
|
easily retargetable, implementation. To port this MPI implementation,
|
|
one implements either the five functions of the "channel interface"
|
|
or, for better performance, the full MPICH ADI (Abstract Device
|
|
Interface). MPICH, and lots of information about it and porting, are
|
|
available from
|
|
<A HREF="http://www.mcs.anl.gov/mpi/mpich/">http://www.mcs.anl.gov/mpi/mpich/</A>.
|
|
</LI>
|
|
<LI>AFMPI (Aggregate Function MPI) is a subset implementation of the
|
|
MPI 2.0 standard. This is the one that I wrote. Built on top of the
|
|
AFAPI, it is designed to showcase low-latency collective communication
|
|
functions and RMAs, and thus provides only minimal support for MPI
|
|
data types, communicators, etc. It allows C programs using MPI to run
|
|
on an individual Linux system or across a cluster connected by
|
|
AFAPI-capable network hardware. It is freely available from
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>.</LI>
|
|
</UL>
|
|
<P>No matter which of these (or other) MPI implementations one uses, it
|
|
is fairly simple to perform the most common types of communications.
|
|
<P>However, MPI 2.0 incorporates several communication paradigms that are
|
|
fundamentally different enough so that a programmer using one of them
|
|
might not even recognize the other coding styles as MPI. Thus, rather
|
|
than giving a single example program, it is useful to have an example
|
|
of each of the fundamentally different communication paradigms that
|
|
MPI supports. All three programs implement the same basic algorithm
|
|
(from section 1.3) that is used throughout this HOWTO to compute the
|
|
value of Pi.
|
|
<P>The first MPI program uses basic MPI message-passing calls for each
|
|
processor to send its partial sum to processor 0, which sums and
|
|
prints the result:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>
|
|
#include <stdio.h>
|
|
#include <mpi.h>
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width;
|
|
double sum, lsum;
|
|
register int intervals, i;
|
|
int nproc, iproc;
|
|
MPI_Status status;
|
|
|
|
if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);
|
|
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
|
|
MPI_Comm_rank(MPI_COMM_WORLD, &iproc);
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
lsum = 0;
|
|
for (i=iproc; i<intervals; i+=nproc) {
|
|
register double x = (i + 0.5) * width;
|
|
lsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
lsum *= width;
|
|
if (iproc != 0) {
|
|
MPI_Send(&lbuf, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
|
|
} else {
|
|
sum = lsum;
|
|
for (i=1; i<nproc; ++i) {
|
|
MPI_Recv(&lbuf, 1, MPI_DOUBLE, MPI_ANY_SOURCE,
|
|
MPI_ANY_TAG, MPI_COMM_WORLD, &status);
|
|
sum += lsum;
|
|
}
|
|
printf("Estimation of pi is %f\n", sum);
|
|
}
|
|
MPI_Finalize();
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>The second MPI version uses collective communication (which, for this
|
|
particular application, is clearly the most appropriate):
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>
|
|
#include <stdio.h>
|
|
#include <mpi.h>
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width;
|
|
double sum, lsum;
|
|
register int intervals, i;
|
|
int nproc, iproc;
|
|
|
|
if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);
|
|
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
|
|
MPI_Comm_rank(MPI_COMM_WORLD, &iproc);
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
lsum = 0;
|
|
for (i=iproc; i<intervals; i+=nproc) {
|
|
register double x = (i + 0.5) * width;
|
|
lsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
lsum *= width;
|
|
MPI_Reduce(&lsum, &sum, 1, MPI_DOUBLE,
|
|
MPI_SUM, 0, MPI_COMM_WORLD);
|
|
if (iproc == 0) {
|
|
printf("Estimation of pi is %f\n", sum);
|
|
}
|
|
MPI_Finalize();
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>The third MPI version uses the MPI 2.0 RMA mechanism for each processor
|
|
to add its local <CODE>lsum</CODE> into <CODE>sum</CODE> on processor 0:
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>
|
|
#include <stdio.h>
|
|
#include <mpi.h>
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width;
|
|
double sum = 0, lsum;
|
|
register int intervals, i;
|
|
int nproc, iproc;
|
|
MPI_Win sum_win;
|
|
|
|
if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);
|
|
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
|
|
MPI_Comm_rank(MPI_COMM_WORLD, &iproc);
|
|
MPI_Win_create(&sum, sizeof(sum), sizeof(sum),
|
|
0, MPI_COMM_WORLD, &sum_win);
|
|
MPI_Win_fence(0, sum_win);
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
lsum = 0;
|
|
for (i=iproc; i<intervals; i+=nproc) {
|
|
register double x = (i + 0.5) * width;
|
|
lsum += 4.0 / (1.0 + x * x);
|
|
}
|
|
lsum *= width;
|
|
MPI_Accumulate(&lsum, 1, MPI_DOUBLE, 0, 0,
|
|
1, MPI_DOUBLE, MPI_SUM, sum_win);
|
|
MPI_Win_fence(0, sum_win);
|
|
if (iproc == 0) {
|
|
printf("Estimation of pi is %f\n", sum);
|
|
}
|
|
MPI_Finalize();
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>It is useful to note that the MPI 2.0 RMA mechanism very neatly
|
|
overcomes any potential problems with the corresponding data structure
|
|
on various processors residing at different memory locations. This is
|
|
done by referencing a "window" that implies the base address,
|
|
protection against out-of-bound accesses, and even address scaling.
|
|
Efficient implementation is aided by the fact that RMA processing may
|
|
be delayed until the next <CODE>MPI_Win_fence</CODE>. In
|
|
summary, the RMA mechanism may be a strange cross between distributed
|
|
shared memory and message passing, but it is a very clean interface
|
|
that potentially generates very efficient communication.
|
|
<P>
|
|
<H2><A NAME="ss3.6">3.6 AFAPI (Aggregate Function API)</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Unlike PVM, MPI, etc., the AFAPI (Aggregate Function Application
|
|
Program Interface) did not start life as an attempt to build a
|
|
portable abstract interface layered on top of existing network
|
|
hardware and software. Rather, AFAPI began as the very
|
|
hardware-specific low-level support library for PAPERS (Purdue's
|
|
Adapter for Parallel Execution and Rapid Synchronization; see
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>).
|
|
<P>PAPERS was discussed briefly in section 3.2; it is a public domain
|
|
design custom aggregate function network that delivers latencies as
|
|
low as a few microseconds. However, the important thing about PAPERS
|
|
is that it was developed as an attempt to build a supercomputer that
|
|
would be a better target for compiler technology than existing
|
|
supercomputers. This is qualitatively different from most Linux
|
|
cluster efforts and PVM/MPI, which generally focus on trying to use
|
|
standard networks for the relatively few sufficiently coarse-grain
|
|
parallel applications. The fact that Linux PCs are used as components
|
|
of PAPERS systems is simply an artifact of implementing prototypes in
|
|
the most cost-effective way possible.
|
|
<P>The need for a common low-level software interface across more than a
|
|
dozen different prototype implementations was what made the PAPERS
|
|
library become standardized as AFAPI. However, the model used by
|
|
AFAPI is inherently simpler and better suited for the finer-grain
|
|
interactions typical of code compiled by parallelizing compilers or
|
|
written for SIMD architectures. The simplicity of the model not only
|
|
makes PAPERS hardware easy to build, but also yields surprisingly
|
|
efficient AFAPI ports for a variety of other hardware systems, such as
|
|
SMPs.
|
|
<P>AFAPI currently runs on Linux clusters connected using TTL_PAPERS,
|
|
CAPERS, or WAPERS. It also runs (without OS calls or even bus-lock
|
|
instructions, see section 2.2) on SMP systems using a System V Shared
|
|
Memory library called SHMAPERS. A version that runs across Linux
|
|
clusters using UDP broadcasts on conventional networks (e.g.,
|
|
Ethernet) is under development. All released versions are available
|
|
from
|
|
<A HREF="http://garage.ecn.purdue.edu/~papers/">http://garage.ecn.purdue.edu/~papers/</A>. All versions
|
|
of the AFAPI are designed to be called from C or C++.
|
|
<P>The following example program is the AFAPI version of the Pi
|
|
computation described in section 1.3.
|
|
<P>
|
|
<HR>
|
|
<PRE>
|
|
#include <stdlib.h>
|
|
#include <stdio.h>
|
|
#include "afapi.h"
|
|
|
|
main(int argc, char **argv)
|
|
{
|
|
register double width, sum;
|
|
register int intervals, i;
|
|
|
|
if (p_init()) exit(1);
|
|
|
|
intervals = atoi(argv[1]);
|
|
width = 1.0 / intervals;
|
|
|
|
sum = 0;
|
|
for (i=IPROC; i<intervals; i+=NPROC) {
|
|
register double x = (i + 0.5) * width;
|
|
sum += 4.0 / (1.0 + x * x);
|
|
}
|
|
|
|
sum = p_reduceAdd64f(sum) * width;
|
|
|
|
if (IPROC == CPROC) {
|
|
printf("Estimation of pi is %f\n", sum);
|
|
}
|
|
|
|
p_exit();
|
|
return(0);
|
|
}
|
|
</PRE>
|
|
<HR>
|
|
<P>
|
|
<H2><A NAME="ss3.7">3.7 Other Cluster Support Libraries</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>In addition to PVM, MPI, and AFAPI, the following libraries offer
|
|
features that may be useful in parallel computing using a cluster of
|
|
Linux systems. These systems are given a lighter treatment in this
|
|
document simply because, unlike PVM, MPI, and AFAPI, I have little or
|
|
no direct experience with the use of these systems on Linux clusters.
|
|
If you find any of these or other libraries to be especially useful,
|
|
please send email to me at
|
|
<A HREF="mailto:hankd@engr.uky.edu">hankd@engr.uky.edu</A> describing what you've found, and I will
|
|
consider adding an expanded section on that library.
|
|
<P>
|
|
<H3>Condor (process migration support)</H3>
|
|
|
|
<P>
|
|
<P>Condor is a distributed resource management system that can manage
|
|
large heterogeneous clusters of workstations. Its design has been
|
|
motivated by the needs of users who would like to use the unutilized
|
|
capacity of such clusters for their long-running,
|
|
computation-intensive jobs. Condor preserves a large measure of the
|
|
originating machine's environment on the execution machine, even if
|
|
the originating and execution machines do not share a common file
|
|
system and/or password mechanisms. Condor jobs that consist of a
|
|
single process are automatically checkpointed and migrated between
|
|
workstations as needed to ensure eventual completion.
|
|
<P>Condor is available at
|
|
<A HREF="http://www.cs.wisc.edu/condor/">http://www.cs.wisc.edu/condor/</A>. A
|
|
Linux port exists; more information is available at
|
|
<A HREF="http://www.cs.wisc.edu/condor/linux/linux.html">http://www.cs.wisc.edu/condor/linux/linux.html</A>. Contact
|
|
<A HREF="mailto:condor-admin@cs.wisc.edu">condor-admin@cs.wisc.edu</A> for details.
|
|
<P>
|
|
<H3>DFN-RPC (German Research Network - Remote Procedure Call)</H3>
|
|
|
|
<P>
|
|
<P>The DFN-RPC, a (German Research Network Remote Procedure Call) tool,
|
|
was developed to distribute and parallelize scientific-technical
|
|
application programs between a workstation and a compute server or a
|
|
cluster. The interface is optimized for applications written in
|
|
fortran, but the DFN-RPC can also be used in a C environment. It has
|
|
been ported to Linux. More information is at
|
|
<A HREF="ftp://ftp.uni-stuttgart.de/pub/rus/dfn_rpc/README_dfnrpc.html">ftp://ftp.uni-stuttgart.de/pub/rus/dfn_rpc/README_dfnrpc.html</A>.
|
|
<P>
|
|
<H3>DQS (Distributed Queueing System)</H3>
|
|
|
|
<P>
|
|
<P>Not exactly a library, DQS 3.0 (Distributed Queueing System) is a job
|
|
queueing system that has been developed and tested under Linux. It is
|
|
designed to allow both use and administration of a heterogeneous
|
|
cluster as a single entity. It is available from
|
|
<A HREF="http://www.scri.fsu.edu/~pasko/dqs.html">http://www.scri.fsu.edu/~pasko/dqs.html</A>.
|
|
<P>There is also a commercial version called CODINE 4.1.1 (COmputing in
|
|
DIstributed Network Environments). Information on it is available
|
|
from
|
|
<A HREF="http://www.genias.de/genias_welcome.html">http://www.genias.de/genias_welcome.html</A>.
|
|
<P>
|
|
<H2><A NAME="ss3.8">3.8 General Cluster References</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Because clusters can be constructed and used in so many different ways,
|
|
there are quite a few groups that have made interesting contributions.
|
|
The following are references to various cluster-related projects that
|
|
may be of general interest. This includes a mix of Linux-specific and
|
|
generic cluster references. The list is given in alphabetical order.
|
|
<P>
|
|
<H3>Beowulf</H3>
|
|
|
|
<P>
|
|
<P>The Beowulf project,
|
|
<A HREF="http://cesdis1.gsfc.nasa.gov/beowulf/">http://cesdis1.gsfc.nasa.gov/beowulf/</A>, centers on production of
|
|
software for using off-the-shelf clustered workstations based on
|
|
commodity PC-class hardware, a high-bandwidth cluster-internal
|
|
network, and the Linux operating system.
|
|
<P>Thomas Sterling has been the driving force behind Beowulf, and
|
|
continues to be an eloquent and outspoken proponent of Linux
|
|
clustering for scientific computing in general. In fact, many groups
|
|
now refer to their clusters as "Beowulf class" systems - even if the
|
|
cluster isn't really all that similar to the official Beowulf design.
|
|
<P>Don Becker, working in support of the Beowulf project, has produced
|
|
many of the network drivers used by Linux in general. Many of these
|
|
drivers have even been adapted for use in BSD. Don also is
|
|
responsible for many of these Linux network drivers allowing
|
|
load-sharing across multiple parallel connections to achieve higher
|
|
bandwidth without expensive switched hubs. This type of load sharing
|
|
was the original distinguishing feature of the Beowulf cluster.
|
|
<P>
|
|
<H3>Linux/AP+</H3>
|
|
|
|
<P>
|
|
<P>The Linux/AP+ project,
|
|
<A HREF="http://cap.anu.edu.au/cap/projects/linux/">http://cap.anu.edu.au/cap/projects/linux/</A>, is not exactly about
|
|
Linux clustering, but centers on porting Linux to the Fujitsu AP1000+
|
|
and adding appropriate parallel processing enhancements. The AP1000+
|
|
is a commercially available SPARC-based parallel machine that uses a
|
|
custom network with a torus topology, 25 MB/s bandwidth, and 10
|
|
microsecond latency... in short, it looks a lot like a SPARC Linux
|
|
cluster.
|
|
<P>
|
|
<H3>Locust</H3>
|
|
|
|
<P>
|
|
<P>The Locust project,
|
|
<A HREF="http://www.ecsl.cs.sunysb.edu/~manish/locust/">http://www.ecsl.cs.sunysb.edu/~manish/locust/</A>, is building a
|
|
distributed virtual shared memory system that uses compile-time
|
|
information to hide message-latency and to reduce network traffic at
|
|
run time. Pupa is the underlying communication subsystem of Locust,
|
|
and is implemented using Ethernet to connect 486 PCs under FreeBSD.
|
|
Linux?
|
|
<P>
|
|
<H3>Midway DSM (Distributed Shared Memory)</H3>
|
|
|
|
<P>
|
|
<P>Midway,
|
|
<A HREF="http://www.cs.cmu.edu/afs/cs.cmu.edu/project/midway/WWW/HomePage.html">http://www.cs.cmu.edu/afs/cs.cmu.edu/project/midway/WWW/HomePage.html</A>,
|
|
is a software-based DSM (Distributed Shared Memory) system, not unlike
|
|
TreadMarks. The good news is that it uses compile-time aids rather
|
|
than relatively slow page-fault mechanisms, and it is free. The bad
|
|
news is that it doesn't run on Linux clusters.
|
|
<P>
|
|
<H3>Mosix</H3>
|
|
|
|
<P>
|
|
<P>MOSIX modifies the BSDI BSD/OS to provide dynamic load balancing and
|
|
preemptive process migration across a networked group of PCs. This is
|
|
nice stuff not just for parallel processing, but for generally using a
|
|
cluster much like a scalable SMP. Will there be a Linux version? Look
|
|
at
|
|
<A HREF="http://www.cs.huji.ac.il/mosix/">http://www.cs.huji.ac.il/mosix/</A> for more information.
|
|
<P>
|
|
<H3>NOW (Network Of Workstations)</H3>
|
|
|
|
<P>
|
|
<P>The Berkeley NOW (Network Of Workstations) project,
|
|
<A HREF="http://now.cs.berkeley.edu/">http://now.cs.berkeley.edu/</A>, has led much of the push toward
|
|
parallel computing using networks of workstations. There is a lot
|
|
work going on here, all aimed toward "demonstrating a practical 100
|
|
processor system in the next few years." Alas, they don't use Linux.
|
|
<P>
|
|
<H3>Parallel Processing Using Linux</H3>
|
|
|
|
<P>
|
|
<P>The parallel processing using Linux WWW site,
|
|
<A HREF="http://aggregate.org/LDP/">http://aggregate.org/LDP/</A>, is the home of this HOWTO
|
|
and many related documents including online slides for a full-day
|
|
tutorial. Aside from the work on the PAPERS project, the Purdue
|
|
University School of Electrical and Computer Engineering generally has
|
|
been a leader in parallel processing; this site was established to
|
|
help others apply Linux PCs for parallel processing.
|
|
<P>Since Purdue's first cluster of Linux PCs was assembled in February
|
|
1994, there have been many Linux PC clusters assembled at Purdue,
|
|
including several with video walls. Although these clusters used 386,
|
|
486, and Pentium systems (no Pentium Pro systems), Intel recently
|
|
awarded Purdue a donation which will allow it to construct multiple
|
|
large clusters of Pentium II systems (with as many as 165 machines
|
|
planned for a single cluster). Although these clusters all have/will
|
|
have PAPERS networks, most also have conventional networks.
|
|
<P>
|
|
<H3>Pentium Pro Cluster Workshop</H3>
|
|
|
|
<P>
|
|
<P>In Des Moines, Iowa, April 10-11, 1997, AMES Laboratory held the
|
|
Pentium Pro Cluster Workshop. The WWW site from this workshop,
|
|
<A HREF="http://www.scl.ameslab.gov/workshops/PPCworkshop.html">http://www.scl.ameslab.gov/workshops/PPCworkshop.html</A>, contains
|
|
a wealth of PC cluster information gathered from all the attendees.
|
|
<P>
|
|
<H3>TreadMarks DSM (Distributed Shared Memory)</H3>
|
|
|
|
<P>
|
|
<P>DSM (Distributed Shared Memory) is a technique whereby a
|
|
message-passing system can appear to behave as an SMP. There are
|
|
quite a few such systems, most of which use the OS page-fault mechanism
|
|
to trigger message transmissions. TreadMarks,
|
|
<A HREF="http://www.cs.rice.edu/~willy/TreadMarks/overview.html">http://www.cs.rice.edu/~willy/TreadMarks/overview.html</A>, is one
|
|
of the more efficient of such systems and does run on Linux clusters.
|
|
The bad news is "TreadMarks is being distributed at a small cost to
|
|
universities and nonprofit institutions." For more information about
|
|
the software, contact
|
|
<A HREF="mailto:treadmarks@ece.rice.edu">treadmarks@ece.rice.edu</A>.
|
|
<P>
|
|
<H3>U-Net (User-level NETwork interface architecture)</H3>
|
|
|
|
<P>
|
|
<P>The U-Net (User-level NETwork interface architecture) project at
|
|
Cornell,
|
|
<A HREF="http://www2.cs.cornell.edu/U-Net/Default.html">http://www2.cs.cornell.edu/U-Net/Default.html</A>,
|
|
attempts to provide low-latency and high-bandwidth using commodity
|
|
network hardware by by virtualizing the network interface so that
|
|
applications can send and receive messages without operating system
|
|
intervention. U-Net runs on Linux PCs using a DECchip DC21140 based
|
|
Fast Ethernet card or a Fore Systems PCA-200 (not PCA-200E) ATM card.
|
|
<P>
|
|
<H3>WWT (Wisconsin Wind Tunnel)</H3>
|
|
|
|
<P>
|
|
<P>There is really quite a lot of cluster-related work at Wisconsin. The
|
|
WWT (Wisconsin Wind Tunnel) project,
|
|
<A HREF="http://www.cs.wisc.edu/~wwt/">http://www.cs.wisc.edu/~wwt/</A>, is doing all sorts of work toward
|
|
developing a "standard" interface between compilers and the underlying
|
|
parallel hardware. There is the Wisconsin COW (Cluster Of
|
|
Workstations), Cooperative Shared Memory and Tempest, the Paradyn
|
|
Parallel Performance Tools, etc. Unfortunately, there is not much
|
|
about Linux.
|
|
<P>
|
|
<HR>
|
|
<A HREF="Parallel-Processing-HOWTO-4.html">Next</A>
|
|
<A HREF="Parallel-Processing-HOWTO-2.html">Previous</A>
|
|
<A HREF="Parallel-Processing-HOWTO.html#toc3">Contents</A>
|
|
</BODY>
|
|
</HTML>
|