mirror of https://github.com/tLDP/LDP
1212 lines
50 KiB
Plaintext
1212 lines
50 KiB
Plaintext
<!doctype linuxdoc system>
|
|
|
|
<article>
|
|
<title>Beowulf HOWTO
|
|
|
|
|
|
<author>Jacek Radajewski and Douglas Eadline
|
|
|
|
<date>v1.1.1, 22 November 1998
|
|
|
|
|
|
<abstract>
|
|
This document introduces the Beowulf Supercomputer architecture and
|
|
provides background information on parallel programming, including
|
|
links to other more specific documents, and web pages.
|
|
|
|
</abstract>
|
|
|
|
<toc>
|
|
|
|
<sect>Preamble
|
|
|
|
<sect1>Disclaimer
|
|
<p>
|
|
We will not accept any responsibility for any incorrect information
|
|
within this document, nor for any damage it might cause when applied.
|
|
|
|
<sect1>Copyright
|
|
<p>
|
|
Copyright © 1997 - 1998 Jacek Radajewski and Douglas Eadline.
|
|
Permission to distribute and modify this document is granted under the
|
|
GNU General Public Licence.
|
|
|
|
<sect1>About this HOWTO
|
|
<p>
|
|
Jacek Radajewski started work on this document in November 1997 and
|
|
was soon joined by Douglas Eadline. Over a few months the Beowulf
|
|
HOWTO grew into a large document, and in August 1998 it was split into
|
|
three documents: Beowulf HOWTO, Beowulf Architecture Design HOWTO, and
|
|
the Beowulf Installation and Administration HOWTO. Version 1.0.0 of
|
|
the Beowulf HOWTO was released to the Linux Documentation Project on
|
|
11 November 1998. We hope that this is only the beginning of what
|
|
will become a complete Beowulf Documentation Project.
|
|
|
|
<sect1>About the authors
|
|
<p>
|
|
<itemize>
|
|
|
|
<item>Jacek Radajewski works as a Network Manager, and is studying for
|
|
an honors degree in computer science at the University of Southern
|
|
Queensland, Australia. Jacek's first contact with Linux was in 1995
|
|
and it was love at first sight. Jacek built his first Beowulf cluster
|
|
in May 1997 and has been playing with the technology ever since,
|
|
always trying to find new and better ways of setting things up. You
|
|
can contact Jacek by sending e-mail to <htmlurl
|
|
name="jacek@usq.edu.au" url="mailto:jacek@usq.edu.au">
|
|
|
|
<item>Douglas Eadline, Ph.D. is President and Principal Scientist at
|
|
Paralogic, Inc., Bethlehem, PA, USA. Trained as Physical/Analytical
|
|
Chemist, he has been involved with computers since 1978 when he built
|
|
his first single board computer for use with chemical instrumentation.
|
|
Dr. Eadline's interests now include Linux, Beowulf clusters, and
|
|
parallel algorithms. Dr. Eadline can be contacted by sending
|
|
email to <htmlurl name="deadline@plogic.com"
|
|
url="mailto:deadline@plogic.com">
|
|
|
|
</itemize>
|
|
|
|
<sect1>Acknowledgements
|
|
<p>
|
|
The writing of the Beowulf HOWTO was a long proces and is finally
|
|
complete, thanks to many individuals. I would like to thank the
|
|
following people for their help and contribution to this HOWTO.
|
|
<itemize>
|
|
|
|
<item> Becky for her love, support, and understanding.
|
|
|
|
<item> Tom Sterling, Don Becker, and other people at NASA who started
|
|
the Beowulf project.
|
|
|
|
<item> Thanh Tran-Cong and the Faculty of Engineering and Surveying
|
|
for making the <it>topcat</it> Beowulf machine available for experiments.
|
|
|
|
<item> My supervisor Christopher Vance for many great ideas.
|
|
|
|
<item> My friend Russell Waldron for great programming ideas, his
|
|
general interest in the project, and support.
|
|
|
|
<item> My friend David Smith for proof reading this document.
|
|
|
|
<item> Many other people on the Beowulf mailing list who provided me
|
|
with feedback and ideas.
|
|
|
|
<item> All the people who are responsible for the Linux operating
|
|
system and all the other free software packages used on
|
|
<it>topcat</it> and other Beowulf machines.
|
|
|
|
</itemize>
|
|
|
|
<sect>Introduction
|
|
<p>
|
|
|
|
As the performance of commodity computer and network hardware
|
|
increase, and their prices decrease, it becomes more and more
|
|
practical to build parallel computational systems from off-the-shelf
|
|
components, rather than buying CPU time on very expensive
|
|
Supercomputers. In fact, the price per performance ratio of a Beowulf
|
|
type machine is between three to ten times better than that for
|
|
traditional supercomputers. Beowulf architecture scales well, it is
|
|
easy to construct and you only pay for the hardware as most of the
|
|
software is free.
|
|
|
|
<sect1>Who should read this HOWTO ?
|
|
<p>
|
|
This HOWTO is designed for a person with at least some exposure to the
|
|
Linux operating system. Knowledge of Beowulf technology or
|
|
understanding of more complex operating system and networking
|
|
concepts is not essential, but some exposure to parallel computing
|
|
would be advantageous (after all you must have some reason to read
|
|
this document). This HOWTO will not answer all possible questions you
|
|
might have about Beowulf, but hopefully will give you ideas and guide
|
|
you in the right direction. The purpose of this HOWTO is to provide
|
|
background information, links and references to more advanced
|
|
documents.
|
|
|
|
<sect1>What is a Beowulf ?
|
|
<p>
|
|
<it>Famed was this Beowulf: far flew the boast of him, son of Scyld,
|
|
in the Scandian lands. So becomes it a youth to quit him well with
|
|
his father's friends, by fee and gift, that to aid him, aged, in after
|
|
days, come warriors willing, should war draw nigh, liegemen loyal: by
|
|
lauded deeds shall an earl have honor in every clan.</it> Beowulf is
|
|
the earliest surviving epic poem written in English. It is a story
|
|
about a hero of great strength and courage who defeted a monster
|
|
called Grendel. See <ref id="history" name="History"> to find out
|
|
more about the Beowulf hero.
|
|
<p>
|
|
There are probably as many Beowulf definitions as there are people who
|
|
build or use Beowulf Supercomputer facilities. Some claim that one
|
|
can call their system Beowulf only if it is built in the same way as
|
|
the NASA's original machine. Others go to the other extreme and call
|
|
Beowulf any system of workstations running parallel code. My
|
|
definition of Beowulf fits somewhere between the two views described
|
|
above, and is based on many postings to the Beowulf mailing list:
|
|
|
|
<p>
|
|
Beowulf is a multi computer architecture which can be used for
|
|
parallel computations. It is a system which usually consists of one
|
|
server node, and one or more client nodes connected together via
|
|
Ethernet or some other network. It is a system built using commodity
|
|
hardware components, like any PC capable of running Linux, standard
|
|
Ethernet adapters, and switches. It does not contain any custom
|
|
hardware components and is trivially reproducible. Beowulf also uses
|
|
commodity software like the Linux operating system, Parallel Virtual
|
|
Machine (PVM) and Message Passing Interface (MPI). The server node
|
|
controls the whole cluster and serves files to the client nodes. It
|
|
is also the cluster's console and gateway to the outside world. Large
|
|
Beowulf machines might have more than one server node, and possibly
|
|
other nodes dedicated to particular tasks, for example consoles or
|
|
monitoring stations. In most cases client nodes in a Beowulf system
|
|
are dumb, the dumber the better. Nodes are configured and controlled
|
|
by the server node, and do only what they are told to do. In a
|
|
disk-less client configuration, client nodes don't even know their IP
|
|
address or name until the server tells them what it is. One of the
|
|
main differences between Beowulf and a Cluster of Workstations (COW)
|
|
is the fact that Beowulf behaves more like a single machine rather
|
|
than many workstations. In most cases client nodes do not have
|
|
keyboards or monitors, and are accessed only via remote login or
|
|
possibly serial terminal. Beowulf nodes can be thought of as a CPU +
|
|
memory package which can be plugged in to the cluster, just like a CPU
|
|
or memory module can be plugged into a motherboard.
|
|
|
|
<p> Beowulf is not a special software package, new network topology or
|
|
the latest kernel hack. Beowulf is a technology of clustering Linux
|
|
computers to form a parallel, virtual supercomputer. Although there
|
|
are many software packages such as kernel modifications, PVM and MPI
|
|
libraries, and configuration tools which make the Beowulf architecture
|
|
faster, easier to configure, and much more usable, one can build a
|
|
Beowulf class machine using standard Linux distribution without any
|
|
additional software. If you have two networked Linux computers which
|
|
share at least the <tt>/home</tt> file system via NFS, and trust each
|
|
other to execute remote shells (rsh), then it could be argued that you
|
|
have a simple, two node Beowulf machine.
|
|
|
|
|
|
<sect1>Classification
|
|
<p>
|
|
Beowulf systems have been constructed from a variety of parts. For the sake
|
|
of performance some non-commodity components (i.e. produced by a single
|
|
manufacturer) have been employed. In order to account for the different
|
|
types of systems and to make discussions about machines a bit easier, we
|
|
propose the following simple classification scheme:
|
|
<p>
|
|
CLASS I BEOWULF:
|
|
<p>
|
|
This class of machines built entirely from commodity "off-the-shelf" parts.
|
|
We shall use the "Computer Shopper" certification test to define commodity
|
|
"off-the-shelf" parts. (Computer Shopper is a 1 inch thick monthly
|
|
magazine/catalog of PC systems and components.) The test is as follows:
|
|
|
|
A CLASS I Beowulf is a machine that can be assembled
|
|
from parts found in at least 3 nationally/globally circulated
|
|
advertising catalogs.
|
|
|
|
The advantages of a CLASS I system are:
|
|
<itemize>
|
|
<item> hardware is available form multiple sources (low prices, easy maintenance)
|
|
<item> no reliance on a single hardware vendor
|
|
<item> driver support from Linux commodity
|
|
<item> usually based on standards (SCSI, Ethernet, etc.)
|
|
</itemize>
|
|
|
|
The disadvantages of a CLASS I system are:
|
|
<itemize>
|
|
<item>best performance may require CLASS II hardware
|
|
</itemize>
|
|
<p>
|
|
CLASS II BEOWULF
|
|
<p>
|
|
A CLASS II Beowulf is simply any machine that does not pass the Computer
|
|
Shopper certification test. This is not a bad thing. Indeed, it is merely a
|
|
classification of the machine.
|
|
|
|
The advantages of a CLASS II system are:
|
|
<itemize>
|
|
<item> Performance can be quite good!
|
|
</itemize>
|
|
|
|
The disadvantages of a CLASS II system are:
|
|
<itemize>
|
|
<item> driver support may vary
|
|
<item> reliance on single hardware vendor
|
|
<item> may be more expensive than CLASS I systems.
|
|
</itemize>
|
|
|
|
One CLASS is not necessarily better than the other. It all depends on your
|
|
needs and budget. This classification system is only intended to make
|
|
discussions about Beowulf systems a bit more succinct. The "System Design"
|
|
section may help determine what kind of system is best suited for your needs.
|
|
|
|
|
|
|
|
|
|
<sect>Architecture Overview
|
|
<p>
|
|
|
|
<sect1>What does it look like ?
|
|
<p>
|
|
I think that the best way of describing the Beowulf supercomputer
|
|
architecture is to use an example which is very similar to the actual
|
|
Beowulf, but familiar to most system administrators. The example that
|
|
is closest to a Beowulf machine is a Unix computer laboratory with a
|
|
server and a number of clients. To be more specific I'll use the DEC
|
|
Alpha undergraduate computer laboratory at the Faculty of Sciences,
|
|
USQ as the example. The server computer is called <it>beldin</it> and
|
|
the client machines are called <it>scilab01</it>, <it>scilab02</it>,
|
|
<it>scilab03</it>, up to <it>scilab20</it>. All clients have
|
|
a local copy of the Digital Unix 4.0 operating system installed, but
|
|
get the user file space (<tt>/home</tt>) and <tt>/usr/local</tt> from
|
|
the server via NFS (Network File System). Each client has an entry
|
|
for the server and all the other clients in its
|
|
<tt>/etc/hosts.equiv</tt> file, so all clients can execute a remote
|
|
shell (rsh) to all others. The server machine is a NIS server for the
|
|
whole laboratory, so account information is the same across all the
|
|
machines. A person can sit at the <it>scilab02</it> console,
|
|
login, and have the same environment as if he logged onto the server
|
|
or <it>scilab15</it>. The reason all the clients have the same look
|
|
and feel is that the operating system is installed and configured
|
|
in the same way on all machines, and both the user's <tt>/home</tt>
|
|
and <tt>/usr/local</tt> areas are physically on the server and
|
|
accessed by the clients via NFS. For more information on NIS and NFS
|
|
please read the <htmlurl name="NIS"
|
|
url="http://sunsite.unc.edu/LDP/HOWTO/NIS-HOWTO.html"> and <htmlurl
|
|
name="NFS" url="http://sunsite.unc.edu/LDP/HOWTO/NFS-HOWTO.html">
|
|
HOWTOs.
|
|
|
|
|
|
<sect1>How to utilise the other nodes ?
|
|
<p>
|
|
|
|
Now that we have some idea about the system architecture, let us take
|
|
a look at how we can utilise the available CPU cycles of the machines
|
|
in the computer laboratory. Any person can logon to any of the
|
|
machines, and run a program in their home directory, but they can also
|
|
spawn the same job on a different machine simply by executing remote
|
|
shell. For example, assume that we want to calculate the sum of the
|
|
square roots of all integers between 1 and 10 inclusive. We write a
|
|
simple program called <tt>sigmasqrt</tt> (please see <ref
|
|
id="sigmasqrt" name="source code">) which does exactly that. To
|
|
calculate the sum of the square roots of numbers from 1 to 10 we
|
|
execute :
|
|
<verb>
|
|
[jacek@beldin sigmasqrt]$ time ./sigmasqrt 1 10
|
|
22.468278
|
|
|
|
real 0m0.029s
|
|
user 0m0.001s
|
|
sys 0m0.024s
|
|
|
|
</verb>
|
|
The <tt>time</tt> command allows us to check the wall-clock (the
|
|
elapsed time) of running this job. As we can see, this example took
|
|
only a small fraction of a second (0.029 sec) to execute, but what if
|
|
I want to add the square root of integers from 1 to 1 000 000 000 ?
|
|
Let us try this, and again calculate the wall-clock time.
|
|
|
|
<verb>
|
|
[jacek@beldin sigmasqrt]$ time ./sigmasqrt 1 1000000000
|
|
21081851083600.559000
|
|
|
|
real 16m45.937s
|
|
user 16m43.527s
|
|
sys 0m0.108s
|
|
</verb>
|
|
|
|
|
|
This time, the execution time of the program is considerably longer.
|
|
The obvious question to ask is what can we do to speed up the
|
|
execution time of the job? How can we change the way the job is
|
|
running to minimize the wall-clock time of running this job? The
|
|
obvious answer is to split the job into a number of sub-jobs and to
|
|
run these sub-jobs in parallel on all computers. We could split one
|
|
big addition task into 20 parts, calculating one range of square roots
|
|
and adding them on each node. When all nodes finish the
|
|
calculation and return their results, the 20 numbers could be added
|
|
together to obtain the final solution. Before we run this job we will
|
|
make a named pipe which will be used by all processes to write their
|
|
results.
|
|
|
|
<verb>
|
|
[jacek@beldin sigmasqrt]$ mkfifo output
|
|
[jacek@beldin sigmasqrt]$ ./prun.sh & time cat output | ./sum
|
|
[1] 5085
|
|
21081851083600.941000
|
|
[1]+ Done ./prun.sh
|
|
|
|
real 0m58.539s
|
|
user 0m0.061s
|
|
sys 0m0.206s
|
|
</verb>
|
|
|
|
This time we get about 58.5 seconds. This is the time from starting
|
|
the job until all the nodes have finished their computations and
|
|
written their results into the pipe. The time does not include the
|
|
final addition of the twenty numbers, but this time is a very small
|
|
fraction of a second and can be ignored. We can see that there is a
|
|
significant improvement in running this job in parallel. In fact the
|
|
parallel job ran about 17 times faster, which is very reasonable for a
|
|
20 fold increase in the number of CPUs. The purpose of the above
|
|
example is to illustrate the simplest method of parallelising
|
|
concurrent code. In practice such simple examples are rare and
|
|
different techniques (PVM and PMI APIs) are used to achieve the
|
|
parallelism.
|
|
|
|
|
|
<sect1>How does Beowulf differ from a COW ?
|
|
<p>
|
|
The computer laboratory described above is a perfect example of a
|
|
Cluster of Workstations (COW). So what is so special about Beowulf,
|
|
and how is it different from a COW? The truth is that there is not
|
|
much difference, but Beowulf does have few unique characteristics.
|
|
First of all, in most cases client nodes in a Beowulf cluster do not
|
|
have keyboards, mice, video cards nor monitors. All access to the
|
|
client nodes is done via remote connections from the server node,
|
|
dedicated console node, or a serial console. Because there is no need
|
|
for client nodes to access machines outside the cluster, nor for
|
|
machines outside the cluster to access client nodes directly, it
|
|
is a common practice for the client nodes to use private IP addresses
|
|
like the 10.0.0.0/8 or 192.168.0.0/16 address ranges (RFC 1918 <htmlurl
|
|
name="http://www.alternic.net/rfcs/1900/rfc1918.txt.html"
|
|
url="http://www.alternic.net/rfcs/1900/rfc1918.txt.html">). Usually
|
|
the only machine that is also connected to the outside world using a
|
|
second network card is the server node. The most common ways of using
|
|
the system is to access the server's console directly, or either
|
|
telnet or remote login to the server node from personal workstation.
|
|
Once on the server node, users can edit and compile their code, and
|
|
also spawn jobs on all nodes in the cluster. In most cases COWs are
|
|
used for parallel computations at night, and over weekends when people
|
|
do not actually use the workstations for every day work, thus
|
|
utilising idle CPU cycles. Beowulf on the other hand is a machine
|
|
usually dedicated to parallel computing, and optimised for this
|
|
purpose. Beowulf also gives better price/performance ratio as it is
|
|
built from off-the-shelf components and runs mainly free software.
|
|
Beowulf has also more single system image features which help the
|
|
users to see the Beowulf cluster as a single computing workstation.
|
|
|
|
|
|
<sect>System Design
|
|
<p>
|
|
Before you purchase any hardware, it may be a good idea to consider
|
|
the design of your system. There are basically two hardware issues
|
|
involved with design of a Beowulf system: the type of nodes or
|
|
computers you are going to use; and way you connect the computer
|
|
nodes. There is one software issue that may effect your hardware
|
|
decisions; the communication library or API. A more detailed
|
|
discussion of hardware and communication software is provided later in
|
|
this document.
|
|
<p>
|
|
While the number of choices is not large, there are some important
|
|
design decisions that must be made when constructing a Beowulf
|
|
systems. Because the science (or art) of "parallel computing" has
|
|
many different interpretations, an introduction is provided below. If
|
|
you do not like to read background material, you may skip this
|
|
section, but it is advised that you read section <ref id="suitability"
|
|
name="Suitability"> before you make you final hardware decisions.
|
|
|
|
<sect1>A brief background on parallel computing.
|
|
<p>
|
|
This section provides background on parallel computing concepts. It
|
|
is NOT an exhaustive or complete description of parallel computing
|
|
science and technology. It is a brief description of the issues that
|
|
may be important to a Beowulf designer and user.
|
|
<p>
|
|
As you design and build your Beowulf, many of these issues described
|
|
below will become important in your decision process. Due to its
|
|
component nature, a Beowulf Supercomputer requires that we consider
|
|
many factors carefully because they are now under our control. In
|
|
general, it is not all that difficult to understand the issues
|
|
involved with parallel computing. Indeed, once the issues are
|
|
understood, your expectations will be more realistic and success will
|
|
be more likely. Unlike the "sequential world" where processor speed is
|
|
considered the single most important factor, processor speed in the
|
|
"parallel world" is just one of several factors that will determine
|
|
overall system performance and efficiency.
|
|
<p>
|
|
|
|
<sect1>The methods of parallel computing
|
|
<p>
|
|
Parallel computing can take many forms. From a user's perspective, it is
|
|
important to consider the advantages and disadvantages of each methodology.
|
|
The following section attempts to provide some perspective on the methods of
|
|
parallel computing and indicate where the Beowulf machine falls on this
|
|
continuum.
|
|
<p>
|
|
<sect2>Why more than one CPU?
|
|
<p>
|
|
Answering this question is important. Using 8 CPUs to run your word
|
|
processor sounds a little like "over-kill" -- and it is. What about a
|
|
web server, a database, a rendering program, or a project scheduler?
|
|
Maybe extra CPUs would help. What about a complex simulation, a fluid
|
|
dynamics code, or a data mining application. Extra CPUs definitely
|
|
help in these situations. Indeed, multiple CPUs are being used to
|
|
solve more and more problems.
|
|
<p>
|
|
The next question usually is: "Why do I need two or four CPUs, I will
|
|
just wait for the 986 turbo-hyper chip." There are several reasons:
|
|
<enum>
|
|
<item> Due to the use of multi-tasking Operating Systems, it is
|
|
possible to do several things at once. This is a natural
|
|
"parallelism" that is easily exploited by more than one low cost CPU.
|
|
|
|
<item> Processor speeds have been doubling every 18 months, but what
|
|
about RAM speeds or hard disk speeds? Unfortunately, these speeds are
|
|
not increasing as fast as the CPU speeds. Keep in mind most
|
|
applications require "out of cache memory access" and hard disk
|
|
access. Doing things in parallel is one way to get around some of
|
|
these limitations.
|
|
|
|
<item> Predictions indicate that processor speeds will not continue to double
|
|
every 18 months after the year 2005. There are some very serious obstacles to
|
|
overcome in order to maintain this trend.
|
|
|
|
<item> Depending on the application, parallel computing can speed things up by any
|
|
where from 2 to 500 times faster (in some cases even faster). Such performance
|
|
is not available using a single processor. Even supercomputers that at one
|
|
time used very fast custom processors are now built from multiple "commodity-
|
|
off-the-shelf" CPUs.
|
|
</enum>
|
|
|
|
If you need speed - either due to a compute bound problem and/or an I/O bound
|
|
problem, parallel is worth considering. Because parallel computing is
|
|
implemented in a variety of ways, solving your problem in parallel will
|
|
require some very important decisions to be made. These decisions may
|
|
dramatically effect portability, performance, and cost of your application.
|
|
<p>
|
|
Before we get technical, let's look take a look at a real "parallel computing
|
|
problem" using an example with which we are familiar - waiting in long lines
|
|
at a store.
|
|
|
|
<sect2>The Parallel Computing Store
|
|
<p>
|
|
Consider a big store with 8 cash registers grouped together in the front of
|
|
the store. Assume each cash register/cashier is a CPU and each customer is a
|
|
computer program. The size of the computer program (amount of work) is the
|
|
size of each customer's order. The following analogies can be used to
|
|
illustrate parallel computing concepts.
|
|
<p>
|
|
<sect3>Single-tasking Operating System
|
|
<p>
|
|
One cash register open (is in use) and must process each customer one at a
|
|
time.
|
|
<p>
|
|
Computer Example: MS DOS
|
|
<p>
|
|
<sect3>Multi-tasking Operating System:
|
|
<p>
|
|
One cash register open, but now we process only a part of each order at a
|
|
time, move to the next person and process some of their order. Everyone
|
|
"seems" to be moving through the line together, but if no one else is in the
|
|
line, you will get through the line faster.
|
|
<p>
|
|
Computer Example: UNIX, NT using a single CPU
|
|
<p>
|
|
<sect3>Multitasking Operating Systems with Multiple CPUs:
|
|
<p>
|
|
Now we open several cash registers in the store. Each order can be processed
|
|
by a separate cash register and the line can move much faster. This is called
|
|
SMP - Symmetric Multi-processing. Although there are extra cash registers
|
|
open, you will still never get through the line any faster than just you and a
|
|
single cash register.
|
|
<p>
|
|
Computer Example: UNIX and NT with multiple CPUs
|
|
<p>
|
|
|
|
<sect3>Threads on a Multitasking Operating Systems extra CPUs
|
|
<p>
|
|
If you "break-up" the items in your order, you might be able to move through
|
|
the line faster by using several cash registers at one time. First, we must
|
|
assume you have a large amount of goods, because the time you invest "breaking
|
|
up your order" must be regained by using multiple cash registers. In theory,
|
|
you should be able to move through the line "n" times faster than before*;
|
|
where "n" is the number of cash registers. When the cashiers need to get sub-
|
|
totals, they can exchange information quickly by looking and talking to all
|
|
the other "local" cash registers. They can even snoop around the other cash
|
|
registers to find information they need to work faster. There is a limit,
|
|
however, as to how many cash registers the store can effectively locate in any
|
|
one place.
|
|
|
|
Amdals law will also limit the application speed-up to the slowest
|
|
sequential portion of the program.
|
|
|
|
Computer Example: UNIX or NT with extra CPU on the same motherboard running
|
|
multi-threaded programs.
|
|
|
|
|
|
<sect3>Sending Messages on Multitasking Operating Systems with extra CPUs:
|
|
<p>
|
|
In order to improve performance, the store adds 8 cash registers at the back
|
|
of the store. Because the new cash registers are far away from the front cash
|
|
registers, the cashiers must call on the phone to send their sub-totals to the
|
|
front of the store. This distance adds extra overhead (time) to communication
|
|
between cashiers, but if communication is minimized, it is not a problem. If
|
|
you have a really big order, one that requires all the cash registers, then as
|
|
before your speed can be improved by using all cash registers at the same
|
|
time, the extra overhead must be considered. In some cases, the store may have
|
|
single cash registers (or islands of cash registers) located all over the
|
|
store - each cash register (or island) must communicate by phone. Since all
|
|
the cashiers working the cash registers can talk to each other by phone, it
|
|
does not matter too much where they are.
|
|
|
|
Computer Example: One or several copies of UNIX or NT with extra CPUs on the
|
|
same or different motherboard communicating through messages.
|
|
|
|
The above scenarios, although not exact, are a good representation of
|
|
constraints placed on parallel systems. Unlike a single CPU (or cash
|
|
register) communication is an issue.
|
|
|
|
<sect1>Architectures for parallel computing
|
|
<p>
|
|
The common methods and architectures of parallel computing are presented
|
|
below. While this description is by no means exhaustive, it is enough to
|
|
understand the basic issues involved with Beowulf design.
|
|
|
|
<sect2>Hardware Architectures
|
|
<p>
|
|
|
|
There are basically two ways parallel computer hardware is put together:
|
|
|
|
<enum>
|
|
<item> Local memory machines that communicate by messages (Beowulf Clusters)
|
|
<item> Shared memory machines that communicate through memory (SMP machines)
|
|
</enum>
|
|
|
|
A typical Beowulf is a collection of single CPU machines connected using fast
|
|
Ethernet and is, therefore, a local memory machine. A 4 way SMP box is a
|
|
shared memory machine and can be used for parallel computing - parallel
|
|
applications communicate using shared memory. Just as in the computer store
|
|
analogy, local memory machines (individual cash registers) can be scaled up to
|
|
large numbers of CPUs, while the number of CPUs shared memory machines (the
|
|
number of cash registers you can place in one spot) can have is limited due to
|
|
memory contention.
|
|
|
|
It is possible, however, to connect many shared memory machines to create a
|
|
"hybrid" shared memory machine. These hybrid machines "look" like a single
|
|
large SMP machine to the user and are often called NUMA (non uniform memory
|
|
access) machines because the global memory seen by the programmer and shared
|
|
by all the CPUs can have different latencies. At some level, however, a NUMA
|
|
machine must "pass messages" between local shared memory pools.
|
|
|
|
It is also possible to connect SMP machines as local memory compute nodes.
|
|
Typical CLASS I motherboards have either 2 or 4 CPUs and are often used as a
|
|
means to reduce the overall system cost. The Linux internal scheduler
|
|
determines how these CPUs get shared. The user cannot (at this point) assign
|
|
a specific task to a specific SMP processor. The user can however, start two
|
|
independent processes or a threaded processes and expect to see a performance
|
|
increase over a single CPU system.
|
|
|
|
<sect2>Software API Architectures
|
|
<p>
|
|
There basically two ways to "express" concurrency in a program:
|
|
<enum>
|
|
<item> Using Messages sent between processors
|
|
<item> Using operating system Threads
|
|
</enum>
|
|
<p>
|
|
Other methods do exist, but these are the two most widely used. It is
|
|
important to remember that the expression of concurrency is not necessary
|
|
controlled by the underlying hardware. Both Messages and Threads can be
|
|
implemented on SMP, NUMA-SMP, and clusters - although as explained below
|
|
efficiently and portability are important issues.
|
|
<p>
|
|
<sect3>Messages
|
|
<p>
|
|
Historically, messages passing technology reflected the
|
|
design of early local memory parallel computers. Messages require
|
|
copying data while Threads use data in place. The latency and speed
|
|
at which messages can be copied are the limiting factor with message
|
|
passing models. A Message is quite simple: some data and a destination
|
|
processor. Common message passing APIs are <htmlurl
|
|
url="http://www.epm.ornl.gov/pvm" name="PVM"> or <htmlurl
|
|
url="http://www.mcs.anl.gov/Projects/mpi/index.html" name="MPI">. Message
|
|
passing can be efficiently implemented using Threads and Messages work
|
|
well both on SMP machine and between clusters of machines. The
|
|
advantage to using messages on an SMP machine, as opposed to Threads,
|
|
is that if you decided to use clusters in the future it is easy to add
|
|
machines or scale your application.
|
|
<p>
|
|
<sect3>Threads
|
|
<p>
|
|
Operating system Threads were developed because shared memory SMP
|
|
(symmetrical multiprocessing) designs allowed very fast shared memory
|
|
communication and synchronization between concurrent parts of a
|
|
program. Threads work well on SMP systems because communication is
|
|
through shared memory. For this reason the user must isolate local
|
|
data from global data, otherwise programs will not work properly. In
|
|
contrast to messages, a large amount of copying can be eliminated with
|
|
threads because the data is shared between processes (threads). Linux
|
|
supports POSIX threads. The problem with threads is that it is
|
|
difficult to extend them beyond one SMP machine and because data is
|
|
shared between CPUs, cache coherence issues can contribute to
|
|
overhead. Extending threads beyond the SMP boundary efficiently
|
|
requires NUMA technology which is expensive and not natively
|
|
supported by Linux. Implementing threads on top of messages has been
|
|
done (<htmlurl name="(http://syntron.com/ptools/ptools_pg.htm)"
|
|
url="http://syntron.com/ptools/ptools_pg.htm">), but Threads are often
|
|
inefficient when implemented using messages.
|
|
<p>
|
|
The following can be stated about performance:
|
|
<verb>
|
|
SMP machine cluster of machines scalability
|
|
performance performance
|
|
----------- ------------------- -----------
|
|
messages good best best
|
|
|
|
threads best poor* poor*
|
|
|
|
* requires expensive NUMA technology.
|
|
</verb>
|
|
|
|
<sect2>Application Architecture
|
|
<p>
|
|
In order to run an application in parallel on multiple CPUs, it must be
|
|
explicitly broken in to concurrent parts. A standard single CPU application
|
|
will run no faster than a single CPU application on multiple processors.
|
|
There are some tools and compilers that can break up programs, but
|
|
parallelizing codes is not a "plug and play" operation. Depending on the
|
|
application, parallelizing code can be easy, extremely difficult, or in some
|
|
cases impossible due to algorithm dependencies.
|
|
<p>
|
|
Before the software issues can be addressed the concept of Suitability
|
|
needs to be introduced.
|
|
|
|
<sect1>Suitability<label id="suitability">
|
|
<p>
|
|
Most questions about parallel computing have the same answer:
|
|
<p>
|
|
"It all depends upon the application."
|
|
<p>
|
|
Before we jump into the issues, there is one very important distinction that
|
|
needs to be made - the difference between CONCURRENT and PARALLEL. For the
|
|
sake of this discussion we will define these two concepts as follows:
|
|
<p>
|
|
CONCURRENT parts of a program are those that can be computed independently.
|
|
<p>
|
|
PARALLEL parts of a program are those CONCURRENT parts that are executed on
|
|
separate processing elements at the same time.
|
|
<p>
|
|
The distinction is very important, because CONCURRENCY is a property of the
|
|
program and efficient PARALLELISM is a property of the machine. Ideally,
|
|
PARALLEL execution should result in faster performance. The limiting factor
|
|
in parallel performance is the communication speed and latency between compute
|
|
nodes. (Latency also exists with threaded SMP applications due to cache
|
|
coherency.) Many of the common parallel benchmarks are highly parallel and
|
|
communication and latency are not the bottle neck. This type of problem can be
|
|
called "obviously parallel". Other applications are not so simple and
|
|
executing CONCURRENT parts of the program in PARALLEL may actually cause the
|
|
program to run slower, thus offsetting any performance gains in other
|
|
CONCURRENT parts of the program. In simple terms, the cost of communication
|
|
time must pay for the savings in computation time, otherwise the PARALLEL
|
|
execution of the CONCURRENT part is inefficient.
|
|
<p>
|
|
The task of the programmer is to determining what CONCURRENT parts of the
|
|
program SHOULD be executed in PARALLEL and what parts SHOULD NOT. The answer
|
|
to this will determine the EFFICIENCY of application. The following graph
|
|
summarizes the situation for the programmer:
|
|
<p>
|
|
<verb>
|
|
|
|
|
|
|
|
| *
|
|
| *
|
|
| *
|
|
% of | *
|
|
appli- | *
|
|
cations | *
|
|
| *
|
|
| *
|
|
| *
|
|
| *
|
|
| *
|
|
| ****
|
|
| ****
|
|
| ********************
|
|
+-----------------------------------
|
|
communication time/processing time
|
|
</verb>
|
|
<p>
|
|
In a perfect parallel computer, the ratio of communication/processing would be
|
|
equal and anything that is CONCURRENT could be implemented in PARALLEL.
|
|
Unfortunately, Real parallel computers, including shared memory machines, are
|
|
subject to the effects described in this graph. When designing a Beowulf, the
|
|
user may want to keep this graph in mind because parallel efficiency depends
|
|
upon ratio of communication time and processing time for A SPECIFIC PARALLEL
|
|
COMPUTER. Applications may be portable between parallel computers, but there
|
|
is no guarantee they will be efficient on a different platform.
|
|
<p>
|
|
IN GENERAL, THERE IS NO SUCH THING AS A PORTABLE AND EFFICIENT PARALLEL
|
|
PROGRAM
|
|
<p>
|
|
There is yet another consequence to the above graph. Since efficiency depends
|
|
upon the comm./process. ratio, just changing one component of the ratio does
|
|
not necessary mean a specific application will perform faster. A change in
|
|
processor speed, while keeping the communication speed that same may have non-
|
|
intuitive effects on your program. For example, doubling or tripling the CPU
|
|
speed, while keeping the communication speed the same, may now make some
|
|
previously efficient PARALLEL portions of your program, more efficient if they
|
|
were executed SEQUENTIALLY. That is, it may now be faster to run the
|
|
previously PARALLEL parts as SEQUENTIAL. Furthermore, running inefficient
|
|
parts in parallel will actually keep your application from reaching its
|
|
maximum speed. Thus, by adding faster processor, you may actually slowed down
|
|
your application (you are keeping the new CPU from running at its maximum
|
|
speed for that application)
|
|
<p>
|
|
UPGRADING TO A FASTER CPU MAY ACTUALLY SLOW DOWN YOUR APPLICATION
|
|
<p>
|
|
So, in conclusion, to know whether or not you can use a parallel hardware
|
|
environment, you need to have some insight into the suitability of a
|
|
particular machine to your application. You need to look at a lot of issues
|
|
including CPU speeds, compiler, message passing API, network, etc. Please
|
|
note, just profiling an application, does not give the whole story. You may
|
|
identify a computationally heavy portion of your program, but you do not
|
|
know the communication cost for this portion. It may be that for a given
|
|
system, the communication cost as do not make parallelizing this code
|
|
efficient.
|
|
<p>
|
|
A final note about a common misconception. It is often stated that "a program
|
|
is PARALLELIZED", but in reality only the CONCURRENT parts of the program have
|
|
been located. For all the reasons given above, the program is not
|
|
PARALLELIZED. Efficient PARALLELIZATION is a property of the machine.
|
|
<p>
|
|
|
|
<sect1>Writing and porting parallel software
|
|
<p>
|
|
Once you decide that you need parallel computing and would like to design and
|
|
build a Beowulf, a few moments considering your application with respect to
|
|
the previous discussion may be a good idea.
|
|
<p>
|
|
In general there are two things you can do:
|
|
<enum>
|
|
<item>Go ahead and construct a CLASS I Beowulf and then "fit" your
|
|
application to it. Or run existing parallel applications that
|
|
you know work on your Beowulf (but beware of the portability and
|
|
efficiently issues mentioned above)
|
|
|
|
<item>Look at the applications you need to run on your Beowulf and
|
|
make some estimations as to the type of hardware and software you
|
|
need.
|
|
</enum>
|
|
|
|
In either case, at some point you will need to look at the efficiency issues.
|
|
In general, there are three things you need to do:
|
|
<enum>
|
|
<item> Determine concurrent parts of your program
|
|
<item> Estimate parallel efficiently
|
|
<item> Describing the concurrent parts of your program
|
|
</enum>
|
|
|
|
Let's look at these one at a time.
|
|
|
|
<sect2>Determine concurrent parts of your program
|
|
<p>
|
|
This step is often considered "parallelizing your program". Parallelization
|
|
decisions will be made in step 2. In this step, you need to determine data
|
|
dependencies.
|
|
<p>
|
|
>From a practical standpoint, applications may exhibit two types of
|
|
concurrency: compute (number crunching) and I/O (database). Although in many
|
|
cases compute and I/O concurrency are orthogonal, there are application that
|
|
require both. There are tools available that can perform concurrency analysis
|
|
on existing applications. Most of these tools are designed for FORTRAN.
|
|
There are two reasons FORTRAN is used: historically most number crunching
|
|
applications were written in FORTRAN and it is easier to analyze. If no tools
|
|
are available, then this step can be some what difficult for existing
|
|
applications.
|
|
<p>
|
|
<sect2>Estimate parallel efficiency
|
|
<p>
|
|
Without the help of tools, this step may require trial and error tests or just
|
|
a plain old educated guess. If you have a specific application in mind, try
|
|
to determine if it is CPU limited (compute bound) or hard disk limited (I/O
|
|
bound). The requirements of your Beowulf may be quite different depending
|
|
upon your needs. For example, a compute bound problem may need a few very
|
|
fast CPUs and high speed low latency network, while an I/O bound problem may
|
|
work better with more slower CPUs and fast Ethernet.
|
|
<p>
|
|
This recommendation often comes as a surprise to most people because, the
|
|
standard assumption is that faster processor are always better. While this is
|
|
true if your have an unlimited budget, real systems may have cost constraints
|
|
that should be maximized. For I/O bound problems, there is a little known
|
|
rule (called the Eadline-Dedkov Law) that is quite helpful:
|
|
<p>
|
|
For two given parallel computers with the same cumulative CPU performance
|
|
index, the one which has slower processors (and a probably correspondingly
|
|
slower interprocessor communication network) will have better performance
|
|
for I/O-dominant applications.
|
|
<p>
|
|
While the proof of this rule is beyond the scope of this document, you
|
|
find it interesting to download the paper <it>Performance
|
|
Considerations for I/O-Dominant Applications on Parallel
|
|
Computers</it> (Postscript format 109K ) <htmlurl
|
|
url="ftp://www.plogic.com/pub/papers/exs-pap6.ps"
|
|
name="(ftp://www.plogic.com/pub/papers/exs-pap6.ps)">
|
|
<p>
|
|
Once you have determined what type of concurrency you have in your
|
|
application, you will need to estimate how efficient it will be in
|
|
parallel. See Section <ref id="software" name="Software"> for a
|
|
description of Software tools.
|
|
<p>
|
|
In the absence of tools, you may try to guess your way through this step. If
|
|
a compute bound loop measured in minutes and the data can be transferred in
|
|
seconds, then it might be a good candidate for parallelization. But remember,
|
|
if you take a 16 minute loop and break it into 32 parts, and your data
|
|
transfers require several seconds per part, then things are going to get
|
|
tight. You will reach a point of diminishing returns.
|
|
<p>
|
|
<sect2>Describing the concurrent parts of your program
|
|
<p>
|
|
There are several ways to describe concurrent parts of your program:
|
|
<enum>
|
|
<item> Explicit parallel execution
|
|
<item> Implicit parallel execution
|
|
</enum>
|
|
|
|
The major difference between the two is that explicit parallelism is
|
|
determined by the user where implicit parallelism is determined by the
|
|
compiler.
|
|
|
|
<sect3>Explicit Methods
|
|
<p>
|
|
These are basically method where the user must modify source code
|
|
specifically for a parallel computer. The user must either add
|
|
messages using <htmlurl url="http://www.epm.ornl.gov/pvm" name="PVM"> or
|
|
<htmlurl url="http://www.mcs.anl.gov/Projects/mpi/index.html" name="MPI"> or
|
|
add threads using POSIX threads. (Keep in mind however, threads can
|
|
not move between SMP motherboards).
|
|
<p>
|
|
Explicit methods tend to be the most difficult to implement and debug. Users
|
|
typically embed explicit function calls in standard FORTRAN 77 or C/C++ source
|
|
code. The MPI library has added some functions to make some standard parallel
|
|
methods easier to implement (i.e. scatter/gather functions). In addition, it
|
|
is also possible to use standard libraries that have been written for parallel
|
|
computers. Keep in mind, however, the portability vs. efficiently trade-off)
|
|
<p>
|
|
For historical reasons, most number crunching codes are written in FORTRAN.
|
|
For this reasons, FORTRAN has the largest amount of support (tools,
|
|
libraries, etc.) for parallel computing. Many programmers now use C or re-
|
|
write existing FORTRAN applications in C with the notion the C will allow
|
|
faster execution. While this may be true as C is the closest thing to a
|
|
universal machine code, it has some major drawbacks. The use of pointers in C
|
|
makes determining data dependencies extremely difficult. Automatic analysis
|
|
of pointers is extremely difficult. If you have an existing FORTRAN program
|
|
and think that you might want to parallelize it in the future - DO NOT CONVERT
|
|
IT TO C!
|
|
<p>
|
|
<sect3>Implicit Methods
|
|
<p>
|
|
Implicit methods are those where the user gives up some (or all) of the
|
|
parallelization decisions to the compiler. Examples are FORTRAN 90, High
|
|
Performance FORTRAN (HPF), Bulk Synchronous Parallel (BSP), and a whole
|
|
collection of other methods that are under development.
|
|
<p>
|
|
Implicit methods require the user to provide some information about the
|
|
concurrent nature of their application, but the compiler will then make many
|
|
decisions about how to execute this concurrency in parallel. These methods
|
|
provide some level of portability and efficiency, but there is still no "best
|
|
way" to describe a concurrent problem for a parallel computer.
|
|
|
|
<sect>Beowulf Resources
|
|
<p>
|
|
|
|
<sect1>Starting Points
|
|
<p>
|
|
|
|
<itemize>
|
|
|
|
<item>Beowulf mailing list. To subscribe send mail to <htmlurl
|
|
url="mailto:beowulf-request@cesdis.gsfc.nasa.gov"
|
|
name="beowulf-request@cesdis.gsfc.nasa.gov"> with the word
|
|
<it>subscribe</it> in the message body.
|
|
|
|
<item>Beowulf Homepage <htmlurl name="http://www.beowulf.org"
|
|
url="http://www.beowulf.org">
|
|
|
|
<item>Extreme Linux <htmlurl name="http://www.extremelinux.org"
|
|
url="http://www.extremelinux.org">
|
|
|
|
<item>Extreme Linux Software from Red Hat <htmlurl name="http://www.redhat.com/extreme"
|
|
url="http://www.redhat.com/extreme">
|
|
|
|
</itemize>
|
|
|
|
|
|
<sect1>Documentation
|
|
<p>
|
|
|
|
<itemize>
|
|
|
|
<item>The latest version of the Beowulf HOWTO <htmlurl
|
|
name="http://www.sci.usq.edu.au/staff/jacek/beowulf"
|
|
url="http://www.sci.usq.edu.au/staff/jacek/beowulf">.
|
|
|
|
<item>Building a Beowulf System <htmlurl
|
|
url="http://www.cacr.caltech.edu/beowulf/tutorial/building.html"
|
|
name="http://www.cacr.caltech.edu/beowulf/tutorial/building.html">
|
|
|
|
<item>Jacek's Beowulf Links <htmlurl
|
|
name="http://www.sci.usq.edu.au/staff/jacek/beowulf"
|
|
url="http://www.sci.usq.edu.au/staff/jacek/beowulf">.
|
|
|
|
<item>Beowulf Installation and Administration HOWTO (DRAFT) <htmlurl
|
|
name="http://www.sci.usq.edu.au/staff/jacek/beowulf"
|
|
url="http://www.sci.usq.edu.au/staff/jacek/beowulf">.
|
|
|
|
<item>Linux Parallel Processing HOWTO <htmlurl
|
|
name="http://yara.ecn.purdue.edu/~pplinux/PPHOWTO/pphowto.html"
|
|
url="http://yara.ecn.purdue.edu/~pplinux/PPHOWTO/pphowto.html">
|
|
|
|
</itemize>
|
|
|
|
|
|
<sect1>Papers<label id="papers">
|
|
<p>
|
|
|
|
<itemize>
|
|
|
|
<item>Chance Reschke, Thomas Sterling, Daniel Ridge, Daniel Savarese,
|
|
Donald Becker, and Phillip Merkey <it>A Design Study of Alternative
|
|
Network Topologies for the Beowulf Parallel Workstation</it>.
|
|
Proceedings Fifth IEEE International Symposium on High Performance
|
|
Distributed Computing, 1996. <htmlurl
|
|
name="http://www.beowulf.org/papers/HPDC96/hpdc96.html"
|
|
url="http://www.beowulf.org/papers/HPDC96/hpdc96.html">
|
|
|
|
|
|
<item>Daniel Ridge, Donald Becker, Phillip Merkey, Thomas Sterling
|
|
Becker, and Phillip Merkey. <it>Harnessing the Power of Parallelism in
|
|
a Pile-of-PCs</it>. Proceedings, IEEE Aerospace, 1997. <htmlurl
|
|
name="http://www.beowulf.org/papers/AA97/aa97.ps"
|
|
url="http://www.beowulf.org/papers/AA97/aa97.ps">
|
|
|
|
|
|
<item>Thomas Sterling, Donald J. Becker, Daniel Savarese, Michael
|
|
R. Berry, and Chance Res. <it>Achieving a Balanced Low-Cost
|
|
Architecture for Mass Storage Management through Multiple Fast
|
|
Ethernet Channels on the Beowulf Parallel Workstation</it>.
|
|
Proceedings, International Parallel Processing Symposium, 1996.
|
|
<htmlurl name="http://www.beowulf.org/papers/IPPS96/ipps96.html"
|
|
url="http://www.beowulf.org/papers/IPPS96/ipps96.html">
|
|
|
|
|
|
<item>Donald J. Becker, Thomas Sterling, Daniel Savarese, Bruce
|
|
Fryxell, Kevin Olson. <it>Communication Overhead for Space Science
|
|
Applications on the Beowulf Parallel Workstation</it>.
|
|
Proceedings,High Performance and Distributed Computing, 1995.
|
|
<htmlurl name="http://www.beowulf.org/papers/HPDC95/hpdc95.html"
|
|
url="http://www.beowulf.org/papers/HPDC95/hpdc95.html">
|
|
|
|
|
|
|
|
<item>Donald J. Becker, Thomas Sterling, Daniel Savarese, John
|
|
E. Dorband, Udaya A. Ranawak, Charles V. Packer. <it>BEOWULF: A
|
|
PARALLEL WORKSTATION FOR SCIENTIFIC COMPUTATION</it>. Proceedings,
|
|
International Conference on Parallel Processing, 95.
|
|
<htmlurl name="http://www.beowulf.org/papers/ICPP95/icpp95.html"
|
|
url="http://www.beowulf.org/papers/ICPP95/icpp95.html">
|
|
|
|
<item>Papers at the Beowulf site <htmlurl
|
|
url="http://www.beowulf.org/papers/papers.html"
|
|
name="http://www.beowulf.org/papers/papers.html">
|
|
|
|
|
|
</itemize>
|
|
|
|
|
|
<sect1>Software<label id="software">
|
|
<p>
|
|
<itemize>
|
|
<item>PVM - Parallel Virtual Machine <htmlurl
|
|
name="http://www.epm.ornl.gov/pvm/pvm_home.html"
|
|
url="http://www.epm.ornl.gov/pvm/pvm_home.html">
|
|
|
|
|
|
|
|
<item>LAM/MPI (Local Area Multicomputer / Message Passing Interface
|
|
<htmlurl url="http://www.mpi.nd.edu/lam"
|
|
name="http://www.mpi.nd.edu/lam">
|
|
|
|
<item>BERT77 - FORTRAN conversion tool <htmlurl
|
|
url="http://www.plogic.com/bert.html"
|
|
name="http://www.plogic.com/bert.html">
|
|
|
|
<item>Beowulf software from Beowulf Project Page <htmlurl
|
|
name="http://beowulf.gsfc.nasa.gov/software/software.html"
|
|
url="http://beowulf.gsfc.nasa.gov/software/software.html">
|
|
|
|
<item>Jacek's Beowulf-utils <htmlurl
|
|
name="ftp://ftp.sci.usq.edu.au/pub/jacek/beowulf-utils"
|
|
url="ftp://ftp.sci.usq.edu.au/pub/jacek/beowulf-utils">
|
|
|
|
<item>bWatch - cluster monitoring tool <htmlurl
|
|
url="http://www.sci.usq.edu.au/staff/jacek/bWatch"
|
|
name="http://www.sci.usq.edu.au/staff/jacek/bWatch">
|
|
|
|
</itemize>
|
|
|
|
|
|
|
|
<sect1>Beowulf Machines
|
|
<p>
|
|
<itemize>
|
|
|
|
<item>Avalon consists of 140 Alpha processors, 36 GB of RAM, and is
|
|
probably the fastest Beowulf machine, cruising at 47.7 Gflops and
|
|
ranking 114th on the Top 500 list. <htmlurl
|
|
name="http://swift.lanl.gov/avalon/"
|
|
url="http://swift.lanl.gov/avalon/">
|
|
|
|
<item>Megalon-A Massively PArallel CompuTer Resource (MPACTR) consists
|
|
of 14, quad CPU Pentium Pro 200 nodes, and 14 GB of RAM. <htmlurl
|
|
name="http://megalon.ca.sandia.gov/description.html"
|
|
url="http://megalon.ca.sandia.gov/description.html">
|
|
|
|
<item>theHIVE - Highly-parallel Integrated Virtual Environment is
|
|
another fast Beowulf Supercomputer. theHIVE is a 64 node, 128 CPU
|
|
machine with the total of 4 GB RAM. <htmlurl
|
|
name="http://newton.gsfc.nasa.gov/thehive/"
|
|
url="http://newton.gsfc.nasa.gov/thehive/">
|
|
|
|
<item>Topcat is a much smaller machine and consists of 16 CPUs and 1.2
|
|
GB RAM. <htmlurl name="http://www.sci.usq.edu.au/staff/jacek/topcat"
|
|
url="http://www.sci.usq.edu.au/staff/jacek/topcat">
|
|
|
|
<item>MAGI cluster - this is a very interesting site with many good
|
|
links. <htmlurl name="http://noel.feld.cvut.cz/magi/"
|
|
url="http://noel.feld.cvut.cz/magi/">
|
|
|
|
</itemize>
|
|
|
|
|
|
|
|
<sect1>Other Interesting Sites
|
|
<p>
|
|
|
|
<itemize>
|
|
<item>SMP Linux <htmlurl name="http://www.linux.org.uk/SMP/title.html"
|
|
url="http://www.linux.org.uk/SMP/title.html">
|
|
|
|
<item>Paralogic - Buy a Beowulf <htmlurl name="http://www.plogic.com"
|
|
url="http://www.plogic.com">
|
|
|
|
</itemize>
|
|
|
|
<sect1>History<label id="history">
|
|
<p>
|
|
<itemize>
|
|
|
|
<item> Legends - Beowulf <htmlurl name="http://legends.dm.net/beowulf/index.html"
|
|
url="http://legends.dm.net/beowulf/index.html">
|
|
|
|
<item> The Adventures of Beowulf <htmlurl
|
|
name="http://www.lnstar.com/literature/beowulf/beowulf.html"
|
|
url="http://www.lnstar.com/literature/beowulf/beowulf.html">
|
|
|
|
</itemize>
|
|
|
|
<sect>Source code
|
|
<p>
|
|
<sect1>sum.c<label id="sum">
|
|
<p>
|
|
<verb>
|
|
/* Jacek Radajewski jacek@usq.edu.au */
|
|
/* 21/08/1998 */
|
|
|
|
#include <stdio.h>
|
|
#include <math.h>
|
|
|
|
int main (void) {
|
|
|
|
double result = 0.0;
|
|
double number = 0.0;
|
|
char string[80];
|
|
|
|
|
|
while (scanf("%s", string) != EOF) {
|
|
|
|
number = atof(string);
|
|
result = result + number;
|
|
}
|
|
|
|
printf("%lf\n", result);
|
|
|
|
return 0;
|
|
|
|
}
|
|
</verb>
|
|
|
|
<sect1>sigmasqrt.c<label id="sigmasqrt">
|
|
<p>
|
|
<verb>
|
|
/* Jacek Radajewski jacek@usq.edu.au */
|
|
/* 21/08/1998 */
|
|
|
|
#include <stdio.h>
|
|
#include <math.h>
|
|
|
|
int main (int argc, char** argv) {
|
|
|
|
long number1, number2, counter;
|
|
double result;
|
|
|
|
if (argc < 3) {
|
|
printf ("usage : %s number1 number2\n",argv[0]);
|
|
exit(1);
|
|
} else {
|
|
number1 = atol (argv[1]);
|
|
number2 = atol (argv[2]);
|
|
result = 0.0;
|
|
}
|
|
|
|
for (counter = number1; counter <= number2; counter++) {
|
|
result = result + sqrt((double)counter);
|
|
}
|
|
|
|
printf("%lf\n", result);
|
|
|
|
return 0;
|
|
|
|
}
|
|
</verb>
|
|
|
|
|
|
<sect1>prun.sh<label id="prun">
|
|
<p>
|
|
|
|
<verb>
|
|
#!/bin/bash
|
|
# Jacek Radajewski jacek@usq.edu.au
|
|
# 21/08/1998
|
|
|
|
export SIGMASQRT=/home/staff/jacek/beowulf/HOWTO/example1/sigmasqrt
|
|
|
|
# $OUTPUT must be a named pipe
|
|
# mkfifo output
|
|
|
|
export OUTPUT=/home/staff/jacek/beowulf/HOWTO/example1/output
|
|
|
|
rsh scilab01 $SIGMASQRT 1 50000000 > $OUTPUT < /dev/null&
|
|
rsh scilab02 $SIGMASQRT 50000001 100000000 > $OUTPUT < /dev/null&
|
|
rsh scilab03 $SIGMASQRT 100000001 150000000 > $OUTPUT < /dev/null&
|
|
rsh scilab04 $SIGMASQRT 150000001 200000000 > $OUTPUT < /dev/null&
|
|
rsh scilab05 $SIGMASQRT 200000001 250000000 > $OUTPUT < /dev/null&
|
|
rsh scilab06 $SIGMASQRT 250000001 300000000 > $OUTPUT < /dev/null&
|
|
rsh scilab07 $SIGMASQRT 300000001 350000000 > $OUTPUT < /dev/null&
|
|
rsh scilab08 $SIGMASQRT 350000001 400000000 > $OUTPUT < /dev/null&
|
|
rsh scilab09 $SIGMASQRT 400000001 450000000 > $OUTPUT < /dev/null&
|
|
rsh scilab10 $SIGMASQRT 450000001 500000000 > $OUTPUT < /dev/null&
|
|
rsh scilab11 $SIGMASQRT 500000001 550000000 > $OUTPUT < /dev/null&
|
|
rsh scilab12 $SIGMASQRT 550000001 600000000 > $OUTPUT < /dev/null&
|
|
rsh scilab13 $SIGMASQRT 600000001 650000000 > $OUTPUT < /dev/null&
|
|
rsh scilab14 $SIGMASQRT 650000001 700000000 > $OUTPUT < /dev/null&
|
|
rsh scilab15 $SIGMASQRT 700000001 750000000 > $OUTPUT < /dev/null&
|
|
rsh scilab16 $SIGMASQRT 750000001 800000000 > $OUTPUT < /dev/null&
|
|
rsh scilab17 $SIGMASQRT 800000001 850000000 > $OUTPUT < /dev/null&
|
|
rsh scilab18 $SIGMASQRT 850000001 900000000 > $OUTPUT < /dev/null&
|
|
rsh scilab19 $SIGMASQRT 900000001 950000000 > $OUTPUT < /dev/null&
|
|
rsh scilab20 $SIGMASQRT 950000001 1000000000 > $OUTPUT < /dev/null&
|
|
</verb>
|
|
|
|
|
|
</article>
|