2004-06-03 03:44:39 +00:00
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
|
|
|
|
"http://docbook.org/xml/4.2/docbookx.dtd">
|
2016-03-29 16:56:05 +00:00
|
|
|
<article id="Beowulf-HOWTO">
|
2004-06-03 03:44:39 +00:00
|
|
|
<articleinfo>
|
|
|
|
<title>The Beowulf HOWTO</title>
|
|
|
|
|
|
|
|
<author>
|
|
|
|
<firstname>Kurt</firstname>
|
|
|
|
|
|
|
|
<surname>Swendson</surname>
|
|
|
|
|
|
|
|
<affiliation>
|
|
|
|
<address><email>lam32767@lycos.com</email></address>
|
|
|
|
</affiliation>
|
|
|
|
</author>
|
|
|
|
|
|
|
|
<pubdate>2004-05-17</pubdate>
|
|
|
|
|
|
|
|
<revhistory>
|
|
|
|
<revision>
|
|
|
|
<revnumber>1.0</revnumber>
|
2005-01-08 18:15:05 +00:00
|
|
|
<date>2005-01-08</date>
|
|
|
|
<revremark>first official release</revremark>
|
|
|
|
</revision>
|
|
|
|
<revision>
|
|
|
|
<revnumber>0.9</revnumber>
|
2004-06-03 03:44:39 +00:00
|
|
|
<date>2004-05-17</date>
|
|
|
|
<authorinitials>01</authorinitials>
|
2005-01-08 18:15:05 +00:00
|
|
|
<revremark>initial revision</revremark>
|
2004-06-03 03:44:39 +00:00
|
|
|
</revision>
|
|
|
|
</revhistory>
|
|
|
|
|
|
|
|
<abstract>
|
|
|
|
<para>This document describes step by step instructions on building a
|
2004-07-07 15:37:24 +00:00
|
|
|
Beowulf cluster. This is a Red Hat and LAM specific version of this
|
2004-06-03 03:44:39 +00:00
|
|
|
document.</para>
|
|
|
|
</abstract>
|
|
|
|
</articleinfo>
|
|
|
|
|
|
|
|
<sect1 id="intro">
|
|
|
|
<title>Introduction</title>
|
|
|
|
|
|
|
|
<para>This document describes step by step instructions on building a
|
|
|
|
Beowulf cluster. After seeing all of the documentation that was available,
|
2004-07-07 15:37:24 +00:00
|
|
|
I felt there were enough gaps and omissions that my own document, which I
|
|
|
|
believe accurately describes how to build a Beowulf cluster, would be
|
2004-06-03 03:44:39 +00:00
|
|
|
beneficial.</para>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>I first saw Thomas Sterling's article in Scientific American, and
|
|
|
|
immediately got the book, because its title was "How to Build a Beowulf".
|
2004-07-07 15:37:24 +00:00
|
|
|
No doubt, it was a valuable reference, but it does not walk you through
|
|
|
|
instructions on exactly what to do.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>What follows is a description of what I got to work. It is only one
|
|
|
|
example - my example. You may choose a different message passing
|
|
|
|
interface; you may choose a different Linux distribution. You may also
|
|
|
|
spend as much time as I did researching and experimenting, and learn on
|
|
|
|
your own.</para>
|
|
|
|
|
|
|
|
<sect2 id="copyright">
|
|
|
|
<title>Copyright and License</title>
|
|
|
|
|
|
|
|
<para>This document, <emphasis>The Beowulf HOWTO</emphasis>, is
|
2004-07-07 15:37:24 +00:00
|
|
|
copyrighted (c) 2004 by <emphasis>Kurt Swendson</emphasis>. Permission
|
2004-06-03 03:44:39 +00:00
|
|
|
is granted to copy, distribute and/or modify this document under the
|
|
|
|
terms of the GNU Free Documentation License, Version 1.1 or any later
|
|
|
|
version published by the Free Software Foundation; with no Invariant
|
|
|
|
Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A
|
|
|
|
copy of the license is available at <ulink
|
|
|
|
url="http://www.gnu.org/copyleft/fdl.html">
|
|
|
|
http://www.gnu.org/copyleft/fdl.html</ulink>.</para>
|
|
|
|
|
|
|
|
<para>Linux is a registered trademark of Linus Torvalds.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="disclaimer">
|
|
|
|
<title>Disclaimer</title>
|
|
|
|
|
|
|
|
<para>No liability for the contents of this document can be accepted.
|
|
|
|
Use the concepts, examples and information at your own risk. There may
|
2004-07-07 15:37:24 +00:00
|
|
|
be errors and inaccuracies which could damage to your system. Though
|
|
|
|
this is highly unlikely, proceed with caution. The author(s) do not
|
|
|
|
accept responsibility for your actions.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>All copyrights are held by their by their respective owners,
|
|
|
|
unless specifically noted otherwise. Use of a term in this document
|
|
|
|
should not be regarded as affecting the validity of any trademark or
|
|
|
|
service mark. Naming of particular products or brands should not be seen
|
|
|
|
as endorsements.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="credits">
|
|
|
|
<title>Credits / Contributors</title>
|
|
|
|
|
2004-07-07 15:37:24 +00:00
|
|
|
<para>Thanks to Thomas Johnson for all of his support and encouragement
|
|
|
|
and, of course, for the hardware without which I would not have been
|
|
|
|
able to even start.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>Thanks to my lovely wife Sharron for her understanding and
|
|
|
|
patience during my many hours spent with "the wolves".</para>
|
2005-01-08 18:41:49 +00:00
|
|
|
|
|
|
|
<para>The <ulink url="http://ibiblio.org/pub/Linux/docs/HOWTO/archive/Beowulf-HOWTO.html">original
|
|
|
|
Beowulf HOWTO</ulink> by Jacek Radajewski and Douglas Eadline. </para>
|
2004-06-03 03:44:39 +00:00
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="feedback">
|
|
|
|
<title>Feedback</title>
|
|
|
|
|
|
|
|
<para>Send your additions, comments and criticisms to
|
|
|
|
<email>lam32767@lycos.com</email>.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1>
|
|
|
|
<title>Definitions</title>
|
|
|
|
|
2005-01-08 18:15:05 +00:00
|
|
|
<para>What is a Beowulf cluster? The authors of the
|
|
|
|
<ulink url="http://ibiblio.org/pub/Linux/docs/HOWTO/archive/Beowulf-HOWTO.html">original
|
|
|
|
Beowulf HOWTO</ulink>, Jacek Radajewski and Douglas Eadline,
|
|
|
|
provide a good definition in their document: "Beowulf is a
|
2004-06-07 22:29:39 +00:00
|
|
|
multi-computer architecture which can be used for parallel computations.
|
2004-06-03 03:44:39 +00:00
|
|
|
It is a system which usually consists of one server node, and one or more
|
|
|
|
client nodes connected together via Ethernet or some other network". The
|
|
|
|
site <ulink url="http://beowulf.org">beowulf.org</ulink> lists many web
|
|
|
|
pages about Beowulf systems built by individuals and organizations. From
|
|
|
|
these two links, one can be exposed to a large number of perspectives on
|
|
|
|
the Beowulf architecture, and draw his / her own conclusions.</para>
|
|
|
|
|
2005-01-08 18:43:18 +00:00
|
|
|
<para>What's the difference between a true Beowulf cluster and a COW
|
2004-06-03 03:44:39 +00:00
|
|
|
[cluster of workstations]? Brahma gives a good definition:<ulink
|
|
|
|
url="http://www.phy.duke.edu/brahma/beowulf_book/node62.html">
|
|
|
|
http://www.phy.duke.edu/brahma/beowulf_book/node62.html</ulink>.</para>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>If you are a "user" at your organization, and you have the use
|
|
|
|
of some nodes, you may still do the instructions shown here to create a cow.
|
|
|
|
But if you "own" the nodes, that is, if you have complete control of them,
|
2004-06-03 03:44:39 +00:00
|
|
|
and are able to completely erase and rebuild them, you may create a true
|
|
|
|
Beowulf cluster.</para>
|
|
|
|
|
2004-07-07 15:37:24 +00:00
|
|
|
<para>In Brahma's web page, he suggests you manually configure each box,
|
2005-01-08 18:41:49 +00:00
|
|
|
and then later on (after you get the feel of doing this whole "wolfing up"
|
2004-07-07 15:37:24 +00:00
|
|
|
procedure), you can set up new nodes automatically (which I will describe
|
|
|
|
in a later document).</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1>
|
|
|
|
<title>Requirements</title>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>Let's briefly outline the requirements: <itemizedlist>
|
2004-06-03 03:44:39 +00:00
|
|
|
<listitem>
|
|
|
|
<para>More than one box, each equipped with a network card.</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>A switch or hub to connect them</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>Linux</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>A message-passing interface [I used lam]</para>
|
|
|
|
</listitem>
|
2004-07-07 15:37:24 +00:00
|
|
|
</itemizedlist>It is not a requirement to have a kvm switch, [you know,
|
|
|
|
the switch to share one keyboard, video, and mouse between many boxes],
|
|
|
|
but it is convenient while setting up and / or debugging.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1>
|
2004-06-07 22:29:39 +00:00
|
|
|
<title>Set Up The Head Node</title>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>So let's get "wolfing." Choose the most powerful box to be the head
|
2004-06-07 22:29:39 +00:00
|
|
|
node. Install Linux there and choose every package you want. The only
|
2005-01-08 18:41:49 +00:00
|
|
|
requirement is that you choose "Network Servers" [in Red Hat terminology]
|
2005-01-08 18:43:18 +00:00
|
|
|
because you need to have NFS and ssh. That's all you need. In my case, I
|
2004-06-03 03:44:39 +00:00
|
|
|
was going to do development of the Beowulf application, so I added X and C
|
|
|
|
development.</para>
|
|
|
|
|
|
|
|
<para>It is my experience that you do not actually need NFS, but I found
|
|
|
|
it invaluable for copying files between nodes, and for automating the
|
|
|
|
install process. Later in this document I will describe how you can run a
|
|
|
|
simple Beowulf application without the use of NFS, but a more complex
|
|
|
|
application may use NFS or actually depend upon it.</para>
|
|
|
|
|
|
|
|
<para>Those of you researching Beowulf systems will also know how you can
|
2004-06-07 22:29:39 +00:00
|
|
|
have a second network card on the head node so you can access it from the
|
|
|
|
outside world. This is not required for the operation of a cluster.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>I learned the hard way: use a password that obeys the strong
|
2004-07-07 15:37:24 +00:00
|
|
|
password constraints for your Linux distribution. I used an easily typed
|
2005-01-08 18:41:49 +00:00
|
|
|
password like "a" for my user, and the whole thing did not work. When I
|
2004-06-03 03:44:39 +00:00
|
|
|
changed my password to a legal password, with mixed numbers, characters,
|
|
|
|
upper and lower case, it worked.</para>
|
|
|
|
|
|
|
|
<para>If you use lam as your message passing interface, you will read in
|
|
|
|
the manual to turn OFF the firewalls, because they use random port numbers
|
|
|
|
to communicate between nodes. Here is a rule: If the manual tells you to
|
|
|
|
do something, DO IT! The lam manual also tells you to run as a non-root
|
|
|
|
user. Make the same user for every box. Build every box on the cluster
|
2005-01-08 18:41:49 +00:00
|
|
|
with that same user and password. I named that non root user "wolf".
|
2004-07-10 14:42:33 +00:00
|
|
|
</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>Hosts</title>
|
|
|
|
|
2004-07-07 15:37:24 +00:00
|
|
|
<para>First we modify /etc/hosts. In it, you will see the comments
|
2005-01-08 18:41:49 +00:00
|
|
|
telling you to leave the "localhost" line alone. Ignore that advice and
|
2004-07-07 15:37:24 +00:00
|
|
|
change it to not include the name of your box in the loopback
|
2004-06-03 03:44:39 +00:00
|
|
|
address.</para>
|
|
|
|
|
|
|
|
<para>Modify the line that says: <screen>127.0.0.1 wolf00 localhost.localdomain localhost</screen></para>
|
|
|
|
|
|
|
|
<para>...to now say: <screen>127.0.0.1 localhost.localdomain localhost </screen></para>
|
|
|
|
|
|
|
|
<para>Then add all the boxes you want on your cluster. Note: This is not
|
|
|
|
required for the operation of a Beowulf cluster; only convenient, so
|
2005-01-08 18:41:49 +00:00
|
|
|
that you may type a simple "wolf01" when you refer to a box on your
|
2004-06-03 03:44:39 +00:00
|
|
|
cluster instead of the more tedious 192.168.0.101:</para>
|
|
|
|
|
|
|
|
<screen>192.168.0.100 wolf00
|
|
|
|
192.168.0.101 wolf01
|
|
|
|
192.168.0.102 wolf02
|
|
|
|
192.168.0.103 wolf03
|
|
|
|
192.168.0.104 wolf04</screen>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>Groups</title>
|
|
|
|
|
2004-06-07 22:29:39 +00:00
|
|
|
<para>In order to responsibly set up your cluster, especially if you are
|
2005-01-08 18:41:49 +00:00
|
|
|
a "user" of your boxes [see Definitions], you should have some measure
|
2004-06-03 03:44:39 +00:00
|
|
|
of security.</para>
|
|
|
|
|
|
|
|
<para>After you create your user, create a group, and add the user to
|
|
|
|
the group. Then, you may modify your files and directories to only be
|
|
|
|
accessible by the users within that group:</para>
|
|
|
|
|
|
|
|
<screen>groupadd beowulf
|
|
|
|
usermod -g beowulf wolf </screen>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>...and add the following to /home/wolf/.bash_profile:</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<screen>umask 007</screen>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>Now any files created by the user "wolf" [or any user within the
|
2004-06-07 22:29:39 +00:00
|
|
|
group] will be automatically only writeable by the group
|
2005-01-08 18:41:49 +00:00
|
|
|
"beowulf".</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>NFS</title>
|
|
|
|
|
|
|
|
<para>Refer to the following web site: <ulink
|
|
|
|
url="http://www.ibiblio.org/mdw/HOWTO/NFS-HOWTO/server.html">http://www.ibiblio.org/mdw/HOWTO/NFS-HOWTO/server.html</ulink></para>
|
|
|
|
|
|
|
|
<para>Print that up, and have it at your side. I will be directing you
|
|
|
|
how to modify your system in order to create an NFS server, but I have
|
|
|
|
found this site invaluable, as you may also.</para>
|
|
|
|
|
|
|
|
<para>Make a directory for everybody to share:</para>
|
|
|
|
|
|
|
|
<screen>mkdir /mnt/wolf
|
|
|
|
chmod 770 /mnt/wolf
|
|
|
|
chown wolf:beowulf /mnt/wolf -R </screen>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>Go to the /etc directory, and add your "shared" directory to the
|
2004-06-03 03:44:39 +00:00
|
|
|
exports file:</para>
|
|
|
|
|
|
|
|
<screen>cd /etc
|
|
|
|
cat >> exports
|
|
|
|
/mnt/wolf 192.168.0.100/192.168.0.255 (rw)
|
|
|
|
<control d></screen>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>IP Addresses</title>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>My network is 192.168.0.nnn because it is one of the "private" IP
|
2004-06-03 03:44:39 +00:00
|
|
|
ranges. Thomas Sterling talks about it on page 106 of his book. It is
|
|
|
|
inside my firewall, and works just fine.</para>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>My head node, which I call "wolf00" is 192.168.0.100, and every
|
|
|
|
other node is named "wolfnn", with an ip of 192.168.0.100 + nn. I am
|
2004-06-03 03:44:39 +00:00
|
|
|
following the sage advice of many of the web pages out there, and
|
|
|
|
setting myself up for an easier task of scaling up my cluster.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>Services</title>
|
|
|
|
|
|
|
|
<para>Make sure that services we want are up:</para>
|
|
|
|
|
|
|
|
<screen>chkconfig -add sshd
|
|
|
|
chkconfig -add nfs
|
|
|
|
chkconfig -add rexec
|
|
|
|
chkconfig -add rlogin
|
|
|
|
chkconfig -level 3 rsh on
|
|
|
|
chkconfig -level 3 nfs on
|
|
|
|
chkconfig -level 3 rexec on
|
|
|
|
chkconfig -level 3 rlogin on</screen>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>...And, during startup, I saw some services that I know I don't
|
2004-06-03 03:44:39 +00:00
|
|
|
want, and in my opinion, could be removed. You may add or remove others
|
|
|
|
that suit your needs; just include the ones shown above.</para>
|
|
|
|
|
|
|
|
<screen>chkconfig -del atd
|
|
|
|
chkconfig -del rsh
|
|
|
|
chkconfig -del sendmail</screen>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>SSH</title>
|
|
|
|
|
|
|
|
<para>To be responsible, we make ssh work. While logged in as root, you
|
|
|
|
must modify the /etc/ssh/sshd_config file. The lines:</para>
|
|
|
|
|
|
|
|
<screen>#RSAAuthentication yes
|
|
|
|
#AuthorizedKeysFile .ssh/authorized_keys</screen>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>...are commented out, so uncomment them [remove the #].</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>Reboot, and log back in as wolf, because the operation of your
|
2005-01-08 18:41:49 +00:00
|
|
|
cluster will always be done from the user "wolf". Also, the hosts file
|
2004-06-03 03:44:39 +00:00
|
|
|
modifications done earlier must take effect. Logging out and back in
|
|
|
|
will not do this. To be sure, reboot the box, and make sure your prompt
|
2005-01-08 18:41:49 +00:00
|
|
|
shows hostname "wolf00".</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>To generate your public and private SSH keys, do this:</para>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<screen>ssh-keygen -b 1024 -f ~/.ssh/id_rsa -t rsa -N "" </screen>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>...and it will display a few messages, and tell you that it created
|
2004-06-03 03:44:39 +00:00
|
|
|
the public / private key pair. You will see these files, id_rsa and
|
|
|
|
id_rsa.pub, in the /home/wolf/.ssh directory.</para>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>Copy the id_rsa.pub file into a file called "authorized_keys"
|
2004-06-03 03:44:39 +00:00
|
|
|
right there in the .ssh directory. We will be using this file later.
|
|
|
|
Verify that the contents of this file show the hostname [the reason we
|
|
|
|
rebooted the box]. Modify the security on the files, and the
|
|
|
|
directory:</para>
|
|
|
|
|
|
|
|
<screen>chmod 644 ~/.ssh/auth*
|
|
|
|
chmod 755 ~/.ssh </screen>
|
|
|
|
|
|
|
|
<para>According to the LAM user group, only the head node needs to log
|
|
|
|
on to the slave nodes; not the other way around. Therefore when we copy
|
2005-01-08 18:43:18 +00:00
|
|
|
the public key files, we only copy the head node's key file to each
|
2004-06-03 03:44:39 +00:00
|
|
|
slave node, and set up the agent on the head node. This is MUCH easier
|
|
|
|
than copying all authorized_keys files to all nodes. I will describe
|
|
|
|
this in more detail later.</para>
|
|
|
|
|
|
|
|
<para>Note: I only am documenting what the LAM distribution of the
|
2004-07-07 15:37:24 +00:00
|
|
|
message passing interface requires; if you chose another message passing
|
|
|
|
interface to build your cluster, your requirements may differ.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>At the end of /home/wolf/.bash_profile, add the following
|
|
|
|
statements [again this is lam-specific; your requirements may
|
|
|
|
vary]:</para>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<screen>export LAMRSH='ssh -x'
|
|
|
|
ssh-agent sh -c 'ssh-add && bash'</screen>
|
2004-06-03 03:44:39 +00:00
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>MPI</title>
|
|
|
|
|
2004-06-07 22:29:39 +00:00
|
|
|
<para>Lastly, put your message passing interface on the box. As stated
|
2004-06-03 03:44:39 +00:00
|
|
|
in 1.2 Requirements, I used lam. You can get lam from here:</para>
|
|
|
|
|
|
|
|
<para><ulink url=" http://www.lam-mpi.org/">
|
|
|
|
http://www.lam-mpi.org/</ulink></para>
|
|
|
|
|
|
|
|
<para>...but you can use any other message passing interface or parallel
|
2004-07-07 15:37:24 +00:00
|
|
|
virtual machine software you want. Again, I am showing you what worked
|
|
|
|
for me.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>You can either build LAM from the supplied source, or use their
|
|
|
|
precompiled RPM package. It is not in the scope of this document to
|
2004-07-07 15:37:24 +00:00
|
|
|
describe that; I just got the source and followed the directions, and in
|
|
|
|
another experiment I installed their rpm. Both of them worked fine.
|
2004-06-07 22:29:39 +00:00
|
|
|
Remember the whole reason we are doing this is to learn; go forth and
|
2004-07-10 14:42:33 +00:00
|
|
|
learn.</para>
|
2004-07-07 15:37:24 +00:00
|
|
|
|
|
|
|
<para>You may also read more documentation regarding LAM and other
|
|
|
|
message passing interface software <ulink
|
|
|
|
url="http://www.tldp.org/HOWTO/Scientific-Computing-with-GNU-Linux/systems.html">here.</ulink></para>
|
2004-06-03 03:44:39 +00:00
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1>
|
2004-06-07 22:29:39 +00:00
|
|
|
<title>Set Up Slave Nodes</title>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>Get your network cables out. Install Linux on the first non-head
|
2004-07-10 14:42:33 +00:00
|
|
|
node. Follow these steps for each non-head node.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<sect2>
|
2004-06-07 22:29:39 +00:00
|
|
|
<title>Base Linux Install</title>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>Going with my example node names and IP addresses, this is what I
|
|
|
|
chose during setup:</para>
|
|
|
|
|
|
|
|
<screen>Workstation
|
|
|
|
auto partition
|
|
|
|
remove all partitions on system
|
|
|
|
use LILO as the boot loader
|
|
|
|
put boot loader on the MBR
|
|
|
|
host name wolf01
|
|
|
|
ip address 192.168.0.101
|
2005-01-08 18:41:49 +00:00
|
|
|
add the user "wolf"
|
2004-06-03 03:44:39 +00:00
|
|
|
same password as on all other nodes
|
|
|
|
NO firewall</screen>
|
|
|
|
|
2004-06-07 22:29:39 +00:00
|
|
|
<para>The ONLY package installed: network servers. Un-select all other
|
2004-06-03 03:44:39 +00:00
|
|
|
packages.</para>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>It doesn't matter what else you choose; this is the minimum that
|
2004-07-07 15:37:24 +00:00
|
|
|
you need. Why fill the box up with non-essential software you will never
|
|
|
|
use? My research has been concentrated on finding that minimal
|
2004-06-03 03:44:39 +00:00
|
|
|
configuration to get up and running.</para>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>Here's another very important point: when you move on to an
|
2004-06-03 03:44:39 +00:00
|
|
|
automated install and config, you really will NEVER log in to the box.
|
|
|
|
Only during setup and install do I type anything directly on the
|
|
|
|
box.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>Hardware</title>
|
|
|
|
|
|
|
|
<para>When the computer starts up, it will complain if it does not have
|
|
|
|
a keyboard connected. I was not able to modify the BIOS, because I had
|
|
|
|
older discarded boxes with no documentation, so I just connected a
|
2005-01-08 18:41:49 +00:00
|
|
|
"fake" keyboard.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>I am in the computer industry, and see hundreds of keyboards come
|
|
|
|
and go, and some occasionally end up in the garbage. I get the old dead
|
|
|
|
keyboard out of the garbage, remove JUST the cord with the tiny circuit
|
|
|
|
board up there in the corner, where the num lock and caps lock lights
|
|
|
|
are. Then I plug the cord in, and the computer thinks it has a complete
|
|
|
|
keyboard without incident.</para>
|
|
|
|
|
|
|
|
<para>Again, you would be better off modifying your bios, if you are
|
2005-01-08 18:41:49 +00:00
|
|
|
able to. This is just a trick to use in case you don't have the bios
|
2004-07-07 15:37:24 +00:00
|
|
|
program.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
2004-06-07 22:29:39 +00:00
|
|
|
<title>Post Install Commands</title>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>After your newly installed box reboots, log on as root again,
|
2005-01-08 18:43:18 +00:00
|
|
|
and...</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>do the same chkconfig commands stated above to set up the
|
|
|
|
right services.</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>modify hosts; remove "wolfnn" from localhost, and just add
|
2004-06-03 03:44:39 +00:00
|
|
|
wolfnn and wolf00.</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>install lam</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>create the /mnt/wolf directory and set up security for
|
|
|
|
it.</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>do the ssh configuration</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<para>Up to this point, we are pretty much the same as the head node. I
|
|
|
|
do NOT do the modification of the exports file.</para>
|
|
|
|
|
|
|
|
<para>Also, do NOT add this line to the .bash_profile:</para>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<screen>sh -c 'ssh-add && bash'</screen>
|
2004-06-03 03:44:39 +00:00
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
2004-06-07 22:29:39 +00:00
|
|
|
<title>SSH On Slave Nodes</title>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>Recall that on the head node, we created a file "authorized_keys".
|
2004-06-03 03:44:39 +00:00
|
|
|
Copy that file, created on your head node, to the ~/.ssh directory on
|
|
|
|
the slave nodes. The HEAD node will log on the all the SLAVE
|
|
|
|
nodes.</para>
|
|
|
|
|
|
|
|
<para>The requirement, as stated in the LAM user manual, is that there
|
|
|
|
should be no interaction required when logging in from the head to any
|
|
|
|
of the slaves. So, copying the public key from the head node into each
|
2005-01-08 18:41:49 +00:00
|
|
|
slave node, in the file "authorized_keys", tells each slave
|
|
|
|
that "wolf
|
2004-06-03 03:44:39 +00:00
|
|
|
user on wolf00 is allowed to log on here without any password; we know
|
2005-01-08 18:41:49 +00:00
|
|
|
it is safe."</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>However you may recall that the documentation states that the
|
|
|
|
first time you log on, it will ask for confirmation. So only once, after
|
|
|
|
doing the above configuration, go back to the head node, and type ssh
|
2005-01-08 18:41:49 +00:00
|
|
|
wolfnn where "wolfnn" is the name of your newly configured slave node.
|
|
|
|
It will ask you for confirmation, and you simply answer "yes" to it, and
|
2004-06-03 03:44:39 +00:00
|
|
|
that will be the last time you will have to interact.</para>
|
|
|
|
|
|
|
|
<para>Prove it by logging off, and then ssh back to that node, and it
|
|
|
|
should just immediately log you in, with no dialog whatsoever.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
2004-06-07 22:29:39 +00:00
|
|
|
<title>NFS Settings On Slave Nodes</title>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>As root, enter these commands:</para>
|
|
|
|
|
|
|
|
<screen>cat >> /etc/fstab
|
|
|
|
wolf00:/mnt/wolf /mnt/wolf nfs rw,hard,intr 0 0
|
|
|
|
<control d> </screen>
|
|
|
|
|
|
|
|
<para>What we did here was automatically mount the exported directory we
|
|
|
|
put in the /etc/exports file on the head node. More discussion regarding
|
|
|
|
nfs later in this document.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
2004-06-07 22:29:39 +00:00
|
|
|
<title>Lilo Modifications On Slave Nodes</title>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>Then modify /etc/lilo.conf.</para>
|
|
|
|
|
|
|
|
<para>The 2nd line of this file says</para>
|
|
|
|
|
|
|
|
<screen>timeout=nn</screen>
|
|
|
|
|
|
|
|
<para>Modify that line to say:</para>
|
|
|
|
|
|
|
|
<screen>timeout=1200</screen>
|
|
|
|
|
|
|
|
<para>After it is modified, we invoke the changes. You type
|
|
|
|
"/sbin/lilo", and it will display back "added linux *" to confirm that
|
|
|
|
it took the changes you made to the lilo.conf file:</para>
|
|
|
|
|
|
|
|
<screen>/sbin/lilo
|
|
|
|
Added linux * </screen>
|
|
|
|
|
|
|
|
<para>Why do I do this lilo modification? If you were researching
|
|
|
|
Beowulf on the web, and understand everything I have done so far, you
|
2005-01-08 18:41:49 +00:00
|
|
|
may wonder, "I don't remember reading anything about lilo.conf."</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>All my Beowulf nodes share a single power strip. I turn on the
|
|
|
|
power strip, and every box on the cluster starts up immediately. As the
|
|
|
|
startup procedure progresses, it mounts file systems. Seeing that the
|
|
|
|
non-head nodes mount the shared directory from the head node, they all
|
|
|
|
will have to wait a little bit until the head node is up, with NFS ready
|
|
|
|
to go. So I make each slave node wait 2 minutes in the lilo step.
|
|
|
|
Meanwhile, the head node comes up, and making the shared directory
|
|
|
|
available. By then, the slave nodes finally start booting up because
|
|
|
|
lilo has waited 2 minutes.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1>
|
|
|
|
<title>Verification</title>
|
|
|
|
|
|
|
|
<para>All done! You are almost ready to start wolfing.</para>
|
|
|
|
|
|
|
|
<para>Reboot your boxes. Did they all come up? Can you ping the head node
|
2004-07-10 14:42:33 +00:00
|
|
|
from each box? Can you ping each node from the head node? Can you ssh?
|
2005-01-08 18:41:49 +00:00
|
|
|
Don't worry about doing ssh as root; only as wolf. Also only worry about
|
2004-07-10 14:42:33 +00:00
|
|
|
ssh from the head to the slave, not the other way around.</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>If you are logged in as wolf, and ssh to a box, does it go
|
|
|
|
automatically, without prompting for password?</para>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>After the node boots up, log in as wolf, and say "mount". Does it
|
2004-06-03 03:44:39 +00:00
|
|
|
show wolf00:/mnt/wolf mounted? On the head node, copy a file into
|
2004-07-10 14:42:33 +00:00
|
|
|
/mnt/wolf. Can you read and write that file from the slave node?</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>This is really not required; it is merely convenient to have a
|
|
|
|
common directory reside on the head node. With a common shared directory,
|
|
|
|
you can easily use scp to copy files between boxes. Sterling states in his
|
|
|
|
book, on page 119, a single NFS server causes a serious obstacle to
|
|
|
|
scaling up to large numbers of nodes. I learned this when I went from a
|
|
|
|
small number of boxes up to a large number.</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1>
|
2004-06-07 22:29:39 +00:00
|
|
|
<title>Run A Program</title>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<para>Once you can do all the tests shown above, you should be able to run
|
|
|
|
a program. From here on in, the instructions are lam specific.</para>
|
|
|
|
|
|
|
|
<para>Go back to the head node, log in as wolf, and enter the following
|
|
|
|
commands:</para>
|
|
|
|
|
|
|
|
<screen>cat > /nnt/wolf/lamhosts
|
|
|
|
wolf01
|
|
|
|
wolf02
|
|
|
|
wolf03
|
|
|
|
wolf04
|
|
|
|
<control d></screen>
|
|
|
|
|
2005-01-08 18:41:49 +00:00
|
|
|
<para>Go to the lam examples directory, and compile "hello.c":</para>
|
2004-06-03 03:44:39 +00:00
|
|
|
|
|
|
|
<screen>mpicc -o hello hello.c
|
|
|
|
cp hello /mnt/wolf </screen>
|
|
|
|
|
|
|
|
<para>Then, as shown in the lam documentation, start up lam:</para>
|
|
|
|
|
|
|
|
<screen>[wolf@wolf00 wolf]$ lamboot -v lamhosts
|
|
|
|
LAM 7.0/MPI 2 C++/ROMIO - Indiana University
|
|
|
|
n0<2572> ssi:boot:base:linear: booting n0 (wolf00)
|
|
|
|
n0<2572> ssi:boot:base:linear: booting n1 (wolf01)
|
|
|
|
n0<2572> ssi:boot:base:linear: booting n2 (wolf02)
|
|
|
|
n0<2572> ssi:boot:base:linear: booting n3 (wolf04)
|
|
|
|
n0<2572> ssi:boot:base:linear: finished</screen>
|
|
|
|
|
|
|
|
<para>So we are now finally ready to run an app. [Remember, I am using
|
|
|
|
lam; your message passing interface may have different syntax].</para>
|
|
|
|
|
|
|
|
<screen>[wolf@wolf00 wolf]$ mpirun n0-3 /mnt/wolf/hello
|
|
|
|
Hello, world! I am 0 of 4
|
|
|
|
Hello, world! I am 3 of 4
|
|
|
|
Hello, world! I am 2 of 4
|
|
|
|
Hello, world! I am 1 of 4
|
|
|
|
[wolf@wolf00 wolf]$</screen>
|
|
|
|
|
|
|
|
<para>Recall I mentioned the use of NFS above. I am telling the nodes to
|
|
|
|
all use the nfs shared directory, which will bottleneck when using a
|
|
|
|
larger number of boxes. You could easily copy the executable to each box,
|
|
|
|
and in the mpirun command, specify node local directories: mpirun n0-3
|
|
|
|
/home/wolf/hello. The prerequisite for this is to have all the files
|
|
|
|
available locally. In fact I have done this, and it worked better than
|
|
|
|
using the nfs shared executable. Of course this theory breaks down if my
|
|
|
|
cluster application needs to modify a file shared across the
|
|
|
|
cluster.</para>
|
|
|
|
</sect1>
|
2005-01-08 18:15:05 +00:00
|
|
|
</article>
|