mirror of https://github.com/tLDP/LDP
642 lines
25 KiB
XML
642 lines
25 KiB
XML
![]() |
<?xml version="1.0" encoding="UTF-8"?>
|
|||
|
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
|
|||
|
"http://docbook.org/xml/4.2/docbookx.dtd">
|
|||
|
<article>
|
|||
|
<articleinfo>
|
|||
|
<title>The Beowulf HOWTO</title>
|
|||
|
|
|||
|
<author>
|
|||
|
<firstname>Kurt</firstname>
|
|||
|
|
|||
|
<surname>Swendson</surname>
|
|||
|
|
|||
|
<affiliation>
|
|||
|
<address><email>lam32767@lycos.com</email></address>
|
|||
|
</affiliation>
|
|||
|
</author>
|
|||
|
|
|||
|
<pubdate>2004-05-17</pubdate>
|
|||
|
|
|||
|
<revhistory>
|
|||
|
<revision>
|
|||
|
<revnumber>1.0</revnumber>
|
|||
|
|
|||
|
<date>2004-05-17</date>
|
|||
|
|
|||
|
<authorinitials>01</authorinitials>
|
|||
|
|
|||
|
<revremark>first official release</revremark>
|
|||
|
</revision>
|
|||
|
</revhistory>
|
|||
|
|
|||
|
<abstract>
|
|||
|
<para>This document describes step by step instructions on building a
|
|||
|
Beowulf cluster. This is a Red Hat and LAM specific version of this
|
|||
|
document.</para>
|
|||
|
</abstract>
|
|||
|
</articleinfo>
|
|||
|
|
|||
|
<sect1 id="intro">
|
|||
|
<title>Introduction</title>
|
|||
|
|
|||
|
<para>This document describes step by step instructions on building a
|
|||
|
Beowulf cluster. After seeing all of the documentation that was available,
|
|||
|
I felt there were enough gaps and omissions that my own document that
|
|||
|
accurately describes how to build a Beowulf cluster would be
|
|||
|
beneficial.</para>
|
|||
|
|
|||
|
<para>I first saw Thomas Sterling’s article in Scientific American, and
|
|||
|
immediately got the book, because its title was “How to Build a Beowulf”.
|
|||
|
No doubt - it was a valuable reference, but it really does not walk you
|
|||
|
through instructions on exactly what to do.</para>
|
|||
|
|
|||
|
<para>What follows is a description of what I got to work. It is only one
|
|||
|
example - my example. You may choose a different message passing
|
|||
|
interface; you may choose a different Linux distribution. You may also
|
|||
|
spend as much time as I did researching and experimenting, and learn on
|
|||
|
your own.</para>
|
|||
|
|
|||
|
<sect2 id="copyright">
|
|||
|
<title>Copyright and License</title>
|
|||
|
|
|||
|
<para>This document, <emphasis>The Beowulf HOWTO</emphasis>, is
|
|||
|
copyrighted (c) 2002 by <emphasis>Kurt Swendson</emphasis>. Permission
|
|||
|
is granted to copy, distribute and/or modify this document under the
|
|||
|
terms of the GNU Free Documentation License, Version 1.1 or any later
|
|||
|
version published by the Free Software Foundation; with no Invariant
|
|||
|
Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A
|
|||
|
copy of the license is available at <ulink
|
|||
|
url="http://www.gnu.org/copyleft/fdl.html">
|
|||
|
http://www.gnu.org/copyleft/fdl.html</ulink>.</para>
|
|||
|
|
|||
|
<para>Linux is a registered trademark of Linus Torvalds.</para>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2 id="disclaimer">
|
|||
|
<title>Disclaimer</title>
|
|||
|
|
|||
|
<para>No liability for the contents of this document can be accepted.
|
|||
|
Use the concepts, examples and information at your own risk. There may
|
|||
|
be errors and inaccuracies, that could be damaging to your system.
|
|||
|
Proceed with caution, and although this is highly unlikely, the
|
|||
|
author(s) do not take any responsibility.</para>
|
|||
|
|
|||
|
<para>All copyrights are held by their by their respective owners,
|
|||
|
unless specifically noted otherwise. Use of a term in this document
|
|||
|
should not be regarded as affecting the validity of any trademark or
|
|||
|
service mark. Naming of particular products or brands should not be seen
|
|||
|
as endorsements.</para>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2 id="credits">
|
|||
|
<title>Credits / Contributors</title>
|
|||
|
|
|||
|
<para>Thanks to Thomas Johnson, for all of his support and
|
|||
|
encouragement, and of course, the hardware, without which I would not
|
|||
|
have been able to even start.</para>
|
|||
|
|
|||
|
<para>Thanks to my lovely wife Sharron for her understanding and
|
|||
|
patience during my many hours spent with "the wolves".</para>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2 id="feedback">
|
|||
|
<title>Feedback</title>
|
|||
|
|
|||
|
<para>Send your additions, comments and criticisms to
|
|||
|
<email>lam32767@lycos.com</email>.</para>
|
|||
|
</sect2>
|
|||
|
</sect1>
|
|||
|
|
|||
|
<sect1>
|
|||
|
<title>Definitions</title>
|
|||
|
|
|||
|
<para>What is a Beowulf cluster? <ulink
|
|||
|
url="http://www.tldp.org/HOWTO/Beowulf-HOWTO.html">Jacek and
|
|||
|
Douglas</ulink> give a good definition in their article: "Beowulf is a
|
|||
|
multi computer architecture which can be used for parallel computations.
|
|||
|
It is a system which usually consists of one server node, and one or more
|
|||
|
client nodes connected together via Ethernet or some other network". The
|
|||
|
site <ulink url="http://beowulf.org">beowulf.org</ulink> lists many web
|
|||
|
pages about Beowulf systems built by individuals and organizations. From
|
|||
|
these two links, one can be exposed to a large number of perspectives on
|
|||
|
the Beowulf architecture, and draw his / her own conclusions.</para>
|
|||
|
|
|||
|
<para>What’s the difference between a true Beowulf cluster and a COW
|
|||
|
[cluster of workstations]? Brahma gives a good definition:<ulink
|
|||
|
url="http://www.phy.duke.edu/brahma/beowulf_book/node62.html">
|
|||
|
http://www.phy.duke.edu/brahma/beowulf_book/node62.html</ulink>.</para>
|
|||
|
|
|||
|
<para>If you are a “user” at your organization, and you have the use of
|
|||
|
some nodes, you may still do the instructions shown here to create a cow.
|
|||
|
But if you “own” the nodes, that is, if you have complete control of them,
|
|||
|
and are able to completely erase and rebuild them, you may create a true
|
|||
|
Beowulf cluster.</para>
|
|||
|
|
|||
|
<para>In Brahma's web page, he suggests that you manually configure each
|
|||
|
box, and then, later on, after you get the feel of doing this whole
|
|||
|
“wolfing up” procedure, you can set up new nodes automatically, which I
|
|||
|
will describe in a later document.</para>
|
|||
|
</sect1>
|
|||
|
|
|||
|
<sect1>
|
|||
|
<title>Requirements</title>
|
|||
|
|
|||
|
<para>Let’s briefly outline your requirements: <itemizedlist>
|
|||
|
<listitem>
|
|||
|
<para>More than one box, each equipped with a network card.</para>
|
|||
|
</listitem>
|
|||
|
|
|||
|
<listitem>
|
|||
|
<para>A switch or hub to connect them</para>
|
|||
|
</listitem>
|
|||
|
|
|||
|
<listitem>
|
|||
|
<para>Linux</para>
|
|||
|
</listitem>
|
|||
|
|
|||
|
<listitem>
|
|||
|
<para>A message-passing interface [I used lam]</para>
|
|||
|
</listitem>
|
|||
|
</itemizedlist>It is not a requirement to have a kvm switch, but it is
|
|||
|
convenient while setting up and / or debugging.</para>
|
|||
|
</sect1>
|
|||
|
|
|||
|
<sect1>
|
|||
|
<title>Set up the head node</title>
|
|||
|
|
|||
|
<para>So let’s get wolfing. Choose the most powerful box to be the head
|
|||
|
node. Install Linux on there, and choose every package you want. The only
|
|||
|
requirement is that you choose “Network Servers” [in Red Hat terminology]
|
|||
|
because you need to have NFS and ssh. That’s all you need. In my case, I
|
|||
|
was going to do development of the Beowulf application, so I added X and C
|
|||
|
development.</para>
|
|||
|
|
|||
|
<para>It is my experience that you do not actually need NFS, but I found
|
|||
|
it invaluable for copying files between nodes, and for automating the
|
|||
|
install process. Later in this document I will describe how you can run a
|
|||
|
simple Beowulf application without the use of NFS, but a more complex
|
|||
|
application may use NFS or actually depend upon it.</para>
|
|||
|
|
|||
|
<para>Those of you researching Beowulf systems will also know how you can
|
|||
|
have a 2nd network card on the head node so you can access it from the
|
|||
|
outside world. This is not required for the operation of a cluster, so you
|
|||
|
may do what you want regarding this.</para>
|
|||
|
|
|||
|
<para>I learned the hard way: use a password that obeys the strong
|
|||
|
password constraints for your Linux distribution. I used an easily-typed
|
|||
|
password like “a” for my user, and the whole thing did not work. When I
|
|||
|
changed my password to a legal password, with mixed numbers, characters,
|
|||
|
upper and lower case, it worked.</para>
|
|||
|
|
|||
|
<para>If you use lam as your message passing interface, you will read in
|
|||
|
the manual to turn OFF the firewalls, because they use random port numbers
|
|||
|
to communicate between nodes. Here is a rule: If the manual tells you to
|
|||
|
do something, DO IT! The lam manual also tells you to run as a non-root
|
|||
|
user. Make the same user for every box. Build every box on the cluster
|
|||
|
with the same user “wolf” with the same password.</para>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>Hosts</title>
|
|||
|
|
|||
|
<para>First we will modify /etc/hosts. In the file, you will see the
|
|||
|
comments telling you to leave the “localhost” line alone. Ignore that
|
|||
|
advice and change it to not include the name of your box in the loopback
|
|||
|
address.</para>
|
|||
|
|
|||
|
<para>Modify the line that says: <screen>127.0.0.1 wolf00 localhost.localdomain localhost</screen></para>
|
|||
|
|
|||
|
<para>...to now say: <screen>127.0.0.1 localhost.localdomain localhost </screen></para>
|
|||
|
|
|||
|
<para>Then add all the boxes you want on your cluster. Note: This is not
|
|||
|
required for the operation of a Beowulf cluster; only convenient, so
|
|||
|
that you may type a simple “wolf01” when you refer to a box on your
|
|||
|
cluster instead of the more tedious 192.168.0.101:</para>
|
|||
|
|
|||
|
<screen>192.168.0.100 wolf00
|
|||
|
192.168.0.101 wolf01
|
|||
|
192.168.0.102 wolf02
|
|||
|
192.168.0.103 wolf03
|
|||
|
192.168.0.104 wolf04</screen>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>Groups</title>
|
|||
|
|
|||
|
<para>In order to responsibly set your cluster up, especially if you are
|
|||
|
a “user” of your boxes [see Definitions], you should have some measure
|
|||
|
of security.</para>
|
|||
|
|
|||
|
<para>After you create your user, create a group, and add the user to
|
|||
|
the group. Then, you may modify your files and directories to only be
|
|||
|
accessible by the users within that group:</para>
|
|||
|
|
|||
|
<screen>groupadd beowulf
|
|||
|
usermod -g beowulf wolf </screen>
|
|||
|
|
|||
|
<para>… and add the following to /home/wolf/.bash_profile:</para>
|
|||
|
|
|||
|
<screen>umask 007</screen>
|
|||
|
|
|||
|
<para>Now any files created by the user “wolf” [or any user within the
|
|||
|
group] will automatically be only writeable by the group
|
|||
|
“beowulf”.</para>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>NFS</title>
|
|||
|
|
|||
|
<para>Refer to the following web site: <ulink
|
|||
|
url="http://www.ibiblio.org/mdw/HOWTO/NFS-HOWTO/server.html">http://www.ibiblio.org/mdw/HOWTO/NFS-HOWTO/server.html</ulink></para>
|
|||
|
|
|||
|
<para>Print that up, and have it at your side. I will be directing you
|
|||
|
how to modify your system in order to create an NFS server, but I have
|
|||
|
found this site invaluable, as you may also.</para>
|
|||
|
|
|||
|
<para>Make a directory for everybody to share:</para>
|
|||
|
|
|||
|
<screen>mkdir /mnt/wolf
|
|||
|
chmod 770 /mnt/wolf
|
|||
|
chown wolf:beowulf /mnt/wolf -R </screen>
|
|||
|
|
|||
|
<para>Go to the /etc directory, and add your “shared” directory to the
|
|||
|
exports file:</para>
|
|||
|
|
|||
|
<screen>cd /etc
|
|||
|
cat >> exports
|
|||
|
/mnt/wolf 192.168.0.100/192.168.0.255 (rw)
|
|||
|
<control d></screen>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>IP Addresses</title>
|
|||
|
|
|||
|
<para>My network is 192.168.0.nnn because it is one of the “private” IP
|
|||
|
ranges. Thomas Sterling talks about it on page 106 of his book. It is
|
|||
|
inside my firewall, and works just fine.</para>
|
|||
|
|
|||
|
<para>My head node, which I call “wolf00” is 192.168.0.100, and every
|
|||
|
other node is named “wolfnn”, with an ip of 192.168.0.100 + nn. I am
|
|||
|
following the sage advice of many of the web pages out there, and
|
|||
|
setting myself up for an easier task of scaling up my cluster.</para>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>Services</title>
|
|||
|
|
|||
|
<para>Make sure that services we want are up:</para>
|
|||
|
|
|||
|
<screen>chkconfig -add sshd
|
|||
|
chkconfig -add telnet
|
|||
|
chkconfig -add nfs
|
|||
|
chkconfig -add rexec
|
|||
|
chkconfig -add rlogin
|
|||
|
chkconfig -level 3 rsh on
|
|||
|
chkconfig -level 3 telnet on
|
|||
|
chkconfig -level 3 nfs on
|
|||
|
chkconfig -level 3 rexec on
|
|||
|
chkconfig -level 3 rlogin on</screen>
|
|||
|
|
|||
|
<para>Telnet? I added this just as a convenience. It is not needed but
|
|||
|
it is nice to have while debugging nfs. How are you going to log into a
|
|||
|
box if you can’t ssh to the box?</para>
|
|||
|
|
|||
|
<para>…And, during startup, I saw some services that I know I don’t
|
|||
|
want, and in my opinion, could be removed. You may add or remove others
|
|||
|
that suit your needs; just include the ones shown above.</para>
|
|||
|
|
|||
|
<screen>chkconfig -del atd
|
|||
|
chkconfig -del rsh
|
|||
|
chkconfig -del sendmail</screen>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>SSH</title>
|
|||
|
|
|||
|
<para>To be responsible, we make ssh work. While logged in as root, you
|
|||
|
must modify the /etc/ssh/sshd_config file. The lines:</para>
|
|||
|
|
|||
|
<screen>#RSAAuthentication yes
|
|||
|
#AuthorizedKeysFile .ssh/authorized_keys</screen>
|
|||
|
|
|||
|
<para>… are commented out, so uncomment them [remove the #].</para>
|
|||
|
|
|||
|
<para>Reboot, and log back in as wolf, because the operation of your
|
|||
|
cluster will always be done from the user "wolf". Also, the hosts file
|
|||
|
modifications done earlier must take effect. Logging out and back in
|
|||
|
will not do this. To be sure, reboot the box, and make sure your prompt
|
|||
|
shows hostname "wolf00". </para>
|
|||
|
|
|||
|
<para>To generate your public and private SSH keys, do this:</para>
|
|||
|
|
|||
|
<screen>ssh-keygen -b 1024 -f ~/.ssh/id_rsa -t rsa -N “” </screen>
|
|||
|
|
|||
|
<para>… and it will display a few messages, and tell you that it created
|
|||
|
the public / private key pair. You will see these files, id_rsa and
|
|||
|
id_rsa.pub, in the /home/wolf/.ssh directory.</para>
|
|||
|
|
|||
|
<para>Copy the id_rsa.pub file into a file called “authorized_keys”
|
|||
|
right there in the .ssh directory. We will be using this file later.
|
|||
|
Verify that the contents of this file show the hostname [the reason we
|
|||
|
rebooted the box]. Modify the security on the files, and the
|
|||
|
directory:</para>
|
|||
|
|
|||
|
<screen>chmod 644 ~/.ssh/auth*
|
|||
|
chmod 755 ~/.ssh </screen>
|
|||
|
|
|||
|
<para>According to the LAM user group, only the head node needs to log
|
|||
|
on to the slave nodes; not the other way around. Therefore when we copy
|
|||
|
the public key files, we only copy the head node’s key file to each
|
|||
|
slave node, and set up the agent on the head node. This is MUCH easier
|
|||
|
than copying all authorized_keys files to all nodes. I will describe
|
|||
|
this in more detail later.</para>
|
|||
|
|
|||
|
<para>Note: I only am documenting what the LAM distribution of the
|
|||
|
message passing interface requires; if you chose another MPI to build
|
|||
|
your cluster your requirements may differ.</para>
|
|||
|
|
|||
|
<para>At the end of /home/wolf/.bash_profile, add the following
|
|||
|
statements [again this is lam-specific; your requirements may
|
|||
|
vary]:</para>
|
|||
|
|
|||
|
<screen>export LAMRSH=’ssh -x’
|
|||
|
ssh-agent sh -c ‘ssh-add && bash’</screen>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>MPI</title>
|
|||
|
|
|||
|
<para>Lastly, put your message - passing interface on the box. As stated
|
|||
|
in 1.2 Requirements, I used lam. You can get lam from here:</para>
|
|||
|
|
|||
|
<para><ulink url=" http://www.lam-mpi.org/">
|
|||
|
http://www.lam-mpi.org/</ulink></para>
|
|||
|
|
|||
|
<para>...but you can use any other message passing interface or parallel
|
|||
|
virtual machine software you want. Again --- I am showing you what
|
|||
|
worked for me.</para>
|
|||
|
|
|||
|
<para>You can either build LAM from the supplied source, or use their
|
|||
|
precompiled RPM package. It is not in the scope of this document to
|
|||
|
describe that - I just got the source and followed the directions, and
|
|||
|
in another experiment I installed their rpm; both of them worked fine.
|
|||
|
Remember the whole reason we are doing this is to learn - go forth and
|
|||
|
learn.</para>
|
|||
|
</sect2>
|
|||
|
</sect1>
|
|||
|
|
|||
|
<sect1>
|
|||
|
<title>Set up slave nodes</title>
|
|||
|
|
|||
|
<para>Get your network cables out. Install Linux on the first non-head
|
|||
|
node.</para>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>Base linux install</title>
|
|||
|
|
|||
|
<para>Going with my example node names and IP addresses, this is what I
|
|||
|
chose during setup:</para>
|
|||
|
|
|||
|
<screen>Workstation
|
|||
|
auto partition
|
|||
|
remove all partitions on system
|
|||
|
use LILO as the boot loader
|
|||
|
put boot loader on the MBR
|
|||
|
host name wolf01
|
|||
|
ip address 192.168.0.101
|
|||
|
add the user “wolf”
|
|||
|
same password as on all other nodes
|
|||
|
NO firewall</screen>
|
|||
|
|
|||
|
<para>The ONLY package installed: network servers. UN select all other
|
|||
|
packages.</para>
|
|||
|
|
|||
|
<para>It doesn’t matter what else you choose; this is the minimum of
|
|||
|
what you need. Why fill the box up with non-essential software you will
|
|||
|
never use? My research has been concentrated on finding that minimal
|
|||
|
configuration to get up and running.</para>
|
|||
|
|
|||
|
<para>Here’s another very important point. When you move on to an
|
|||
|
automated install and config, you really will NEVER log in to the box.
|
|||
|
Only during setup and install do I type anything directly on the
|
|||
|
box.</para>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>Hardware</title>
|
|||
|
|
|||
|
<para>When the computer starts up, it will complain if it does not have
|
|||
|
a keyboard connected. I was not able to modify the BIOS, because I had
|
|||
|
older discarded boxes with no documentation, so I just connected a
|
|||
|
“fake” keyboard.</para>
|
|||
|
|
|||
|
<para>I am in the computer industry, and see hundreds of keyboards come
|
|||
|
and go, and some occasionally end up in the garbage. I get the old dead
|
|||
|
keyboard out of the garbage, remove JUST the cord with the tiny circuit
|
|||
|
board up there in the corner, where the num lock and caps lock lights
|
|||
|
are. Then I plug the cord in, and the computer thinks it has a complete
|
|||
|
keyboard without incident.</para>
|
|||
|
|
|||
|
<para>Again, you would be better off modifying your bios, if you are
|
|||
|
able to. This is just a trick to use in the case that you don’t have the
|
|||
|
bios program.</para>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>Post install commands</title>
|
|||
|
|
|||
|
<para>After your newly installed box reboots, log on as root again,
|
|||
|
and…</para>
|
|||
|
|
|||
|
<itemizedlist>
|
|||
|
<listitem>
|
|||
|
<para>do the same chkconfig commands stated above to set up the
|
|||
|
right services.</para>
|
|||
|
</listitem>
|
|||
|
</itemizedlist>
|
|||
|
|
|||
|
<itemizedlist>
|
|||
|
<listitem>
|
|||
|
<para>modify hosts; remove “wolfnn” from localhost, and just add
|
|||
|
wolfnn and wolf00.</para>
|
|||
|
</listitem>
|
|||
|
</itemizedlist>
|
|||
|
|
|||
|
<itemizedlist>
|
|||
|
<listitem>
|
|||
|
<para>install lam</para>
|
|||
|
</listitem>
|
|||
|
</itemizedlist>
|
|||
|
|
|||
|
<itemizedlist>
|
|||
|
<listitem>
|
|||
|
<para>create the /mnt/wolf directory and set up security for
|
|||
|
it.</para>
|
|||
|
</listitem>
|
|||
|
</itemizedlist>
|
|||
|
|
|||
|
<itemizedlist>
|
|||
|
<listitem>
|
|||
|
<para>do the ssh configuration</para>
|
|||
|
</listitem>
|
|||
|
</itemizedlist>
|
|||
|
|
|||
|
<para>Up to this point, we are pretty much the same as the head node. I
|
|||
|
do NOT do the modification of the exports file.</para>
|
|||
|
|
|||
|
<para>Also, do NOT add this line to the .bash_profile:</para>
|
|||
|
|
|||
|
<screen>sh -c ‘ssh-add && bash’</screen>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>SSH on slave nodes</title>
|
|||
|
|
|||
|
<para>Recall that on the head node, we created a file “authorized_keys”.
|
|||
|
Copy that file, created on your head node, to the ~/.ssh directory on
|
|||
|
the slave nodes. The HEAD node will log on the all the SLAVE
|
|||
|
nodes.</para>
|
|||
|
|
|||
|
<para>The requirement, as stated in the LAM user manual, is that there
|
|||
|
should be no interaction required when logging in from the head to any
|
|||
|
of the slaves. So, copying the public key from the head node into each
|
|||
|
slave node, in the file “authorized_keys”, tells each slave that “wolf
|
|||
|
user on wolf00 is allowed to log on here without any password; we know
|
|||
|
it is safe.”</para>
|
|||
|
|
|||
|
<para>However you may recall that the documentation states that the
|
|||
|
first time you log on, it will ask for confirmation. So only once, after
|
|||
|
doing the above configuration, go back to the head node, and type ssh
|
|||
|
wolfnn where “wolfnn” is the name of your newly configured slave node.
|
|||
|
It will ask you for confirmation, and you simply answer “yes” to it, and
|
|||
|
that will be the last time you will have to interact.</para>
|
|||
|
|
|||
|
<para>Prove it by logging off, and then ssh back to that node, and it
|
|||
|
should just immediately log you in, with no dialog whatsoever.</para>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>NFS settings on slave nodes</title>
|
|||
|
|
|||
|
<para>As root, enter these commands:</para>
|
|||
|
|
|||
|
<screen>cat >> /etc/fstab
|
|||
|
wolf00:/mnt/wolf /mnt/wolf nfs rw,hard,intr 0 0
|
|||
|
<control d> </screen>
|
|||
|
|
|||
|
<para>What we did here was automatically mount the exported directory we
|
|||
|
put in the /etc/exports file on the head node. More discussion regarding
|
|||
|
nfs later in this document.</para>
|
|||
|
</sect2>
|
|||
|
|
|||
|
<sect2>
|
|||
|
<title>Lilo modifcations on slave nodes</title>
|
|||
|
|
|||
|
<para>Then modify /etc/lilo.conf.</para>
|
|||
|
|
|||
|
<para>The 2nd line of this file says</para>
|
|||
|
|
|||
|
<screen>timeout=nn</screen>
|
|||
|
|
|||
|
<para>Modify that line to say:</para>
|
|||
|
|
|||
|
<screen>timeout=1200</screen>
|
|||
|
|
|||
|
<para>After it is modified, we invoke the changes. You type
|
|||
|
"/sbin/lilo", and it will display back "added linux *" to confirm that
|
|||
|
it took the changes you made to the lilo.conf file:</para>
|
|||
|
|
|||
|
<screen>/sbin/lilo
|
|||
|
Added linux * </screen>
|
|||
|
|
|||
|
<para>Why do I do this lilo modification? If you were researching
|
|||
|
Beowulf on the web, and understand everything I have done so far, you
|
|||
|
may wonder, “I don’t remember reading anything about lilo.conf.”</para>
|
|||
|
|
|||
|
<para>All my Beowulf nodes share a single power strip. I turn on the
|
|||
|
power strip, and every box on the cluster starts up immediately. As the
|
|||
|
startup procedure progresses, it mounts file systems. Seeing that the
|
|||
|
non-head nodes mount the shared directory from the head node, they all
|
|||
|
will have to wait a little bit until the head node is up, with NFS ready
|
|||
|
to go. So I make each slave node wait 2 minutes in the lilo step.
|
|||
|
Meanwhile, the head node comes up, and making the shared directory
|
|||
|
available. By then, the slave nodes finally start booting up because
|
|||
|
lilo has waited 2 minutes.</para>
|
|||
|
</sect2>
|
|||
|
</sect1>
|
|||
|
|
|||
|
<sect1>
|
|||
|
<title>Verification</title>
|
|||
|
|
|||
|
<para>All done! You are almost ready to start wolfing.</para>
|
|||
|
|
|||
|
<para>Reboot your boxes. Did they all come up? Can you ping the head node
|
|||
|
from each box? Can you ping each node from the head node? Can you telnet?
|
|||
|
Can you ssh? Don’t worry about doing ssh as root; only as wolf.</para>
|
|||
|
|
|||
|
<para>If you are logged in as wolf, and ssh to a box, does it go
|
|||
|
automatically, without prompting for password?</para>
|
|||
|
|
|||
|
<para>After the node boots up, log in as wolf, and say “mount”. Does it
|
|||
|
show wolf00:/mnt/wolf mounted? On the head node, copy a file into
|
|||
|
/mnt/wolf. Can you read and write that file from the node box?</para>
|
|||
|
|
|||
|
<para>This is really not required; it is merely convenient to have a
|
|||
|
common directory reside on the head node. With a common shared directory,
|
|||
|
you can easily use scp to copy files between boxes. Sterling states in his
|
|||
|
book, on page 119, a single NFS server causes a serious obstacle to
|
|||
|
scaling up to large numbers of nodes. I learned this when I went from a
|
|||
|
small number of boxes up to a large number.</para>
|
|||
|
</sect1>
|
|||
|
|
|||
|
<sect1>
|
|||
|
<title>Run a program</title>
|
|||
|
|
|||
|
<para>Once you can do all the tests shown above, you should be able to run
|
|||
|
a program. From here on in, the instructions are lam specific.</para>
|
|||
|
|
|||
|
<para>Go back to the head node, log in as wolf, and enter the following
|
|||
|
commands:</para>
|
|||
|
|
|||
|
<screen>cat > /nnt/wolf/lamhosts
|
|||
|
wolf01
|
|||
|
wolf02
|
|||
|
wolf03
|
|||
|
wolf04
|
|||
|
<control d></screen>
|
|||
|
|
|||
|
<para>Go to the lam examples directory, and compile “hello.c”:</para>
|
|||
|
|
|||
|
<screen>mpicc -o hello hello.c
|
|||
|
cp hello /mnt/wolf </screen>
|
|||
|
|
|||
|
<para>Then, as shown in the lam documentation, start up lam:</para>
|
|||
|
|
|||
|
<screen>[wolf@wolf00 wolf]$ lamboot -v lamhosts
|
|||
|
LAM 7.0/MPI 2 C++/ROMIO - Indiana University
|
|||
|
n0<2572> ssi:boot:base:linear: booting n0 (wolf00)
|
|||
|
n0<2572> ssi:boot:base:linear: booting n1 (wolf01)
|
|||
|
n0<2572> ssi:boot:base:linear: booting n2 (wolf02)
|
|||
|
n0<2572> ssi:boot:base:linear: booting n3 (wolf04)
|
|||
|
n0<2572> ssi:boot:base:linear: finished</screen>
|
|||
|
|
|||
|
<para>So we are now finally ready to run an app. [Remember, I am using
|
|||
|
lam; your message passing interface may have different syntax].</para>
|
|||
|
|
|||
|
<screen>[wolf@wolf00 wolf]$ mpirun n0-3 /mnt/wolf/hello
|
|||
|
Hello, world! I am 0 of 4
|
|||
|
Hello, world! I am 3 of 4
|
|||
|
Hello, world! I am 2 of 4
|
|||
|
Hello, world! I am 1 of 4
|
|||
|
[wolf@wolf00 wolf]$</screen>
|
|||
|
|
|||
|
<para>Recall I mentioned the use of NFS above. I am telling the nodes to
|
|||
|
all use the nfs shared directory, which will bottleneck when using a
|
|||
|
larger number of boxes. You could easily copy the executable to each box,
|
|||
|
and in the mpirun command, specify node local directories: mpirun n0-3
|
|||
|
/home/wolf/hello. The prerequisite for this is to have all the files
|
|||
|
available locally. In fact I have done this, and it worked better than
|
|||
|
using the nfs shared executable. Of course this theory breaks down if my
|
|||
|
cluster application needs to modify a file shared across the
|
|||
|
cluster.</para>
|
|||
|
</sect1>
|
|||
|
</article>
|