old-www/HOWTO/text/Parallel-Processing-HOWTO

  Linux Parallel Processing HOWTO
  Hank Dietz, hankd@engr.uky.edu
  v2.0, 2004-06-28

  Although this HOWTO has been "republished" (v2.0, 2004-06-28) to
  update the author contact info, it has many broken links and some
  information is seriously out of date. Rather than just repairing
  links, this document is being heavily rewritten as a Guide which we
  expect to release in July 2004. At that time, the HOWTO will be
  obsolete.  The prefered home URL for both the old and new documents is
  <http://aggregate.org/LDP/>

  ______________________________________________________________________

  Table of Contents


  1. Introduction

     1.1 Is Parallel Processing What I Want?
     1.2 Terminology
     1.3 Example Algorithm
     1.4 Organization Of This Document

  2. SMP Linux

     2.1 SMP Hardware
        2.1.1 Does each processor have its own L2 cache?
        2.1.2 Bus configuration?
        2.1.3 Memory interleaving and DRAM technologies?
     2.2 Introduction To Shared Memory Programming
        2.2.1 Shared Everything Vs. Shared Something
           2.2.1.1 Shared Everything
           2.2.1.2 Shared Something
        2.2.2 Atomicity And Ordering
        2.2.3 Volatility
        2.2.4 Locks
        2.2.5 Cache Line Size
        2.2.6 Linux Scheduler Issues
     2.3 bb_threads
     2.4 LinuxThreads
     2.5 System V Shared Memory
     2.6 Memory Map Call

  3. Clusters Of Linux Systems

     3.1 Why A Cluster?
     3.2 Network Hardware
        3.2.1 ArcNet
        3.2.2 ATM
        3.2.3 CAPERS
        3.2.4 Ethernet
        3.2.5 Ethernet (Fast Ethernet)
        3.2.6 Ethernet (Gigabit Ethernet)
        3.2.7 FC (Fibre Channel)
        3.2.8 FireWire (IEEE 1394)
        3.2.9 HiPPI And Serial HiPPI
        3.2.10 IrDA (Infrared Data Association)
        3.2.11 Myrinet
        3.2.12 Parastation
        3.2.13 PLIP
        3.2.14 SCI
        3.2.15 SCSI
        3.2.16 ServerNet
        3.2.17 SHRIMP
        3.2.18 SLIP
        3.2.19 TTL_PAPERS
        3.2.20 USB (Universal Serial Bus)
        3.2.21 WAPERS
     3.3 Network Software Interface
        3.3.1 Sockets
           3.3.1.1 UDP Protocol (SOCK_DGRAM)
           3.3.1.2 TCP Protocol (SOCK_STREAM)
        3.3.2 Device Drivers
        3.3.3 User-Level Libraries
     3.4 PVM (Parallel Virtual Machine)
     3.5 MPI (Message Passing Interface)
     3.6 AFAPI (Aggregate Function API)
     3.7 Other Cluster Support Libraries
        3.7.1 Condor (process migration support)
        3.7.2 DFN-RPC (German Research Network - Remote Procedure Call)
        3.7.3 DQS (Distributed Queueing System)
     3.8 General Cluster References
        3.8.1 Beowulf
        3.8.2 Linux/AP+
        3.8.3 Locust
        3.8.4 Midway DSM (Distributed Shared Memory)
        3.8.5 Mosix
        3.8.6 NOW (Network Of Workstations)
        3.8.7 Parallel Processing Using Linux
        3.8.8 Pentium Pro Cluster Workshop
        3.8.9 TreadMarks DSM (Distributed Shared Memory)
        3.8.10 U-Net (User-level NETwork interface architecture)
        3.8.11 WWT (Wisconsin Wind Tunnel)

  4. SIMD Within A Register (e.g., using MMX)

     4.1 SWAR: What Is It Good For?
     4.2 Introduction To SWAR Programming
        4.2.1 Polymorphic Operations
        4.2.2 Partitioned Operations
           4.2.2.1 Partitioned Instructions
           4.2.2.2 Unpartitioned Operations With Correction Code
           4.2.2.3 Controlling Field Values
        4.2.3 Communication & Type Conversion Operations
        4.2.4 Recurrence Operations (Reductions, Scans, etc.)
     4.3 MMX SWAR Under Linux

  5. Linux-Hosted Attached Processors

     5.1 A Linux PC Is A Good Host
     5.2 Did You DSP That?
     5.3 FPGAs And Reconfigurable Logic Computing

  6. Of General Interest

     6.1 Programming Languages And Compilers
        6.1.1 Fortran 66/77/PCF/90/HPF/95
        6.1.2 GLU (Granular Lucid)
        6.1.3 Jade And SAM
        6.1.4 Mentat And Legion
        6.1.5 MPL (MasPar Programming Language)
        6.1.6 PAMS (Parallel Application Management System)
        6.1.7 Parallaxis-III
        6.1.8 pC++/Sage++
        6.1.9 SR (Synchronizing Resources)
        6.1.10 ZPL And IronMan
     6.2 Performance Issues
     6.3 Conclusion - It's Out There


  ______________________________________________________________________

  1.  Introduction


  Parallel Processing refers to the concept of speeding-up the execution
  of a program by dividing the program into multiple fragments that can
  execute simultaneously, each on its own processor.  A program being
  executed across n processors might execute n times faster than it
  would using a single processor.


  Traditionally, multiple processors were provided within a specially
  designed "parallel computer"; along these lines, Linux now supports
  SMP systems (often sold as "servers") in which multiple processors
  share a single memory and bus interface within a single computer.  It
  is also possible for a group of computers (for example, a group of PCs
  each running Linux) to be interconnected by a network to form a
  parallel-processing cluster.  The third alternative for parallel
  computing using Linux is to use the multimedia instruction extensions
  (i.e., MMX) to operate in parallel on vectors of integer data.
  Finally, it is also possible to use a Linux system as a "host" for a
  specialized attached parallel processing compute engine.  All these
  approaches are discussed in detail in this document.


  1.1.  Is Parallel Processing What I Want?


  Although use of multiple processors can speed-up many operations, most
  applications cannot yet benefit from parallel processing.  Basically,
  parallel processing is appropriate only if:


  <20>  Your application has enough parallelism to make good use of
     multiple processors.  In part, this is a matter of identifying
     portions of the program that can execute independently and
     simultaneously on separate processors, but you will also find that
     some things that could execute in parallel might actually slow
     execution if executed in parallel using a particular system.  For
     example, a program that takes four seconds to execute within a
     single machine might be able to execute in only one second of
     processor time on each of four machines, but no speedup would be
     achieved if it took three seconds or more for these machines to
     coordinate their actions.

  <20>  Either the particular application program you are interested in
     already has been parallelized (rewritten to take advantage of
     parallel processing) or you are willing to do at least some new
     coding to take advantage of parallel processing.

  <20>  You are interested in researching, or at least becoming familiar
     with, issues involving parallel processing.  Parallel processing
     using Linux systems isn't necessarily difficult, but it is not
     familiar to most computer users, and there isn't any book called
     "Parallel Processing for Dummies"...  at least not yet.  This HOWTO
     is a good starting point, not all you need to know.


  The good news is that if all the above are true, you'll find that
  parallel processing using Linux can yield supercomputer performance
  for some programs that perform complex computations or operate on
  large data sets.  What's more, it can do that using cheap hardware...
  which you might already own.  As an added bonus, it is also easy to
  use a parallel Linux system for other things when it is not busy
  executing a parallel job.

  If parallel processing is not what you want, but you would like to
  achieve at least a modest improvement in performance, there are still
  things you can do.  For example, you can improve performance of
  sequential programs by moving to a faster processor, adding memory,
  replacing an IDE disk with fast wide SCSI, etc.  If that's all you are
  interested in, jump to section 6.2; otherwise, read on.


  1.2.  Terminology


  Although parallel processing has been used for many years in many
  systems, it is still somewhat unfamiliar to most computer users.
  Thus, before discussing the various alternatives, it is important to
  become familiar with a few commonly used terms.

     SIMD:
        SIMD (Single Instruction stream, Multiple Data stream) refers to
        a parallel execution model in which all processors execute the
        same operation at the same time, but each processor is allowed
        to operate upon its own data.  This model naturally fits the
        concept of performing the same operation on every element of an
        array, and is thus often associated with vector or array
        manipulation.  Because all operations are inherently
        synchronized, interactions among SIMD processors tend to be
        easily and efficiently implemented.


     MIMD:
        MIMD (Multiple Instruction stream, Multiple Data stream) refers
        to a parallel execution model in which each processor is
        essentially acting independently.  This model most naturally
        fits the concept of decomposing a program for parallel execution
        on a functional basis; for example, one processor might update a
        database file while another processor generates a graphic
        display of the new entry.  This is a more flexible model than
        SIMD execution, but it is achieved at the risk of debugging
        nightmares called race conditions, in which a program may
        intermittently fail due to timing variations reordering the
        operations of one processor relative to those of another.


     SPMD:
        SPMD (Single Program, Multiple Data) is a restricted version of
        MIMD in which all processors are running the same program.
        Unlike SIMD, each processor executing SPMD code may take a
        different control flow path through the program.


     Communication Bandwidth:
        The bandwidth of a communication system is the maximum amount of
        data that can be transmitted in a unit of time...  once data
        transmission has begun.  Bandwidth for serial connections is
        often measured in baud or bits/second (b/s), which generally
        correspond to 1/10 to 1/8 that many Bytes/second (B/s).  For
        example, a 1,200 baud modem transfers about 120 B/s, whereas a
        155 Mb/s ATM network connection is nearly 130,000 times faster,
        transferring about about 17 MB/s.  High bandwidth allows large
        blocks of data to be transferred efficiently between processors.


     Communication Latency:
        The latency of a communication system is the minimum time taken
        to transmit one object, including any send and receive software
        overhead.  Latency is very important in parallel processing
        because it determines the minimum useful grain size, the minimum
        run time for a segment of code to yield speed-up through
        parallel execution.  Basically, if a segment of code runs for
        less time than it takes to transmit its result value (i.e.,
        latency), executing that code segment serially on the processor
        that needed the result value would be faster than parallel
        execution; serial execution would avoid the communication
        overhead.


     Message Passing:
        Message passing is a model for interactions between processors
        within a parallel system.  In general, a message is constructed
        by software on one processor and is sent through an
        interconnection network to another processor, which then must
        accept and act upon the message contents.  Although the overhead
        in handling each message (latency) may be high, there are
        typically few restrictions on how much information each message
        may contain.  Thus, message passing can yield high bandwidth
        making it a very effective way to transmit a large block of data
        from one processor to another.  However, to minimize the need
        for expensive message passing operations, data structures within
        a parallel program must be spread across the processors so that
        most data referenced by each processor is in its local memory...
        this task is known as data layout.


     Shared Memory:
        Shared memory is a model for interactions between processors
        within a parallel system.  Systems like the multi-processor
        Pentium machines running Linux physically share a single memory
        among their processors, so that a value written to shared memory
        by one processor can be directly accessed by any processor.
        Alternatively, logically shared memory can be implemented for
        systems in which each processor has it own memory by converting
        each non-local memory reference into an appropriate inter-
        processor communication.  Either implementation of shared memory
        is generally considered easier to use than message passing.
        Physically shared memory can have both high bandwidth and low
        latency, but only when multiple processors do not try to access
        the bus simultaneously; thus, data layout still can seriously
        impact performance, and cache effects, etc., can make it
        difficult to determine what the best layout is.


     Aggregate Functions:
        In both the message passing and shared memory models, a
        communication is initiated by a single processor; in contrast,
        aggregate function communication is an inherently parallel
        communication model in which an entire group of processors act
        together.  The simplest such action is a barrier
        synchronization, in which each individual processor waits until
        every processor in the group has arrived at the barrier.  By
        having each processor output a datum as a side-effect of
        reaching a barrier, it is possible to have the communication
        hardware return a value to each processor which is an arbitrary
        function of the values collected from all processors.  For
        example, the return value might be the answer to the question
        "did any processor find a solution?"  or it might be the sum of
        one value from each processor.  Latency can be very low, but
        bandwidth per processor also tends to be low.  Traditionally,
        this model is used primarily to control parallel execution
        rather than to distribute data values.


     Collective Communication:
        This is another name for aggregate functions, most often used
        when referring to aggregate functions that are constructed using
        multiple message-passing operations.


     SMP:
        SMP (Symmetric Multi-Processor) refers to the operating system
        concept of a group of processors working together as peers, so
        that any piece of work could be done equally well by any
        processor.  Typically, SMP implies the combination of MIMD and
        shared memory.  In the IA32 world, SMP generally means compliant
        with MPS (the Intel MultiProcessor Specification); in the
        future, it may mean "Slot 2"....


     SWAR:
        SWAR (SIMD Within A Register) is a generic term for the concept
        of partitioning a register into multiple integer fields and
        using register-width operations to perform SIMD-parallel
        computations across those fields.  Given a machine with k-bit
        registers, data paths, and function units, it has long been
        known that ordinary register operations can function as SIMD
        parallel operations on as many as n, k/n-bit, field values.
        Although this type of parallelism can be implemented using
        ordinary integer registers and instructions, many high-end
        microprocessors have recently added specialized instructions to
        enhance the performance of this technique for multimedia-
        oriented tasks.  In addition to the Intel/AMD/Cyrix MMX
        (MultiMedia eXtensions), there are: Digital Alpha MAX
        (MultimediA eXtensions), Hewlett-Packard PA-RISC MAX (Multimedia
        Acceleration eXtensions), MIPS MDMX (Digital Media eXtension,
        pronounced "Mad Max"), and Sun SPARC V9 VIS (Visual Instruction
        Set).  Aside from the three vendors who have agreed on MMX, all
        of these instruction set extensions are roughly comparable, but
        mutually incompatible.


     Attached Processors:
        Attached processors are essentially special-purpose computers
        that are connected to a host system to accelerate specific types
        of computation.  For example, many video and audio cards for PCs
        contain attached processors designed, respectively, to
        accelerate common graphics operations and audio DSP (Digital
        Signal Processing).  There is also a wide range of attached
        array processors, so called because they are designed to
        accelerate arithmetic operations on arrays.  In fact, many
        commercial supercomputers are really attached processors with
        workstation hosts.


     RAID:
        RAID (Redundant Array of Inexpensive Disks) is a simple
        technology for increasing both the bandwidth and reliability of
        disk I/O.  Although there are many different variations, all
        have two key concepts in common.  First, each data block is
        striped across a group of n+k disk drives such that each drive
        only has to read or write 1/n of the data...  yielding n times
        the bandwidth of one drive.  Second, redundant data is written
        so that data can be recovered if a disk drive fails; this is
        important because otherwise if any one of the n+k drives were to
        fail, the entire file system could be lost.  A good overview of
        RAID in general is given at  <http://www.uni-
        mainz.de/~neuffer/scsi/what_is_raid.html>, and information about
        RAID options for Linux systems is at
        <http://linas.org/linux/raid.html>.  Aside from specialized RAID
        hardware support, Linux also supports software RAID 0, 1, 4, and
        5 across multiple disks hosted by a single Linux system; see the
        Software RAID mini-HOWTO and the Multi-Disk System Tuning mini-
        HOWTO for details.  RAID across disk drives on multiple machines
        in a cluster is not directly supported.


     IA32:
        IA32 (Intel Architecture, 32-bit) really has nothing to do with
        parallel processing, but rather refers to the class of
        processors whose instruction sets are generally compatible with
        that of the Intel 386.  Basically, any Intel x86 processor after
        the 286 is compatible with the 32-bit flat memory model that
        characterizes IA32.  AMD and Cyrix also make a multitude of
        IA32-compatible processors.  Because Linux evolved primarily on
        IA32 processors and that is where the commodity market is
        centered, it is convenient to use IA32 to distinguish any of
        these processors from the PowerPC, Alpha, PA-RISC, MIPS, SPARC,
        etc.  The upcoming IA64 (64-bit with EPIC, Explicitly Parallel
        Instruction Computing) will certainly complicate matters, but
        Merced, the first IA64 processor, is not scheduled for
        production until 1999.


     COTS:
        Since the demise of many parallel supercomputer companies, COTS
        (Commercial Off-The-Shelf) is commonly discussed as a
        requirement for parallel computing systems.  Being fanatically
        pure, the only COTS parallel processing techniques using PCs are
        things like SMP Windows NT servers and various MMX Windows
        applications; it really doesn't pay to be that fanatical.  The
        underlying concept of COTS is really minimization of development
        time and cost.  Thus, a more useful, more common, meaning of
        COTS is that at least most subsystems benefit from commodity
        marketing, but other technologies are used where they are
        effective.  Most often, COTS parallel processing refers to a
        cluster in which the nodes are commodity PCs, but the network
        interface and software are somewhat customized...  typically
        running Linux and applications codes that are freely available
        (e.g., copyleft or public domain), but not literally COTS.

  1.3.  Example Algorithm


  In order to better understand the use of the various parallel
  programming approaches outlined in this HOWTO, it is useful to have an
  example problem.  Although just about any simple parallel algorithm
  would do, by selecting an algorithm that has been used to demonstrate
  various other parallel programming systems, it becomes a bit easier to
  compare and contrast approaches.  M. J.  Quinn's book, Parallel
  Computing Theory And Practice, second edition, McGraw Hill, New York,
  1994, uses a parallel algorithm that computes the value of Pi to
  demonstrate a variety of different parallel supercomputer programming
  environments (e.g., nCUBE message passing, Sequent shared memory).  In
  this HOWTO, we use the same basic algorithm.

  The algorithm computes the approximate value of Pi by summing the area
  under x squared.  As a purely sequential C program, the algorithm
  looks like:


  ______________________________________________________________________
  #include <stdlib.h>;
  #include <stdio.h>;

  main(int argc, char **argv)
  {
    register double width, sum;
    register int intervals, i;

    /* get the number of intervals */
    intervals = atoi(argv[1]);
    width = 1.0 / intervals;

    /* do the computation */
    sum = 0;
    for (i=0; i<intervals; ++i) {
      register double x = (i + 0.5) * width;
      sum += 4.0 / (1.0 + x * x);
    }
    sum *= width;

    printf("Estimation of pi is %f\n", sum);

    return(0);
  }
  ______________________________________________________________________


  However, this sequential algorithm easily yields an "embarrassingly
  parallel" implementation.  The area is subdivided into intervals, and
  any number of processors can each independently sum the intervals
  assigned to it, with no need for interaction between processors.  Once
  the local sums have been computed, they are added together to create a
  global sum; this step requires some level of coordination and
  communication between processors.  Finally, this global sum is printed
  by one processor as the approximate value of Pi.

  In this HOWTO, the various parallel implementations of this algorithm
  appear where each of the different programming methods is discussed.


  1.4.  Organization Of This Document


  The remainder of this document is divided into five parts.  Sections
  2, 3, 4, and 5 correspond to the three different types of hardware
  configurations supporting parallel processing using Linux:


  <20>  Section 2 discusses SMP Linux systems.  These directly support MIMD
     execution using shared memory, although message passing also is
     implemented easily.  Although Linux supports SMP configurations up
     to 16 processors, most SMP PC systems have either two or four
     identical processors.

  <20>  Section 3 discusses clusters of networked machines, each running
     Linux.  A cluster can be used as a parallel processing system that
     directly supports MIMD execution and message passing, perhaps also
     providing logically shared memory.  Simulated SIMD execution and
     aggregate function communication also can be supported, depending
     on the networking method used.  The number of processors in a
     cluster can range from two to thousands, primarily limited by the
     physical wiring constraints of the network.  In some cases, various
     types of machines can be mixed within a cluster; for example, a
     network combining DEC Alpha and Pentium Linux systems would be a
     heterogeneous cluster.

  <20>  Section 4 discusses SWAR, SIMD Within A Register.  This is a very
     restrictive type of parallel execution model, but on the other
     hand, it is a built-in capability of ordinary processors.
     Recently, MMX (and other) instruction set extensions to modern
     processors have made this approach even more effective.

  <20>  Section 5 discusses the use of Linux PCs as hosts for simple
     parallel computing systems.  Either as an add-in card or as an
     external box, attached processors can provide a Linux system with
     formidable processing power for specific types of applications.
     For example, inexpensive ISA cards are available that provide
     multiple DSP processors offering hundreds of MFLOPS for compute-
     bound problems.  However, these add-in boards are just processors;
     they generally do not run an OS, have disk or console I/O
     capability, etc.  To make such systems useful, the Linux "host"
     must provide these functions.


  The final section of this document covers aspects that are of general
  interest for parallel processing using Linux, not specific to a
  particular one of the approaches listed above.

  As you read this document, keep in mind that we haven't tested
  everything, and a lot of stuff reported here "still has a research
  character" (a nice way to say "doesn't quite work like it should" ;-).
  However, parallel processing using Linux is useful now, and an
  increasingly large group is working to make it better.

  The author of this HOWTO is Hank Dietz, Ph.D., currently Professor &
  James F. Hardymon Chair in Networking at the University of Kentucky,
  Electrical & Computer Engineering Dept in Lexington, KY, 40506-0046.
  Dietz retains rights to this document as per the Linux Documentation
  Project guidelines.  Although an effort has been made to ensure the
  correctness and fairness of this presentation, neither Dietz nor
  University of Kentucky can be held responsible for any problems or
  errors, and University of Kentucky does not endorse any of the
  work/products discussed.


  2.  SMP Linux


  This document gives a brief overview of how to use SMP Linux
  <http://www.linux.org.uk/SMP/title.html> systems for parallel
  processing.  The most up-to-date information on SMP Linux is probably
  available via the SMP Linux project mailing list; send email to
  majordomo@vger.rutgers.edu with the text subscribe linux-smp to join
  the list.

  Does SMP Linux really work?  In June 1996, I purchased a brand new
  (well, new off-brand ;-) two-processor 100MHz Pentium system.  The
  fully assembled system, including both processors, Asus motherboard,
  256K cache, 32M RAM, 1.6G disk, 6X CDROM, Stealth 64, and 15" Acer
  monitor, cost a total of $1,800.  This was just a few hundred dollars
  more than a comparable uniprocessor system.  Getting SMP Linux running
  was simply a matter of installing the "stock" uniprocessor Linux,
  recompiling the kernel with the SMP=1 line in the makefile uncommented
  (although I find setting SMP to 1 a bit ironic ;-), and informing lilo
  about the new kernel.  This system performs well enough, and has been
  stable enough, to serve as my primary workstation ever since.  In
  summary, SMP Linux really does work.

  The next question is how much high-level support is available for
  writing and executing shared memory parallel programs under SMP Linux.
  Through early 1996, there wasn't much.  Things have changed.  For
  example, there is now a very complete POSIX threads library.

  Although performance may be lower than for native shared-memory
  mechanisms, an SMP Linux system also can use most parallel processing
  software that was originally developed for a workstation cluster using
  socket communication.  Sockets (see section 3.3) work within an SMP
  Linux system, and even for multiple SMPs networked as a cluster.
  However, sockets imply a lot of unnecessary overhead for an SMP.  Much
  of that overhead is within the kernel or interrupt handlers; this
  worsens the problem because SMP Linux generally allows only one
  processor to be in the kernel at a time and the interrupt controller
  is set so that only the boot processor can process interrupts.
  Despite this, typical SMP communication hardware is so much better
  than most cluster networks that cluster software will often run better
  on an SMP than on the cluster for which it was designed.

  The remainder of this section discusses SMP hardware, reviews the
  basic Linux mechanisms for sharing memory across the processes of a
  parallel program, makes a few observations about atomicity,
  volatility, locks, and cache lines, and finally gives some pointers to
  other shared memory parallel processing resources.


  2.1.  SMP Hardware


  Although SMP systems have been around for many years, until very
  recently, each such machine tended to implement basic functions
  differently enough so that operating system support was not portable.
  The thing that has changed this situation is Intel's Multiprocessor
  Specification, often referred to as simply MPS.  The MPS 1.4
  specification is currently available as a PDF file at
  <http://www.intel.com/design/pro/datashts/242016.htm>, and there is a
  brief overview of MPS 1.1 at
  <http://support.intel.com/oem_developer/ial/support/9300.HTM>, but be
  aware that Intel does re-arrange their WWW site often.  A wide range
  of vendors <http://www.uruk.org/~erich/mps-hw.html> are building MPS-
  compliant systems supporting up to four processors, but MPS
  theoretically allows many more processors.

  The only non-MPS, non-IA32, systems supported by SMP Linux are Sun4m
  multiprocessor SPARC machines.  SMP Linux supports most Intel MPS
  version 1.1 or 1.4 compliant machines with up to sixteen 486DX,
  Pentium, Pentium MMX, Pentium Pro, or Pentium II processors.
  Unsupported IA32 processors include the Intel 386, Intel 486SX/SLC
  processors (the lack of floating point hardware interferes with the
  SMP mechanisms), and AMD & Cyrix processors (they require different
  SMP support chips that do not seem to be available at this writing).

  It is important to understand that the performance of MPS-compliant
  systems can vary widely.  As expected, one cause for performance
  differences is processor speed:  faster clock speeds tend to yield
  faster systems, and a Pentium Pro processor is faster than a Pentium.
  However, MPS does not really specify how hardware implements shared
  memory, but only how that implementation must function from a software
  point of view; this means that performance is also a function of how
  the shared memory implementation interacts with the characteristics of
  SMP Linux and your particular programs.

  The primary way in which systems that comply with MPS differ is in how
  they implement access to physically shared memory.


  2.1.1.  Does each processor have its own L2 cache?


  Some MPS Pentium systems, and all MPS Pentium Pro and Pentium II
  systems, have independent L2 caches.  (The L2 cache is packaged within
  the Pentium Pro or Pentium II modules.)  Separate L2 caches are
  generally viewed as maximizing compute performance, but things are not
  quite so obvious under Linux.  The primary complication is that the
  current SMP Linux scheduler does not attempt to keep each process on
  the same processor, a concept known as processor affinity.  This may
  change soon; there has recently been some discussion about this in the
  SMP Linux development community under the title "processor binding."
  Without processor affinity, having separate L2 caches may introduce
  significant overhead when a process is given a timeslice on a
  processor other than the one that was executing it last.

  Many relatively inexpensive systems are organized so that two Pentium
  processors share a single L2 cache.  The bad news is that this causes
  contention for the cache, seriously degrading performance when running
  multiple independent sequential programs.  The good news is that many
  parallel programs might actually benefit from the shared cache because
  if both processors will want to access the same line from shared
  memory, only one had to fetch it into cache and contention for the bus
  is averted.  The lack of processor affinity also causes less damage
  with a shared L2 cache.  Thus, for parallel programs, it isn't really
  clear that sharing L2 cache is as harmful as one might expect.

  Experience with our dual Pentium shared 256K cache system shows quite
  a wide range of performance depending on the level of kernel activity
  required.  At worst, we see only about 1.2x speedup.  However, we also
  have seen up to 2.1x speedup, which suggests that compute-intensive
  SPMD-style code really does profit from the "shared fetch" effect.


  2.1.2.  Bus configuration?


  The first thing to say is that most modern systems connect the
  processors to one or more PCI buses that in turn are "bridged" to one
  or more ISA/EISA buses.  These bridges add latency, and both EISA and
  ISA generally offer lower bandwidth than PCI (ISA being the lowest),
  so disk drives, video cards, and other high-performance devices
  generally should be connected via a PCI bus interface.

  Although an MPS system can achieve good speed-up for many compute-
  intensive parallel programs even if there is only one PCI bus, I/O
  operations occur at no better than uniprocessor performance...  and
  probably a little worse due to bus contention from the processors.
  Thus, if you are looking to speed-up I/O, make sure that you get an
  MPS system with multiple independent PCI busses and I/O controllers
  (e.g., multiple SCSI chains).  You will need to be careful to make
  sure SMP Linux supports what you get.  Also keep in mind that the
  current SMP Linux essentially allows only one processor in the kernel
  at any time, so you should choose your I/O controllers carefully to
  pick ones that minimize the kernel time required for each I/O
  operation.  For really high performance, you might even consider doing
  raw device I/O directly from user processes, without a system call...
  this isn't necessarily as hard as it sounds, and need not compromise
  security (see section 3.3 for a description of the basic techniques).

  It is important to note that the relationship between bus speed and
  processor clock rate has become very fuzzy over the past few years.
  Although most systems now use the same PCI clock rate, it is not
  uncommon to find a faster processor clock paired with a slower bus
  clock.  The classic example of this was that the Pentium 133 generally
  used a faster bus than a Pentium 150, with appropriately strange-
  looking performance on various benchmarks.  These effects are
  amplified in SMP systems; it is even more important to have a faster
  bus clock.


  2.1.3.  Memory interleaving and DRAM technologies?


  Memory interleaving actually has nothing whatsoever to do with MPS,
  but you will often see it mentioned for MPS systems because these
  systems are typically more demanding of memory bandwidth.  Basically,
  two-way or four-way interleaving organizes RAM so that a block access
  is accomplished using multiple banks of RAM rather than just one.
  This provides higher memory access bandwidth, particularly for cache
  line loads and stores.

  The waters are a bit muddied about this, however, because EDO DRAM and
  various other memory technologies tend to improve similar kinds of
  operations.  An excellent overview of DRAM technologies is given in
  <http://www.pcguide.com/ref/ram/tech.htm>.

  So, for example, is it better to have 2-way interleaved EDO DRAM or
  non-interleaved SDRAM?  That is a very good question with no simple
  answer, because both interleaving and exotic DRAM technologies tend to
  be expensive.  The same dollar investment in more ordinary memory
  configurations generally will give you a significantly larger main
  memory.  Even the slowest DRAM is still a heck of a lot faster than
  using disk-based virtual memory....


  2.2.  Introduction To Shared Memory Programming


  Ok, so you have decided that parallel processing on an SMP is a great
  thing to do...  how do you get started?  Well, the first step is to
  learn a little bit about how shared memory communication really works.

  It sounds like you simply have one processor store a value into memory
  and another processor load it; unfortunately, it isn't quite that
  simple.  For example, the relationship between processes and
  processors is very blurry; however, if we have no more active
  processes than there are processors, the terms are roughly
  interchangeable.  The remainder of this section briefly summarizes the
  key issues that could cause serious problems, if you were not aware of
  them:  the two different models used to determine what is shared,
  atomicity issues, the concept of volatility, hardware lock
  instructions, cache line effects, and Linux scheduler issues.


  2.2.1.  Shared Everything Vs. Shared Something


  There are two fundamentally different models commonly used for shared
  memory programming:  shared everything and shared something.  Both of
  these models allow processors to communicate by loads and stores
  from/into shared memory; the distinction comes in the fact that shared
  everything places all data structures in shared memory, while shared
  something requires the user to explicitly indicate which data
  structures are potentially shared and which are private to a single
  processor.

  Which shared memory model should you use?  That is mostly a question
  of religion.  A lot of people like the shared everything model because
  they do not really need to identify which data structures should be
  shared at the time they are declared...  you simply put locks around
  potentially-conflicting accesses to shared objects to ensure that only
  one process(or) has access at any moment.  Then again, that really
  isn't all that simple...  so many people prefer the relative safety of
  shared something.


  2.2.1.1.  Shared Everything


  The nice thing about sharing everything is that you can easily take an
  existing sequential program and incrementally convert it into a shared
  everything parallel program.  You do not have to first determine which
  data need to be accessible by other processors.

  Put simply, the primary problem with sharing everything is that any
  action taken by one processor could affect the other processors.  This
  problem surfaces in two ways:


  <20>  Many libraries use data structures that simply are not sharable.
     For example, the UNIX convention is that most functions can return
     an error code in a variable called errno; if two shared everything
     processes perform various calls, they would interfere with each
     other because they share the same errno.  Although there is now a
     library version that fixes the errno problem, similar problems
     still exist in most libraries.  For example, unless special
     precautions are taken, the X library will not work if calls are
     made from multiple shared everything processes.

  <20>  Normally, the worst-case behavior for a program with a bad pointer
     or array subscript is that the process that contains the offending
     code dies.  It might even generate a core file that clues you in to
     what happened.  In shared everything parallel processing, it is
     very likely that the stray accesses will bring the demise of a
     process other than the one at fault, making it nearly impossible to
     localize and correct the error.

  Neither of these types of problems is common when shared something is
  used, because only the explicitly-marked data structures are shared.
  It also is fairly obvious that shared everything only works if all
  processors are executing the exact same memory image; you cannot use
  shared everything across multiple different code images (i.e., can use
  only SPMD, not general MIMD).

  The most common type of shared everything programming support is a
  threads library.  Threads
  <http://liinwww.ira.uka.de/bibliography/Os/threads.html> are
  essentially "light-weight" processes that might not be scheduled in
  the same way as regular UNIX processes and, most importantly, share
  access to a single memory map.  The POSIX Pthreads
  <http://www.humanfactor.com/pthreads/mit-pthreads.html> package has
  been the focus of a number of porting efforts; the big question is
  whether any of these ports actually run the threads of a program in
  parallel under SMP Linux (ideally, with a processor for each thread).
  The POSIX API doesn't require it, and versions like
  <http://www.aa.net/~mtp/PCthreads.html> apparently do not implement
  parallel thread execution - all the threads of a program are kept
  within a single Linux process.

  The first threads library that supported SMP Linux parallelism was the
  now somewhat obsolete bb_threads library,
  <ftp://caliban.physics.utoronto.ca/pub/linux/>, a very small library
  that used the Linux clone() call to fork new, independently scheduled,
  Linux processes all sharing a single address space.  SMP Linux
  machines can run multiple of these "threads" in parallel because each
  "thread" is a full Linux process; the trade-off is that you do not get
  the same "light-weight" scheduling control provided by some thread
  libraries under other operating systems.  The library used a bit of C-
  wrapped assembly code to install a new chunk of memory as each
  thread's stack and to provide atomic access functions for an array of
  locks (mutex objects).  Documentation consisted of a README and a
  short sample program.

  More recently, a version of POSIX threads using clone() has been
  developed.  This library, LinuxThreads
  <http://pauillac.inria.fr/~xleroy/linuxthreads/>, is clearly the
  preferred shared everything library for use under SMP Linux.  POSIX
  threads are well documented, and the LinuxThreads README
  <http://pauillac.inria.fr/~xleroy/linuxthreads/README> and
  LinuxThreads FAQ
  <http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html> are very well
  done.  The primary problem now is simply that POSIX threads have a lot
  of details to get right and LinuxThreads is still a work in progress.
  There is also the problem that the POSIX thread standard has evolved
  through the standardization process, so you need to be a bit careful
  not to program for obsolete early versions of the standard.


  2.2.1.2.  Shared Something


  Shared something is really "only share what needs to be shared."  This
  approach can work for general MIMD (not just SPMD) provided that care
  is taken for the shared objects to be allocated at the same places in
  each processor's memory map.  More importantly, shared something makes
  it easier to predict and tune performance, debug code, etc.  The only
  problems are:


  <20>  It can be hard to know beforehand what really needs to be shared.

  <20>  The actual allocation of objects in shared memory may be awkward,
     especially for what would have been stack-allocated objects.  For
     example, it may be necessary to explicitly allocate shared objects
     in a separate memory segment, requiring separate memory allocation
     routines and introducing extra pointer indirections in each
     reference.

  Currently, there are two very similar mechanisms that allow groups of
  Linux processes to have independent memory spaces, all sharing only a
  relatively small memory segment.  Assuming that you didn't foolishly
  exclude "System V IPC" when you configured your Linux system, Linux
  supports a very portable mechanism that has generally become known as
  "System V Shared Memory."  The other alternative is a memory mapping
  facility whose implementation varies widely across different UNIX
  systems:  the mmap() system call.  You can, and should, learn about
  these calls from the manual pages...  but a brief overview of each is
  given in sections 2.5 and 2.6 to help get you started.


  2.2.2.  Atomicity And Ordering


  No matter which of the above two models you use, the result is pretty
  much the same:  you get a pointer to a chunk of read/write memory that
  is accessible by all processes within your parallel program.  Does
  that mean I can just have my parallel program access shared memory
  objects as though they were in ordinary local memory?  Well, not
  quite....

  Atomicity refers to the concept that an operation on an object is
  accomplished as an indivisible, uninterruptible, sequence.
  Unfortunately, sharing memory access does not imply that all
  operations on data in shared memory occur atomically.  Unless special
  precautions are taken, only simple load or store operations that occur
  within a single bus transaction (i.e., aligned 8, 16, or 32-bit
  operations, but not misaligned nor 64-bit operations) are atomic.
  Worse still, "smart" compilers like GCC will often perform
  optimizations that could eliminate the memory operations needed to
  ensure that other processors can see what this processor has done.
  Fortunately, both these problems can be remedied...  leaving only the
  relationship between access efficiency and cache line size for us to
  worry about.

  However, before discussing these issues, it is useful to point-out
  that all of this assumes that memory references for each processor
  happen in the order in which they were coded.  The Pentium does this,
  but also notes that future Intel processors might not.  So, for future
  processors, keep in mind that it may be necessary to surround some
  shared memory accesses with instructions that cause all pending memory
  accesses to complete, thus providing memory access ordering.  The
  CPUID instruction apparently is reserved to have this side-effect.


  2.2.3.  Volatility


  To prevent GCC's optimizer from buffering values of shared memory
  objects in registers, all objects in shared memory should be declared
  as having types with the volatile attribute.  If this is done, all
  shared object reads and writes that require just one word access will
  occur atomically.  For example, suppose that p is a pointer to an
  integer, where both the pointer and the integer it will point at are
  in shared memory; the ANSI C declaration might be:


  ______________________________________________________________________
  volatile int * volatile p;
  ______________________________________________________________________


  In this code, the first volatile refers to the int that p will
  eventually point at; the second volatile refers to the pointer itself.
  Yes, it is annoying, but it is the price one pays for enabling GCC to
  perform some very powerful optimizations.  At least in theory, the
  -traditional option to GCC might suffice to produce correct code at
  the expense of some optimization, because pre-ANSI K&R C essentially
  claimed that all variables were volatile unless explicitly declared as
  register.  Still, if your typical GCC compile looks like cc -O6 ...,
  you really will want to explicitly mark things as volatile only where
  necessary.

  There has been a rumor to the effect that using assembly-language
  locks that are marked as modifying all processor registers will cause
  GCC to appropriately flush all variables, thus avoiding the
  "inefficient" compiled code associated with things declared as
  volatile.  This hack appears to work for statically allocated global
  variables using version 2.7.0 of GCC...  however, that behavior is not
  required by the ANSI C standard.  Still worse, other processes that
  are making only read accesses can buffer the values in registers
  forever, thus never noticing that the shared memory value has actually
  changed.  In summary, do what you want, but only variables accessed
  through volatile are guaranteed to work correctly.

  Note that you can cause a volatile access to an ordinary variable by
  using a type cast that imposes the volatile attribute.  For example,
  the ordinary int i; can be referenced as a volatile by *((volatile int
  *) &i); thus, you can explicitly invoke the "overhead" of volatility
  only where it is critical.


  2.2.4.  Locks


  If you thought that ++i; would always work to add one to a variable i
  in shared memory, you've got a nasty little surprise coming:  even if
  coded as a single instruction, the load and store of the result are
  separate memory transactions, and other processors could access i
  between these two transactions.  For example, having two processes
  both perform ++i; might only increment i by one, rather than by two.
  According to the Intel Pentium "Architecture and Programming Manual,"
  the LOCK prefix can be used to ensure that any of the following
  instructions is atomic relative to the data memory location it
  accesses:


  ______________________________________________________________________
  BTS, BTR, BTC                     mem, reg/imm
  XCHG                              reg, mem
  XCHG                              mem, reg
  ADD, OR, ADC, SBB, AND, SUB, XOR  mem, reg/imm
  NOT, NEG, INC, DEC                mem
  CMPXCHG, XADD
  ______________________________________________________________________


  However, it probably is not a good idea to use all these operations.
  For example, XADD did not even exist for the 386, so coding it may
  cause portability problems.

  The XCHG instruction always asserts a lock, even without the LOCK
  prefix, and thus is clearly the preferred atomic operation from which
  to build higher-level atomic constructs such as semaphores and shared
  queues.  Of course, you can't get GCC to generate this instruction
  just by writing C code...  instead, you must use a bit of in-line
  assembly code.  Given a word-size volatile object obj and a word-size
  register value reg, the GCC in-line assembly code is:


  ______________________________________________________________________
  __asm__ __volatile__ ("xchgl %1,%0"
                        :"=r" (reg), "=m" (obj)
                        :"r" (reg), "m" (obj));
  ______________________________________________________________________


  Examples of GCC in-line assembly code using bit operations for locking
  are given in the source code for the bb_threads library
  <ftp://caliban.physics.utoronto.ca/pub/linux/>.

  It is important to remember, however, that there is a cost associated
  with making memory transactions atomic.  A locking operation carries a
  fair amount of overhead and may delay memory activity from other
  processors, whereas ordinary references may use local cache.  The best
  performance results when locking operations are used as infrequently
  as possible.  Further, these IA32 atomic instructions obviously are
  not portable to other systems.

  There are many alternative approaches that allow ordinary instructions
  to be used to implement various synchronizations, including mutual
  exclusion - ensuring that at most one processor is updating a given
  shared object at any moment.  Most OS textbooks discuss at least one
  of these techniques.  There is a fairly good discussion in the Fourth
  Edition of Operating System Concepts, by Abraham Silberschatz and
  Peter B. Galvin, ISBN 0-201-50480-4.


  2.2.5.  Cache Line Size


  One more fundamental atomicity concern can have a dramatic impact on
  SMP performance:  cache line size.  Although the MPS standard requires
  references to be coherent no matter what caching is used, the fact is
  that when one processor writes to a particular line of memory, every
  cached copy of the old line must be invalidated or updated.  This
  implies that if two or more processors are both writing data to
  different portions of the same line a lot of cache and bus traffic may
  result, effectively to pass the line from cache to cache.  This
  problem is known as false sharing.  The solution is simply to try to
  organize data so that what is accessed in parallel tends to come from
  a different cache line for each process.

  You might be thinking that false sharing is not a problem using a
  system with a shared L2 cache, but remember that there are still
  separate L1 caches.  Cache organization and number of separate levels
  can both vary, but the Pentium L1 cache line size is 32 bytes and
  typical external cache line sizes are around 256 bytes.  Suppose that
  the addresses (physical or virtual) of two items are a and b and that
  the largest per-processor cache line size is c, which we assume to be
  a power of two.  To be very precise, if ((int) a) & ~(c - 1) is equal
  to ((int) b) & ~(c - 1), then both references are in the same cache
  line.  A simpler rule is that if shared objects being referenced in
  parallel are at least c bytes apart, they should map to different
  cache lines.


  2.2.6.  Linux Scheduler Issues


  Although the whole point of using shared memory for parallel
  processing is to avoid OS overhead, OS overhead can come from things
  other than communication per se.  We have already said that the number
  of processes that should be constructed is less than or equal to the
  number of processors in the machine.  But how do you decide exactly
  how many processes to make?

  For best performance, the number of processes in your parallel program
  should be equal to the expected number of your program's processes
  that simultaneously can be running on different processors.  For
  example, if a four-processor SMP typically has one process actively
  running for some other purpose (e.g., a WWW server), then your
  parallel program should use only three processes.  You can get a rough
  idea of how many other processes are active on your system by looking
  at the "load average" quoted by the uptime command.

  Alternatively, you could boost the priority of the processes in your
  parallel program using, for example, the renice command or nice()
  system call.  You must be privileged to increase priority.  The idea
  is simply to force the other processes out of processors so that your
  program can run simultaneously across all processors.  This can be
  accomplished somewhat more explicitly using the prototype version of
  SMP Linux at  <http://luz.cs.nmt.edu/~rtlinux/>, which offers real-
  time schedulers.

  If you are not the only user treating your SMP system as a parallel
  machine, you may also have conflicts between the two or more parallel
  programs trying to execute simultaneously.  This standard solution is
  gang scheduling - i.e., manipulating scheduling priority so that at
  any given moment, only the processes of a single parallel program are
  running.  It is useful to recall, however, that using more parallelism
  tends to have diminishing returns and scheduler activity adds
  overhead.  Thus, for example, it is probably better for a four-
  processor machine to run two programs with two processes each rather
  than gang scheduling between two programs with four processes each.

  There is one more twist to this.  Suppose that you are developing a
  program on a machine that is heavily used all day, but will be fully
  available for parallel execution at night.  You need to write and test
  your code for correctness with the full number of processes, even
  though you know that your daytime test runs will be slow.  Well, they
  will be very slow if you have processes busy waiting for shared memory
  values to be changed by other processes that are not currently running
  (on other processors).  The same problem occurs if you develop and
  test your code on a single-processor system.

  The solution is to embed calls in your code, wherever it may loop
  awaiting an action from another processor, so that Linux will give
  another process a chance to run.  I use a C macro, call it IDLE_ME, to
  do this:  for a test run, compile with cc -DIDLE_ME=usleep(1); ...;
  for a "production" run, compile with cc -DIDLE_ME={} ....  The
  usleep(1) call requests a 1 microsecond sleep, which has the effect of
  allowing the Linux scheduler to select a different process to run on
  that processor.  If the number of processes is more than twice the
  number of processors available, it is not unusual for codes to run ten
  times faster with usleep(1) calls than without them.


  2.3.  bb_threads


  The bb_threads ("Bare Bones" threads) library,
  <ftp://caliban.physics.utoronto.ca/pub/linux/>, is a remarkably simple
  library that demonstrates use of the Linux clone() call.  The gzip tar
  file is only 7K bytes!  Although this library is essentially made
  obsolete by the LinuxThreads library discussed in section 2.4,
  bb_threads is still usable, and it is small and simple enough to serve
  well as an introduction to use of Linux thread support.  Certainly, it
  is far less daunting to read this source code than to browse the
  source code for LinuxThreads.  In summary, the bb_threads library is a
  good starting point, but is not really suitable for coding large
  projects.

  The basic program structure for using the bb_threads library is:


  1. Start the program running as a single process.

  2. You will need to estimate the maximum stack space that will be
     required for each thread.  Guessing large is relatively harmless
     (that is what virtual memory is for ;-), but remember that all the
     stacks are coming from a single virtual address space, so guessing
     huge is not a great idea.  The demo suggests 64K.  This size is set
     to b bytes by bb_threads_stacksize(b).

  3. The next step is to initialize any locks that you will need.  The
     lock mechanism built-into this library numbers locks from 0 to
     MAX_MUTEXES, and initializes lock i by bb_threads_mutexcreate(i).

  4. Spawning a new thread is done by calling a library routine that
     takes arguments specifying what function the new thread should
     execute and what arguments should be transmitted to it.  To start a
     new thread executing the void-returning function f with the single
     argument arg, you do something like bb_threads_newthread(f, &arg),
     where f should be declared something like void f(void *arg, size_t
     dummy).  If you need to pass more than one argument, pass a pointer
     to a structure initialized to hold the argument values.

  5. Run parallel code, being careful to use bb_threads_lock(n) and
     bb_threads_unlock(n) where n is an integer identifying which lock
     to use.  Note that the lock and unlock operations in this library
     are very basic spin locks using atomic bus-locking instructions,
     which can cause excessive memory-reference interference and do not
     make any attempt to ensure fairness.

     The demo program packaged with bb_threads did not correctly use
     locks to prevent printf() from being executed simultaneously from
     within the functions fnn and main...  and because of this, the demo
     does not always work.  I'm not saying this to knock the demo, but
     rather to emphasize that this stuff is very tricky; also, it is
     only slightly easier using LinuxThreads.

  6. When a thread executes a return, it actually destroys the
     process...  but the local stack memory is not automatically
     deallocated.  To be precise, Linux doesn't support deallocation,
     but the memory space is not automatically added back to the
     malloc() free list.  Thus, the parent process should reclaim the
     space for each dead child by bb_threads_cleanup(wait(NULL)).


  The following C program uses the algorithm discussed in section 1.3 to
  compute the approximate value of Pi using two bb_threads threads.


  ______________________________________________________________________
  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <sys/types.h>
  #include <sys/wait.h>
  #include "bb_threads.h"

  volatile double pi = 0.0;
  volatile int intervals;
  volatile int pids[2];      /* Unix PIDs of threads */

  void
  do_pi(void *data, size_t len)
  {
    register double width, localsum;
    register int i;
    register int iproc = (getpid() != pids[0]);

    /* set width */
    width = 1.0 / intervals;

    /* do the local computations */
    localsum = 0;
    for (i=iproc; i<intervals; i+=2) {
      register double x = (i + 0.5) * width;
      localsum += 4.0 / (1.0 + x * x);
    }
    localsum *= width;

    /* get permission, update pi, and unlock */
    bb_threads_lock(0);
    pi += localsum;
    bb_threads_unlock(0);
  }

  int
  main(int argc, char **argv)
  {
    /* get the number of intervals */
    intervals = atoi(argv[1]);

    /* set stack size and create lock... */
    bb_threads_stacksize(65536);
    bb_threads_mutexcreate(0);

    /* make two threads... */
    pids[0] = bb_threads_newthread(do_pi, NULL);
    pids[1] = bb_threads_newthread(do_pi, NULL);

    /* cleanup after two threads (really a barrier sync) */
    bb_threads_cleanup(wait(NULL));
    bb_threads_cleanup(wait(NULL));

    /* print the result */
    printf("Estimation of pi is %f\n", pi);

    /* check-out */
    exit(0);
  }
  ______________________________________________________________________


  2.4.  LinuxThreads


  LinuxThreads  <http://pauillac.inria.fr/~xleroy/linuxthreads/> is a
  fairly complete and solid implementation of "shared everything" as per
  the POSIX 1003.1c threads standard.  Unlike other POSIX threads ports,
  LinuxThreads uses the same Linux kernel threads facility (clone())
  that is used by bb_threads.  POSIX compatibility means that it is
  relatively easy to port quite a few threaded applications from other
  systems and various tutorial materials are available.  In short, this
  is definitely the threads package to use under Linux for developing
  large-scale threaded programs.

  The basic program structure for using the LinuxThreads library is:


  1. Start the program running as a single process.

  2. The next step is to initialize any locks that you will need.
     Unlike bb_threads locks, which are identified by numbers, POSIX
     locks are declared as variables of type pthread_mutex_t lock.  Use
     pthread_mutex_init(&lock,val) to initialize each one you will need
     to use.

  3. As with bb_threads, spawning a new thread is done by calling a
     library routine that takes arguments specifying what function the
     new thread should execute and what arguments should be transmitted
     to it.  However, POSIX requires the user to declare a variable of
     type pthread_t to identify each thread.  To create a thread
     pthread_t thread running f(), one calls
     pthread_create(&thread,NULL,f,&arg).

  4. Run parallel code, being careful to use pthread_mutex_lock(&lock)
     and pthread_mutex_unlock(&lock) as appropriate.

  5. Use pthread_join(thread,&retval) to clean-up after each thread.

  6. Use -D_REENTRANT when compiling your C code.

  An example parallel computation of Pi using LinuxThreads follows.  The
  algorithm of section 1.3 is used and, as for the bb_threads example,
  two threads execute in parallel.


  ______________________________________________________________________
  #include <stdio.h>
  #include <stdlib.h>
  #include "pthread.h"

  volatile double pi = 0.0;  /* Approximation to pi (shared) */
  pthread_mutex_t pi_lock;   /* Lock for above */
  volatile double intervals; /* How many intervals? */

  void *
  process(void *arg)
  {
    register double width, localsum;
    register int i;
    register int iproc = (*((char *) arg) - '0');

    /* Set width */
    width = 1.0 / intervals;

    /* Do the local computations */
    localsum = 0;
    for (i=iproc; i<intervals; i+=2) {
      register double x = (i + 0.5) * width;
      localsum += 4.0 / (1.0 + x * x);
    }
    localsum *= width;

    /* Lock pi for update, update it, and unlock */
    pthread_mutex_lock(&pi_lock);
    pi += localsum;
    pthread_mutex_unlock(&pi_lock);

    return(NULL);
  }

  int
  main(int argc, char **argv)
  {
    pthread_t thread0, thread1;
    void * retval;

    /* Get the number of intervals */
    intervals = atoi(argv[1]);

    /* Initialize the lock on pi */
    pthread_mutex_init(&pi_lock, NULL);

    /* Make the two threads */
    if (pthread_create(&thread0, NULL, process, "0") ||
        pthread_create(&thread1, NULL, process, "1")) {
      fprintf(stderr, "%s: cannot make thread\n", argv[0]);
      exit(1);
    }

    /* Join (collapse) the two threads */
    if (pthread_join(thread0, &retval) ||
        pthread_join(thread1, &retval)) {
      fprintf(stderr, "%s: thread join failed\n", argv[0]);
      exit(1);
    }

    /* Print the result */
    printf("Estimation of pi is %f\n", pi);

    /* Check-out */
    exit(0);
  }
  ______________________________________________________________________


  2.5.  System V Shared Memory


  The System V IPC (Inter-Process Communication) support consists of a
  number of system calls providing message queues, semaphores, and a
  shared memory mechanism.  Of course, these mechanisms were originally
  intended to be used for multiple processes to communicate within a
  uniprocessor system.  However, that implies that it also should work
  to communicate between processes under SMP Linux, no matter which
  processors they run on.

  Before going into how these calls are used, it is important to
  understand that although System V IPC calls exist for things like
  semaphores and message transmission, you probably should not use them.
  Why not?  These functions are generally slow and serialized under SMP
  Linux.  Enough said.

  The basic procedure for creating a group of processes sharing access
  to a shared memory segment is:


  1. Start the program running as a single process.

  2. Typically, you will want each run of a parallel program to have its
     own shared memory segment, so you will need to call shmget() to
     create a new segment of the desired size.  Alternatively, this call
     can be used to get the ID of a pre-existing shared memory segment.
     In either case, the return value is either the shared memory
     segment ID or -1 for error.  For example, to create a shared memory
     segment of b bytes, the call might be shmid = shmget(IPC_PRIVATE,
     b, (IPC_CREAT | 0666)).

  3. The next step is to attach this shared memory segment to this
     process, literally adding it to the virtual memory map of this
     process.  Although the shmat() call allows the programmer to
     specify the virtual address at which the segment should appear, the
     address selected must be aligned on a page boundary (i.e., be a
     multiple of the page size returned by getpagesize(), which is
     usually 4096 bytes), and will override the mapping of any memory
     formerly at that address.  Thus, we instead prefer to let the
     system pick the address.  In either case, the return value is a
     pointer to the base virtual address of the segment just mapped.
     The code is shmptr = shmat(shmid, 0, 0).

     Notice that you can allocate all your static shared variables into
     this shared memory segment by simply declaring all shared variables
     as members of a struct type, and declaring shmptr to be a pointer
     to that type.  Using this technique, shared variable x would be
     accessed as shmptr->x.

  4. Since this shared memory segment should be destroyed when the last
     process with access to it terminates or detaches from it, we need
     to call shmctl() to set-up this default action.  The code is
     something like shmctl(shmid, IPC_RMID, 0).

  5. Use the standard Linux fork() call to make the desired number of
     processes...  each will inherit the shared memory segment.

  6. When a process is done using a shared memory segment, it really
     should detach from that shared memory segment.  This is done by
     shmdt(shmptr).


  Although the above set-up does require a few system calls, once the
  shared memory segment has been established, any change made by one
  processor to a value in that memory will automatically be visible to
  all processes.  Most importantly, each communication operation will
  occur without the overhead of a system call.

  An example C program using System V shared memory segments follows.
  It computes Pi, using the same algorithm given in section 1.3.


  ______________________________________________________________________
  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>
  #include <sys/ipc.h>
  #include <sys/shm.h>

  volatile struct shared { double pi; int lock; } *shared;

  inline extern int xchg(register int reg,
  volatile int * volatile obj)
  {
    /* Atomic exchange instruction */
  __asm__ __volatile__ ("xchgl %1,%0"
                        :"=r" (reg), "=m" (*obj)
                        :"r" (reg), "m" (*obj));
    return(reg);
  }

  main(int argc, char **argv)
  {
    register double width, localsum;
    register int intervals, i;
    register int shmid;
    register int iproc = 0;;

    /* Allocate System V shared memory */
    shmid = shmget(IPC_PRIVATE,
                   sizeof(struct shared),
                   (IPC_CREAT | 0600));
    shared = ((volatile struct shared *) shmat(shmid, 0, 0));
    shmctl(shmid, IPC_RMID, 0);

    /* Initialize... */
    shared->pi = 0.0;
    shared->lock = 0;

    /* Fork a child */
    if (!fork()) ++iproc;

    /* get the number of intervals */
    intervals = atoi(argv[1]);
    width = 1.0 / intervals;

    /* do the local computations */
    localsum = 0;
    for (i=iproc; i<intervals; i+=2) {
      register double x = (i + 0.5) * width;
      localsum += 4.0 / (1.0 + x * x);
    }
    localsum *= width;

    /* Atomic spin lock, add, unlock... */
    while (xchg((iproc + 1), &(shared->lock))) ;
    shared->pi += localsum;
    shared->lock = 0;

    /* Terminate child (barrier sync) */
    if (iproc == 0) {
      wait(NULL);
      printf("Estimation of pi is %f\n", shared->pi);
    }

    /* Check out */
    return(0);
  }
  ______________________________________________________________________


  In this example, I have used the IA32 atomic exchange instruction to
  implement locking.  For better performance and portability, substitute
  a synchronization technique that avoids atomic bus-locking
  instructions (discussed in section 2.2).

  When debugging your code, it is useful to remember that the ipcs
  command will report the status of the System V IPC facilities
  currently in use.


  2.6.  Memory Map Call


  Using system calls for file I/O can be very expensive; in fact, that
  is why there is a user-buffered file I/O library (getchar(), fwrite(),
  etc.).  But user buffers don't work if multiple processes are
  accessing the same writeable file, and the user buffer management
  overhead is significant.  The BSD UNIX fix for this was the addition
  of a system call that allows a portion of a file to be mapped into
  user memory, essentially using virtual memory paging mechanisms to
  cause updates.  This same mechanism also has been used in systems from
  Sequent for many years as the basis for their shared memory parallel
  processing support.  Despite some very negative comments in the (quite
  old) man page, Linux seems to correctly perform at least some of the
  basic functions, and it supports the degenerate use of this system
  call to map an anonymous segment of memory that can be shared across
  multiple processes.

  In essence, the Linux implementation of mmap() is a plug-in
  replacement for steps 2, 3, and 4 in the System V shared memory scheme
  outlined in section 2.5.  To create an anonymous shared memory
  segment:


  ______________________________________________________________________
  shmptr =
      mmap(0,                        /* system assigns address */
           b,                        /* size of shared memory segment */
           (PROT_READ | PROT_WRITE), /* access rights, can be rwx */
           (MAP_ANON | MAP_SHARED),  /* anonymous, shared */
           0,                        /* file descriptor (not used) */
           0);                       /* file offset (not used) */
  ______________________________________________________________________


  The equivalent to the System V shared memory shmdt() call is munmap():


  ______________________________________________________________________
  munmap(shmptr, b);
  ______________________________________________________________________


  In my opinion, there is no real benefit in using mmap() instead of the
  System V shared memory support.


  3.  Clusters Of Linux Systems


  This section attempts to give an overview of cluster parallel
  processing using Linux.  Clusters are currently both the most popular
  and the most varied approach, ranging from a conventional network of
  workstations (NOW) to essentially custom parallel machines that just
  happen to use Linux PCs as processor nodes.  There is also quite a lot
  of software support for parallel processing using clusters of Linux
  machines.


  3.1.  Why A Cluster?


  Cluster parallel processing offers several important advantages:


  <20>  Each of the machines in a cluster can be a complete system, usable
     for a wide range of other computing applications.  This leads many
     people to suggest that cluster parallel computing can simply claim
     all the "wasted cycles" of workstations sitting idle on people's
     desks.  It is not really so easy to salvage those cycles, and it
     will probably slow your co-worker's screen saver, but it can be
     done.

  <20>  The current explosion in networked systems means that most of the
     hardware for building a cluster is being sold in high volume, with
     correspondingly low "commodity" prices as the result.  Further
     savings come from the fact that only one video card, monitor, and
     keyboard are needed for each cluster (although you may need to swap
     these into each machine to perform the initial installation of
     Linux, once running, a typical Linux PC does not need a "console").
     In comparison, SMP and attached processors are much smaller
     markets, tending toward somewhat higher price per unit performance.

  <20>  Cluster computing can scale to very large systems.  While it is
     currently hard to find a Linux-compatible SMP with many more than
     four processors, most commonly available network hardware easily
     builds a cluster with up to 16 machines.  With a little work,
     hundreds or even thousands of machines can be networked.  In fact,
     the entire Internet can be viewed as one truly huge cluster.

  <20>  The fact that replacing a "bad machine" within a cluster is trivial
     compared to fixing a partly faulty SMP yields much higher
     availability for carefully designed cluster configurations.  This
     becomes important not only for particular applications that cannot
     tolerate significant service interruptions, but also for general
     use of systems containing enough processors so that single-machine
     failures are fairly common.  (For example, even though the average
     time to failure of a PC might be two years, in a cluster with 32
     machines, the probability that at least one will fail within 6
     months is quite high.)


  OK, so clusters are free or cheap and can be very large and highly
  available...  why doesn't everyone use a cluster?  Well, there are
  problems too:


  <20>  With a few exceptions, network hardware is not designed for
     parallel processing.  Typically latency is very high and bandwidth
     relatively low compared to SMP and attached processors.  For
     example, SMP latency is generally no more than a few microseconds,
     but is commonly hundreds or thousands of microseconds for a
     cluster.  SMP communication bandwidth is often more than 100
     MBytes/second; although the fastest network hardware (e.g.,
     "Gigabit Ethernet") offers comparable speed, the most commonly used
     networks are between 10 and 1000 times slower.

     The performance of network hardware is poor enough as an isolated
     cluster network.  If the network is not isolated from other
     traffic, as is often the case using "machines that happen to be
     networked" rather than a system designed as a cluster, performance
     can be substantially worse.

  <20>  There is very little software support for treating a cluster as a
     single system.  For example, the ps command only reports the
     processes running on one Linux system, not all processes running
     across a cluster of Linux systems.


  Thus, the basic story is that clusters offer great potential, but that
  potential may be very difficult to achieve for most applications.  The
  good news is that there is quite a lot of software support that will
  help you achieve good performance for programs that are well suited to
  this environment, and there are also networks designed specifically to
  widen the range of programs that can achieve good performance.


  3.2.  Network Hardware


  Computer networking is an exploding field...  but you already knew
  that.  An ever-increasing range of networking technologies and
  products are being developed, and most are available in forms that
  could be applied to make a parallel-processing cluster out of a group
  of machines (i.e., PCs each running Linux).

  Unfortunately, no one network technology solves all problems best; in
  fact, the range of approach, cost, and performance is at first hard to
  believe.  For example, using standard commercially-available hardware,
  the cost per machine networked ranges from less than $5 to over
  $4,000.  The delivered bandwidth and latency each also vary over four
  orders of magnitude.

  Before trying to learn about specific networks, it is important to
  recognize that these things change like the wind (see
  <http://www.linux.org.uk/NetNews.html> for Linux networking news), and
  it is very difficult to get accurate data about some networks.

  Where I was particularly uncertain, I've placed a ?.  I have spent a
  lot of time researching this topic, but I'm sure my summary is full of
  errors and has omitted many important things.  If you have any
  corrections or additions, please send email to hankd@engr.uky.edu.

  Summaries like the LAN Technology Scorecard at
  <http://web.syr.edu/~jmwobus/comfaqs/lan-technology.html> give some
  characteristics of many different types of networks and LAN standards.
  However, the summary in this HOWTO centers on the network properties
  that are most relevant to construction of Linux clusters.  The section
  discussing each network begins with a short list of characteristics.
  The following defines what these entries mean.


     Linux support:
        If the answer is no, the meaning is pretty clear.  Other answers
        try to describe the basic program interface that is used to
        access the network.  Most network hardware is interfaced via a
        kernel driver, typically supporting TCP/UDP communication.  Some
        other networks use more direct (e.g., library) interfaces to
        reduce latency by bypassing the kernel.
        Years ago, it used to be considered perfectly acceptable to
        access a floating point unit via an OS call, but that is now
        clearly ludicrous; in my opinion, it is just as awkward for each
        communication between processors executing a parallel program to
        require an OS call.  The problem is that computers haven't yet
        integrated these communication mechanisms, so non-kernel
        approaches tend to have portability problems.  You are going to
        hear a lot more about this in the near future, mostly in the
        form of the new Virtual Interface (VI) Architecture,
        <http://www.viarch.org/>, which is a standardized method for
        most network interface operations to bypass the usual OS call
        layers.  The VI standard is backed by Compaq, Intel, and
        Microsoft, and is sure to have a strong impact on SAN (System
        Area Network) designs over the next few years.


     Maximum bandwidth:
        This is the number everybody cares about.  I have generally used
        the theoretical best case numbers; your mileage will vary.


     Minimum latency:
        In my opinion, this is the number everybody should care about
        even more than bandwidth.  Again, I have used the unrealistic
        best-case numbers, but at least these numbers do include all
        sources of latency, both hardware and software.  In most cases,
        the network latency is just a few microseconds; the much larger
        numbers reflect layers of inefficient hardware and software
        interfaces.


     Available as:
        Simply put, this describes how you get this type of network
        hardware.  Commodity stuff is widely available from many
        vendors, with price as the primary distinguishing factor.
        Multiple-vendor things are available from more than one
        competing vendor, but there are significant differences and
        potential interoperability problems.  Single-vendor networks
        leave you at the mercy of that supplier (however benevolent it
        may be).  Public domain designs mean that even if you cannot
        find somebody to sell you one, you or anybody else can buy parts
        and make one.  Research prototypes are just that; they are
        generally neither ready for external users nor available to
        them.


     Interface port/bus used:
        How does one hook-up this network?  The highest performance and
        most common now is a PCI bus interface card.  There are also
        EISA, VESA local bus (VL bus), and ISA bus cards.  ISA was there
        first, and is still commonly used for low-performance cards.
        EISA is still around as the second bus in a lot of PCI machines,
        so there are a few cards.  These days, you don't see much VL
        stuff (although  <http://www.vesa.org/> would beg to differ).


        Of course, any interface that you can use without having to open
        your PC's case has more than a little appeal.  IrDA and USB
        interfaces are appearing with increasing frequency.  The
        Standard Parallel Port (SPP) used to be what your printer was
        plugged into, but it has seen a lot of use lately as an external
        extension of the ISA bus; this new functionality is enhanced by
        the IEEE 1284 standard, which specifies EPP and ECP
        improvements.  There is also the old, reliable, slow RS232
        serial port.  I don't know of anybody connecting machines using
        VGA video connectors, keyboard, mouse, or game ports...  so
        that's about it.


     Network structure:
        A bus is a wire, set of wires, or fiber.  A hub is a little box
        that knows how to connect different wires/fibers plugged into
        it; switched hubs allow multiple connections to be actively
        transmitting data simultaneously.


     Cost per machine connected:
        Here's how to use these numbers.  Suppose that, not counting the
        network connection, it costs $2,000 to purchase a PC for use as
        a node in your cluster.  Adding a Fast Ethernet brings the per
        node cost to about $2,400; adding a Myrinet instead brings the
        cost to about $3,800.  If you have about $20,000 to spend, that
        means you could have either 8 machines connected by Fast
        Ethernet or 5 machines connected by Myrinet.  It also can be
        very reasonable to have multiple networks; e.g., $20,000 could
        buy 8 machines connected by both Fast Ethernet and TTL_PAPERS.
        Pick the network, or set of networks, that is most likely to
        yield a cluster that will run your application fastest.


        By the time you read this, these numbers will be wrong...  heck,
        they're probably wrong already.  There may also be quantity
        discounts, special deals, etc.  Still, the prices quoted here
        aren't likely to be wrong enough to lead you to a totally
        inappropriate choice.  It doesn't take a PhD (although I do have
        one ;-) to see that expensive networks only make sense if your
        application needs their special properties or if the PCs being
        clustered are relatively expensive.

  Now that you have the disclaimers, on with the show....


  3.2.1.  ArcNet


  <20>  Linux support: kernel drivers

  <20>  Maximum bandwidth: 2.5 Mb/s

  <20>  Minimum latency: 1,000 microseconds?

  <20>  Available as: multiple-vendor hardware

  <20>  Interface port/bus used: ISA

  <20>  Network structure: unswitched hub or bus (logical ring)

  <20>  Cost per machine connected: $200


  ARCNET is a local area network that is primarily intended for use in
  embedded real-time control systems.  Like Ethernet, the network is
  physically organized either as taps on a bus or one or more hubs,
  however, unlike Ethernet, it uses a token-based protocol logically
  structuring the network as a ring.  Packet headers are small (3 or 4
  bytes) and messages can carry as little as a single byte of data.
  Thus, ARCNET yields more consistent performance than Ethernet, with
  bounded delays, etc.  Unfortunately, it is slower than Ethernet and
  less popular, making it more expensive.  More information is available
  from the ARCNET Trade Association at  <http://www.arcnet.com/>.
  3.2.2.  ATM


  <20>  Linux support: kernel driver, AAL* library

  <20>  Maximum bandwidth: 155 Mb/s (soon, 1,200 Mb/s)

  <20>  Minimum latency: 120 microseconds

  <20>  Available as: multiple-vendor hardware

  <20>  Interface port/bus used: PCI

  <20>  Network structure: switched hubs

  <20>  Cost per machine connected: $3,000


  Unless you've been in a coma for the past few years, you have probably
  heard a lot about how ATM (Asynchronous Transfer Mode) is the
  future...  well, sort-of.  ATM is cheaper than HiPPI and faster than
  Fast Ethernet, and it can be used over the very long distances that
  the phone companies care about.  The ATM network protocol is also
  designed to provide a lower-overhead software interface and to more
  efficiently manage small messages and real-time communications (e.g.,
  digital audio and video).  It is also one of the highest-bandwidth
  networks that Linux currently supports.  The bad news is that ATM
  isn't cheap, and there are still some compatibility problems across
  vendors.  An overview of Linux ATM development is available at
  <http://lrcwww.epfl.ch/linux-atm/>.


  3.2.3.  CAPERS


  <20>  Linux support: AFAPI library

  <20>  Maximum bandwidth: 1.2 Mb/s

  <20>  Minimum latency: 3 microseconds

  <20>  Available as: commodity hardware

  <20>  Interface port/bus used: SPP

  <20>  Network structure: cable between 2 machines

  <20>  Cost per machine connected: $2


  CAPERS (Cable Adapter for Parallel Execution and Rapid
  Synchronization) is a spin-off of the PAPERS project,
  <http://garage.ecn.purdue.edu/~papers/>, at the Purdue University
  School of Electrical and Computer Engineering.  In essence, it defines
  a software protocol for using an ordinary "LapLink" SPP-to-SPP cable
  to implement the PAPERS library for two Linux PCs.  The idea doesn't
  scale, but you can't beat the price.  As with TTL_PAPERS, to improve
  system security, there is a minor kernel patch recommended, but not
  required:  <http://garage.ecn.purdue.edu/~papers/giveioperm.html>.


  3.2.4.  Ethernet


  <20>  Linux support: kernel drivers

  <20>  Maximum bandwidth: 10 Mb/s

  <20>  Minimum latency: 100 microseconds

  <20>  Available as: commodity hardware

  <20>  Interface port/bus used: PCI

  <20>  Network structure: switched or unswitched hubs, or hubless bus

  <20>  Cost per machine connected: $100 (hubless, $50)


  For some years now, 10 Mbits/s Ethernet has been the standard network
  technology.  Good Ethernet interface cards can be purchased for well
  under $50, and a fair number of PCs now have an Ethernet controller
  built-into the motherboard.  For lightly-used networks, Ethernet
  connections can be organized as a multi-tap bus without a hub; such
  configurations can serve up to 200 machines with minimal cost, but are
  not appropriate for parallel processing.  Adding an unswitched hub
  does not really help performance.  However, switched hubs that can
  provide full bandwidth to simultaneous connections cost only about
  $100 per port.  Linux supports an amazing range of Ethernet
  interfaces, but it is important to keep in mind that variations in the
  interface hardware can yield significant performance differences.  See
  the Hardware Compatibility HOWTO for comments on which are supported
  and how well they work; also see
  <http://cesdis1.gsfc.nasa.gov/linux/drivers/>.

  An interesting way to improve performance is offered by the 16-machine
  Linux cluster work done in the Beowulf project,
  <http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html>, at NASA
  CESDIS.  There, Donald Becker, who is the author of many Ethernet card
  drivers, has developed support for load sharing across multiple
  Ethernet networks that shadow each other (i.e., share the same network
  addresses).  This load sharing is built-into the standard Linux
  distribution, and is done invisibly below the socket operation level.
  Because hub cost is significant, having each machine connected to two
  or more hubless or unswitched hub Ethernet networks can be a very
  cost-effective way to improve performance.  In fact, in situations
  where one machine is the network performance bottleneck, load sharing
  using shadow networks works much better than using a single switched
  hub network.


  3.2.5.  Ethernet (Fast Ethernet)


  <20>  Linux support: kernel drivers

  <20>  Maximum bandwidth: 100 Mb/s

  <20>  Minimum latency: 80 microseconds

  <20>  Available as: commodity hardware

  <20>  Interface port/bus used: PCI

  <20>  Network structure: switched or unswitched hubs

  <20>  Cost per machine connected: $400?


  Although there are really quite a few different technologies calling
  themselves "Fast Ethernet," this term most often refers to a hub-based
  100 Mbits/s Ethernet that is somewhat compatible with older "10 BaseT"
  10 Mbits/s devices and cables.  As might be expected, anything called
  Ethernet is generally priced for a volume market, and these interfaces
  are generally a small fraction of the price of 155 Mbits/s ATM cards.
  The catch is that having a bunch of machines dividing the bandwidth of
  a single 100 Mbits/s "bus" (using an unswitched hub) yields
  performance that might not even be as good on average as using 10
  Mbits/s Ethernet with a switched hub that can give each machine's
  connection a full 10 Mbits/s.

  Switched hubs that can provide 100 Mbits/s for each machine
  simultaneously are expensive, but prices are dropping every day, and
  these switches do yield much higher total network bandwidth than
  unswitched hubs.  The thing that makes ATM switches so expensive is
  that they must switch for each (relatively short) ATM cell; some Fast
  Ethernet switches take advantage of the expected lower switching
  frequency by using techniques that may have low latency through the
  switch, but take multiple milliseconds to change the switch path...
  if your routing pattern changes frequently, avoid those switches.  See
  <http://cesdis1.gsfc.nasa.gov/linux/drivers/> for information about
  the various cards and drivers.

  Also note that, as described for Ethernet, the Beowulf project,
  <http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html>, at NASA has
  been developing support that offers improved performance by load
  sharing across multiple Fast Ethernets.


  3.2.6.  Ethernet (Gigabit Ethernet)


  <20>  Linux support: kernel drivers

  <20>  Maximum bandwidth: 1,000 Mb/s

  <20>  Minimum latency: 300 microseconds?

  <20>  Available as: multiple-vendor hardware

  <20>  Interface port/bus used: PCI

  <20>  Network structure: switched hubs or FDRs

  <20>  Cost per machine connected: $2,500?


  I'm not sure that Gigabit Ethernet,  <http://www.gigabit-
  ethernet.org/>, has a good technological reason to be called
  Ethernet...  but the name does accurately reflect the fact that this
  is intended to be a cheap, mass-market, computer network technology
  with native support for IP.  However, current pricing reflects the
  fact that Gb/s hardware is still a tricky thing to build.

  Unlike other Ethernet technologies, Gigabit Ethernet provides for a
  level of flow control that should make it a more reliable network.
  FDRs, or Full-Duplex Repeaters, simply multiplex lines, using
  buffering and localized flow control to improve performance.  Most
  switched hubs are being built as new interface modules for existing
  gigabit-capable switch fabrics.  Switch/FDR products have been shipped
  or announced by at least  <http://www.acacianet.com/>,
  <http://www.baynetworks.com/>,  <http://www.cabletron.com/>,
  <http://www.networks.digital.com/>,
  <http://www.extremenetworks.com/>,  <http://www.foundrynet.com/>,
  <http://www.gigalabs.com/>,  <http://www.packetengines.com/>.
  <http://www.plaintree.com/>,  <http://www.prominet.com/>,
  <http://www.sun.com/>, and  <http://www.xlnt.com/>.

  There is a Linux driver,
  <http://cesdis.gsfc.nasa.gov/linux/drivers/yellowfin.html>, for the
  Packet Engines "Yellowfin" G-NIC,  <http://www.packetengines.com/>.
  Early tests under Linux achieved about 2.5x higher bandwidth than
  could be achieved with the best 100 Mb/s Fast Ethernet; with gigabit
  networks, careful tuning of PCI bus use is a critical factor.  There
  is little doubt that driver improvements, and Linux drivers for other
  NICs, will follow.


  3.2.7.  FC (Fibre Channel)


  <20>  Linux support: no

  <20>  Maximum bandwidth: 1,062 Mb/s

  <20>  Minimum latency: ?

  <20>  Available as: multiple-vendor hardware

  <20>  Interface port/bus used: PCI?

  <20>  Network structure: ?

  <20>  Cost per machine connected: ?


  The goal of FC (Fibre Channel) is to provide high-performance block
  I/O (an FC frame carries a 2,048 byte data payload), particularly for
  sharing disks and other storage devices that can be directly connected
  to the FC rather than connected through a computer.  Bandwidth-wise,
  FC is specified to be relatively fast, running anywhere between 133
  and 1,062 Mbits/s.  If FC becomes popular as a high-end SCSI
  replacement, it may quickly become a cheap technology; for now, it is
  not cheap and is not supported by Linux.  A good collection of FC
  references is maintained by the Fibre Channel Association at
  <http://www.amdahl.com/ext/CARP/FCA/FCA.html>


  3.2.8.  FireWire (IEEE 1394)


  <20>  Linux support: no

  <20>  Maximum bandwidth: 196.608 Mb/s (soon, 393.216 Mb/s)

  <20>  Minimum latency: ?

  <20>  Available as: multiple-vendor hardware

  <20>  Interface port/bus used: PCI

  <20>  Network structure: random without cycles (self-configuring)

  <20>  Cost per machine connected: $600


  FireWire,  <http://www.firewire.org/>, the IEEE 1394-1995 standard, is
  destined to be the low-cost high-speed digital network for consumer
  electronics.  The showcase application is connecting DV digital video
  camcorders to computers, but FireWire is intended to be used for
  applications ranging from being a SCSI replacement to interconnecting
  the components of your home theater.  It allows up to 64K devices to
  be connected in any topology using busses and bridges that does not
  create a cycle, and automatically detects the configuration when
  components are added or removed.  Short (four-byte "quadlet") low-
  latency messages are supported as well as ATM-like isochronous
  transmission (used to keep multimedia messages synchronized).  Adaptec
  has FireWire products that allow up to 63 devices to be connected to a
  single PCI interface card, and also has good general FireWire
  information at  <http://www.adaptec.com/serialio/>.

  Although FireWire will not be the highest bandwidth network available,
  the consumer-level market (which should drive prices very low) and low
  latency support might make this one of the best Linux PC cluster
  message-passing network technologies within the next year or so.


  3.2.9.  HiPPI And Serial HiPPI


  <20>  Linux support: no

  <20>  Maximum bandwidth: 1,600 Mb/s (serial is 1,200 Mb/s)

  <20>  Minimum latency: ?

  <20>  Available as: multiple-vendor hardware

  <20>  Interface port/bus used: EISA, PCI

  <20>  Network structure: switched hubs

  <20>  Cost per machine connected: $3,500 (serial is $4,500)


  HiPPI (High Performance Parallel Interface) was originally intended to
  provide very high bandwidth for transfer of huge data sets between a
  supercomputer and another machine (a supercomputer, frame buffer, disk
  array, etc.), and has become the dominant standard for supercomputers.
  Although it is an oxymoron, Serial HiPPI is also becoming popular,
  typically using a fiber optic cable instead of the 32-bit wide
  standard (parallel) HiPPI cables.  Over the past few years, HiPPI
  crossbar switches have become common and prices have dropped sharply;
  unfortunately, serial HiPPI is still pricey, and that is what PCI bus
  interface cards generally support.  Worse still, Linux doesn't yet
  support HiPPI.  A good overview of HiPPI is maintained by CERN at
  <http://www.cern.ch/HSI/hippi/>; they also maintain a rather long list
  of HiPPI vendors at
  <http://www.cern.ch/HSI/hippi/procintf/manufact.htm>.


  3.2.10.  IrDA (Infrared Data Association)


  <20>  Linux support: no?

  <20>  Maximum bandwidth: 1.15 Mb/s and 4 Mb/s

  <20>  Minimum latency: ?

  <20>  Available as: multiple-vendor hardware

  <20>  Interface port/bus used: IrDA

  <20>  Network structure: thin air ;-)

  <20>  Cost per machine connected: $0


  IrDA (Infrared Data Association,  <http://www.irda.org/>) is that
  little infrared device on the side of a lot of laptop PCs.  It is
  inherently difficult to connect more than two machines using this
  interface, so it is unlikely to be used for clustering.  Don Becker
  did some preliminary work with IrDA.


  3.2.11.  Myrinet


  <20>  Linux support: library

  <20>  Maximum bandwidth: 1,280 Mb/s

  <20>  Minimum latency: 9 microseconds

  <20>  Available as: single-vendor hardware

  <20>  Interface port/bus used: PCI

  <20>  Network structure: switched hubs

  <20>  Cost per machine connected: $1,800


  Myrinet  <http://www.myri.com/> is a local area network (LAN) designed
  to also serve as a "system area network" (SAN), i.e., the network
  within a cabinet full of machines connected as a parallel system.  The
  LAN and SAN versions use different physical media and have somewhat
  different characteristics; generally, the SAN version would be used
  within a cluster.

  Myrinet is fairly conventional in structure, but has a reputation for
  being particularly well-implemented.  The drivers for Linux are said
  to perform very well, although shockingly large performance variations
  have been reported with different PCI bus implementations for the host
  computers.

  Currently, Myrinet is clearly the favorite network of cluster groups
  that are not too severely "budgetarily challenged."  If your idea of a
  Linux PC is a high-end Pentium Pro or Pentium II with at least 256 MB
  RAM and a SCSI RAID, the cost of Myrinet is quite reasonable.
  However, using more ordinary PC configurations, you may find that your
  choice is between N machines linked by Myrinet or 2N linked by
  multiple Fast Ethernets and TTL_PAPERS.  It really depends on what
  your budget is and what types of computations you care about most.


  3.2.12.  Parastation


  <20>  Linux support: HAL or socket library

  <20>  Maximum bandwidth: 125 Mb/s

  <20>  Minimum latency: 2 microseconds

  <20>  Available as: single-vendor hardware

  <20>  Interface port/bus used: PCI

  <20>  Network structure: hubless mesh

  <20>  Cost per machine connected: > $1,000


  The ParaStation project  <http://wwwipd.ira.uka.de/parastation> at
  University of Karlsruhe Department of Informatics is building a PVM-
  compatible custom low-latency network.  They first constructed a two-
  processor ParaPC prototype using a custom EISA card interface and PCs
  running BSD UNIX, and then built larger clusters using DEC Alphas.
  Since January 1997, ParaStation has been available for Linux.  The PCI
  cards are being made in cooperation with a company called Hitex (see
  <http://www.hitex.com:80/parastation/>).  Parastation hardware
  implements both fast, reliable, message transmission and simple
  barrier synchronization.


  3.2.13.  PLIP


  <20>  Linux support: kernel driver

  <20>  Maximum bandwidth: 1.2 Mb/s

  <20>  Minimum latency: 1,000 microseconds?

  <20>  Available as: commodity hardware

  <20>  Interface port/bus used: SPP

  <20>  Network structure: cable between 2 machines

  <20>  Cost per machine connected: $2


  For just the cost of a "LapLink" cable, PLIP (Parallel Line Interface
  Protocol) allows two Linux machines to communicate through standard
  parallel ports using standard socket-based software.  In terms of
  bandwidth, latency, and scalability, this is not a very serious
  network technology; however, the near-zero cost and the software
  compatibility are useful.  The driver is part of the standard Linux
  kernel distributions.


  3.2.14.  SCI


  <20>  Linux support: no

  <20>  Maximum bandwidth: 4,000 Mb/s

  <20>  Minimum latency: 2.7 microseconds

  <20>  Available as: multiple-vendor hardware

  <20>  Interface port/bus used: PCI, proprietary

  <20>  Network structure: ?

  <20>  Cost per machine connected: > $1,000


  The goal of SCI (Scalable Coherent Interconnect, ANSI/IEEE 1596-1992)
  is essentially to provide a high performance mechanism that can
  support coherent shared memory access across large numbers of
  machines, as well various types of block message transfers.  It is
  fairly safe to say that the designed bandwidth and latency of SCI are
  both "awesome" in comparison to most other network technologies.  The
  catch is that SCI is not widely available as cheap production units,
  and there isn't any Linux support.

  SCI primarily is used in various proprietary designs for logically-
  shared physically-distributed memory machines, such as the HP/Convex
  Exemplar SPP and the Sequent NUMA-Q 2000 (see
  <http://www.sequent.com/>).  However, SCI is available as a PCI
  interface card and 4-way switches (up to 16 machines can be connected
  by cascading four 4-way switches) from Dolphin,
  <http://www.dolphinics.com/>, as their CluStar product line.  A good
  set of links overviewing SCI is maintained by CERN at
  <http://www.cern.ch/HSI/sci/sci.html>.


  3.2.15.  SCSI


  <20>  Linux support: kernel drivers

  <20>  Maximum bandwidth: 5 Mb/s to over 20 Mb/s

  <20>  Minimum latency: ?

  <20>  Available as: multiple-vendor hardware

  <20>  Interface port/bus used: PCI, EISA, ISA card

  <20>  Network structure: inter-machine bus sharing SCSI devices

  <20>  Cost per machine connected: ?


  SCSI (Small Computer Systems Interconnect) is essentially an I/O bus
  that is used for disk drives, CD ROMS, image scanners, etc.  There are
  three separate standards SCSI-1, SCSI-2, and SCSI-3; Fast and Ultra
  speeds; and data path widths of 8, 16, or 32 bits (with FireWire
  compatibility also mentioned in SCSI-3).  It is all pretty confusing,
  but we all know a good SCSI is somewhat faster than EIDE and can
  handle more devices more efficiently.

  What many people do not realize is that it is fairly simple for two
  computers to share a single SCSI bus.  This type of configuration is
  very useful for sharing disk drives between machines and implementing
  fail-over - having one machine take over database requests when the
  other machine fails.  Currently, this is the only mechanism supported
  by Microsoft's PC cluster product, WolfPack.  However, the inability
  to scale to larger systems renders shared SCSI uninteresting for
  parallel processing in general.


  3.2.16.  ServerNet


  <20>  Linux support: no

  <20>  Maximum bandwidth: 400 Mb/s

  <20>  Minimum latency: 3 microseconds

  <20>  Available as: single-vendor hardware

  <20>  Interface port/bus used: PCI

  <20>  Network structure: hexagonal tree/tetrahedral lattice of hubs

  <20>  Cost per machine connected: ?


  ServerNet is the high-performance network hardware from Tandem,
  <http://www.tandem.com>.  Especially in the online transation
  processing (OLTP) world, Tandem is well known as a leading producer of
  high-reliability systems, so it is not surprising that their network
  claims not just high performance, but also "high data integrity and
  reliability."  Another interesting aspect of ServerNet is that it
  claims to be able to transfer data from any device directly to any
  device; not just between processors, but also disk drives, etc., in a
  one-sided style similar to that suggested by the MPI remote memory
  access mechanisms described in section 3.5.  One last comment about
  ServerNet:  although there is just a single vendor, that vendor is
  powerful enough to potentially establish ServerNet as a major
  standard...  Tandem is owned by Compaq.


  3.2.17.  SHRIMP


  <20>  Linux support: user-level memory mapped interface

  <20>  Maximum bandwidth: 180 Mb/s

  <20>  Minimum latency: 5 microseconds

  <20>  Available as: research prototype

  <20>  Interface port/bus used: EISA

  <20>  Network structure: mesh backplane (as in Intel Paragon)

  <20>  Cost per machine connected: ?


  The SHRIMP project,  <http://www.CS.Princeton.EDU/shrimp/>, at the
  Princeton University Computer Science Department is building a
  parallel computer using PCs running Linux as the processing elements.
  The first SHRIMP (Scalable, High-Performance, Really Inexpensive
  Multi-Processor) was a simple two-processor prototype using a dual-
  ported RAM on a custom EISA card interface.  There is now a prototype
  that will scale to larger configurations using a custom interface card
  to connect to a "hub" that is essentially the same mesh routing
  network used in the Intel Paragon (see
  <http://www.ssd.intel.com/paragon.html>).  Considerable effort has
  gone into developing low-overhead "virtual memory mapped
  communication" hardware and support software.


  3.2.18.  SLIP


  <20>  Linux support: kernel drivers

  <20>  Maximum bandwidth: 0.1 Mb/s

  <20>  Minimum latency: 1,000 microseconds?

  <20>  Available as: commodity hardware

  <20>  Interface port/bus used: RS232C

  <20>  Network structure: cable between 2 machines

  <20>  Cost per machine connected: $2


  Although SLIP (Serial Line Interface Protocol) is firmly planted at
  the low end of the performance spectrum, SLIP (or CSLIP or PPP) allows
  two machines to perform socket communication via ordinary RS232 serial
  ports.  The RS232 ports can be connected using a null-modem RS232
  serial cable, or they can even be connected via dial-up through a
  modem.  In any case, latency is high and bandwidth is low, so SLIP
  should be used only when no other alternatives are available.  It is
  worth noting, however, that most PCs have two RS232 ports, so it would
  be possible to network a group of machines simply by connecting the
  machines as a linear array or as a ring.  There is even load sharing
  software called EQL.


  3.2.19.  TTL_PAPERS


  <20>  Linux support: AFAPI library

  <20>  Maximum bandwidth: 1.6 Mb/s

  <20>  Minimum latency: 3 microseconds

  <20>  Available as: public-domain design, single-vendor hardware

  <20>  Interface port/bus used: SPP

  <20>  Network structure: tree of hubs

  <20>  Cost per machine connected: $100


  The PAPERS (Purdue's Adapter for Parallel Execution and Rapid
  Synchronization) project,  <http://garage.ecn.purdue.edu/~papers/>, at
  the Purdue University School of Electrical and Computer Engineering is
  building scalable, low-latency, aggregate function communication
  hardware and software that allows a parallel supercomputer to be built
  using unmodified PCs/workstations as nodes.

  There have been over a dozen different types of PAPERS hardware built
  that connect to PCs/workstations via the SPP (Standard Parallel Port),
  roughly following two development lines.  The versions called "PAPERS"
  target higher performance, using whatever technologies are
  appropriate; current work uses FPGAs, and high bandwidth PCI bus
  interface designs are also under development.  In contrast, the
  versions called "TTL_PAPERS" are designed to be easily reproduced
  outside Purdue, and are remarkably simple public domain designs that
  can be built using ordinary TTL logic.  One such design is produced
  commercially,  <http://chelsea.ios.com:80/~hgdietz/sbm4.html>.

  Unlike the custom hardware designs from other universities, TTL_PAPERS
  clusters have been assembled at many universities from the USA to
  South Korea.  Bandwidth is severely limited by the SPP connections,
  but PAPERS implements very low latency aggregate function
  communications; even the fastest message-oriented systems cannot
  provide comparable performance on those aggregate functions.  Thus,
  PAPERS is particularly good for synchronizing the displays of a video
  wall (to be discussed further in the upcoming Video Wall HOWTO),
  scheduling accesses to a high-bandwidth network, evaluating global
  fitness in genetic searches, etc.  Although PAPERS clusters have been
  built using IBM PowerPC AIX, DEC Alpha OSF/1, and HP PA-RISC HP-UX
  machines, Linux-based PCs are the platforms best supported.

  User programs using TTL_PAPERS AFAPI directly access the SPP hardware
  port registers under Linux, without an OS call for each access.  To do
  this, AFAPI first gets port permission using either iopl() or
  ioperm().  The problem with these calls is that both require the user
  program to be privileged, yielding a potential security hole.  The
  solution is an optional kernel patch,
  <http://garage.ecn.purdue.edu/~papers/giveioperm.html>, that allows a
  privileged process to control port permission for any process.


  3.2.20.  USB (Universal Serial Bus)


  <20>  Linux support: kernel driver

  <20>  Maximum bandwidth: 12 Mb/s

  <20>  Minimum latency: ?

  <20>  Available as: commodity hardware

  <20>  Interface port/bus used: USB

  <20>  Network structure: bus

  <20>  Cost per machine connected: $5?


  USB (Universal Serial Bus,  <http://www.usb.org/>) is a hot-pluggable
  conventional-Ethernet-speed, bus for up to 127 peripherals ranging
  from keyboards to video conferencing cameras.  It isn't really clear
  how multiple computers get connected to each other using USB.  In any
  case, USB ports are quickly becoming as standard on PC motherboards as
  RS232 and SPP, so don't be surprised if one or two USB ports are
  lurking on the back of the next PC you buy.  Development of a Linux
  driver is discussed at  <http://peloncho.fis.ucm.es/~inaky/USB.html>.

  In some ways, USB is almost the low-performance, zero-cost, version of
  FireWire that you can purchase today.


  3.2.21.  WAPERS


  <20>  Linux support: AFAPI library

  <20>  Maximum bandwidth: 0.4 Mb/s

  <20>  Minimum latency: 3 microseconds

  <20>  Available as: public-domain design

  <20>  Interface port/bus used: SPP

  <20>  Network structure: wiring pattern between 2-64 machines

  <20>  Cost per machine connected: $5


  WAPERS (Wired-AND Adapter for Parallel Execution and Rapid
  Synchronization) is a spin-off of the PAPERS project,
  <http://garage.ecn.purdue.edu/~papers/>, at the Purdue University
  School of Electrical and Computer Engineering.  If implemented
  properly, the SPP has four bits of open-collector output that can be
  wired together across machines to implement a 4-bit wide wired AND.
  This wired-AND is electrically touchy, and the maximum number of
  machines that can be connected in this way critically depends on the
  analog properties of the ports (maximum sink current and pull-up
  resistor value); typically, up to 7 or 8 machines can be networked by
  WAPERS.  Although cost and latency are very low, so is bandwidth;
  WAPERS is much better as a second network for aggregate operations
  than as the only network in a cluster.  As with TTL_PAPERS, to improve
  system security, there is a minor kernel patch recommended, but not
  required:  <http://garage.ecn.purdue.edu/~papers/giveioperm.html>.


  3.3.  Network Software Interface


  Before moving on to discuss the software support for parallel
  applications, it is useful to first briefly cover the basics of low-
  level software interface to the network hardware.  There are really
  only three basic choices:  sockets, device drivers, and user-level
  libraries.


  3.3.1.  Sockets


  By far the most common low-level network interface is a socket
  interface.  Sockets have been a part of unix for over a decade, and
  most standard network hardware is designed to support at least two
  types of socket protocols:  UDP and TCP.  Both types of socket allow
  you to send arbitrary size blocks of data from one machine to another,
  but there are several important differences.  Typically, both yield a
  minimum latency of around 1,000 microseconds, although performance can
  be far worse depending on network traffic.

  These socket types are the basic network software interface for most
  of the portable, higher-level, parallel processing software; for
  example, PVM uses a combination of UDP and TCP, so knowing the
  difference will help you tune performance.  For even better
  performance, you can also use these mechanisms directly in your
  program.  The following is just a simple overview of UDP and TCP; see
  the manual pages and a good network programming book for details.


  3.3.1.1.  UDP Protocol (SOCK_DGRAM)


  UDP is the User Datagram Protocol, but you more easily can remember
  the properties of UDP as Unreliable Datagram Processing.  In other
  words, UDP allows each block to be sent as an individual message, but
  a message might be lost in transmission.  In fact, depending on
  network traffic, UDP messages can be lost, can arrive multiple times,
  or can arrive in an order different from that in which they were sent.
  The sender of a UDP message does not automatically get an
  acknowledgment, so it is up to user-written code to detect and
  compensate for these problems.  Fortunately, UDP does ensure that if a
  message arrives, the message contents are intact (i.e., you never get
  just part of a UDP message).

  The nice thing about UDP is that it tends to be the fastest socket
  protocol.  Further, UDP is "connectionless," which means that each
  message is essentially independent of all others.  A good analogy is
  that each message is like a letter to be mailed; you might send
  multiple letters to the same address, but each one is independent of
  the others and there is no limit on how many people you can send
  letters to.


  3.3.1.2.  TCP Protocol (SOCK_STREAM)


  Unlike UDP, TCP is a reliable, connection-based, protocol.  Each block
  sent is not seen as a message, but as a block of data within an
  apparently continuous stream of bytes being transmitted through a
  connection between sender and receiver.  This is very different from
  UDP messaging because each block is simply part of the byte stream and
  it is up to the user code to figure-out how to extract each block from
  the byte stream; there are no markings separating messages.  Further,
  the connections are more fragile with respect to network problems, and
  only a limited number of connections can exist simultaneously for each
  process.  Because it is reliable, TCP generally implies significantly
  more overhead than UDP.

  There are, however, a few pleasant surprises about TCP.  One is that,
  if multiple messages are sent through a connection, TCP is able to
  pack them together in a buffer to better match network hardware packet
  sizes, potentially yielding better-than-UDP performance for groups of
  short or oddly-sized messages.  The other bonus is that networks
  constructed using reliable direct physical links between machines can
  easily and efficiently simulate TCP connections.  For example, this
  was done for the ParaStation's "Socket Library" interface software,
  which provides TCP semantics using user-level calls that differ from
  the standard TCP OS calls only by the addition of the prefix PSS to
  each function name.


  3.3.2.  Device Drivers


  When it comes to actually pushing data onto the network or pulling
  data off the network, the standard unix software interface is a part
  of the unix kernel called a device driver.  UDP and TCP don't just
  transport data, they also imply a fair amount of overhead for socket
  management.  For example, something has to manage the fact that
  multiple TCP connections can share a single physical network
  interface. In contrast, a device driver for a dedicated network
  interface only needs to implement a few simple data transport
  functions.  These device driver functions can then be invoked by user
  programs by using open() to identify the proper device and then using
  system calls like read() and write() on the open "file." Thus, each
  such operation could transport a block of data with little more than
  the overhead of a system call, which might be as fast as tens of
  microseconds.

  Writing a device driver to be used with Linux is not hard...  if you
  know precisely how the device hardware works.  If you are not sure how
  it works, don't guess.  Debugging device drivers isn't fun and
  mistakes can fry hardware.  However, if that hasn't scared you off, it
  may be possible to write a device driver to, for example, use
  dedicated Ethernet cards as dumb but fast direct machine-to-machine
  connections without the usual Ethernet protocol overhead.  In fact,
  that's pretty much what some early Intel supercomputers did....  Look
  at the Device Driver HOWTO for more information.


  3.3.3.  User-Level Libraries


  If you've taken an OS course, user-level access to hardware device
  registers is exactly what you have been taught never to do, because
  one of the primary purposes of an OS is to control device access.
  However, an OS call is at least tens of microseconds of overhead.  For
  custom network hardware like TTL_PAPERS, which can perform a basic
  network operation in just 3 microseconds, such OS call overhead is
  intolerable.  The only way to avoid that overhead is to have user-
  level code - a user-level library - directly access hardware device
  registers.  Thus, the question becomes one of how a user-level library
  can access hardware directly, yet not compromise the OS control of
  device access rights.
  On a typical system, the only way for a user-level library to directly
  access hardware device registers is to:


  1. At user program start-up, use an OS call to map the page of memory
     address space containing the device registers into the user process
     virtual memory map.  For some systems, the mmap() call (first
     mentioned in section 2.6) can be used to map a special file which
     represents the physical memory page addresses of the I/O devices.
     Alternatively, it is relatively simple to write a device driver to
     perform this function.  Further, this device driver can control
     access by only mapping the page(s) containing the specific device
     registers needed, thereby maintaining OS access control.

  2. Access device registers without an OS call by simply loading or
     storing to the mapped addresses.  For example, *((char *) 0x1234) =
     5; would store the byte value 5 into memory location 1234
     (hexadecimal).

  Fortunately, it happens that Linux for the Intel 386 (and compatible
  processors) offers an even better solution:


  1. Using the ioperm() OS call from a privileged process, get
     permission to access the precise I/O port addresses that correspond
     to the device registers.  Alternatively, permission can be managed
     by an independent privileged user process (i.e., a "meta OS") using
     the giveioperm() OS call
     <http://garage.ecn.purdue.edu/~papers/giveioperm.html> patch for
     Linux.

  2. Access device registers without an OS call by using 386 port I/O
     instructions.


  This second solution is preferable because it is common that multiple
  I/O devices have their registers within a single page, in which case
  the first technique would not provide protection against accessing
  other device registers that happened to reside in the same page as the
  ones intended.  Of course, the down side is that 386 port I/O
  instructions cannot be coded in C - instead, you will need to use a
  bit of assembly code.  The GCC-wrapped (usable in C programs) inline
  assembly code function for a port input of a byte value is:


  ______________________________________________________________________
  extern inline unsigned char
  inb(unsigned short port)
  {
      unsigned char _v;
  __asm__ __volatile__ ("inb %w1,%b0"
                        :"=a" (_v)
                        :"d" (port), "0" (0));
      return _v;
  }
  ______________________________________________________________________


  Similarly, the GCC-wrapped code for a byte port output is:


  ______________________________________________________________________
  extern inline void
  outb(unsigned char value,
  unsigned short port)
  {
  __asm__ __volatile__ ("outb %b0,%w1"
                        :/* no outputs */
                        :"a" (value), "d" (port));
  }
  ______________________________________________________________________


  3.4.  PVM (Parallel Virtual Machine)


  PVM (Parallel Virtual Machine) is a freely-available, portable,
  message-passing library generally implemented on top of sockets.  It
  is clearly established as the de-facto standard for message-passing
  cluster parallel computing.

  PVM supports single-processor and SMP Linux machines, as well as
  clusters of Linux machines linked by socket-capable networks (e.g.,
  SLIP, PLIP, Ethernet, ATM).  In fact, PVM will even work across groups
  of machines in which a variety of different types of processors,
  configurations, and physical networks are used - Heterogeneous
  Clusters - even to the scale of treating machines linked by the
  Internet as a parallel cluster.  PVM also provides facilities for
  parallel job control across a cluster.  Best of all, PVM has long been
  freely available (currently from
  <http://www.epm.ornl.gov/pvm/pvm_home.html>), which has led to many
  programming language compilers, application libraries, programming and
  debugging tools, etc., using it as their "portable message-passing
  target library."  There is also a network newsgroup,
  comp.parallel.pvm.

  It is important to note, however, that PVM message-passing calls
  generally add significant overhead to standard socket operations,
  which already had high latency.  Further, the message handling calls
  themselves do not constitute a particularly "friendly" programming
  model.

  Using the same Pi computation example first described in section 1.3,
  the version using C with PVM library calls is:


  ______________________________________________________________________
  #include <stdlib.h>
  #include <stdio.h>
  #include <pvm3.h>

  #define NPROC   4

  main(int argc, char **argv)
  {
    register double lsum, width;
    double sum;
    register int intervals, i;
    int mytid, iproc, msgtag = 4;
    int tids[NPROC];  /* array of task ids */

    /* enroll in pvm */
    mytid = pvm_mytid();

    /* Join a group and, if I am the first instance,
       iproc=0, spawn more copies of myself
    */
    iproc = pvm_joingroup("pi");

    if (iproc == 0) {
      tids[0] = pvm_mytid();
      pvm_spawn("pvm_pi", &argv[1], 0, NULL, NPROC-1, &tids[1]);
    }
    /* make sure all processes are here */
    pvm_barrier("pi", NPROC);

    /* get the number of intervals */
    intervals = atoi(argv[1]);
    width = 1.0 / intervals;

    lsum = 0.0;
    for (i = iproc; i<intervals; i+=NPROC) {
      register double x = (i + 0.5) * width;
      lsum += 4.0 / (1.0 + x * x);
    }

    /* sum across the local results & scale by width */
    sum = lsum * width;
    pvm_reduce(PvmSum, &sum, 1, PVM_DOUBLE, msgtag, "pi", 0);

    /* have only the console PE print the result */
    if (iproc == 0) {
      printf("Estimation of pi is %f\n", sum);
    }

    /* Check program finished, leave group, exit pvm */
    pvm_barrier("pi", NPROC);
    pvm_lvgroup("pi");
    pvm_exit();
    return(0);
  }
  ______________________________________________________________________


  3.5.  MPI (Message Passing Interface)


  Although PVM is the de-facto standard message-passing library, MPI
  (Message Passing Interface) is the relatively new official standard.
  The home page for the MPI standard is
  <http://www.mcs.anl.gov:80/mpi/> and the newsgroup is
  comp.parallel.mpi.

  However, before discussing MPI, I feel compelled to say a little bit
  about the PVM vs. MPI religious war that has been going on for the
  past few years.  I'm not really on either side.  Here's my attempt at
  a relatively unbiased summary of the differences:


     Execution control environment.
        Put simply, PVM has one and MPI doesn't specify how/if one is
        implemented.  Thus, things like starting a PVM program executing
        are done identically everywhere, while for MPI it depends on
        which implementation is being used.


     Support for heterogeneous clusters.
        PVM grew-up in the workstation cycle-scavenging world, and thus
        directly manages heterogeneous mixes of machines and operating
        systems.  In contrast, MPI largely assumes that the target is an
        MPP (Massively Parallel Processor) or a dedicated cluster of
        nearly identical workstations.


     Kitchen sink syndrome.
        PVM evidences a unity of purpose that MPI 2.0 doesn't.  The new
        MPI 2.0 standard includes a lot of features that go way beyond
        the basic message passing model - things like RMA (Remote Memory
        Access) and parallel file I/O.  Are these things useful?  Of
        course they are...  but learning MPI 2.0 is a lot like learning
        a complete new programming language.


     User interface design.
        MPI was designed after PVM, and clearly learned from it.  MPI
        offers simpler, more efficient, buffer handling and higher-level
        abstractions allowing user-defined data structures to be
        transmitted in messages.


     The force of law.
        By my count, there are still significantly more things designed
        to use PVM than there are to use MPI; however, porting them to
        MPI is easy, and the fact that MPI is backed by a widely-
        supported formal standard means that using MPI is, for many
        institutions, a matter of policy.

  Conclusion?  Well, there are at least three independently developed,
  freely available, versions of MPI that can run on clusters of Linux
  systems (and I wrote one of them):


  <20>  LAM (Local Area Multicomputer) is a full implementation of the MPI
     1.1 standard.  It allows MPI programs to be executed within an
     individual Linux system or across a cluster of Linux systems using
     UDP/TCP socket communication.  The system includes simple execution
     control facilities, as well as a variety of program development and
     debugging aids.  It is freely available from
     <http://www.osc.edu/lam.html>.

  <20>  MPICH (MPI CHameleon) is designed as a highly portable full
     implementation of the MPI 1.1 standard.  Like LAM, it allows MPI
     programs to be executed within an individual Linux system or across
     a cluster of Linux systems using UDP/TCP socket communication.
     However, the emphasis is definitely on promoting MPI by providing
     an efficient, easily retargetable, implementation.  To port this
     MPI implementation, one implements either the five functions of the
     "channel interface" or, for better performance, the full MPICH ADI
     (Abstract Device Interface).  MPICH, and lots of information about
     it and porting, are available from
     <http://www.mcs.anl.gov/mpi/mpich/>.

  <20>  AFMPI (Aggregate Function MPI) is a subset implementation of the
     MPI 2.0 standard.  This is the one that I wrote.  Built on top of
     the AFAPI, it is designed to showcase low-latency collective
     communication functions and RMAs, and thus provides only minimal
     support for MPI data types, communicators, etc.  It allows C
     programs using MPI to run on an individual Linux system or across a
     cluster connected by AFAPI-capable network hardware.  It is freely
     available from  <http://garage.ecn.purdue.edu/~papers/>.

  No matter which of these (or other) MPI implementations one uses, it
  is fairly simple to perform the most common types of communications.

  However, MPI 2.0 incorporates several communication paradigms that are
  fundamentally different enough so that a programmer using one of them
  might not even recognize the other coding styles as MPI.  Thus, rather
  than giving a single example program, it is useful to have an example
  of each of the fundamentally different communication paradigms that
  MPI supports.  All three programs implement the same basic algorithm
  (from section 1.3) that is used throughout this HOWTO to compute the
  value of Pi.

  The first MPI program uses basic MPI message-passing calls for each
  processor to send its partial sum to processor 0, which sums and
  prints the result:


  ______________________________________________________________________
  #include <stdlib.h>
  #include <stdio.h>
  #include <mpi.h>

  main(int argc, char **argv)
  {
    register double width;
    double sum, lsum;
    register int intervals, i;
    int nproc, iproc;
    MPI_Status status;

    if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &iproc);
    intervals = atoi(argv[1]);
    width = 1.0 / intervals;
    lsum = 0;
    for (i=iproc; i<intervals; i+=nproc) {
      register double x = (i + 0.5) * width;
      lsum += 4.0 / (1.0 + x * x);
    }
    lsum *= width;
    if (iproc != 0) {
      MPI_Send(&lbuf, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
    } else {
      sum = lsum;
      for (i=1; i<nproc; ++i) {
        MPI_Recv(&lbuf, 1, MPI_DOUBLE, MPI_ANY_SOURCE,
                 MPI_ANY_TAG, MPI_COMM_WORLD, &status);
        sum += lsum;
      }
      printf("Estimation of pi is %f\n", sum);
    }
    MPI_Finalize();
    return(0);
  }
  ______________________________________________________________________


  The second MPI version uses collective communication (which, for this
  particular application, is clearly the most appropriate):


  ______________________________________________________________________
  #include <stdlib.h>
  #include <stdio.h>
  #include <mpi.h>

  main(int argc, char **argv)
  {
    register double width;
    double sum, lsum;
    register int intervals, i;
    int nproc, iproc;

    if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &iproc);
    intervals = atoi(argv[1]);
    width = 1.0 / intervals;
    lsum = 0;
    for (i=iproc; i<intervals; i+=nproc) {
      register double x = (i + 0.5) * width;
      lsum += 4.0 / (1.0 + x * x);
    }
    lsum *= width;
    MPI_Reduce(&lsum, &sum, 1, MPI_DOUBLE,
               MPI_SUM, 0, MPI_COMM_WORLD);
    if (iproc == 0) {
      printf("Estimation of pi is %f\n", sum);
    }
    MPI_Finalize();
    return(0);
  }
  ______________________________________________________________________


  The third MPI version uses the MPI 2.0 RMA mechanism for each
  processor to add its local lsum into sum on processor 0:


  ______________________________________________________________________
  #include <stdlib.h>
  #include <stdio.h>
  #include <mpi.h>

  main(int argc, char **argv)
  {
    register double width;
    double sum = 0, lsum;
    register int intervals, i;
    int nproc, iproc;
    MPI_Win sum_win;

    if (MPI_Init(&argc, &argv) != MPI_SUCCESS) exit(1);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &iproc);
    MPI_Win_create(&sum, sizeof(sum), sizeof(sum),
                   0, MPI_COMM_WORLD, &sum_win);
    MPI_Win_fence(0, sum_win);
    intervals = atoi(argv[1]);
    width = 1.0 / intervals;
    lsum = 0;
    for (i=iproc; i<intervals; i+=nproc) {
      register double x = (i + 0.5) * width;
      lsum += 4.0 / (1.0 + x * x);
    }
    lsum *= width;
    MPI_Accumulate(&lsum, 1, MPI_DOUBLE, 0, 0,
                   1, MPI_DOUBLE, MPI_SUM, sum_win);
    MPI_Win_fence(0, sum_win);
    if (iproc == 0) {
      printf("Estimation of pi is %f\n", sum);
    }
    MPI_Finalize();
    return(0);
  }
  ______________________________________________________________________


  It is useful to note that the MPI 2.0 RMA mechanism very neatly
  overcomes any potential problems with the corresponding data structure
  on various processors residing at different memory locations.  This is
  done by referencing a "window" that implies the base address,
  protection against out-of-bound accesses, and even address scaling.
  Efficient implementation is aided by the fact that RMA processing may
  be delayed until the next MPI_Win_fence.  In summary, the RMA
  mechanism may be a strange cross between distributed shared memory and
  message passing, but it is a very clean interface that potentially
  generates very efficient communication.


  3.6.  AFAPI (Aggregate Function API)


  Unlike PVM, MPI, etc., the AFAPI (Aggregate Function Application
  Program Interface) did not start life as an attempt to build a
  portable abstract interface layered on top of existing network
  hardware and software.  Rather, AFAPI began as the very hardware-
  specific low-level support library for PAPERS (Purdue's Adapter for
  Parallel Execution and Rapid Synchronization; see
  <http://garage.ecn.purdue.edu/~papers/>).

  PAPERS was discussed briefly in section 3.2; it is a public domain
  design custom aggregate function network that delivers latencies as
  low as a few microseconds.  However, the important thing about PAPERS
  is that it was developed as an attempt to build a supercomputer that
  would be a better target for compiler technology than existing
  supercomputers.  This is qualitatively different from most Linux
  cluster efforts and PVM/MPI, which generally focus on trying to use
  standard networks for the relatively few sufficiently coarse-grain
  parallel applications.  The fact that Linux PCs are used as components
  of PAPERS systems is simply an artifact of implementing prototypes in
  the most cost-effective way possible.

  The need for a common low-level software interface across more than a
  dozen different prototype implementations was what made the PAPERS
  library become standardized as AFAPI.  However, the model used by
  AFAPI is inherently simpler and better suited for the finer-grain
  interactions typical of code compiled by parallelizing compilers or
  written for SIMD architectures.  The simplicity of the model not only
  makes PAPERS hardware easy to build, but also yields surprisingly
  efficient AFAPI ports for a variety of other hardware systems, such as
  SMPs.

  AFAPI currently runs on Linux clusters connected using TTL_PAPERS,
  CAPERS, or WAPERS.  It also runs (without OS calls or even bus-lock
  instructions, see section 2.2) on SMP systems using a System V Shared
  Memory library called SHMAPERS.  A version that runs across Linux
  clusters using UDP broadcasts on conventional networks (e.g.,
  Ethernet) is under development.  All released versions are available
  from  <http://garage.ecn.purdue.edu/~papers/>.  All versions of the
  AFAPI are designed to be called from C or C++.

  The following example program is the AFAPI version of the Pi
  computation described in section 1.3.


  ______________________________________________________________________
  #include <stdlib.h>
  #include <stdio.h>
  #include "afapi.h"

  main(int argc, char **argv)
  {
    register double width, sum;
    register int intervals, i;

    if (p_init()) exit(1);

    intervals = atoi(argv[1]);
    width = 1.0 / intervals;

    sum = 0;
    for (i=IPROC; i<intervals; i+=NPROC) {
      register double x = (i + 0.5) * width;
      sum += 4.0 / (1.0 + x * x);
    }

    sum = p_reduceAdd64f(sum) * width;

    if (IPROC == CPROC) {
      printf("Estimation of pi is %f\n", sum);
    }

    p_exit();
    return(0);
  }
  ______________________________________________________________________


  3.7.  Other Cluster Support Libraries


  In addition to PVM, MPI, and AFAPI, the following libraries offer
  features that may be useful in parallel computing using a cluster of
  Linux systems.  These systems are given a lighter treatment in this
  document simply because, unlike PVM, MPI, and AFAPI, I have little or
  no direct experience with the use of these systems on Linux clusters.
  If you find any of these or other libraries to be especially useful,
  please send email to me at hankd@engr.uky.edu describing what you've
  found, and I will consider adding an expanded section on that library.


  3.7.1.  Condor (process migration support)


  Condor is a distributed resource management system that can manage
  large heterogeneous clusters of workstations.  Its design has been
  motivated by the needs of users who would like to use the unutilized
  capacity of such clusters for their long-running, computation-
  intensive jobs.  Condor preserves a large measure of the originating
  machine's environment on the execution machine, even if the
  originating and execution machines do not share a common file system
  and/or password mechanisms.  Condor jobs that consist of a single
  process are automatically checkpointed and migrated between
  workstations as needed to ensure eventual completion.

  Condor is available at  <http://www.cs.wisc.edu/condor/>.  A Linux
  port exists; more information is available at
  <http://www.cs.wisc.edu/condor/linux/linux.html>.  Contact condor-
  admin@cs.wisc.edu for details.


  3.7.2.  DFN-RPC (German Research Network - Remote Procedure Call)


  The DFN-RPC, a (German Research Network Remote Procedure Call) tool,
  was developed to distribute and parallelize scientific-technical
  application programs between a workstation and a compute server or a
  cluster. The interface is optimized for applications written in
  fortran, but the DFN-RPC can also be used in a C environment.  It has
  been ported to Linux.  More information is at  <ftp://ftp.uni-
  stuttgart.de/pub/rus/dfn_rpc/README_dfnrpc.html>.


  3.7.3.  DQS (Distributed Queueing System)


  Not exactly a library, DQS 3.0 (Distributed Queueing System) is a job
  queueing system that has been developed and tested under Linux.  It is
  designed to allow both use and administration of a heterogeneous
  cluster as a single entity.  It is available from
  <http://www.scri.fsu.edu/~pasko/dqs.html>.

  There is also a commercial version called CODINE 4.1.1 (COmputing in
  DIstributed Network Environments).  Information on it is available
  from  <http://www.genias.de/genias_welcome.html>.


  3.8.  General Cluster References


  Because clusters can be constructed and used in so many different
  ways, there are quite a few groups that have made interesting
  contributions.  The following are references to various cluster-
  related projects that may be of general interest.  This includes a mix
  of Linux-specific and generic cluster references.  The list is given
  in alphabetical order.


  3.8.1.  Beowulf


  The Beowulf project,  <http://cesdis1.gsfc.nasa.gov/beowulf/>, centers
  on production of software for using off-the-shelf clustered
  workstations based on commodity PC-class hardware, a high-bandwidth
  cluster-internal network, and the Linux operating system.

  Thomas Sterling has been the driving force behind Beowulf, and
  continues to be an eloquent and outspoken proponent of Linux
  clustering for scientific computing in general. In fact, many groups
  now refer to their clusters as "Beowulf class" systems - even if the
  cluster isn't really all that similar to the official Beowulf design.

  Don Becker, working in support of the Beowulf project, has produced
  many of the network drivers used by Linux in general.  Many of these
  drivers have even been adapted for use in BSD.  Don also is
  responsible for many of these Linux network drivers allowing load-
  sharing across multiple parallel connections to achieve higher
  bandwidth without expensive switched hubs.  This type of load sharing
  was the original distinguishing feature of the Beowulf cluster.


  3.8.2.  Linux/AP+


  The Linux/AP+ project,  <http://cap.anu.edu.au/cap/projects/linux/>,
  is not exactly about Linux clustering, but centers on porting Linux to
  the Fujitsu AP1000+ and adding appropriate parallel processing
  enhancements.  The AP1000+ is a commercially available SPARC-based
  parallel machine that uses a custom network with a torus topology, 25
  MB/s bandwidth, and 10 microsecond latency...  in short, it looks a
  lot like a SPARC Linux cluster.


  3.8.3.  Locust


  The Locust project,  <http://www.ecsl.cs.sunysb.edu/~manish/locust/>,
  is building a distributed virtual shared memory system that uses
  compile-time information to hide message-latency and to reduce network
  traffic at run time.  Pupa is the underlying communication subsystem
  of Locust, and is implemented using Ethernet to connect 486 PCs under
  FreeBSD.  Linux?


  3.8.4.  Midway DSM (Distributed Shared Memory)


  Midway,
  <http://www.cs.cmu.edu/afs/cs.cmu.edu/project/midway/WWW/HomePage.html>,
  is a software-based DSM (Distributed Shared Memory) system, not unlike
  TreadMarks.  The good news is that it uses compile-time aids rather
  than relatively slow page-fault mechanisms, and it is free.  The bad
  news is that it doesn't run on Linux clusters.


  3.8.5.  Mosix


  MOSIX modifies the BSDI BSD/OS to provide dynamic load balancing and
  preemptive process migration across a networked group of PCs.  This is
  nice stuff not just for parallel processing, but for generally using a
  cluster much like a scalable SMP.  Will there be a Linux version?
  Look at  <http://www.cs.huji.ac.il/mosix/> for more information.


  3.8.6.  NOW (Network Of Workstations)


  The Berkeley NOW (Network Of Workstations) project,
  <http://now.cs.berkeley.edu/>, has led much of the push toward
  parallel computing using networks of workstations.  There is a lot
  work going on here, all aimed toward "demonstrating a practical 100
  processor system in the next few years."  Alas, they don't use Linux.


  3.8.7.  Parallel Processing Using Linux


  The parallel processing using Linux WWW site,
  <http://aggregate.org/LDP/>, is the home of this HOWTO and many
  related documents including online slides for a full-day tutorial.
  Aside from the work on the PAPERS project, the Purdue University
  School of Electrical and Computer Engineering generally has been a
  leader in parallel processing; this site was established to help
  others apply Linux PCs for parallel processing.

  Since Purdue's first cluster of Linux PCs was assembled in February
  1994, there have been many Linux PC clusters assembled at Purdue,
  including several with video walls.  Although these clusters used 386,
  486, and Pentium systems (no Pentium Pro systems), Intel recently
  awarded Purdue a donation which will allow it to construct multiple
  large clusters of Pentium II systems (with as many as 165 machines
  planned for a single cluster).  Although these clusters all have/will
  have PAPERS networks, most also have conventional networks.


  3.8.8.  Pentium Pro Cluster Workshop


  In Des Moines, Iowa, April 10-11, 1997, AMES Laboratory held the
  Pentium Pro Cluster Workshop.  The WWW site from this workshop,
  <http://www.scl.ameslab.gov/workshops/PPCworkshop.html>, contains a
  wealth of PC cluster information gathered from all the attendees.


  3.8.9.  TreadMarks DSM (Distributed Shared Memory)


  DSM (Distributed Shared Memory) is a technique whereby a message-
  passing system can appear to behave as an SMP.  There are quite a few
  such systems, most of which use the OS page-fault mechanism to trigger
  message transmissions.  TreadMarks,
  <http://www.cs.rice.edu/~willy/TreadMarks/overview.html>, is one of
  the more efficient of such systems and does run on Linux clusters.
  The bad news is "TreadMarks is being distributed at a small cost to
  universities and nonprofit institutions." For more information about
  the software, contact treadmarks@ece.rice.edu.


  3.8.10.  U-Net (User-level NETwork interface architecture)


  The U-Net (User-level NETwork interface architecture) project at
  Cornell,  <http://www2.cs.cornell.edu/U-Net/Default.html>, attempts to
  provide low-latency and high-bandwidth using commodity network
  hardware by by virtualizing the network interface so that applications
  can send and receive messages without operating system intervention.
  U-Net runs on Linux PCs using a DECchip DC21140 based Fast Ethernet
  card or a Fore Systems PCA-200 (not PCA-200E) ATM card.


  3.8.11.  WWT (Wisconsin Wind Tunnel)


  There is really quite a lot of cluster-related work at Wisconsin.  The
  WWT (Wisconsin Wind Tunnel) project,  <http://www.cs.wisc.edu/~wwt/>,
  is doing all sorts of work toward developing a "standard" interface
  between compilers and the underlying parallel hardware.  There is the
  Wisconsin COW (Cluster Of Workstations), Cooperative Shared Memory and
  Tempest, the Paradyn Parallel Performance Tools, etc.  Unfortunately,
  there is not much about Linux.


  4.  SIMD Within A Register (e.g., using MMX)


  SIMD (Single Instruction stream, Multiple Data stream) Within A
  Register (SWAR) isn't a new idea.  Given a machine with k-bit
  registers, data paths, and function units, it has long been known that
  ordinary register operations can function as SIMD parallel operations
  on n, k/n-bit, integer field values.  However, it is only with the
  recent push for multimedia that the 2x to 8x speedup offered by SWAR
  techniques has become a concern for mainstream computing.  The 1997
  versions of most microprocessors incorporate hardware support for
  SWAR:


  <20>  AMD K6 MMX (MultiMedia eXtensions)

  <20>  Cyrix M2 MMX (MultiMedia eXtensions)

  <20>  Digital Alpha MAX (MultimediA eXtensions)

  <20>  Hewlett-Packard PA-RISC MAX (Multimedia Acceleration eXtensions)

  <20>  Intel Pentium II & Pentium with MMX (MultiMedia eXtensions)

  <20>  Microunity Mediaprocessor SIGD (Single Instruction on Groups of
     Data)

  <20>  MIPS Digital Media eXtension (MDMX, pronounced Mad Max)

  <20>  Sun SPARC V9 VIS (Visual Instruction Set)

  There are a few holes in the hardware support provided by the new
  microprocessors, quirks like only supporting some operations for some
  field sizes.  It is important to remember, however, that you don't
  need any hardware support for many SWAR operations to be efficient.
  For example, bitwise operations are not affected by the logical
  partitioning of a register.


  4.1.  SWAR: What Is It Good For?


  Although every modern processor is capable of executing with at least
  some SWAR parallelism, the sad fact is that even the best SWAR-
  enhanced instruction sets do not support very general-purpose
  parallelism.  In fact, many people have noticed that the performance
  difference between Pentium and "Pentium with MMX technology" is often
  due to things like the larger L1 cache that coincided with appearance
  of MMX.  So, realistically, what is SWAR (or MMX) good for?
  <20>  Integers only, the smaller the better.  Two 32-bit values fit in a
     64-bit MMX register, but so do eight one-byte characters or even an
     entire chess board worth of one-bit values.

     Note: there will be a floating-point version of MMX, although very
     little has been said about it at this writing.  Cyrix has posted a
     set of slides,  <ftp://ftp.cyrix.com/developr/mpf97rm.pdf>, that
     includes a few comments about MMFP.  Apparently, MMFP will support
     two 32-bit floating-point numbers to be packed into a 64-bit MMX
     register; combining this with two MMFP pipelines will yield four
     single-precision FLOPs per clock.

  <20>  SIMD or vector-style parallelism.  The same operation is applied to
     all fields simultaneously.  There are ways to nullify the effects
     on selected fields (i.e., equivalent to SIMD enable masking), but
     they complicate coding and hurt performance.

  <20>  Localized, regular (preferably packed), memory reference patterns.
     SWAR in general, and MMX in particular, are terrible at randomly-
     ordered accesses; gathering a vector x[y] (where y is an index
     array) is prohibitively expensive.

  These are serious restrictions, but this type of parallelism occurs in
  many parallel algorithms - not just multimedia applications.  For the
  right type of algorithm, SWAR is more effective than SMP or cluster
  parallelism...  and it doesn't cost anything to use it.


  4.2.  Introduction To SWAR Programming


  The basic concept of SWAR, SIMD Within A Register, is that operations
  on word-length registers can be used to speed-up computations by
  performing SIMD parallel operations on n k/n-bit field values.
  However, making use of SWAR technology can be awkward, and some SWAR
  operations are actually more expensive than the corresponding
  sequences of serial operations because they require additional
  instructions to enforce the field partitioning.

  To illustrate this point, let's consider a greatly simplified SWAR
  mechanism that manages four 8-bit fields within each 32-bit register.
  The values in two registers might be represented as:


  ______________________________________________________________________
           PE3     PE2     PE1     PE0
        +-------+-------+-------+-------+
  Reg0  | D 7:0 | C 7:0 | B 7:0 | A 7:0 |
        +-------+-------+-------+-------+
  Reg1  | H 7:0 | G 7:0 | F 7:0 | E 7:0 |
        +-------+-------+-------+-------+
  ______________________________________________________________________


  This simply indicates that each register is viewed as essentially a
  vector of four independent 8-bit integer values.  Alternatively, think
  of A and E as values in Reg0 and Reg1 of processing element 0 (PE0), B
  and F as values in PE1's registers, and so forth.

  The remainder of this document briefly reviews the basic classes of
  SIMD parallel operations on these integer vectors and how these
  functions can be implemented.


  4.2.1.  Polymorphic Operations


  Some SWAR operations can be performed trivially using ordinary 32-bit
  integer operations, without concern for the fact that the operation is
  really intended to operate independently in parallel on these 8-bit
  fields.  We call any such SWAR operation polymorphic, since the
  function is unaffected by the field types (sizes).

  Testing if any field is non-zero is polymorphic, as are all bitwise
  logic operations.  For example, an ordinary bitwise-and operation (C's
  & operator) performs a bitwise and no matter what the field sizes are.
  A simple bitwise and of the above registers yields:


  ______________________________________________________________________
            PE3       PE2       PE1       PE0
        +---------+---------+---------+---------+
  Reg2  | D&H 7:0 | C&G 7:0 | B&F 7:0 | A&E 7:0 |
        +---------+---------+---------+---------+
  ______________________________________________________________________


  Because the bitwise and operation always has the value of result bit k
  affected only by the values of the operand bit k values, all field
  sizes are supported using the same single instruction.


  4.2.2.  Partitioned Operations


  Unfortunately, lots of important SWAR operations are not polymorphic.
  Arithmetic operations such as add, subtract, multiply, and divide are
  all subject to carry/borrow interactions between fields.  We call such
  SWAR operations partitioned, because each such operation must
  effectively partition the operands and result to prevent interactions
  between fields.  However, there are actually three different methods
  that can be used to achieve this effect.


  4.2.2.1.  Partitioned Instructions


  Perhaps the most obvious approach to implementing partitioned
  operations is to provide hardware support for "partitioned parallel
  instructions" that cut the carry/borrow logic between fields.  This
  approach can yield the highest performance, but it requires a change
  to the processor's instruction set and generally places many
  restrictions on field size (e.g., 8-bit fields might be supported, but
  not 12-bit fields).

  The AMD/Cyrix/Intel MMX, Digital MAX, HP MAX, and Sun VIS all
  implement restricted versions of partitioned instructions.
  Unfortunately, these different instruction set extensions have
  significantly different restrictions, making algorithms somewhat non-
  portable between them.  For example, consider the following sampling
  of partitioned operations:


  ______________________________________________________________________
    Instruction           AMD/Cyrix/Intel MMX   DEC MAX   HP MAX   Sun VIS
  +---------------------+---------------------+---------+--------+---------+
  | Absolute Difference |                     |       8 |        |       8 |
  +---------------------+---------------------+---------+--------+---------+
  | Merge Maximum       |                     |   8, 16 |        |         |
  +---------------------+---------------------+---------+--------+---------+
  | Compare             |           8, 16, 32 |         |        |  16, 32 |
  +---------------------+---------------------+---------+--------+---------+
  | Multiply            |                  16 |         |        |    8x16 |
  +---------------------+---------------------+---------+--------+---------+
  | Add                 |           8, 16, 32 |         |     16 |  16, 32 |
  +---------------------+---------------------+---------+--------+---------+
  ______________________________________________________________________


  In the table, the numbers indicate the field sizes, in bits, for which
  each operation is supported.  Even though the table omits many
  instructions including all the more exotic ones, it is clear that
  there are many differences.  The direct result is that high-level
  languages (HLLs) really are not very effective as programming models,
  and portability is generally poor.


  4.2.2.2.  Unpartitioned Operations With Correction Code


  Implementing partitioned operations using partitioned instructions can
  certainly be efficient, but what do you do if the partitioned
  operation you need is not supported by the hardware?  The answer is
  that you use a series of ordinary instructions to perform the
  operation with carry/borrow across fields, and then correct for the
  undesired field interactions.

  This is a purely software approach, and the corrections do introduce
  overhead, but it works with fully general field partitioning.  This
  approach is also fully general in that it can be used either to fill
  gaps in the hardware support for partitioned instructions, or it can
  be used to provide full functionality for target machines that have no
  hardware support at all.  In fact, by expressing the code sequences in
  a language like C, this approach allows SWAR programs to be fully
  portable.

  The question immediately arises:  precisely how inefficient is it to
  simulate SWAR partitioned operations using unpartitioned operations
  with correction code?  Well, that is certainly the $64k question...
  but many operations are not as difficult as one might expect.

  Consider implementing a four-element 8-bit integer vector add of two
  source vectors, x+y, using ordinary 32-bit operations.

  An ordinary 32-bit add might actually yield the correct result, but
  not if any 8-bit field carries into the next field.  Thus, our goal is
  simply to ensure that such a carry does not occur.  Because adding two
  k-bit fields generates an at most k+1 bit result, we can ensure that
  no carry occurs by simply "masking out" the most significant bit of
  each field.  This is done by bitwise anding each operand with
  0x7f7f7f7f and then performing an ordinary 32-bit add.


  ______________________________________________________________________
  t = ((x & 0x7f7f7f7f) + (y & 0x7f7f7f7f));
  ______________________________________________________________________


  That result is correct...  except for the most significant bit within
  each field.  Computing the correct value for each field is simply a
  matter of doing two 1-bit partitioned adds of the most significant
  bits from x and y to the 7-bit carry result which was computed for t.
  Fortunately, a 1-bit partitioned add is implemented by an ordinary
  exclusive or operation.  Thus, the result is simply:


  ______________________________________________________________________
  (t ^ ((x ^ y) & 0x80808080))
  ______________________________________________________________________


  Ok, well, maybe that isn't so simple.  After all, it is six operations
  to do just four adds.  However, notice that the number of operations
  is not a function of how many fields there are...  so, with more
  fields, we get speedup.  In fact, we may get speedup anyway simply
  because the fields were loaded and stored in a single (integer vector)
  operation, register availability may be improved, and there are fewer
  dynamic code scheduling dependencies (because partial word references
  are avoided).


  4.2.2.3.  Controlling Field Values


  While the other two approaches to partitioned operation implementation
  both center on getting the maximum space utilization for the
  registers, it can be computationally more efficient to instead control
  the field values so that inter-field carry/borrow events should never
  occur.  For example, if we know that all the field values being added
  are such that no field overflow will occur, a partitioned add
  operation can be implemented using an ordinary add instruction; in
  fact, given this constraint, an ordinary add instruction appears
  polymorphic, and is usable for any field sizes without correction
  code.  The question thus becomes how to ensure that field values will
  not cause carry/borrow events.

  One way to ensure this property is to implement partitioned
  instructions that can restrict the range of field values.  The Digital
  MAX vector minimum and maximum instructions can be viewed as hardware
  support for clipping field values to avoid inter-field carry/borrow.

  However, suppose that we do not have partitioned instructions that can
  efficiently restrict the range of field values...  is there a
  sufficient condition that can be cheaply imposed to ensure
  carry/borrow events will not interfere with adjacent fields?  The
  answer lies in analysis of the arithmetic properties.  Adding two k-
  bit numbers generates a result with at most k+1 bits; thus, a field of
  k+1 bits can safely contain such an operation despite using ordinary
  instructions.

  Thus, suppose that the 8-bit fields in our earlier example are now
  7-bit fields with 1-bit "carry/borrow spacers":


  ______________________________________________________________________
                PE3          PE2          PE1          PE0
        +----+-------+----+-------+----+-------+----+-------+
  Reg0  | D' | D 6:0 | C' | C 6:0 | B' | B 6:0 | A' | A 6:0 |
        +----+-------+----+-------+----+-------+----+-------+
  ______________________________________________________________________


  A vector of 7-bit adds is performed as follows.  Let us assume that,
  prior to the start of any partitioned operation, all the carry spacer
  bits (A', B', C', and D') have the value 0.  By simply executing an
  ordinary add operation, all the fields obtain the correct 7-bit
  values; however, some spacer bit values might now be 1.  We can
  correct this by just one more conventional operation, masking-out the
  spacer bits.  Our 7-bit integer vector add, x+y, is thus:


  ______________________________________________________________________
  ((x + y) & 0x7f7f7f7f)
  ______________________________________________________________________


  This is just two instructions for four adds, clearly yielding good
  speedup.

  The sharp reader may have noticed that setting the spacer bits to 0
  does not work for subtract operations.  The correction is, however,
  remarkably simple.  To compute x-y, we simply ensure the initial
  condition that the spacers in x are all 1, while the spacers in y are
  all 0.  In the worst case, we would thus get:


  ______________________________________________________________________
  (((x | 0x80808080) - y) & 0x7f7f7f7f)
  ______________________________________________________________________


  However, the additional bitwise or operation can often be optimized
  out by ensuring that the operation generating the value for x used |
  0x80808080 rather than & 0x7f7f7f7f as the last step.

  Which method should be used for SWAR partitioned operations?  The
  answer is simply "whichever yields the best speedup."  Interestingly,
  the ideal method to use may be different for different field sizes
  within the same program running on the same machine.


  4.2.3.  Communication & Type Conversion Operations


  Although some parallel computations, including many operations on
  image pixels, have the property that the ith value in a vector is a
  function only of values that appear in the ith position of the operand
  vectors, this is generally not the case.  For example, even pixel
  operations such as smoothing require values from adjacent pixels as
  operands, and transformations like FFTs require more complex (less
  localized) communication patterns.

  It is not difficult to efficiently implement 1-dimensional nearest
  neighbor communication for SWAR using unpartitioned shift operations.
  For example, to move a value from PEi to PE(i+1), a simple shift
  operation suffices.  If the fields are 8-bits in length, we would use:


  ______________________________________________________________________
  (x << 8)
  ______________________________________________________________________


  Still, it isn't always quite that simple.  For example, to move a
  value from PEi to PE(i-1), a simple shift operation might suffice...
  but the C language does not specify if shifts right preserve the sign
  bit, and some machines only provide signed shift right.  Thus, in the
  general case, we must explicitly zero the potentially replicated sign
  bits:


  ______________________________________________________________________
  ((x >> 8) & 0x00ffffff)
  ______________________________________________________________________


  Adding "wrap-around connections" is also reasonably efficient using
  unpartitioned shifts.  For example, to move a value from PEi to
  PE(i+1) with wraparound:


  ______________________________________________________________________
  ((x << 8) | ((x >> 24) & 0x000000ff))
  ______________________________________________________________________


  The real problem comes when more general communication patterns must
  be implemented.  Only the HP MAX instruction set supports arbitrary
  rearrangement of fields with a single instruction, which is called
  Permute.  This Permute instruction is really misnamed; not only can it
  perform an arbitrary permutation of the fields, but it also allows
  repetition.  In short, it implements an arbitrary x[y] operation.

  Unfortunately, x[y] is very difficult to implement without such an
  instruction.  The code sequence is generally both long and
  inefficient; in fact, it is sequential code.  This is very
  disappointing.  The relatively high speed of x[y] operations in the
  MasPar MP1/MP2 and Thinking Machines CM1/CM2/CM200 SIMD supercomputers
  was one of the key reasons these machines performed well.  However,
  x[y] has always been slower than nearest neighbor communication, even
  on those supercomputers, so many algorithms have been designed to
  minimize the need for x[y] operations.  In short, without hardware
  support, it is probably best to develop SWAR algorithms as though x[y]
  wasn't legal...  or at least isn't cheap.


  4.2.4.  Recurrence Operations (Reductions, Scans, etc.)


  A recurrence is a computation in which there is an apparently
  sequential relationship between values being computed.  However, if
  these recurrences involve associative operations, it may be possible
  to recode the computation using a tree-structured parallel algorithm.

  The most common type of parallelizable recurrence is probably the
  class known as associative reductions.  For example, to compute the
  sum of a vector's values, one commonly writes purely sequential C code
  like:


  ______________________________________________________________________
  t = 0;
  for (i=0; i<MAX; ++i) t += x[i];
  ______________________________________________________________________


  However, the order of the additions is rarely important.  Floating
  point and saturation math can yield different answers if the order of
  additions is changed, but ordinary wrap-around integer additions will
  yield the same results independent of addition order.  Thus, we can
  re-write this sequence into a tree-structured parallel summation in
  which we first add pairs of values, then pairs of those partial sums,
  and so forth, until a single final sum results.  For a vector of four
  8-bit values, just two addition steps are needed; the first step does
  two 8-bit adds, yielding two 16-bit result fields (each containing a
  9-bit result):


  ______________________________________________________________________
  t = ((x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff));
  ______________________________________________________________________


  The second step adds these two 9-bit values in 16-bit fields to
  produce a single 10-bit result:


  ______________________________________________________________________
  ((t + (t >> 16)) & 0x000003ff)
  ______________________________________________________________________


  Actually, the second step performs two 16-bit field adds...  but the
  top 16-bit add is meaningless, which is why the result is masked to a
  single 10-bit result value.

  Scans, also known as "parallel prefix" operations, are somewhat harder
  to implement efficiently.  This is because, unlike reductions, scans
  produce partitioned results.  For this reason, scans can be
  implemented using a fairly obvious sequence of partitioned operations.


  4.3.  MMX SWAR Under Linux


  For Linux, IA32 processors are our primary concern.  The good news is
  that AMD, Cyrix, and Intel all implement the same MMX instructions.
  However, MMX performance varies; for example, the K6 has only one MMX
  pipeline - the Pentium with MMX has two.  The only really bad news is
  that Intel is still running those stupid MMX commercials....  ;-)

  There are really three approaches to using MMX for SWAR:


  1. Use routines from an MMX library.  In particular, Intel has
     developed several "performance libraries,"
     <http://developer.intel.com/drg/tools/ad.htm>, that offer a variety
     of hand-optimized routines for common multimedia tasks.  With a
     little effort, many non-multimedia algorithms can be reworked to
     enable some of the most compute-intensive portions to be
     implemented using one or more of these library routines.  These
     libraries are not currently available for Linux, but could be
     ported.

  2. Use MMX instructions directly.  This is somewhat complicated by two
     facts.  The first problem is that MMX might not be available on the
     processor, so an alternative implementation must also be provided.
     The second problem is that the IA32 assembler generally used under
     Linux does not currently recognize MMX instructions.

  3. Use a high-level language or module compiler that can directly
     generate appropriate MMX instructions.  Such tools are currently
     under development, but none is yet fully functional under Linux.
     For example, at Purdue University (
     <http://dynamo.ecn.purdue.edu/~hankd/SWAR/>) we are currently
     developing a compiler that will take functions written in an
     explicitly parallel C dialect and will generate SWAR modules that
     are callable as C functions, yet make use of whatever SWAR support
     is available, including MMX.  The first prototype module compilers
     were built in Fall 1996, however, bringing this technology to a
     usable state is taking much longer than was originally expected.

  In summary, MMX SWAR is still awkward to use.  However, with a little
  extra effort, the second approach given above can be used now.  Here
  are the basics:


  1. You cannot use MMX if your processor does not support it.  The
     following GCC code can be used to test if MMX is supported on your
     processor.  It returns 0 if not, non-zero if it is supported.


     ___________________________________________________________________
     inline extern
     int mmx_init(void)
     {
             int mmx_available;

             __asm__ __volatile__ (
                     /* Get CPU version information */
                     "movl $1, %%eax\n\t"
                     "cpuid\n\t"
                     "andl $0x800000, %%edx\n\t"
                     "movl %%edx, %0"
                     : "=q" (mmx_available)
                     : /* no input */
             );
             return mmx_available;
     }
     ___________________________________________________________________


  2. An MMX register essentially holds one of what GCC would call an
     unsigned long long.  Thus, memory-based variables of this type
     become the communication mechanism between your MMX modules and the
     C programs that call them.  Alternatively, you can declare your MMX
     data as any 64-bit aligned data structure (it is convenient to
     ensure 64-bit alignment by declaring your data type as a union with
     an unsigned long long field).

  3. If MMX is available, you can write your MMX code using the .byte
     assembler directive to encode each instruction.  This is painful
     stuff to do by hand, but not difficult for a compiler to generate.
     For example, the MMX instruction PADDB MM0,MM1 could be encoded as
     the GCC in-line assembly code:


     ___________________________________________________________________
     __asm__ __volatile__ (".byte 0x0f, 0xfc, 0xc1\n\t");
     ___________________________________________________________________


  Remember that MMX uses some of the same hardware that is used for
  floating point operations, so code intermixed with MMX code must not
  invoke any floating point operations.  The floating point stack also
  should be empty before executing any MMX code; the floating point
  stack is normally empty at the beginning of a C function that does not
  use floating point.

  4. Exit your MMX code by executing the EMMS instruction, which can be
     encoded as:


     ___________________________________________________________________
     __asm__ __volatile__ (".byte 0x0f, 0x77\n\t");
     ___________________________________________________________________


  If the above looks very awkward and crude, it is.  However, MMX is
  still quite young....  future versions of this document will offer
  better ways to program MMX SWAR.


  5.  Linux-Hosted Attached Processors


  Although this approach has recently fallen out of favor, it is
  virtually impossible for other parallel processing methods to achieve
  the low cost and high performance possible by using a Linux system to
  host an attached parallel computing system.  The problem is that very
  little software support is available; you are pretty much on your own.


  5.1.  A Linux PC Is A Good Host


  In general, attached parallel processors tend to be specialized to
  perform specific types of functions.

  Before becoming discouraged by the fact that you are somewhat on your
  own, it is useful to understand that, although it may be difficult to
  get a Linux PC to appropriately host a particular system, a Linux PC
  is one of the few platforms well suited to this type of use.

  PCs make a good host for two primary reasons.  The first is the cheap
  and easy expansion capability; resources such as more memory, disks,
  networks, etc., are trivially added to a PC.  The second is the ease
  of interfacing.  Not only are ISA and PCI bus prototyping cards widely
  available, but the parallel port offers reasonable performance in a
  completely non-invasive interface.  The IA32 separate I/O space also
  facilitates interfacing by providing hardware I/O address protection
  at the level of individual I/O port addresses.

  Linux also makes a good host OS.  The free availability of full source
  code, and extensive "hacking" guides, obviously are a tremendous help.
  However, Linux also provides good near-real-time scheduling, and there
  is even a true real-time version of Linux at
  <http://luz.cs.nmt.edu/~rtlinux/>.  Perhaps even more important is the
  fact that while providing a full UNIX environment, Linux can support
  development tools that were written to run under Microsoft DOS and/or
  Windows.  MSDOS programs can execute within a Linux process using
  dosemu to provide a protected virtual machine that can literally run
  MSDOS.  Linux support for Windows 3.xx programs is even more direct:
  free software such as wine,  <http://www.linpro.no/wine/>, simulates
  Windows 3.11 well enough for most programs to execute correctly and
  efficiently within a UNIX/X environment.

  The following two sections give examples of attached parallel systems
  that I'd like to see supported under Linux....


  5.2.  Did You DSP That?


  There is a thriving market for high-performance DSP (Digital Signal
  Processing) processors.  Although these chips were generally designed
  to be embedded in application-specific systems, they also make great
  attached parallel computers.  Why?


  <20>  Many of them, such as the Texas Instruments ( <http://www.ti.com/>)
     TMS320 and the Analog Devices ( <http://www.analog.com/>) SHARC DSP
     families, are designed to construct parallel machines with little
     or no "glue" logic.

  <20>  They are cheap, especially per MIP or MFLOP.  Including the cost of
     basic support logic, it is not unheard of for a DSP processor to be
     one tenth the cost of a PC processor with comparable performance.

  <20>  They do not use much power nor generate much heat.  This means that
     it is possible to have a bunch of these chips powered by a
     conventional PC's power supply - and enclosing them in your PC's
     case will not turn it into an oven.

  <20>  There are strange-looking things in most DSP instruction sets that
     high-level (e.g., C) compilers are unlikely to use well - for
     example, "Bit Reverse Addressing."  Using an attached parallel
     system, it is possible to straightforwardly compile and run most
     code on the host, while running the most time-consuming few
     algorithms on the DSPs as carefully hand-tuned code.

  <20>  These DSP processors are not really designed to run a UNIX-like OS,
     and generally are not very good as stand-alone general-purpose
     computer processors.  For example, many do not have memory
     management hardware.  In other words, they work best when hosted by
     a more general-purpose machine...  such as a Linux PC.

  Although some audio cards and modems include DSP processors that Linux
  drivers can access, the big payoff comes from using an attached
  parallel system that has four or more DSP processors.

  Because the Texas Instruments TMS320 series,
  <http://www.ti.com/sc/docs/dsps/dsphome.htm>, has been very popular
  for a long time, and it is trivial to construct a TMS320-based
  parallel processor, there are quite a few such systems available.
  There are both integer-only and floating-point capable versions of the
  TMS320; older designs used a somewhat unusual single-precision
  floating-point format, but the new models support IEEE formats.  The
  older TMS320C4x (aka, 'C4x) achieves up to 80 MFLOPS using the TI-
  specific single-precision floating-point format; in contrast, a single
  'C67x will provide up to 1 GFLOPS single-precision or 420 MFLOPS
  double-precision for IEEE floating point calculations, using a VLIW-
  based chip architecture called VelociTI.  Not only is it easy to
  configure a group of these chips as a multiprocessor, but in a single
  chip, the 'C8x multiprocessor will provide a 100 MFLOPS IEEE floating-
  point RISC master processor along with either two or four integer
  slave DSPs.

  The other DSP processor family that has been used in more than a few
  attached parallel systems lately is the SHARC (aka, ADSP-2106x) from
  Analog Devices  <http://www.analog.com/>.  These chips can be
  configured as a 6-processor shared memory multiprocessor without
  external glue logic, and larger systems also can be configured using
  six 4-bit links/chip.  Most of the larger systems seem targeted to
  military applications, and are a bit pricey.  However, Integrated
  Computing Engines, Inc.,  <http://www.iced.com/>, makes an interesting
  little two-board PCI card set called GreenICE.  This unit contains an
  array of 16 SHARC processors, and is capable of delivering a peak
  speed of about 1.9 GFLOPS using a single-precision IEEE format.
  GreenICE costs less than $5,000.

  In my opinion, attached parallel DSPs really deserve a lot more
  attention from the Linux parallel processing community....


  5.3.  FPGAs And Reconfigurable Logic Computing


  If parallel processing is all about getting the highest speedup, then
  why not build custom hardware?  Well, we all know the answers; it
  costs too much, takes too long to develop, becomes useless when we
  change the algorithm even slightly, etc.  However, recent advances in
  electrically reprogrammable FPGAs (Field Programmable Gate Arrays)
  have nullified most of those objections.  Now, the gate density is
  high enough so that an entire simple processor can be built within a
  single FPGA, and the time to reconfigure (reprogram) an FPGA has also
  been dropping to a level where it is reasonable to reconfigure even
  when moving from one phase of an algorithm to the next.

  This stuff is not for the weak of heart:  you'll have to work with
  hardware description languages like VHDL for the FPGA configuration,
  as well as writing low-level code to interface to programs on the
  Linux host system.  However, the cost of FPGAs is low, and especially
  for algorithms operating on low-precision integer data (actually, a
  small superset of the stuff SWAR is good at), FPGAs can perform
  complex operations just about as fast as you can feed them data.  For
  example, simple FPGA-based systems have yielded better-than-
  supercomputer times for searching gene databases.

  There are other companies making appropriate FPGA-based hardware, but
  the following two companies represent a good sample.

  Virtual Computer Company offers a variety of products using
  dynamically reconfigurable SRAM-based Xilinx FPGAs.  Their 8/16 bit
  "Virtual ISA Proto Board"  <http://www.vcc.com/products/isa.html> is
  less than $2,000.

  The Altera ARC-PCI (Altera Reconfigurable Computer, PCI bus),
  <http://www.altera.com/html/new/pressrel/pr_arc-pci.html>, is a
  similar type of card, but uses Altera FPGAs and a PCI bus interface
  rather than ISA.

  Many of the design tools, hardware description languages, compilers,
  routers, mappers, etc., come as object code only that runs under
  Windows and/or DOS.  You could simply keep a disk partition with
  DOS/Windows on your host PC and reboot whenever you need to use them,
  however, many of these software packages may work under Linux using
  dosemu or Windows emulators like wine.


  6.  Of General Interest


  The material covered in this section applies to all four parallel
  processing models for Linux.


  6.1.  Programming Languages And Compilers


  I am primarily known as a compiler researcher, so I'd like to be able
  to say that there are lots of really great compilers automatically
  generating efficient parallel code for Linux systems.  Unfortunately,
  the truth is that it is hard to beat the performance obtained by
  expressing your parallel program using various explicit communication
  and other parallel operations within C code that is compiled by GCC.

  The following language/compiler projects represent some of the best
  efforts toward producing reasonably efficient code from high-level
  languages.  Generally, each is reasonably effective for the kinds of
  programming tasks it targets, but none is the powerful general-purpose
  language and compiler system that will make you forever stop writing C
  programs to compile with GCC...  which is fine.  Use these languages
  and compilers as they were intended, and you'll be rewarded with
  shorter development times, easier debugging and maintenance, etc.

  There are plenty of languages and compilers beyond those listed here
  (in alphabetical order).  A list of freely available compilers (most
  of which have nothing to do with Linux parallel processing) is at
  <http://www.idiom.com/free-compilers/>.


  6.1.1.  Fortran 66/77/PCF/90/HPF/95


  At least in the scientific computing community, there will always be
  Fortran.  Of course, now Fortran doesn't mean the same thing it did in
  the 1966 ANSI standard.  Basically, Fortran 66 was pretty simple
  stuff.  Fortran 77 added tons of features, the most noticeable of
  which were the improved support for character data and the change of
  DO loop semantics.  PCF (Parallel Computing Forum) Fortran attempted
  to add a variety of parallel processing support features to 77.
  Fortran 90 is a fully-featured modern language, essentially adding
  C++-like object-oriented programming features and parallel array
  syntax to the 77 language.  HPF (High-Performance Fortran,
  <http://www.crpc.rice.edu/HPFF/home.html>), which has itself gone
  through two versions (HPF-1 and HPF-2), is essentially the enhanced,
  standardized, version of what many of us used to know as CM Fortran,
  MasPar Fortran, or Fortran D; it extends Fortran 90 with a variety of
  parallel processing enhancements, largely focussed on specifying data
  layouts.  Finally, Fortran 95 represents a relatively minor
  enhancement and refinement of 90.

  What works with C generally can also work with f2c, g77 (a nice Linux-
  specific overview is at  <http://linux.uni-
  regensburg.de/psi_linux/gcc/html_g77/g77_91.html>), or the commercial
  Fortran 90/95 products from
  <http://extweb.nag.co.uk/nagware/NCNJNKNM.html>.  This is because all
  of these compilers eventually come down to the same code-generation
  used in the back-end of GCC.

  Commercial Fortran parallelizers that can generate code for SMPs are
  available from  <http://www.kai.com/> and
  <http://www.psrv.com/vast/vast_parallel.html>.  It is not clear if
  these compilers will work for SMP Linux, but it should be possible
  given that the standard POSIX threads (i.e., LinuxThreads) work under
  SMP Linux.

  The Portland Group,  <http://www.pgroup.com/>, has commercial
  parallelizing HPF Fortran (and C, C++) compilers that generate code
  for SMP Linux; they also have a version targeting clusters using MPI
  or PVM.  FORGE/spf/xHPF products at  < http://www.apri.com/> might
  also be useful for SMPs or clusters.

  Freely available parallelizing Fortrans that might be made to work
  with parallel Linux systems include:


  <20>  ADAPTOR (Automatic DAta Parallelism TranslaTOR,
     <http://www.gmd.de/SCAI/lab/adaptor/adaptor_home.html>), which can
     translate HPF into Fortran 77/90 code with MPI or PVM calls, but
     does not mention Linux.

  <20>  Fx  <http://www.cs.cmu.edu/~fx/Fx> at Carnegie Mellon targets some
     workstation clusters, but Linux?

  <20>  HPFC (prototype HPF Compiler,
     <http://www.cri.ensmp.fr/~coelho/hpfc.html>) generates Fortran 77
     code with PVM calls.  Is it usable on a Linux cluster?

  <20>  Can PARADIGM (PARAllelizing compiler for DIstributed-memory
     General-purpose Multicomputers,
     <http://www.crhc.uiuc.edu/Paradigm/>) be used with Linux?

  <20>  The Polaris compiler,
     <http://ece.www.ecn.purdue.edu/~eigenman/polaris/>, generates
     Fortran code for shared memory multiprocessors, and may soon be
     retargeted to PAPERS Linux clusters.

  <20>  PREPARE,
     <http://www.irisa.fr/EXTERNE/projet/pampa/PREPARE/prepare.html>,
     targets MPI clusters...  it is not clear if it can generate code to
     run on IA32 processors.

  <20>  Combining ADAPT and ADLIB, shpf (Subset High Performance Fortran
     compilation system,
     <http://www.ccg.ecs.soton.ac.uk/Projects/shpf/shpf.html>) is public
     domain and generates Fortran 90 with MPI calls...  so, if you have
     a Fortran 90 compiler under Linux....

  <20>  SUIF (Stanford University Intermediate Form, see
     <http://suif.stanford.edu/>) has parallelizing compilers for both C
     and Fortran.  This is also the focus of the National Compiler
     Infrastructure Project...  so, is anybody targeting parallel Linux
     systems?

  I'm sure that I have omitted many potentially useful compilers for
  various dialects of Fortran, but there are so many that it is
  difficult to keep track.  In the future, I would prefer to list only
  those compilers known to work with Linux.  Please email comments
  and/or corrections to hankd@engr.uky.edu.


  6.1.2.  GLU (Granular Lucid)


  GLU (Granular Lucid) is a very high-level programming system based on
  a hybrid programming model that combines intensional (Lucid) and
  imperative models.  It supports both PVM and TCP sockets.  Does it run
  under Linux?  More information is available at
  <http://www.csl.sri.com/GLU.html>.


  6.1.3.  Jade And SAM


  Jade is a parallel programming language that extends C to exploit
  coarse-grain concurrency in sequential, imperative programs.  It
  assumes a distributed shared memory model, which is implemented by SAM
  for workstation clusters using PVM.  More information is available at
  <http://suif.stanford.edu/~scales/sam.html>.


  6.1.4.  Mentat And Legion


  Mentat is an object-oriented parallel processing system that works
  with workstation clusters and has been ported to Linux.  Mentat
  Programming Language (MPL) is an object-oriented programming language
  based on C++.  The Mentat run-time system uses something vaguely
  resembling non-blocking remote procedure calls.  More information is
  available at  <http://www.cs.virginia.edu/~mentat/>.

  Legion  <http://www.cs.virginia.edu/~legion/> is built on top on
  Mentat, providing the appearance of a single virtual machine across
  wide-area networked machines.


  6.1.5.  MPL (MasPar Programming Language)


  Not to be confussed with Mentat's MPL, this language was originally
  developed as the native parallel C dialect for the MasPar SIMD
  supercomputers.  Well, MasPar isn't really in that business any more
  (they are now NeoVista Solutions,  <http://www.neovista.com>, a data
  mining company), but their MPL compiler was built using GCC, so it is
  still freely available.  In a joint effort between the University of
  Alabama at Huntsville and Purdue University, MasPar's MPL has been
  retargeted to generate C code with AFAPI calls (see section 3.6), and
  thus runs on both Linux SMPs and clusters.  The compiler is, however,
  somewhat buggy...  see
  <http://www.math.luc.edu/~laufer/mspls/papers/cohen.ps>.


  6.1.6.  PAMS (Parallel Application Management System)


  Myrias is a company selling a software product called PAMS (Parallel
  Application Management System).  PAMS provides very simple directives
  for virtual shared memory parallel processing.  Networks of Linux
  machines are not yet supported.  See  <http://www.myrias.com/> for
  more information.


  6.1.7.  Parallaxis-III


  Parallaxis-III is a structured programming language that extends
  Modula-2 with "virtual processors and connections" for data
  parallelism (a SIMD model).  The Parallaxis software comprises
  compilers for sequential and parallel computer systems, a debugger
  (extensions to the gdb and xgbd debugger), and a large variety of
  sample algorithms from different areas, especially image processing.
  This runs on sequential Linux systems...  an old version supported
  various parallel targets, and the new version also will (e.g.,
  targeting a PVM cluster).  More information is available at
  <http://www.informatik.uni-stuttgart.de/ipvr/bv/p3/p3.html>.


  6.1.8.  pC++/Sage++


  pC++/Sage++ is a language extension to C++ that permits data-parallel
  style operations using "collections of objects" from some base
  "element" class.  It is a preprocessor generating C++ code that can
  run under PVM.  Does it run under Linux?  More information is
  available at  <http://www.extreme.indiana.edu/sage/>.


  6.1.9.  SR (Synchronizing Resources)


  SR (Synchronizing Resources) is a concurrent programming language in
  which resources encapsulate processes and the variables they share;
  operations provide the primary mechanism for process interaction. SR
  provides a novel integration of the mechanisms for invoking and
  servicing operations. Consequently, all of local and remote procedure
  call, rendezvous, message passing, dynamic process creation,
  multicast, and semaphores are supported. SR also supports shared
  global variables and operations.

  It has been ported to Linux, but it isn't clear what parallelism it
  can execute with.  More information is available at
  <http://www.cs.arizona.edu/sr/www/index.html>.


  6.1.10.  ZPL And IronMan


  ZPL is an array-based programming language intended to support
  engineering and scientific applications.  It generates calls to a
  simple message-passing interface called IronMan, and the few functions
  which constitute this interface can be easily implemented using nearly
  any message-passing system.  However, it is primarily targeted to PVM
  and MPI on workstation clusters, and Linux is supported.  More
  information is available at
  <http://www.cs.washington.edu/research/projects/orca3/zpl/www/>.


  6.2.  Performance Issues


  There are a lot of people who spend a lot of time benchmarking
  particular motherboards, network cards, etc., trying to determine
  which is the best.  The problem with that approach is that by the time
  you've been able to benchmark something, it is no longer the best
  available; it even may have been taken off the market and replaced by
  a revised model with entirely different properties.

  Buying PC hardware is like buying orange juice.  Usually, it is made
  with pretty good stuff no matter what company name is on the label.
  Few people know, or care, where the components (or orange juice
  concentrate) came from.  That said, there are some hardware
  differences that you should pay attention to.  My advice is simply
  that you be aware of what you can expect from the hardware under
  Linux, and then focus your attention on getting rapid delivery, a good
  price, and a reasonable policy for returns.

  An excellent overview of the different PC processors is given in
  <http://www.pcguide.com/ref/cpu/fam/>; in fact, the whole WWW site
  <http://www.pcguide.com/> is full of good technical overviews of PC
  hardware.  It is also useful to know a bit about performance of
  specific hardware configurations, and the Linux Benchmarking HOWTO
  <http://sunsite.unc.edu/LDP/HOWTO/Benchmarking-HOWTO.html> is a good
  place to start.

  The Intel IA32 processors have many special registers that can be used
  to measure the performance of a running system in exquisite detail.
  Intel VTune,  <http://developer.intel.com/design/perftool/vtune/>,
  uses the performance registers extensively in a very complete code-
  tuning system...  that unfortunately doesn't run under Linux.  A
  loadable module device driver, and library routines, for accessing the
  Pentium performance registers is available from
  <http://www.cs.umd.edu/users/akinlar/driver.html>.  Keep in mind that
  these performance registers are different on different IA32
  processors; this code works only with Pentium, not with 486, Pentium
  Pro, Pentium II, K6, etc.

  Another comment on performance is appropriate, especially for those of
  you who want to build big clusters and put them in small spaces.  At
  least some modern processors incorporate thermal sensors and circuits
  that are used to slow the internal clock rate if operating temperature
  gets too high (an attempt to reduce heat output and improve
  reliability).  I'm not suggesting that everyone should go buy a
  peltier device (heat pump) to cool each CPU, but you should be aware
  that high operating temperature does not just shorten component life -
  it also can directly reduce system performance.  Do not arrange your
  computers in physical configurations that block airflow, trap heat
  within confined areas, etc.

  Finally, performance isn't just speed, but also reliability and
  availability.  High reliability means that your system almost never
  crashes, even when components fail...  which generally requires
  special features like redundant power supplies and hot-swap
  motherboards.  That usually isn't cheap.  High availability refers to
  the concept that your system is available for use nearly all the
  time...  the system may crash when components fail, but the system is
  quickly repaired and rebooted.  There is a High-Availability HOWTO
  that discusses many of the basic issues.  However, especially for
  clusters, high availablity can be achieved simply by having a few
  spares.  I recommend at least one spare, and prefer to have at least
  one spare for every 16 machines in a large cluster.  Discarding faulty
  hardware and replacing it with a spare can yield both higher
  availability and lower cost than a maintenance contract.


  6.3.  Conclusion - It's Out There


  So, is anybody doing parallel processing using Linux?  Yes!

  It wasn't very long ago that a lot of people were wondering if the
  death of many parallel-processing supercomputer companies meant that
  parallel processing was on its way out.  I didn't think it was dead
  then (see  <http://dynamo.ecn.purdue.edu/~hankd/Opinions/pardead.html>
  for a fun overview of what I think really happened), and it seems
  quite clear now that parallel processing is again on the rise.  Even
  Intel, which just recently stopped making parallel supercomputers, is
  proud of the parallel processing support in things like MMX and the
  upcoming IA64 EPIC (Explicitly Parallel Instruction Computer).

  If you search for "Linux" and "parallel" with your favorite search
  engine, you'll find quite a few places are involved in parallel
  processing using Linux.  In particular, Linux PC clusters seem to be
  popping-up everywhere.  The appropriateness of Linux, combined with
  the low cost and high performance of PC hardware, have made parallel
  processing using Linux a popular approach to supercomputing for both
  small, budget-constrained, groups and large, well-funded, national
  research laboratories.

  Various projects listed elsewhere in this document maintain lists of
  "kindred" research sites that have similar parallel Linux
  configurations.  However, at
  <http://yara.ecn.purdue.edu/~pplinux/Sites/>, there is a hypertext
  document intended to provide photographs, descriptions, and contact
  information for all the various sites using Linux systems for parallel
  processing.  To have information about your site posted there:


  <20>  You must have a "permanent" parallel Linux site:  an SMP, cluster
     of machines, SWAR system, or PC with attached processor, which is
     configured to allow users to execute parallel programs under Linux.
     A Linux-based software environment (e.g., PVM, MPI, AFAPI) that
     directly supports parallel processing must be installed on the
     system.  However, the hardware need not be dedicated to parallel
     processing under Linux, and may be used for completely different
     purposes when parallel programs are not being run.

  <20>  Request that your site be listed.  Send your site information to
     hankd@engr.uky.edu.  Please follow the format used in other entries
     for your site information.  No site will be listed without an
     explicit request from the contact person for that site.

  There are 14 clusters in the current listing, but we are aware of at
  least several dozen Linux clusters world-wide.  Of course, listing
  does not imply any endorsement, etc.; our hope is simply to increase
  awareness, research, and collaboration involving parallel processing
  using Linux.