old-www/HOWTO/Parallel-Processing-HOWTO-5...

178 lines
9.4 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
<TITLE>Linux Parallel Processing HOWTO: Linux-Hosted Attached Processors</TITLE>
<LINK HREF="Parallel-Processing-HOWTO-6.html" REL=next>
<LINK HREF="Parallel-Processing-HOWTO-4.html" REL=previous>
<LINK HREF="Parallel-Processing-HOWTO.html#toc5" REL=contents>
</HEAD>
<BODY>
<A HREF="Parallel-Processing-HOWTO-6.html">Next</A>
<A HREF="Parallel-Processing-HOWTO-4.html">Previous</A>
<A HREF="Parallel-Processing-HOWTO.html#toc5">Contents</A>
<HR>
<H2><A NAME="s5">5. Linux-Hosted Attached Processors</A></H2>
<P>
<P>Although this approach has recently fallen out of favor, it is
virtually impossible for other parallel processing methods to achieve
the low cost and high performance possible by using a Linux system to
host an attached parallel computing system. The problem is that very
little software support is available; you are pretty much on your own.
<P>
<H2><A NAME="ss5.1">5.1 A Linux PC Is A Good Host</A>
</H2>
<P>
<P>In general, attached parallel processors tend to be specialized to
perform specific types of functions.
<P>Before becoming discouraged by the fact that you are somewhat on your
own, it is useful to understand that, although it may be difficult to
get a Linux PC to appropriately host a particular system, a Linux PC
is one of the few platforms well suited to this type of use.
<P>PCs make a good host for two primary reasons. The first is the cheap
and easy expansion capability; resources such as more memory, disks,
networks, etc., are trivially added to a PC. The second is the ease
of interfacing. Not only are ISA and PCI bus prototyping cards widely
available, but the parallel port offers reasonable performance in a
completely non-invasive interface. The IA32 separate I/O space also
facilitates interfacing by providing hardware I/O address protection
at the level of individual I/O port addresses.
<P>Linux also makes a good host OS. The free availability of full source
code, and extensive "hacking" guides, obviously are a tremendous help.
However, Linux also provides good near-real-time scheduling, and there
is even a true real-time version of Linux at
<A HREF="http://luz.cs.nmt.edu/~rtlinux/">http://luz.cs.nmt.edu/~rtlinux/</A>. Perhaps even more important
is the fact that while providing a full UNIX environment, Linux can
support development tools that were written to run under Microsoft DOS
and/or Windows. MSDOS programs can execute within a Linux process
using <CODE>dosemu</CODE> to provide a protected virtual machine that can
literally run MSDOS. Linux support for Windows 3.xx programs is even
more direct: free software such as <CODE>wine</CODE>,
<A HREF="http://www.linpro.no/wine/">http://www.linpro.no/wine/</A>, simulates Windows 3.11 well enough
for most programs to execute correctly and efficiently within a UNIX/X
environment.
<P>The following two sections give examples of attached parallel systems
that I'd like to see supported under Linux....
<P>
<H2><A NAME="ss5.2">5.2 Did You DSP That?</A>
</H2>
<P>
<P>There is a thriving market for high-performance DSP (Digital Signal
Processing) processors. Although these chips were generally designed
to be embedded in application-specific systems, they also make great
attached parallel computers. Why?
<P>
<UL>
<LI>Many of them, such as the Texas Instruments (
<A HREF="http://www.ti.com/">http://www.ti.com/</A>) TMS320 and the Analog Devices (
<A HREF="http://www.analog.com/">http://www.analog.com/</A>) SHARC DSP families, are designed to
construct parallel machines with little or no "glue" logic.
</LI>
<LI>They are cheap, especially per MIP or MFLOP. Including the cost
of basic support logic, it is not unheard of for a DSP processor to be
one tenth the cost of a PC processor with comparable performance.
</LI>
<LI>They do not use much power nor generate much heat. This means
that it is possible to have a bunch of these chips powered by a
conventional PC's power supply - and enclosing them in your PC's case
will not turn it into an oven.
</LI>
<LI>There are strange-looking things in most DSP instruction sets
that high-level (e.g., C) compilers are unlikely to use well - for
example, "Bit Reverse Addressing." Using an attached parallel system,
it is possible to straightforwardly compile and run most code on the
host, while running the most time-consuming few algorithms on the DSPs
as carefully hand-tuned code.
</LI>
<LI>These DSP processors are not really designed to run a UNIX-like
OS, and generally are not very good as stand-alone general-purpose
computer processors. For example, many do not have memory management
hardware. In other words, they work best when hosted by a more
general-purpose machine... such as a Linux PC.</LI>
</UL>
<P>Although some audio cards and modems include DSP processors that Linux
drivers can access, the big payoff comes from using an attached
parallel system that has four or more DSP processors.
<P>Because the Texas Instruments TMS320 series,
<A HREF="http://www.ti.com/sc/docs/dsps/dsphome.htm">http://www.ti.com/sc/docs/dsps/dsphome.htm</A>, has been very
popular for a long time, and it is trivial to construct a TMS320-based
parallel processor, there are quite a few such systems available.
There are both integer-only and floating-point capable versions of the
TMS320; older designs used a somewhat unusual single-precision
floating-point format, but the new models support IEEE formats. The
older TMS320C4x (aka, 'C4x) achieves up to 80 MFLOPS using the
TI-specific single-precision floating-point format; in contrast, a
single 'C67x will provide up to 1 GFLOPS single-precision or 420
MFLOPS double-precision for IEEE floating point calculations, using a
VLIW-based chip architecture called VelociTI. Not only is it easy to
configure a group of these chips as a multiprocessor, but in a single
chip, the 'C8x multiprocessor will provide a 100 MFLOPS IEEE
floating-point RISC master processor along with either two or four
integer slave DSPs.
<P>The other DSP processor family that has been used in more than a few
attached parallel systems lately is the SHARC (aka, ADSP-2106x) from
Analog Devices
<A HREF="http://www.analog.com/">http://www.analog.com/</A>. These chips can be
configured as a 6-processor shared memory multiprocessor without
external glue logic, and larger systems also can be configured using
six 4-bit links/chip. Most of the larger systems seem targeted to
military applications, and are a bit pricey. However, Integrated
Computing Engines, Inc.,
<A HREF="http://www.iced.com/">http://www.iced.com/</A>, makes an
interesting little two-board PCI card set called GreenICE. This unit
contains an array of 16 SHARC processors, and is capable of delivering
a peak speed of about 1.9 GFLOPS using a single-precision IEEE format.
GreenICE costs less than $5,000.
<P>In my opinion, attached parallel DSPs really deserve a lot more
attention from the Linux parallel processing community....
<P>
<H2><A NAME="ss5.3">5.3 FPGAs And Reconfigurable Logic Computing</A>
</H2>
<P>
<P>If parallel processing is all about getting the highest speedup, then
why not build custom hardware? Well, we all know the answers; it
costs too much, takes too long to develop, becomes useless when we
change the algorithm even slightly, etc. However, recent advances in
electrically reprogrammable FPGAs (Field Programmable Gate Arrays)
have nullified most of those objections. Now, the gate density is
high enough so that an entire simple processor can be built within a
single FPGA, and the time to reconfigure (reprogram) an FPGA has also
been dropping to a level where it is reasonable to reconfigure even
when moving from one phase of an algorithm to the next.
<P>This stuff is not for the weak of heart: you'll have to work with
hardware description languages like VHDL for the FPGA configuration, as
well as writing low-level code to interface to programs on the Linux
host system. However, the cost of FPGAs is low, and especially for
algorithms operating on low-precision integer data (actually, a small
superset of the stuff SWAR is good at), FPGAs can perform complex
operations just about as fast as you can feed them data. For example,
simple FPGA-based systems have yielded better-than-supercomputer times
for searching gene databases.
<P>There are other companies making appropriate FPGA-based hardware, but
the following two companies represent a good sample.
<P>Virtual Computer Company offers a variety of products using
dynamically reconfigurable SRAM-based Xilinx FPGAs. Their 8/16 bit
"Virtual ISA Proto Board"
<A HREF="http://www.vcc.com/products/isa.html">http://www.vcc.com/products/isa.html</A> is less than $2,000.
<P>The Altera ARC-PCI (Altera Reconfigurable Computer, PCI bus),
<A HREF="http://www.altera.com/html/new/pressrel/pr_arc-pci.html">http://www.altera.com/html/new/pressrel/pr_arc-pci.html</A>,
is a similar type of card, but uses Altera FPGAs and a PCI bus
interface rather than ISA.
<P>Many of the design tools, hardware description languages, compilers,
routers, mappers, etc., come as object code only that runs under
Windows and/or DOS. You could simply keep a disk partition with
DOS/Windows on your host PC and reboot whenever you need to use them,
however, many of these software packages may work under Linux using
<CODE>dosemu</CODE> or Windows emulators like <CODE>wine</CODE>.
<P>
<HR>
<A HREF="Parallel-Processing-HOWTO-6.html">Next</A>
<A HREF="Parallel-Processing-HOWTO-4.html">Previous</A>
<A HREF="Parallel-Processing-HOWTO.html#toc5">Contents</A>
</BODY>
</HTML>