LDP/LDP/howto/linuxdoc/Alpha-HOWTO.sgml

481 lines
22 KiB
Plaintext

<!doctype linuxdoc system>
<article>
<title>Brief Introduction to Alpha Systems and Processors</title>
<author>
Neal Crook, Digital Equipment
(Editor: <url url="mailto:davidm@azstarnet.com" name="David Mosberger">)
</author>
<date>V0.11, 6 June 1997
<abstract>
This document is a brief overview of existing Alpha CPUs, chipsets and
systems. It has something of a hardware bias, reflecting my own area
of expertese. Although I am an employee of Digital Equipment
Corporation, this is not an official statement by Digital and any
opinions expressed are mine and not Digital's.
</abstract>
<toc>
<sect>What is Alpha
<p> "Alpha" is the name given to Digital's 64-bit RISC architecture. The Alpha
project in Digital began in mid-1989, with the goal of providing a
high-performance migration path for VAX customers. This was not the first RISC
architecture to be produced by Digital, but it was the first to reach the
market. When Digital announced Alpha, in March 1992, it made the decision to
enter the merchant semicondutor market by selling Alpha microprocessors.
<p> Alpha is also sometimes referred to as Alpha AXP, for obscure and
arcane reasons that aren't worth persuing. Suffice it to say that they are one
and the same.
<sect>What is Digital Semiconductor
<p> <url url="http://www.digital.com/info/semiconductor/"
name="Digital Semiconductor"> (DS) is the business unit within Digital
Equipment Corporation (Digital - we don't like the name DEC) that
sells semiconductors on the merchant market. Digital's products
include CPUs, support chipsets, PCI-PCI bridges and PCI peripheral
chips for comms and multimedia.
<sect>Alpha CPUs
<p>There are currently 2 generations of CPU core that implement the Alpha
architecture:
<itemize>
<item> EV4
<item> EV5
</itemize>
<p>Opinions differ as to what "EV" stands for (Editor's note: the true
answer is of course "Electro Vlassic" <ref id="ref1" name="[1]">), but the
number represents the first generation of Digital's CMOS technology
that the core was implemented in. So, the EV4 was originally
implemented in CMOS4. As time goes by, a CPU tends to get a mid-life
performance kick by being optically shrunk into the next generation of
CMOS process. EV45, then, is the EV4 core implemented in CMOS5
process. There is a big difference between shrinking a design into a
particular technology and implementing it from scratch in that
technology (but I don't want to go into that now). There are a few
other wildcards in here: there is also a CMOS4S (optical shrink in
CMOS4) and a CMOS5L.
<p>True technophiles will be interested to know that CMOS4 is a 0.75 micron
process, CMOS5 is a 0.5 micron process and CMOS6 is a 0.35 micron process.
<p>To map these CPU cores to <em/chips/ we get:
<descrip>
<tag/21064-150,166/ EV4 (originally), EV4S (now)
<tag/21064-200/ EV4S
<tag/21064A-233,275,300/ EV45
<tag/21066/ LCA4S (EV4 core, with EV4 FPU)
<tag/21066A-233/ LCA45 (EV4 core, but with EV45 FPU)
<tag/21164-233,300,333/ EV5
<tag/21164A-417/ EV56
<tag/21264/ <url name="EV6" url="http://www.mdronline.com/report/articles/21264/21264.html">
</descrip>
<p> The EV4 core is a dual-issue (it can issue 2 instructions per CPU
clock) superpipelined core with integer unit, floating point unit and
branch prediction. It is fully bypassed and has 64-bit internal data
paths and tightly coupled 8Kbyte caches, one each for Instruction and
Data. The caches are write-through (they never get dirty).
<p> The EV45 core has a couple of tweaks to the EV4 core: it has a
slightly improved floating point unit, and 16KB caches, one each for
Instruction and Data (it also has cache parity). (Editor's note: Neal
Crook indicated in a separate mail that the changes to the floating
point unit (FPU) improve the performance of the divider. The EV4 FPU
divider takes 34 cycles for a single-precision divide and 63 cycles
for a double-precision divide (non data-dependent). In constrast, the
EV45 divider takes typically 19 cycles (34 cycles max) for
single-precision and typically 29 cycles (63 cycles max) for a
double-precision division (data-dependent).)
<p> The EV5 core is a quad-issue core, also superpipelined, fully bypassed
etc etc. It has tightly-coupled 8Kbyte caches, one each for I and D. These
caches are write-through. It also has a tightly-coupled 96Kbyte on-chip
second-level cache (the Scache) which is 3-way set associative and write-back
(it can be dirty). The EV4->EV5 performance increase is better than just
the increase achieved by clock speed improvements. As well as the bigger
caches and quad issue, there are microarchitectural improvements to reduce
producer/consumer latencies in some paths.
<p> The EV56 core is fundamentally the same microarchitecture as the
EV5, but it adds some new instructions for 8 and 16-bit loads and
stores (see Section <ref id="byte ld/st" name="Bytes and all that
stuff">). These are primarily intended for use by device drivers. The
EV56 core is implemented in CMOS6, which is a 2.0V process.
<p> The 21064 was anounced in March 1992. It uses the EV4 core, with a 128-bit
bus interface. The bus interface supports the 'easy' connection of an external
second-level cache, with a block size of 256-bits (2 data beats on the
bus). The Bcache timing is completely software configurable. The 21064 can also
be configured to use a 64-bit external bus, (but I'm not sure if any shipping
system uses this mode). The 21064 does not impose any policy on the Bcache, but
it is usually configured as a write-back cache. The 21064 does contain hooks to
allow external hardware to maintain cache coherence with the Bcache and
internal caches, but this is hairy.
<p> The 21066 uses the EV4 core and integrates a memory controller and
PCI host bridge. To save pins, the memory controller has a 64-bit data
bus (but the internal caches have a block size of 256 bits, just like
the 21064, therefore a block fill takes 4 beats on the bus). The
memory controller supports an external Bcache and external DRAMs. The
timing of the Bcache and DRAMs is completely software configurable,
and can be controlled to the resolution of the CPU clock
period. Having a 4-beat process to fill a cache block isn't as bad as
it sounds because the DRAM access is done in page mode. Unfortunately,
the memory controller doesn't support any of the new esoteric DRAMs
(SDRAM, EDO or BEDO) or synchronous cache RAMs. The PCI bus interface
is fully rev2.0 compliant and runs at upto 33MHz.
<p> The 21164 has a 128-bit data bus and supports split reads, with
upto 2 reads outstanding at any time (this allows 100% data bus
utilisation under best-case dream-on conditions, i.e., you can
theoretically transfer 128-bits of data on every bus clock). The 21164
supports easy connection of an external 3-rd level cache (Bcache) and
has all the hooks to allow external systems to maintain full cache
coherence with all caches. Therefore, symmetric multiprocessor designs
are 'easy'.
<p> The 21164A was announced in October, 1995. It uses the EV56 core. It is
nominally pin-compatible with the 21164, but requires split power rails; all
of the power pins that were +3.3V power on the 21164 have now been split into
two groups; one group provided 2.0V power to the CPU core, the other group
supplies 3.3V to the I/O cells. Unlike older implementations, the 21164 pins
are not 5V-tolerant. The end result of this change is that 21164 systems are,
in general, not upgradeable to the 21164A (though note that it would be
relatively straightforward to design a 21164A system that could also
accommodate a 21164). The 21164A also has a couple of new pins to support
the new 8 and 16-bit loads and stores. It also improves the 21164 support for
using synchronus SRAMs to implement the external Bcache.
<sect>21064 performance vs 21066 performance
<p> The 21064 and the 21066 have the same (EV4) CPU core. If the same program
is run on a 21064 and a 21066, at the same CPU speed, then the
difference in performance comes only as a result of system
Bcache/memory bandwidth. Any code thread that has a high hit-rate on
the <em>internal</em> caches will perform the same. There are 2 big
performance killers:
<enum>
<item> Code that is write-intensive. Even though the 21064 and the 21066
have write buffers to swallow some of the delays, code that is
write-intensive will be throttled by write bandwidth at the system
bus. This arises because the on-chip caches are write-through.
<item> Code that wants to treat floats as integers. The Alpha
architecture does not allow register-register transfers from integer
registers to floating point registers. Such a conversion has to be
done via memory (And therefore, because the on-chip caches are
write-through, via the Bcache). (Editor's note: it seems that both
the EV4 and EV45 can perform the conversion through the primary data
cache (Dcache), provided that the memory is cached already. In such a
case, the store in the conversion sequence will update the Dcache and
the subsequent load is, under certain circumstances, able to read the
updated d-cache value, thus avoiding a costly roundtrip to the Bcache.
In particular, it seems best to execute the stq/ldt or stt/ldq
instructions back-to-back, which is somewhat counter-intuitive.)
</enum>
<p> If you make the same comparison between a 21064A and a 21066A, there is an
additional factor due to the different Icache and Dcache sizes between the two
chips.
<p> Now, the 21164 solves both these problems: it achieve <em/much/
higher system bus bandwidths (despite having the same number of signal
pins - yes, I <em/know/ it's got about twice as many pins as a
21064, but all those extra ones are power and ground! (yes, really!!))
and it has write-back caches. The only remaining problem is the answer
to the question "how much does it cost?"
<p>
<sect>A Few Notes On Clocking
<p> All of the current Alpha CPUs use high-speed clocks, because their
microarchitectures have been designed as so-called short-tick
designs. None of the sytem busses have to run at horrendous speeds as
a result though:
<itemize>
<item> on the 21066(A), 21064(A), 21164 the off-chip cache (Bcache)
timing is completely programmable, to the resolution of the CPU
clock. For example, on a 275MHz CPU, the Bcache read access time can
be controller with a resolution of 3.6ns
<item> on the 21066(A), the DRAM timing is completely programmable, to
the resolution of the CPU clock (<em/not/ the PCI clock, the CPU clock).
<item> on the 21064(A), 21164(A), the system bus frequency is a
sub-multiple of the CPU clock frequency. Most of the 21064
motherboards use a 33MHz system bus clock.
<item> Systems that use the 21066 can run the PCI at any frequency
relative to the CPU. Generally, the PCI runs at 33MHz.
<item> Systems that use the APECs chipset (see Section <ref id="The
chip-sets">) always have their CPU system bus equal to their PCI bus
frequency. This means that both busses tends to run at either 25MHz or
33MHz (since these are the frequencies that scale up to match the CPU
frequencies). On APECs systems, the DRAM controller timings are
software programmable in terms of the CPU system bus frequency
</itemize>
<bf/Aside:/ someone suggested that they were getting bad performance
on a 21066 because the 21066 memory controller was only running at
33MHz. Actually, it's the superfast 21064A systems that have memory
controllers that 'only' run at 33MHz.
<sect>The chip-sets <label id="The chip-sets">
<p> DS sells two CPU support chipsets. The 2107x chipset (aka APECS) is a
21064(A) support chiset. The 2117x chipset (aka ALCOR) is a 21164 support
chipset. There will also be 2117xA chipset (aka ALCOR 2) as a 21164A support
chipset.
<p> Both chipsets provide memory controllers and PCI host bridges for their
CPU. APECS provides a 32-bit PCI host bridge, ALCOR provides a 64-bit PCI host
bridge which (in accordance with the requirements of the PCI spec) can support
both 32-bit and 64-bit PCI devices.
<p>
APECS consists of 6, 208-pin chips (4, 32-bit data slices (DECADE), 1
system controller (COMANCHE), 1 PCI controller (EPIC)). It provides a DRAM
controller (128-bit memory bus) and a PCI interface. It also does all the work
to maintain memory coherence when a PCI device DMAs into (or out of) memory.
<p> ALCOR consists of 5 chips (4, 64-bit data slices (Data Switch, DSW) -
208-pin PQFP and 1 control (Control, I/O Address, CIA) - a 383 pin plastic PGA).
It provides a DRAM controller (256-bit memory bus) and a PCI interface. It
also does all the work required to support an external Bcache and to
maintain memory coherence when a PCI device DMAs into (or out of) memory.
<p> There is no support chipset for the 21066, since the memory controller and
PCI host bridge functionality are integrated onto the chip.
<sect>The Systems
<p> The applications engineering group in DS produces example designs using the
CPUs and support chipsets. These are typically PC-AT size motherboards, with
all the functionality that you'd typically find on a high-end Pentium
motherboard. Originally, these example designs were intended to be used as
starting points for third-parties to produce motherboard designs from. These
first-generation designs were called Evaluation Boards (EBs). As the
amount of engineering required to build a motherboard has increased (due to
higher-speed clocks and the need to meet RF emission and susceptibility
regulations) the emphasis has shifted towards providing motherboards that
are suitable for volume manufacture.
<p> Digital's system groups have produced several generations of machines using
Alpha processors. Some of these systems use support logic that is designed by
the systems groups, and some use commodity chipsets from DS. In some cases,
systems use a combination of both.
<p> Various third-parties build systems using Alpha processors. Some of these
companies design systems from scratch, and others use DS support chipsets,
clone/modify DS example designs or simply package systems using build and
tested boards from DS.
<p> The EB64: Obsolete design using 21064 with memory controller implemented
using programmable logic. I/O provided by using programmable logic to interface
a 486<->ISA bridge chip. On-board Ethernet, SuperI/O (2S, 1P, FD), Ethernet
and ISA. PC-AT size. Runs from standard PC power supply.
<p> The EB64+: Uses 21064 or 21064A and APECs. Has ISA and PCI expansion (3
ISA, 2 PCI, one pair are on a shared slot). Supports 36-bit DRAM SIMs. ISA bus
generated by Intel SaturnI/O PCI-ISA bridge. On-board SCSI (NCR 810 on PCI)
Ethernet (Digital 21040), KBD, MOUSE (PS2 style), SuperI/O (2S, 1P, FD),
RTC/NVRAM. Boot
ROM is EPROM. PC-AT size. Runs from standard PC power supply.
<p> The EB66: Uses 21066 or 21066A. I/O sub-system is identical to EB64+. Baby
PC-AT size. Runs from standard PC power supply. The EB66 schematic was
published as a marketing poster advertising the 21066 as "the first
microprocessor in the world with embedded PCI" (for trivia fans: there are
actually 2 versions of this poster - I drew the circuits and wrote the spiel
for the first version, and some Americans mauled the spiel for the second
version)
<p> The EB164: Uses 21164 and ALCOR. Has ISA and PCI expansion (3 ISA slots,
2 64-bit PCI slots (one is shared with an ISA slot) and 2 32-bit PCI slots.
Uses plus-in Bcache SIMMs. I/O sub-system provides SuperI/O (2S, 1P, FD),
KBD, MOUSE (PS2 style), RTC/NVRAM. Boot ROM is Flash. PC-AT-sized motherboard.
Requires power supply with 3.3V output.
<p> The AlphaPC64 (aka Cabriolet): derived from EB64+ but now baby-AT
with Flash boot ROM, no on-board SCSI or Ethernet. 3 ISA slots, 4 PCI
slots (one pair are on a shared slot), uses plug-in Bcache SIMMs.
Requires power supply with 3.3V output.
<p> The AXPpci33 (aka NoName), is based on the EB66. This design is produced by
Digital's Technical OEM (TOEM) group. It uses the 21066 processor running at
166MHz or 233MHz. It is a baby-AT size, and runs from a standard PC power
supply. It has 5 ISA slots and 3 PCI slots (one pair are a shared slot). There
are 2 versions, with either PS/2 or large DIN connectors for the keyboard.
<p> Other 21066-based motherboards: most if not all other 21066-based
motherboards on the market are also based on EB66 - there's really not many
system options when designing a 21066 system, because all the control is done
on-chip.
<p> Multia (aka the Universal Desktop Box): This is a very compact
pedestal desktop system based on the 21066. It includes 2 PCMCIA
sockets, 21030 (TGA) graphics, 21040 Ethernet and NCR 810 SCSI disk
along with floppy, 2 serial ports and a parallel port. It has limited
expansion capability (one PCI slot) due to its compact size. (There is
some restriction on when you can use the PCI slot, can't remember
what) (Note that 21066A-based and Pentium-based Multia's are also
available).
<p> DEC PC 150 AXP (aka Jensen): This is a very old Digital system - one of the
first-generation Alpha systems. It is only mentioned here because a number of
these systems seem to be available on the second-hand market. The Jensen is a
floor-standing tower system which used a 150MHz 21064 (later versions used
faster CPUs but I'm not sure what speeds). It used programmable logic to
interface a 486 EISA I/O bridge to the CPU.
<p> Other 21064(A) systems: There are 3 or 4 motherboard designs around (I'm
not including Digital <em>systems</em> here) and all the ones I know of
are derived from the EB64+ design. These include:
<itemize>
<item>EB64+ (some vendors package the board and sell it unmodified); AT
form-factor.
<item>Aspen Systems motherboard: EB64+ derivative; baby-AT form-factor.
<item>Aspen Systems server board: many PCI slots (includes PCI bridge).
<item>AlphaPC64 (aka Cabriolet), baby AT form-factor.
</itemize>
<p> Other 21164(A) systems: The only one I'm aware of that isn't simply
an EB164 clone is a system made by DeskStation. That system is implemented
using a memory and I/O controller proprietary to Desk Station. I don't know
what their attitude towards Linux is.
<sect>Bytes and all that stuff<label id="byte ld/st">
<p> When the Alpha architecture was introduced, it was unique amongst RISC
architectures for eschewing 8-bit and 16-bit loads and stores. It supported
32-bit and 64-bit loads and stores (longword and quadword, in Digital's
nomenclature). The co-architects (Dick Sites, Rich Witek) justified this
decision by citing the advantages:
<enum>
<item> Byte support in the cache and memory
sub-system tends to slow down accesses for 32-bit and 64-bit quantities.
<item> Byte support makes it hard to build high-speed error-correction
circuitry into the cache/memory sub-system.
</enum>
<p> Alpha compensates by providing powerful instructions for
manipulating bytes and byte groups within 64-bit registers. Standard
benchmarks for string operations (e.g., some of the Byte benchmarks) show
that Alpha performs very well on byte manipulation.
<p> The absence of byte loads and stores impacts some software semaphores and
impacts the design of I/O sub-systems. Digital's solution to the I/O problem is
to use some low-order address lines to specify the data size during I/O
transfers, and to decode these as byte enables. This so-called Sparse
Addressing wastes address space and has the consequence that I/O space is
non-contiguous (more on the intricacies of Sparse Addressing when I get around
to writing it). Note that I/O space, in this context, refers to all system
resources present on the PCI and therefore includes both PCI memory space and
PCI I/O space.
<p> With the 21164A introduction, the Alpha archtecture was ECO'd to include
byte addressing. Executing these new instructions on an earlier CPU will cause
an OPCDEC PALcode exception, so that the PALcode will handle the access. This
will have a performance impact. The ramifications of this are that use of these
new instructions (IMO) should be restricted to device drivers rather than
applications code.
<p> These new byte load and stores mean that future support chipsets will be
able to support contiguous I/O space.
<sect> PALcode and all that stuff
<p> This is a placeholder for a section explaining PALcode. I will write it if
there is sufficient interest.
<sect>Porting
<p> The ability of any Alpha-based machine to run Linux is really only limited
by your ability to get information on the gory details of its innards. Since
there are Linux ports for the E66, EB64+ and EB164 boards, all systems based on
the 21066, 21064/APECS or 21164/ALCOR should run Linux with little or no
modification. The major thing that is different between any of these
motherboards is the way that they route interrupts. There are three sources of
interrupts:
<itemize>
<item> on-board devices
<item> PCI devices
<item> ISA devices
</itemize>
<p> All the systems use an Intel System I/O bridge (SIO) to act as a
bridge between PCI and ISA (the main I/O bus is PCI, the ISA bus is a
secondary bus used to support slow-speed and 'legacy' I/O
devices). The SIO contains the traditional pair of daisy-chained
8259s.
<p> Some systems (e.g., the Noname) route all of their interrupts
through the SIO and thence to the CPU. Some systems have a separate
interrupt controller and route all PCI interrupts plus the SIO
interrupt (8259 output) through that, and all ISA interrupts through
the SIO.
<p> Other differences between the systems include:
<itemize>
<item> how many slots they have
<item> what on-board PCI devices they have
<item> whether they have Flash or EPROM
</itemize>
<sect>More Information
<p> All of the DS evaluation boards and motherboard designs are
license-free and the whole documentation kit for a design costs about
\$50. That includes all the schematics, programmable parts sources,
data sheets for CPU and support chipset. The doc kits are available
from Digital Semiconductor distributors. I'm not suggesting that many
people will want to rush out and buy this, but I do want to point out
that the information is available.
<p>Hope that was helpful. Comments/updates/suggestions for expansion
to <url url="mailto:neal.crook@reo.mts.digital.com" name="Neal Crook">.
<sect>References
<p><label id="ref1"><url url="http://www.research.digital.com/wrl/publications/abstracts/TN-13.html" name="[1]">
Bill Hamburgen, Jeff Mogul, Brian Reid, Alan Eustace, Richard Swan,
Mary Jo Doherty, and Joel Bartlett. <em>Characterization of Organic
Illumination Systems</em>. DEC WRL, Technical Note 13, April 1989.
</article>