old-www/HOWTO/Unix-and-Internet-Fundament.../core-formats.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML
><HEAD
><TITLE
>How does my computer store things in memory?</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.7"><LINK
REL="HOME"
TITLE="The Unix and Internet Fundamentals HOWTO"
HREF="index.html"><LINK
REL="PREVIOUS"
TITLE="How does my computer keep processes from stepping on each other?"
HREF="memory-management.html"><LINK
REL="NEXT"
TITLE="How does my computer store things on disk?"
HREF="disk-layout.html"></HEAD
><BODY
CLASS="sect1"
BGCOLOR="#FFFFFF"
TEXT="#000000"
LINK="#0000FF"
VLINK="#840084"
ALINK="#0000FF"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
>The Unix and Internet Fundamentals HOWTO</TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="memory-management.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
></TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="disk-layout.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="sect1"
><H1
CLASS="sect1"
><A
NAME="core-formats"
></A
>9. How does my computer store things in memory?</H1
><P
>You probably know that everything on a computer is stored as strings of
bits (binary digits; you can think of them as lots of little on-off
switches).  Here we'll explain how those bits are used to represent the
letters and numbers that your computer is crunching.</P
><P
>Before we can go into this, you need to understand about the
<I
CLASS="firstterm"
>word size</I
> of your computer.  The word size is the
computer's preferred size for moving units of information around;
technically it's the width of your processor's
<I
CLASS="firstterm"
>registers</I
>,
which are the holding areas your processor uses to do arithmetic and
logical calculations.  When people write about computers having bit sizes
(calling them, say, <SPAN
CLASS="QUOTE"
>"32-bit"</SPAN
> or <SPAN
CLASS="QUOTE"
>"64-bit"</SPAN
> computers), this is what
they mean.</P
><P
>Most computers now have a word size of 64 bits.  In the recent past
(early 2000s) many PCs had 32-bit words. The old 286 machines back in the
1980s had a word size of 16.  Old-style mainframes often had 36-bit
words.</P
><P
>The computer views your memory as a sequence of words numbered from
zero up to some large value dependent on your memory size. That value is
limited by your word size, which is why programs on older machines like
286s had to go through painful contortions to address large amounts of
memory.  I won't describe them here; they still give older programmers
nightmares.</P
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="numbers"
></A
>9.1. Numbers</H2
><P
>Integer numbers are represented as either words or pairs of words,
depending on your processor's word size.  One 64-bit machine word is the
most common integer representation.</P
><P
>Integer arithmetic is close to but not actually mathematical
base-two.  The low-order bit is 1, next 2, then 4 and so forth as in pure
binary.  But signed numbers are represented in
<I
CLASS="firstterm"
>twos-complement</I
>
notation.  The highest-order bit is a <I
CLASS="firstterm"
>sign
bit</I
> which
makes the quantity negative, and every negative number can be obtained from
the corresponding positive value by inverting all the bits and adding one.
This is why integers on a 64-bit machine have the range
-2<SUP
>63</SUP
> to 2<SUP
>63</SUP
> - 1.
That 64th bit is being used for sign; 0 means a positive number or zero, 1
a negative number.</P
><P
>Some computer languages give you access to <I
CLASS="firstterm"
>unsigned
arithmetic</I
> which is straight base 2 with zero and
positive numbers only.</P
><P
>Most processors and some languages can do operations in
<I
CLASS="firstterm"
>floating-point</I
>
numbers (this capability is built into all recent processor chips).
Floating-point numbers give you a much wider range of values than integers
and let you express fractions.  The ways in which this is done vary and are
rather too complicated to discuss in detail here, but the general idea is
much like so-called &#8216;scientific notation&#8217;, where one might
write (say) 1.234 * 10<SUP
>23</SUP
>; the encoding of the
number is split into a
<I
CLASS="firstterm"
>mantissa</I
>
(1.234) and the exponent part (23) for the power-of-ten multiplier (which
means the number multiplied out would have 20 zeros on it, 23 minus the
three decimal places).</P
></DIV
><DIV
CLASS="sect2"
><H2
CLASS="sect2"
><A
NAME="characters"
></A
>9.2. Characters</H2
><P
>Characters are normally represented as strings of seven bits each in
an encoding called ASCII (American Standard Code for Information
Interchange).  On modern machines, each of the 128 ASCII characters is the
low seven bits of an
<I
CLASS="firstterm"
>octet</I
>
or 8-bit byte; octets are packed into memory words so that (for example) a
six-character string only takes up one 64-bit memory word.  For an ASCII code
chart, type &#8216;man 7 ascii&#8217; at your Unix prompt.</P
><P
>The preceding paragraph was misleading in two ways.  The minor one is
that the term &#8216;octet&#8217; is formally correct but seldom actually
used; most people refer to an octet as
<I
CLASS="firstterm"
>byte</I
>
and expect bytes to be eight bits long.  Strictly speaking, the term
&#8216;byte&#8217; is more general; there used to be, for example, 36-bit
machines with 9-bit bytes (though there probably never will be
again).</P
><P
>The major one is that not all the world uses ASCII.  In fact, much of
the world can't &#8212; ASCII, while fine for American English, lacks many
accented and other special characters needed by users of other languages.
Even British English has trouble with the lack of a pound-currency
sign.</P
><P
>There have been several attempts to fix this problem.  All use the
extra high bit that ASCII doesn't, making it the low half of a
256-character set.  The most widely-used of these is the so-called
&#8216;Latin-1&#8217; character set (more formally called ISO 8859-1).
This is the default character set for Linux, older versions of HTML, and X.
Microsoft Windows uses a mutant version of Latin-1 that adds a bunch of
characters such as right and left double quotes in places proper Latin-1
leaves unassigned for historical reasons (for a scathing account of the
trouble this causes, see the <A
HREF="http://www.fourmilab.ch/webtools/demoroniser/"
TARGET="_top"
>demoroniser</A
>
page).</P
><P
>Latin-1 handles western European languages, including English,
French, German, Spanish, Italian, Dutch, Norwegian, Swedish, Danish, and
Icelandic.  However, this isn't good enough either, and as a result there
is a whole series of Latin-2 through -9 character sets to handle things
like Greek, Arabic, Hebrew, Esperanto, and Serbo-Croatian.  For details,
see the <A
HREF="http://czyborra.com/charsets/iso8859.html"
TARGET="_top"
>&#13;ISO alphabet soup</A
> page.</P
><P
>The ultimate solution is a huge standard called Unicode (and its
identical twin ISO/IEC 10646-1:1993).  Unicode is identical to Latin-1 in
its lowest 256 slots.  Above these in 16-bit space it includes Greek,
Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi,
Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian,
Tibetan, Japanese Kana, the complete set of modern Korean Hangul, and a
unified set of Chinese/Japanese/Korean (CJK) ideographs. For details, see
the <A
HREF="http://www.unicode.org/"
TARGET="_top"
>Unicode Home Page</A
>.  XML
and XHTML use this character set.</P
><P
>Recent versions of Linux use an encoding of Unicode called UTF-8.  In
UTF, characters 0-127 are ASCII.  Characters 128-255 are used only in
sequences of 2 through 4 bytes that identify non-ASCII characters.</P
></DIV
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="memory-management.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="disk-layout.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>How does my computer keep processes from stepping on each other?</TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
>&nbsp;</TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>How does my computer store things on disk?</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>