mirror of https://github.com/tLDP/LDP
1781 lines
87 KiB
XML
1781 lines
87 KiB
XML
<?xml version="1.0"?>
|
|
<!DOCTYPE article PUBLIC
|
|
"-//OASIS//DTD DocBook XML V4.1.2//EN"
|
|
"docbook/docbookxx.dtd" [
|
|
<!ENTITY howto "http://www.tldp.org/HOWTO/">
|
|
<!ENTITY mini-howto "&howto;/mini/">
|
|
<!ENTITY howto-host "www.tldp.org">
|
|
<!ENTITY howto-domain ".tldp.org">
|
|
<!ENTITY howto-toplevel ".org">
|
|
<!ENTITY home "http://www.catb.org/~esr/">
|
|
]>
|
|
|
|
<article id="Unix-and-Internet-Fundamentals-HOWTO">
|
|
<articleinfo>
|
|
<title>The Unix and Internet Fundamentals HOWTO</title>
|
|
|
|
<author>
|
|
<firstname>Eric</firstname>
|
|
<surname>Raymond</surname>
|
|
<affiliation>
|
|
<address>
|
|
<email>esr@thyrsus.com</email>
|
|
</address>
|
|
</affiliation>
|
|
</author>
|
|
|
|
<revhistory id="revhistory">
|
|
<revision>
|
|
<revnumber>2.14</revnumber>
|
|
<date></date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
More minor corrections.
|
|
</revremark>
|
|
</revision>
|
|
<revision>
|
|
<revnumber>2.13</revnumber>
|
|
<date></date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Minor corrections.
|
|
</revremark>
|
|
</revision>
|
|
<revision>
|
|
<revnumber>2.12</revnumber>
|
|
<date>2010-07-31</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Add Farsi translation link. Note that ISA is dead.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<!--
|
|
<revhistory>
|
|
<revision>
|
|
<revnumber>2.11</revnumber>
|
|
<date>2010-01-05</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Notice that it's mostly 64 bits now. Update the sections
|
|
on booting and running programs to describe X.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.10</revnumber>
|
|
<date>2007-11-28</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Minor updates.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.9</revnumber>
|
|
<date>2004-03-03</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Minor updates.
|
|
</revremark>
|
|
</revision>
|
|
<revision>
|
|
<revnumber>2.8</revnumber>
|
|
<date>2003-10-04</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Minor updates.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.8</revnumber>
|
|
<date>2003-02-22</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
LDP site and my home site moved.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.7</revnumber>
|
|
<date>2002-07-06</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Light rewrite of permissions section.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.6</revnumber>
|
|
<date>2002-02-06</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
URL correction and typo fixes.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.5</revnumber>
|
|
<date>2002-02-02</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Corrected description of IP.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.4</revnumber>
|
|
<date>2001-06-12</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Where to find more.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.3</revnumber>
|
|
<date>2001-05-21</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Introduction to bus types.
|
|
Polish translation link.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.2</revnumber>
|
|
<date>2001-02-05</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
New section on how DNS is organized. Corrected for new
|
|
location of document. Various copy-edit fixes.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.1</revnumber>
|
|
<date>2000-11-29</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Correct explanation of twos-complement numbers. Various
|
|
copy-edit fixes.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>2.0</revnumber>
|
|
<date>2000-08-05</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
First DocBook version. Detailed description of memory hierarchy.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>1.7</revnumber>
|
|
<date>2000-03-06</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Correct and expanded the section on file permissions.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>1.4</revnumber>
|
|
<date>1999-09-25</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Be more precise about what kernel does vs. what init does.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>1.3</revnumber>
|
|
<date>1999-06-27</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
The sections ‘What happens when you log in?’ and
|
|
‘File ownership, permissions and security’.
|
|
</revremark>
|
|
</revision>
|
|
|
|
<revision>
|
|
<revnumber>1.2</revnumber>
|
|
<date>1998-12-26</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
The section ‘How does my computer store things in memory?’.
|
|
</revremark>
|
|
</revision>
|
|
-->
|
|
<revision>
|
|
<revnumber>1.0</revnumber>
|
|
<date>1998-10-29</date>
|
|
<authorinitials>esr</authorinitials>
|
|
<revremark>
|
|
Initial revision.
|
|
</revremark>
|
|
</revision>
|
|
</revhistory>
|
|
<abstract>
|
|
<para>
|
|
This document describes the working basics of PC-class computers, Unix-like
|
|
operating systems, and the Internet in non-technical language.
|
|
</para>
|
|
|
|
</abstract>
|
|
</articleinfo>
|
|
|
|
<!-- Section1: intro -->
|
|
|
|
<sect1 id="intro"><title>Introduction</title>
|
|
|
|
<sect2 id="purpose" ><title>Purpose of this document</title>
|
|
|
|
<para>This document is intended to help Linux and Internet users who are
|
|
learning by doing. While this is a great way to acquire specific skills,
|
|
sometimes it leaves peculiar gaps in one's knowledge of the basics — gaps
|
|
which can make it hard to think creatively or troubleshoot effectively,
|
|
from lack of a good mental model of what is really going on.</para>
|
|
|
|
<para>I'll try to describe in clear, simple language how it all works. The
|
|
presentation will be tuned for people using Unix or Linux on PC-class
|
|
machines. Nevertheless, I'll usually refer simply to ‘Unix’
|
|
here, as most of what I will describe is constant across different machines
|
|
and across Unix variants.</para>
|
|
|
|
<para>I'm going to assume you're using an Intel PC. The details differ
|
|
slightly if you're running an PowerPC or some other kind of computer, but
|
|
the basic concepts are the same.</para>
|
|
|
|
<para>I won't repeat things, so you'll have to pay attention, but that
|
|
also means you'll learn from every word you read. It's a good idea to just
|
|
skim when you first read this; you should come back and reread it a few
|
|
times after you've digested what you have learned.</para>
|
|
|
|
<para>This is an evolving document. I intend to keep adding sections in
|
|
response to user feedback, so you should come back and review it
|
|
periodically.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="newversions"><title>New versions of this document</title>
|
|
|
|
<para>New versions of the Unix and Internet Fundamentals HOWTO will be
|
|
periodically posted to <ulink url="news:comp.os.linux.help">
|
|
comp.os.linux.help</ulink> and <ulink url="news:comp.os.linux.announce">
|
|
comp.os.linux.announce</ulink> and <ulink url="news:news.answers">
|
|
news.answers</ulink>. They will also be uploaded to various websites,
|
|
including the Linux Documentation Project home page.</para>
|
|
|
|
<para>You can view the latest version of this on the World Wide Web via the URL
|
|
<ulink
|
|
url="&howto;Unix-and-Internet-Fundamentals-HOWTO/index.html">
|
|
http:&howto;Unix-and-Internet-Fundamentals-HOWTO/index.html</ulink>.
|
|
</para>
|
|
|
|
<para>This document has been translated into: the following languages:
|
|
<ulink url="http://mehdi.wordpress.com/unix-and-internet-fundamentals-persian/">Farsi</ulink>,
|
|
<ulink
|
|
url="http://theta.uoks.uj.edu.pl/~gszczepa/gszczepa/esr1iso2.htm">Polish</ulink>
|
|
<ulink
|
|
url="http://es.tldp.org/Manuales-LuCAS/doc-fundamentos-unix-internet/fundamentos.html
|
|
">Spanish</ulink>
|
|
<ulink
|
|
url="http://docs.comu.edu.tr/howto/fundementals-howto.html">Turkish</ulink>
|
|
</para>
|
|
|
|
</sect2>
|
|
<sect2 id="feedback"><title>Feedback and corrections</title>
|
|
|
|
<para>If you have questions or comments about this document, please feel
|
|
free to mail Eric S. Raymond, at <ulink url="mailto:esr@thyrsus.com">
|
|
esr@thyrsus.com</ulink>. I welcome any suggestions or criticisms. I
|
|
especially welcome hyperlinks to more detailed explanations of individual
|
|
concepts. If you find a mistake with this document, please let me know so
|
|
I can correct it in the next version. Thanks.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="resources"><title>Related resources</title>
|
|
|
|
<para>If you're reading this in order to learn how to hack, you should also
|
|
read the <ulink url="&home;faqs/hacker-howto.html">
|
|
How To Become A Hacker FAQ</ulink>. It has links to some other useful
|
|
resources.</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
<sect1 id="anatomy"><title>Basic anatomy of your computer</title>
|
|
|
|
<para>Your computer has a processor chip inside it that does the actual
|
|
computing. It has internal memory (what DOS/Windows people call
|
|
<quote>RAM</quote> and Unix people often call <quote>core</quote>; the Unix
|
|
term is a folk memory from when RAM consisted of ferrite-core donuts). The
|
|
processor and memory live on the
|
|
<firstterm>motherboard</firstterm><indexterm><primary>motherboard</primary></indexterm>,
|
|
which is the heart of your computer.</para>
|
|
|
|
<para>Your computer has a screen and keyboard. It has hard drives and an
|
|
optical CD-ROM (or maybe a DVD drive) and maybe a floppy disk. Some of
|
|
these devices are run by <firstterm>controller cards</firstterm> that plug
|
|
into the motherboard and help the computer drive them; others are run by
|
|
specialized chipsets directly on the motherboard that fulfill the same
|
|
function as a controller card. Your keyboard is too simple to need a
|
|
separate card; the controller is built into the keyboard chassis
|
|
itself.</para>
|
|
|
|
<para>We'll go into some of the details of how these devices work later. For
|
|
now, here are a few basic things to keep in mind about how they work
|
|
together:</para>
|
|
|
|
<para>All the parts of your computer inside the case are connected by a
|
|
<firstterm>bus</firstterm><indexterm><primary>bus</primary></indexterm>.
|
|
Physically, the bus is what you plug your controller cards into (the video
|
|
card, the disk controller, a sound card if you have one). The bus is the
|
|
data highway between your processor, your screen, your disk, and everything
|
|
else.</para>
|
|
|
|
<para>(If you've seen references to ‘ISA’, ‘PCI’,
|
|
and ‘PCMCIA’ in connection with PCs and have not understood
|
|
them, these are bus types. ISA is, except in minor details, the same bus
|
|
that was used on IBM's original PCs in 1980; it is no longer used.
|
|
PCI, for Peripheral Component Interconnection, is the bus used on most
|
|
modern PCs, and on modern Macintoshes as well. PCMCIA is a variant of ISA
|
|
with smaller physical connectors used on laptop computers.)</para>
|
|
|
|
<para>The processor, which makes everything else go, can't actually see any of
|
|
the other pieces directly; it has to talk to them over the bus. The only
|
|
other subsystem that it has really fast, immediate access to is memory (the
|
|
core). In order for programs to run, then, they have to be <firstterm>in
|
|
core</firstterm> (in memory).</para>
|
|
|
|
<para>When your computer reads a program or data off the disk, what actually
|
|
happens is that the processor uses the bus to send a disk read request
|
|
to your disk controller. Some time later the disk controller uses the
|
|
bus to signal the processor that it has read the data and put it in a
|
|
certain location in memory. The processor can then use the bus to look
|
|
at that data.</para>
|
|
|
|
<para>Your keyboard and screen also communicate with the processor via the
|
|
bus, but in simpler ways. We'll discuss those later on. For now, you know
|
|
enough to understand what happens when you turn on your computer.</para>
|
|
|
|
</sect1>
|
|
<sect1 id="bootup"><title>What happens when you switch on a computer?</title>
|
|
|
|
<para>A computer without a program running is just an inert hunk of
|
|
electronics. The first thing a computer has to do when it is turned on is
|
|
start up a special program called an <anchor id="os"/><firstterm>operating
|
|
system</firstterm>. The operating system's job is to help other computer
|
|
programs to work by handling the messy details of controlling the
|
|
computer's hardware.</para>
|
|
|
|
<para>The process of bringing up the operating system is called <anchor
|
|
id="boot"/> <firstterm>booting</firstterm> (originally this was
|
|
<firstterm>bootstrapping</firstterm> and alluded to the process of pulling
|
|
yourself up <quote>by your bootstraps</quote>). Your computer knows how to
|
|
boot because instructions for booting are built into one of its chips, the
|
|
BIOS (or Basic Input/Output System) chip.</para>
|
|
|
|
<para>The BIOS chip tells it to look in a fixed place, usually on the
|
|
lowest-numbered hard disk (the <firstterm>boot disk</firstterm>) for a
|
|
special program called a <firstterm>boot loader</firstterm> (under Linux the
|
|
boot loader is called Grub or LILO). The boot loader is pulled into memory and
|
|
started. The boot loader's job is to start the real operating
|
|
system.</para>
|
|
|
|
<para>The loader does this by looking for a
|
|
<firstterm>kernel</firstterm><indexterm><primary>kernel</primary></indexterm>,
|
|
loading it into memory, and starting it. If you Linux and see
|
|
"LILO" on the screen followed by a bunch of dots, it is loading the kernel.
|
|
(Each dot means it has loaded another <anchor id="diskblock"/><firstterm>disk
|
|
block</firstterm> of kernel code.)</para>
|
|
|
|
<para>(You may wonder why the BIOS doesn't load the kernel directly —
|
|
why the two-step process with the boot loader? Well, the BIOS isn't very
|
|
smart. In fact it's very stupid, and Linux doesn't use it at all after
|
|
boot time. It was originally written for primitive 8-bit PCs with tiny
|
|
disks, and literally can't access enough of the disk to load the kernel
|
|
directly. The boot loader step also lets you start one of several
|
|
operating systems off different places on your disk, in the unlikely event
|
|
that Unix isn't good enough for you.)</para>
|
|
|
|
<para>Once the kernel starts, it has to look around, find the rest of the
|
|
hardware, and get ready to run programs. It does this by poking not at
|
|
ordinary memory locations but rather at <firstterm>I/O ports</firstterm>
|
|
— special bus addresses that are likely to have device controller
|
|
cards listening at them for commands. The kernel doesn't poke at random;
|
|
it has a lot of built-in knowledge about what it's likely to find where,
|
|
and how controllers will respond if they're present. This process is
|
|
called
|
|
<firstterm>autoprobing</firstterm><indexterm><primary>autoprobing</primary></indexterm>.</para>
|
|
|
|
<para>You may or may not be able to see any of this going on. Back when
|
|
Unix systems used text consoles, you'd see boot messages scroll by on your
|
|
screen as the system started up. Nowawadays, Unixes often hide the boot
|
|
messages behind a graphical splash screen. You may be able to see them by
|
|
switching to a text console view with the key combination Ctrl-Shift-F1. If
|
|
this works, you should be able to switch back to the graphical boot screen
|
|
with a different Ctrl-Shift sequence; try F7, F8, and F9.</para>
|
|
|
|
<para>Most of the messages emitted boot time are the kernel autoprobing
|
|
your hardware through the I/O ports, figuring out what it has available to
|
|
it and adapting itself to your machine. The Linux kernel is extremely good
|
|
at this, better than most other Unixes and <firstterm>much</firstterm> better
|
|
than DOS or Windows. In fact, many Linux old-timers think the cleverness
|
|
of Linux's boot-time probes (which made it relatively easy to install) was
|
|
a major reason it broke out of the pack of free-Unix experiments to attract
|
|
a critical mass of users.</para>
|
|
|
|
<para>But getting the kernel fully loaded and running isn't the end of the
|
|
boot process; it's just the first stage (sometimes called <firstterm>run
|
|
level 1</firstterm>). After this first stage, the kernel hands control to a
|
|
special process called ‘init’ which spawns several housekeeping
|
|
processes. (Some recent Linuxes use a different program called
|
|
‘upstart’ that does similar things)</para>
|
|
|
|
<para>The init process's first job is usually to check to make sure your disks
|
|
are OK. Disk file systems are fragile things; if they've been damaged by a
|
|
hardware failure or a sudden power outage, there are good reasons to take
|
|
recovery steps before your Unix is all the way up. We'll go into some of
|
|
this later on when we talk about <link linkend="fsck">how file systems can
|
|
go wrong</link>.</para>
|
|
|
|
<para>Init's next step is to start several <firstterm>daemons</firstterm>. A
|
|
daemon is a program like a print spooler, a mail listener or a WWW server
|
|
that lurks in the background, waiting for things to do. These special
|
|
programs often have to coordinate several requests that could conflict.
|
|
They are daemons because it's often easier to write one program that runs
|
|
constantly and knows about all requests than it would be to try to make
|
|
sure that a flock of copies (each processing one request and all running at
|
|
the same time) don't step on each other. The particular collection of
|
|
daemons your system starts may vary, but will almost always include a print
|
|
spooler (a gatekeeper daemon for your printer).</para>
|
|
|
|
<para>The next step is to prepare for users. Init starts a copy of a
|
|
program called <command>getty</command> to watch your screen and keyboard
|
|
(and maybe more copies to watch dial-in serial ports). Actually, nowadays
|
|
it usually starts multiple copies of <command>getty</command> so you have
|
|
several (usually 7 or 8) virtual consoles, with your screen and keyboards
|
|
connected to one of them at a time. But you likely won't see any of these,
|
|
because one of your consoles will be taken over by the X server (about
|
|
which more in a bit).</para>
|
|
|
|
<para>We're not done yet. The next step is to start up various daemons
|
|
that support networking and other services. The most important of these is
|
|
your X server. X is a daemon that manages your display, keyboard, and
|
|
mouse. Its main job is to produce the color pixel graphics you normally
|
|
see on your screen.</para>
|
|
|
|
<para>When the X server comes up, during the last part of your machine's
|
|
boot process, it effectively takes over the hardware from whatever virtual
|
|
console was previously in control. That's when you'll see a graphical
|
|
login screen, produced for you by a program called a <firstterm>display
|
|
manager</firstterm>.</para>
|
|
|
|
</sect1>
|
|
<sect1 id="login"><title>What happens when you log in?</title>
|
|
|
|
<para>When you log in, you identify yourself to the computer. On modern
|
|
Unixes you will usually do this through a graphical display manager. But
|
|
it's possible to switch virtual consoles with a Ctrl-Shift key sequence and
|
|
do a textual login, too. In that case you go through the
|
|
<command>getty</command> instance watching that console to call the
|
|
program <command>login</command>.</para>
|
|
|
|
<para>You identify yourself to the display manager or
|
|
<command>login</command> with a login name and password. That login name
|
|
is looked up in a file called /etc/passwd, which is a sequence of lines
|
|
each describing a user account.</para>
|
|
|
|
<para>One of these fields is an encrypted version of the account password
|
|
(sometimes the encrypted fields are actually kept in a second /etc/shadow
|
|
file with tighter permissions; this makes password cracking harder). What
|
|
you enter as an account password is encrypted in exactly the same way, and
|
|
the <command>login</command> program checks to see if they match. The
|
|
security of this method depends on the fact that, while it's easy to go
|
|
from your clear password to the encrypted version, the reverse is very
|
|
hard. Thus, even if someone can see the encrypted version of your
|
|
password, they can't use your account. (It also means that if you forget
|
|
your password, there's no way to recover it, only to change it to something
|
|
else you choose.)</para>
|
|
|
|
<para>Once you have successfully logged in, you get all the privileges
|
|
associated with the individual account you are using. You may also be
|
|
recognized as part of a
|
|
<firstterm>group</firstterm><indexterm><primary>group</primary></indexterm>.
|
|
A group is a named collection of users set up by the system administrator.
|
|
Groups can have privileges independently of their members’ privileges. A
|
|
user can be a member of multiple groups. (For details about how Unix
|
|
privileges work, see the section below on <link linkend="permissions">permissions</link>.)</para>
|
|
|
|
<para>(Note that although you will normally refer to users and groups by
|
|
name, they are actually stored internally as numeric IDs. The password
|
|
file maps your account name to a user ID; the
|
|
<filename>/etc/group</filename><indexterm><primary>/etc/group</primary></indexterm>
|
|
file maps group names to numeric group IDs. Commands that deal with
|
|
accounts and groups do the translation automatically.)</para>
|
|
|
|
<para>Your account entry also contains your <firstterm>home
|
|
directory</firstterm><indexterm><primary>home
|
|
directory</primary></indexterm>, the place in the Unix file system where
|
|
your personal files will live. Finally, your account entry also sets your
|
|
<firstterm>shell</firstterm><indexterm><primary>shell</primary></indexterm>,
|
|
the command interpreter that <command>login</command> will start up to
|
|
accept your commmands.</para>
|
|
|
|
<para>What happens after you have successfully logged in depends on how you
|
|
did it. On a text console, <command>login</command> will launch a shell
|
|
and you'll be off and running. If you logged in through a display
|
|
manager, the X server will bring up your graphical desktop and you will
|
|
be able to run programs from it — either through the menus, or
|
|
through desktop icons, or through a <firstterm>terminal
|
|
emulator</firstterm> running a <firstterm>shell</firstterm>.</para>
|
|
|
|
</sect1>
|
|
<sect1 id="running-programs"><title>What happens when you run programs
|
|
after boot time?</title>
|
|
|
|
<para>After boot time and before you run a program, you can think of your
|
|
computer as containing a zoo of processes that are all waiting for
|
|
something to do. They're all waiting on <firstterm>events</firstterm>. An
|
|
event can be you pressing a key or moving a mouse. Or, if your machine is
|
|
hooked to a network, an event can be a data packet coming in over that
|
|
network.</para>
|
|
|
|
<para>The kernel is one of these processes. It's a special one, because it
|
|
controls when the other <firstterm>user processes</firstterm> can run, and it
|
|
is normally the only process with direct access to the machine's hardware.
|
|
In fact, user processes have to make requests to the kernel when they want
|
|
to get keyboard input, write to your screen, read from or write to disk, or
|
|
do just about anything other than crunching bits in memory. These requests
|
|
are known as <firstterm>system calls</firstterm>.</para>
|
|
|
|
<para>Normally all I/O goes through the kernel so it can schedule the
|
|
operations and prevent processes from stepping on each other. A few
|
|
special user processes are allowed to slide around the kernel, usually by
|
|
being given direct access to I/O ports. X servers are the most common
|
|
example of this.</para>
|
|
|
|
<para>You will run programs in one of two ways: through your X server
|
|
or through a shell. Often, you'll actually do both, because you'll
|
|
start a terminal emulator that mimics an old-fashioned textual console,
|
|
giving you a shell to run programs from. I'll describe what happens
|
|
when you do that, then I'll return to what happens when you run a program
|
|
through an X menu or desktop icon.</para>
|
|
|
|
<para>The shell is called the shell because it wraps around and hides the
|
|
operating system kernel. It's an important feature of Unix that the shell
|
|
and kernel are separate programs communicating through a small set of
|
|
system calls. This makes it possible for there to be multiple shells,
|
|
suiting different tastes in interfaces.</para>
|
|
|
|
<para>The normal shell gives you the ‘$’ prompt that you see
|
|
after logging in (unless you've customized it to be something else). We
|
|
won't talk about shell syntax and the easy things you can see on the screen
|
|
here; instead we'll take a look behind the scenes at what's happening from
|
|
the computer's point of view.</para>
|
|
|
|
<para>The shell is just a user process, and not a particularly special one.
|
|
It waits on your keystrokes, listening (through the kernel) to the keyboard
|
|
I/O port. As the kernel sees them, it echoes them to your virtual console or
|
|
X terminal emulator. When the kernel sees an ‘Enter’ it passes
|
|
your line of text to the shell. The shell tries to interpret those
|
|
keystrokes as commands.</para>
|
|
|
|
<para>Let's say you type ‘ls’ and Enter to invoke the Unix
|
|
directory lister. The shell applies its built-in rules to figure out that
|
|
you want to run the executable command in the file
|
|
<filename>/bin/ls</filename>. It makes a system call asking the kernel to
|
|
start /bin/ls as a new <firstterm>child process</firstterm> and give it
|
|
access to the screen and keyboard through the kernel. Then the shell goes
|
|
to sleep, waiting for ls to finish.</para>
|
|
|
|
<para>When <command>/bin/ls</command> is done, it tells the kernel it's
|
|
finished by issuing an <firstterm>exit</firstterm> system call. The kernel
|
|
then wakes up the shell and tells it that it can continue running. The shell
|
|
issues another prompt and waits for another line of input.</para>
|
|
|
|
<para>Other things may be going on while your ‘ls’ is
|
|
executing, however (we'll have to suppose that you're listing a very long
|
|
directory). You might switch to another virtual console, log in there, and
|
|
start a game of Quake, for example. Or, suppose you're hooked up to the
|
|
Internet. Your machine might be sending or receiving mail while
|
|
<command>/bin/ls</command> runs.</para>
|
|
|
|
<para>When you're running programs through the X server rather than a shell
|
|
(that is, by choosing an application from a pull-down menu, or
|
|
double-clicking a desktop icon), any of several programs associated with
|
|
your X server can behave like a shell and launch the program. I'm going to
|
|
gloss over the details here because they're both variable and unimportant.
|
|
The key point is that the X server, unlike a normal shell, doesn't go to
|
|
sleep while the client program is running — instead, it sits between
|
|
you and the client, passing your mouse clicks and keypresses to it and
|
|
fulfilling its requests to point pixels on your display.</para>
|
|
|
|
</sect1>
|
|
<sect1 id="devices"><title>How do input devices and interrupts work?</title>
|
|
|
|
<para>Your keyboard is a very simple input device; simple because it
|
|
generates small amounts of data very slowly (by a computer's standards).
|
|
When you press or release a key, that event is signalled up the keyboard
|
|
cable to raise a <firstterm>hardware
|
|
interrupt</firstterm><indexterm><primary>hardware
|
|
interrupt</primary></indexterm>.</para>
|
|
|
|
<para>It's the operating system's job to watch for such interrupts. For
|
|
each possible kind of interrupt, there will be an <firstterm>interrupt
|
|
handler</firstterm><indexterm><primary>interrupt
|
|
handler</primary></indexterm>, a part of the operating system that stashes
|
|
away any data associated with them (like your keypress/keyrelease value)
|
|
until it can be processed.</para>
|
|
|
|
<para>What the interrupt handler for your keyboard actually does is post the
|
|
key value into a system area near the bottom of memory. There, it will
|
|
be available for inspection when the operating system passes control to
|
|
whichever program is currently supposed to be reading from the keyboard.</para>
|
|
|
|
<para>More complex input devices like disk or network cards work in a similar
|
|
way. Earlier, I referred to a disk controller using the bus to signal that
|
|
a disk request has been fulfilled. What actually happens is that the disk
|
|
raises an interrupt. The disk interrupt handler then copies the retrieved
|
|
data into memory, for later use by the program that made the request.</para>
|
|
|
|
<para>Every kind of interrupt has an associated <firstterm>priority
|
|
level</firstterm><indexterm><primary>priority level</primary></indexterm>.
|
|
Lower-priority interrupts (like keyboard events) have to wait on
|
|
higher-priority interrupts (like clock ticks or disk events). Unix is
|
|
designed to give high priority to the kinds of events that need to be
|
|
processed rapidly in order to keep the machine's response smooth.</para>
|
|
|
|
<para>In your operating system's boot-time messages, you may see references
|
|
to <firstterm>IRQ</firstterm><indexterm><primary>IRQ</primary></indexterm>
|
|
numbers. You may be aware that one of the common ways to misconfigure
|
|
hardware is to have two different devices try to use the same IRQ, without
|
|
understanding exactly why. </para>
|
|
|
|
<para>Here's the answer. IRQ is short for "Interrupt Request". The operating
|
|
system needs to know at startup time which numbered interrupts each
|
|
hardware device will use, so it can associate the proper handlers with each
|
|
one. If two different devices try use the same IRQ, interrupts will
|
|
sometimes get dispatched to the wrong handler. This will usually at least
|
|
lock up the device, and can sometimes confuse the OS badly enough that it
|
|
will flake out or crash.</para>
|
|
|
|
</sect1>
|
|
<sect1 id="multitasking"><title>How does my computer do several things at once?</title>
|
|
|
|
<para>It doesn't, actually. Computers can only do one task (or
|
|
<firstterm>process</firstterm>) at a time. But a computer can change tasks
|
|
very rapidly, and fool slow human beings into thinking it's doing several
|
|
things at once. This is called
|
|
<firstterm>timesharing</firstterm><indexterm><primary>timesharing</primary></indexterm>.</para>
|
|
|
|
<para>One of the kernel's jobs is to manage timesharing. It has a part
|
|
called the
|
|
<firstterm>scheduler</firstterm><indexterm><primary>scheduler</primary></indexterm>
|
|
which keeps information inside itself about all the other (non-kernel)
|
|
processes in your zoo. Every 1/60th of a second, a timer goes off in the
|
|
kernel, generating a clock interrupt. The scheduler stops whatever process
|
|
is currently running, suspends it in place, and hands control to another
|
|
process.</para>
|
|
|
|
<para>1/60th of a second may not sound like a lot of time. But on today's
|
|
microprocessors it's enough to run tens of thousands of machine
|
|
instructions, which can do a great deal of work. So even if you have many
|
|
processes, each one can accomplish quite a bit in each of its
|
|
timeslices.</para>
|
|
|
|
<para>In practice, a program may not get its entire timeslice. If an
|
|
interrupt comes in from an I/O device, the kernel effectively stops the
|
|
current task, runs the interrupt handler, and then returns to the current
|
|
task. A storm of high-priority interrupts can squeeze out normal
|
|
processing; this misbehavior is called <firstterm>thrashing</firstterm> and
|
|
is fortunately very hard to induce under modern Unixes.</para>
|
|
|
|
<para>In fact, the speed of programs is only very seldom limited by the
|
|
amount of machine time they can get (there are a few exceptions to this
|
|
rule, such as sound or 3-D graphics generation). Much more often, delays
|
|
are caused when the program has to wait on data from a disk drive or
|
|
network connection.</para>
|
|
|
|
<para>An operating system that can routinely support many simultaneous
|
|
processes is called <quote>multitasking</quote>. The Unix family of operating
|
|
systems was designed from the ground up for multitasking and is very good
|
|
at it — much more effective than Windows or the old Mac OS, which both
|
|
had multitasking bolted into them as an afterthought and do it rather poorly.
|
|
Efficient, reliable multitasking is a large part of what makes Linux
|
|
superior for networking, communications, and Web service.</para>
|
|
|
|
</sect1>
|
|
<sect1 id="memory-management"><title>How does my computer keep processes from stepping on each other?</title>
|
|
|
|
<para>The kernel's scheduler takes care of dividing processes in time.
|
|
Your operating system also has to divide them in space, so that processes
|
|
can't step on each others' working memory. Even if you assume that all
|
|
programs are trying to be cooperative, you don't want a bug in one of them
|
|
to be able to corrupt others. The things your operating system does to
|
|
solve this problem are called <firstterm>memory
|
|
management</firstterm><indexterm><primary>memory
|
|
management</primary></indexterm>.</para>
|
|
|
|
<para>Each process in your zoo needs its own area of memory, as a place to
|
|
run its code from and keep variables and results in. You can think of this
|
|
set as consisting of a read-only <firstterm>code
|
|
segment</firstterm><indexterm><primary>code segment</primary></indexterm>
|
|
(containing the process's instructions) and a writeable <firstterm>data
|
|
segment</firstterm><indexterm><primary>data segment</primary></indexterm>
|
|
(containing all the process's variable storage). The data segment is truly
|
|
unique to each process, but if two processes are running the same code Unix
|
|
automatically arranges for them to share a single code segment as an
|
|
efficiency measure.</para>
|
|
|
|
<sect2 id="vm-simple"><title>Virtual memory: the simple version</title>
|
|
|
|
<para>Efficiency is important, because memory is expensive. Sometimes you
|
|
don't have enough to hold the entirety of all the programs the machine is
|
|
running, especially if you are using a large program like an X server. To
|
|
get around this, Unix uses a technique called <anchor id="vm"/>
|
|
<firstterm>virtual memory</firstterm><indexterm><primary>virtual
|
|
memory</primary></indexterm>. It doesn't try to hold all the code and data
|
|
for a process in memory. Instead, it keeps around only a relatively small
|
|
<firstterm>working set</firstterm><indexterm><primary>working
|
|
set</primary></indexterm>; the rest of the process's state is left in a
|
|
special <firstterm>swap space</firstterm><indexterm><primary>swap
|
|
space</primary></indexterm> area on your hard disk.</para>
|
|
|
|
<para>Note that in the past, that <quote>Sometimes</quote> last paragraph ago was
|
|
<quote>Almost always</quote> — the size of memory was typically small
|
|
relative to the size of running programs, so swapping was frequent. Memory
|
|
is far less expensive nowadays and even low-end machines have quite a lot
|
|
of it. On modern single-user machines with 64MB of memory and up, it's
|
|
possible to run X and a typical mix of jobs without ever swapping after
|
|
they're initially loaded into core.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="vm-details"><title>Virtual memory: the detailed version</title>
|
|
|
|
<para>Actually, the last section oversimplified things a bit. Yes,
|
|
programs see most of your memory as one big flat bank of addresses bigger
|
|
than physical memory, and disk swapping is used to maintain that illusion.
|
|
But your hardware actually has no fewer than five different kinds of memory
|
|
in it, and the differences between them can matter a good deal when
|
|
programs have to be tuned for maximum speed. To really understand what
|
|
goes on in your machine, you should learn how all of them work.</para>
|
|
|
|
<para>The five kinds of memory are these: processor registers, internal (or
|
|
on-chip) cache, external (or off-chip) cache, main memory, and disk. And
|
|
the reason there are so many kinds is simple: speed costs money. I have
|
|
listed these kinds of memory in increasing order of access time and
|
|
decreasing order of cost. Register memory is the fastest and most
|
|
expensive and can be random-accessed about a billion times a second, while
|
|
disk is the slowest and cheapest and can do about 100 random accesses a
|
|
second.</para>
|
|
|
|
<para>Here's a full list reflecting early-2000 speeds for a typical desktop
|
|
machine. While speed and capacity will go up and prices will drop, you can
|
|
expect these ratios to remain fairly constant — and it's those ratios that
|
|
shape the memory hierarchy.</para>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>Disk</term>
|
|
<listitem><para>Size: 13000MB Accesses: 100KB/sec</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Main memory</term>
|
|
<listitem><para>Size: 256MB Accesses: 100M/sec</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>External cache</term>
|
|
<listitem><para>Size: 512KB Accesses: 250M/sec</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Internal Cache</term>
|
|
<listitem><para>Size: 32KB Accesses: 500M/sec</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Processor</term>
|
|
<listitem><para>Size: 28 bytes Accesses: 1000M/sec</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
<para>We can't build everything out of the fastest kinds of memory. It
|
|
would be way too expensive — and even if it weren't, fast memory is
|
|
volatile. That is, it loses its marbles when the power goes off. Thus,
|
|
computers have to have hard disks or other kinds of non-volatile storage
|
|
that retains data when the power goes off. And there's a huge mismatch
|
|
between the speed of processors and the speed of disks. The middle three
|
|
levels of the memory hierarchy (<firstterm>internal
|
|
cache</firstterm><indexterm><primary>internal
|
|
cache</primary></indexterm>, <firstterm>external
|
|
cache</firstterm><indexterm><primary>external
|
|
cache</primary></indexterm>, and main memory) basically exist to bridge
|
|
that gap.</para>
|
|
|
|
<para>Linux and other Unixes have a feature called <firstterm>virtual
|
|
memory</firstterm><indexterm><primary>virtual memory</primary></indexterm>.
|
|
What this means is that the operating system behaves as though it has much
|
|
more main memory than it actually does. Your actual physical main memory
|
|
behaves like a set of windows or caches on a much larger "virtual" memory
|
|
space, most of which at any given time is actually stored on disk in a
|
|
special zone called the <firstterm>swap
|
|
area</firstterm><indexterm><primary>swap area</primary></indexterm>. Out of
|
|
sight of user programs, the OS is moving blocks of data (called "pages")
|
|
between memory and disk to maintain this illusion. The end result is that
|
|
your virtual memory is much larger but not too much slower than real
|
|
memory.</para>
|
|
|
|
<para>How much slower virtual memory is than physical depends on how well
|
|
the operating system's swapping algorithms match the way your programs use
|
|
virtual memory. Fortunately, memory reads and writes that are close
|
|
together in time also tend to cluster in memory space. This tendency is
|
|
called
|
|
<firstterm>locality</firstterm><indexterm><primary>locality</primary></indexterm>,
|
|
or more formally <firstterm>locality of
|
|
reference</firstterm><indexterm><primary>locality of
|
|
reference</primary></indexterm> — and it's a good thing. If memory
|
|
references jumped around virtual space at random, you'd typically have to
|
|
do a disk read and write for each new reference and virtual memory would be
|
|
as slow as a disk. But because programs do actually exhibit strong
|
|
locality, your operating system can do relatively few swaps per
|
|
reference.</para>
|
|
|
|
<para>It's been found by experience that the most effective method for a
|
|
broad class of memory-usage patterns is very simple; it's called LRU or the
|
|
<quote>least recently used</quote> algorithm. The virtual-memory system grabs disk
|
|
blocks into its <firstterm>working
|
|
set</firstterm><indexterm><primary>working set</primary></indexterm> as it
|
|
needs them. When it runs out of physical memory for the working set, it
|
|
dumps the least-recently-used block. All Unixes, and most other
|
|
virtual-memory operating systems, use minor variations on LRU.</para>
|
|
|
|
<para>Virtual memory is the first link in the bridge between disk and
|
|
processor speeds. It's explicitly managed by the OS. But there is still a
|
|
major gap between the speed of physical main memory and the speed at which
|
|
a processor can access its register memory. The external and internal
|
|
caches address this, using a technique similar to virtual memory as I've
|
|
described it.</para>
|
|
|
|
<para>Just as the physical main memory behaves like a set of windows or
|
|
caches on the disk's swap area, the external cache acts as windows on main
|
|
memory. External cache is faster (250M accesses per sec, rather than 100M)
|
|
and smaller. The hardware (specifically, your computer's memory
|
|
controller) does the LRU thing in the external cache on blocks of data
|
|
fetched from the main memory. For historical reasons, the unit of cache
|
|
swapping is called a <firstterm>line</firstterm> rather than a page.</para>
|
|
|
|
<para>But we're not done. The internal cache gives us the final step-up in
|
|
effective speed by caching portions of the external cache. It is faster
|
|
and smaller yet — in fact, it lives right on the processor chip.</para>
|
|
|
|
<para>If you want to make your programs really fast, it's useful to know
|
|
these details. Your programs get faster when they have stronger locality,
|
|
because that makes the caching work better. The easiest way to make
|
|
programs fast is therefore to make them small. If a program isn't slowed
|
|
down by lots of disk I/O or waits on network events, it will usually run at
|
|
the speed of the smallest cache that it will fit inside.</para>
|
|
|
|
<para>If you can't make your whole program small, some effort to tune the
|
|
speed-critical portions so they have stronger locality can pay off.
|
|
Details on techniques for doing such tuning are beyond the scope of this
|
|
tutorial; by the time you need them, you'll be intimate enough with some
|
|
compiler to figure out many of them yourself.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="mmu"><title>The Memory Management Unit</title>
|
|
|
|
<para>Even when you have enough physical core to avoid swapping, the part
|
|
of the operating system called the <firstterm>memory manager</firstterm>
|
|
still has important work to do. It has to make sure that programs can only
|
|
alter their own data segments — that is, prevent erroneous or malicious
|
|
code in one program from garbaging the data in another. To do this, it
|
|
keeps a table of data and code segments. The table is updated whenever a
|
|
process either requests more memory or releases memory (the latter usually
|
|
when it exits).</para>
|
|
|
|
<para>This table is used to pass commands to a specialized part of the
|
|
underlying hardware called an
|
|
<firstterm>MMU</firstterm><indexterm><primary>MMU</primary></indexterm> or
|
|
<firstterm>memory management unit</firstterm><indexterm><primary>memory
|
|
management unit</primary></indexterm>. Modern processor chips have MMUs
|
|
built right onto them. The MMU has the special ability to put fences
|
|
around areas of memory, so an out-of-bound reference will be refused and
|
|
cause a special interrupt to be raised.</para>
|
|
|
|
<para>If you ever see a Unix message that says <quote>Segmentation fault</quote>,
|
|
<quote>core dumped</quote> or something similar, this is exactly what has happened;
|
|
an attempt by the running program to access memory (core) outside its
|
|
segment has raised a fatal interrupt. This indicates a bug in the program
|
|
code; the <firstterm>core dump</firstterm><indexterm><primary>core
|
|
dump</primary></indexterm> it leaves behind is diagnostic information
|
|
intended to help a programmer track it down.</para>
|
|
|
|
<para>There is another aspect to protecting processes from each other besides
|
|
segregating the memory they access. You also want to be able to control
|
|
their file accesses so a buggy or malicious program can't corrupt critical
|
|
pieces of the system. This is why Unix has <link linkend="permissions">
|
|
file permissions</link> which we'll discuss later.</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
<sect1 id="core-formats"><title>How does my computer store things in memory?</title>
|
|
|
|
<para>You probably know that everything on a computer is stored as strings of
|
|
bits (binary digits; you can think of them as lots of little on-off
|
|
switches). Here we'll explain how those bits are used to represent the
|
|
letters and numbers that your computer is crunching.</para>
|
|
|
|
<para>Before we can go into this, you need to understand about the
|
|
<firstterm>word size</firstterm><indexterm><primary>word
|
|
size</primary></indexterm> of your computer. The word size is the
|
|
computer's preferred size for moving units of information around;
|
|
technically it's the width of your processor's
|
|
<firstterm>registers</firstterm><indexterm><primary>registers</primary></indexterm>,
|
|
which are the holding areas your processor uses to do arithmetic and
|
|
logical calculations. When people write about computers having bit sizes
|
|
(calling them, say, <quote>32-bit</quote> or <quote>64-bit</quote> computers), this is what
|
|
they mean.</para>
|
|
|
|
<para>Most computers now have a word size of 64 bits. In the recent past
|
|
(early 2000s) many PCs had 32-bit words. The old 286 machines back in the
|
|
1980s had a word size of 16. Old-style mainframes often had 36-bit
|
|
words.</para>
|
|
|
|
<para>The computer views your memory as a sequence of words numbered from
|
|
zero up to some large value dependent on your memory size. That value is
|
|
limited by your word size, which is why programs on older machines like
|
|
286s had to go through painful contortions to address large amounts of
|
|
memory. I won't describe them here; they still give older programmers
|
|
nightmares.</para>
|
|
|
|
<sect2 id="numbers"><title>Numbers</title>
|
|
|
|
<para>Integer numbers are represented as either words or pairs of words,
|
|
depending on your processor's word size. One 64-bit machine word is the
|
|
most common integer representation.</para>
|
|
|
|
<para>Integer arithmetic is close to but not actually mathematical
|
|
base-two. The low-order bit is 1, next 2, then 4 and so forth as in pure
|
|
binary. But signed numbers are represented in
|
|
<firstterm>twos-complement</firstterm><indexterm><primary>twos-complement</primary></indexterm>
|
|
notation. The highest-order bit is a <firstterm>sign
|
|
bit</firstterm><indexterm><primary>sign bit</primary></indexterm> which
|
|
makes the quantity negative, and every negative number can be obtained from
|
|
the corresponding positive value by inverting all the bits and adding one.
|
|
This is why integers on a 64-bit machine have the range
|
|
-2<superscript>63</superscript> to 2<superscript>63</superscript> - 1.
|
|
That 64th bit is being used for sign; 0 means a positive number or zero, 1
|
|
a negative number.</para>
|
|
|
|
<para>Some computer languages give you access to <firstterm>unsigned
|
|
arithmetic</firstterm><indexterm><primary>unsigned
|
|
arithmetic</primary></indexterm> which is straight base 2 with zero and
|
|
positive numbers only.</para>
|
|
|
|
<para>Most processors and some languages can do operations in
|
|
<firstterm>floating-point</firstterm><indexterm><primary>floating-point</primary></indexterm>
|
|
numbers (this capability is built into all recent processor chips).
|
|
Floating-point numbers give you a much wider range of values than integers
|
|
and let you express fractions. The ways in which this is done vary and are
|
|
rather too complicated to discuss in detail here, but the general idea is
|
|
much like so-called ‘scientific notation’, where one might
|
|
write (say) 1.234 * 10<superscript>23</superscript>; the encoding of the
|
|
number is split into a
|
|
<firstterm>mantissa</firstterm><indexterm><primary>mantissa</primary></indexterm>
|
|
(1.234) and the exponent part (23) for the power-of-ten multiplier (which
|
|
means the number multiplied out would have 20 zeros on it, 23 minus the
|
|
three decimal places).</para>
|
|
|
|
</sect2>
|
|
<sect2 id="characters"><title>Characters</title>
|
|
|
|
<para>Characters are normally represented as strings of seven bits each in
|
|
an encoding called ASCII (American Standard Code for Information
|
|
Interchange). On modern machines, each of the 128 ASCII characters is the
|
|
low seven bits of an
|
|
<firstterm>octet</firstterm><indexterm><primary>octet</primary></indexterm>
|
|
or 8-bit byte; octets are packed into memory words so that (for example) a
|
|
six-character string only takes up one 64-bit memory word. For an ASCII code
|
|
chart, type ‘man 7 ascii’ at your Unix prompt.</para>
|
|
|
|
<para>The preceding paragraph was misleading in two ways. The minor one is
|
|
that the term ‘octet’ is formally correct but seldom actually
|
|
used; most people refer to an octet as
|
|
<firstterm>byte</firstterm><indexterm><primary>byte</primary></indexterm>
|
|
and expect bytes to be eight bits long. Strictly speaking, the term
|
|
‘byte’ is more general; there used to be, for example, 36-bit
|
|
machines with 9-bit bytes (though there probably never will be
|
|
again).</para>
|
|
|
|
<para>The major one is that not all the world uses ASCII. In fact, much of
|
|
the world can't — ASCII, while fine for American English, lacks many
|
|
accented and other special characters needed by users of other languages.
|
|
Even British English has trouble with the lack of a pound-currency
|
|
sign.</para>
|
|
|
|
<para>There have been several attempts to fix this problem. All use the
|
|
extra high bit that ASCII doesn't, making it the low half of a
|
|
256-character set. The most widely-used of these is the so-called
|
|
‘Latin-1’ character set (more formally called ISO 8859-1).
|
|
This is the default character set for Linux, older versions of HTML, and X.
|
|
Microsoft Windows uses a mutant version of Latin-1 that adds a bunch of
|
|
characters such as right and left double quotes in places proper Latin-1
|
|
leaves unassigned for historical reasons (for a scathing account of the
|
|
trouble this causes, see the <ulink
|
|
url="http://www.fourmilab.ch/webtools/demoroniser/">demoroniser</ulink>
|
|
page).</para>
|
|
|
|
<para>Latin-1 handles western European languages, including English,
|
|
French, German, Spanish, Italian, Dutch, Norwegian, Swedish, Danish, and
|
|
Icelandic. However, this isn't good enough either, and as a result there
|
|
is a whole series of Latin-2 through -9 character sets to handle things
|
|
like Greek, Arabic, Hebrew, Esperanto, and Serbo-Croatian. For details,
|
|
see the <ulink
|
|
url="http://czyborra.com/charsets/iso8859.html">
|
|
ISO alphabet soup</ulink> page.</para>
|
|
|
|
<para>The ultimate solution is a huge standard called Unicode (and its
|
|
identical twin ISO/IEC 10646-1:1993). Unicode is identical to Latin-1 in
|
|
its lowest 256 slots. Above these in 16-bit space it includes Greek,
|
|
Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi,
|
|
Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian,
|
|
Tibetan, Japanese Kana, the complete set of modern Korean Hangul, and a
|
|
unified set of Chinese/Japanese/Korean (CJK) ideographs. For details, see
|
|
the <ulink url="http://www.unicode.org/">Unicode Home Page</ulink>. XML
|
|
and XHTML use this character set.</para>
|
|
|
|
<para>Recent versions of Linux use an encoding of Unicode called UTF-8. In
|
|
UTF, characters 0-127 are ASCII. Characters 128-255 are used only in
|
|
sequences of 2 through 4 bytes that identify non-ASCII characters.</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
<sect1 id="disk-layout"><title>How does my computer store things on disk?</title>
|
|
|
|
<para>When you look at a hard disk under Unix, you see a tree of named
|
|
directories and files. Normally you won't need to look any deeper than
|
|
that, but it does become useful to know what's going on underneath if you
|
|
have a disk crash and need to try to salvage files. Unfortunately, there's
|
|
no good way to describe disk organization from the file level downwards, so
|
|
I'll have to describe it from the hardware up.</para>
|
|
|
|
<sect2 id="disk-lowlevel"><title>Low-level disk and file system structure</title>
|
|
|
|
<para>The surface area of your disk, where it stores data, is divided up
|
|
something like a dartboard — into circular tracks which are then
|
|
pie-sliced into sectors. Because tracks near the outer edge have more area
|
|
than those close to the spindle at the center of the disk, the outer tracks
|
|
have more sector slices in them than the inner ones. Each sector (or
|
|
<firstterm>disk block</firstterm><indexterm><primary>disk
|
|
block</primary></indexterm>) has the same size, which under modern Unixes
|
|
is generally 1 binary K (1024 8-bit bytes). Each disk block has a unique
|
|
address or <firstterm>disk block number</firstterm><indexterm><primary>disk
|
|
block number</primary></indexterm>.</para>
|
|
|
|
<para>Unix divides the disk into <firstterm>disk
|
|
partitions</firstterm><indexterm><primary>disk
|
|
partitions</primary></indexterm>. Each partition is a continuous span of
|
|
blocks that's used separately from any other partition, either as a file
|
|
system or as swap space. The original reasons for partitions had to do
|
|
with crash recovery in a world of much slower and more error-prone disks;
|
|
the boundaries between them reduce the fraction of your disk likely to
|
|
become inaccessible or corrupted by a random bad spot on the disk.
|
|
Nowadays, it's more important that partitions can be declared read-only
|
|
(preventing an intruder from modifying critical system files) or shared
|
|
over a network through various means we won't discuss here. The
|
|
lowest-numbered partition on a disk is often treated specially, as a
|
|
<firstterm>boot partition</firstterm><indexterm><primary>boot
|
|
partition</primary></indexterm> where you can put a kernel to be
|
|
booted.</para>
|
|
|
|
<para>Each partition is either <firstterm>swap
|
|
space</firstterm><indexterm><primary>swap space</primary></indexterm> (used
|
|
to implement <link linkend="vm">virtual memory</link>) or a <anchor
|
|
id="filesystems"/><firstterm>file system</firstterm><indexterm><primary>file
|
|
system</primary></indexterm> used to hold files. Swap-space partitions are
|
|
just treated as a linear sequence of blocks. File systems, on the other
|
|
hand, need a way to map file names to sequences of disk blocks. Because
|
|
files grow, shrink, and change over time, a file's data blocks will not be
|
|
a linear sequence but may be scattered all over its partition (from
|
|
wherever the operating system can find a free block when it needs
|
|
one). This scattering effect is called
|
|
<firstterm>fragmentation</firstterm>.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="filestructure"><title>File names and directories</title>
|
|
|
|
<para>Within each file system, the mapping from names to blocks is handled
|
|
through a structure called an
|
|
<firstterm>i-node</firstterm><indexterm><primary>i-node</primary></indexterm>.
|
|
There's a pool of these things near the <quote>bottom</quote>
|
|
(lowest-numbered blocks) of each file system (the very lowest ones are used
|
|
for housekeeping and labeling purposes we won't describe here). Each
|
|
i-node describes one file. File data blocks (including directories) live
|
|
above the i-nodes (in higher-numbered blocks). </para>
|
|
|
|
<para>Every i-node contains a list of the disk block numbers in the file it
|
|
describes. (Actually this is a half-truth, only correct for small files,
|
|
but the rest of the details aren't important here.) Note that the i-node
|
|
does <firstterm>not</firstterm> contain the name of the file.</para>
|
|
|
|
<para>Names of files live in <firstterm>directory
|
|
structures</firstterm><indexterm><primary>directory
|
|
structures</primary></indexterm>. A directory structure just maps names to
|
|
i-node numbers. This is why, in Unix, a file can have multiple true names
|
|
(or <firstterm>hard links</firstterm><indexterm><primary>hard
|
|
links</primary></indexterm>); they're just multiple directory entries that
|
|
happen to point to the same i-node.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="mount-points"><title>Mount points</title>
|
|
|
|
<para>In the simplest case, your entire Unix file system lives in just one
|
|
disk partition. While you'll see this arrangement on some small personal
|
|
Unix systems, it's unusual. More typical is for it to be spread across
|
|
several disk partitions, possibly on different physical disks. So, for
|
|
example, your system may have one small partition where the kernel lives, a
|
|
slightly larger one where OS utilities live, and a much bigger one where
|
|
user home directories live.</para>
|
|
|
|
<para>The only partition you'll have access to immediately after system
|
|
boot is your <firstterm>root partition</firstterm><indexterm><primary>root partition</primary></indexterm>,
|
|
which is (almost always) the one you booted from. It holds the root
|
|
directory of the file system, the top node from which everything else
|
|
hangs.</para>
|
|
|
|
<para>The other partitions in the system have to be attached to this root
|
|
in order for your entire, multiple-partition file system to be accessible.
|
|
About midway through the boot process, your Unix will make these non-root
|
|
partitions accessible. It will
|
|
<firstterm>mount</firstterm><indexterm><primary>mount</primary></indexterm>
|
|
each one onto a directory on the root partition.</para>
|
|
|
|
<para>For example, if you have a Unix directory called
|
|
<filename>/usr</filename>, it is probably a mount point to a partition that
|
|
contains many programs installed with your Unix but not required during
|
|
initial boot.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="iname"><title>How a file gets looked up</title>
|
|
|
|
<para>Now we can look at the file system from the top down. When you open
|
|
a file (such as, say,
|
|
<filename>/home/esr/WWW/ldp/fundamentals.xml</filename>) here is what
|
|
happens:</para>
|
|
|
|
<para>Your kernel starts at the root of your Unix file system (in the root
|
|
partition). It looks for a directory there called ‘home’.
|
|
Usually ‘home’ is a mount point to a large user partition
|
|
elsewhere, so it will go there. In the top-level directory structure of
|
|
that user partition, it will look for a entry called ‘esr’ and
|
|
extract an i-node number. It will go to that i-node, notice that its
|
|
associated file data blocks are a directory structure, and look up
|
|
‘WWW’. Extracting <emphasis>that</emphasis> i-node, it will go
|
|
to the corresponding subdirectory and look up ‘ldp’. That will
|
|
take it to yet another directory i-node. Opening that one, it will find an
|
|
i-node number for ‘fundamentals.xml’. That i-node is not a
|
|
directory, but instead holds the list of disk blocks associated with the
|
|
file.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="permissions"><title>File ownership, permissions and security</title>
|
|
|
|
<para>To keep programs from accidentally or
|
|
maliciously stepping on data they shouldn't, Unix has
|
|
<firstterm>permission</firstterm><indexterm><primary>permission</primary></indexterm>
|
|
features. These were originally designed to support timesharing by
|
|
protecting multiple users on the same machine from each other, back in the
|
|
days when Unix ran mainly on expensive shared minicomputers.</para>
|
|
|
|
<para>In order to understand file permissions, you need to recall the
|
|
description of users and groups in the section <link linkend="login">
|
|
What happens when you log in?</link>. Each file has an owning user and an
|
|
owning group. These are initially those of the file's creator; they can be
|
|
changed with the programs
|
|
chown(1)<indexterm><primary>chown(1)</primary></indexterm> and
|
|
chgrp(1)<indexterm><primary>chgrp(1)</primary></indexterm>.</para>
|
|
|
|
<para>The basic permissions that can be associated with a file are
|
|
‘read’ (permission to read data from it), ‘write’
|
|
(permission to modify it) and ‘execute’ (permission to run it
|
|
as a program). Each file has three sets of permissions; one for its owning
|
|
user, one for any user in its owning group, and one for everyone else. The
|
|
‘privileges’ you get when you log in are just the ability to do
|
|
read, write, and execute on those files for which the permission bits match
|
|
your user ID or one of the groups you are in, or files that have been made
|
|
accessible to the world.</para>
|
|
|
|
<para>To see how these may interact and how Unix displays them, let's look
|
|
at some file listings on a hypothetical Unix system. Here's one:</para>
|
|
|
|
<screen>
|
|
snark:~$ ls -l notes
|
|
-rw-r--r-- 1 esr users 2993 Jun 17 11:00 notes
|
|
</screen>
|
|
|
|
<para>This is an ordinary data file. The listing tells us that it's owned
|
|
by the user ‘esr’ and was created with the owning group
|
|
‘users’. Probably the machine we're on puts every ordinary user in
|
|
this group by default; other groups you commonly see on timesharing
|
|
machines are ‘staff’, ‘admin’, or
|
|
‘wheel’ (for obvious reasons, groups are not very important on
|
|
single-user workstations or PCs). Your Unix may use a different default
|
|
group, perhaps one named after your user ID.</para>
|
|
|
|
<para>The string ‘-rw-r--r--’ represents the permission bits
|
|
for the file. The very first dash is the position for the directory bit;
|
|
it would show ‘d’ if the file were a directory, or would show
|
|
‘l’ if the file were a symbolic link. After that, the first
|
|
three places are user permissions, the second three group permissions, and
|
|
the third are permissions for others (often called ‘world’
|
|
permissions). On this file, the owning user ‘esr’ may read or
|
|
write the file, other people in the ‘users’ group may read it,
|
|
and everybody else in the world may read it. This is a pretty typical set
|
|
of permissions for an ordinary data file.</para>
|
|
|
|
<para>Now let's look at a file with very different permissions. This file
|
|
is GCC, the GNU C compiler. </para>
|
|
|
|
<screen>
|
|
snark:~$ ls -l /usr/bin/gcc
|
|
-rwxr-xr-x 3 root bin 64796 Mar 21 16:41 /usr/bin/gcc
|
|
</screen>
|
|
|
|
<para>This file belongs to a user called ‘root’ and a group
|
|
called ‘bin’; it can be written (modified) only by root, but
|
|
read or executed by anyone. This is a typical ownership and set of
|
|
permissions for a pre-installed system command. The ‘bin’
|
|
group exists on some Unixes to group together system commands (the name is
|
|
a historical relic, short for ‘binary’). Your Unix might use a
|
|
‘root’ group instead (not quite the same as the ‘root'
|
|
user!).</para>
|
|
|
|
<para>The ‘root’ user is the conventional name for numeric user
|
|
ID 0, a special, privileged account that can override all privileges. Root
|
|
access is useful but dangerous; a typing mistake while you're logged in as
|
|
root can clobber critical system files that the same command executed from
|
|
an ordinary user account could not touch.</para>
|
|
|
|
<para>Because the root account is so powerful, access to it should be guarded
|
|
very carefully. Your root password is the single most critical piece of
|
|
security information on your system, and it is what any crackers and
|
|
intruders who ever come after you will be trying to get.</para>
|
|
|
|
<para>About passwords: Don't write them down — and don't pick a
|
|
passwords that can easily be guessed, like the first name of your
|
|
girlfriend/boyfriend/spouse. This is an astonishingly common bad practice
|
|
that helps crackers no end. In general, don't pick any word in the
|
|
dictionary; there are programs called <firstterm>dictionary
|
|
crackers</firstterm> that look for likely passwords by running through word
|
|
lists of common choices. A good technique is to pick a combination
|
|
consisting of a word, a digit, and another word, such as
|
|
‘shark6cider’ or ‘jump3joy’; that will make the search
|
|
space too large for a dictionary cracker. Don't use these examples, though
|
|
— crackers might expect that after reading this document and put them
|
|
in their dictionaries.</para>
|
|
|
|
<para>Now let's look at a third case:</para>
|
|
|
|
<screen>
|
|
snark:~$ ls -ld ~
|
|
drwxr-xr-x 89 esr users 9216 Jun 27 11:29 /home2/esr
|
|
snark:~$
|
|
</screen>
|
|
|
|
<para>This file is a directory (note the ‘d’ in the first
|
|
permissions slot). We see that it can be written only by esr, but read and
|
|
executed by anybody else.</para>
|
|
|
|
<para>Read permission gives you the ability to list the directory — that
|
|
is, to see the names of files and directories it contains. Write permission
|
|
gives you the ability to create and delete files in the directory. If you
|
|
remember that the directory includes a list of the names of the files and
|
|
subdirectories it contains, these rules will make sense.</para>
|
|
|
|
<para>Execute permission on a directory means you can get through the
|
|
directory to open the files and directories below it. In effect, it gives
|
|
you permission to access the i-nodes in the directory. A directory with
|
|
execute completely turned off would be useless.</para>
|
|
|
|
<para>Occasionally you'll see a directory that is world-executable but not
|
|
world-readable; this means a random user can get to files and directories
|
|
beneath it, but only by knowing their exact names (the directory cannot be
|
|
listed).</para>
|
|
|
|
<para>It's important to remember that read, write, or execute permission on a
|
|
directory is independent of the permissions on the files and directories
|
|
beneath. In particular, write access on a directory means you can
|
|
create new files or delete existing files there, but does not
|
|
automatically give you write access to existing files.</para>
|
|
|
|
<para>Finally, let's look at the permissions of the login program
|
|
itself.</para>
|
|
|
|
<screen>
|
|
snark:~$ ls -l /bin/login
|
|
-rwsr-xr-x 1 root bin 20164 Apr 17 12:57 /bin/login
|
|
</screen>
|
|
|
|
<para>This has the permissions we'd expect for a system command —
|
|
except for that ‘s’ where the owner-execute bit ought to be.
|
|
This is the visible manifestation of a special permission called the
|
|
‘set-user-id’ or <firstterm>setuid
|
|
bit</firstterm><indexterm><primary>setuid bit</primary></indexterm>.</para>
|
|
|
|
<para>The setuid bit is normally attached to programs that need to give
|
|
ordinary users the privileges of root, but in a controlled way. When it is
|
|
set on an executable program, you get the privileges of the owner of that
|
|
program file while the program is running on your behalf, whether or not
|
|
they match your own.</para>
|
|
|
|
<para>Like the root account itself, setuid programs are useful but
|
|
dangerous. Anyone who can subvert or modify a setuid program owned by root
|
|
can use it to spawn a shell with root privileges. For this reason, opening
|
|
a file to write it automatically turns off its setuid bit on most Unixes.
|
|
Many attacks on Unix security try to exploit bugs in setuid programs in
|
|
order to subvert them. Security-conscious system administrators are
|
|
therefore extra-careful about these programs and reluctant to install new
|
|
ones.</para>
|
|
|
|
<para>There are a couple of important details we glossed over when
|
|
discussing permissions above; namely, how the owning group and permissions
|
|
are assigned when a file or directory is first created. The group is an
|
|
issue because users can be members of multiple groups, but one of them
|
|
(specified in the user's <filename>/etc/passwd</filename> entry) is the
|
|
user's <firstterm>default group</firstterm><indexterm><primary>default
|
|
group</primary></indexterm> and will normally own files created by the
|
|
user.</para>
|
|
|
|
<para>The story with initial permission bits is a little more complicated.
|
|
A program that creates a file will normally specify the permissions it is
|
|
to start with. But these will be modified by a variable in the user's
|
|
environment called the
|
|
<firstterm>umask</firstterm><indexterm><primary>umask</primary></indexterm>.
|
|
The umask specifies which permission bits to <emphasis>turn off</emphasis>
|
|
when creating a file; the most common value, and the default on most
|
|
systems, is -------w- or 002, which turns off the world-write bit. See the
|
|
documentation of the umask command on your shell's manual page for
|
|
details.</para>
|
|
|
|
<para>Initial directory group is also a bit complicated. On some Unixes a new
|
|
directory gets the default group of the creating user (this in the System V
|
|
convention); on others, it gets the owning group of the parent directory
|
|
in which it's created (this is the BSD convention). On some modern Unixes,
|
|
including Linux, the latter behavior can be selected by setting the
|
|
set-group-ID on the directory (chmod g+s).</para>
|
|
|
|
</sect2>
|
|
<sect2 id="fsck"><title>How things can go wrong</title>
|
|
|
|
<para>Earlier it was hinted that file systems can be fragile things.
|
|
Now we know that to get to a file you have to hopscotch through what may be
|
|
an arbitrarily long chain of directory and i-node references. Now suppose
|
|
your hard disk develops a bad spot?</para>
|
|
|
|
<para>If you're lucky, it will only trash some file data. If you're
|
|
unlucky, it could corrupt a directory structure or i-node number and leave
|
|
an entire subtree of your system hanging in limbo — or, worse, result
|
|
in a corrupted structure that points multiple ways at the same disk block
|
|
or i-node. Such corruption can be spread by normal file operations,
|
|
trashing data that was not in the original bad spot.</para>
|
|
|
|
<para>Fortunately, this kind of contingency has become quite uncommon as disk
|
|
hardware has become more reliable. Still, it means that your Unix will
|
|
want to integrity-check the file system periodically to make sure nothing
|
|
is amiss. Modern Unixes do a fast integrity check on each partition at
|
|
boot time, just before mounting it. Every few reboots they'll do a much
|
|
more thorough check that takes a few minutes longer.</para>
|
|
|
|
<para>If all of this sounds like Unix is terribly complex and
|
|
failure-prone, it may be reassuring to know that these boot-time checks
|
|
typically catch and correct normal problems <emphasis>before</emphasis>
|
|
they become really disastrous. Other operating systems don't have these
|
|
facilities, which speeds up booting a bit but can leave you much more
|
|
seriously screwed when attempting to recover by hand (and that's assuming
|
|
you have a copy of Norton Utilities or whatever in the first
|
|
place...).</para>
|
|
|
|
<para>One of the trends in current Unix designs is <firstterm>journalling
|
|
file systems</firstterm><indexterm><primary>journalling file
|
|
systems</primary></indexterm>. These arrange traffic to the disk so that
|
|
it's guaranteed to be in a consistent state that can be recovered when the
|
|
system comes back up. This will speed up the boot-time integrity check a
|
|
lot.</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
<sect1 id="languages"><title>How do computer languages work?</title>
|
|
|
|
<para>We've already discussed <link linkend="running-programs">how programs
|
|
are run</link>. Every program ultimately has to execute as a stream of
|
|
bytes that are instructions in your computer's <firstterm>machine
|
|
language</firstterm><indexterm><primary>machine
|
|
language</primary></indexterm>. But human beings don't deal with machine
|
|
language very well; doing so has become a rare, black art even among
|
|
hackers.</para>
|
|
|
|
<para>Almost all Unix code except a small amount of direct
|
|
hardware-interface support in the kernel itself is nowadays written in a
|
|
<firstterm>high-level language</firstterm><indexterm><primary>high-level
|
|
language</primary></indexterm>. (The ‘high-level’ in this term
|
|
is a historical relic meant to distinguish these from
|
|
‘low-level’ <firstterm>assembler
|
|
languages</firstterm><indexterm><primary>assembler
|
|
languages</primary></indexterm>, which are basically thin wrappers around
|
|
machine code.)</para>
|
|
|
|
<para>There are several different kinds of high-level languages. In order
|
|
to talk about these, you'll find it useful to bear in mind that the
|
|
<firstterm>source code</firstterm><indexterm><primary>source
|
|
code</primary></indexterm> of a program (the human-created, editable
|
|
version) has to go through some kind of translation into machine code that
|
|
the machine can actually run.</para>
|
|
|
|
<sect2 id="compilers"><title>Compiled languages</title>
|
|
|
|
<para>The most conventional kind of language is a <firstterm>compiled
|
|
language</firstterm><indexterm><primary>compiled
|
|
language</primary></indexterm>. Compiled languages get translated into
|
|
runnable files of binary machine code by a special program called
|
|
(logically enough) a
|
|
<firstterm>compiler</firstterm><indexterm><primary>compiler</primary></indexterm>.
|
|
Once the binary has been generated, you can run it directly without looking
|
|
at the source code again. (Most software is delivered as compiled binaries
|
|
made from code you don't see.)</para>
|
|
|
|
<para>Compiled languages tend to give excellent performance and have the most
|
|
complete access to the OS, but also to be difficult to program in.</para>
|
|
|
|
<para>C, the language in which Unix itself is written, is by far the most
|
|
important of these (with its variant C++). FORTRAN is another compiled
|
|
language still used among engineers and scientists but years older and much
|
|
more primitive. In the Unix world no other compiled languages are in
|
|
mainstream use. Outside it, COBOL is very widely used for financial and
|
|
business software.</para>
|
|
|
|
<para>There used to be many other compiler languages, but most of them have
|
|
either gone extinct or are strictly research tools. If you are a new
|
|
Unix developer using a compiled language, it is overwhelmingly likely
|
|
to be C or C++.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="interpreters"><title>Interpreted languages</title>
|
|
|
|
<para>An <firstterm>interpreted
|
|
language</firstterm><indexterm><primary>interpreted
|
|
language</primary></indexterm> depends on an interpreter program that reads
|
|
the source code and translates it on the fly into computations and system
|
|
calls. The source has to be re-interpreted (and the interpreter present)
|
|
each time the code is executed.</para>
|
|
|
|
<para>Interpreted languages tend to be slower than compiled languages, and
|
|
often have limited access to the underlying operating system and hardware.
|
|
On the other hand, they tend to be easier to program and more forgiving of
|
|
coding errors than compiled languages.</para>
|
|
|
|
<para>Many Unix utilities, including the shell and bc(1) and sed(1) and awk(1),
|
|
are effectively small interpreted languages. BASICs are usually
|
|
interpreted. So is Tcl. Historically, the most important interpretive
|
|
language has been LISP (a major improvement over most of its successors).
|
|
Today, Unix shells and the Lisp that lives inside the Emacs editor are
|
|
probably the most important pure interpreted languages.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="pcode"><title>P-code languages</title>
|
|
|
|
<para>Since 1990 a kind of hybrid language that uses both compilation and
|
|
interpretation has become increasingly important. P-code languages are
|
|
like compiled languages in that the source is translated to a compact
|
|
binary form which is what you actually execute, but that form is not
|
|
machine code. Instead it's
|
|
<firstterm>pseudocode</firstterm><indexterm><primary>pseudocode</primary></indexterm>
|
|
(or
|
|
<firstterm>p-code</firstterm><indexterm><primary>p-code</primary></indexterm>),
|
|
which is usually a lot simpler but more powerful than a real machine
|
|
language. When you run the program, you interpret the p-code.</para>
|
|
|
|
<para>P-code can run nearly as fast as a compiled binary (p-code interpreters
|
|
can be made quite simple, small and speedy). But p-code languages can keep
|
|
the flexibility and power of a good interpreter.</para>
|
|
|
|
<para>Important p-code languages include Python, Perl, and Java.</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
<sect1 id="internet"><title>How does the Internet work?</title>
|
|
|
|
<para>To help you understand how the Internet works, we'll look at the things
|
|
that happen when you do a typical Internet operation — pointing a browser
|
|
at the front page of this document at its home on the Web at the Linux
|
|
Documentation Project. This document is</para>
|
|
|
|
<screen>
|
|
&howto;Unix-and-Internet-Fundamentals-HOWTO/index.html
|
|
</screen>
|
|
|
|
<para>which means it lives in the file
|
|
HOWTO/Unix-and-Internet-Fundamentals-HOWTO/index.html under the World Wide Web
|
|
export directory of the host &howto-host;.</para>
|
|
|
|
<sect2 id="dns"><title>Names and locations</title>
|
|
|
|
<para>The first thing your browser has to do is to establish a network
|
|
connection to the machine where the document lives. To do that, it first
|
|
has to find the network location of the
|
|
<firstterm>host</firstterm><indexterm><primary>host</primary></indexterm>
|
|
&howto-host; (‘host’ is short for ‘host machine’ or ‘network host';
|
|
&howto-host; is a typical
|
|
<firstterm>hostname</firstterm><indexterm><primary>hostname</primary></indexterm>).
|
|
The corresponding location is actually a number called an <firstterm>IP
|
|
address</firstterm><indexterm><primary>IP address</primary></indexterm>
|
|
(we'll explain the ‘IP’ part of this term later).</para>
|
|
|
|
<para>To do this, your browser queries a program called a
|
|
<firstterm>name server</firstterm><indexterm><primary>name server</primary></indexterm>. The name server
|
|
may live on your machine, but it's more likely to run on a service machine
|
|
that yours talks to. When you sign up with an ISP, part of your setup
|
|
procedure will almost certainly involve telling your Internet software the
|
|
IP address of a nameserver on the ISP's network.</para>
|
|
|
|
<para>The name servers on different machines talk to each other, exchanging
|
|
and keeping up to date all the information needed to resolve hostnames (map
|
|
them to IP addresses). Your nameserver may query three or four different
|
|
sites across the network in the process of resolving &howto-host;, but
|
|
this usually happens very quickly (as in less than a second). We'll look
|
|
at how nameservers detail in the next section.</para>
|
|
|
|
<para>The nameserver will tell your browser that &howto-host;'s IP
|
|
address is 152.19.254.81; knowing this, your machine will be able to
|
|
exchange bits with &howto-host; directly.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="domains"><title>The Domain Name System</title>
|
|
|
|
<para>The whole network of programs and databases that cooperates to
|
|
translate hostnames to IP addresses is called ‘DNS’ (Domain
|
|
Name System). When you see references to a ‘DNS server’, that
|
|
means what we just called a nameserver. Now I'll explain how the overall
|
|
system works.</para>
|
|
|
|
<para>Internet hostnames are composed of parts separated by dots. A
|
|
<firstterm>domain</firstterm><indexterm><primary>domain</primary>
|
|
</indexterm> is a collection of machines that share a common name suffix.
|
|
Domains can live inside other domains. For example, the machine
|
|
&howto-host; lives in the &howto-domain; subdomain of the &howto-toplevel;
|
|
domain.</para>
|
|
|
|
<para>Each domain is defined by an <firstterm>authoritative name
|
|
server</firstterm><indexterm><primary>authoritative name server</primary>
|
|
</indexterm> that knows the IP addresses of the other machines in the
|
|
domain. The authoritative (or ‘primary') name server may have backups in
|
|
case it goes down; if you see references to a <firstterm>secondary name
|
|
server</firstterm><indexterm><primary>secondary name server</primary>
|
|
</indexterm> or (‘secondary DNS') it's talking about one of those. These
|
|
secondaries typically refresh their information from their primaries every
|
|
few hours, so a change made to the hostname-to-IP mapping on the primary
|
|
will automatically be propagated.</para>
|
|
|
|
<para>Now here's the important part. The nameservers for a domain do
|
|
<emphasis>not</emphasis> have to know the locations of all the machines in
|
|
other domains (including their own subdomains); they only have to know the
|
|
location of the nameservers. In our example, the authoritative name server
|
|
for the &howto-toplevel; domain knows the IP address of the nameserver for
|
|
&howto-domain; but <emphasis>not</emphasis> the address of all the other
|
|
machines in &howto-domain;. </para>
|
|
|
|
<para>The domains in the DNS system are arranged like a big inverted tree.
|
|
At the top are the root servers. Everybody knows the IP addresses of the
|
|
root servers; they're wired into your DNS software.
|
|
The root servers know the IP addresses of the nameservers for the
|
|
top-level domains like .com and .org, but not the addresses of machines
|
|
inside those domains. Each top-level domain server knows where the
|
|
nameservers for the domains directly beneath it are, and so forth.</para>
|
|
|
|
<para>DNS is carefully designed so that each machine can get away with the
|
|
minimum amount of knowledge it needs to have about the shape of the tree,
|
|
and local changes to subtrees can be made simply by changing one
|
|
authoritative server's database of name-to-IP-address mappings.</para>
|
|
|
|
<para>When you query for the IP address of &howto-host;, what actually
|
|
happens is this: First, your nameserver asks a root server to tell it where
|
|
it can find a nameserver for &howto-toplevel;. Once it knows that, it then
|
|
asks the &howto-toplevel; server to tell it the IP address of a
|
|
&howto-domain; nameserver. Once it has that, it asks the &howto-domain;
|
|
nameserver to tell it the address of the host &howto-host;.</para>
|
|
|
|
<para>Most of the time, your nameserver doesn't actually have to work that
|
|
hard. Nameservers do a lot of cacheing; when yours resolves a hostname, it
|
|
keeps the association with the resulting IP address around in memory for a
|
|
while. This is why, when you surf to a new website, you'll usually only
|
|
see a message from your browser about "Looking up" the host for the first
|
|
page you fetch. Eventually the name-to-address mapping expires and your
|
|
DNS has to re-query — this is important so you don't have invalid
|
|
information hanging around forever when a hostname changes addresses. Your
|
|
cached IP address for a site is also thrown out if the host is
|
|
unreachable. </para>
|
|
|
|
</sect2>
|
|
<sect2 id="transport"><title>Packets and routers</title>
|
|
|
|
<para>What the browser wants to do is send a command to the Web server on
|
|
&howto-host; that looks like this:</para>
|
|
|
|
<screen>
|
|
GET /LDP/HOWTO/Fundamentals.html HTTP/1.0
|
|
</screen>
|
|
|
|
<para>Here's how that happens. The command is made into a
|
|
<firstterm>packet</firstterm><indexterm><primary>packet</primary></indexterm>,
|
|
a block of bits like a telegram that is wrapped with three important
|
|
things; the <firstterm>source address</firstterm><indexterm><primary>source
|
|
address</primary></indexterm> (the IP address of your machine), the
|
|
<firstterm>destination address</firstterm><indexterm><primary>destination
|
|
address</primary></indexterm> (152.19.254.81), and a <firstterm>service
|
|
number</firstterm><indexterm><primary>service number</primary></indexterm>
|
|
or <firstterm>port number</firstterm><indexterm><primary>port
|
|
number</primary></indexterm> (80, in this case) that indicates that it's a
|
|
World Wide Web request.</para>
|
|
|
|
<para>Your machine then ships the packet down the wire (your connection to
|
|
your ISP, or local network) until it gets to a specialized machine called a
|
|
<firstterm>router</firstterm><indexterm><primary>router</primary></indexterm>.
|
|
The router has a map of the Internet in its memory — not always a complete
|
|
one, but one that completely describes your network neighborhood and knows
|
|
how to get to the routers for other neighborhoods on the Internet.</para>
|
|
|
|
<para>Your packet may pass through several routers on the way to its
|
|
destination. Routers are smart. They watch how long it takes for other
|
|
routers to acknowledge having received a packet. They also use that
|
|
information to direct traffic over fast links. They use it to notice when
|
|
another router (or a cable) have dropped off the network, and compensate
|
|
if possible by finding another route.</para>
|
|
|
|
<para>There's an urban legend that the Internet was designed to survive
|
|
nuclear war. This is not true, but the Internet's design is extremely good
|
|
at getting reliable performance out of flaky hardware in an uncertain
|
|
world. This is directly due to the fact that its intelligence is
|
|
distributed through thousands of routers rather than concentrated in a few
|
|
massive and vulnerable switches (like the phone network). This means that
|
|
failures tend to be well localized and the network can route around
|
|
them.</para>
|
|
|
|
<para>Once your packet gets to its destination machine, that machine uses the
|
|
service number to feed the packet to the web server. The web server can
|
|
tell where to reply to by looking at the command packet's source IP
|
|
address. When the web server returns this document, it will be broken up
|
|
into a number of packets. The size of the packets will vary according to
|
|
the transmission media in the network and the type of service.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="TCP-IP"><title>TCP and IP</title>
|
|
|
|
<para>To understand how multiple-packet transmissions are handled, you need to
|
|
know that the Internet actually uses two protocols, stacked one on top
|
|
of the other.</para>
|
|
|
|
<para>The lower level,
|
|
<firstterm>IP</firstterm><indexterm><primary>IP</primary></indexterm>
|
|
(Internet Protocol), is responsible for labeling
|
|
individual packets with the source address and destination address of two
|
|
computers exchanging information over a network.
|
|
For example, when you access http://&howto-host;, the packets you send
|
|
will have your computer's IP address, such as 192.168.1.101, and the IP
|
|
address of the &howto-host; computer, 152.2.210.81. These addresses
|
|
work in much the same way that your home address works when someone sends
|
|
you a letter. The post office can read the address and determine where
|
|
you are and how best to route the letter to you, much like a router does
|
|
for Internet traffic.</para>
|
|
|
|
<para>The upper level,
|
|
<firstterm>TCP</firstterm><indexterm><primary>TCP</primary></indexterm>
|
|
(Transmission Control Protocol), gives you reliability. When two machines
|
|
negotiate a TCP connection (which they do using IP), the receiver knows to
|
|
send acknowledgements of the packets it sees back to the sender. If the
|
|
sender doesn't see an acknowledgement for a packet within some timeout
|
|
period, it resends that packet. Furthermore, the sender gives each TCP
|
|
packet a sequence number, which the receiver can use to reassemble packets
|
|
in case they show up out of order. (This can easily happen if network
|
|
links go up or down during a connection.)</para>
|
|
|
|
<para>TCP/IP packets also contain a checksum to enable detection of data
|
|
corrupted by bad links. (The checksum is computed from the rest of the
|
|
packet in such a way that if either the rest of the packet or the
|
|
checksum is corrupted, redoing the computation and comparing is very likely
|
|
to indicate an error.) So, from the point of view of anyone using TCP/IP
|
|
and nameservers, it looks like a reliable way to pass streams of bytes
|
|
between hostname/service-number pairs. People who write network protocols
|
|
almost never have to think about all the packetizing, packet reassembly,
|
|
error checking, checksumming, and retransmission that goes on below that
|
|
level.</para>
|
|
|
|
</sect2>
|
|
<sect2 id="HTTP"><title>HTTP, an application protocol</title>
|
|
|
|
<para>Now let's get back to our example. Web browsers and servers speak an
|
|
<firstterm>application protocol</firstterm><indexterm><primary>application
|
|
protocol</primary></indexterm> that runs on top of TCP/IP, using it simply
|
|
as a way to pass strings of bytes back and forth. This protocol is called
|
|
<firstterm>HTTP</firstterm><indexterm><primary>HTTP</primary></indexterm>
|
|
(Hyper-Text Transfer Protocol) and we've already seen one command in it —
|
|
the GET shown above.</para>
|
|
|
|
<para>When the GET command goes to &howto-host;'s webserver with service
|
|
number 80, it will be dispatched to a <firstterm>server
|
|
daemon</firstterm><indexterm><primary>server
|
|
daemon</primary></indexterm> listening on port 80. Most Internet services
|
|
are implemented by server daemons that do nothing but wait on ports,
|
|
watching for and executing incoming commands.</para>
|
|
|
|
<para>If the design of the Internet has one overall rule, it's that all the
|
|
parts should be as simple and human-accessible as possible. HTTP, and its
|
|
relatives (like the Simple Mail Transfer Protocol,
|
|
<firstterm>SMTP</firstterm><indexterm><primary>SMTP</primary></indexterm>,
|
|
that is used to move electronic mail between hosts) tend to use simple
|
|
printable-text commands that end with a carriage-return/line feed.</para>
|
|
|
|
<para>This is marginally inefficient; in some circumstances you could get more
|
|
speed by using a tightly-coded binary protocol. But experience has shown
|
|
that the benefits of having commands be easy for human beings to describe
|
|
and understand outweigh any marginal gain in efficiency that you might get
|
|
at the cost of making things tricky and opaque.</para>
|
|
|
|
<para>Therefore, what the server daemon ships back to you via TCP/IP is also
|
|
text. The beginning of the response will look something like this (a few
|
|
headers have been suppressed):</para>
|
|
|
|
<screen>
|
|
HTTP/1.1 200 OK
|
|
Date: Sat, 10 Oct 1998 18:43:35 GMT
|
|
Server: Apache/1.2.6 Red Hat
|
|
Last-Modified: Thu, 27 Aug 1998 17:55:15 GMT
|
|
Content-Length: 2982
|
|
Content-Type: text/html
|
|
</screen>
|
|
|
|
<para>These headers will be followed by a blank line and the text of the
|
|
web page (after which the connection is dropped). Your browser just
|
|
displays that page. The headers tell it how (in particular, the
|
|
Content-Type header tells it the returned data is really HTML).</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
<sect1 id="more"><title>To Learn More</title>
|
|
<para>There is a <ulink
|
|
url="&howto;Reading-List-HOWTO/">Reading List
|
|
HOWTO</ulink> that lists books you can read to learn more about the
|
|
topics we have touched on here. You might also want to read the
|
|
<ulink url="&home;faqs/hacker-howto.html">How To Become A
|
|
Hacker</ulink> document.</para>
|
|
</sect1>
|
|
|
|
</article>
|
|
|
|
<!--
|
|
The following sets edit modes for GNU EMACS
|
|
Local Variables:
|
|
fill-column:75
|
|
compile-command: "mail -s \"Unix and Internet Fundamentals HOWTO update\" submit@en.tldp.org <Unix-and-Internet-Fundamentals-HOWTO.xml"
|
|
End:
|
|
-->
|