old-www/LDP/LG/issue91/loozzr.html

445 lines
24 KiB
HTML

<!--startcut ==============================================-->
<!-- *** BEGIN HTML header *** -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML><HEAD>
<title>Into the Mist: How Linux Console Fonts Work LG #91</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
ALINK="#FF0000">
<!-- *** END HTML header *** -->
<!-- *** BEGIN navbar *** -->
<A HREF="lodato.html">&lt;&lt;&nbsp;Prev</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="index.html">TOC</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="../index.html">Front Page</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue91/loozzr.html">Talkback</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="../faq/index.html">FAQ</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="mathew.html">Next&nbsp;&gt;&gt;</A>
<!-- *** END navbar *** -->
<!--endcut ============================================================-->
<TABLE BORDER><TR><TD WIDTH="200">
<A HREF="http://www.linuxgazette.com/">
<IMG ALT="LINUX GAZETTE" SRC="../gx/2002/lglogo_200x41.png"
WIDTH="200" HEIGHT="41" border="0"></A>
<BR CLEAR="all">
<SMALL>...<I>making Linux just a little more fun!</I></SMALL>
</TD><TD WIDTH="380">
<CENTER>
<BIG><BIG><STRONG><FONT COLOR="maroon">Into the Mist: How Linux Console Fonts Work</FONT></STRONG></BIG></BIG>
<BR>
<STRONG>By <A HREF="../authors/loozzr.html">En D Loozzr</A></STRONG>
</CENTER>
</TD></TR>
</TABLE>
<P>
<!-- END header -->
<p><b>THE CONSOLE DRIVER</b>
<p>As of Linux 2.4.x, the kernel includes a console driver sub-divided
in a keyboard driver and a screen driver. The console driver is being entirely
re-written for Linux 2.6.0 but at this stage, basically, the keyboard driver
sends characters to an application, the application does its job and requests
from the screen driver some output on the display. The console driver is
complemented by the kbd package which is likely to reside either in /usr/share/kbd/
or in /usr/lib/kbd/.
<p>In the path from the keyboard driver to the application and further
to the screen driver, the characters are nothing but codes (hex numbers).
And since in the end we want to see their little pictures (glyphs) on the
screen there must be a way to associate the glyphs with those codes.
<p>This article will focus on the screen driver only, taking for granted
that something happens between keyboard and application. Some basic notions
of fonts are required. Also keep the man page for the utility 'setfont'
handy. The article is based on material from:
<blockquote><a href="ftp://win.tue.nl/pub/linux-local/utils/kbd/">ftp://win.tue.nl/pub/linux-local/utils/kbd/</a>
<br><a href="ftp://ftp.debian.org/debian/pool/main/c/console-tools/">ftp://ftp.debian.org/debian/pool/main/c/console-tools/</a>
<br><a href="http://qrczak.home.ml.org/programy/linux/fonty/">http://qrczak.home.ml.org/programy/linux/fonty/</a></blockquote>
<p><br><b>UNICODE</b>
<p>Traditionally, character encodings use 8 bits and are thus limited to
2^8=256 characters, which is not enough. Of course, once upon time printers
and monitors knew nothing about diacriticals (accents, umlaut etc.) and
further back in time they only had capitals and despised lower case. Those
times are over and in the wake of i18n (internationalisation) 256 characters
qualify as appetizers.
<p>The UCS (Universal Character Set), also known as Unicode, was created
to handle and mix all the world scripts, including the ideographs from
China, Korea, Japan. It has more than 65000 characters for a start but
it can go up to 2^31, figure it out.
<p>UCS is a 32-bit/4-byte encoding. It is normalised by ISO as the 10646-1
standard. The most widely used characters from UCS are contained in its
UCS-2 16-bit subset. This is the subset used now for the Linux console.
The character set Linux uses by default for N and S America, W Europe and
Africa is called latin1 or ISO 8859-1.
<p>For convenience, an encoding called UTF-8 was designed for ASCII backward
compatibility. All characters that have a UCS encoding can be expressed
as a UTF-8 sequence, and vice-versa. Nonetheless, UTF-8 and Unicode are
distinct encodings.
<p>In UTF-8 mode, the console driver treats the ASCII range exactly as
before, so old text viewers can continue to display ASCII. Characters above
the ASCII range are converted to a variable length sequence of bytes (up
to 6 bytes per character), UTF means indeed Unicode Transformation Format
and UTF-8 covers the conversion of 8-bit characters - which was the range
occupied by the traditional character sets.
<p>Unicode is complex. Just keep in mind that it allows to assign an ID
to any character. That ID has four bytes in its full form, and two bytes
in UCS-2 subset, and here the unicode ID looks like e.g. 0x2502 also written
as U+2502. If you know that ID, you can pick up the glyph (picture) for
that character from a suitable font. Indeed, even the names of the glyphs
are standardized and all capitals, e.g.:
<p><b> </b>FEMININE ORDINAL INDICATOR
<p>All clear?
<blockquote>Problem 1: find out the official name for a given unicode
<p>Problem 2: get the glyph for a given unicode</blockquote>
Problem 1 is not critical as far as the Linux console driver is concerned.
The most common official names can be found in some *.trans files in kbd
directory ../consoletrans or some *.uni files in the kbd directory ../unimaps.
For more, refer to:
<blockquote><a href="http://partners.adobe.com/asn/developer/typeforum/unicodegn.html">http://partners.adobe.com/asn/developer/typeforum/unicodegn.html</a></blockquote>
The real hassle is problem 2.
<br>
<p><b>GLYPHS</b>
<p>Although we have already been speaking of glyphs and it is kind of intuitively
clear what they are, here are some additional remarks.
<p>Launch your winword or equivalent word processor and type the letter
'a' several times changing font and size every time. All those a's look
similar while they do differ in shape and size. What they have in common
is that they all represent one glyph, the glyph for 'a'.
<p>The reference to a glyph is just an abstraction from the particular
font you will necessarily be using in order to see something.
<p>A font a is a collection of glyphs in a particular shape. While in graphic
mode the typeface (shape) is emphasized, in the console we mostly bother
about which glyphs are included or not included - and possibly about font
size. A soft font for the console comes in a binary file with bit patterns
for each glyph. And there is a hardware font in the ROM of the VGA adapter.
This is the font you will see, if no soft fonts are loaded at boot time.
<br>
<p><b>UNIMAP</b>
<p>The Screen Font Map gives, for each position in the console font, the
list of characters it will render. Under Linux 2.4.x, the screen driver
is based on the UCS-2 encoding.
<p>The Screen Font Map is also called Unicode Map or Unimap or Console
Map or Screen Map or psf table or whatever. The terminology varies a lot
and does not contribute to easy understanding. Especially not as these
terms had a different meaning before Unicode came up. And especially not
when files that serve the same purpose and have the same format are named
with different extension. Since it seems to be spreading and it sounds
quite distinct, let us opt for unimap and its files *.uni. If you come
across console utilities other than those from the kbd package, be wary
of the terminology jungle.
<p>There is always a unimap. It is included in the font or it is loaded
from a distinct file or - as a last resort - it is the default straight-to-font
or direct-to-font or trivial mapping or direct mapping or null mapping
or idem mapping or identity mapping. Here again terminology has not settled
and is hindering user empowerment. Idem mapping means that a request for
character e.g. 0xB3 is received and the glyph at position 0xB3 in the font
is directly picked up. To make the mess messier, the straight-to-font map
is sometime not considered to be a unimap. We prefer to say that there
is always a unimap even if setfont from the kbd package says otherwise.
They use the option
<blockquote>setfont -u none</blockquote>
to enforce straight-to-font. mapscrn, now incorporated into setfont, used
to call straight-to-font a special unimap. This is the more sensible choice,
we'll stick to it.
<p>One glyph can do for several different unicodes. How come? Well sometimes
identical glyphs get multiple names. For instance, the capital letter 'A'
is available in Russian and English with different names. But a font that
covers both English and Russian does not need the glyph for 'A' twice.
So two different unicodes give in this case the same visual result.
<p>It can also happen that two glyphs are different but close to each other
visually and only one of them is included in the font to save space and
serves as surrogate for the other. This is analog to old habits from the
era of the typewriter. For instance, opening and closing quotation marks
were the same although in typography they are distinct.
<p>Surrogates are formalised with the fallback entries. A fallback entry
is a series of two or more UCS-2 codes, separated by whitespace. The first
one is the unicode we want a glyph for. The following ones are those whose
glyph we want to use when no glyph designed specially for the first code
is available. The order of the codes defines a priority order (own glyph
if available, then the second char's, then the third's, etc.)
<p>Fallback entries are enabled if included in the unimap with a line like:
<blockquote>0x04a U+20AC U+004A</blockquote>
(That means: for character numbered 0x04a we want the Euro symbol. If not
available, take the currency symbol.)
<br>
<p><b>SCREEN MODES</b>
<p>There are two screen modes, single byte mode (until recently the widely
used default) and UTF-8 mode. Switching the screen to and from UTF-8 mode
is done with the escape sequences '\e%G' and '\e%@' at the prompt. By issuing:
<blockquote>unicode_start
<br>unicode_stop</blockquote>
you switch both keyboard and console to and from UTF-8.
<p>In UTF-8 mode, the bytes received from the application and to be written
to the screen are interpreted as a UTF-8 sequence, turned into unicodes
and looked up in the unimap to determine the glyph to use.
<p>Single byte mode applies an additional intermediate map to the bytes
sent by the application before using the unimap.
<p>This intermediate map used to be called the Application Charset Map
or Application Console Map (ACM or acm). Unfortunately, this is the terminology
of the console-tools package that seems to have quietly passed away.
<p>The kbd package does not give any special name to the map, it refers
to it as a translation table and puts it in files with extension .trans.
The man page for setfont calls it Unicode console map which is extremely
odd since it evokes the Unicode map (unimap). As a way out of the impasse,
let us call it cmap, an abbreviation that already occurs here and there.
<p>Here is a simple diagram for the two modes:
<br><pre>
single byte mode:
application -> cmap -> unimap -> screen
(bytes) (UCS-2)
UTF-8 mode:
application -> unimap -> screen
(UTF-8 / UCS-2)
<br></pre>
<p>Memorize this diagram because it is the machete to cut through the documentation
jungle. Make sure you can tell cmap from unimap: what does the cmap do?
<br>
<p><b>WHAT DOES THE CMAP DO?</b>
<p>There are several formats for the cmap and only one that allows to understand
what the map really does. As an example, have a look at the file cp437_to_iso01.trans
in directory ../consoletrans of the kbd package. Code page 437 stems from
the early DOS and is still the font in the ROM of any VGA adapter.
<p>This file has two columns of hex numbers. The first column is an enumeration
of the slots in the font, 256 positions maximum. Only 256 can be handled
by the cmap.
<p>The second column is the translation. The file under consideration makes
it possible to use a cp437 font as if it were a latin1 font. The translation
is not perfect but it works. Example:
<blockquote>0xA1 0xAD</blockquote>
The character 0xA1 in cp437 is an accented vowel which is not correct for
this code in latin1. So cmap is informing the console driver to react as
if the character request were for 0xAD. The console driver goes into the
unimap (straight-to-font) and reads the unicode at position 0xAD. This
happens to be U+00a1, the inverted exclamation mark. Next stop is the font
where the glyph for U+00a1 has to be picked up. In the end, we had a request
for 0xA1 but we did not get the character at that position in cp437, we
got the inverted exclamation mark for the position 0xA1 in latin1. Our
cp437 is behaving like a latin1 font thank to the cmap.
<p>This example works flawlessly but since cp437 and latin1 differ a lot,
in other cases you will get a miss, represented by a generic replacement
character. Or you will get an approximation, a surrogate. For instance,
you get a capital 'A' where you would need the same letter with a circumflex
on top of it.
<p>When using 256 char fonts, a cmap that really translates means surrogates.
When no surrogates are needed, the cmap is straight-to-font: every character
is translated into itself, only the unimap is relevant. This is the most
natural and common case.
<p>However, a font may be designed to cover more than one character set.
This is evident for 512 char fonts but there are indeed 256 char fonts
that can handle more than one character set (albeit only partially). If
you are using such a font, the cmap allows you to select one of the character
sets covered. One example (lat1-16.psfu) is discussed below.
<br>
<p><b>G0/G1 LEGENDS</b>
<p>Although there is only one cmap active at a given time, the kernel knows
four of them. Three of them are built-in and never change. They define
the IBM code page 437 from early DOS versions with box draw characters,
the DEC VT100 charset also with box draw characters, and the ISO latin1
charset. The fourth kernel charset is user-defined, is by default the straight-to-font
mapping, and can only be changed loading a soft font.
<p>The console driver has two slots labelled G0 and G1, each with a reference
to one of the four kernel charsets. G0 and G1 can vary from console to
console as long as they point to cp437, vt100, latin1. If you put a cmap
different from those three in any slot G0 or G1 in any console, all other
consoles will switch to that same user-defined charset. By default, G0
points to latin1, G1 points to vt100. G0 and G1 can be acted upon with
escape sequences at the prompt. And although they are mentioned quite often,
you better leave them alone. Why?
<p>If you load a soft font and send escape sequences to switch between
kernel charsets, you may well be applying to your soft font a translation
that produces plenty of junk. The cmap you select must be suitable for
your font and be a team player with the current unimap. The only guarantee
you have in this respect is to rely on setfont and control both cmap and
unimap. If you start mixing setfont commands with escape sequences to the
console, also partly relying on defaults, you may (you will!) end up losing
any sense of orientation. To keep cmap and unimap under control, use fonts
that have a unimap built-in and use
<blockquote>setfont -m none this_beauty_of_font.psfu</blockquote>
when loading a 256 char soft font. This gives a good guarantee of no interference
if you are not playing with keyboard tools at the same time since keyboard
tools may affect the console font. For 512 char fonts, you must know what's
inside, and you must know the names of the charsets covered (i.e. the corresponding
files *.trans) otherwise you will not be able to switch between them.
<p>And what about the user-defined character set? If you have loaded a
soft font (and any run of setfont loads a soft font except when you are
just saving from the current font to disk), the escape sequence to pick
up the user-defined character set from the kernel will make that soft font
active with the charset implicit to it as cmap and you will not be able
to revert to the ROM font. If you look into setfont's source code, you
will see that they are activating the soft font's character set anyway.
Forget the user-defined character set, it's none of your business, leave
it to setfont.
<p>On the other hand, if you run the ROM font and have not loaded a soft
font, requesting the user-defined charset will only reset to cp437, the
reason being that the user-defined charset has the default value straight-to-font.
For instance, assume that you have chosen vt100 which does not have lower
case letters and will immediately display junk. Send the escape sequence
for the user-defined charset (which has not been defined yet and so still
has the default value): the junk disappears, you get the lower case letters
again.
<p>There is, however, a soft font which has been explicitly made to cope
with the kernel charsets. This font is called
<blockquote>lat1-16.psfu</blockquote>
and is not a latin1 font as the name suggests, it is a mongrel. With the
cmap set to cp437 it will deliver most of cp437 (all block and box draw
elements), with the cmap set to latin1 it will deliver latin1. And it will
also deliver vt100 should anybody care for it. Requesting the user-defined
cmap unveils that the font uses the normally empty control ranges (0-31,
128-159) to pack together chars from cp437 and latin1.
<br><p>
Advice: if you are in a region where latin1 is not suitable, stick to
the font provided by your distro (and kiss most probably good bye to
the box draw elements). If latin1 is ok, use lat1-16.psfu. That will
give you the latin1 characters plus box lines for your file
manager.<br>
<br>
<b>DOCUMENTATION OR LACK THEREOF</b>
<p>The issues around Linux console fonts are poorly documented. The man
pages are too dense, the terminology is windy, the HOWTO that comes with
the kbd package is a despair, I wonder whether people who recommend it
ever tried to read it.
<p>The stuff presented in this article is elementary and still took quite
an effort to grasp. Let us summarize it from a different angle, it will
do no harm.
<ul>
<li>
The following distinctions should be kept in mind:</li>
</ul>
<blockquote>(i) ROM font (always 256 characters) (ii) console soft font
<br>(a) 256 characters maximum (b) 257-512 characters</blockquote>
<ul>
<li>
The ROM font is the old DOS cp437. After changing from the ROM font to
a soft font, there is no return to the ROM font except rebooting.</li>
<li>
The cmap and unimap must fit the font or junk will abound. When the mappings
fit, a font can be used with different personalities and cover multiple
character sets. This is stark in evidence when using a 512 char font.</li>
<li>
If you think a 512 char font is 'naturally' a sensible solution, think
again - since a 512 char font will disable bold colours. Red Hat 8.0 has
such a font as a default and Red Hat users are not pleased. Check, however,
what is said below on Mandrake.</li>
<li>
You can only display characters that are in the current console font. This
is to say, you cannot (!) switch from font A to B on-the-fly, display characters
that are in B but not in A, switch back to A. When you are back to A, those
alien B characters will be overwritten at the next screen update. This
is why your file manager, say Midnight Commander, will not display box
lines if you use a pure latin1 font (which has no box elements).</li>
</ul>
Somebody is working on a new console driver for Linux 2.6.0. Can we place
an order? A trick to use console fonts bigger than 512 characters; each
console its own font; no interference of big fonts with colours. Thank
you very much.
<br>
<p><b>QUERIES &amp; ANSWERS</b>
<p>How do I enforce the ROM font in the console?
<blockquote>There might be a utility for that somewhere but it is not in
the kbd package. Without such a utility, the only way to enforce the ROM
font is to boot into the ROM font. Check your init scripts and make sure
no soft font is loaded. If you fail, rename the directory where the soft
fonts reside so it cannot be found at boot time.</blockquote>
How do I save the ROM font to a file?
<blockquote>When using the ROM font, issue
<p>echo -ne '\e(U'
<br>setfont -o cp437-16.psf
<p>at the prompt. The file cp437-16.psf contains the ROM font. This font
has a height of 16 pixels.</blockquote>
How do I find out which font the console is currently using?
<blockquote>If you mean which name the font has, look in the boot scripts
and/or the shell history to find out what soft font was loaded last (possibly
none, so the ROM font is on). If you want to see the characters in the
font according to their internal arrangement, issue
<p>echo -ne '\e(K'
<br>setfont -om current_font.trans
<p>and look inside current_font.trans with an editor. This does not work
100% because certain character ranges (0-31 and 128-159) are not properly
displayed although they may be storing glyphs. If the font has a unimap,
the unimap will list all characters with their official names. That will
often give an idea of the glyph.</blockquote>
I have created my own font based on latin1 but adding box draw elements
in the unused range 128-159. It works but the horizontal lines have little
gaps. How come?
<blockquote>The characters are 8 pixel wide but the VGA hardware adds a
9th column of blanks so as to display them at a small distance from the
each other. That is very appropriate for most characters but not for horizontal
line segments that should rather close up to each other. For this reason,
the VGA hardware makes an exception for box draw elements: instead of inserting
blanks, the 9th column repeats the 8th column of pixels. So far, so good.
But how does the VGA adapter know where you put your box draw elements?
It does not, either you put them in the same range as they were in cp437
or you will get gaps.</blockquote>
How can I use a 512 char font and save my bold colours?
<blockquote>You will have to boot into the framebuffer, for details see
Framebuffer-HOWTO.html. Opinions about the framebuffer are divided, Mandrake
boots into the framebuffer by default, SuSE advises against. Red Hat's
official position is not known to me but they do not boot into the framebuffer
although they use a 512 char console font that disables bold colours.</blockquote>
The lati1-16.psfu is a 256 char font and still covers more than one charset.
How is it possible?
<blockquote>It is only possible because it covers charsets only partially
or covers charsets that are smaller than 256 characters. cp437 is full
house, it has exactly 256 characters so lat1-16.psfu covers it only partially.
On the other hand, latin1 keeps the control range 0-31 and 128-159 empty
so it has only 192 characters. vt100 is handled as 128 characters but complemented
with latin1 in the 160-255 range. So what lat1-16.psfu does is essentially
keeping box and block draw elements where they used to be in cp437 and
moving latin1 characters elsewhere. This way everything fits within 256
characters. Well done.</blockquote>
Is the console font unique for all consoles or may it vary from console
to console?
<blockquote>The console font is the same for all consoles, what can vary
are the character sets (cmaps) used in the consoles.</blockquote>
<!-- *** BEGIN author bio *** -->
<P>&nbsp;
<P>
<!-- *** BEGIN bio *** -->
<!-- P>
<img ALIGN="LEFT" ALT="[BIO]" SRC="../gx/2002/note.png">
<em>
</em>
<br CLEAR="all" -->
<!-- *** END bio *** -->
<!-- *** END author bio *** -->
<!-- *** BEGIN copyright *** -->
<hr>
<CENTER><SMALL><STRONG>
Copyright &copy; 2003, En D Loozzr.
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
Published in Issue 91 of <i>Linux Gazette</i>, June 2003
</STRONG></SMALL></CENTER>
<!-- *** END copyright *** -->
<HR>
<!--startcut ==========================================================-->
<CENTER>
<!-- *** BEGIN navbar *** -->
<A HREF="lodato.html">&lt;&lt;&nbsp;Prev</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="index.html">TOC</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="../index.html">Front Page</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue91/loozzr.html">Talkback</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="../faq/index.html">FAQ</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="mathew.html">Next&nbsp;&gt;&gt;</A>
<!-- *** END navbar *** -->
</CENTER>
</BODY></HTML>
<!--endcut ============================================================-->