old-www/HOWTO/Unicode-HOWTO-1.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
 <META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
 <TITLE>The Unicode HOWTO: Introduction</TITLE>
 <LINK HREF="Unicode-HOWTO-2.html" REL=next>

 <LINK HREF="Unicode-HOWTO.html#toc1" REL=contents>
</HEAD>
<BODY>
<A HREF="Unicode-HOWTO-2.html">Next</A>
Previous
<A HREF="Unicode-HOWTO.html#toc1">Contents</A>
<HR>
<H2><A NAME="s1">1. Introduction</A></H2>

<P>
<P>
<H2><A NAME="ss1.1">1.1 Why Unicode?</A>
</H2>

<P>
<P>People in different countries use different characters to represent the
words of their native languages. Nowadays most applications, including
email systems and web browsers, are 8-bit clean, i.e. they can operate on
and display text correctly provided that it is represented in an 8-bit
character set, like ISO-8859-1.
<P>There are far more than 256 characters in the world - think of cyrillic,
hebrew, arabic, chinese, japanese, korean and thai -, and new characters
are being invented now and then. The problems that come up for users are:
<P>
<UL>
<LI>It is impossible to store text with characters from different character
sets in the same document. For example, I can cite russian papers in
a German or French publication if I use TeX, xdvi and PostScript,
but I cannot do it in plain text.</LI>
<LI>As long as every document has its own character set, and recognition
of the character set is not automatic, manual user intervention is
inevitable. For example, in order to view the homepage of the
XTeamLinux distribution
<A HREF="http://www.xteamlinux.com.cn/">http://www.xteamlinux.com.cn/</A>
I had to tell Netscape that the web page is coded in GB2312.</LI>
<LI>New symbols like the Euro are being invented. ISO has issued a new
standard ISO-8859-15, which is mostly like ISO-8859-1 except that it
removes some rarely used characters (the old currency sign) and
replaced it with the Euro sign. If users adopt this standard, they
have documents in different character sets on their disk, and they
start having to think about it daily. But computers should make things
simpler, not more complicated.</LI>
</UL>
<P>The solution of this problem is the adoption of a world-wide usable character
set. This character set is Unicode
<A HREF="http://www.unicode.org/">http://www.unicode.org/</A>.
For more info about Unicode, do `<CODE>man 7 unicode</CODE>' (manpage contained
in the man-pages-1.20 package).
<P>
<H2><A NAME="ss1.2">1.2 Unicode encodings</A>
</H2>

<P>
<P>This reduces the user's problem of dealing with character sets to a technical
problem: How to transport Unicode characters using the 8-bit bytes?
8-bit units are the smallest addressing units of most computers and also the
unit used by TCP/IP network connections. The use of 1 byte to represent
1 character is, however, an accident of history, caused by the fact that
computer development started in Europe and the U.S. where 96 characters were
found to be sufficient for a long time.
<P>There are basically four ways to encode Unicode characters in bytes:
<P>
<DL>
<DT><B>UTF-8</B><DD><P>128 characters are encoded using 1 byte (the ASCII characters).
1920 characters are encoded using 2 bytes (Roman, Greek, Cyrillic,
Coptic, Armenian, Hebrew, Arabic characters).
63488 characters are encoded using 3 bytes (Chinese and Japanese among
others).
The other 2147418112 characters (not assigned yet) can be encoded
using 4, 5 or 6 characters.
For more info about UTF-8, do `<CODE>man 7 utf-8</CODE>' (manpage contained
in the man-pages-1.20 package).
<DT><B>UCS-2</B><DD><P>Every character is represented as two bytes.
This encoding can only represent the first 65536 Unicode characters.
<DT><B>UTF-16</B><DD><P>This is an extension of UCS-2 which can represent 1112064 Unicode
characters. The first 65536 Unicode characters are represented as two
bytes, the other ones as four bytes.
<DT><B>UCS-4</B><DD><P>Every character is represented as four bytes.
</DL>
<P>The space requirements for encoding a text, compared to encodings currently
in use (8 bit per character for European languages, more for
Chinese/Japanese/Korean), is as follows. This has an influence on disk
storage space and network download speed (when no form of compression is
used).
<P>
<DL>
<DT><B>UTF-8</B><DD><P>No change for US ASCII, just a few percent more for ISO-8859-1,
50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic.
<DT><B>UCS-2 and UTF-16</B><DD><P>No change for Chinese/Japanese/Korean. 100% more for
US ASCII and ISO-8859-1, Greek and Cyrillic.
<DT><B>UCS-4</B><DD><P>100% more for Chinese/Japanese/Korean. 300% more for US ASCII and
ISO-8859-1, Greek and Cyrillic.
</DL>
<P>Given the penalty for US and European documents caused by UCS-2, UTF-16, and
UCS-4, it seems unlikely that these encodings have a potential for wide-scale
use. The Microsoft Win32 API supports the UCS-2 encoding since 1995 (at
least), yet this encoding has not been widely adopted for documents - SJIS
remains prevalent in Japan.
<P>UTF-8 on the other hand has the potential for wide-scale use, since it
doesn't penalize US and European users, and since many text processing
programs don't need to be changed for UTF-8 support.
<P>In the following, we will describe how to change your Linux system so
it uses UTF-8 as text encoding.
<P>
<H3>Footnotes for C/C++ developers</H3>

<P>
<P>The Microsoft Win32 approach makes it easy for developers to produce
Unicode versions of their programs: You "#define UNICODE" at the top
of your program and then change many occurrences of `<CODE>char</CODE>' to
`<CODE>TCHAR</CODE>', until your program compiles without warnings. The problem
with it is that you end up with two versions of your program: one which
understands UCS-2 text but no 8-bit encodings, and one which understands
only old 8-bit encodings.
<P>Moreover, there is an endianness issue with UCS-2 and UCS-4. The IANA
character set registry
<A HREF="http://www.isi.edu/in-notes/iana/assignments/character-sets">http://www.isi.edu/in-notes/iana/assignments/character-sets</A>
says about ISO-10646-UCS-2: "this needs to specify network byte order: the
standard does not specify". Network byte order is big endian. And RFC 2152
is even clearer: "ISO/IEC 10646-1:1993(E) specifies that when characters the
UCS-2 form are serialized as octets, that the most significant octet appear
first."
Whereas Microsoft, in its C/C++ development tools, recommends
to use machine-dependent endianness (i.e. little endian on ix86 processors)
and either a byte-order mark at the beginning of the document, or some
statistical heuristics(!).
<P>The UTF-8 approach on the other hand keeps `<CODE>char*</CODE>' as the standard C
string type. As a result, your program will handle US ASCII text,
independently of any environment variables, and will handle both
ISO-8859-1 and UTF-8 encoded text provided the LANG environment variable
is set accordingly.
<P>
<H2><A NAME="ss1.3">1.3 Related resources</A>
</H2>

<P>
<P>Markus Kuhn's very up-to-date resource list:
<UL>
<LI>
<A HREF="http://www.cl.cam.ac.uk/~mgk25/unicode.html">http://www.cl.cam.ac.uk/~mgk25/unicode.html</A></LI>
<LI>
<A HREF="http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html">http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html</A></LI>
</UL>
<P>Roman Czyborra's overview of Unicode, UTF-8 and UTF-8 aware programs:
<A HREF="http://czyborra.com/utf/#UTF-8">http://czyborra.com/utf/#UTF-8</A><P>Some example UTF-8 files:
<UL>
<LI>In Markus Kuhn's ucs-fonts package:
<A HREF="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt">quickbrown.txt</A>,
<A HREF="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt">UTF-8-test.txt</A>,
<A HREF="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt">UTF-8-demo.txt</A>.</LI>
<LI>
<A HREF="http://www.columbia.edu/kermit/utf8.html">http://www.columbia.edu/kermit/utf8.html</A></LI>
<LI>
<A HREF="ftp://ftp.cs.su.oz.au/gary/x-utf8.html">ftp://ftp.cs.su.oz.au/gary/x-utf8.html</A></LI>
<LI>The file <CODE>iso10646</CODE> in the Kosta Kostis' trans-1.1.1 package
<A HREF="ftp://ftp.nid.ru/pub/os/unix/misc/trans111.tar.gz">ftp://ftp.nid.ru/pub/os/unix/misc/trans111.tar.gz</A></LI>
<LI>
<A HREF="ftp://ftp.dante.de/pub/tex/info/lwc/apc/utf8.html">ftp://ftp.dante.de/pub/tex/info/lwc/apc/utf8.html</A></LI>
<LI>
<A HREF="http://www.cogsci.ed.ac.uk/~richard/unicode-sample.html">http://www.cogsci.ed.ac.uk/~richard/unicode-sample.html</A></LI>
</UL>
<P>
<P>
<HR>
<A HREF="Unicode-HOWTO-2.html">Next</A>
Previous
<A HREF="Unicode-HOWTO.html#toc1">Contents</A>
</BODY>
</HTML>