177 lines
8.3 KiB
HTML
177 lines
8.3 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
|
|
<TITLE>The Unicode HOWTO: Introduction</TITLE>
|
|
<LINK HREF="Unicode-HOWTO-2.html" REL=next>
|
|
|
|
<LINK HREF="Unicode-HOWTO.html#toc1" REL=contents>
|
|
</HEAD>
|
|
<BODY>
|
|
<A HREF="Unicode-HOWTO-2.html">Next</A>
|
|
Previous
|
|
<A HREF="Unicode-HOWTO.html#toc1">Contents</A>
|
|
<HR>
|
|
<H2><A NAME="s1">1. Introduction</A></H2>
|
|
|
|
<P>
|
|
<P>
|
|
<H2><A NAME="ss1.1">1.1 Why Unicode?</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>People in different countries use different characters to represent the
|
|
words of their native languages. Nowadays most applications, including
|
|
email systems and web browsers, are 8-bit clean, i.e. they can operate on
|
|
and display text correctly provided that it is represented in an 8-bit
|
|
character set, like ISO-8859-1.
|
|
<P>There are far more than 256 characters in the world - think of cyrillic,
|
|
hebrew, arabic, chinese, japanese, korean and thai -, and new characters
|
|
are being invented now and then. The problems that come up for users are:
|
|
<P>
|
|
<UL>
|
|
<LI>It is impossible to store text with characters from different character
|
|
sets in the same document. For example, I can cite russian papers in
|
|
a German or French publication if I use TeX, xdvi and PostScript,
|
|
but I cannot do it in plain text.</LI>
|
|
<LI>As long as every document has its own character set, and recognition
|
|
of the character set is not automatic, manual user intervention is
|
|
inevitable. For example, in order to view the homepage of the
|
|
XTeamLinux distribution
|
|
<A HREF="http://www.xteamlinux.com.cn/">http://www.xteamlinux.com.cn/</A>
|
|
I had to tell Netscape that the web page is coded in GB2312.</LI>
|
|
<LI>New symbols like the Euro are being invented. ISO has issued a new
|
|
standard ISO-8859-15, which is mostly like ISO-8859-1 except that it
|
|
removes some rarely used characters (the old currency sign) and
|
|
replaced it with the Euro sign. If users adopt this standard, they
|
|
have documents in different character sets on their disk, and they
|
|
start having to think about it daily. But computers should make things
|
|
simpler, not more complicated.</LI>
|
|
</UL>
|
|
<P>The solution of this problem is the adoption of a world-wide usable character
|
|
set. This character set is Unicode
|
|
<A HREF="http://www.unicode.org/">http://www.unicode.org/</A>.
|
|
For more info about Unicode, do `<CODE>man 7 unicode</CODE>' (manpage contained
|
|
in the man-pages-1.20 package).
|
|
<P>
|
|
<H2><A NAME="ss1.2">1.2 Unicode encodings</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>This reduces the user's problem of dealing with character sets to a technical
|
|
problem: How to transport Unicode characters using the 8-bit bytes?
|
|
8-bit units are the smallest addressing units of most computers and also the
|
|
unit used by TCP/IP network connections. The use of 1 byte to represent
|
|
1 character is, however, an accident of history, caused by the fact that
|
|
computer development started in Europe and the U.S. where 96 characters were
|
|
found to be sufficient for a long time.
|
|
<P>There are basically four ways to encode Unicode characters in bytes:
|
|
<P>
|
|
<DL>
|
|
<DT><B>UTF-8</B><DD><P>128 characters are encoded using 1 byte (the ASCII characters).
|
|
1920 characters are encoded using 2 bytes (Roman, Greek, Cyrillic,
|
|
Coptic, Armenian, Hebrew, Arabic characters).
|
|
63488 characters are encoded using 3 bytes (Chinese and Japanese among
|
|
others).
|
|
The other 2147418112 characters (not assigned yet) can be encoded
|
|
using 4, 5 or 6 characters.
|
|
For more info about UTF-8, do `<CODE>man 7 utf-8</CODE>' (manpage contained
|
|
in the man-pages-1.20 package).
|
|
<DT><B>UCS-2</B><DD><P>Every character is represented as two bytes.
|
|
This encoding can only represent the first 65536 Unicode characters.
|
|
<DT><B>UTF-16</B><DD><P>This is an extension of UCS-2 which can represent 1112064 Unicode
|
|
characters. The first 65536 Unicode characters are represented as two
|
|
bytes, the other ones as four bytes.
|
|
<DT><B>UCS-4</B><DD><P>Every character is represented as four bytes.
|
|
</DL>
|
|
<P>The space requirements for encoding a text, compared to encodings currently
|
|
in use (8 bit per character for European languages, more for
|
|
Chinese/Japanese/Korean), is as follows. This has an influence on disk
|
|
storage space and network download speed (when no form of compression is
|
|
used).
|
|
<P>
|
|
<DL>
|
|
<DT><B>UTF-8</B><DD><P>No change for US ASCII, just a few percent more for ISO-8859-1,
|
|
50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic.
|
|
<DT><B>UCS-2 and UTF-16</B><DD><P>No change for Chinese/Japanese/Korean. 100% more for
|
|
US ASCII and ISO-8859-1, Greek and Cyrillic.
|
|
<DT><B>UCS-4</B><DD><P>100% more for Chinese/Japanese/Korean. 300% more for US ASCII and
|
|
ISO-8859-1, Greek and Cyrillic.
|
|
</DL>
|
|
<P>Given the penalty for US and European documents caused by UCS-2, UTF-16, and
|
|
UCS-4, it seems unlikely that these encodings have a potential for wide-scale
|
|
use. The Microsoft Win32 API supports the UCS-2 encoding since 1995 (at
|
|
least), yet this encoding has not been widely adopted for documents - SJIS
|
|
remains prevalent in Japan.
|
|
<P>UTF-8 on the other hand has the potential for wide-scale use, since it
|
|
doesn't penalize US and European users, and since many text processing
|
|
programs don't need to be changed for UTF-8 support.
|
|
<P>In the following, we will describe how to change your Linux system so
|
|
it uses UTF-8 as text encoding.
|
|
<P>
|
|
<H3>Footnotes for C/C++ developers</H3>
|
|
|
|
<P>
|
|
<P>The Microsoft Win32 approach makes it easy for developers to produce
|
|
Unicode versions of their programs: You "#define UNICODE" at the top
|
|
of your program and then change many occurrences of `<CODE>char</CODE>' to
|
|
`<CODE>TCHAR</CODE>', until your program compiles without warnings. The problem
|
|
with it is that you end up with two versions of your program: one which
|
|
understands UCS-2 text but no 8-bit encodings, and one which understands
|
|
only old 8-bit encodings.
|
|
<P>Moreover, there is an endianness issue with UCS-2 and UCS-4. The IANA
|
|
character set registry
|
|
<A HREF="http://www.isi.edu/in-notes/iana/assignments/character-sets">http://www.isi.edu/in-notes/iana/assignments/character-sets</A>
|
|
says about ISO-10646-UCS-2: "this needs to specify network byte order: the
|
|
standard does not specify". Network byte order is big endian. And RFC 2152
|
|
is even clearer: "ISO/IEC 10646-1:1993(E) specifies that when characters the
|
|
UCS-2 form are serialized as octets, that the most significant octet appear
|
|
first."
|
|
Whereas Microsoft, in its C/C++ development tools, recommends
|
|
to use machine-dependent endianness (i.e. little endian on ix86 processors)
|
|
and either a byte-order mark at the beginning of the document, or some
|
|
statistical heuristics(!).
|
|
<P>The UTF-8 approach on the other hand keeps `<CODE>char*</CODE>' as the standard C
|
|
string type. As a result, your program will handle US ASCII text,
|
|
independently of any environment variables, and will handle both
|
|
ISO-8859-1 and UTF-8 encoded text provided the LANG environment variable
|
|
is set accordingly.
|
|
<P>
|
|
<H2><A NAME="ss1.3">1.3 Related resources</A>
|
|
</H2>
|
|
|
|
<P>
|
|
<P>Markus Kuhn's very up-to-date resource list:
|
|
<UL>
|
|
<LI>
|
|
<A HREF="http://www.cl.cam.ac.uk/~mgk25/unicode.html">http://www.cl.cam.ac.uk/~mgk25/unicode.html</A></LI>
|
|
<LI>
|
|
<A HREF="http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html">http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html</A></LI>
|
|
</UL>
|
|
<P>Roman Czyborra's overview of Unicode, UTF-8 and UTF-8 aware programs:
|
|
<A HREF="http://czyborra.com/utf/#UTF-8">http://czyborra.com/utf/#UTF-8</A><P>Some example UTF-8 files:
|
|
<UL>
|
|
<LI>In Markus Kuhn's ucs-fonts package:
|
|
<A HREF="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt">quickbrown.txt</A>,
|
|
<A HREF="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt">UTF-8-test.txt</A>,
|
|
<A HREF="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt">UTF-8-demo.txt</A>.</LI>
|
|
<LI>
|
|
<A HREF="http://www.columbia.edu/kermit/utf8.html">http://www.columbia.edu/kermit/utf8.html</A></LI>
|
|
<LI>
|
|
<A HREF="ftp://ftp.cs.su.oz.au/gary/x-utf8.html">ftp://ftp.cs.su.oz.au/gary/x-utf8.html</A></LI>
|
|
<LI>The file <CODE>iso10646</CODE> in the Kosta Kostis' trans-1.1.1 package
|
|
<A HREF="ftp://ftp.nid.ru/pub/os/unix/misc/trans111.tar.gz">ftp://ftp.nid.ru/pub/os/unix/misc/trans111.tar.gz</A></LI>
|
|
<LI>
|
|
<A HREF="ftp://ftp.dante.de/pub/tex/info/lwc/apc/utf8.html">ftp://ftp.dante.de/pub/tex/info/lwc/apc/utf8.html</A></LI>
|
|
<LI>
|
|
<A HREF="http://www.cogsci.ed.ac.uk/~richard/unicode-sample.html">http://www.cogsci.ed.ac.uk/~richard/unicode-sample.html</A></LI>
|
|
</UL>
|
|
<P>
|
|
<P>
|
|
<HR>
|
|
<A HREF="Unicode-HOWTO-2.html">Next</A>
|
|
Previous
|
|
<A HREF="Unicode-HOWTO.html#toc1">Contents</A>
|
|
</BODY>
|
|
</HTML>
|