mirror of https://github.com/tLDP/LDP
2994 lines
109 KiB
Plaintext
2994 lines
109 KiB
Plaintext
<!doctype linuxdoc system>
|
||
|
||
<article>
|
||
|
||
<title>The Unicode HOWTO
|
||
<author>Bruno Haible,
|
||
<htmlurl url="mailto:haible@clisp.cons.org"
|
||
name="<haible@clisp.cons.org>">
|
||
<date>v1.0, 23 January 2001
|
||
<abstract>
|
||
This document describes how to change your Linux system so it uses UTF-8
|
||
as text encoding. -
|
||
This is work in progress. Any tips, patches, pointers, URLs are very welcome.
|
||
</abstract>
|
||
|
||
<toc>
|
||
|
||
|
||
<sect>Introduction
|
||
<p>
|
||
|
||
<sect1>Why Unicode?
|
||
<p>
|
||
|
||
People in different countries use different characters to represent the
|
||
words of their native languages. Nowadays most applications, including
|
||
email systems and web browsers, are 8-bit clean, i.e. they can operate on
|
||
and display text correctly provided that it is represented in an 8-bit
|
||
character set, like ISO-8859-1.
|
||
|
||
There are far more than 256 characters in the world - think of cyrillic,
|
||
hebrew, arabic, chinese, japanese, korean and thai -, and new characters
|
||
are being invented now and then. The problems that come up for users are:
|
||
|
||
<itemize>
|
||
<item>
|
||
It is impossible to store text with characters from different character
|
||
sets in the same document. For example, I can cite russian papers in
|
||
a German or French publication if I use TeX, xdvi and PostScript,
|
||
but I cannot do it in plain text.
|
||
<item>
|
||
As long as every document has its own character set, and recognition
|
||
of the character set is not automatic, manual user intervention is
|
||
inevitable. For example, in order to view the homepage of the
|
||
XTeamLinux distribution
|
||
<htmlurl url="http://www.xteamlinux.com.cn/"
|
||
name="http://www.xteamlinux.com.cn/">
|
||
I had to tell Netscape that the web page is coded in GB2312.
|
||
<item>
|
||
New symbols like the Euro are being invented. ISO has issued a new
|
||
standard ISO-8859-15, which is mostly like ISO-8859-1 except that it
|
||
removes some rarely used characters (the old currency sign) and
|
||
replaced it with the Euro sign. If users adopt this standard, they
|
||
have documents in different character sets on their disk, and they
|
||
start having to think about it daily. But computers should make things
|
||
simpler, not more complicated.
|
||
</itemize>
|
||
|
||
The solution of this problem is the adoption of a world-wide usable character
|
||
set. This character set is Unicode
|
||
<htmlurl url="http://www.unicode.org/"
|
||
name="http://www.unicode.org/">.
|
||
For more info about Unicode, do `<tt>man 7 unicode</tt>' (manpage contained
|
||
in the man-pages-1.20 package).
|
||
|
||
<sect1>Unicode encodings
|
||
<p>
|
||
|
||
This reduces the user's problem of dealing with character sets to a technical
|
||
problem: How to transport Unicode characters using the 8-bit bytes?
|
||
8-bit units are the smallest addressing units of most computers and also the
|
||
unit used by TCP/IP network connections. The use of 1 byte to represent
|
||
1 character is, however, an accident of history, caused by the fact that
|
||
computer development started in Europe and the U.S. where 96 characters were
|
||
found to be sufficient for a long time.
|
||
|
||
There are basically four ways to encode Unicode characters in bytes:
|
||
|
||
<descrip>
|
||
<tag>UTF-8</tag>
|
||
128 characters are encoded using 1 byte (the ASCII characters).
|
||
1920 characters are encoded using 2 bytes (Roman, Greek, Cyrillic,
|
||
Coptic, Armenian, Hebrew, Arabic characters).
|
||
63488 characters are encoded using 3 bytes (Chinese and Japanese among
|
||
others).
|
||
The other 2147418112 characters (not assigned yet) can be encoded
|
||
using 4, 5 or 6 characters.
|
||
For more info about UTF-8, do `<tt>man 7 utf-8</tt>' (manpage contained
|
||
in the man-pages-1.20 package).
|
||
<tag>UCS-2</tag>
|
||
Every character is represented as two bytes.
|
||
This encoding can only represent the first 65536 Unicode characters.
|
||
<tag>UTF-16</tag>
|
||
This is an extension of UCS-2 which can represent 1112064 Unicode
|
||
characters. The first 65536 Unicode characters are represented as two
|
||
bytes, the other ones as four bytes.
|
||
<tag>UCS-4</tag>
|
||
Every character is represented as four bytes.
|
||
</descrip>
|
||
|
||
The space requirements for encoding a text, compared to encodings currently
|
||
in use (8 bit per character for European languages, more for
|
||
Chinese/Japanese/Korean), is as follows. This has an influence on disk
|
||
storage space and network download speed (when no form of compression is
|
||
used).
|
||
|
||
<descrip>
|
||
<tag>UTF-8</tag>
|
||
No change for US ASCII, just a few percent more for ISO-8859-1,
|
||
50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic.
|
||
<tag>UCS-2 and UTF-16</tag>
|
||
No change for Chinese/Japanese/Korean. 100% more for
|
||
US ASCII and ISO-8859-1, Greek and Cyrillic.
|
||
<tag>UCS-4</tag>
|
||
100% more for Chinese/Japanese/Korean. 300% more for US ASCII and
|
||
ISO-8859-1, Greek and Cyrillic.
|
||
</descrip>
|
||
|
||
Given the penalty for US and European documents caused by UCS-2, UTF-16, and
|
||
UCS-4, it seems unlikely that these encodings have a potential for wide-scale
|
||
use. The Microsoft Win32 API supports the UCS-2 encoding since 1995 (at
|
||
least), yet this encoding has not been widely adopted for documents - SJIS
|
||
remains prevalent in Japan.
|
||
|
||
UTF-8 on the other hand has the potential for wide-scale use, since it
|
||
doesn't penalize US and European users, and since many text processing
|
||
programs don't need to be changed for UTF-8 support.
|
||
|
||
In the following, we will describe how to change your Linux system so
|
||
it uses UTF-8 as text encoding.
|
||
|
||
<sect2>Footnotes for C/C++ developers
|
||
<p>
|
||
|
||
The Microsoft Win32 approach makes it easy for developers to produce
|
||
Unicode versions of their programs: You "#define UNICODE" at the top
|
||
of your program and then change many occurrences of `<tt>char</tt>' to
|
||
`<tt>TCHAR</tt>', until your program compiles without warnings. The problem
|
||
with it is that you end up with two versions of your program: one which
|
||
understands UCS-2 text but no 8-bit encodings, and one which understands
|
||
only old 8-bit encodings.
|
||
|
||
Moreover, there is an endianness issue with UCS-2 and UCS-4. The IANA
|
||
character set registry
|
||
<htmlurl url="http://www.isi.edu/in-notes/iana/assignments/character-sets"
|
||
name="http://www.isi.edu/in-notes/iana/assignments/character-sets">
|
||
says about ISO-10646-UCS-2: "this needs to specify network byte order: the
|
||
standard does not specify". Network byte order is big endian. And RFC 2152
|
||
is even clearer: "ISO/IEC 10646-1:1993(E) specifies that when characters the
|
||
UCS-2 form are serialized as octets, that the most significant octet appear
|
||
first."
|
||
Whereas Microsoft, in its C/C++ development tools, recommends
|
||
to use machine-dependent endianness (i.e. little endian on ix86 processors)
|
||
and either a byte-order mark at the beginning of the document, or some
|
||
statistical heuristics(!).
|
||
|
||
The UTF-8 approach on the other hand keeps `<tt>char*</tt>' as the standard C
|
||
string type. As a result, your program will handle US ASCII text,
|
||
independently of any environment variables, and will handle both
|
||
ISO-8859-1 and UTF-8 encoded text provided the LANG environment variable
|
||
is set accordingly.
|
||
|
||
<sect1>Related resources
|
||
<p>
|
||
|
||
Markus Kuhn's very up-to-date resource list:
|
||
<itemize>
|
||
<item>
|
||
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
|
||
name="http://www.cl.cam.ac.uk/~mgk25/unicode.html">
|
||
<item>
|
||
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html"
|
||
name="http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html">
|
||
</itemize>
|
||
|
||
Roman Czyborra's overview of Unicode, UTF-8 and UTF-8 aware programs:
|
||
<htmlurl url="http://czyborra.com/utf/#UTF-8"
|
||
name="http://czyborra.com/utf/#UTF-8">
|
||
|
||
Some example UTF-8 files:
|
||
<itemize>
|
||
<item>
|
||
In Markus Kuhn's ucs-fonts package:
|
||
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt"
|
||
name="quickbrown.txt">,
|
||
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt"
|
||
name="UTF-8-test.txt">,
|
||
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt"
|
||
name="UTF-8-demo.txt">.
|
||
<item>
|
||
<htmlurl url="http://www.columbia.edu/kermit/utf8.html"
|
||
name="http://www.columbia.edu/kermit/utf8.html">
|
||
<item>
|
||
<htmlurl url="ftp://ftp.cs.su.oz.au/gary/x-utf8.html"
|
||
name="ftp://ftp.cs.su.oz.au/gary/x-utf8.html">
|
||
<item>
|
||
The file <tt>iso10646</tt> in the Kosta Kostis' trans-1.1.1 package
|
||
<htmlurl url="ftp://ftp.nid.ru/pub/os/unix/misc/trans111.tar.gz"
|
||
name="ftp://ftp.nid.ru/pub/os/unix/misc/trans111.tar.gz">
|
||
<item>
|
||
<htmlurl url="ftp://ftp.dante.de/pub/tex/info/lwc/apc/utf8.html"
|
||
name="ftp://ftp.dante.de/pub/tex/info/lwc/apc/utf8.html">
|
||
<item>
|
||
<htmlurl url="http://www.cogsci.ed.ac.uk/~richard/unicode-sample.html"
|
||
name="http://www.cogsci.ed.ac.uk/~richard/unicode-sample.html">
|
||
</itemize>
|
||
|
||
|
||
<sect>Display setup
|
||
<p>
|
||
|
||
We assume you have already adapted your Linux console and X11 configuration
|
||
to your keyboard and locale. This is explained in the Danish/International
|
||
HOWTO, and in the other national HOWTOs: Finnish, French, German, Italian,
|
||
Polish, Slovenian, Spanish, Cyrillic, Hebrew, Chinese, Thai, Esperanto. But
|
||
please do not follow the advice given in the Thai HOWTO, to pretend you
|
||
were using ISO-8859-1 characters (U0000..U00FF) when what you are typing
|
||
are actually Thai characters (U0E01..U0E5B). Doing so will only cause
|
||
problems when you switch to Unicode.
|
||
|
||
<sect1>Linux console
|
||
<p>
|
||
|
||
I'm not talking much about the Linux console here, because on those machines
|
||
on which I don't have xdm running, I use it only to type my login name,
|
||
my password, and "xinit".
|
||
|
||
Anyway, the kbd-0.99 package
|
||
<htmlurl url="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-0.99.tar.gz"
|
||
name="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-0.99.tar.gz">
|
||
and a heavily extended version, the console-tools-0.2.3 package
|
||
<htmlurl url="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-tools-0.2.3.tar.gz"
|
||
name="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-tools-0.2.3.tar.gz">
|
||
contains in the kbd-0.99/src/ (or console-tools-0.2.3/screenfonttools/)
|
||
directory two programs: `unicode_start' and `unicode_stop'. When you call
|
||
`unicode_start', the console's screen output is interpreted as UTF-8. Also,
|
||
the keyboard is put into Unicode mode (see "man kbd_mode"). In this mode,
|
||
Unicode characters typed as Alt-x1 ... Alt-xn (where x1,...,xn are digits on
|
||
the numeric keypad) will be emitted in UTF-8. If your keyboard or, more
|
||
precisely, your normal keymap has non-ASCII letter keys (like the German
|
||
Umlaute) which you would like to be CapsLockable, you need to apply the kernel
|
||
patch
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.2.9-keyboard.diff"
|
||
name="linux-2.2.9-keyboard.diff">
|
||
or
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.3.12-keyboard.diff"
|
||
name="linux-2.3.12-keyboard.diff">.
|
||
|
||
You will want to use display characters from different scripts on the same
|
||
screen. For this, you need a Unicode console font. The
|
||
<htmlurl url="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-0.99.tar.gz"
|
||
name="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-0.99.tar.gz">
|
||
and
|
||
<htmlurl url="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-data-1999.08.29.tar.gz"
|
||
name="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-data-1999.08.29.tar.gz">
|
||
packages contain a font (LatArCyrHeb-{08,14,16,19}.psf) which
|
||
covers Latin, Cyrillic, Hebrew, Arabic scripts. It covers ISO 8859 parts
|
||
1,2,3,4,5,6,8,9,10 all at once. To install it, copy it to
|
||
/usr/lib/kbd/consolefonts/ and execute
|
||
"/usr/bin/setfont /usr/lib/kbd/consolefonts/LatArCyrHeb-14.psf".
|
||
|
||
A more flexible approach is given by Dmitry Yu. Bolkhovityanov
|
||
<htmlurl url="mailto:D.Yu.Bolkhovityanov@inp.nsk.su"
|
||
name="<D.Yu.Bolkhovityanov@inp.nsk.su>">
|
||
in <htmlurl url="http://www.inp.nsk.su/~bolkhov/files/fonts/univga/index.html"
|
||
name="http://www.inp.nsk.su/~bolkhov/files/fonts/univga/index.html">
|
||
and <htmlurl url="http://www.inp.nsk.su/~bolkhov/files/fonts/univga/uni-vga.tgz"
|
||
name="http://www.inp.nsk.su/~bolkhov/files/fonts/univga/uni-vga.tgz">.
|
||
To work around the constraint that a VGA font can only cover 512 characters simultaneously,
|
||
he provides a rich Unicode font (2279 characters, covering Latin, Greek, Cyrillic, Hebrew,
|
||
Armenian, IPA, math symbols, arrows, and more) in the typical 8x16 size and a script
|
||
which permits to extract any 512 characters as a console font.
|
||
|
||
If you want cut&paste to work with UTF-8 consoles, you need the patch
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.3.12-console.diff"
|
||
name="linux-2.3.12-console.diff">
|
||
from Edmund Thomas Grimley Evans and Stanislav Voronyi.
|
||
|
||
In April 2000, Edmund Thomas Grimley Evans
|
||
<htmlurl url="mailto:edmundo@rano.org"
|
||
name="<edmundo@rano.org>">
|
||
has implemented an UTF-8 console terminal emulator. It uses Unicode fonts
|
||
and relies on the Linux frame buffer device.
|
||
|
||
<sect1>X11 Foreign fonts
|
||
<p>
|
||
|
||
Don't hesitate to install Cyrillic, Chinese, Japanese etc. fonts. Even
|
||
if they are not Unicode fonts, they will help in displaying Unicode
|
||
documents: at least Netscape Communicator 4 and Java will make use of
|
||
foreign fonts when available.
|
||
|
||
The following programs are useful when installing fonts:
|
||
<itemize>
|
||
<item>
|
||
"mkfontdir directory"
|
||
prepares a font directory for use by the X server, needs to be executed
|
||
after installing fonts in a directory.
|
||
<item>
|
||
"xset -q | sed -e '1,/^Font Path:/d' | sed -e '2,$d' -e 's/^ //'"
|
||
displays the X server's current font path.
|
||
<item>
|
||
"xset fp+ directory"
|
||
adds a directory to the X server's current font path.
|
||
To add a directory permanently, add a "FontPath" line to your
|
||
/etc/XF86Config file, in section "Files".
|
||
<item>
|
||
"xset fp rehash"
|
||
needs to be executed after calling mkfontdir on a directory that is
|
||
already contained in the X server's current font path.
|
||
<item>
|
||
"xfontsel"
|
||
allows you to browse the installed fonts by selecting various font
|
||
properties.
|
||
<item>
|
||
"xlsfonts -fn fontpattern"
|
||
lists all fonts matching a font pattern. Also displays various font
|
||
properties. In particular, "xlsfonts -ll -fn font" lists the font
|
||
properties CHARSET_REGISTRY and CHARSET_ENCODING, which together
|
||
determine the font's encoding.
|
||
<item>
|
||
"xfd -fn font"
|
||
displays a font page by page.
|
||
</itemize>
|
||
|
||
The following fonts are freely available (not a complete list):
|
||
<itemize>
|
||
<item>
|
||
The ones contained in XFree86, sometimes packaged in separate packages.
|
||
For example, SuSE has only normal 75dpi fonts in the base `xf86' package.
|
||
The other fonts are in the packages `xfnt100', `xfntbig', `xfntcyr',
|
||
`xfntscl'.
|
||
<item>
|
||
The Emacs international fonts,
|
||
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/intlfonts/intlfonts-1.2.tar.gz"
|
||
name="ftp://ftp.gnu.org/pub/gnu/intlfonts/intlfonts-1.2.tar.gz">
|
||
As already mentioned, they are useful even if you prefer XEmacs to
|
||
GNU Emacs or don't use any Emacs at all.
|
||
</itemize>
|
||
|
||
<sect1>X11 Unicode fonts
|
||
<p>
|
||
|
||
Applications wishing to display text belonging to different scripts (like
|
||
Cyrillic and Greek) at the same time, can do so by using different X fonts
|
||
for the various pieces of text. This is what Netscape Communicator and Java
|
||
do. However, this approach is more complicated, because instead of working
|
||
with `Font' and `XFontStruct', the programmer has to deal with `XFontSet',
|
||
and also because not all fonts in the font set need to have the same
|
||
dimensions.
|
||
|
||
<itemize>
|
||
<item>
|
||
Markus Kuhn has assembled fixed-width 75dpi fonts with Unicode encoding
|
||
covering Latin, Greek, Cyrillic, Armenian, Georgian, Hebrew scripts and
|
||
many symbols.
|
||
They cover ISO 8859 parts 1,2,3,4,5,7,8,9,10,13,14,15,16 all at once.
|
||
These fonts are required for running xterm in utf-8 mode. They are now
|
||
contained in XFree86 4.0.1, therefore you need to install them manually
|
||
only if you have an older XFree86 3.x version.
|
||
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz"
|
||
name="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz">.
|
||
<item>
|
||
Markus Kuhn has also assembled double-width fixed 75dpi fonts with Unicode
|
||
encoding covering Chinese, Japanese and Korean. These fonts are contained
|
||
in XFree86 4.0.1 as well.
|
||
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts-asian.tar.gz"
|
||
name="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts-asian.tar.gz">
|
||
<item>
|
||
Roman Czyborra has assembled an 8x16 / 16x16 75dpi font with Unicode encoding
|
||
covering a huge part of Unicode. Download unifont.hex.gz and hex2bdf from
|
||
<htmlurl url="http://czyborra.com/unifont/"
|
||
name="http://czyborra.com/unifont/">.
|
||
It is not fixed-width: 8 pixels wide for European characters, 16 pixels wide
|
||
for Chinese characters. Installation instructions:
|
||
<tscreen><verb>
|
||
$ gunzip unifont.hex.gz
|
||
$ hex2bdf < unifont.hex > unifont.bdf
|
||
$ bdftopcf -o unifont.pcf unifont.bdf
|
||
$ gzip -9 unifont.pcf
|
||
# cp unifont.pcf.gz /usr/X11R6/lib/X11/fonts/misc
|
||
# cd /usr/X11R6/lib/X11/fonts/misc
|
||
# mkfontdir
|
||
# xset fp rehash
|
||
</verb></tscreen>
|
||
<item>
|
||
Primoz Peterlin has assembled an ETL family fonts covering Latin, Greek,
|
||
Cyrillic, Armenian, Georgian, Hebrew scripts.
|
||
<htmlurl url="ftp://ftp.x.org/contrib/fonts/etl-unicode.tar.gz"
|
||
name="ftp://ftp.x.org/contrib/fonts/etl-unicode.tar.gz">
|
||
Use the "bdftopcf" program in order to install it.
|
||
<item>
|
||
Mark Leisher has assembled a proportional, 17 pixel high (12 point), font,
|
||
called ClearlyU, covering Latin, Greek, Cyrillic, Armenian, Georgian, Hebrew,
|
||
Thai, Laotian scripts.
|
||
<htmlurl url="http://crl.nmsu.edu/~mleisher/cu.html"
|
||
name="http://crl.nmsu.edu/~mleisher/cu.html">.
|
||
Installation instructions:
|
||
<tscreen><verb>
|
||
$ bdftopcf -o cu12.pcf cu12.bdf
|
||
$ gzip -9 cu12.pcf
|
||
# cp cu12.pcf.gz /usr/X11R6/lib/X11/fonts/misc
|
||
# cd /usr/X11R6/lib/X11/fonts/misc
|
||
# mkfontdir
|
||
# xset fp rehash
|
||
</verb></tscreen>
|
||
</itemize>
|
||
|
||
<!-- Not useful now: The `ps2pk' and `pk2bm' programs contained in the teTeX
|
||
distribution can convert existing Postscript/Type1 and TeX metafont fonts
|
||
to .bdf format for use with X11. -->
|
||
|
||
<sect1>Unicode xterm
|
||
<p>
|
||
|
||
xterm is part of X11R6 and XFree86, but is maintained separately by Tom
|
||
Dickey.
|
||
<htmlurl url="http://www.clark.net/pub/dickey/xterm/xterm.html"
|
||
name="http://www.clark.net/pub/dickey/xterm/xterm.html">
|
||
Newer versions (patch level 146 and above) contain support for converting
|
||
keystrokes to UTF-8 before sending them to the application running in the
|
||
xterm, and for displaying Unicode characters that the application outputs
|
||
as UTF-8 byte sequence. It also contains support for double-wide characters
|
||
(mostly CJK ideographs) and combining characters, contributed by Robert Brady
|
||
<htmlurl url="mailto:robert@suse.co.uk"
|
||
name="<robert@suse.co.uk>">.
|
||
|
||
To get an UTF-8 xterm running, you need to:
|
||
<itemize>
|
||
Fetch
|
||
<htmlurl url="http://www.clark.net/pub/dickey/xterm/xterm.tar.gz"
|
||
name="http://www.clark.net/pub/dickey/xterm/xterm.tar.gz">,
|
||
<item>
|
||
Configure it by calling "./configure --enable-wide-chars ...", then
|
||
compile and install it.
|
||
<item>
|
||
Have a Unicode fixed-width font installed. Markus Kuhn's ucs-fonts.tar.gz
|
||
(see above) is made for this.
|
||
<item>
|
||
Start "xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'".
|
||
The option "-u8" turns on Unicode and UTF-8 handling. The font designated
|
||
by the long "-fn" option is Markus Kuhn's Unicode font. Without this option,
|
||
the default font called "fixed" would be used, an ISO-8859-1 6x13 font.
|
||
<item>
|
||
Take a look at the sample files contained in Markus Kuhn's ucs-fonts
|
||
package:
|
||
<tscreen><verb>
|
||
$ cd .../ucs-fonts
|
||
$ cat quickbrown.txt
|
||
$ cat utf-8-demo.txt
|
||
</verb></tscreen>
|
||
You should be seeing (among others) greek and russian characters.
|
||
<item>
|
||
To make xterm come up with UTF-8 handling each time it is started,
|
||
add the lines
|
||
<tscreen><verb>
|
||
xterm*utf8: 1
|
||
xterm*VT100*font: -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1
|
||
xterm*VT100*wideFont: -misc-fixed-medium-r-normal-ja-13-125-75-75-c-120-iso10646-1
|
||
xterm*VT100*boldFont: -misc-fixed-bold-r-semicondensed--13-120-75-75-c-60-iso10646-1
|
||
</verb></tscreen>
|
||
to your $HOME/.Xdefaults (for yourself only).
|
||
For CJK text processing with double-width characters, the following
|
||
settings are probably better:
|
||
<tscreen><verb>
|
||
xterm*VT100*font: -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
|
||
xterm*VT100*wideFont: -Misc-Fixed-Medium-R-Normal-ja-18-120-100-100-C-180-ISO10646-1
|
||
</verb></tscreen>
|
||
I don't recommend changing
|
||
the system-wide /usr/X11R6/lib/X11/app-defaults/XTerm, because then your
|
||
changes will be erased next time you upgrade to a new XFree86 version.
|
||
</itemize>
|
||
|
||
<!-- Development versions of xterm, by Robert Brady, are available from
|
||
http://susu.org.uk/~robert/xterm/
|
||
currently xterm-150 + http://susu.org.uk/~robert/xterm-23.diff.gz
|
||
but I don't know whether its stable enough (e.g. whether the memory
|
||
leak is fixed now).
|
||
-->
|
||
|
||
<sect1>TrueType fonts
|
||
<p>
|
||
|
||
The fonts mentioned above are fixed size and not scalable. For some
|
||
applications, especially printing, high resolution fonts are necessary,
|
||
though. The most important type of scalable, high resolution fonts are
|
||
TrueType fonts.
|
||
<!-- COMMENTED OUT.
|
||
Two other categories of scalable fonts are Postscript fonts (Adobe Type1,
|
||
Type3, Type42) and TeX metafonts. Converters exist between both: ps2mf
|
||
and mf2ps. But, as Juliusz Chroboczek writes in
|
||
http://www.dcs.ed.ac.uk/home/jec/programs/xfsft/renderers.html
|
||
"While the quality of the rasterisation of TrueType fonts depends solely
|
||
on the font used, in the case of Type 1 fonts it depends greatly on the
|
||
rasteriser. As it is significantly easier to produce high-quality Type 1
|
||
fonts than TrueType fonts, it is unfortunate that there is currently no
|
||
_good_ Free Type 1 rasteriser available."
|
||
But somehow, there are now Unicode TrueType fonts available, whereas
|
||
I don't know of any Unicode Postscript/metafont fonts.
|
||
-->
|
||
They are currently supported by
|
||
<itemize>
|
||
XFree86 4.0.1; you need to add the line
|
||
<tscreen><verb>
|
||
Load "freetype"
|
||
</verb></tscreen>
|
||
or
|
||
<tscreen><verb>
|
||
Load "xtt"
|
||
</verb></tscreen>
|
||
to the <tt>"Module"</tt> section of your XF86Config file.
|
||
<item>
|
||
The display engines of other operating systems.
|
||
<item>
|
||
The yudit editor, see below, and its printing engine.
|
||
<!-- COMMENTED OUT. ghostscript use of TrueType fonts is irrelevant,
|
||
because they can only be used as a substitute for TimesRoman etc.,
|
||
and not enable the use of Unicode characters.
|
||
Furthermore, http://www.dcs.ed.ac.uk/home/jec/programs/xfsft/renderers.html
|
||
shows that ghostscript does not correctly support TrueType fonts at
|
||
low resolutions.
|
||
<item>
|
||
The ghostscript Postscript rasterizer.
|
||
-->
|
||
</itemize>
|
||
|
||
Some no-cost TrueType fonts with large Unicode coverage are
|
||
<descrip>
|
||
<tag>Bitstream Cyberbit</tag>
|
||
Covers Roman, Cyrillic, Greek, Hebrew, Arabic, combining diacritical marks,
|
||
Chinese, Korean, Japanese, and more.
|
||
|
||
Downloadable from
|
||
<htmlurl url="ftp://ftp.netscape.com/pub/communicator/extras/fonts/windows/Cyberbit.ZIP"
|
||
name="ftp://ftp.netscape.com/pub/communicator/extras/fonts/windows/Cyberbit.ZIP">.
|
||
It is free for non-commercial purposes.
|
||
|
||
<tag>Microsoft Arial</tag>
|
||
Covers Roman, Cyrillic, Greek, Hebrew, Arabic, some combining diacritical
|
||
marks, Vietnamese.
|
||
|
||
Downloadable; look on a search engine for ftp-able files called
|
||
<tt>arial.ttf</tt>, <tt>ariali.ttf</tt>, <tt>arialbd.ttf</tt>,
|
||
<tt>arialbi.ttf</tt>.
|
||
<!-- COMMENTED OUT. Since redistribution of these fonts is not allowed
|
||
(Microsoft wants everyone to use a Windows platform in order to unpack
|
||
their arial32.exe), we cannot publish the URL
|
||
ftp://ftp.bora.net/pub/sw/screen-saver/themes/fonts/
|
||
-->
|
||
|
||
<tag>Lucida Sans Unicode</tag>
|
||
Covers Roman, Cyrillic, Greek, Hebrew, combining diacritical marks.
|
||
|
||
Download: contained in IBM's JDK 1.3.0 for Linux, at
|
||
<htmlurl url="http://www.ibm.com/java/jdk/linux130/"
|
||
name="http://www.ibm.com/java/jdk/linux130/">,
|
||
or directly downloadable as <tt>LucidaSansRegular.ttf</tt> and
|
||
<tt>LucidaSansOblique.ttf</tt> from
|
||
<htmlurl url="ftp://ftp.maths.tcd.ie/Linux/opt/IBMJava2-13/jre/lib/fonts/"
|
||
name="ftp://ftp.maths.tcd.ie/Linux/opt/IBMJava2-13/jre/lib/fonts/">.
|
||
|
||
<tag>Arphic</tag>
|
||
Cover Chinese (both traditional and simplified).
|
||
|
||
Download: at
|
||
<htmlurl url="ftp://ftp.gnu.org/non-gnu/chinese-fonts-truetype/"
|
||
name="ftp://ftp.gnu.org/non-gnu/chinese-fonts-truetype/">.
|
||
These fonts are truly free.
|
||
<!-- They are not Unicode fonts, but I mention them nevertheless because
|
||
they are popular in the Republic of China, and they have a good license.
|
||
-->
|
||
|
||
</descrip>
|
||
|
||
Download locations for these and other TrueType fonts can be found at
|
||
Christoph Singer's list of freely downloadable Unicode TrueType fonts
|
||
<htmlurl url="http://www.ccss.de/slovo/unifonts.htm"
|
||
name="http://www.ccss.de/slovo/unifonts.htm">.
|
||
|
||
Truetype fonts are installed similarly to fixed size fonts, except that
|
||
they go in a separate directory, and that <tt>ttmkfdir</tt> must be
|
||
called before <tt>mkfontdir</tt>:
|
||
<tscreen><verb>
|
||
# mkdir -p /usr/X11R6/lib/X11/fonts/truetype
|
||
# cp /somewhere/Cyberbit.ttf ... /usr/X11R6/lib/X11/fonts/truetype
|
||
# cd /usr/X11R6/lib/X11/fonts/truetype
|
||
# ttmkfdir > fonts.scale
|
||
# mkfontdir
|
||
# xset fp rehash
|
||
</verb></tscreen>
|
||
|
||
TrueType fonts can be converted to low resolution, non-scalable X11 fonts by
|
||
use of Mark Leisher's ttf2bdf utility
|
||
<htmlurl url="ftp://crl.nmsu.edu/CLR/multiling/General/ttf2bdf-2.8-LINUX.tar.gz"
|
||
name="ftp://crl.nmsu.edu/CLR/multiling/General/ttf2bdf-2.8-LINUX.tar.gz">.
|
||
For example, to generate a proportional Unicode font for use with cooledit:
|
||
<tscreen><verb>
|
||
# cd /usr/X11R6/lib/X11/fonts/local
|
||
# ttf2bdf ../truetrype/Cyberbit.ttf > cyberbit.bdf
|
||
# bdftopcf -o cyberbit.pcf cyberbit.bdf
|
||
# gzip -9 cyberbit.pcf
|
||
# mkfontdir
|
||
# xset fp rehash
|
||
</verb></tscreen>
|
||
|
||
More information about TrueType fonts can be found in the Linux TrueType HOWTO
|
||
<htmlurl url="http://www.moisty.org/~brion/linux/TrueType-HOWTO.html"
|
||
name="http://www.moisty.org/~brion/linux/TrueType-HOWTO.html">.
|
||
|
||
<sect1>Miscellaneous
|
||
<p>
|
||
|
||
A small program which tests whether a Linux console or xterm is in UTF-8 mode
|
||
can be found in the
|
||
<htmlurl url="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/x-lt-1.24.tar.gz"
|
||
name="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/x-lt-1.24.tar.gz">
|
||
package by Ricardas Cepas, files testUTF-8.c and testUTF8.c. Most applications
|
||
should not use this, however: they should look at the environment variables,
|
||
see section "Locale environment variables".
|
||
|
||
|
||
<sect>Locale setup
|
||
<p>
|
||
|
||
<sect1>Files & the kernel
|
||
<p>
|
||
|
||
You can now already use any Unicode characters in file names. No kernel
|
||
or file utilities need modifications. This is because file names in the
|
||
kernel can be anything not containing a null byte, and '/' is used to
|
||
delimit subdirectories. When encoded using UTF-8, non-ASCII characters
|
||
will never be encoded using null bytes or slashes. All that happens is
|
||
that file and directory names occupy more bytes than they contain characters.
|
||
For example, a filename consisting of five greek characters will appear
|
||
to the kernel as a 10-byte filename. The kernel does not know (and does
|
||
not need to know) that these bytes are displayed as greek.
|
||
|
||
This is the general theory, as long as your files stay inside Linux. On
|
||
filesystems which are used from other operating systems, you have mount
|
||
options to control conversion of filenames to/from UTF-8:
|
||
<itemize>
|
||
<item>
|
||
The "vfat" filesystems has a mount option "utf8".
|
||
See <htmlurl url="file:/usr/src/linux/Documentation/filesystems/vfat.txt"
|
||
name="file:/usr/src/linux/Documentation/filesystems/vfat.txt">.
|
||
When you give an "iocharset" mount option different from the default
|
||
(which is "iso8859-1"), the results with and without "utf8" are not
|
||
consistent. Therefore I don't recommend the "iocharset" mount option.
|
||
<item>
|
||
The "msdos", "umsdos" filesystems have the same mount option, but it
|
||
appears to have no effect.
|
||
<item>
|
||
The "iso9660" filesystem has a mount option "utf8".
|
||
See <htmlurl url="file:/usr/src/linux/Documentation/filesystems/isofs.txt"
|
||
name="file:/usr/src/linux/Documentation/filesystems/isofs.txt">.
|
||
<item>
|
||
Since Linux 2.2.x kernels, the "ntfs" filesystem has a mount option
|
||
"utf8". See
|
||
<htmlurl url="file:/usr/src/linux/Documentation/filesystems/ntfs.txt"
|
||
name="file:/usr/src/linux/Documentation/filesystems/ntfs.txt">.
|
||
</itemize>
|
||
The other filesystems (nfs, smbfs, ncpfs, hpfs, etc.) don't convert
|
||
filenames; therefore they support Unicode file names in UTF-8 encoding only
|
||
if the other operating system supports them.
|
||
Recall that to enable a mount option for all future remounts, you add it to
|
||
the fourth column of the corresponding /etc/fstab line.
|
||
|
||
<!-- COMMENTED OUT. It is questionable whether the kernel needs line editing
|
||
capabilities that go beyond those of ASCII. Multibyte and double-width
|
||
are feasible in kernel space, but combining characters and Bidi are not.
|
||
|
||
<sect1>Ttys & the kernel
|
||
<p>
|
||
|
||
Ttys are some kind of bidirectional pipes between two program, allowing
|
||
fancy features like echoing or command-line editing. When in an xterm,
|
||
you execute the "cat" command without arguments, you can enter and edit
|
||
any number of lines, and they will be echoed back line by line. The
|
||
kernel's editing actions are not correct, especially the Backspace (erase)
|
||
key and the tab key are not treated correctly.
|
||
|
||
To fix this, you need to:
|
||
<itemize>
|
||
<item>
|
||
apply the kernel patch
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.0.35-tty.diff"
|
||
name="linux-2.0.35-tty.diff"> or
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.2.9-tty.diff"
|
||
name="linux-2.2.9-tty.diff"> or
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.3.12-tty.diff"
|
||
name="linux-2.3.12-tty.diff">
|
||
and recompile your kernel,
|
||
<item>
|
||
if you are using glibc2, apply the patch
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/glibc211-tty.diff"
|
||
name="glibc211-tty.diff">
|
||
and recompile your libc (or if you are not so adventurous, it is sufficient
|
||
to patch an already installed include file:
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/glibc-tty.diff"
|
||
name="glibc-tty.diff">),
|
||
<item>
|
||
apply the patch <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/stty.diff"
|
||
name="stty.diff">
|
||
to GNU sh-utils-1.16b, and rebuild the "stty" program, then test it using
|
||
"stty -a" and "stty iutf8".
|
||
<item>
|
||
add the command "stty iutf8" to the "unicode_start" script, and
|
||
add the command "stty -iutf8" to the "unicode_stop script.
|
||
<item>
|
||
apply the patch <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/xterm.diff"
|
||
name="xterm.diff">
|
||
to xterm-146, and rebuild "xterm", then test it by starting
|
||
"xterm -u8"/"xterm +u8" and running "stty -a" and interactive "cat" inside it.
|
||
</itemize>
|
||
-->
|
||
|
||
<!-- COMMENTED OUT. Need to find a better solution.
|
||
To make this fix persistent across rlogin and telnet, I also used to do the
|
||
following, but I have been convinced that this is not a good idea:
|
||
<itemize>
|
||
<item>
|
||
Define new values for the TERM environment variable, "linux-utf8" as an
|
||
alias to "linux", and "xterm-utf8" as an alias to "xterm".
|
||
If your system has the ncurses library and the /usr/lib/terminfo (or
|
||
/usr/share/terminfo) database, do this by running
|
||
<tscreen><verb>
|
||
$ tic linux-utf8.terminfo
|
||
$ tic xterm-utf8.terminfo
|
||
</verb></tscreen>
|
||
as non-root (this will create the terminfo entries in your $HOME/.terminfo
|
||
directory). Here are
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-utf8.terminfo"
|
||
name="linux-utf8.terminfo">
|
||
and
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/xterm-utf8.terminfo"
|
||
name="xterm-utf8.terminfo">.
|
||
I don't recommend running this as root, because it will create
|
||
the terminfo entries in /usr/lib/terminfo where they might be erased next
|
||
time you upgrade your system.
|
||
If your system has an /etc/termcap file, you should also edit that file:
|
||
copy the linux and xterm entries and give them the new names "linux-utf8"
|
||
and "xterm-utf8". For an example, see
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/termcap.diff"
|
||
name="termcap.diff">.
|
||
<item>
|
||
Each time you call "unicode_start" or "unicode_stop" from the console, also
|
||
execute "export TERM=linux-utf8" or "export TERM=linux", respectively.
|
||
<item>
|
||
Apply the patch <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/xterm2.diff"
|
||
name="xterm2.diff">
|
||
to xterm-146, rebuild "xterm", and remove any "XTerm*termName" line from
|
||
/usr/X11R6/lib/X11/app-defaults/XTerm and $HOME/.Xdefaults. Now xterm sets
|
||
the TERM variable to "xterm-utf8" instead of "xterm" when running in UTF-8
|
||
mode.
|
||
<item>
|
||
Apply the patches
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/netkit.diff"
|
||
name="netkit.diff">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/netkitb.diff"
|
||
name="netkitb.diff"> and
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/telnet.diff"
|
||
name="telnet.diff"> and
|
||
rebuild "rlogind" and "telnetd". Now rlogin and telnet put the tty into
|
||
UTF-8 editing mode whenever the TERM environment variable is "linux-utf8"
|
||
or "xterm-utf8".
|
||
</itemize>
|
||
-->
|
||
|
||
<sect1>Upgrading the C library
|
||
<p>
|
||
|
||
glibc-2.2 supports multibyte locales, in particular UTF-8 locales. But
|
||
glibc-2.1.x and earlier C libraries do not support it. Therefore you need
|
||
to upgrade to glibc-2.2. Upgrading from glibc-2.1.x is riskless, because
|
||
glibc-2.2 is binary compatible with glibc-2.1.x (at least on i386 platforms,
|
||
and except for IPv6). Nevertheless, I recommend to have a bootable rescue
|
||
disk handy in case something goes wrong.
|
||
|
||
Prepare the kernel sources. You must have them unpacked and configured.
|
||
/usr/src/linux/include/linux/autoconf.h must exist. Building the kernel
|
||
is not needed.
|
||
|
||
Retrieve the glibc sources
|
||
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/glibc/"
|
||
name="ftp://ftp.gnu.org/pub/gnu/glibc/">,
|
||
su to root, then unpack, build and install it:
|
||
<tscreen><verb>
|
||
# unset LD_PRELOAD
|
||
# unset LD_LIBRARY_PATH
|
||
# tar xvfz glibc-2.2.tar.gz
|
||
# tar xvfz glibc-linuxthreads-2.2.tar.gz -C glibc-2.2
|
||
# mkdir glibc-2.2-build
|
||
# cd glibc-2.2-build
|
||
# ../glibc-2.2/configure --prefix=/usr --with-headers=/usr/src/linux/include --enable-add-ons
|
||
# make
|
||
# make check
|
||
# make info
|
||
# LC_ALL=C make install
|
||
# make localedata/install-locales
|
||
</verb></tscreen>
|
||
|
||
Upgrading from glibc versions earlier than 2.1.x cannot be done this way;
|
||
consider first installing a Linux distribution based on glibc-2.1.x, and
|
||
then upgrading to glibc-2.2 as described above.
|
||
|
||
Note that if -- for any reason -- you want to rebuild GCC after having
|
||
installed glibc-2.2, you need to first apply this patch
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/gcc-glibc-2.2-compat.diff"
|
||
name="gcc-glibc-2.2-compat.diff">
|
||
to the GCC sources.
|
||
|
||
<sect1>General data conversion
|
||
<p>
|
||
|
||
You will need a program to convert your locally (probably ISO-8859-1) encoded
|
||
texts to UTF-8. (The alternative would be to keep using texts in different
|
||
encodings on the same machine; this is not fun in the long run.)
|
||
One such program is `iconv', which comes with glibc-2.2. Simply use
|
||
<tscreen><verb>
|
||
$ iconv --from-code=ISO-8859-1 --to-code=UTF-8 < old_file > new_file
|
||
</verb></tscreen>
|
||
|
||
Here are two handy shell scripts, called "i2u"
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/i2u.sh" name="i2u.sh">
|
||
(for ISO to UTF conversion) and "u2i"
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/u2i.sh" name="u2i.sh">
|
||
(for UTF to ISO conversion).
|
||
Adapt according to your current 8-bit character set.
|
||
|
||
If you don't have glibc-2.2 and iconv installed, you can use GNU recode 3.6
|
||
instead.
|
||
"i2u" <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/i2u_recode.sh"
|
||
name="i2u_recode.sh"> is
|
||
"recode ISO-8859-1..UTF-8", and
|
||
"u2i" <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/u2i_recode.sh"
|
||
name="u2i_recode.sh"> is
|
||
"recode UTF-8..ISO-8859-1".
|
||
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz"
|
||
name="ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz">
|
||
|
||
Or you can also use CLISP instead. Here are
|
||
"i2u" <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/i2u.lisp"
|
||
name="i2u.lisp"> and
|
||
"u2i" <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/u2i.lisp"
|
||
name="u2i.lisp">
|
||
written in Lisp. Note: You need a CLISP version from July 1999 or newer.
|
||
<htmlurl url="ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz"
|
||
name="ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz">.
|
||
|
||
Other data conversion programs, less powerful than GNU recode, are
|
||
`trans'
|
||
<htmlurl url="ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/trans113.tar.gz"
|
||
name="ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/trans113.tar.gz">,
|
||
`tcs' from the Plan9 operating system
|
||
<htmlurl url="ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/tcs.tar.gz"
|
||
name="ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/tcs.tar.gz">,
|
||
and
|
||
`utrans'/`uhtrans'/`hutrans'
|
||
<htmlurl url="ftp://ftp.cdrom.com/pub/FreeBSD/distfiles/i18ntools-1.0.tar.gz"
|
||
name="ftp://ftp.cdrom.com/pub/FreeBSD/distfiles/i18ntools-1.0.tar.gz">
|
||
by G. Adam Stanislav
|
||
<htmlurl url="mailto:adam@whizkidtech.net"
|
||
name="<adam@whizkidtech.net>">.
|
||
|
||
For the repeated conversion of files to UTF-8 from different character sets,
|
||
a semi-automatic tool can be used:
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/to-utf8" name="to-utf8">
|
||
presents the non-ASCII parts of a file to the user, lets him decide about the
|
||
file's original character set, and then converts the file to UTF-8.
|
||
|
||
<sect1>Locale environment variables
|
||
<p>
|
||
|
||
You may have the following environment variables set, containing locale
|
||
names:
|
||
<descrip>
|
||
<tag>LANGUAGE</tag>
|
||
override for LC_MESSAGES, used by GNU gettext only
|
||
<tag>LC_ALL</tag>
|
||
override for all other LC_* variables
|
||
<tag>LC_CTYPE, LC_MESSAGES, LC_COLLATE, LC_NUMERIC, LC_MONETARY, LC_TIME</tag>
|
||
individual variables for:
|
||
character types and encoding,
|
||
natural language messages,
|
||
sorting rules,
|
||
number formatting,
|
||
money amount formatting,
|
||
date and time display
|
||
<tag>LANG</tag>
|
||
default value for all LC_* variables
|
||
</descrip>
|
||
(See `<tt>man 7 locale</tt>' for a detailed description.)
|
||
|
||
Each of the LC_* and LANG variables can contain a locale name of the
|
||
following form:
|
||
|
||
<quote>
|
||
language[_territory[.codeset]][@modifier]
|
||
</quote>
|
||
|
||
where language is an
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/ISO_639.html"
|
||
name="ISO 639">
|
||
language code (lower case), territory is an
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/ISO_3166.html"
|
||
name="ISO 3166">
|
||
country code (upper case), codeset denotes a character set, and
|
||
modifier stands for other particular attributes (for example indicating
|
||
a particular language dialect, or a nonstandard orthography).
|
||
|
||
LANGUAGE can contain several locale names, separated by colons.
|
||
|
||
In order to tell your system and all applications that you are using UTF-8,
|
||
you need to add a codeset suffix of UTF-8 to your locale names. For example,
|
||
if you were using
|
||
<tscreen><verb>
|
||
LC_CTYPE=de_DE
|
||
</verb></tscreen>
|
||
you would change this to
|
||
<tscreen><verb>
|
||
LC_CTYPE=de_DE.UTF-8
|
||
</verb></tscreen>
|
||
|
||
You do <em>not</em> need to change your LANGUAGE environment variable.
|
||
GNU gettext in glibc-2.2 has the ability to convert translations to the right
|
||
encoding.
|
||
|
||
<sect1>Creating the locale support files
|
||
<p>
|
||
|
||
You create using <tt>localedef</tt> the support files for each UTF-8 locale
|
||
you intend to use, for example:
|
||
<tscreen><verb>
|
||
$ localedef -v -c -i de_DE -f UTF-8 de_DE.UTF-8
|
||
</verb></tscreen>
|
||
|
||
You typically don't need to create locales named "de" or "fr" without
|
||
country suffix, because these locales are normally only used by the
|
||
LANGUAGE variable and not by the LC_* variables, and LANGUAGE is only
|
||
used as an override for LC_MESSAGES.
|
||
|
||
<sect>Specific applications
|
||
<p>
|
||
|
||
<sect1>Shells
|
||
<p>
|
||
|
||
<sect2>bash
|
||
<p>
|
||
|
||
By default, GNU bash assumes that every character is one byte long and one
|
||
column wide. A patch for bash 2.04, by Marcin 'Qrczak' Kowalczyk and
|
||
Ricardas Cepas, teaches bash about multibyte characters in UTF-8 encoding.
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/bash-2.04-diff"
|
||
name="bash-2.04-diff">
|
||
|
||
Double-width characters, combining characters and bidi are not supported by
|
||
this patch. It seems a complete redesign of the readline redisplay engine is
|
||
needed.
|
||
|
||
<sect1>Networking
|
||
<p>
|
||
|
||
<!--
|
||
<sect2>rlogin
|
||
<p>
|
||
|
||
is fine with the above mentioned patches.
|
||
-->
|
||
|
||
<sect2>telnet
|
||
<p>
|
||
|
||
In some installations, telnet is not 8-bit clean by default.
|
||
In order to be able to send Unicode keystrokes to the remote host, you need to
|
||
set telnet into "outbinary" mode.
|
||
There are two ways to do this:
|
||
<tscreen><verb>
|
||
$ telnet -L <host>
|
||
</verb></tscreen>
|
||
and
|
||
<tscreen><verb>
|
||
$ telnet
|
||
telnet> set outbinary
|
||
telnet> open <host>
|
||
</verb></tscreen>
|
||
|
||
<!--
|
||
Additionally, use the above mentioned patches.
|
||
-->
|
||
|
||
<sect2>kermit
|
||
<p>
|
||
|
||
The communications program C-Kermit
|
||
<htmlurl url="http://www.columbia.edu/kermit/ckermit.html"
|
||
name="http://www.columbia.edu/kermit/ckermit.html">,
|
||
(an interactive tool for connection setup, telnet, file transfer,
|
||
with support for TCP/IP and serial lines),
|
||
in versions 7.0 or newer, understands the file and transfer encodings
|
||
UTF-8 and UCS-2, and understands the terminal encoding UTF-8, and converts
|
||
between these encodings and many others. Documentation of these features
|
||
can be found in
|
||
<htmlurl url="http://www.columbia.edu/kermit/ckermit2.html#x6.6"
|
||
name="http://www.columbia.edu/kermit/ckermit2.html#x6.6">.
|
||
|
||
<sect1>Browsers
|
||
<p>
|
||
|
||
<sect2>Netscape
|
||
<p>
|
||
|
||
Netscape 4.05 or newer can display HTML documents in UTF-8 encoding. All a
|
||
document needs is the following line between the
|
||
<head> and </head> tags:
|
||
<tscreen><verb>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
||
</verb></tscreen>
|
||
|
||
Netscape 4.05 or newer can also display HTML and text files in UCS-2
|
||
encoding with byte-order mark.
|
||
|
||
<htmlurl url="http://www.netscape.com/computing/download/"
|
||
name="http://www.netscape.com/computing/download/">
|
||
|
||
<sect2>Mozilla
|
||
<p>
|
||
|
||
Mozilla milestone M16 has much better internationalization than Netscape 4.
|
||
It can display HTML documents in UTF-8 encoding with support for more
|
||
languages. Alas, there is a cosmetic problem with CJK fonts: some glyphs
|
||
can be bigger than the line's height, thus overlapping the previous or next
|
||
line.
|
||
|
||
<htmlurl url="http://www.mozilla.org/"
|
||
name="http://www.mozilla.org/">
|
||
|
||
<sect2>Amaya
|
||
<p>
|
||
|
||
Amaya 4.2.1
|
||
(<htmlurl url="http://www.w3.org/Amaya/"
|
||
name="http://www.w3.org/Amaya/">,
|
||
<htmlurl url="http://www.w3.org/Amaya/User/SourceDist"
|
||
name="http://www.w3.org/Amaya/User/SourceDist">)
|
||
has now limited handling of UTF-8 encoded HTML pages. It
|
||
recognizes the encoding, but it displays only ISO-8859-1 and symbol
|
||
characters; it only ever accesses the fonts
|
||
<tscreen><verb>
|
||
-adobe-times-*-iso8859-1
|
||
-adobe-helvetica-*-iso8859-1
|
||
-adobe-new century schoolbook-*-iso8859-1
|
||
-adobe-courier-*-iso8859-1
|
||
-adobe-symbol-*-adobe-fontspecific
|
||
</verb></tscreen>
|
||
|
||
Amaya is in fact a HTML editor, not only a browser. Amaya's strengths among
|
||
the browsers are its speed, given enough memory, and its rendering
|
||
of mathematical formulas (MathML support).
|
||
|
||
<sect2>lynx
|
||
<p>
|
||
|
||
lynx-2.8 has an options screen (key 'O') which permits to set the display
|
||
character set. When running in an xterm or Linux console in UTF-8 mode,
|
||
set this to "UNICODE UTF-8". Note that for this setting to take effect
|
||
in the current browser session, you have to confirm on the "Accept Changes"
|
||
field, and for this setting to take effect in future browser sessions, you
|
||
have to enable the "Save options to disk" field and then confirm it on
|
||
the "Accept Changes" field.
|
||
|
||
Now, again, all a document needs is the following line between the
|
||
<head> and </head> tags:
|
||
<tscreen><verb>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
||
</verb></tscreen>
|
||
|
||
When you are viewing text files in UTF-8 encoding, you also need to
|
||
pass the command-line option "-assume_local_charset=UTF-8" (affects only
|
||
file:/... URLs) or "-assume_charset=UTF-8" (affects all URLs).
|
||
In lynx-2.8.2 you can alternatively, in the options screen (key 'O'),
|
||
change the assumed document character set to "utf-8".
|
||
|
||
There is also an option in the options screen, to set the "preferred document
|
||
character set". But it has no effect, at least with file:/... URLs
|
||
and with http://... URLs served by apache-1.3.0.
|
||
|
||
There is a spacing and line-breaking problem, however. (Look at the
|
||
russian section of x-utf8.html, or at utf-8-demo.txt.)
|
||
|
||
Also, in lynx-2.8.2, configured with --enable-prettysrc, the nice colour
|
||
scheme does not work correctly any more when the display character set
|
||
has been set to "UNICODE UTF-8". This is fixed by a simple patch
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/lynx282.diff" name="lynx282.diff">.
|
||
|
||
The Lynx developers say: "For any serious use of UTF-8 screen output with
|
||
lynx, compiling with slang lib and -DSLANG_MBCS_HACK is still recommended."
|
||
|
||
Latest stable release:
|
||
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/lynx/lynx-2.8.2.tar.gz"
|
||
name="ftp://ftp.gnu.org/pub/gnu/lynx/lynx-2.8.2.tar.gz">
|
||
|
||
<htmlurl url="http://lynx.isc.org/"
|
||
name="http://lynx.isc.org/">
|
||
|
||
General home page:
|
||
<htmlurl url="http://lynx.browser.org/"
|
||
name="http://lynx.browser.org/">
|
||
|
||
<htmlurl url="http://www.slcc.edu/lynx/"
|
||
name="http://www.slcc.edu/lynx/">
|
||
|
||
Newer development shapshots:
|
||
<htmlurl url="http://lynx.isc.org/current/"
|
||
name="http://lynx.isc.org/current/">,
|
||
<htmlurl url="ftp://lynx.isc.org/current/"
|
||
name="ftp://lynx.isc.org/current/">
|
||
|
||
<sect2>w3m
|
||
<p>
|
||
|
||
w3m by Akinori Ito
|
||
<htmlurl url="http://ei5nazha.yz.yamagata-u.ac.jp/~aito/w3m/eng/"
|
||
name="http://ei5nazha.yz.yamagata-u.ac.jp/~aito/w3m/eng/">
|
||
is a text mode browser for HTML pages and plain-text files.
|
||
Its layout of HTML tables, enumerations etc. is much prettier than lynx' one.
|
||
w3m can also be used as a high quality HTML to plain text converter.
|
||
|
||
w3m 0.1.10 has command line options for the three major Japanese encodings, but
|
||
can also be used for UTF-8 encoded files. Without command line options,
|
||
you often have to press Ctrl-L to refresh the display, and line breaking
|
||
in Cyrillic and CJK paragraphs is not good.
|
||
|
||
To fix this, by Hironori Sakamoto has a patch
|
||
<htmlurl url="http://www2u.biglobe.ne.jp/~hsaka/w3m/"
|
||
name="http://www2u.biglobe.ne.jp/~hsaka/w3m/">
|
||
which adds UTF-8 as display encoding.
|
||
|
||
<sect2>Test pages
|
||
<p>
|
||
|
||
Some test pages for browsers can be found at the pages of Alan Wood
|
||
<htmlurl url="http://www.hclrss.demon.co.uk/unicode/#links"
|
||
name="http://www.hclrss.demon.co.uk/unicode/#links">
|
||
and James Kass
|
||
<htmlurl url="http://home.att.net/~jameskass/"
|
||
name="http://home.att.net/~jameskass/">.
|
||
|
||
<sect1>Editors
|
||
<p>
|
||
|
||
<sect2>yudit
|
||
<p>
|
||
|
||
yudit by Gáspár Sinai
|
||
<htmlurl url="http://www.yudit.org/"
|
||
name="http://www.yudit.org/">
|
||
is a first-class unicode text editor for the X Window System.
|
||
It supports simultaneous processing of many languages, input methods,
|
||
conversions for local character standards.
|
||
It has facilities for entering text in all languages with only
|
||
an English keyboard, using keyboard configuration maps.
|
||
|
||
<sect3>yudit-1.5
|
||
<p>
|
||
|
||
It can be compiled in three versions: Xlib GUI, KDE GUI, or Motif GUI.
|
||
|
||
Customization is very easy. Typically you will first customize your font.
|
||
From the font menu I chose "Unicode". Then, since the command
|
||
"xlsfonts '*-*-iso10646-1'" still showed some ambiguity, I chose a font
|
||
size of 13 (to match Markus Kuhn's 13-pixel fixed font).
|
||
|
||
Next, you will customize your input method. The input methods "Straight",
|
||
"Unicode" and "SGML" are most remarkable. For details about the other
|
||
built-in input methods, look in /usr/local/share/yudit/data/.
|
||
|
||
To change the default for the next session, edit your $HOME/.yuditrc
|
||
file.
|
||
|
||
The general editor functionality is limited to editing, cut&paste
|
||
and search&replace. No undo.
|
||
|
||
<sect3>yudit-2.1
|
||
<p>
|
||
|
||
This version is less easy to learn, because it comes with a homebrewn
|
||
GUI and no easily accessible help. But it has an undo functionality and
|
||
should therefore be more usable than version 1.5.
|
||
|
||
<sect3>Fonts for yudit
|
||
<p>
|
||
|
||
yudit can display text using a TrueType font; see section "TrueType fonts"
|
||
above. The Bitstream Cyberbit gives good results. For yudit to find the
|
||
font, symlink it to <tt>/usr/local/share/yudit/data/cyberbit.ttf</tt>.
|
||
|
||
<sect2>vim
|
||
<p>
|
||
|
||
vim (as of version 6.0r) has good support for UTF-8: when started in an
|
||
UTF-8 locale, it assumes UTF-8 encoding for the console and the text files
|
||
being edited. It supports double-wide (CJK) characters as well and
|
||
combining characters and therefore fits perfectly into UTF-8 enabled
|
||
xterm.
|
||
|
||
Installation: Download from
|
||
<htmlurl url="http://www.vim.org/"
|
||
name="http://www.vim.org/">.
|
||
After unpacking the four parts, call <tt>./configure</tt> with
|
||
<tt>--with-features=big</tt> <tt>--enable-multibyte</tt> arguments
|
||
(or edit src/Makefile to include the <tt>--with-features=big</tt> and
|
||
<tt>--enable-multibyte</tt> options). This will turn on the feature
|
||
FEAT_MBYTE. Then do "make" and "make install".
|
||
|
||
vim can be used to edit files in other encodings. For example, to edit
|
||
a BIG5 encoded file: <tt>:e ++cc=BIG5 filename</tt>. All encoding names
|
||
supported by iconv are accepted. Plus: vim automatically distinguishes
|
||
UTF-8 and ISO-8859-1 files without needing any command line option.
|
||
|
||
<sect2>cooledit
|
||
<p>
|
||
|
||
cooledit by Paul Sheer
|
||
<htmlurl url="http://www.cooledit.org/"
|
||
name="http://www.cooledit.org/">
|
||
is a good text editor for the X Window System. Since version 3.15, it has
|
||
support for Unicode, including Bidi for Hebrew (but not Arabic).
|
||
|
||
A build error message message about a missing "vga_setpage" function is
|
||
worked around by adding "-DDO_NOT_USE_VGALIB" to the CFLAGS.
|
||
|
||
To view UTF-8 files in an UTF-8 locale you have to modify a setting in
|
||
the "Options -> Switches" panel: Enable the checkbox "Display characters
|
||
outside locale". I also found it necessary to disable "Spellcheck as you
|
||
type".
|
||
|
||
For viewing texts with both European and CJK characters, cooledit needs a
|
||
font which contains both, for example the GNU unifont (see section
|
||
"X11 Unicode fonts"): Start once
|
||
<tscreen><verb>
|
||
$ cooledit -fn -gnu-unifont-medium-r-normal--16-160-75-75-c-80-iso10646-1
|
||
</verb></tscreen>
|
||
cooledit will then use this font in all future invocations.
|
||
|
||
Unfortunately, the only characters that can be entered through the keyboard
|
||
are ISO-8859-1 characters and, through a cooledit specific compose mechanism,
|
||
ISO-8859-2 characters. Inputing arbitrary Unicode characters in cooledit is
|
||
possible, but a bit tedious.
|
||
|
||
<sect2>emacs
|
||
<p>
|
||
|
||
First of all, you should read the section "International Character Set Support"
|
||
(node "International") in the Emacs manual. In particular, note that you need
|
||
to start Emacs using the command
|
||
<tscreen><verb>
|
||
$ emacs -fn fontset-standard
|
||
</verb></tscreen>
|
||
so that it will use a font set comprising a lot of international characters.
|
||
|
||
In the short term, there are two packages for using UTF-8 in Emacs. None
|
||
of them needs recompiling Emacs.
|
||
<itemize>
|
||
<item>
|
||
The emacs-utf package
|
||
<htmlurl url="http://www.cs.ust.hk/faculty/otfried/Mule/"
|
||
name="http://www.cs.ust.hk/faculty/otfried/Mule/">
|
||
by Otfried Cheong provides a "unicode-utf8" encoding to Emacs.
|
||
<item>
|
||
The oc-unicode package
|
||
<htmlurl url="http://www.cs.ust.hk/faculty/otfried/Mule/"
|
||
name="http://www.cs.ust.hk/faculty/otfried/Mule/">,
|
||
by Otfried Cheong, an extension of the Mule-UCS package
|
||
<htmlurl url="ftp://etlport.etl.go.jp/pub/mule/Mule-UCS/Mule-UCS-0.70.tar.gz"
|
||
name="ftp://etlport.etl.go.jp/pub/mule/Mule-UCS/Mule-UCS-0.70.tar.gz">
|
||
(mirrored at
|
||
<htmlurl url="http://riksun.riken.go.jp/archives/misc/mule/Mule-UCS/Mule-UCS-0.70.tar.gz"
|
||
name="http://riksun.riken.go.jp/archives/misc/mule/Mule-UCS/Mule-UCS-0.70.tar.gz">
|
||
and
|
||
<htmlurl url="ftp://ftp.m17n.org/pub/mule/Mule-UCS/Mule-UCS-0.70.tar.gz"
|
||
name="ftp://ftp.m17n.org/pub/mule/Mule-UCS/Mule-UCS-0.70.tar.gz">)
|
||
by Hisashi Miyashita, provides a "utf-8" encoding to Emacs.
|
||
</itemize>
|
||
|
||
You can use either of these packages, or both together. The advantages
|
||
of the emacs-utf "unicode-utf8" encoding are: it loads faster, and it deals
|
||
better with combining characters (important for Thai).
|
||
The advantage of the Mule-UCS / oc-unicode "utf-8" encoding is: it can apply
|
||
to a process buffer (such as M-x shell), not only to loading and saving of
|
||
files; and it respects the widths of characters better (important for
|
||
Ethiopian). However, it is less reliable: After heavy editing of a file, I
|
||
have seen some Unicode characters replaced with U+FFFD after the file was
|
||
saved. (But maybe that were bugs in Emacs 20.5 and 20.6 which are fixed in
|
||
Emacs 20.7.)
|
||
|
||
To install the emacs-utf package, compile the program "utf2mule" and install
|
||
it somewhere in your $PATH, also install unicode.el, muleuni-1.el,
|
||
unicode-char.el somewhere. Then add the lines
|
||
<tscreen><verb>
|
||
(setq load-path (cons "/home/user/somewhere/emacs" load-path))
|
||
(if (not (string-match "XEmacs" emacs-version))
|
||
(progn
|
||
(require 'unicode)
|
||
;(setq unicode-data-path "..../UnicodeData-3.0.0.txt")
|
||
(if (eq window-system 'x)
|
||
(progn
|
||
(setq fontset12
|
||
(create-fontset-from-fontset-spec
|
||
"-misc-fixed-medium-r-normal-*-12-*-*-*-*-*-fontset-standard"))
|
||
(setq fontset13
|
||
(create-fontset-from-fontset-spec
|
||
"-misc-fixed-medium-r-normal-*-13-*-*-*-*-*-fontset-standard"))
|
||
(setq fontset14
|
||
(create-fontset-from-fontset-spec
|
||
"-misc-fixed-medium-r-normal-*-14-*-*-*-*-*-fontset-standard"))
|
||
(setq fontset15
|
||
(create-fontset-from-fontset-spec
|
||
"-misc-fixed-medium-r-normal-*-15-*-*-*-*-*-fontset-standard"))
|
||
(setq fontset16
|
||
(create-fontset-from-fontset-spec
|
||
"-misc-fixed-medium-r-normal-*-16-*-*-*-*-*-fontset-standard"))
|
||
(setq fontset18
|
||
(create-fontset-from-fontset-spec
|
||
"-misc-fixed-medium-r-normal-*-18-*-*-*-*-*-fontset-standard"))
|
||
; (set-default-font fontset15)
|
||
))))
|
||
</verb></tscreen>
|
||
to your $HOME/.emacs file. To activate any of the font sets, use the Mule
|
||
menu item "Set Font/FontSet" or Shift-down-mouse-1. The Unicode coverage
|
||
may of the font sets at different sizes may depend on the installed fonts;
|
||
here are screen shots at various sizes of UTF-8-demo.txt (
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-12.gif"
|
||
name="12">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-13.gif"
|
||
name="13">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-14.gif"
|
||
name="14">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-15.gif"
|
||
name="15">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-16.gif"
|
||
name="16">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-18.gif"
|
||
name="18">)
|
||
and of the Mule script examples (
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-12.gif"
|
||
name="12">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-13.gif"
|
||
name="13">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-14.gif"
|
||
name="14">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-15.gif"
|
||
name="15">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-16.gif"
|
||
name="16">,
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-18.gif"
|
||
name="18">).
|
||
To designate a font set as the initial font set for the first frame at startup,
|
||
uncomment the <tt>set-default-font</tt> line in the code snippet above.
|
||
|
||
To install the oc-unicode package, execute the command
|
||
<tscreen><verb>
|
||
$ emacs -batch -l oc-comp.el
|
||
</verb></tscreen>
|
||
and install the resulting file <tt>un-define.elc</tt>, as well as
|
||
<tt>oc-unicode.el</tt>, <tt>oc-charsets.el</tt>, <tt>oc-tools.el</tt>,
|
||
somewhere. Then add the lines
|
||
<tscreen><verb>
|
||
(setq load-path (cons "/home/user/somewhere/emacs" load-path))
|
||
(if (not (string-match "XEmacs" emacs-version))
|
||
(progn
|
||
(require 'oc-unicode)
|
||
;(setq unicode-data-path "..../UnicodeData-3.0.0.txt")
|
||
(if (eq window-system 'x)
|
||
(progn
|
||
(setq fontset12
|
||
(oc-create-fontset
|
||
"-misc-fixed-medium-r-normal-*-12-*-*-*-*-*-fontset-standard"
|
||
"-misc-fixed-medium-r-normal-ja-12-*-iso10646-*"))
|
||
(setq fontset13
|
||
(oc-create-fontset
|
||
"-misc-fixed-medium-r-normal-*-13-*-*-*-*-*-fontset-standard"
|
||
"-misc-fixed-medium-r-normal-ja-13-*-iso10646-*"))
|
||
(setq fontset14
|
||
(oc-create-fontset
|
||
"-misc-fixed-medium-r-normal-*-14-*-*-*-*-*-fontset-standard"
|
||
"-misc-fixed-medium-r-normal-ja-14-*-iso10646-*"))
|
||
(setq fontset15
|
||
(oc-create-fontset
|
||
"-misc-fixed-medium-r-normal-*-15-*-*-*-*-*-fontset-standard"
|
||
"-misc-fixed-medium-r-normal-ja-15-*-iso10646-*"))
|
||
(setq fontset16
|
||
(oc-create-fontset
|
||
"-misc-fixed-medium-r-normal-*-16-*-*-*-*-*-fontset-standard"
|
||
"-misc-fixed-medium-r-normal-ja-16-*-iso10646-*"))
|
||
(setq fontset18
|
||
(oc-create-fontset
|
||
"-misc-fixed-medium-r-normal-*-18-*-*-*-*-*-fontset-standard"
|
||
"-misc-fixed-medium-r-normal-ja-18-*-iso10646-*"))
|
||
; (set-default-font fontset15)
|
||
))))
|
||
</verb></tscreen>
|
||
to your $HOME/.emacs file. You can choose your appropriate font set as with
|
||
the emacs-utf package.
|
||
|
||
In order to open an UTF-8 encoded file, you will type
|
||
<tscreen><verb>
|
||
M-x universal-coding-system-argument unicode-utf8 RET
|
||
M-x find-file filename RET
|
||
</verb></tscreen>
|
||
or
|
||
<tscreen><verb>
|
||
C-x RET c unicode-utf8 RET
|
||
C-x C-f filename RET
|
||
</verb></tscreen>
|
||
(or utf-8 instead of unicode-utf8, if you prefer oc-unicode/Mule-UCS).
|
||
|
||
In order to start a shell buffer with UTF-8 I/O, you will type
|
||
<tscreen><verb>
|
||
M-x universal-coding-system-argument utf-8 RET
|
||
M-x shell RET
|
||
</verb></tscreen>
|
||
(This works with oc-unicode/Mule-UCS only.)
|
||
|
||
There is a newer version Mule-UCS-0.81. Unfortunately you need to rebuild emacs
|
||
from source in order to use it.
|
||
|
||
Note that all this works with Emacs 20 in windowing mode only, not in terminal
|
||
mode. None of the mentioned packages works in Emacs 21, as of this writing.
|
||
|
||
Richard Stallman plans to add integrated UTF-8 support to Emacs in the long
|
||
term, and so does the XEmacs developers group.
|
||
|
||
<sect2>xemacs
|
||
<p>
|
||
|
||
(This section is written by Gilbert Baumann.)
|
||
|
||
Here is how to teach XEmacs (20.4 configured with MULE) the UTF-8 encoding.
|
||
Unfortunately you need its sources to be able to patch it.
|
||
|
||
First you need these files provided by Tomohiko Morioka:
|
||
|
||
<htmlurl url="http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-21.0-b55-emc-b55-ucs.diff"
|
||
name="http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-21.0-b55-emc-b55-ucs.diff">
|
||
and
|
||
<htmlurl url="http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-ucs-conv-0.1.tar.gz"
|
||
name="http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-ucs-conv-0.1.tar.gz">
|
||
|
||
The .diff is a diff against the C sources. The tar ball is elisp code,
|
||
which provides lots of code tables to map to and from Unicode. As the
|
||
name of the diff file suggests it is against XEmacs-21; I needed to
|
||
help `patch' a bit. The most notable difference to my XEmacs-20.4
|
||
sources is that file-coding.[ch] was called mule-coding.[ch].
|
||
|
||
For those unfamilar with the XEmacs-MULE stuff (as I am) a quick
|
||
guide:
|
||
|
||
What we call an encoding is called by MULE a `coding-system'. The most
|
||
important commands are:
|
||
|
||
<tscreen><verb>
|
||
M-x set-file-coding-system
|
||
M-x set-buffer-process-coding-system [comint buffers]
|
||
</verb></tscreen>
|
||
|
||
and the variable `file-coding-system-alist', which guides `find-file'
|
||
to guess the encoding used. After stuff was running, the very first
|
||
thing I did was <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/gb-hacks.el" name="this">.
|
||
|
||
This code looks into the special mode line introduced by -*- somewhere
|
||
in the first 600 bytes of the file about to opened; if now there is a
|
||
field "Encoding: xyz;" and the xyz encoding ("coding system" in Emacs speak)
|
||
exists, choose that. So now you could do e.g.
|
||
|
||
<tscreen><verb>
|
||
;;; -*- Mode: Lisp; Syntax: Common-Lisp; Package: CLEX; Encoding: utf-8; -*-
|
||
</verb></tscreen>
|
||
|
||
and XEmacs goes into utf-8 mode here.
|
||
|
||
Atfer everything was running I defined \u03BB (greek lambda) as a
|
||
macro like:
|
||
|
||
<tscreen><verb>
|
||
(defmacro \u03BB (x) `(lambda .,x))
|
||
</verb></tscreen>
|
||
|
||
<sect2>nedit
|
||
<p>
|
||
|
||
<sect2>xedit
|
||
<p>
|
||
|
||
With XFree86-4.0.1, xedit is able to edit UTF-8 files if you set the locale
|
||
accordingly (see above), and add the line "Xedit*international: true" to
|
||
your $HOME/.Xdefaults file.
|
||
|
||
<sect2>axe
|
||
<p>
|
||
|
||
As of version 6.1.2, aXe supports only 8-bit locales. If you add the line
|
||
"Axe*international: true" to your $HOME/.Xdefaults file, it will simply dump
|
||
core.
|
||
|
||
<sect2>pico
|
||
<p>
|
||
|
||
As of version 4.30, pine cannot be reasonably used to view or edit UTF-8
|
||
files. In UTF-8 enabled xterm, it has severe redraw problems.
|
||
|
||
<sect2>mined98
|
||
<p>
|
||
|
||
mined98 is a small text editor by Michiel Huisjes, Achim Müller and
|
||
Thomas Wolff.
|
||
<htmlurl url="http://www.inf.fu-berlin.de/~wolff/mined98.tar.gz"
|
||
name="http://www.inf.fu-berlin.de/~wolff/mined98.tar.gz">
|
||
It lets you edit UTF-8 or 8-bit encoded files, in an UTF-8 or 8-bit xterm.
|
||
It also has powerful capabilities for entering Unicode characters.
|
||
|
||
mined lets you edit both 8-bit encoded and UTF-8 encoded files. By default
|
||
it uses an autodetection heuristic. If you don't want to rely on heuristics,
|
||
pass the command-line option <tt>-u</tt> when editing an UTF-8 file, or
|
||
<tt>+u</tt> when editing an 8-bit encoded file. You can change the
|
||
interpretation at any time from within the editor: It displays the encoding
|
||
("L:h" for 8-bit, "U:h" for UTF-8) in the menu line. Click on the first
|
||
of these characters to change it.
|
||
|
||
mined knows about double-width and combining characters and displays them
|
||
correctly. It also has a special display mode for combining characters.
|
||
|
||
mined also has a scrollbar and very nice pull-down menus. Alas, the "Home",
|
||
"End", "Delete" keys do not work.
|
||
|
||
<sect2>qemacs
|
||
<p>
|
||
|
||
qemacs 0.2 is a small text editor by Fabrice Bellard.
|
||
<htmlurl url="http://www-stud.enst.fr/~bellard/qemacs/"
|
||
name="http://www-stud.enst.fr/~bellard/qemacs/">
|
||
with Emacs keybindings. It runs in an UTF-8 console or xterm, and can edit
|
||
both 8-bit encoded and UTF-8 encoded files. It still has a few rough edges,
|
||
but further development is underway.
|
||
|
||
<sect1>Mailers
|
||
<p>
|
||
|
||
MIME: RFC 2279 defines UTF-8 as a MIME charset, which can be transported
|
||
under the 8bit, quoted-printable and base64 encodings. The older MIME
|
||
UTF-7 proposal (RFC 2152) is considered to be deprecated and should not
|
||
be used any further.
|
||
|
||
Mail clients released after January 1, 1999, should be capable of sending and
|
||
displaying UTF-8 encoded mails, otherwise they are considered deficient.
|
||
But these mails have to carry the MIME labels
|
||
<tscreen><verb>
|
||
Content-Type: text/plain; charset=UTF-8
|
||
Content-Transfer-Encoding: 8bit
|
||
</verb></tscreen>
|
||
Simply piping an UTF-8 file into "mail" without caring about the MIME labels
|
||
will not work.
|
||
|
||
Mail client implementors should take a look at
|
||
<htmlurl url="http://www.imc.org/imc-intl/"
|
||
name="http://www.imc.org/imc-intl/">
|
||
and
|
||
<htmlurl url="http://www.imc.org/mail-i18n.html"
|
||
name="http://www.imc.org/mail-i18n.html">.
|
||
|
||
Now about the individual mail clients (or "mail user agents"):
|
||
|
||
<sect2>pine
|
||
<p>
|
||
|
||
The situation for an unpatched pine version 4.30 is as follows.
|
||
|
||
Pine does not do character set conversions. But it allows you to view
|
||
UTF-8 mails in an UTF-8 text window (Linux console or xterm).
|
||
|
||
Normally, Pine will warn about different character sets each time you view
|
||
an UTF-8 encoded mail. To get rid of this warning, choose S (setup), then
|
||
C (config), then change the value of "character-set" to UTF-8. This option
|
||
will not do anything, except to reduce the warnings, as Pine has no built-in
|
||
knowledge of UTF-8.
|
||
|
||
Also note that Pine's notion of Unicode characters is pretty limited: It
|
||
will display Latin and Greek characters, but not other kinds of Unicode
|
||
characters.
|
||
|
||
A patch by Robert Brady
|
||
<htmlurl url="mailto:robert@suse.co.uk"
|
||
name="<robert@suse.co.uk>">
|
||
<htmlurl url="http://www.ents.susu.soton.ac.uk/~robert/pine-utf8-0.1.diff"
|
||
name="http://www.ents.susu.soton.ac.uk/~robert/pine-utf8-0.1.diff">
|
||
adds UTF-8 support to Pine. With this patch, it decodes and prints headers
|
||
and bodies properly. The patch depends on the GNOME libunicode
|
||
<htmlurl url="http://cvs.gnome.org/lxr/source/libunicode/"
|
||
name="http://cvs.gnome.org/lxr/source/libunicode/">.
|
||
|
||
However, alignment remains broken in many places; replying to a mail does
|
||
not cause the character set to be converted as appropriate; and the editor,
|
||
pico, cannot deal with multibyte characters.
|
||
|
||
<sect2>kmail
|
||
<p>
|
||
|
||
kmail (as of KDE 1.0) does not support UTF-8 mails at all.
|
||
|
||
<sect2>Netscape Communicator
|
||
<p>
|
||
|
||
Netscape Communicator's Messenger can send and display mails in UTF-8
|
||
encoding, but it needs a little bit of manual user intervention.
|
||
|
||
To send an UTF-8 encoded mail: After opening the "Compose" window, but before
|
||
starting to compose the message, select from the menu
|
||
"View -> Character Set -> Unicode (UTF-8)". Then compose the message and
|
||
send it.
|
||
|
||
When you receive an UTF-8 encoded mail, Netscape unfortunately does not
|
||
display it in UTF-8 right away, and does not even give a visual clue that
|
||
the mail was encoded in UTF-8. You have to manually select from the menu
|
||
"View -> Character Set -> Unicode (UTF-8)".
|
||
|
||
For displaying UTF-8 mails, Netscape uses different fonts. You can adjust
|
||
your font settings in the "Edit -> Preferences -> Fonts" dialog; choose
|
||
the "Unicode" font category.
|
||
|
||
<sect2>emacs (rmail, vm)
|
||
<p>
|
||
|
||
<sect2>mutt
|
||
<p>
|
||
|
||
mutt-1.2.x, as available from
|
||
<htmlurl url="http://www.mutt.org/"
|
||
name="http://www.mutt.org/">,
|
||
has only rudimentary support for UTF-8: it can convert
|
||
from UTF-8 into an 8-bit display charset. The mutt-1.3.x
|
||
development branch also supports UTF-8 as the display charset,
|
||
so you can run Mutt in an UTF-8 xterm, and has thorough support
|
||
for MIME and charset conversion (relying on iconv).
|
||
|
||
<sect2>exmh
|
||
<p>
|
||
|
||
exmh 2.1.2 with Tk<54>8.4a1 can recognize and correctly display UTF-8 mails
|
||
(without CJK characters) if you add the following lines to your
|
||
<tt>$HOME/.Xdefaults</tt> file.
|
||
<tscreen><verb>
|
||
!
|
||
! Exmh
|
||
!
|
||
exmh.mimeUCharsets: utf-8
|
||
exmh.mime_utf-8_registry: iso10646
|
||
exmh.mime_utf-8_encoding: 1
|
||
exmh.mime_utf-8_plain_families: fixed
|
||
exmh.mime_utf-8_fixed_families: fixed
|
||
exmh.mime_utf-8_proportional_families: fixed
|
||
exmh.mime_utf-8_title_families: fixed
|
||
</verb></tscreen>
|
||
|
||
<sect1>Text processing
|
||
<p>
|
||
|
||
<sect2>groff
|
||
<p>
|
||
|
||
groff 1.16.1, the GNU implementation of the traditional Unix text processing
|
||
system troff/nroff, can output UTF-8 formatted text. Simply use
|
||
`<tt>groff -Tutf8</tt>' instead of `<tt>groff -Tlatin1</tt>' or
|
||
`<tt>groff -Tascii</tt>'.
|
||
|
||
<sect2>TeX
|
||
<p>
|
||
|
||
The teTeX 0.9 (and newer) distribution contains an Unicode adaptation of TeX,
|
||
called Omega
|
||
(<htmlurl url="http://www.gutenberg.eu.org/omega/"
|
||
name="http://www.gutenberg.eu.org/omega/">,
|
||
<htmlurl url="ftp://ftp.ens.fr/pub/tex/yannis/omega"
|
||
name="ftp://ftp.ens.fr/pub/tex/yannis/omega">).
|
||
Together with the unicode.tex file contained in
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/utf8-tex-0.1.tar.gz"
|
||
name="utf8-tex-0.1.tar.gz">
|
||
it enables you to use UTF-8 encoded sources as input for TeX. A thousand of
|
||
Unicode characters are currently supported.
|
||
|
||
All that changes is that you run `omega' (instead of `tex') or `lambda'
|
||
(instead of `latex'), and insert the following lines at the head of
|
||
your source input.
|
||
<tscreen><verb>
|
||
\ocp\TexUTF=inutf8
|
||
\InputTranslation currentfile \TexUTF
|
||
</verb></tscreen>
|
||
<tscreen><verb>
|
||
\input unicode
|
||
</verb></tscreen>
|
||
|
||
Other maybe related links:
|
||
<htmlurl url="http://www.dante.de/projekte/nts/NTS-FAQ.html"
|
||
name="http://www.dante.de/projekte/nts/NTS-FAQ.html">,
|
||
<htmlurl url="ftp://ftp.dante.de/pub/tex/language/chinese/CJK/"
|
||
name="ftp://ftp.dante.de/pub/tex/language/chinese/CJK/">.
|
||
|
||
<sect1>Databases
|
||
<p>
|
||
|
||
<sect2>PostgreSQL
|
||
<p>
|
||
|
||
PostgreSQL 6.4 or newer can be built with the configuration option
|
||
<tt>--with-mb=UNICODE</tt>.
|
||
|
||
<sect2>Interbase
|
||
<p>
|
||
|
||
Borland/Inprise's Interbase 6.0 can store string fields in UTF-8 format
|
||
if the option "CHARACTER SET UNICODE_FSS" is given.
|
||
|
||
<sect1>Other text-mode applications
|
||
<p>
|
||
|
||
<sect2>less
|
||
<p>
|
||
|
||
With
|
||
<htmlurl url="http://www.flash.net/~marknu/less/less-358.tar.gz"
|
||
name="http://www.flash.net/~marknu/less/less-358.tar.gz">
|
||
you can browse UTF-8 encoded text files in an UTF-8 xterm or console.
|
||
Make sure that the environment variable LESSCHARSET is not set (or is set
|
||
to utf-8). If you also have a LESSKEY environment variable set, also make
|
||
sure that the file it points to does not define LESSCHARSET. If necessary,
|
||
regenerate this file using the `lesskey' command, or unset the LESSKEY
|
||
environment variable.
|
||
|
||
<sect2>lv
|
||
<p>
|
||
|
||
lv-4.49.3 by Tomio Narita
|
||
<htmlurl url="http://www.ff.iij4u.or.jp/~nrt/lv/"
|
||
name="http://www.ff.iij4u.or.jp/~nrt/lv/">
|
||
is a file viewer with builtin character set converters. To view UTF-8 files
|
||
in an UTF-8 console, use "lv -Au8". But it can also be used to view
|
||
files in other CJK encodings in an UTF-8 console.
|
||
|
||
There is a small glitch: lv turns off xterm's cursor and doesn't turn it on
|
||
again.
|
||
|
||
<sect2>expand
|
||
<p>
|
||
|
||
Get the GNU textutils-2.0 and apply the patch
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/textutils-2.0.diff"
|
||
name="textutils-2.0.diff">,
|
||
then configure, add "#define HAVE_FGETWC 1", "#define HAVE_FPUTWC 1" to
|
||
config.h. Then rebuild.
|
||
|
||
<sect2>col, colcrt, colrm, column, rev, ul
|
||
<p>
|
||
|
||
Get the util-linux-2.9y package, configure it, then define ENABLE_WIDECHAR in
|
||
defines.h, change the "#if 0" to "#if 1" in lib/widechar.h. In
|
||
text-utils/Makefile, modify CFLAGS and LDFLAGS so that they include the
|
||
directories where libutf8 is installed. Then rebuild.
|
||
|
||
<sect2>figlet
|
||
<p>
|
||
|
||
figlet 2.2 has an option for UTF-8 input: "figlet -C utf8"
|
||
|
||
<sect2>Base utilities
|
||
<p>
|
||
|
||
The Li18nux list of commands and utilities that ought to be made interoperable
|
||
with UTF-8 is as follows. Useful information needs to get added here; I just
|
||
didn't get around it yet :-)
|
||
|
||
As of glibc-2.2, regular expressions only work for 8-bit characters.
|
||
In an UTF-8 locale, regular expressions that contain non-ASCII characters
|
||
or that expect to match a single multibyte character with "." do not work.
|
||
This affects all commands and utilities listed below.
|
||
<!-- In particular:
|
||
- ed, vi, emacs, use regular expressions for search/replace commands,
|
||
- less, more, use regular expressions for searching,
|
||
- csplit, uses regular expressions,
|
||
- diff, uses regular expressions for its -I and -F options,
|
||
- expr, uses regular expressions for its ":" operator,
|
||
- find, uses regular expressions for its -ok, -regex and -iregex operations,
|
||
- m4, uses regular expressions for its regexp builtin function,
|
||
- nl, uses regular expressions for types starting with p,
|
||
- tac, uses regular expressions for its -r option.
|
||
-->
|
||
|
||
<descrip>
|
||
<tag>alias</tag>
|
||
No info available yet.
|
||
<tag>ar</tag>
|
||
No info available yet.
|
||
<tag>arch</tag>
|
||
No info available yet.
|
||
<tag>arp</tag>
|
||
No info available yet.
|
||
<tag>at</tag>
|
||
As of at-3.1.8: The two uses of isalnum in at.c are invalid and should be
|
||
replaced with a use of quotearg.c or an exclude list of the (fixed) list
|
||
of shell metacharacters. The two uses of %8s in at.c and atd.c are invalid
|
||
and should become arbitrary length.
|
||
<tag>awk</tag>
|
||
No info available yet.
|
||
<tag>basename</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>batch</tag>
|
||
No info available yet.
|
||
<tag>bc</tag>
|
||
No info available yet.
|
||
<tag>bg</tag>
|
||
No info available yet.
|
||
<tag>bunzip2</tag>
|
||
No info available yet.
|
||
<tag>bzip2</tag>
|
||
No info available yet.
|
||
<tag>bzip2recover</tag>
|
||
No info available yet.
|
||
<tag>cal</tag>
|
||
No info available yet.
|
||
<!--
|
||
<tag>cancel(LEGACY)</tag>
|
||
No info available yet.
|
||
-->
|
||
<tag>cat</tag>
|
||
No info available yet.
|
||
<tag>cd</tag>
|
||
No info available yet.
|
||
<tag>cflow</tag>
|
||
No info available yet.
|
||
<tag>chgrp</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>chmod</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>chown</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>chroot</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>cksum</tag>
|
||
As of textutils-2.0e: OK.
|
||
<tag>clear</tag>
|
||
No info available yet.
|
||
<tag>cmp</tag>
|
||
No info available yet.
|
||
<tag>col</tag>
|
||
No info available yet.
|
||
<tag>comm</tag>
|
||
No info available yet.
|
||
<tag>command</tag>
|
||
No info available yet.
|
||
<tag>compress</tag>
|
||
No info available yet.
|
||
<tag>cp</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>cpio</tag>
|
||
No info available yet.
|
||
<tag>crontab</tag>
|
||
No info available yet.
|
||
<tag>csplit</tag>
|
||
No info available yet.
|
||
<tag>ctags</tag>
|
||
No info available yet.
|
||
<tag>cut</tag>
|
||
No info available yet.
|
||
<tag>date</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>dd</tag>
|
||
As of fileutils-4.0u: The conv=lcase, conv=ucase options don't work correctly.
|
||
<tag>df</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>diff</tag>
|
||
As of diffutils-2.7.2: the --side-by-side mode therefore doesn't compute
|
||
column width correctly.
|
||
<tag>diff3</tag>
|
||
No info available yet.
|
||
<tag>dirname</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>domainname</tag>
|
||
No info available yet.
|
||
<tag>du</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>echo</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>ed</tag>
|
||
No info available yet.
|
||
<tag>egrep</tag>
|
||
No info available yet.
|
||
<tag>env</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>ex</tag>
|
||
No info available yet.
|
||
<tag>expand</tag>
|
||
No info available yet.
|
||
<tag>expr</tag>
|
||
As of sh-utils-2.0i: The operators "match", "substr", "index", "length"
|
||
don't work correctly.
|
||
<tag>false</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>fc</tag>
|
||
No info available yet.
|
||
<tag>fg</tag>
|
||
No info available yet.
|
||
<tag>fgrep</tag>
|
||
No info available yet.
|
||
<tag>file</tag>
|
||
No info available yet.
|
||
<tag>find</tag>
|
||
As of findutils-4.1.6: The "-iregex" does not work correctly; this needs a
|
||
fix in function find/parser.c:insert_regex.
|
||
<tag>fold</tag>
|
||
No info available yet.
|
||
<tag>ftp[BSD]</tag>
|
||
No info available yet.
|
||
<tag>fuser</tag>
|
||
No info available yet.
|
||
<tag>gencat</tag>
|
||
No info available yet.
|
||
<tag>getconf</tag>
|
||
No info available yet.
|
||
<tag>getopts</tag>
|
||
No info available yet.
|
||
<tag>gettext</tag>
|
||
No info available yet.
|
||
<tag>grep</tag>
|
||
No info available yet.
|
||
<tag>gunzip</tag>
|
||
No info available yet.
|
||
<tag>gzip</tag>
|
||
gzip-1.3 is UTF-8 capable, but it uses only English messages in ASCII
|
||
charset. Proper internationalization would require: Use gettext. Call
|
||
setlocale. In function check_ofname (file gzip.c), use the function rpmatch
|
||
from GNU text/sh/fileutils instead of asking for "y" or "n". The use
|
||
of strlen in gzip.c:852 is wrong, needs to use the function mbswidth.
|
||
<tag>hash</tag>
|
||
No info available yet.
|
||
<tag>head</tag>
|
||
No info available yet.
|
||
<tag>hostname</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>iconv</tag>
|
||
No info available yet.
|
||
<tag>id</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>ifconfig</tag>
|
||
No info available yet.
|
||
<tag>imake</tag>
|
||
No info available yet.
|
||
<tag>ipcrm</tag>
|
||
No info available yet.
|
||
<tag>ipcs</tag>
|
||
No info available yet.
|
||
<tag>jobs</tag>
|
||
No info available yet.
|
||
<tag>join</tag>
|
||
No info available yet.
|
||
<tag>kill</tag>
|
||
No info available yet.
|
||
<tag>killall</tag>
|
||
No info available yet.
|
||
<tag>ldd</tag>
|
||
No info available yet.
|
||
<tag>less</tag>
|
||
No complete info available yet.
|
||
<tag>lex</tag>
|
||
No info available yet.
|
||
<tag>ln</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>locale</tag>
|
||
As of glibc-2.2: OK.
|
||
<tag>localedef</tag>
|
||
As of glibc-2.2: OK.
|
||
<tag>logger</tag>
|
||
No info available yet.
|
||
<tag>logname</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>lp</tag>
|
||
No info available yet.
|
||
<tag>lpc[BSD]</tag>
|
||
No info available yet.
|
||
<tag>lpq[BSD]</tag>
|
||
No info available yet.
|
||
<tag>lpr[BSD]</tag>
|
||
No info available yet.
|
||
<tag>lprm[BSD]</tag>
|
||
No info available yet.
|
||
<tag>lpstat(LEGACY)</tag>
|
||
No info available yet.
|
||
<tag>ls</tag>
|
||
As of fileutils-4.0y: OK.
|
||
<tag>m4</tag>
|
||
No info available yet.
|
||
<tag>mailx</tag>
|
||
No info available yet.
|
||
<tag>make</tag>
|
||
No info available yet.
|
||
<tag>man</tag>
|
||
No info available yet.
|
||
<tag>mesg</tag>
|
||
No info available yet.
|
||
<tag>mkdir</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>mkfifo</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>mkfs</tag>
|
||
No info available yet.
|
||
<tag>mkswap</tag>
|
||
No info available yet.
|
||
<tag>more</tag>
|
||
No info available yet.
|
||
<tag>mount</tag>
|
||
No info available yet.
|
||
<tag>msgfmt</tag>
|
||
No info available yet.
|
||
<tag>msgmerge</tag>
|
||
No info available yet.
|
||
<tag>mv</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>netstat</tag>
|
||
No info available yet.
|
||
<tag>newgrp</tag>
|
||
No info available yet.
|
||
<tag>nice</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>nl</tag>
|
||
No info available yet.
|
||
<tag>nohup</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>nslookup</tag>
|
||
No info available yet.
|
||
<tag>nm</tag>
|
||
No info available yet.
|
||
<tag>od</tag>
|
||
No info available yet.
|
||
<tag>passwd[BSD]</tag>
|
||
No info available yet.
|
||
<tag>paste</tag>
|
||
No info available yet.
|
||
<tag>patch</tag>
|
||
No info available yet.
|
||
<tag>pathchk</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>ping</tag>
|
||
No info available yet.
|
||
<tag>pr</tag>
|
||
No info available yet.
|
||
<tag>printf</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>ps</tag>
|
||
No info available yet.
|
||
<tag>pwd</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>read</tag>
|
||
No info available yet.
|
||
<tag>reboot</tag>
|
||
No info available yet.
|
||
<tag>renice</tag>
|
||
No info available yet.
|
||
<tag>rm</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>rmdir</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>sed</tag>
|
||
No info available yet.
|
||
<tag>shar[BSD]</tag>
|
||
No info available yet.
|
||
<tag>shutdown</tag>
|
||
No info available yet.
|
||
<tag>sleep</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>sort</tag>
|
||
No info available yet.
|
||
<tag>split</tag>
|
||
No info available yet.
|
||
<tag>strings</tag>
|
||
No info available yet.
|
||
<tag>strip</tag>
|
||
No info available yet.
|
||
<tag>stty</tag>
|
||
As of sh-utils-2.0.11: OK.
|
||
<tag>su[BSD]</tag>
|
||
No info available yet.
|
||
<tag>sum</tag>
|
||
As of textutils-2.0e: OK.
|
||
<tag>tail</tag>
|
||
No info available yet.
|
||
<tag>talk</tag>
|
||
No info available yet.
|
||
<tag>tar</tag>
|
||
As of tar-1.13.17: OK, if user and group names are always ASCII.
|
||
<tag>tclsh</tag>
|
||
No info available yet.
|
||
<tag>tee</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>telnet</tag>
|
||
No info available yet.
|
||
<tag>test</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>time</tag>
|
||
No info available yet.
|
||
<tag>touch</tag>
|
||
As of fileutils-4.0u: OK.
|
||
<tag>tput</tag>
|
||
No info available yet.
|
||
<tag>tr</tag>
|
||
No info available yet.
|
||
<tag>true</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>tsort</tag>
|
||
No info available yet.
|
||
<tag>tty</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>type</tag>
|
||
No info available yet.
|
||
<tag>ulimit</tag>
|
||
No info available yet.
|
||
<tag>umask</tag>
|
||
No info available yet.
|
||
<tag>umount</tag>
|
||
No info available yet.
|
||
<tag>unalias</tag>
|
||
No info available yet.
|
||
<tag>uname</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>uncompress</tag>
|
||
No info available yet.
|
||
<tag>unexpand</tag>
|
||
No info available yet.
|
||
<tag>uniq</tag>
|
||
No info available yet.
|
||
<tag>uudecode</tag>
|
||
No info available yet.
|
||
<tag>uuencode</tag>
|
||
No info available yet.
|
||
<tag>vi</tag>
|
||
No info available yet.
|
||
<tag>wait</tag>
|
||
No info available yet.
|
||
<tag>wc</tag>
|
||
As of textutils-2.0.8: OK.
|
||
<tag>who</tag>
|
||
As of sh-utils-2.0i: OK.
|
||
<tag>wish</tag>
|
||
No info available yet.
|
||
<tag>write</tag>
|
||
No info available yet.
|
||
<tag>xargs</tag>
|
||
As of findutils-4.1.5: The program uses strstr; a patch has been submitted
|
||
to the maintainer.
|
||
<tag>xgettext</tag>
|
||
No info available yet.
|
||
<tag>yacc</tag>
|
||
No info available yet.
|
||
<tag>zcat</tag>
|
||
No info available yet.
|
||
</descrip>
|
||
|
||
<sect1>Other X11 applications
|
||
<p>
|
||
|
||
Owen Taylor is currently developing a library for rendering multilingual
|
||
text, called pango.
|
||
<htmlurl url="http://www.labs.redhat.com/~otaylor/pango/"
|
||
name="http://www.labs.redhat.com/~otaylor/pango/">,
|
||
<htmlurl url="http://www.pango.org/"
|
||
name="http://www.pango.org/">.
|
||
|
||
|
||
<sect>Printing
|
||
<p>
|
||
|
||
Since Postscript itself does not support Unicode fonts, the burden of
|
||
Unicode support in printing is on the program creating the Postscript
|
||
document, not on the Postscript renderer.
|
||
|
||
The existing Postscript fonts I've seen - .pfa/.pfb/.afm/.pfm/.gsf -
|
||
support only a small range of glyphs and are not Unicode fonts.
|
||
|
||
<!-- Can teTeX's `ttf2afm' program be any useful here?? -->
|
||
|
||
<!-- I don't think ghostscript's bdftops program is useful here. -->
|
||
|
||
<sect1>Printing using TrueType fonts
|
||
<p>
|
||
|
||
Both the uniprint and wprint programs produce good printed output
|
||
for Unicode plain text. They require a TrueType font; see section
|
||
"TrueType fonts" above. The Bitstream Cyberbit gives good results.
|
||
|
||
<sect2>uniprint
|
||
<p>
|
||
|
||
The "uniprint" program contained in the yudit package can convert a text
|
||
file to Postscript. For uniprint to find the Cyberbit font, symlink it to
|
||
<tt>/usr/local/share/yudit/data/cyberbit.ttf</tt>.
|
||
|
||
<sect2>wprint
|
||
<p>
|
||
|
||
The "wprint" (WorldPrint) program by Eduardo Trapani
|
||
<htmlurl url="http://ttt.esperanto.org.uy/programoj/angle/wprint.html"
|
||
name="http://ttt.esperanto.org.uy/programoj/angle/wprint.html">
|
||
postprocesses Postscript output produced by Netscape Communicator or Mozilla
|
||
from HTML pages or plain text files.
|
||
|
||
The output is nearly perfect; only in Cyrillic paragraphs the line breaking
|
||
is incorrect: the lines are only about half as wide as they should be.
|
||
|
||
<sect2>Comparison
|
||
<p>
|
||
|
||
For plain text, uniprint has a better overall layout. On the other hand,
|
||
only wprint gets Thai output correct.
|
||
|
||
<sect1>Printing using fixed-size fonts
|
||
<p>
|
||
|
||
Generally, printing using fixed-size fonts does not give an as professional
|
||
output as using TrueType fonts.
|
||
|
||
<sect2>txtbdf2ps
|
||
<p>
|
||
|
||
The txtbdf2ps 0.7 program by Serge Winitzki
|
||
<htmlurl url="http://members.linuxstart.com/~winitzki/txtbdf2ps.html"
|
||
name="http://members.linuxstart.com/~winitzki/txtbdf2ps.html">
|
||
converts a plain text file to Postscript, by use of a BDF font.
|
||
Installation:
|
||
<tscreen><verb>
|
||
# install -m 777 txtbdf2ps-dev.txt /usr/local/bin/txtbdf2ps
|
||
</verb></tscreen>
|
||
Example with a proportional font:
|
||
<tscreen><verb>
|
||
$ txtbdf2ps -BDF=cyberbit.bdf -UTF-8 -nowrap < input.txt > output.ps
|
||
</verb></tscreen>
|
||
Example with a fixed-width font:
|
||
<tscreen><verb>
|
||
$ txtbdf2ps -BDF=unifont.bdf -UTF-8 -nowrap < input.txt > output.ps
|
||
</verb></tscreen>
|
||
|
||
Note: txtbdf2ps does not support combining characters and bidi.
|
||
|
||
<sect1>The classical approach
|
||
<p>
|
||
|
||
Another way to print with TrueType fonts is to convert the TrueType font to
|
||
a Postscript font using the <tt>ttf2pt1</tt> utility
|
||
(<htmlurl url="http://www.netspace.net.au/~mheath/ttf2pt1/"
|
||
name="http://www.netspace.net.au/~mheath/ttf2pt1/">,
|
||
<htmlurl url="http://quadrant.netspace.net.au/ttf2pt1/"
|
||
name="http://quadrant.netspace.net.au/ttf2pt1/">,
|
||
<htmlurl url="http://ttf2pt1.sourceforge.net/"
|
||
name="http://ttf2pt1.sourceforge.net/">). Details can be
|
||
found in Julius Chroboczek's "Printing with TrueType fonts in Unix" writeup,
|
||
<htmlurl url="http://www.dcs.ed.ac.uk/home/jec/programs/xfsft/printing.html"
|
||
name="http://www.dcs.ed.ac.uk/home/jec/programs/xfsft/printing.html">.
|
||
|
||
<sect2>TeX, Omega
|
||
<p>
|
||
|
||
TODO: CJK, metafont, omega, dvips, odvips, utf8-tex-0.1
|
||
|
||
<!-- Not useful now: The `ps2pk' and `afm2tfm' programs, contained in the
|
||
teTeX distribution, can make use of existing Postscript/Type1 fonts for use
|
||
with TeX. So can the `ps2mf' program, not included with teTeX. -->
|
||
|
||
<sect2>DocBook
|
||
<p>
|
||
|
||
TODO: db2ps, jadetex
|
||
|
||
<sect2>groff -Tps
|
||
<p>
|
||
|
||
"groff -Tps" produces Postscript output. Its Postscript output driver
|
||
supports only a very limited number of Unicode characters (only what
|
||
Postscript supports by itself).
|
||
|
||
<!-- Not useful now: The `afmtodit' and `pfbtops' programs, contained in the
|
||
groff package, can make use of existing Postscript/Type1 fonts for use with
|
||
groff. -->
|
||
|
||
<sect1>No luck with...
|
||
|
||
<sect2>Netscape's "Print..."
|
||
<p>
|
||
|
||
As of version 4.72, Netscape Communicator cannot correctly print HTML
|
||
pages in UTF-8 encoding. You really have to use wprint.
|
||
|
||
<sect2>Mozilla's "Print..."
|
||
<p>
|
||
|
||
As of version M16, printing of HTML pages is apparently not implemented.
|
||
|
||
<sect2>html2ps
|
||
<p>
|
||
|
||
As of version 1.0b1, the html2ps HTML to Postscript converter does not support
|
||
UTF-8 encoded HTML pages and has no special treatment of fonts: the generated
|
||
Postscript uses the standard Postscript fonts.
|
||
|
||
<sect2>a2ps
|
||
<p>
|
||
|
||
As of version 4.12, a2ps doesn't support printing UTF-8 encoded text.
|
||
|
||
<sect2>enscript
|
||
<p>
|
||
|
||
As of version 1.6.1, enscript doesn't support printing UTF-8 encoded text.
|
||
By default, it uses only the standard Postscript fonts, but it can also
|
||
include a custom Postscript font in the output.
|
||
|
||
|
||
<sect>Making your programs Unicode aware
|
||
<p>
|
||
|
||
<sect1>C/C++
|
||
<p>
|
||
|
||
The C `<tt>char</tt>' type is 8-bit and will stay 8-bit because it denotes
|
||
the smallest addressable data unit. Various facilities are available:
|
||
|
||
<sect2>For normal text handling
|
||
<p>
|
||
|
||
The ISO/ANSI C standard contains, in an amendment which was added in 1995,
|
||
a "wide character" type `<tt>wchar_t</tt>', a set of functions like those
|
||
found in <tt><string.h></tt> and <tt><ctype.h></tt> (declared in
|
||
<tt><wchar.h></tt> and <tt><wctype.h></tt>, respectively), and
|
||
a set of conversion functions between `<tt>char *</tt>' and
|
||
`<tt>wchar_t *</tt>' (declared in <tt><stdlib.h></tt>).
|
||
|
||
Good references for this API are
|
||
<itemize>
|
||
<item>
|
||
the GNU libc-2.1 manual, chapters 4 "Character Handling" and
|
||
6 "Character Set Handling",
|
||
<item>
|
||
the manual pages
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/man-mbswcs.tar.gz"
|
||
name="man-mbswcs.tar.gz">, now contained in
|
||
<htmlurl url="ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz"
|
||
name="ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz">,
|
||
<item>
|
||
the OpenGroup's introduction
|
||
<htmlurl url="http://www.unix-systems.org/version2/whatsnew/login_mse.html"
|
||
name="http://www.unix-systems.org/version2/whatsnew/login_mse.html">,
|
||
<item>
|
||
the OpenGroup's Single Unix specification
|
||
<htmlurl url="http://www.UNIX-systems.org/online.html"
|
||
name="http://www.UNIX-systems.org/online.html">,
|
||
<item>
|
||
the ISO/IEC 9899:1999 (ISO C 99) standard. The latest draft before it was
|
||
adopted is called n2794. You find it at
|
||
<htmlurl url="ftp://ftp.csn.net/DMK/sc22wg14/review/"
|
||
name="ftp://ftp.csn.net/DMK/sc22wg14/review/">
|
||
or
|
||
<htmlurl url="http://java-tutor.com/docs/c/"
|
||
name="http://java-tutor.com/docs/c/">.
|
||
<item>
|
||
Clive Feather's introduction
|
||
<htmlurl url="http://www.lysator.liu.se/c/na1.html"
|
||
name="http://www.lysator.liu.se/c/na1.html">,
|
||
<item>
|
||
the Dinkumware C library reference
|
||
<htmlurl url="http://www.dinkumware.com/htm_cl/"
|
||
name="http://www.dinkumware.com/htm_cl/">.
|
||
</itemize>
|
||
|
||
Advantages of using this API:
|
||
<itemize>
|
||
<item>
|
||
It's a vendor independent standard.
|
||
<item>
|
||
The functions do the right thing, depending on the user's locale.
|
||
All a program needs to call is <tt>setlocale(LC_ALL,"");</tt>.
|
||
</itemize>
|
||
|
||
Drawbacks of this API:
|
||
<itemize>
|
||
<item>
|
||
Some of the functions are not multithread-safe, because they keep a hidden
|
||
internal state between function calls.
|
||
<item>
|
||
There is no first-class locale datatype. Therefore this API cannot reasonably
|
||
be used for anything that needs more than one locale or character set at the
|
||
same time.
|
||
<item>
|
||
The OS support for this API is not good on most OSes.
|
||
</itemize>
|
||
|
||
<sect3>Portability notes
|
||
<p>
|
||
|
||
A `<tt>wchar_t</tt>' may or may not be encoded in Unicode; this is
|
||
platform and sometimes also locale dependent. A multibyte sequence
|
||
`<tt>char *</tt>' may or may not be encoded in UTF-8; this is platform
|
||
and sometimes also locale dependent.
|
||
|
||
In detail, here is what the
|
||
<htmlurl url="http://www.UNIX-systems.org/online.html"
|
||
name="Single Unix specification">
|
||
says about the `<tt>wchar_t</tt>' type:
|
||
<em>All wide-character codes in a given process consist of an equal number
|
||
of bits. This is in contrast to characters, which can consist of a
|
||
variable number of bytes. The byte or byte sequence that represents a
|
||
character can also be represented as a wide-character code.
|
||
Wide-character codes thus provide a uniform size for manipulating text
|
||
data. A wide-character code having all bits zero is the null
|
||
wide-character code, and terminates wide-character strings. The
|
||
wide-character value for each member of the Portable Character Set
|
||
</em> (i.e. ASCII) <em>
|
||
will equal its value when used as the lone character in an integer
|
||
character constant. Wide-character codes for other characters are
|
||
locale- and implementation-dependent. State shift bytes do not have a
|
||
wide-character code representation.</em>
|
||
|
||
One particular consequence is that in portable programs you shouldn't use
|
||
non-ASCII characters in string literals. That means, even though you
|
||
know the Unicode double quotation marks have the codes U+201C and U+201D,
|
||
you shouldn't write a string literal <tt>L"\u201cHello\u201d, he said"</tt>
|
||
or <tt>"\xe2\x80\x9cHello\xe2\x80\x9d, he said"</tt> in C programs. Instead,
|
||
use GNU gettext, write it as <tt>gettext("'Hello', he said")</tt>, and create
|
||
a message database en.po which translates "'Hello', he said" to
|
||
"\u201cHello\u201d, he said".
|
||
|
||
Here is a survey of the portability of the ISO/ANSI C facilities on various
|
||
Unix flavours.
|
||
|
||
<descrip>
|
||
<tag>GNU glibc-2.2.x</tag>
|
||
<itemize>
|
||
<item><wchar.h> and <wctype.h> exist.
|
||
<item>Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
|
||
<item>Has five UTF-8 locales.
|
||
<item>mbrtowc works.
|
||
</itemize>
|
||
<tag>GNU glibc-2.0.x, glibc-2.1.x</tag>
|
||
<itemize>
|
||
<item><wchar.h> and <wctype.h> exist.
|
||
<item>Has wcs/mbs functions, but no fgetwc/fputwc/wprintf.
|
||
<item>No UTF-8 locale.
|
||
<item>mbrtowc returns EILSEQ for bytes >= 0x80.
|
||
</itemize>
|
||
<tag>AIX 4.3</tag>
|
||
<itemize>
|
||
<item><wchar.h> and <wctype.h> exist.
|
||
<item>Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
|
||
<item>Has many UTF-8 locales, one for every country.
|
||
<item>Needs -D_XOPEN_SOURCE=500 in order to define mbstate_t.
|
||
<item>mbrtowc works.
|
||
</itemize>
|
||
<tag>Solaris 2.7</tag>
|
||
<itemize>
|
||
<item><wchar.h> and <wctype.h> exist.
|
||
<item>Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
|
||
<item>Has the following UTF-8 locales:
|
||
en_US.UTF-8, de.UTF-8, es.UTF-8, fr.UTF-8, it.UTF-8, sv.UTF-8.
|
||
<item>mbrtowc returns -1/EILSEQ (instead of -2) for bytes >= 0x80.
|
||
</itemize>
|
||
<tag>OSF/1 4.0d</tag>
|
||
<itemize>
|
||
<item><wchar.h> and <wctype.h> exist.
|
||
<item>Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
|
||
<item>Has an add-on universal.utf8@ucs4 locale, see "man 5 unicode".
|
||
<item>mbrtowc does not know about UTF-8.
|
||
</itemize>
|
||
<tag>Irix 6.5</tag>
|
||
<itemize>
|
||
<item><wchar.h> and <wctype.h> exist.
|
||
<item>Has wcs/mbs functions and fgetwc/fputwc, but not wprintf.
|
||
<item>Has no multibyte locales.
|
||
<item>Has only a dummy definition for mbstate_t.
|
||
<item>Doesn't have mbrtowc.
|
||
</itemize>
|
||
<tag>HP-UX 11.00</tag>
|
||
<itemize>
|
||
<item><wchar.h> exists, <wctype.h> does not.
|
||
<item>Has wcs/mbs functions and fgetwc/fputwc, but not wprintf.
|
||
<item>Has a C.utf8 locale.
|
||
<item>Doesn't have mbstate_t.
|
||
<item>Doesn't have mbrtowc.
|
||
</itemize>
|
||
</descrip>
|
||
|
||
As a consequence, I recommend to use the restartable and multithread-safe
|
||
wcsr/mbsr functions, forget about those systems which don't have them (Irix,
|
||
HP-UX, AIX), and use the UTF-8 locale plug-in libutf8_plug.so (see below)
|
||
on those systems which permit you to compile programs which use these
|
||
wcsr/mbsr functions (Linux, Solaris, OSF/1).
|
||
|
||
A similar advice, given by Sun in
|
||
<htmlurl url="http://www.sun.com/software/white-papers/wp-unicode/"
|
||
name="http://www.sun.com/software/white-papers/wp-unicode/">,
|
||
section "Internationalized Applications with Unicode", is:
|
||
|
||
<em>To properly internationalize an application, use the following
|
||
guidelines:</em>
|
||
<enum>
|
||
<item><em>Avoid direct access with Unicode. This is a task of the platform's
|
||
internationalization framework.</em>
|
||
<item><em>Use the POSIX model for multibyte and wide-character interfaces.</em>
|
||
<item><em>Only call the APIs that the internationalization framework
|
||
provides for language and cultural-specific operations.</em>
|
||
<item><em>Remain code-set independent.</em>
|
||
</enum>
|
||
|
||
If, for some reason, in some piece of code, you really have to assume that
|
||
`wchar_t' is Unicode (for example, if you want to do special treatment of
|
||
some Unicode characters), you should make that piece of code conditional
|
||
upon the result of <tt>is_locale_utf8()</tt>. Otherwise you will mess up
|
||
your program's behaviour in different locales or other platforms. The
|
||
function <tt>is_locale_utf8</tt> is declared in
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/utf8locale.h" name="utf8locale.h">
|
||
and defined in
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/utf8locale.c" name="utf8locale.c">.
|
||
|
||
<sect3>The libutf8 library
|
||
<p>
|
||
|
||
A portable implementation of the ISO/ANSI C API, which supports 8-bit locales
|
||
and UTF-8 locales, can be found in
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/libutf8-0.7.3.tar.gz"
|
||
name="libutf8-0.7.3.tar.gz">.
|
||
|
||
Advantages:
|
||
<itemize>
|
||
<item>
|
||
Unicode UTF-8 support now, portably, even on OSes whose multibyte character
|
||
support does not work or which don't have multibyte/wide character support
|
||
at all.
|
||
<item>
|
||
The same binary works in all OS supported 8-bit locales and in UTF-8 locales.
|
||
<item>
|
||
When an OS vendor adds proper multibyte character support, you can take
|
||
advantage of it by simply recompiling without -DHAVE_LIBUTF8 compiler option.
|
||
</itemize>
|
||
|
||
<sect3>The Plan9 way
|
||
<p>
|
||
|
||
The Plan9 operating system, a variant of Unix, uses UTF-8 as character
|
||
encoding in all applications. Its wide character type is called
|
||
`<tt>Rune</tt>', not `<tt>wchar_t</tt>'. Parts of its libraries, written by
|
||
Rob Pike and Howard Trickey, are available at
|
||
<htmlurl url="ftp://ftp.cdrom.com/pub/netlib/research/9libs/9libs-1.0.tar.gz"
|
||
name="ftp://ftp.cdrom.com/pub/netlib/research/9libs/9libs-1.0.tar.gz">.
|
||
Another similar library, written by Alistair G. Crooks, is
|
||
<htmlurl url="ftp://ftp.cdrom.com/pub/NetBSD/packages/distfiles/libutf-2.10.tar.gz"
|
||
name="ftp://ftp.cdrom.com/pub/NetBSD/packages/distfiles/libutf-2.10.tar.gz">.
|
||
In particular, each of these libraries contains an UTF-8 aware regular
|
||
expression matcher.
|
||
|
||
Drawback of this API:
|
||
<itemize>
|
||
<item>
|
||
UTF-8 is compiled in, not optional. Programs compiled in this universe lose
|
||
support for the 8-bit encodings which are still frequently used in Europe.
|
||
</itemize>
|
||
|
||
<sect2>For graphical user interface
|
||
<p>
|
||
|
||
The Qt-2.0 library
|
||
<htmlurl url="http://www.troll.no/"
|
||
name="http://www.troll.no/">
|
||
contains a fully-Unicode QString class. You can use the member functions
|
||
QString::utf8 and QString::fromUtf8 to convert to/from UTF-8 encoded text.
|
||
The QString::ascii and QString::latin1 member functions should not be used
|
||
any more.
|
||
|
||
<sect2>For advanced text handling
|
||
<p>
|
||
|
||
The previously mentioned libraries implement Unicode aware versions of
|
||
the ASCII concepts. Here are libraries which deal with Unicode concepts,
|
||
such as titlecase (a third letter case, different from uppercase and
|
||
lowercase), distinction between punctuation and symbols, canonical
|
||
decomposition, combining classes, canonical ordering and the like.
|
||
|
||
<descrip>
|
||
<tag>ucdata-2.4</tag>
|
||
The ucdata library by Mark Leisher
|
||
<htmlurl url="http://crl.nmsu.edu/~mleisher/ucdata.html"
|
||
name="http://crl.nmsu.edu/~mleisher/ucdata.html">
|
||
deals with character properties, case conversion, decomposition, combining
|
||
classes. The companion package ure-0.5
|
||
<htmlurl url="http://crl.nmsu.edu/~mleisher/ure-0.5.tar.gz"
|
||
name="http://crl.nmsu.edu/~mleisher/ure-0.5.tar.gz">
|
||
is a Unicode regular expression matcher.
|
||
|
||
<tag>ustring</tag>
|
||
The ustring C++ library by Rodrigo Reyes
|
||
<htmlurl url="http://ustring.charabia.net/"
|
||
name="http://ustring.charabia.net/">
|
||
deals with character properties, case conversion, decomposition, combining
|
||
classes, and includes a Unicode regular expression matcher.
|
||
|
||
<tag>ICU</tag>
|
||
International Components for Unicode
|
||
<htmlurl url="http://oss.software.ibm.com/icu/"
|
||
name="http://oss.software.ibm.com/icu/">.
|
||
IBM's very comprehensive internationalization library featuring Unicode strings,
|
||
resource bundles, number formatters, date/time formatters, message formatters,
|
||
collation and more. Lots of supported locales. Portable to Unix and Win32,
|
||
but compiles out of the box only on Linux libc6, not libc5.
|
||
|
||
<tag>libunicode</tag>
|
||
The GNOME libunicode library
|
||
<htmlurl url="http://cvs.gnome.org/lxr/source/libunicode/"
|
||
name="http://cvs.gnome.org/lxr/source/libunicode/">
|
||
by Tom Tromey and others. It covers character set conversion, character
|
||
properties, decomposition.
|
||
|
||
</descrip>
|
||
|
||
<sect2>For conversion
|
||
<p>
|
||
|
||
Two kinds of conversion libraries, which support UTF-8 and a large number
|
||
of 8-bit character sets, are available:
|
||
|
||
<sect3>iconv
|
||
<p>
|
||
|
||
The iconv implementation by Ulrich Drepper, contained in the GNU glibc-2.2.
|
||
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/glibc/glibc-2.2.tar.gz"
|
||
name="ftp://ftp.gnu.org/pub/gnu/glibc/glibc-2.2.tar.gz">.
|
||
The iconv manpages are now contained in
|
||
<htmlurl url="ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz"
|
||
name="ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz">.
|
||
|
||
The portable iconv implementation by Bruno Haible.
|
||
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/gnu/libiconv-1.5.1.tar.gz"
|
||
name="ftp://ftp.ilog.fr/pub/Users/haible/gnu/libiconv-1.5.1.tar.gz">
|
||
|
||
The portable iconv implementation by Konstantin Chuguev.
|
||
<htmlurl url="mailto:joy@urc.ac.ru"
|
||
name="<joy@urc.ac.ru>">
|
||
<htmlurl url="ftp://ftp.urc.ac.ru/pub/local/OS/Unix/converters/iconv-0.4.tar.gz"
|
||
name="ftp://ftp.urc.ac.ru/pub/local/OS/Unix/converters/iconv-0.4.tar.gz">
|
||
|
||
Advantages:
|
||
<itemize>
|
||
<item>
|
||
iconv is POSIX standardized, programs using iconv to convert from/to UTF-8
|
||
will also run under Solaris. However, the names for the character sets differ
|
||
between platforms. For example, "EUC-JP" under glibc is "eucJP" under HP-UX.
|
||
(The official IANA name for this character set is "EUC-JP", so it's clearly
|
||
a HP-UX deficiency.)
|
||
<item>
|
||
On glibc-2.1 systems, no additional library is needed. On other systems, one of
|
||
the two other iconv implementations can be used.
|
||
</itemize>
|
||
|
||
<sect3>librecode
|
||
<p>
|
||
|
||
librecode by François Pinard
|
||
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz"
|
||
name="ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz">.
|
||
|
||
Advantages:
|
||
<itemize>
|
||
<item>
|
||
Support for transliteration, i.e. conversion of non-ASCII characters
|
||
to sequences of ASCII characters in order to preserve readability by
|
||
humans, even when a lossless transformation is impossible.
|
||
</itemize>
|
||
|
||
Drawbacks:
|
||
<itemize>
|
||
<item>
|
||
Non-standard API.
|
||
<item>
|
||
Slow initialization.
|
||
</itemize>
|
||
|
||
<sect3>ICU
|
||
<p>
|
||
|
||
International Components for Unicode 1.7
|
||
<htmlurl url="http://oss.software.ibm.com/icu/"
|
||
name="http://oss.software.ibm.com/icu/">.
|
||
IBM's internationalization library also has conversion facilities, declared
|
||
in `<tt>ucnv.h</tt>'.
|
||
|
||
Advantages:
|
||
<itemize>
|
||
<item>
|
||
Comprehensive set of supported encodings.
|
||
</itemize>
|
||
|
||
Drawbacks:
|
||
<itemize>
|
||
<item>
|
||
Non-standard API.
|
||
</itemize>
|
||
|
||
<sect2>Other approaches
|
||
<p>
|
||
|
||
<descrip>
|
||
<tag>libutf-8</tag>
|
||
libutf-8 by G. Adam Stanislav
|
||
<htmlurl url="mailto:adam@whizkidtech.net"
|
||
name="<adam@whizkidtech.net>">
|
||
contains a few functions for on-the-fly conversion from/to UTF-8 encoded
|
||
`FILE*' streams.
|
||
<htmlurl url="http://www.whizkidtech.net/i18n/libutf-8-1.0.tar.gz"
|
||
name="http://www.whizkidtech.net/i18n/libutf-8-1.0.tar.gz">
|
||
|
||
Advantages:
|
||
<itemize>
|
||
<item>
|
||
Very small.
|
||
</itemize>
|
||
|
||
Drawbacks:
|
||
<itemize>
|
||
<item>
|
||
Non-standard API.
|
||
<item>
|
||
UTF-8 is compiled in, not optional. Programs compiled with this library
|
||
lose support for the 8-bit encodings which are still frequently used in Europe.
|
||
<item>
|
||
Installation is nontrivial: Makefile needs tweaking, not autoconfiguring.
|
||
</itemize>
|
||
|
||
</descrip>
|
||
|
||
<sect1>Java
|
||
<p>
|
||
|
||
Java has Unicode support built into the language. The type `char' denotes
|
||
a Unicode character, and the `java.lang.String' class denotes a string
|
||
built up from Unicode characters.
|
||
|
||
Java can display any Unicode characters through its windowing system AWT,
|
||
provided that
|
||
1. you set the Java system property "user.language" appropriately,
|
||
2. the /usr/lib/java/lib/font.properties.<it>language</it> font set
|
||
definitions are appropriate, and
|
||
3. the fonts specified in that file are installed.
|
||
For example, in order to display text containing japanese characters,
|
||
you would install japanese fonts and run "java -Duser.language=ja ...".
|
||
You can combine font sets: In order to display western european, greek
|
||
and japanese characters simultaneously, you would create a combination
|
||
of the files "font.properties" (covers ISO-8859-1), "font.properties.el"
|
||
(covers ISO-8859-7) and "font.properties.ja" into a single file.
|
||
??This is untested??
|
||
|
||
The interfaces java.io.DataInput and java.io.DataOutput have methods called
|
||
`readUTF' and `writeUTF' respectively. But note that they don't use UTF-8;
|
||
they use a modified UTF-8 encoding: the NUL character is encoded as the
|
||
two-byte sequence 0xC0 0x80 instead of 0x00, and a 0x00 byte is added at
|
||
the end. Encoded this way, strings can contain NUL characters and nevertheless
|
||
need not be prefixed with a length field - the C <string.h> functions
|
||
like strlen() and strcpy() can be used to manipulate them.
|
||
|
||
<sect1>Lisp
|
||
<p>
|
||
|
||
The Common Lisp standard specifies two character types: `base-char' and
|
||
`character'. It's up to the implementation to support Unicode or not.
|
||
The language also specifies a keyword argument `:external-format' to `open',
|
||
as the natural place to specify a character set or encoding.
|
||
|
||
Among the free Common Lisp implementations, only CLISP
|
||
<htmlurl url="http://clisp.cons.org/"
|
||
name="http://clisp.cons.org/">
|
||
supports Unicode. You need a CLISP version from March 2000 or newer.
|
||
<htmlurl url="ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz"
|
||
name="ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz">.
|
||
The types `base-char' and `character' are both equivalent to 16-bit Unicode.
|
||
The functions <tt>char-width</tt> and <tt>string-width</tt> provide an
|
||
API comparable to <tt>wcwidth()</tt> and <tt>wcswidth()</tt>.
|
||
The encoding used for file or socket/pipe I/O can be specified through the
|
||
`:external-format' argument. The encodings used for tty I/O and the default
|
||
encoding for file/socket/pipe I/O are locale dependent.
|
||
|
||
Among the commercial Common Lisp implementations:
|
||
|
||
LispWorks
|
||
<htmlurl url="http://www.xanalys.com/software_tools/products/"
|
||
name="http://www.xanalys.com/software_tools/products/">
|
||
supports Unicode.
|
||
The type `base-char' is equivalent to ISO-8859-1, and the type `simple-char'
|
||
(subtype of `character') contains all Unicode characters.
|
||
The encoding used for file I/O can be specified through the
|
||
`:external-format' argument, for example '(:UTF-8).
|
||
Limitations: Encodings cannot be used for socket I/O. The editor cannot edit
|
||
UTF-8 encoded files.
|
||
|
||
Eclipse
|
||
<htmlurl url="http://www.elwood.com/eclipse/eclipse.htm"
|
||
name="http://www.elwood.com/eclipse/eclipse.htm">
|
||
supports Unicode. See
|
||
<htmlurl url="http://www.elwood.com/eclipse/char.htm"
|
||
name="http://www.elwood.com/eclipse/char.htm">.
|
||
The type `base-char' is equivalent
|
||
to ISO-8859-1, and the type `character' contains all Unicode characters.
|
||
The encoding used for file I/O can be specified through a combination of
|
||
the `:element-type' and `:external-format' arguments to `open'.
|
||
Limitations: Character attribute functions are locale dependent. Source and
|
||
compiled source files cannot contain Unicode string literals.
|
||
|
||
The commercial Common Lisp implementation Allegro CL, in version 6.0, has
|
||
Unicode support. The types `base-char' and `character' are both equivalent
|
||
to 16-bit Unicode. The encoding used for file I/O can be specified through the
|
||
`:external-format' argument, for example <tt>:external-format :utf8</tt>.
|
||
The default encoding is locale dependent. More details are at
|
||
<htmlurl url="http://www.franz.com/support/documentation/6.0/doc/iacl.htm"
|
||
name="http://www.franz.com/support/documentation/6.0/doc/iacl.htm">.
|
||
|
||
<sect1>Ada95
|
||
<p>
|
||
|
||
Ada95 was designed for Unicode support and the Ada95 standard library
|
||
features special ISO 10646-1 data types Wide_Character and Wide_String,
|
||
as well as numerous associated procedures and functions. The GNU Ada95
|
||
compiler (gnat-3.11 or newer) supports UTF-8 as the external encoding of
|
||
wide characters. This allows you to use UTF-8 in both source code and
|
||
application I/O. To activate it in the application, use "WCEM=8" in the
|
||
FORM string when opening a file, and use compiler option "-gnatW8" if
|
||
the source code is in UTF-8. See the GNAT
|
||
(<htmlurl url="ftp://cs.nyu.edu/pub/gnat/"
|
||
name="ftp://cs.nyu.edu/pub/gnat/">)
|
||
and Ada95
|
||
(<htmlurl url="ftp://ftp.cnam.fr/pub/Ada/PAL/userdocs/docadalt/rm95/index.htm"
|
||
name="ftp://ftp.cnam.fr/pub/Ada/PAL/userdocs/docadalt/rm95/index.htm">)
|
||
reference manuals for details.
|
||
|
||
<sect1>Python
|
||
<p>
|
||
|
||
Python 2.0
|
||
(<htmlurl url="http://www.python.org/2.0/"
|
||
name="http://www.python.org/2.0/">,
|
||
<htmlurl url="http://www.python.org/pipermail/python-announce-list/2000-October/000889.html"
|
||
name="http://www.python.org/pipermail/python-announce-list/2000-October/000889.html">,
|
||
<htmlurl url="http://starship.python.net/crew/amk/python/writing/new-python/new-python.html#SECTION000300000000000000000"
|
||
name="http://starship.python.net/crew/amk/python/writing/new-python/new-python.html">)
|
||
contains Unicode support. It has a new fundamental data type
|
||
`unicode', representing a Unicode string, a module `unicodedata' for the
|
||
character properties, and a set of converters for the most important encodings.
|
||
See
|
||
<htmlurl url="http://starship.python.net/crew/lemburg/unicode-proposal.txt"
|
||
name="http://starship.python.net/crew/lemburg/unicode-proposal.txt">,
|
||
or the file <tt>Misc/unicode.txt</tt> in the distribution, for details.
|
||
|
||
<sect1>JavaScript/ECMAscript
|
||
<p>
|
||
|
||
Since JavaScript version 1.3, strings are always Unicode. There is no
|
||
character type, but you can use the \uXXXX notation for Unicode characters
|
||
inside strings. No normalization is done internally, so it expects to receive
|
||
Unicode Normalization Form C, which the W3C recommends. See
|
||
<htmlurl url="http://developer.netscape.com/docs/manuals/communicator/jsref/js13.html#Unicode"
|
||
name="http://developer.netscape.com/docs/manuals/communicator/jsref/js13.html#Unicode">
|
||
for details and
|
||
<htmlurl url="http://developer.netscape.com/docs/javascript/e262-pdf.pdf"
|
||
name="http://developer.netscape.com/docs/javascript/e262-pdf.pdf">
|
||
for the complete ECMAscript specification.
|
||
|
||
<sect1>Tcl
|
||
<p>
|
||
|
||
Tcl/Tk started using Unicode as its base character set with version 8.1.
|
||
Its internal representation for strings is UTF-8. It supports the \uXXXX
|
||
notation for Unicode characters. See
|
||
<htmlurl url="http://dev.scriptics.com/doc/howto/i18n.html"
|
||
name="http://dev.scriptics.com/doc/howto/i18n.html">.
|
||
|
||
<sect1>Perl
|
||
<p>
|
||
|
||
Perl 5.6 stores strings internally in UTF-8 format, if you write
|
||
<tscreen><verb>
|
||
use utf8;
|
||
</verb></tscreen>
|
||
at the beginning of your script. <tt>length()</tt> returns the number of
|
||
characters of a string. For details, see the Perl-i18n FAQ at
|
||
<htmlurl url="http://rf.net/~james/perli18n.html"
|
||
name="http://rf.net/~james/perli18n.html">.
|
||
|
||
Support for other (non-8-bit) encodings is available through the iconv
|
||
interface module
|
||
<htmlurl url="http://cpan.perl.org/modules/by-module/Text/Text-Iconv-1.1.tar.gz"
|
||
name="http://cpan.perl.org/modules/by-module/Text/Text-Iconv-1.1.tar.gz">.
|
||
|
||
<sect1>Related reading
|
||
<p>
|
||
|
||
Tomohiro Kubota has written an introduction to internationalization
|
||
<htmlurl url="http://www.debian.org/doc/manuals/intro-i18n/"
|
||
name="http://www.debian.org/doc/manuals/intro-i18n/">.
|
||
The emphasis of his document is on writing software that runs in any locale,
|
||
using the locale's encoding.
|
||
|
||
<sect>Other sources of information
|
||
<p>
|
||
|
||
<sect1>Mailing lists
|
||
<p>
|
||
|
||
Broader audiences can be reached at the following mailing lists.
|
||
|
||
Note that where I write `at', you should write `@'. (Anti-spam device.)
|
||
|
||
<sect2>linux-utf8
|
||
<p>
|
||
|
||
Address: <tt>linux-utf8</tt> at <tt>nl.linux.org</tt>
|
||
|
||
This mailing list is about internationalization with Unicode, and covers
|
||
a broad range of topics from the keyboard driver to the X11 fonts.
|
||
|
||
Archives are at
|
||
<htmlurl url="http://mail.nl.linux.org/linux-utf8/"
|
||
name="http://mail.nl.linux.org/linux-utf8/">.
|
||
|
||
To subscribe, send a message to <tt>majordomo</tt> at <tt>nl.linux.org</tt>
|
||
with the line "subscribe linux-utf8" in the body.
|
||
|
||
<sect2>li18nux
|
||
<p>
|
||
|
||
Address: <tt>linux-i18n</tt> at <tt>sun.com</tt>
|
||
|
||
This mailing list is focused on organizing internationalization work on
|
||
Linux, and arranging meetings between people.
|
||
|
||
To subscribe, fill in the form at http://www.li18nux.org/
|
||
and send it to <tt>linux-i18n-request</tt> at <tt>sun.com</tt>.
|
||
|
||
<sect2>unicode
|
||
<p>
|
||
|
||
Address: <tt>unicode</tt> at <tt>unicode.org</tt>
|
||
|
||
This mailing list is focused on the standardization and continuing development
|
||
of the Unicode standard, and related technologies, such as Bidi and sorting
|
||
algorithms.
|
||
|
||
Archives are at
|
||
<htmlurl url="ftp://ftp.unicode.org/Public/MailArchive/"
|
||
name="ftp://ftp.unicode.org/Public/MailArchive/">,
|
||
but they are not regularly updated.
|
||
|
||
For subscription information, see
|
||
<htmlurl url="http://www.unicode.org/unicode/consortium/distlist.html"
|
||
name="http://www.unicode.org/unicode/consortium/distlist.html">.
|
||
|
||
<sect2>X11 internationalization
|
||
<p>
|
||
|
||
Address: <tt>i18n</tt> at <tt>xfree86.org</tt>
|
||
|
||
This mailing list addresses the people who work on better internationalization
|
||
of the X11/XFree86 system.
|
||
|
||
Archives are at
|
||
<htmlurl url="http://devel.xfree86.org/archives/i18n/"
|
||
name="http://devel.xfree86.org/archives/i18n/">.
|
||
|
||
To subscribe, send mail to the friendly person at <tt>i18n-request</tt> at
|
||
<tt>xfree86.org</tt> explaining your motivation.
|
||
|
||
<sect2>X11 fonts
|
||
<p>
|
||
|
||
Address: <tt>fonts</tt> at <tt>xfree86.org</tt>
|
||
|
||
This mailing list addresses the people who work on Unicode fonts and the
|
||
font subsystem for the X11/XFree86 system.
|
||
|
||
Archives are at
|
||
<htmlurl url="http://devel.xfree86.org/archives/fonts/"
|
||
name="http://devel.xfree86.org/archives/fonts/">.
|
||
|
||
To subscribe, send mail to the overworked person at <tt>fonts-request</tt> at
|
||
<tt>xfree86.org</tt> explaining your motivation.
|
||
|
||
</article>
|