LDP/LDP/howto/linuxdoc/Unicode-HOWTO.sgml

2994 lines
109 KiB
Plaintext
Raw Permalink Blame History

<!doctype linuxdoc system>
<article>
<title>The Unicode HOWTO
<author>Bruno Haible,
<htmlurl url="mailto:haible@clisp.cons.org"
name="&lt;haible@clisp.cons.org&gt;">
<date>v1.0, 23 January 2001
<abstract>
This document describes how to change your Linux system so it uses UTF-8
as text encoding. -
This is work in progress. Any tips, patches, pointers, URLs are very welcome.
</abstract>
<toc>
<sect>Introduction
<p>
<sect1>Why Unicode?
<p>
People in different countries use different characters to represent the
words of their native languages. Nowadays most applications, including
email systems and web browsers, are 8-bit clean, i.e. they can operate on
and display text correctly provided that it is represented in an 8-bit
character set, like ISO-8859-1.
There are far more than 256 characters in the world - think of cyrillic,
hebrew, arabic, chinese, japanese, korean and thai -, and new characters
are being invented now and then. The problems that come up for users are:
<itemize>
<item>
It is impossible to store text with characters from different character
sets in the same document. For example, I can cite russian papers in
a German or French publication if I use TeX, xdvi and PostScript,
but I cannot do it in plain text.
<item>
As long as every document has its own character set, and recognition
of the character set is not automatic, manual user intervention is
inevitable. For example, in order to view the homepage of the
XTeamLinux distribution
<htmlurl url="http://www.xteamlinux.com.cn/"
name="http://www.xteamlinux.com.cn/">
I had to tell Netscape that the web page is coded in GB2312.
<item>
New symbols like the Euro are being invented. ISO has issued a new
standard ISO-8859-15, which is mostly like ISO-8859-1 except that it
removes some rarely used characters (the old currency sign) and
replaced it with the Euro sign. If users adopt this standard, they
have documents in different character sets on their disk, and they
start having to think about it daily. But computers should make things
simpler, not more complicated.
</itemize>
The solution of this problem is the adoption of a world-wide usable character
set. This character set is Unicode
<htmlurl url="http://www.unicode.org/"
name="http://www.unicode.org/">.
For more info about Unicode, do `<tt>man 7 unicode</tt>' (manpage contained
in the man-pages-1.20 package).
<sect1>Unicode encodings
<p>
This reduces the user's problem of dealing with character sets to a technical
problem: How to transport Unicode characters using the 8-bit bytes?
8-bit units are the smallest addressing units of most computers and also the
unit used by TCP/IP network connections. The use of 1 byte to represent
1 character is, however, an accident of history, caused by the fact that
computer development started in Europe and the U.S. where 96 characters were
found to be sufficient for a long time.
There are basically four ways to encode Unicode characters in bytes:
<descrip>
<tag>UTF-8</tag>
128 characters are encoded using 1 byte (the ASCII characters).
1920 characters are encoded using 2 bytes (Roman, Greek, Cyrillic,
Coptic, Armenian, Hebrew, Arabic characters).
63488 characters are encoded using 3 bytes (Chinese and Japanese among
others).
The other 2147418112 characters (not assigned yet) can be encoded
using 4, 5 or 6 characters.
For more info about UTF-8, do `<tt>man 7 utf-8</tt>' (manpage contained
in the man-pages-1.20 package).
<tag>UCS-2</tag>
Every character is represented as two bytes.
This encoding can only represent the first 65536 Unicode characters.
<tag>UTF-16</tag>
This is an extension of UCS-2 which can represent 1112064 Unicode
characters. The first 65536 Unicode characters are represented as two
bytes, the other ones as four bytes.
<tag>UCS-4</tag>
Every character is represented as four bytes.
</descrip>
The space requirements for encoding a text, compared to encodings currently
in use (8 bit per character for European languages, more for
Chinese/Japanese/Korean), is as follows. This has an influence on disk
storage space and network download speed (when no form of compression is
used).
<descrip>
<tag>UTF-8</tag>
No change for US ASCII, just a few percent more for ISO-8859-1,
50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic.
<tag>UCS-2 and UTF-16</tag>
No change for Chinese/Japanese/Korean. 100% more for
US ASCII and ISO-8859-1, Greek and Cyrillic.
<tag>UCS-4</tag>
100% more for Chinese/Japanese/Korean. 300% more for US ASCII and
ISO-8859-1, Greek and Cyrillic.
</descrip>
Given the penalty for US and European documents caused by UCS-2, UTF-16, and
UCS-4, it seems unlikely that these encodings have a potential for wide-scale
use. The Microsoft Win32 API supports the UCS-2 encoding since 1995 (at
least), yet this encoding has not been widely adopted for documents - SJIS
remains prevalent in Japan.
UTF-8 on the other hand has the potential for wide-scale use, since it
doesn't penalize US and European users, and since many text processing
programs don't need to be changed for UTF-8 support.
In the following, we will describe how to change your Linux system so
it uses UTF-8 as text encoding.
<sect2>Footnotes for C/C++ developers
<p>
The Microsoft Win32 approach makes it easy for developers to produce
Unicode versions of their programs: You "#define UNICODE" at the top
of your program and then change many occurrences of `<tt>char</tt>' to
`<tt>TCHAR</tt>', until your program compiles without warnings. The problem
with it is that you end up with two versions of your program: one which
understands UCS-2 text but no 8-bit encodings, and one which understands
only old 8-bit encodings.
Moreover, there is an endianness issue with UCS-2 and UCS-4. The IANA
character set registry
<htmlurl url="http://www.isi.edu/in-notes/iana/assignments/character-sets"
name="http://www.isi.edu/in-notes/iana/assignments/character-sets">
says about ISO-10646-UCS-2: "this needs to specify network byte order: the
standard does not specify". Network byte order is big endian. And RFC 2152
is even clearer: "ISO/IEC 10646-1:1993(E) specifies that when characters the
UCS-2 form are serialized as octets, that the most significant octet appear
first."
Whereas Microsoft, in its C/C++ development tools, recommends
to use machine-dependent endianness (i.e. little endian on ix86 processors)
and either a byte-order mark at the beginning of the document, or some
statistical heuristics(!).
The UTF-8 approach on the other hand keeps `<tt>char*</tt>' as the standard C
string type. As a result, your program will handle US ASCII text,
independently of any environment variables, and will handle both
ISO-8859-1 and UTF-8 encoded text provided the LANG environment variable
is set accordingly.
<sect1>Related resources
<p>
Markus Kuhn's very up-to-date resource list:
<itemize>
<item>
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
name="http://www.cl.cam.ac.uk/~mgk25/unicode.html">
<item>
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html"
name="http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html">
</itemize>
Roman Czyborra's overview of Unicode, UTF-8 and UTF-8 aware programs:
<htmlurl url="http://czyborra.com/utf/#UTF-8"
name="http://czyborra.com/utf/#UTF-8">
Some example UTF-8 files:
<itemize>
<item>
In Markus Kuhn's ucs-fonts package:
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt"
name="quickbrown.txt">,
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt"
name="UTF-8-test.txt">,
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt"
name="UTF-8-demo.txt">.
<item>
<htmlurl url="http://www.columbia.edu/kermit/utf8.html"
name="http://www.columbia.edu/kermit/utf8.html">
<item>
<htmlurl url="ftp://ftp.cs.su.oz.au/gary/x-utf8.html"
name="ftp://ftp.cs.su.oz.au/gary/x-utf8.html">
<item>
The file <tt>iso10646</tt> in the Kosta Kostis' trans-1.1.1 package
<htmlurl url="ftp://ftp.nid.ru/pub/os/unix/misc/trans111.tar.gz"
name="ftp://ftp.nid.ru/pub/os/unix/misc/trans111.tar.gz">
<item>
<htmlurl url="ftp://ftp.dante.de/pub/tex/info/lwc/apc/utf8.html"
name="ftp://ftp.dante.de/pub/tex/info/lwc/apc/utf8.html">
<item>
<htmlurl url="http://www.cogsci.ed.ac.uk/~richard/unicode-sample.html"
name="http://www.cogsci.ed.ac.uk/~richard/unicode-sample.html">
</itemize>
<sect>Display setup
<p>
We assume you have already adapted your Linux console and X11 configuration
to your keyboard and locale. This is explained in the Danish/International
HOWTO, and in the other national HOWTOs: Finnish, French, German, Italian,
Polish, Slovenian, Spanish, Cyrillic, Hebrew, Chinese, Thai, Esperanto. But
please do not follow the advice given in the Thai HOWTO, to pretend you
were using ISO-8859-1 characters (U0000..U00FF) when what you are typing
are actually Thai characters (U0E01..U0E5B). Doing so will only cause
problems when you switch to Unicode.
<sect1>Linux console
<p>
I'm not talking much about the Linux console here, because on those machines
on which I don't have xdm running, I use it only to type my login name,
my password, and "xinit".
Anyway, the kbd-0.99 package
<htmlurl url="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-0.99.tar.gz"
name="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-0.99.tar.gz">
and a heavily extended version, the console-tools-0.2.3 package
<htmlurl url="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-tools-0.2.3.tar.gz"
name="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-tools-0.2.3.tar.gz">
contains in the kbd-0.99/src/ (or console-tools-0.2.3/screenfonttools/)
directory two programs: `unicode_start' and `unicode_stop'. When you call
`unicode_start', the console's screen output is interpreted as UTF-8. Also,
the keyboard is put into Unicode mode (see "man kbd_mode"). In this mode,
Unicode characters typed as Alt-x1 ... Alt-xn (where x1,...,xn are digits on
the numeric keypad) will be emitted in UTF-8. If your keyboard or, more
precisely, your normal keymap has non-ASCII letter keys (like the German
Umlaute) which you would like to be CapsLockable, you need to apply the kernel
patch
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.2.9-keyboard.diff"
name="linux-2.2.9-keyboard.diff">
or
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.3.12-keyboard.diff"
name="linux-2.3.12-keyboard.diff">.
You will want to use display characters from different scripts on the same
screen. For this, you need a Unicode console font. The
<htmlurl url="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-0.99.tar.gz"
name="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-0.99.tar.gz">
and
<htmlurl url="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-data-1999.08.29.tar.gz"
name="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-data-1999.08.29.tar.gz">
packages contain a font (LatArCyrHeb-{08,14,16,19}.psf) which
covers Latin, Cyrillic, Hebrew, Arabic scripts. It covers ISO 8859 parts
1,2,3,4,5,6,8,9,10 all at once. To install it, copy it to
/usr/lib/kbd/consolefonts/ and execute
"/usr/bin/setfont /usr/lib/kbd/consolefonts/LatArCyrHeb-14.psf".
A more flexible approach is given by Dmitry Yu. Bolkhovityanov
<htmlurl url="mailto:D.Yu.Bolkhovityanov@inp.nsk.su"
name="&lt;D.Yu.Bolkhovityanov@inp.nsk.su&gt;">
in <htmlurl url="http://www.inp.nsk.su/~bolkhov/files/fonts/univga/index.html"
name="http://www.inp.nsk.su/~bolkhov/files/fonts/univga/index.html">
and <htmlurl url="http://www.inp.nsk.su/~bolkhov/files/fonts/univga/uni-vga.tgz"
name="http://www.inp.nsk.su/~bolkhov/files/fonts/univga/uni-vga.tgz">.
To work around the constraint that a VGA font can only cover 512 characters simultaneously,
he provides a rich Unicode font (2279 characters, covering Latin, Greek, Cyrillic, Hebrew,
Armenian, IPA, math symbols, arrows, and more) in the typical 8x16 size and a script
which permits to extract any 512 characters as a console font.
If you want cut&amp;paste to work with UTF-8 consoles, you need the patch
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.3.12-console.diff"
name="linux-2.3.12-console.diff">
from Edmund Thomas Grimley Evans and Stanislav Voronyi.
In April 2000, Edmund Thomas Grimley Evans
<htmlurl url="mailto:edmundo@rano.org"
name="&lt;edmundo@rano.org&gt;">
has implemented an UTF-8 console terminal emulator. It uses Unicode fonts
and relies on the Linux frame buffer device.
<sect1>X11 Foreign fonts
<p>
Don't hesitate to install Cyrillic, Chinese, Japanese etc. fonts. Even
if they are not Unicode fonts, they will help in displaying Unicode
documents: at least Netscape Communicator 4 and Java will make use of
foreign fonts when available.
The following programs are useful when installing fonts:
<itemize>
<item>
"mkfontdir directory"
prepares a font directory for use by the X server, needs to be executed
after installing fonts in a directory.
<item>
"xset -q | sed -e '1,/^Font Path:/d' | sed -e '2,$d' -e 's/^ //'"
displays the X server's current font path.
<item>
"xset fp+ directory"
adds a directory to the X server's current font path.
To add a directory permanently, add a "FontPath" line to your
/etc/XF86Config file, in section "Files".
<item>
"xset fp rehash"
needs to be executed after calling mkfontdir on a directory that is
already contained in the X server's current font path.
<item>
"xfontsel"
allows you to browse the installed fonts by selecting various font
properties.
<item>
"xlsfonts -fn fontpattern"
lists all fonts matching a font pattern. Also displays various font
properties. In particular, "xlsfonts -ll -fn font" lists the font
properties CHARSET_REGISTRY and CHARSET_ENCODING, which together
determine the font's encoding.
<item>
"xfd -fn font"
displays a font page by page.
</itemize>
The following fonts are freely available (not a complete list):
<itemize>
<item>
The ones contained in XFree86, sometimes packaged in separate packages.
For example, SuSE has only normal 75dpi fonts in the base `xf86' package.
The other fonts are in the packages `xfnt100', `xfntbig', `xfntcyr',
`xfntscl'.
<item>
The Emacs international fonts,
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/intlfonts/intlfonts-1.2.tar.gz"
name="ftp://ftp.gnu.org/pub/gnu/intlfonts/intlfonts-1.2.tar.gz">
As already mentioned, they are useful even if you prefer XEmacs to
GNU Emacs or don't use any Emacs at all.
</itemize>
<sect1>X11 Unicode fonts
<p>
Applications wishing to display text belonging to different scripts (like
Cyrillic and Greek) at the same time, can do so by using different X fonts
for the various pieces of text. This is what Netscape Communicator and Java
do. However, this approach is more complicated, because instead of working
with `Font' and `XFontStruct', the programmer has to deal with `XFontSet',
and also because not all fonts in the font set need to have the same
dimensions.
<itemize>
<item>
Markus Kuhn has assembled fixed-width 75dpi fonts with Unicode encoding
covering Latin, Greek, Cyrillic, Armenian, Georgian, Hebrew scripts and
many symbols.
They cover ISO 8859 parts 1,2,3,4,5,7,8,9,10,13,14,15,16 all at once.
These fonts are required for running xterm in utf-8 mode. They are now
contained in XFree86 4.0.1, therefore you need to install them manually
only if you have an older XFree86 3.x version.
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz"
name="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz">.
<item>
Markus Kuhn has also assembled double-width fixed 75dpi fonts with Unicode
encoding covering Chinese, Japanese and Korean. These fonts are contained
in XFree86 4.0.1 as well.
<htmlurl url="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts-asian.tar.gz"
name="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts-asian.tar.gz">
<item>
Roman Czyborra has assembled an 8x16 / 16x16 75dpi font with Unicode encoding
covering a huge part of Unicode. Download unifont.hex.gz and hex2bdf from
<htmlurl url="http://czyborra.com/unifont/"
name="http://czyborra.com/unifont/">.
It is not fixed-width: 8 pixels wide for European characters, 16 pixels wide
for Chinese characters. Installation instructions:
<tscreen><verb>
$ gunzip unifont.hex.gz
$ hex2bdf < unifont.hex > unifont.bdf
$ bdftopcf -o unifont.pcf unifont.bdf
$ gzip -9 unifont.pcf
# cp unifont.pcf.gz /usr/X11R6/lib/X11/fonts/misc
# cd /usr/X11R6/lib/X11/fonts/misc
# mkfontdir
# xset fp rehash
</verb></tscreen>
<item>
Primoz Peterlin has assembled an ETL family fonts covering Latin, Greek,
Cyrillic, Armenian, Georgian, Hebrew scripts.
<htmlurl url="ftp://ftp.x.org/contrib/fonts/etl-unicode.tar.gz"
name="ftp://ftp.x.org/contrib/fonts/etl-unicode.tar.gz">
Use the "bdftopcf" program in order to install it.
<item>
Mark Leisher has assembled a proportional, 17 pixel high (12 point), font,
called ClearlyU, covering Latin, Greek, Cyrillic, Armenian, Georgian, Hebrew,
Thai, Laotian scripts.
<htmlurl url="http://crl.nmsu.edu/~mleisher/cu.html"
name="http://crl.nmsu.edu/~mleisher/cu.html">.
Installation instructions:
<tscreen><verb>
$ bdftopcf -o cu12.pcf cu12.bdf
$ gzip -9 cu12.pcf
# cp cu12.pcf.gz /usr/X11R6/lib/X11/fonts/misc
# cd /usr/X11R6/lib/X11/fonts/misc
# mkfontdir
# xset fp rehash
</verb></tscreen>
</itemize>
<!-- Not useful now: The `ps2pk' and `pk2bm' programs contained in the teTeX
distribution can convert existing Postscript/Type1 and TeX metafont fonts
to .bdf format for use with X11. -->
<sect1>Unicode xterm
<p>
xterm is part of X11R6 and XFree86, but is maintained separately by Tom
Dickey.
<htmlurl url="http://www.clark.net/pub/dickey/xterm/xterm.html"
name="http://www.clark.net/pub/dickey/xterm/xterm.html">
Newer versions (patch level 146 and above) contain support for converting
keystrokes to UTF-8 before sending them to the application running in the
xterm, and for displaying Unicode characters that the application outputs
as UTF-8 byte sequence. It also contains support for double-wide characters
(mostly CJK ideographs) and combining characters, contributed by Robert Brady
<htmlurl url="mailto:robert@suse.co.uk"
name="&lt;robert@suse.co.uk&gt;">.
To get an UTF-8 xterm running, you need to:
<itemize>
Fetch
<htmlurl url="http://www.clark.net/pub/dickey/xterm/xterm.tar.gz"
name="http://www.clark.net/pub/dickey/xterm/xterm.tar.gz">,
<item>
Configure it by calling "./configure --enable-wide-chars ...", then
compile and install it.
<item>
Have a Unicode fixed-width font installed. Markus Kuhn's ucs-fonts.tar.gz
(see above) is made for this.
<item>
Start "xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'".
The option "-u8" turns on Unicode and UTF-8 handling. The font designated
by the long "-fn" option is Markus Kuhn's Unicode font. Without this option,
the default font called "fixed" would be used, an ISO-8859-1 6x13 font.
<item>
Take a look at the sample files contained in Markus Kuhn's ucs-fonts
package:
<tscreen><verb>
$ cd .../ucs-fonts
$ cat quickbrown.txt
$ cat utf-8-demo.txt
</verb></tscreen>
You should be seeing (among others) greek and russian characters.
<item>
To make xterm come up with UTF-8 handling each time it is started,
add the lines
<tscreen><verb>
xterm*utf8: 1
xterm*VT100*font: -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1
xterm*VT100*wideFont: -misc-fixed-medium-r-normal-ja-13-125-75-75-c-120-iso10646-1
xterm*VT100*boldFont: -misc-fixed-bold-r-semicondensed--13-120-75-75-c-60-iso10646-1
</verb></tscreen>
to your $HOME/.Xdefaults (for yourself only).
For CJK text processing with double-width characters, the following
settings are probably better:
<tscreen><verb>
xterm*VT100*font: -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
xterm*VT100*wideFont: -Misc-Fixed-Medium-R-Normal-ja-18-120-100-100-C-180-ISO10646-1
</verb></tscreen>
I don't recommend changing
the system-wide /usr/X11R6/lib/X11/app-defaults/XTerm, because then your
changes will be erased next time you upgrade to a new XFree86 version.
</itemize>
<!-- Development versions of xterm, by Robert Brady, are available from
http://susu.org.uk/~robert/xterm/
currently xterm-150 + http://susu.org.uk/~robert/xterm-23.diff.gz
but I don't know whether its stable enough (e.g. whether the memory
leak is fixed now).
-->
<sect1>TrueType fonts
<p>
The fonts mentioned above are fixed size and not scalable. For some
applications, especially printing, high resolution fonts are necessary,
though. The most important type of scalable, high resolution fonts are
TrueType fonts.
<!-- COMMENTED OUT.
Two other categories of scalable fonts are Postscript fonts (Adobe Type1,
Type3, Type42) and TeX metafonts. Converters exist between both: ps2mf
and mf2ps. But, as Juliusz Chroboczek writes in
http://www.dcs.ed.ac.uk/home/jec/programs/xfsft/renderers.html
"While the quality of the rasterisation of TrueType fonts depends solely
on the font used, in the case of Type 1 fonts it depends greatly on the
rasteriser. As it is significantly easier to produce high-quality Type 1
fonts than TrueType fonts, it is unfortunate that there is currently no
_good_ Free Type 1 rasteriser available."
But somehow, there are now Unicode TrueType fonts available, whereas
I don't know of any Unicode Postscript/metafont fonts.
-->
They are currently supported by
<itemize>
XFree86 4.0.1; you need to add the line
<tscreen><verb>
Load "freetype"
</verb></tscreen>
or
<tscreen><verb>
Load "xtt"
</verb></tscreen>
to the <tt>"Module"</tt> section of your XF86Config file.
<item>
The display engines of other operating systems.
<item>
The yudit editor, see below, and its printing engine.
<!-- COMMENTED OUT. ghostscript use of TrueType fonts is irrelevant,
because they can only be used as a substitute for TimesRoman etc.,
and not enable the use of Unicode characters.
Furthermore, http://www.dcs.ed.ac.uk/home/jec/programs/xfsft/renderers.html
shows that ghostscript does not correctly support TrueType fonts at
low resolutions.
<item>
The ghostscript Postscript rasterizer.
-->
</itemize>
Some no-cost TrueType fonts with large Unicode coverage are
<descrip>
<tag>Bitstream Cyberbit</tag>
Covers Roman, Cyrillic, Greek, Hebrew, Arabic, combining diacritical marks,
Chinese, Korean, Japanese, and more.
Downloadable from
<htmlurl url="ftp://ftp.netscape.com/pub/communicator/extras/fonts/windows/Cyberbit.ZIP"
name="ftp://ftp.netscape.com/pub/communicator/extras/fonts/windows/Cyberbit.ZIP">.
It is free for non-commercial purposes.
<tag>Microsoft Arial</tag>
Covers Roman, Cyrillic, Greek, Hebrew, Arabic, some combining diacritical
marks, Vietnamese.
Downloadable; look on a search engine for ftp-able files called
<tt>arial.ttf</tt>, <tt>ariali.ttf</tt>, <tt>arialbd.ttf</tt>,
<tt>arialbi.ttf</tt>.
<!-- COMMENTED OUT. Since redistribution of these fonts is not allowed
(Microsoft wants everyone to use a Windows platform in order to unpack
their arial32.exe), we cannot publish the URL
ftp://ftp.bora.net/pub/sw/screen-saver/themes/fonts/
-->
<tag>Lucida Sans Unicode</tag>
Covers Roman, Cyrillic, Greek, Hebrew, combining diacritical marks.
Download: contained in IBM's JDK 1.3.0 for Linux, at
<htmlurl url="http://www.ibm.com/java/jdk/linux130/"
name="http://www.ibm.com/java/jdk/linux130/">,
or directly downloadable as <tt>LucidaSansRegular.ttf</tt> and
<tt>LucidaSansOblique.ttf</tt> from
<htmlurl url="ftp://ftp.maths.tcd.ie/Linux/opt/IBMJava2-13/jre/lib/fonts/"
name="ftp://ftp.maths.tcd.ie/Linux/opt/IBMJava2-13/jre/lib/fonts/">.
<tag>Arphic</tag>
Cover Chinese (both traditional and simplified).
Download: at
<htmlurl url="ftp://ftp.gnu.org/non-gnu/chinese-fonts-truetype/"
name="ftp://ftp.gnu.org/non-gnu/chinese-fonts-truetype/">.
These fonts are truly free.
<!-- They are not Unicode fonts, but I mention them nevertheless because
they are popular in the Republic of China, and they have a good license.
-->
</descrip>
Download locations for these and other TrueType fonts can be found at
Christoph Singer's list of freely downloadable Unicode TrueType fonts
<htmlurl url="http://www.ccss.de/slovo/unifonts.htm"
name="http://www.ccss.de/slovo/unifonts.htm">.
Truetype fonts are installed similarly to fixed size fonts, except that
they go in a separate directory, and that <tt>ttmkfdir</tt> must be
called before <tt>mkfontdir</tt>:
<tscreen><verb>
# mkdir -p /usr/X11R6/lib/X11/fonts/truetype
# cp /somewhere/Cyberbit.ttf ... /usr/X11R6/lib/X11/fonts/truetype
# cd /usr/X11R6/lib/X11/fonts/truetype
# ttmkfdir > fonts.scale
# mkfontdir
# xset fp rehash
</verb></tscreen>
TrueType fonts can be converted to low resolution, non-scalable X11 fonts by
use of Mark Leisher's ttf2bdf utility
<htmlurl url="ftp://crl.nmsu.edu/CLR/multiling/General/ttf2bdf-2.8-LINUX.tar.gz"
name="ftp://crl.nmsu.edu/CLR/multiling/General/ttf2bdf-2.8-LINUX.tar.gz">.
For example, to generate a proportional Unicode font for use with cooledit:
<tscreen><verb>
# cd /usr/X11R6/lib/X11/fonts/local
# ttf2bdf ../truetrype/Cyberbit.ttf > cyberbit.bdf
# bdftopcf -o cyberbit.pcf cyberbit.bdf
# gzip -9 cyberbit.pcf
# mkfontdir
# xset fp rehash
</verb></tscreen>
More information about TrueType fonts can be found in the Linux TrueType HOWTO
<htmlurl url="http://www.moisty.org/~brion/linux/TrueType-HOWTO.html"
name="http://www.moisty.org/~brion/linux/TrueType-HOWTO.html">.
<sect1>Miscellaneous
<p>
A small program which tests whether a Linux console or xterm is in UTF-8 mode
can be found in the
<htmlurl url="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/x-lt-1.24.tar.gz"
name="ftp://sunsite.unc.edu/pub/Linux/system/keyboards/x-lt-1.24.tar.gz">
package by Ricardas Cepas, files testUTF-8.c and testUTF8.c. Most applications
should not use this, however: they should look at the environment variables,
see section "Locale environment variables".
<sect>Locale setup
<p>
<sect1>Files &amp; the kernel
<p>
You can now already use any Unicode characters in file names. No kernel
or file utilities need modifications. This is because file names in the
kernel can be anything not containing a null byte, and '/' is used to
delimit subdirectories. When encoded using UTF-8, non-ASCII characters
will never be encoded using null bytes or slashes. All that happens is
that file and directory names occupy more bytes than they contain characters.
For example, a filename consisting of five greek characters will appear
to the kernel as a 10-byte filename. The kernel does not know (and does
not need to know) that these bytes are displayed as greek.
This is the general theory, as long as your files stay inside Linux. On
filesystems which are used from other operating systems, you have mount
options to control conversion of filenames to/from UTF-8:
<itemize>
<item>
The "vfat" filesystems has a mount option "utf8".
See <htmlurl url="file:/usr/src/linux/Documentation/filesystems/vfat.txt"
name="file:/usr/src/linux/Documentation/filesystems/vfat.txt">.
When you give an "iocharset" mount option different from the default
(which is "iso8859-1"), the results with and without "utf8" are not
consistent. Therefore I don't recommend the "iocharset" mount option.
<item>
The "msdos", "umsdos" filesystems have the same mount option, but it
appears to have no effect.
<item>
The "iso9660" filesystem has a mount option "utf8".
See <htmlurl url="file:/usr/src/linux/Documentation/filesystems/isofs.txt"
name="file:/usr/src/linux/Documentation/filesystems/isofs.txt">.
<item>
Since Linux 2.2.x kernels, the "ntfs" filesystem has a mount option
"utf8". See
<htmlurl url="file:/usr/src/linux/Documentation/filesystems/ntfs.txt"
name="file:/usr/src/linux/Documentation/filesystems/ntfs.txt">.
</itemize>
The other filesystems (nfs, smbfs, ncpfs, hpfs, etc.) don't convert
filenames; therefore they support Unicode file names in UTF-8 encoding only
if the other operating system supports them.
Recall that to enable a mount option for all future remounts, you add it to
the fourth column of the corresponding /etc/fstab line.
<!-- COMMENTED OUT. It is questionable whether the kernel needs line editing
capabilities that go beyond those of ASCII. Multibyte and double-width
are feasible in kernel space, but combining characters and Bidi are not.
<sect1>Ttys &amp; the kernel
<p>
Ttys are some kind of bidirectional pipes between two program, allowing
fancy features like echoing or command-line editing. When in an xterm,
you execute the "cat" command without arguments, you can enter and edit
any number of lines, and they will be echoed back line by line. The
kernel's editing actions are not correct, especially the Backspace (erase)
key and the tab key are not treated correctly.
To fix this, you need to:
<itemize>
<item>
apply the kernel patch
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.0.35-tty.diff"
name="linux-2.0.35-tty.diff"> or
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.2.9-tty.diff"
name="linux-2.2.9-tty.diff"> or
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-2.3.12-tty.diff"
name="linux-2.3.12-tty.diff">
and recompile your kernel,
<item>
if you are using glibc2, apply the patch
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/glibc211-tty.diff"
name="glibc211-tty.diff">
and recompile your libc (or if you are not so adventurous, it is sufficient
to patch an already installed include file:
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/glibc-tty.diff"
name="glibc-tty.diff">),
<item>
apply the patch <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/stty.diff"
name="stty.diff">
to GNU sh-utils-1.16b, and rebuild the "stty" program, then test it using
"stty -a" and "stty iutf8".
<item>
add the command "stty iutf8" to the "unicode_start" script, and
add the command "stty -iutf8" to the "unicode_stop script.
<item>
apply the patch <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/xterm.diff"
name="xterm.diff">
to xterm-146, and rebuild "xterm", then test it by starting
"xterm -u8"/"xterm +u8" and running "stty -a" and interactive "cat" inside it.
</itemize>
-->
<!-- COMMENTED OUT. Need to find a better solution.
To make this fix persistent across rlogin and telnet, I also used to do the
following, but I have been convinced that this is not a good idea:
<itemize>
<item>
Define new values for the TERM environment variable, "linux-utf8" as an
alias to "linux", and "xterm-utf8" as an alias to "xterm".
If your system has the ncurses library and the /usr/lib/terminfo (or
/usr/share/terminfo) database, do this by running
<tscreen><verb>
$ tic linux-utf8.terminfo
$ tic xterm-utf8.terminfo
</verb></tscreen>
as non-root (this will create the terminfo entries in your $HOME/.terminfo
directory). Here are
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/linux-utf8.terminfo"
name="linux-utf8.terminfo">
and
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/xterm-utf8.terminfo"
name="xterm-utf8.terminfo">.
I don't recommend running this as root, because it will create
the terminfo entries in /usr/lib/terminfo where they might be erased next
time you upgrade your system.
If your system has an /etc/termcap file, you should also edit that file:
copy the linux and xterm entries and give them the new names "linux-utf8"
and "xterm-utf8". For an example, see
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/termcap.diff"
name="termcap.diff">.
<item>
Each time you call "unicode_start" or "unicode_stop" from the console, also
execute "export TERM=linux-utf8" or "export TERM=linux", respectively.
<item>
Apply the patch <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/xterm2.diff"
name="xterm2.diff">
to xterm-146, rebuild "xterm", and remove any "XTerm*termName" line from
/usr/X11R6/lib/X11/app-defaults/XTerm and $HOME/.Xdefaults. Now xterm sets
the TERM variable to "xterm-utf8" instead of "xterm" when running in UTF-8
mode.
<item>
Apply the patches
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/netkit.diff"
name="netkit.diff">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/netkitb.diff"
name="netkitb.diff"> and
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/telnet.diff"
name="telnet.diff"> and
rebuild "rlogind" and "telnetd". Now rlogin and telnet put the tty into
UTF-8 editing mode whenever the TERM environment variable is "linux-utf8"
or "xterm-utf8".
</itemize>
-->
<sect1>Upgrading the C library
<p>
glibc-2.2 supports multibyte locales, in particular UTF-8 locales. But
glibc-2.1.x and earlier C libraries do not support it. Therefore you need
to upgrade to glibc-2.2. Upgrading from glibc-2.1.x is riskless, because
glibc-2.2 is binary compatible with glibc-2.1.x (at least on i386 platforms,
and except for IPv6). Nevertheless, I recommend to have a bootable rescue
disk handy in case something goes wrong.
Prepare the kernel sources. You must have them unpacked and configured.
/usr/src/linux/include/linux/autoconf.h must exist. Building the kernel
is not needed.
Retrieve the glibc sources
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/glibc/"
name="ftp://ftp.gnu.org/pub/gnu/glibc/">,
su to root, then unpack, build and install it:
<tscreen><verb>
# unset LD_PRELOAD
# unset LD_LIBRARY_PATH
# tar xvfz glibc-2.2.tar.gz
# tar xvfz glibc-linuxthreads-2.2.tar.gz -C glibc-2.2
# mkdir glibc-2.2-build
# cd glibc-2.2-build
# ../glibc-2.2/configure --prefix=/usr --with-headers=/usr/src/linux/include --enable-add-ons
# make
# make check
# make info
# LC_ALL=C make install
# make localedata/install-locales
</verb></tscreen>
Upgrading from glibc versions earlier than 2.1.x cannot be done this way;
consider first installing a Linux distribution based on glibc-2.1.x, and
then upgrading to glibc-2.2 as described above.
Note that if -- for any reason -- you want to rebuild GCC after having
installed glibc-2.2, you need to first apply this patch
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/gcc-glibc-2.2-compat.diff"
name="gcc-glibc-2.2-compat.diff">
to the GCC sources.
<sect1>General data conversion
<p>
You will need a program to convert your locally (probably ISO-8859-1) encoded
texts to UTF-8. (The alternative would be to keep using texts in different
encodings on the same machine; this is not fun in the long run.)
One such program is `iconv', which comes with glibc-2.2. Simply use
<tscreen><verb>
$ iconv --from-code=ISO-8859-1 --to-code=UTF-8 &lt; old_file &gt; new_file
</verb></tscreen>
Here are two handy shell scripts, called "i2u"
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/i2u.sh" name="i2u.sh">
(for ISO to UTF conversion) and "u2i"
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/u2i.sh" name="u2i.sh">
(for UTF to ISO conversion).
Adapt according to your current 8-bit character set.
If you don't have glibc-2.2 and iconv installed, you can use GNU recode 3.6
instead.
"i2u" <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/i2u_recode.sh"
name="i2u_recode.sh"> is
"recode ISO-8859-1..UTF-8", and
"u2i" <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/u2i_recode.sh"
name="u2i_recode.sh"> is
"recode UTF-8..ISO-8859-1".
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz"
name="ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz">
Or you can also use CLISP instead. Here are
"i2u" <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/i2u.lisp"
name="i2u.lisp"> and
"u2i" <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/u2i.lisp"
name="u2i.lisp">
written in Lisp. Note: You need a CLISP version from July 1999 or newer.
<htmlurl url="ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz"
name="ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz">.
Other data conversion programs, less powerful than GNU recode, are
`trans'
<htmlurl url="ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/trans113.tar.gz"
name="ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/trans113.tar.gz">,
`tcs' from the Plan9 operating system
<htmlurl url="ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/tcs.tar.gz"
name="ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/tcs.tar.gz">,
and
`utrans'/`uhtrans'/`hutrans'
<htmlurl url="ftp://ftp.cdrom.com/pub/FreeBSD/distfiles/i18ntools-1.0.tar.gz"
name="ftp://ftp.cdrom.com/pub/FreeBSD/distfiles/i18ntools-1.0.tar.gz">
by G. Adam Stanislav
<htmlurl url="mailto:adam@whizkidtech.net"
name="&lt;adam@whizkidtech.net&gt;">.
For the repeated conversion of files to UTF-8 from different character sets,
a semi-automatic tool can be used:
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/to-utf8" name="to-utf8">
presents the non-ASCII parts of a file to the user, lets him decide about the
file's original character set, and then converts the file to UTF-8.
<sect1>Locale environment variables
<p>
You may have the following environment variables set, containing locale
names:
<descrip>
<tag>LANGUAGE</tag>
override for LC_MESSAGES, used by GNU gettext only
<tag>LC_ALL</tag>
override for all other LC_* variables
<tag>LC_CTYPE, LC_MESSAGES, LC_COLLATE, LC_NUMERIC, LC_MONETARY, LC_TIME</tag>
individual variables for:
character types and encoding,
natural language messages,
sorting rules,
number formatting,
money amount formatting,
date and time display
<tag>LANG</tag>
default value for all LC_* variables
</descrip>
(See `<tt>man 7 locale</tt>' for a detailed description.)
Each of the LC_* and LANG variables can contain a locale name of the
following form:
<quote>
language[_territory[.codeset]][@modifier]
</quote>
where language is an
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/ISO_639.html"
name="ISO 639">
language code (lower case), territory is an
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/ISO_3166.html"
name="ISO 3166">
country code (upper case), codeset denotes a character set, and
modifier stands for other particular attributes (for example indicating
a particular language dialect, or a nonstandard orthography).
LANGUAGE can contain several locale names, separated by colons.
In order to tell your system and all applications that you are using UTF-8,
you need to add a codeset suffix of UTF-8 to your locale names. For example,
if you were using
<tscreen><verb>
LC_CTYPE=de_DE
</verb></tscreen>
you would change this to
<tscreen><verb>
LC_CTYPE=de_DE.UTF-8
</verb></tscreen>
You do <em>not</em> need to change your LANGUAGE environment variable.
GNU gettext in glibc-2.2 has the ability to convert translations to the right
encoding.
<sect1>Creating the locale support files
<p>
You create using <tt>localedef</tt> the support files for each UTF-8 locale
you intend to use, for example:
<tscreen><verb>
$ localedef -v -c -i de_DE -f UTF-8 de_DE.UTF-8
</verb></tscreen>
You typically don't need to create locales named "de" or "fr" without
country suffix, because these locales are normally only used by the
LANGUAGE variable and not by the LC_* variables, and LANGUAGE is only
used as an override for LC_MESSAGES.
<sect>Specific applications
<p>
<sect1>Shells
<p>
<sect2>bash
<p>
By default, GNU bash assumes that every character is one byte long and one
column wide. A patch for bash 2.04, by Marcin 'Qrczak' Kowalczyk and
Ricardas Cepas, teaches bash about multibyte characters in UTF-8 encoding.
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/bash-2.04-diff"
name="bash-2.04-diff">
Double-width characters, combining characters and bidi are not supported by
this patch. It seems a complete redesign of the readline redisplay engine is
needed.
<sect1>Networking
<p>
<!--
<sect2>rlogin
<p>
is fine with the above mentioned patches.
-->
<sect2>telnet
<p>
In some installations, telnet is not 8-bit clean by default.
In order to be able to send Unicode keystrokes to the remote host, you need to
set telnet into "outbinary" mode.
There are two ways to do this:
<tscreen><verb>
$ telnet -L <host>
</verb></tscreen>
and
<tscreen><verb>
$ telnet
telnet> set outbinary
telnet> open <host>
</verb></tscreen>
<!--
Additionally, use the above mentioned patches.
-->
<sect2>kermit
<p>
The communications program C-Kermit
<htmlurl url="http://www.columbia.edu/kermit/ckermit.html"
name="http://www.columbia.edu/kermit/ckermit.html">,
(an interactive tool for connection setup, telnet, file transfer,
with support for TCP/IP and serial lines),
in versions 7.0 or newer, understands the file and transfer encodings
UTF-8 and UCS-2, and understands the terminal encoding UTF-8, and converts
between these encodings and many others. Documentation of these features
can be found in
<htmlurl url="http://www.columbia.edu/kermit/ckermit2.html#x6.6"
name="http://www.columbia.edu/kermit/ckermit2.html#x6.6">.
<sect1>Browsers
<p>
<sect2>Netscape
<p>
Netscape 4.05 or newer can display HTML documents in UTF-8 encoding. All a
document needs is the following line between the
&lt;head&gt; and &lt;/head&gt; tags:
<tscreen><verb>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</verb></tscreen>
Netscape 4.05 or newer can also display HTML and text files in UCS-2
encoding with byte-order mark.
<htmlurl url="http://www.netscape.com/computing/download/"
name="http://www.netscape.com/computing/download/">
<sect2>Mozilla
<p>
Mozilla milestone M16 has much better internationalization than Netscape 4.
It can display HTML documents in UTF-8 encoding with support for more
languages. Alas, there is a cosmetic problem with CJK fonts: some glyphs
can be bigger than the line's height, thus overlapping the previous or next
line.
<htmlurl url="http://www.mozilla.org/"
name="http://www.mozilla.org/">
<sect2>Amaya
<p>
Amaya 4.2.1
(<htmlurl url="http://www.w3.org/Amaya/"
name="http://www.w3.org/Amaya/">,
<htmlurl url="http://www.w3.org/Amaya/User/SourceDist"
name="http://www.w3.org/Amaya/User/SourceDist">)
has now limited handling of UTF-8 encoded HTML pages. It
recognizes the encoding, but it displays only ISO-8859-1 and symbol
characters; it only ever accesses the fonts
<tscreen><verb>
-adobe-times-*-iso8859-1
-adobe-helvetica-*-iso8859-1
-adobe-new century schoolbook-*-iso8859-1
-adobe-courier-*-iso8859-1
-adobe-symbol-*-adobe-fontspecific
</verb></tscreen>
Amaya is in fact a HTML editor, not only a browser. Amaya's strengths among
the browsers are its speed, given enough memory, and its rendering
of mathematical formulas (MathML support).
<sect2>lynx
<p>
lynx-2.8 has an options screen (key 'O') which permits to set the display
character set. When running in an xterm or Linux console in UTF-8 mode,
set this to "UNICODE UTF-8". Note that for this setting to take effect
in the current browser session, you have to confirm on the "Accept Changes"
field, and for this setting to take effect in future browser sessions, you
have to enable the "Save options to disk" field and then confirm it on
the "Accept Changes" field.
Now, again, all a document needs is the following line between the
&lt;head&gt; and &lt;/head&gt; tags:
<tscreen><verb>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</verb></tscreen>
When you are viewing text files in UTF-8 encoding, you also need to
pass the command-line option "-assume_local_charset=UTF-8" (affects only
file:/... URLs) or "-assume_charset=UTF-8" (affects all URLs).
In lynx-2.8.2 you can alternatively, in the options screen (key 'O'),
change the assumed document character set to "utf-8".
There is also an option in the options screen, to set the "preferred document
character set". But it has no effect, at least with file:/... URLs
and with http://... URLs served by apache-1.3.0.
There is a spacing and line-breaking problem, however. (Look at the
russian section of x-utf8.html, or at utf-8-demo.txt.)
Also, in lynx-2.8.2, configured with --enable-prettysrc, the nice colour
scheme does not work correctly any more when the display character set
has been set to "UNICODE UTF-8". This is fixed by a simple patch
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/lynx282.diff" name="lynx282.diff">.
The Lynx developers say: "For any serious use of UTF-8 screen output with
lynx, compiling with slang lib and -DSLANG_MBCS_HACK is still recommended."
Latest stable release:
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/lynx/lynx-2.8.2.tar.gz"
name="ftp://ftp.gnu.org/pub/gnu/lynx/lynx-2.8.2.tar.gz">
<htmlurl url="http://lynx.isc.org/"
name="http://lynx.isc.org/">
General home page:
<htmlurl url="http://lynx.browser.org/"
name="http://lynx.browser.org/">
<htmlurl url="http://www.slcc.edu/lynx/"
name="http://www.slcc.edu/lynx/">
Newer development shapshots:
<htmlurl url="http://lynx.isc.org/current/"
name="http://lynx.isc.org/current/">,
<htmlurl url="ftp://lynx.isc.org/current/"
name="ftp://lynx.isc.org/current/">
<sect2>w3m
<p>
w3m by Akinori Ito
<htmlurl url="http://ei5nazha.yz.yamagata-u.ac.jp/~aito/w3m/eng/"
name="http://ei5nazha.yz.yamagata-u.ac.jp/~aito/w3m/eng/">
is a text mode browser for HTML pages and plain-text files.
Its layout of HTML tables, enumerations etc. is much prettier than lynx' one.
w3m can also be used as a high quality HTML to plain text converter.
w3m 0.1.10 has command line options for the three major Japanese encodings, but
can also be used for UTF-8 encoded files. Without command line options,
you often have to press Ctrl-L to refresh the display, and line breaking
in Cyrillic and CJK paragraphs is not good.
To fix this, by Hironori Sakamoto has a patch
<htmlurl url="http://www2u.biglobe.ne.jp/~hsaka/w3m/"
name="http://www2u.biglobe.ne.jp/~hsaka/w3m/">
which adds UTF-8 as display encoding.
<sect2>Test pages
<p>
Some test pages for browsers can be found at the pages of Alan Wood
<htmlurl url="http://www.hclrss.demon.co.uk/unicode/#links"
name="http://www.hclrss.demon.co.uk/unicode/#links">
and James Kass
<htmlurl url="http://home.att.net/~jameskass/"
name="http://home.att.net/~jameskass/">.
<sect1>Editors
<p>
<sect2>yudit
<p>
yudit by G&aacute;sp&aacute;r Sinai
<htmlurl url="http://www.yudit.org/"
name="http://www.yudit.org/">
is a first-class unicode text editor for the X Window System.
It supports simultaneous processing of many languages, input methods,
conversions for local character standards.
It has facilities for entering text in all languages with only
an English keyboard, using keyboard configuration maps.
<sect3>yudit-1.5
<p>
It can be compiled in three versions: Xlib GUI, KDE GUI, or Motif GUI.
Customization is very easy. Typically you will first customize your font.
From the font menu I chose "Unicode". Then, since the command
"xlsfonts '*-*-iso10646-1'" still showed some ambiguity, I chose a font
size of 13 (to match Markus Kuhn's 13-pixel fixed font).
Next, you will customize your input method. The input methods "Straight",
"Unicode" and "SGML" are most remarkable. For details about the other
built-in input methods, look in /usr/local/share/yudit/data/.
To change the default for the next session, edit your $HOME/.yuditrc
file.
The general editor functionality is limited to editing, cut&amp;paste
and search&amp;replace. No undo.
<sect3>yudit-2.1
<p>
This version is less easy to learn, because it comes with a homebrewn
GUI and no easily accessible help. But it has an undo functionality and
should therefore be more usable than version 1.5.
<sect3>Fonts for yudit
<p>
yudit can display text using a TrueType font; see section "TrueType fonts"
above. The Bitstream Cyberbit gives good results. For yudit to find the
font, symlink it to <tt>/usr/local/share/yudit/data/cyberbit.ttf</tt>.
<sect2>vim
<p>
vim (as of version 6.0r) has good support for UTF-8: when started in an
UTF-8 locale, it assumes UTF-8 encoding for the console and the text files
being edited. It supports double-wide (CJK) characters as well and
combining characters and therefore fits perfectly into UTF-8 enabled
xterm.
Installation: Download from
<htmlurl url="http://www.vim.org/"
name="http://www.vim.org/">.
After unpacking the four parts, call <tt>./configure</tt> with
<tt>--with-features=big</tt> <tt>--enable-multibyte</tt> arguments
(or edit src/Makefile to include the <tt>--with-features=big</tt> and
<tt>--enable-multibyte</tt> options). This will turn on the feature
FEAT_MBYTE. Then do "make" and "make install".
vim can be used to edit files in other encodings. For example, to edit
a BIG5 encoded file: <tt>:e ++cc=BIG5 filename</tt>. All encoding names
supported by iconv are accepted. Plus: vim automatically distinguishes
UTF-8 and ISO-8859-1 files without needing any command line option.
<sect2>cooledit
<p>
cooledit by Paul Sheer
<htmlurl url="http://www.cooledit.org/"
name="http://www.cooledit.org/">
is a good text editor for the X Window System. Since version 3.15, it has
support for Unicode, including Bidi for Hebrew (but not Arabic).
A build error message message about a missing "vga_setpage" function is
worked around by adding "-DDO_NOT_USE_VGALIB" to the CFLAGS.
To view UTF-8 files in an UTF-8 locale you have to modify a setting in
the "Options -> Switches" panel: Enable the checkbox "Display characters
outside locale". I also found it necessary to disable "Spellcheck as you
type".
For viewing texts with both European and CJK characters, cooledit needs a
font which contains both, for example the GNU unifont (see section
"X11 Unicode fonts"): Start once
<tscreen><verb>
$ cooledit -fn -gnu-unifont-medium-r-normal--16-160-75-75-c-80-iso10646-1
</verb></tscreen>
cooledit will then use this font in all future invocations.
Unfortunately, the only characters that can be entered through the keyboard
are ISO-8859-1 characters and, through a cooledit specific compose mechanism,
ISO-8859-2 characters. Inputing arbitrary Unicode characters in cooledit is
possible, but a bit tedious.
<sect2>emacs
<p>
First of all, you should read the section "International Character Set Support"
(node "International") in the Emacs manual. In particular, note that you need
to start Emacs using the command
<tscreen><verb>
$ emacs -fn fontset-standard
</verb></tscreen>
so that it will use a font set comprising a lot of international characters.
In the short term, there are two packages for using UTF-8 in Emacs. None
of them needs recompiling Emacs.
<itemize>
<item>
The emacs-utf package
<htmlurl url="http://www.cs.ust.hk/faculty/otfried/Mule/"
name="http://www.cs.ust.hk/faculty/otfried/Mule/">
by Otfried Cheong provides a "unicode-utf8" encoding to Emacs.
<item>
The oc-unicode package
<htmlurl url="http://www.cs.ust.hk/faculty/otfried/Mule/"
name="http://www.cs.ust.hk/faculty/otfried/Mule/">,
by Otfried Cheong, an extension of the Mule-UCS package
<htmlurl url="ftp://etlport.etl.go.jp/pub/mule/Mule-UCS/Mule-UCS-0.70.tar.gz"
name="ftp://etlport.etl.go.jp/pub/mule/Mule-UCS/Mule-UCS-0.70.tar.gz">
(mirrored at
<htmlurl url="http://riksun.riken.go.jp/archives/misc/mule/Mule-UCS/Mule-UCS-0.70.tar.gz"
name="http://riksun.riken.go.jp/archives/misc/mule/Mule-UCS/Mule-UCS-0.70.tar.gz">
and
<htmlurl url="ftp://ftp.m17n.org/pub/mule/Mule-UCS/Mule-UCS-0.70.tar.gz"
name="ftp://ftp.m17n.org/pub/mule/Mule-UCS/Mule-UCS-0.70.tar.gz">)
by Hisashi Miyashita, provides a "utf-8" encoding to Emacs.
</itemize>
You can use either of these packages, or both together. The advantages
of the emacs-utf "unicode-utf8" encoding are: it loads faster, and it deals
better with combining characters (important for Thai).
The advantage of the Mule-UCS / oc-unicode "utf-8" encoding is: it can apply
to a process buffer (such as M-x shell), not only to loading and saving of
files; and it respects the widths of characters better (important for
Ethiopian). However, it is less reliable: After heavy editing of a file, I
have seen some Unicode characters replaced with U+FFFD after the file was
saved. (But maybe that were bugs in Emacs 20.5 and 20.6 which are fixed in
Emacs 20.7.)
To install the emacs-utf package, compile the program "utf2mule" and install
it somewhere in your $PATH, also install unicode.el, muleuni-1.el,
unicode-char.el somewhere. Then add the lines
<tscreen><verb>
(setq load-path (cons "/home/user/somewhere/emacs" load-path))
(if (not (string-match "XEmacs" emacs-version))
(progn
(require 'unicode)
;(setq unicode-data-path "..../UnicodeData-3.0.0.txt")
(if (eq window-system 'x)
(progn
(setq fontset12
(create-fontset-from-fontset-spec
"-misc-fixed-medium-r-normal-*-12-*-*-*-*-*-fontset-standard"))
(setq fontset13
(create-fontset-from-fontset-spec
"-misc-fixed-medium-r-normal-*-13-*-*-*-*-*-fontset-standard"))
(setq fontset14
(create-fontset-from-fontset-spec
"-misc-fixed-medium-r-normal-*-14-*-*-*-*-*-fontset-standard"))
(setq fontset15
(create-fontset-from-fontset-spec
"-misc-fixed-medium-r-normal-*-15-*-*-*-*-*-fontset-standard"))
(setq fontset16
(create-fontset-from-fontset-spec
"-misc-fixed-medium-r-normal-*-16-*-*-*-*-*-fontset-standard"))
(setq fontset18
(create-fontset-from-fontset-spec
"-misc-fixed-medium-r-normal-*-18-*-*-*-*-*-fontset-standard"))
; (set-default-font fontset15)
))))
</verb></tscreen>
to your $HOME/.emacs file. To activate any of the font sets, use the Mule
menu item "Set Font/FontSet" or Shift-down-mouse-1. The Unicode coverage
may of the font sets at different sizes may depend on the installed fonts;
here are screen shots at various sizes of UTF-8-demo.txt (
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-12.gif"
name="12">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-13.gif"
name="13">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-14.gif"
name="14">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-15.gif"
name="15">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-16.gif"
name="16">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-UTF-8-demo-18.gif"
name="18">)
and of the Mule script examples (
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-12.gif"
name="12">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-13.gif"
name="13">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-14.gif"
name="14">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-15.gif"
name="15">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-16.gif"
name="16">,
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/emacs-HELLO-18.gif"
name="18">).
To designate a font set as the initial font set for the first frame at startup,
uncomment the <tt>set-default-font</tt> line in the code snippet above.
To install the oc-unicode package, execute the command
<tscreen><verb>
$ emacs -batch -l oc-comp.el
</verb></tscreen>
and install the resulting file <tt>un-define.elc</tt>, as well as
<tt>oc-unicode.el</tt>, <tt>oc-charsets.el</tt>, <tt>oc-tools.el</tt>,
somewhere. Then add the lines
<tscreen><verb>
(setq load-path (cons "/home/user/somewhere/emacs" load-path))
(if (not (string-match "XEmacs" emacs-version))
(progn
(require 'oc-unicode)
;(setq unicode-data-path "..../UnicodeData-3.0.0.txt")
(if (eq window-system 'x)
(progn
(setq fontset12
(oc-create-fontset
"-misc-fixed-medium-r-normal-*-12-*-*-*-*-*-fontset-standard"
"-misc-fixed-medium-r-normal-ja-12-*-iso10646-*"))
(setq fontset13
(oc-create-fontset
"-misc-fixed-medium-r-normal-*-13-*-*-*-*-*-fontset-standard"
"-misc-fixed-medium-r-normal-ja-13-*-iso10646-*"))
(setq fontset14
(oc-create-fontset
"-misc-fixed-medium-r-normal-*-14-*-*-*-*-*-fontset-standard"
"-misc-fixed-medium-r-normal-ja-14-*-iso10646-*"))
(setq fontset15
(oc-create-fontset
"-misc-fixed-medium-r-normal-*-15-*-*-*-*-*-fontset-standard"
"-misc-fixed-medium-r-normal-ja-15-*-iso10646-*"))
(setq fontset16
(oc-create-fontset
"-misc-fixed-medium-r-normal-*-16-*-*-*-*-*-fontset-standard"
"-misc-fixed-medium-r-normal-ja-16-*-iso10646-*"))
(setq fontset18
(oc-create-fontset
"-misc-fixed-medium-r-normal-*-18-*-*-*-*-*-fontset-standard"
"-misc-fixed-medium-r-normal-ja-18-*-iso10646-*"))
; (set-default-font fontset15)
))))
</verb></tscreen>
to your $HOME/.emacs file. You can choose your appropriate font set as with
the emacs-utf package.
In order to open an UTF-8 encoded file, you will type
<tscreen><verb>
M-x universal-coding-system-argument unicode-utf8 RET
M-x find-file filename RET
</verb></tscreen>
or
<tscreen><verb>
C-x RET c unicode-utf8 RET
C-x C-f filename RET
</verb></tscreen>
(or utf-8 instead of unicode-utf8, if you prefer oc-unicode/Mule-UCS).
In order to start a shell buffer with UTF-8 I/O, you will type
<tscreen><verb>
M-x universal-coding-system-argument utf-8 RET
M-x shell RET
</verb></tscreen>
(This works with oc-unicode/Mule-UCS only.)
There is a newer version Mule-UCS-0.81. Unfortunately you need to rebuild emacs
from source in order to use it.
Note that all this works with Emacs 20 in windowing mode only, not in terminal
mode. None of the mentioned packages works in Emacs 21, as of this writing.
Richard Stallman plans to add integrated UTF-8 support to Emacs in the long
term, and so does the XEmacs developers group.
<sect2>xemacs
<p>
(This section is written by Gilbert Baumann.)
Here is how to teach XEmacs (20.4 configured with MULE) the UTF-8 encoding.
Unfortunately you need its sources to be able to patch it.
First you need these files provided by Tomohiko Morioka:
<htmlurl url="http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-21.0-b55-emc-b55-ucs.diff"
name="http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-21.0-b55-emc-b55-ucs.diff">
and
<htmlurl url="http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-ucs-conv-0.1.tar.gz"
name="http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-ucs-conv-0.1.tar.gz">
The .diff is a diff against the C sources. The tar ball is elisp code,
which provides lots of code tables to map to and from Unicode. As the
name of the diff file suggests it is against XEmacs-21; I needed to
help `patch' a bit. The most notable difference to my XEmacs-20.4
sources is that file-coding.[ch] was called mule-coding.[ch].
For those unfamilar with the XEmacs-MULE stuff (as I am) a quick
guide:
What we call an encoding is called by MULE a `coding-system'. The most
important commands are:
<tscreen><verb>
M-x set-file-coding-system
M-x set-buffer-process-coding-system [comint buffers]
</verb></tscreen>
and the variable `file-coding-system-alist', which guides `find-file'
to guess the encoding used. After stuff was running, the very first
thing I did was <htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/gb-hacks.el" name="this">.
This code looks into the special mode line introduced by -*- somewhere
in the first 600 bytes of the file about to opened; if now there is a
field "Encoding: xyz;" and the xyz encoding ("coding system" in Emacs speak)
exists, choose that. So now you could do e.g.
<tscreen><verb>
;;; -*- Mode: Lisp; Syntax: Common-Lisp; Package: CLEX; Encoding: utf-8; -*-
</verb></tscreen>
and XEmacs goes into utf-8 mode here.
Atfer everything was running I defined \u03BB (greek lambda) as a
macro like:
<tscreen><verb>
(defmacro \u03BB (x) `(lambda .,x))
</verb></tscreen>
<sect2>nedit
<p>
<sect2>xedit
<p>
With XFree86-4.0.1, xedit is able to edit UTF-8 files if you set the locale
accordingly (see above), and add the line "Xedit*international: true" to
your $HOME/.Xdefaults file.
<sect2>axe
<p>
As of version 6.1.2, aXe supports only 8-bit locales. If you add the line
"Axe*international: true" to your $HOME/.Xdefaults file, it will simply dump
core.
<sect2>pico
<p>
As of version 4.30, pine cannot be reasonably used to view or edit UTF-8
files. In UTF-8 enabled xterm, it has severe redraw problems.
<sect2>mined98
<p>
mined98 is a small text editor by Michiel Huisjes, Achim M&uuml;ller and
Thomas Wolff.
<htmlurl url="http://www.inf.fu-berlin.de/~wolff/mined98.tar.gz"
name="http://www.inf.fu-berlin.de/~wolff/mined98.tar.gz">
It lets you edit UTF-8 or 8-bit encoded files, in an UTF-8 or 8-bit xterm.
It also has powerful capabilities for entering Unicode characters.
mined lets you edit both 8-bit encoded and UTF-8 encoded files. By default
it uses an autodetection heuristic. If you don't want to rely on heuristics,
pass the command-line option <tt>-u</tt> when editing an UTF-8 file, or
<tt>+u</tt> when editing an 8-bit encoded file. You can change the
interpretation at any time from within the editor: It displays the encoding
("L:h" for 8-bit, "U:h" for UTF-8) in the menu line. Click on the first
of these characters to change it.
mined knows about double-width and combining characters and displays them
correctly. It also has a special display mode for combining characters.
mined also has a scrollbar and very nice pull-down menus. Alas, the "Home",
"End", "Delete" keys do not work.
<sect2>qemacs
<p>
qemacs 0.2 is a small text editor by Fabrice Bellard.
<htmlurl url="http://www-stud.enst.fr/~bellard/qemacs/"
name="http://www-stud.enst.fr/~bellard/qemacs/">
with Emacs keybindings. It runs in an UTF-8 console or xterm, and can edit
both 8-bit encoded and UTF-8 encoded files. It still has a few rough edges,
but further development is underway.
<sect1>Mailers
<p>
MIME: RFC 2279 defines UTF-8 as a MIME charset, which can be transported
under the 8bit, quoted-printable and base64 encodings. The older MIME
UTF-7 proposal (RFC 2152) is considered to be deprecated and should not
be used any further.
Mail clients released after January 1, 1999, should be capable of sending and
displaying UTF-8 encoded mails, otherwise they are considered deficient.
But these mails have to carry the MIME labels
<tscreen><verb>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
</verb></tscreen>
Simply piping an UTF-8 file into "mail" without caring about the MIME labels
will not work.
Mail client implementors should take a look at
<htmlurl url="http://www.imc.org/imc-intl/"
name="http://www.imc.org/imc-intl/">
and
<htmlurl url="http://www.imc.org/mail-i18n.html"
name="http://www.imc.org/mail-i18n.html">.
Now about the individual mail clients (or "mail user agents"):
<sect2>pine
<p>
The situation for an unpatched pine version 4.30 is as follows.
Pine does not do character set conversions. But it allows you to view
UTF-8 mails in an UTF-8 text window (Linux console or xterm).
Normally, Pine will warn about different character sets each time you view
an UTF-8 encoded mail. To get rid of this warning, choose S (setup), then
C (config), then change the value of "character-set" to UTF-8. This option
will not do anything, except to reduce the warnings, as Pine has no built-in
knowledge of UTF-8.
Also note that Pine's notion of Unicode characters is pretty limited: It
will display Latin and Greek characters, but not other kinds of Unicode
characters.
A patch by Robert Brady
<htmlurl url="mailto:robert@suse.co.uk"
name="&lt;robert@suse.co.uk&gt;">
<htmlurl url="http://www.ents.susu.soton.ac.uk/~robert/pine-utf8-0.1.diff"
name="http://www.ents.susu.soton.ac.uk/~robert/pine-utf8-0.1.diff">
adds UTF-8 support to Pine. With this patch, it decodes and prints headers
and bodies properly. The patch depends on the GNOME libunicode
<htmlurl url="http://cvs.gnome.org/lxr/source/libunicode/"
name="http://cvs.gnome.org/lxr/source/libunicode/">.
However, alignment remains broken in many places; replying to a mail does
not cause the character set to be converted as appropriate; and the editor,
pico, cannot deal with multibyte characters.
<sect2>kmail
<p>
kmail (as of KDE 1.0) does not support UTF-8 mails at all.
<sect2>Netscape Communicator
<p>
Netscape Communicator's Messenger can send and display mails in UTF-8
encoding, but it needs a little bit of manual user intervention.
To send an UTF-8 encoded mail: After opening the "Compose" window, but before
starting to compose the message, select from the menu
"View -> Character Set -> Unicode (UTF-8)". Then compose the message and
send it.
When you receive an UTF-8 encoded mail, Netscape unfortunately does not
display it in UTF-8 right away, and does not even give a visual clue that
the mail was encoded in UTF-8. You have to manually select from the menu
"View -> Character Set -> Unicode (UTF-8)".
For displaying UTF-8 mails, Netscape uses different fonts. You can adjust
your font settings in the "Edit -> Preferences -> Fonts" dialog; choose
the "Unicode" font category.
<sect2>emacs (rmail, vm)
<p>
<sect2>mutt
<p>
mutt-1.2.x, as available from
<htmlurl url="http://www.mutt.org/"
name="http://www.mutt.org/">,
has only rudimentary support for UTF-8: it can convert
from UTF-8 into an 8-bit display charset. The mutt-1.3.x
development branch also supports UTF-8 as the display charset,
so you can run Mutt in an UTF-8 xterm, and has thorough support
for MIME and charset conversion (relying on iconv).
<sect2>exmh
<p>
exmh 2.1.2 with Tk<54>8.4a1 can recognize and correctly display UTF-8 mails
(without CJK characters) if you add the following lines to your
<tt>$HOME/.Xdefaults</tt> file.
<tscreen><verb>
!
! Exmh
!
exmh.mimeUCharsets: utf-8
exmh.mime_utf-8_registry: iso10646
exmh.mime_utf-8_encoding: 1
exmh.mime_utf-8_plain_families: fixed
exmh.mime_utf-8_fixed_families: fixed
exmh.mime_utf-8_proportional_families: fixed
exmh.mime_utf-8_title_families: fixed
</verb></tscreen>
<sect1>Text processing
<p>
<sect2>groff
<p>
groff 1.16.1, the GNU implementation of the traditional Unix text processing
system troff/nroff, can output UTF-8 formatted text. Simply use
`<tt>groff -Tutf8</tt>' instead of `<tt>groff -Tlatin1</tt>' or
`<tt>groff -Tascii</tt>'.
<sect2>TeX
<p>
The teTeX 0.9 (and newer) distribution contains an Unicode adaptation of TeX,
called Omega
(<htmlurl url="http://www.gutenberg.eu.org/omega/"
name="http://www.gutenberg.eu.org/omega/">,
<htmlurl url="ftp://ftp.ens.fr/pub/tex/yannis/omega"
name="ftp://ftp.ens.fr/pub/tex/yannis/omega">).
Together with the unicode.tex file contained in
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/utf8-tex-0.1.tar.gz"
name="utf8-tex-0.1.tar.gz">
it enables you to use UTF-8 encoded sources as input for TeX. A thousand of
Unicode characters are currently supported.
All that changes is that you run `omega' (instead of `tex') or `lambda'
(instead of `latex'), and insert the following lines at the head of
your source input.
<tscreen><verb>
\ocp\TexUTF=inutf8
\InputTranslation currentfile \TexUTF
</verb></tscreen>
<tscreen><verb>
\input unicode
</verb></tscreen>
Other maybe related links:
<htmlurl url="http://www.dante.de/projekte/nts/NTS-FAQ.html"
name="http://www.dante.de/projekte/nts/NTS-FAQ.html">,
<htmlurl url="ftp://ftp.dante.de/pub/tex/language/chinese/CJK/"
name="ftp://ftp.dante.de/pub/tex/language/chinese/CJK/">.
<sect1>Databases
<p>
<sect2>PostgreSQL
<p>
PostgreSQL 6.4 or newer can be built with the configuration option
<tt>--with-mb=UNICODE</tt>.
<sect2>Interbase
<p>
Borland/Inprise's Interbase 6.0 can store string fields in UTF-8 format
if the option "CHARACTER SET UNICODE_FSS" is given.
<sect1>Other text-mode applications
<p>
<sect2>less
<p>
With
<htmlurl url="http://www.flash.net/~marknu/less/less-358.tar.gz"
name="http://www.flash.net/~marknu/less/less-358.tar.gz">
you can browse UTF-8 encoded text files in an UTF-8 xterm or console.
Make sure that the environment variable LESSCHARSET is not set (or is set
to utf-8). If you also have a LESSKEY environment variable set, also make
sure that the file it points to does not define LESSCHARSET. If necessary,
regenerate this file using the `lesskey' command, or unset the LESSKEY
environment variable.
<sect2>lv
<p>
lv-4.49.3 by Tomio Narita
<htmlurl url="http://www.ff.iij4u.or.jp/~nrt/lv/"
name="http://www.ff.iij4u.or.jp/~nrt/lv/">
is a file viewer with builtin character set converters. To view UTF-8 files
in an UTF-8 console, use "lv -Au8". But it can also be used to view
files in other CJK encodings in an UTF-8 console.
There is a small glitch: lv turns off xterm's cursor and doesn't turn it on
again.
<sect2>expand
<p>
Get the GNU textutils-2.0 and apply the patch
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/textutils-2.0.diff"
name="textutils-2.0.diff">,
then configure, add "#define HAVE_FGETWC 1", "#define HAVE_FPUTWC 1" to
config.h. Then rebuild.
<sect2>col, colcrt, colrm, column, rev, ul
<p>
Get the util-linux-2.9y package, configure it, then define ENABLE_WIDECHAR in
defines.h, change the "#if 0" to "#if 1" in lib/widechar.h. In
text-utils/Makefile, modify CFLAGS and LDFLAGS so that they include the
directories where libutf8 is installed. Then rebuild.
<sect2>figlet
<p>
figlet 2.2 has an option for UTF-8 input: "figlet -C utf8"
<sect2>Base utilities
<p>
The Li18nux list of commands and utilities that ought to be made interoperable
with UTF-8 is as follows. Useful information needs to get added here; I just
didn't get around it yet :-)
As of glibc-2.2, regular expressions only work for 8-bit characters.
In an UTF-8 locale, regular expressions that contain non-ASCII characters
or that expect to match a single multibyte character with "." do not work.
This affects all commands and utilities listed below.
<!-- In particular:
- ed, vi, emacs, use regular expressions for search/replace commands,
- less, more, use regular expressions for searching,
- csplit, uses regular expressions,
- diff, uses regular expressions for its -I and -F options,
- expr, uses regular expressions for its ":" operator,
- find, uses regular expressions for its -ok, -regex and -iregex operations,
- m4, uses regular expressions for its regexp builtin function,
- nl, uses regular expressions for types starting with p,
- tac, uses regular expressions for its -r option.
-->
<descrip>
<tag>alias</tag>
No info available yet.
<tag>ar</tag>
No info available yet.
<tag>arch</tag>
No info available yet.
<tag>arp</tag>
No info available yet.
<tag>at</tag>
As of at-3.1.8: The two uses of isalnum in at.c are invalid and should be
replaced with a use of quotearg.c or an exclude list of the (fixed) list
of shell metacharacters. The two uses of %8s in at.c and atd.c are invalid
and should become arbitrary length.
<tag>awk</tag>
No info available yet.
<tag>basename</tag>
As of sh-utils-2.0i: OK.
<tag>batch</tag>
No info available yet.
<tag>bc</tag>
No info available yet.
<tag>bg</tag>
No info available yet.
<tag>bunzip2</tag>
No info available yet.
<tag>bzip2</tag>
No info available yet.
<tag>bzip2recover</tag>
No info available yet.
<tag>cal</tag>
No info available yet.
<!--
<tag>cancel(LEGACY)</tag>
No info available yet.
-->
<tag>cat</tag>
No info available yet.
<tag>cd</tag>
No info available yet.
<tag>cflow</tag>
No info available yet.
<tag>chgrp</tag>
As of fileutils-4.0u: OK.
<tag>chmod</tag>
As of fileutils-4.0u: OK.
<tag>chown</tag>
As of fileutils-4.0u: OK.
<tag>chroot</tag>
As of sh-utils-2.0i: OK.
<tag>cksum</tag>
As of textutils-2.0e: OK.
<tag>clear</tag>
No info available yet.
<tag>cmp</tag>
No info available yet.
<tag>col</tag>
No info available yet.
<tag>comm</tag>
No info available yet.
<tag>command</tag>
No info available yet.
<tag>compress</tag>
No info available yet.
<tag>cp</tag>
As of fileutils-4.0u: OK.
<tag>cpio</tag>
No info available yet.
<tag>crontab</tag>
No info available yet.
<tag>csplit</tag>
No info available yet.
<tag>ctags</tag>
No info available yet.
<tag>cut</tag>
No info available yet.
<tag>date</tag>
As of sh-utils-2.0i: OK.
<tag>dd</tag>
As of fileutils-4.0u: The conv=lcase, conv=ucase options don't work correctly.
<tag>df</tag>
As of fileutils-4.0u: OK.
<tag>diff</tag>
As of diffutils-2.7.2: the --side-by-side mode therefore doesn't compute
column width correctly.
<tag>diff3</tag>
No info available yet.
<tag>dirname</tag>
As of sh-utils-2.0i: OK.
<tag>domainname</tag>
No info available yet.
<tag>du</tag>
As of fileutils-4.0u: OK.
<tag>echo</tag>
As of sh-utils-2.0i: OK.
<tag>ed</tag>
No info available yet.
<tag>egrep</tag>
No info available yet.
<tag>env</tag>
As of sh-utils-2.0i: OK.
<tag>ex</tag>
No info available yet.
<tag>expand</tag>
No info available yet.
<tag>expr</tag>
As of sh-utils-2.0i: The operators "match", "substr", "index", "length"
don't work correctly.
<tag>false</tag>
As of sh-utils-2.0i: OK.
<tag>fc</tag>
No info available yet.
<tag>fg</tag>
No info available yet.
<tag>fgrep</tag>
No info available yet.
<tag>file</tag>
No info available yet.
<tag>find</tag>
As of findutils-4.1.6: The "-iregex" does not work correctly; this needs a
fix in function find/parser.c:insert_regex.
<tag>fold</tag>
No info available yet.
<tag>ftp[BSD]</tag>
No info available yet.
<tag>fuser</tag>
No info available yet.
<tag>gencat</tag>
No info available yet.
<tag>getconf</tag>
No info available yet.
<tag>getopts</tag>
No info available yet.
<tag>gettext</tag>
No info available yet.
<tag>grep</tag>
No info available yet.
<tag>gunzip</tag>
No info available yet.
<tag>gzip</tag>
gzip-1.3 is UTF-8 capable, but it uses only English messages in ASCII
charset. Proper internationalization would require: Use gettext. Call
setlocale. In function check_ofname (file gzip.c), use the function rpmatch
from GNU text/sh/fileutils instead of asking for "y" or "n". The use
of strlen in gzip.c:852 is wrong, needs to use the function mbswidth.
<tag>hash</tag>
No info available yet.
<tag>head</tag>
No info available yet.
<tag>hostname</tag>
As of sh-utils-2.0i: OK.
<tag>iconv</tag>
No info available yet.
<tag>id</tag>
As of sh-utils-2.0i: OK.
<tag>ifconfig</tag>
No info available yet.
<tag>imake</tag>
No info available yet.
<tag>ipcrm</tag>
No info available yet.
<tag>ipcs</tag>
No info available yet.
<tag>jobs</tag>
No info available yet.
<tag>join</tag>
No info available yet.
<tag>kill</tag>
No info available yet.
<tag>killall</tag>
No info available yet.
<tag>ldd</tag>
No info available yet.
<tag>less</tag>
No complete info available yet.
<tag>lex</tag>
No info available yet.
<tag>ln</tag>
As of fileutils-4.0u: OK.
<tag>locale</tag>
As of glibc-2.2: OK.
<tag>localedef</tag>
As of glibc-2.2: OK.
<tag>logger</tag>
No info available yet.
<tag>logname</tag>
As of sh-utils-2.0i: OK.
<tag>lp</tag>
No info available yet.
<tag>lpc[BSD]</tag>
No info available yet.
<tag>lpq[BSD]</tag>
No info available yet.
<tag>lpr[BSD]</tag>
No info available yet.
<tag>lprm[BSD]</tag>
No info available yet.
<tag>lpstat(LEGACY)</tag>
No info available yet.
<tag>ls</tag>
As of fileutils-4.0y: OK.
<tag>m4</tag>
No info available yet.
<tag>mailx</tag>
No info available yet.
<tag>make</tag>
No info available yet.
<tag>man</tag>
No info available yet.
<tag>mesg</tag>
No info available yet.
<tag>mkdir</tag>
As of fileutils-4.0u: OK.
<tag>mkfifo</tag>
As of fileutils-4.0u: OK.
<tag>mkfs</tag>
No info available yet.
<tag>mkswap</tag>
No info available yet.
<tag>more</tag>
No info available yet.
<tag>mount</tag>
No info available yet.
<tag>msgfmt</tag>
No info available yet.
<tag>msgmerge</tag>
No info available yet.
<tag>mv</tag>
As of fileutils-4.0u: OK.
<tag>netstat</tag>
No info available yet.
<tag>newgrp</tag>
No info available yet.
<tag>nice</tag>
As of sh-utils-2.0i: OK.
<tag>nl</tag>
No info available yet.
<tag>nohup</tag>
As of sh-utils-2.0i: OK.
<tag>nslookup</tag>
No info available yet.
<tag>nm</tag>
No info available yet.
<tag>od</tag>
No info available yet.
<tag>passwd[BSD]</tag>
No info available yet.
<tag>paste</tag>
No info available yet.
<tag>patch</tag>
No info available yet.
<tag>pathchk</tag>
As of sh-utils-2.0i: OK.
<tag>ping</tag>
No info available yet.
<tag>pr</tag>
No info available yet.
<tag>printf</tag>
As of sh-utils-2.0i: OK.
<tag>ps</tag>
No info available yet.
<tag>pwd</tag>
As of sh-utils-2.0i: OK.
<tag>read</tag>
No info available yet.
<tag>reboot</tag>
No info available yet.
<tag>renice</tag>
No info available yet.
<tag>rm</tag>
As of fileutils-4.0u: OK.
<tag>rmdir</tag>
As of fileutils-4.0u: OK.
<tag>sed</tag>
No info available yet.
<tag>shar[BSD]</tag>
No info available yet.
<tag>shutdown</tag>
No info available yet.
<tag>sleep</tag>
As of sh-utils-2.0i: OK.
<tag>sort</tag>
No info available yet.
<tag>split</tag>
No info available yet.
<tag>strings</tag>
No info available yet.
<tag>strip</tag>
No info available yet.
<tag>stty</tag>
As of sh-utils-2.0.11: OK.
<tag>su[BSD]</tag>
No info available yet.
<tag>sum</tag>
As of textutils-2.0e: OK.
<tag>tail</tag>
No info available yet.
<tag>talk</tag>
No info available yet.
<tag>tar</tag>
As of tar-1.13.17: OK, if user and group names are always ASCII.
<tag>tclsh</tag>
No info available yet.
<tag>tee</tag>
As of sh-utils-2.0i: OK.
<tag>telnet</tag>
No info available yet.
<tag>test</tag>
As of sh-utils-2.0i: OK.
<tag>time</tag>
No info available yet.
<tag>touch</tag>
As of fileutils-4.0u: OK.
<tag>tput</tag>
No info available yet.
<tag>tr</tag>
No info available yet.
<tag>true</tag>
As of sh-utils-2.0i: OK.
<tag>tsort</tag>
No info available yet.
<tag>tty</tag>
As of sh-utils-2.0i: OK.
<tag>type</tag>
No info available yet.
<tag>ulimit</tag>
No info available yet.
<tag>umask</tag>
No info available yet.
<tag>umount</tag>
No info available yet.
<tag>unalias</tag>
No info available yet.
<tag>uname</tag>
As of sh-utils-2.0i: OK.
<tag>uncompress</tag>
No info available yet.
<tag>unexpand</tag>
No info available yet.
<tag>uniq</tag>
No info available yet.
<tag>uudecode</tag>
No info available yet.
<tag>uuencode</tag>
No info available yet.
<tag>vi</tag>
No info available yet.
<tag>wait</tag>
No info available yet.
<tag>wc</tag>
As of textutils-2.0.8: OK.
<tag>who</tag>
As of sh-utils-2.0i: OK.
<tag>wish</tag>
No info available yet.
<tag>write</tag>
No info available yet.
<tag>xargs</tag>
As of findutils-4.1.5: The program uses strstr; a patch has been submitted
to the maintainer.
<tag>xgettext</tag>
No info available yet.
<tag>yacc</tag>
No info available yet.
<tag>zcat</tag>
No info available yet.
</descrip>
<sect1>Other X11 applications
<p>
Owen Taylor is currently developing a library for rendering multilingual
text, called pango.
<htmlurl url="http://www.labs.redhat.com/~otaylor/pango/"
name="http://www.labs.redhat.com/~otaylor/pango/">,
<htmlurl url="http://www.pango.org/"
name="http://www.pango.org/">.
<sect>Printing
<p>
Since Postscript itself does not support Unicode fonts, the burden of
Unicode support in printing is on the program creating the Postscript
document, not on the Postscript renderer.
The existing Postscript fonts I've seen - .pfa/.pfb/.afm/.pfm/.gsf -
support only a small range of glyphs and are not Unicode fonts.
<!-- Can teTeX's `ttf2afm' program be any useful here?? -->
<!-- I don't think ghostscript's bdftops program is useful here. -->
<sect1>Printing using TrueType fonts
<p>
Both the uniprint and wprint programs produce good printed output
for Unicode plain text. They require a TrueType font; see section
"TrueType fonts" above. The Bitstream Cyberbit gives good results.
<sect2>uniprint
<p>
The "uniprint" program contained in the yudit package can convert a text
file to Postscript. For uniprint to find the Cyberbit font, symlink it to
<tt>/usr/local/share/yudit/data/cyberbit.ttf</tt>.
<sect2>wprint
<p>
The "wprint" (WorldPrint) program by Eduardo Trapani
<htmlurl url="http://ttt.esperanto.org.uy/programoj/angle/wprint.html"
name="http://ttt.esperanto.org.uy/programoj/angle/wprint.html">
postprocesses Postscript output produced by Netscape Communicator or Mozilla
from HTML pages or plain text files.
The output is nearly perfect; only in Cyrillic paragraphs the line breaking
is incorrect: the lines are only about half as wide as they should be.
<sect2>Comparison
<p>
For plain text, uniprint has a better overall layout. On the other hand,
only wprint gets Thai output correct.
<sect1>Printing using fixed-size fonts
<p>
Generally, printing using fixed-size fonts does not give an as professional
output as using TrueType fonts.
<sect2>txtbdf2ps
<p>
The txtbdf2ps 0.7 program by Serge Winitzki
<htmlurl url="http://members.linuxstart.com/~winitzki/txtbdf2ps.html"
name="http://members.linuxstart.com/~winitzki/txtbdf2ps.html">
converts a plain text file to Postscript, by use of a BDF font.
Installation:
<tscreen><verb>
# install -m 777 txtbdf2ps-dev.txt /usr/local/bin/txtbdf2ps
</verb></tscreen>
Example with a proportional font:
<tscreen><verb>
$ txtbdf2ps -BDF=cyberbit.bdf -UTF-8 -nowrap < input.txt > output.ps
</verb></tscreen>
Example with a fixed-width font:
<tscreen><verb>
$ txtbdf2ps -BDF=unifont.bdf -UTF-8 -nowrap < input.txt > output.ps
</verb></tscreen>
Note: txtbdf2ps does not support combining characters and bidi.
<sect1>The classical approach
<p>
Another way to print with TrueType fonts is to convert the TrueType font to
a Postscript font using the <tt>ttf2pt1</tt> utility
(<htmlurl url="http://www.netspace.net.au/~mheath/ttf2pt1/"
name="http://www.netspace.net.au/~mheath/ttf2pt1/">,
<htmlurl url="http://quadrant.netspace.net.au/ttf2pt1/"
name="http://quadrant.netspace.net.au/ttf2pt1/">,
<htmlurl url="http://ttf2pt1.sourceforge.net/"
name="http://ttf2pt1.sourceforge.net/">). Details can be
found in Julius Chroboczek's "Printing with TrueType fonts in Unix" writeup,
<htmlurl url="http://www.dcs.ed.ac.uk/home/jec/programs/xfsft/printing.html"
name="http://www.dcs.ed.ac.uk/home/jec/programs/xfsft/printing.html">.
<sect2>TeX, Omega
<p>
TODO: CJK, metafont, omega, dvips, odvips, utf8-tex-0.1
<!-- Not useful now: The `ps2pk' and `afm2tfm' programs, contained in the
teTeX distribution, can make use of existing Postscript/Type1 fonts for use
with TeX. So can the `ps2mf' program, not included with teTeX. -->
<sect2>DocBook
<p>
TODO: db2ps, jadetex
<sect2>groff -Tps
<p>
"groff -Tps" produces Postscript output. Its Postscript output driver
supports only a very limited number of Unicode characters (only what
Postscript supports by itself).
<!-- Not useful now: The `afmtodit' and `pfbtops' programs, contained in the
groff package, can make use of existing Postscript/Type1 fonts for use with
groff. -->
<sect1>No luck with...
<sect2>Netscape's "Print..."
<p>
As of version 4.72, Netscape Communicator cannot correctly print HTML
pages in UTF-8 encoding. You really have to use wprint.
<sect2>Mozilla's "Print..."
<p>
As of version M16, printing of HTML pages is apparently not implemented.
<sect2>html2ps
<p>
As of version 1.0b1, the html2ps HTML to Postscript converter does not support
UTF-8 encoded HTML pages and has no special treatment of fonts: the generated
Postscript uses the standard Postscript fonts.
<sect2>a2ps
<p>
As of version 4.12, a2ps doesn't support printing UTF-8 encoded text.
<sect2>enscript
<p>
As of version 1.6.1, enscript doesn't support printing UTF-8 encoded text.
By default, it uses only the standard Postscript fonts, but it can also
include a custom Postscript font in the output.
<sect>Making your programs Unicode aware
<p>
<sect1>C/C++
<p>
The C `<tt>char</tt>' type is 8-bit and will stay 8-bit because it denotes
the smallest addressable data unit. Various facilities are available:
<sect2>For normal text handling
<p>
The ISO/ANSI C standard contains, in an amendment which was added in 1995,
a "wide character" type `<tt>wchar_t</tt>', a set of functions like those
found in <tt>&lt;string.h&gt;</tt> and <tt>&lt;ctype.h&gt;</tt> (declared in
<tt>&lt;wchar.h&gt;</tt> and <tt>&lt;wctype.h&gt;</tt>, respectively), and
a set of conversion functions between `<tt>char *</tt>' and
`<tt>wchar_t *</tt>' (declared in <tt>&lt;stdlib.h&gt;</tt>).
Good references for this API are
<itemize>
<item>
the GNU libc-2.1 manual, chapters 4 "Character Handling" and
6 "Character Set Handling",
<item>
the manual pages
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/man-mbswcs.tar.gz"
name="man-mbswcs.tar.gz">, now contained in
<htmlurl url="ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz"
name="ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz">,
<item>
the OpenGroup's introduction
<htmlurl url="http://www.unix-systems.org/version2/whatsnew/login_mse.html"
name="http://www.unix-systems.org/version2/whatsnew/login_mse.html">,
<item>
the OpenGroup's Single Unix specification
<htmlurl url="http://www.UNIX-systems.org/online.html"
name="http://www.UNIX-systems.org/online.html">,
<item>
the ISO/IEC 9899:1999 (ISO C 99) standard. The latest draft before it was
adopted is called n2794. You find it at
<htmlurl url="ftp://ftp.csn.net/DMK/sc22wg14/review/"
name="ftp://ftp.csn.net/DMK/sc22wg14/review/">
or
<htmlurl url="http://java-tutor.com/docs/c/"
name="http://java-tutor.com/docs/c/">.
<item>
Clive Feather's introduction
<htmlurl url="http://www.lysator.liu.se/c/na1.html"
name="http://www.lysator.liu.se/c/na1.html">,
<item>
the Dinkumware C library reference
<htmlurl url="http://www.dinkumware.com/htm_cl/"
name="http://www.dinkumware.com/htm_cl/">.
</itemize>
Advantages of using this API:
<itemize>
<item>
It's a vendor independent standard.
<item>
The functions do the right thing, depending on the user's locale.
All a program needs to call is <tt>setlocale(LC_ALL,"");</tt>.
</itemize>
Drawbacks of this API:
<itemize>
<item>
Some of the functions are not multithread-safe, because they keep a hidden
internal state between function calls.
<item>
There is no first-class locale datatype. Therefore this API cannot reasonably
be used for anything that needs more than one locale or character set at the
same time.
<item>
The OS support for this API is not good on most OSes.
</itemize>
<sect3>Portability notes
<p>
A `<tt>wchar_t</tt>' may or may not be encoded in Unicode; this is
platform and sometimes also locale dependent. A multibyte sequence
`<tt>char *</tt>' may or may not be encoded in UTF-8; this is platform
and sometimes also locale dependent.
In detail, here is what the
<htmlurl url="http://www.UNIX-systems.org/online.html"
name="Single Unix specification">
says about the `<tt>wchar_t</tt>' type:
<em>All wide-character codes in a given process consist of an equal number
of bits. This is in contrast to characters, which can consist of a
variable number of bytes. The byte or byte sequence that represents a
character can also be represented as a wide-character code.
Wide-character codes thus provide a uniform size for manipulating text
data. A wide-character code having all bits zero is the null
wide-character code, and terminates wide-character strings. The
wide-character value for each member of the Portable Character Set
</em> (i.e. ASCII) <em>
will equal its value when used as the lone character in an integer
character constant. Wide-character codes for other characters are
locale- and implementation-dependent. State shift bytes do not have a
wide-character code representation.</em>
One particular consequence is that in portable programs you shouldn't use
non-ASCII characters in string literals. That means, even though you
know the Unicode double quotation marks have the codes U+201C and U+201D,
you shouldn't write a string literal <tt>L"\u201cHello\u201d, he said"</tt>
or <tt>"\xe2\x80\x9cHello\xe2\x80\x9d, he said"</tt> in C programs. Instead,
use GNU gettext, write it as <tt>gettext("'Hello', he said")</tt>, and create
a message database en.po which translates "'Hello', he said" to
"\u201cHello\u201d, he said".
Here is a survey of the portability of the ISO/ANSI C facilities on various
Unix flavours.
<descrip>
<tag>GNU glibc-2.2.x</tag>
<itemize>
<item>&lt;wchar.h&gt; and &lt;wctype.h&gt; exist.
<item>Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
<item>Has five UTF-8 locales.
<item>mbrtowc works.
</itemize>
<tag>GNU glibc-2.0.x, glibc-2.1.x</tag>
<itemize>
<item>&lt;wchar.h&gt; and &lt;wctype.h&gt; exist.
<item>Has wcs/mbs functions, but no fgetwc/fputwc/wprintf.
<item>No UTF-8 locale.
<item>mbrtowc returns EILSEQ for bytes &gt;= 0x80.
</itemize>
<tag>AIX 4.3</tag>
<itemize>
<item>&lt;wchar.h&gt; and &lt;wctype.h&gt; exist.
<item>Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
<item>Has many UTF-8 locales, one for every country.
<item>Needs -D_XOPEN_SOURCE=500 in order to define mbstate_t.
<item>mbrtowc works.
</itemize>
<tag>Solaris 2.7</tag>
<itemize>
<item>&lt;wchar.h&gt; and &lt;wctype.h&gt; exist.
<item>Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
<item>Has the following UTF-8 locales:
en_US.UTF-8, de.UTF-8, es.UTF-8, fr.UTF-8, it.UTF-8, sv.UTF-8.
<item>mbrtowc returns -1/EILSEQ (instead of -2) for bytes &gt;= 0x80.
</itemize>
<tag>OSF/1 4.0d</tag>
<itemize>
<item>&lt;wchar.h&gt; and &lt;wctype.h&gt; exist.
<item>Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
<item>Has an add-on universal.utf8@ucs4 locale, see "man 5 unicode".
<item>mbrtowc does not know about UTF-8.
</itemize>
<tag>Irix 6.5</tag>
<itemize>
<item>&lt;wchar.h&gt; and &lt;wctype.h&gt; exist.
<item>Has wcs/mbs functions and fgetwc/fputwc, but not wprintf.
<item>Has no multibyte locales.
<item>Has only a dummy definition for mbstate_t.
<item>Doesn't have mbrtowc.
</itemize>
<tag>HP-UX 11.00</tag>
<itemize>
<item>&lt;wchar.h&gt; exists, &lt;wctype.h&gt; does not.
<item>Has wcs/mbs functions and fgetwc/fputwc, but not wprintf.
<item>Has a C.utf8 locale.
<item>Doesn't have mbstate_t.
<item>Doesn't have mbrtowc.
</itemize>
</descrip>
As a consequence, I recommend to use the restartable and multithread-safe
wcsr/mbsr functions, forget about those systems which don't have them (Irix,
HP-UX, AIX), and use the UTF-8 locale plug-in libutf8_plug.so (see below)
on those systems which permit you to compile programs which use these
wcsr/mbsr functions (Linux, Solaris, OSF/1).
A similar advice, given by Sun in
<htmlurl url="http://www.sun.com/software/white-papers/wp-unicode/"
name="http://www.sun.com/software/white-papers/wp-unicode/">,
section "Internationalized Applications with Unicode", is:
<em>To properly internationalize an application, use the following
guidelines:</em>
<enum>
<item><em>Avoid direct access with Unicode. This is a task of the platform's
internationalization framework.</em>
<item><em>Use the POSIX model for multibyte and wide-character interfaces.</em>
<item><em>Only call the APIs that the internationalization framework
provides for language and cultural-specific operations.</em>
<item><em>Remain code-set independent.</em>
</enum>
If, for some reason, in some piece of code, you really have to assume that
`wchar_t' is Unicode (for example, if you want to do special treatment of
some Unicode characters), you should make that piece of code conditional
upon the result of <tt>is_locale_utf8()</tt>. Otherwise you will mess up
your program's behaviour in different locales or other platforms. The
function <tt>is_locale_utf8</tt> is declared in
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/utf8locale.h" name="utf8locale.h">
and defined in
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/utf8locale.c" name="utf8locale.c">.
<sect3>The libutf8 library
<p>
A portable implementation of the ISO/ANSI C API, which supports 8-bit locales
and UTF-8 locales, can be found in
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/utf8/libutf8-0.7.3.tar.gz"
name="libutf8-0.7.3.tar.gz">.
Advantages:
<itemize>
<item>
Unicode UTF-8 support now, portably, even on OSes whose multibyte character
support does not work or which don't have multibyte/wide character support
at all.
<item>
The same binary works in all OS supported 8-bit locales and in UTF-8 locales.
<item>
When an OS vendor adds proper multibyte character support, you can take
advantage of it by simply recompiling without -DHAVE_LIBUTF8 compiler option.
</itemize>
<sect3>The Plan9 way
<p>
The Plan9 operating system, a variant of Unix, uses UTF-8 as character
encoding in all applications. Its wide character type is called
`<tt>Rune</tt>', not `<tt>wchar_t</tt>'. Parts of its libraries, written by
Rob Pike and Howard Trickey, are available at
<htmlurl url="ftp://ftp.cdrom.com/pub/netlib/research/9libs/9libs-1.0.tar.gz"
name="ftp://ftp.cdrom.com/pub/netlib/research/9libs/9libs-1.0.tar.gz">.
Another similar library, written by Alistair G. Crooks, is
<htmlurl url="ftp://ftp.cdrom.com/pub/NetBSD/packages/distfiles/libutf-2.10.tar.gz"
name="ftp://ftp.cdrom.com/pub/NetBSD/packages/distfiles/libutf-2.10.tar.gz">.
In particular, each of these libraries contains an UTF-8 aware regular
expression matcher.
Drawback of this API:
<itemize>
<item>
UTF-8 is compiled in, not optional. Programs compiled in this universe lose
support for the 8-bit encodings which are still frequently used in Europe.
</itemize>
<sect2>For graphical user interface
<p>
The Qt-2.0 library
<htmlurl url="http://www.troll.no/"
name="http://www.troll.no/">
contains a fully-Unicode QString class. You can use the member functions
QString::utf8 and QString::fromUtf8 to convert to/from UTF-8 encoded text.
The QString::ascii and QString::latin1 member functions should not be used
any more.
<sect2>For advanced text handling
<p>
The previously mentioned libraries implement Unicode aware versions of
the ASCII concepts. Here are libraries which deal with Unicode concepts,
such as titlecase (a third letter case, different from uppercase and
lowercase), distinction between punctuation and symbols, canonical
decomposition, combining classes, canonical ordering and the like.
<descrip>
<tag>ucdata-2.4</tag>
The ucdata library by Mark Leisher
<htmlurl url="http://crl.nmsu.edu/~mleisher/ucdata.html"
name="http://crl.nmsu.edu/~mleisher/ucdata.html">
deals with character properties, case conversion, decomposition, combining
classes. The companion package ure-0.5
<htmlurl url="http://crl.nmsu.edu/~mleisher/ure-0.5.tar.gz"
name="http://crl.nmsu.edu/~mleisher/ure-0.5.tar.gz">
is a Unicode regular expression matcher.
<tag>ustring</tag>
The ustring C++ library by Rodrigo Reyes
<htmlurl url="http://ustring.charabia.net/"
name="http://ustring.charabia.net/">
deals with character properties, case conversion, decomposition, combining
classes, and includes a Unicode regular expression matcher.
<tag>ICU</tag>
International Components for Unicode
<htmlurl url="http://oss.software.ibm.com/icu/"
name="http://oss.software.ibm.com/icu/">.
IBM's very comprehensive internationalization library featuring Unicode strings,
resource bundles, number formatters, date/time formatters, message formatters,
collation and more. Lots of supported locales. Portable to Unix and Win32,
but compiles out of the box only on Linux libc6, not libc5.
<tag>libunicode</tag>
The GNOME libunicode library
<htmlurl url="http://cvs.gnome.org/lxr/source/libunicode/"
name="http://cvs.gnome.org/lxr/source/libunicode/">
by Tom Tromey and others. It covers character set conversion, character
properties, decomposition.
</descrip>
<sect2>For conversion
<p>
Two kinds of conversion libraries, which support UTF-8 and a large number
of 8-bit character sets, are available:
<sect3>iconv
<p>
The iconv implementation by Ulrich Drepper, contained in the GNU glibc-2.2.
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/glibc/glibc-2.2.tar.gz"
name="ftp://ftp.gnu.org/pub/gnu/glibc/glibc-2.2.tar.gz">.
The iconv manpages are now contained in
<htmlurl url="ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz"
name="ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz">.
The portable iconv implementation by Bruno Haible.
<htmlurl url="ftp://ftp.ilog.fr/pub/Users/haible/gnu/libiconv-1.5.1.tar.gz"
name="ftp://ftp.ilog.fr/pub/Users/haible/gnu/libiconv-1.5.1.tar.gz">
The portable iconv implementation by Konstantin Chuguev.
<htmlurl url="mailto:joy@urc.ac.ru"
name="&lt;joy@urc.ac.ru&gt;">
<htmlurl url="ftp://ftp.urc.ac.ru/pub/local/OS/Unix/converters/iconv-0.4.tar.gz"
name="ftp://ftp.urc.ac.ru/pub/local/OS/Unix/converters/iconv-0.4.tar.gz">
Advantages:
<itemize>
<item>
iconv is POSIX standardized, programs using iconv to convert from/to UTF-8
will also run under Solaris. However, the names for the character sets differ
between platforms. For example, "EUC-JP" under glibc is "eucJP" under HP-UX.
(The official IANA name for this character set is "EUC-JP", so it's clearly
a HP-UX deficiency.)
<item>
On glibc-2.1 systems, no additional library is needed. On other systems, one of
the two other iconv implementations can be used.
</itemize>
<sect3>librecode
<p>
librecode by Fran&ccedil;ois Pinard
<htmlurl url="ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz"
name="ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz">.
Advantages:
<itemize>
<item>
Support for transliteration, i.e. conversion of non-ASCII characters
to sequences of ASCII characters in order to preserve readability by
humans, even when a lossless transformation is impossible.
</itemize>
Drawbacks:
<itemize>
<item>
Non-standard API.
<item>
Slow initialization.
</itemize>
<sect3>ICU
<p>
International Components for Unicode 1.7
<htmlurl url="http://oss.software.ibm.com/icu/"
name="http://oss.software.ibm.com/icu/">.
IBM's internationalization library also has conversion facilities, declared
in `<tt>ucnv.h</tt>'.
Advantages:
<itemize>
<item>
Comprehensive set of supported encodings.
</itemize>
Drawbacks:
<itemize>
<item>
Non-standard API.
</itemize>
<sect2>Other approaches
<p>
<descrip>
<tag>libutf-8</tag>
libutf-8 by G. Adam Stanislav
<htmlurl url="mailto:adam@whizkidtech.net"
name="&lt;adam@whizkidtech.net&gt;">
contains a few functions for on-the-fly conversion from/to UTF-8 encoded
`FILE*' streams.
<htmlurl url="http://www.whizkidtech.net/i18n/libutf-8-1.0.tar.gz"
name="http://www.whizkidtech.net/i18n/libutf-8-1.0.tar.gz">
Advantages:
<itemize>
<item>
Very small.
</itemize>
Drawbacks:
<itemize>
<item>
Non-standard API.
<item>
UTF-8 is compiled in, not optional. Programs compiled with this library
lose support for the 8-bit encodings which are still frequently used in Europe.
<item>
Installation is nontrivial: Makefile needs tweaking, not autoconfiguring.
</itemize>
</descrip>
<sect1>Java
<p>
Java has Unicode support built into the language. The type `char' denotes
a Unicode character, and the `java.lang.String' class denotes a string
built up from Unicode characters.
Java can display any Unicode characters through its windowing system AWT,
provided that
1. you set the Java system property "user.language" appropriately,
2. the /usr/lib/java/lib/font.properties.<it>language</it> font set
definitions are appropriate, and
3. the fonts specified in that file are installed.
For example, in order to display text containing japanese characters,
you would install japanese fonts and run "java -Duser.language=ja ...".
You can combine font sets: In order to display western european, greek
and japanese characters simultaneously, you would create a combination
of the files "font.properties" (covers ISO-8859-1), "font.properties.el"
(covers ISO-8859-7) and "font.properties.ja" into a single file.
??This is untested??
The interfaces java.io.DataInput and java.io.DataOutput have methods called
`readUTF' and `writeUTF' respectively. But note that they don't use UTF-8;
they use a modified UTF-8 encoding: the NUL character is encoded as the
two-byte sequence 0xC0 0x80 instead of 0x00, and a 0x00 byte is added at
the end. Encoded this way, strings can contain NUL characters and nevertheless
need not be prefixed with a length field - the C &lt;string.h&gt; functions
like strlen() and strcpy() can be used to manipulate them.
<sect1>Lisp
<p>
The Common Lisp standard specifies two character types: `base-char' and
`character'. It's up to the implementation to support Unicode or not.
The language also specifies a keyword argument `:external-format' to `open',
as the natural place to specify a character set or encoding.
Among the free Common Lisp implementations, only CLISP
<htmlurl url="http://clisp.cons.org/"
name="http://clisp.cons.org/">
supports Unicode. You need a CLISP version from March 2000 or newer.
<htmlurl url="ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz"
name="ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz">.
The types `base-char' and `character' are both equivalent to 16-bit Unicode.
The functions <tt>char-width</tt> and <tt>string-width</tt> provide an
API comparable to <tt>wcwidth()</tt> and <tt>wcswidth()</tt>.
The encoding used for file or socket/pipe I/O can be specified through the
`:external-format' argument. The encodings used for tty I/O and the default
encoding for file/socket/pipe I/O are locale dependent.
Among the commercial Common Lisp implementations:
LispWorks
<htmlurl url="http://www.xanalys.com/software_tools/products/"
name="http://www.xanalys.com/software_tools/products/">
supports Unicode.
The type `base-char' is equivalent to ISO-8859-1, and the type `simple-char'
(subtype of `character') contains all Unicode characters.
The encoding used for file I/O can be specified through the
`:external-format' argument, for example '(:UTF-8).
Limitations: Encodings cannot be used for socket I/O. The editor cannot edit
UTF-8 encoded files.
Eclipse
<htmlurl url="http://www.elwood.com/eclipse/eclipse.htm"
name="http://www.elwood.com/eclipse/eclipse.htm">
supports Unicode. See
<htmlurl url="http://www.elwood.com/eclipse/char.htm"
name="http://www.elwood.com/eclipse/char.htm">.
The type `base-char' is equivalent
to ISO-8859-1, and the type `character' contains all Unicode characters.
The encoding used for file I/O can be specified through a combination of
the `:element-type' and `:external-format' arguments to `open'.
Limitations: Character attribute functions are locale dependent. Source and
compiled source files cannot contain Unicode string literals.
The commercial Common Lisp implementation Allegro CL, in version 6.0, has
Unicode support. The types `base-char' and `character' are both equivalent
to 16-bit Unicode. The encoding used for file I/O can be specified through the
`:external-format' argument, for example <tt>:external-format :utf8</tt>.
The default encoding is locale dependent. More details are at
<htmlurl url="http://www.franz.com/support/documentation/6.0/doc/iacl.htm"
name="http://www.franz.com/support/documentation/6.0/doc/iacl.htm">.
<sect1>Ada95
<p>
Ada95 was designed for Unicode support and the Ada95 standard library
features special ISO 10646-1 data types Wide_Character and Wide_String,
as well as numerous associated procedures and functions. The GNU Ada95
compiler (gnat-3.11 or newer) supports UTF-8 as the external encoding of
wide characters. This allows you to use UTF-8 in both source code and
application I/O. To activate it in the application, use "WCEM=8" in the
FORM string when opening a file, and use compiler option "-gnatW8" if
the source code is in UTF-8. See the GNAT
(<htmlurl url="ftp://cs.nyu.edu/pub/gnat/"
name="ftp://cs.nyu.edu/pub/gnat/">)
and Ada95
(<htmlurl url="ftp://ftp.cnam.fr/pub/Ada/PAL/userdocs/docadalt/rm95/index.htm"
name="ftp://ftp.cnam.fr/pub/Ada/PAL/userdocs/docadalt/rm95/index.htm">)
reference manuals for details.
<sect1>Python
<p>
Python 2.0
(<htmlurl url="http://www.python.org/2.0/"
name="http://www.python.org/2.0/">,
<htmlurl url="http://www.python.org/pipermail/python-announce-list/2000-October/000889.html"
name="http://www.python.org/pipermail/python-announce-list/2000-October/000889.html">,
<htmlurl url="http://starship.python.net/crew/amk/python/writing/new-python/new-python.html#SECTION000300000000000000000"
name="http://starship.python.net/crew/amk/python/writing/new-python/new-python.html">)
contains Unicode support. It has a new fundamental data type
`unicode', representing a Unicode string, a module `unicodedata' for the
character properties, and a set of converters for the most important encodings.
See
<htmlurl url="http://starship.python.net/crew/lemburg/unicode-proposal.txt"
name="http://starship.python.net/crew/lemburg/unicode-proposal.txt">,
or the file <tt>Misc/unicode.txt</tt> in the distribution, for details.
<sect1>JavaScript/ECMAscript
<p>
Since JavaScript version 1.3, strings are always Unicode. There is no
character type, but you can use the \uXXXX notation for Unicode characters
inside strings. No normalization is done internally, so it expects to receive
Unicode Normalization Form C, which the W3C recommends. See
<htmlurl url="http://developer.netscape.com/docs/manuals/communicator/jsref/js13.html#Unicode"
name="http://developer.netscape.com/docs/manuals/communicator/jsref/js13.html#Unicode">
for details and
<htmlurl url="http://developer.netscape.com/docs/javascript/e262-pdf.pdf"
name="http://developer.netscape.com/docs/javascript/e262-pdf.pdf">
for the complete ECMAscript specification.
<sect1>Tcl
<p>
Tcl/Tk started using Unicode as its base character set with version 8.1.
Its internal representation for strings is UTF-8. It supports the \uXXXX
notation for Unicode characters. See
<htmlurl url="http://dev.scriptics.com/doc/howto/i18n.html"
name="http://dev.scriptics.com/doc/howto/i18n.html">.
<sect1>Perl
<p>
Perl 5.6 stores strings internally in UTF-8 format, if you write
<tscreen><verb>
use utf8;
</verb></tscreen>
at the beginning of your script. <tt>length()</tt> returns the number of
characters of a string. For details, see the Perl-i18n FAQ at
<htmlurl url="http://rf.net/~james/perli18n.html"
name="http://rf.net/~james/perli18n.html">.
Support for other (non-8-bit) encodings is available through the iconv
interface module
<htmlurl url="http://cpan.perl.org/modules/by-module/Text/Text-Iconv-1.1.tar.gz"
name="http://cpan.perl.org/modules/by-module/Text/Text-Iconv-1.1.tar.gz">.
<sect1>Related reading
<p>
Tomohiro Kubota has written an introduction to internationalization
<htmlurl url="http://www.debian.org/doc/manuals/intro-i18n/"
name="http://www.debian.org/doc/manuals/intro-i18n/">.
The emphasis of his document is on writing software that runs in any locale,
using the locale's encoding.
<sect>Other sources of information
<p>
<sect1>Mailing lists
<p>
Broader audiences can be reached at the following mailing lists.
Note that where I write `at', you should write `@'. (Anti-spam device.)
<sect2>linux-utf8
<p>
Address: <tt>linux-utf8</tt> at <tt>nl.linux.org</tt>
This mailing list is about internationalization with Unicode, and covers
a broad range of topics from the keyboard driver to the X11 fonts.
Archives are at
<htmlurl url="http://mail.nl.linux.org/linux-utf8/"
name="http://mail.nl.linux.org/linux-utf8/">.
To subscribe, send a message to <tt>majordomo</tt> at <tt>nl.linux.org</tt>
with the line "subscribe linux-utf8" in the body.
<sect2>li18nux
<p>
Address: <tt>linux-i18n</tt> at <tt>sun.com</tt>
This mailing list is focused on organizing internationalization work on
Linux, and arranging meetings between people.
To subscribe, fill in the form at http://www.li18nux.org/
and send it to <tt>linux-i18n-request</tt> at <tt>sun.com</tt>.
<sect2>unicode
<p>
Address: <tt>unicode</tt> at <tt>unicode.org</tt>
This mailing list is focused on the standardization and continuing development
of the Unicode standard, and related technologies, such as Bidi and sorting
algorithms.
Archives are at
<htmlurl url="ftp://ftp.unicode.org/Public/MailArchive/"
name="ftp://ftp.unicode.org/Public/MailArchive/">,
but they are not regularly updated.
For subscription information, see
<htmlurl url="http://www.unicode.org/unicode/consortium/distlist.html"
name="http://www.unicode.org/unicode/consortium/distlist.html">.
<sect2>X11 internationalization
<p>
Address: <tt>i18n</tt> at <tt>xfree86.org</tt>
This mailing list addresses the people who work on better internationalization
of the X11/XFree86 system.
Archives are at
<htmlurl url="http://devel.xfree86.org/archives/i18n/"
name="http://devel.xfree86.org/archives/i18n/">.
To subscribe, send mail to the friendly person at <tt>i18n-request</tt> at
<tt>xfree86.org</tt> explaining your motivation.
<sect2>X11 fonts
<p>
Address: <tt>fonts</tt> at <tt>xfree86.org</tt>
This mailing list addresses the people who work on Unicode fonts and the
font subsystem for the X11/XFree86 system.
Archives are at
<htmlurl url="http://devel.xfree86.org/archives/fonts/"
name="http://devel.xfree86.org/archives/fonts/">.
To subscribe, send mail to the overworked person at <tt>fonts-request</tt> at
<tt>xfree86.org</tt> explaining your motivation.
</article>