mirror of https://github.com/mkerrisk/man-pages
310 lines
11 KiB
Groff
310 lines
11 KiB
Groff
.\" Hey Emacs! This file is -*- nroff -*- source.
|
|
.\"
|
|
.\" Copyright (C) Markus Kuhn, 1995, 2001
|
|
.\"
|
|
.\" This is free documentation; you can redistribute it and/or
|
|
.\" modify it under the terms of the GNU General Public License as
|
|
.\" published by the Free Software Foundation; either version 2 of
|
|
.\" the License, or (at your option) any later version.
|
|
.\"
|
|
.\" The GNU General Public License's references to "object code"
|
|
.\" and "executables" are to be interpreted as the output of any
|
|
.\" document formatting or typesetting system, including
|
|
.\" intermediate and printed output.
|
|
.\"
|
|
.\" This manual is distributed in the hope that it will be useful,
|
|
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
.\" GNU General Public License for more details.
|
|
.\"
|
|
.\" You should have received a copy of the GNU General Public
|
|
.\" License along with this manual; if not, write to the Free
|
|
.\" Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111,
|
|
.\" USA.
|
|
.\"
|
|
.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
|
|
.\" First version written
|
|
.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
|
|
.\" Update
|
|
.\"
|
|
.TH UNICODE 7 2001-05-11 "GNU" "Linux Programmer's Manual"
|
|
.SH NAME
|
|
Unicode \- the Universal Character Set
|
|
.SH DESCRIPTION
|
|
The international standard
|
|
.B ISO 10646
|
|
defines the
|
|
.BR "Universal Character Set (UCS)" .
|
|
UCS contains all characters of all other character set standards.
|
|
It also guarantees
|
|
.BR "round-trip compatibility" ,
|
|
i.e., conversion tables can be built such that no information is lost
|
|
when a string is converted from any other encoding to UCS and back.
|
|
|
|
UCS contains the characters required to represent practically all
|
|
known languages.
|
|
This includes not only the Latin, Greek, Cyrillic,
|
|
Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,
|
|
Japanese and Korean Han ideographs as well as scripts such as
|
|
Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
|
|
Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
|
|
Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
|
|
Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
|
|
For scripts not yet
|
|
covered, research on how to best encode them for computer usage is
|
|
still going on and they will be added eventually.
|
|
This might
|
|
eventually include not only Hieroglyphs and various historic
|
|
Indo-European languages, but even some selected artistic scripts such
|
|
as Tengwar, Cirth, and Klingon.
|
|
UCS also covers a large number of
|
|
graphical, typographical, mathematical and scientific symbols,
|
|
including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows,
|
|
Macintosh, OCR fonts, as well as many word processing and publishing
|
|
systems, and more are being added.
|
|
|
|
The UCS standard (ISO 10646) describes a
|
|
.I "31-bit character set architecture"
|
|
consisting of 128 24-bit
|
|
.IR groups ,
|
|
each divided into 256 16-bit
|
|
.I planes
|
|
made up of 256 8-bit
|
|
.I rows
|
|
with 256
|
|
.I column
|
|
positions, one for each character.
|
|
Part 1 of the standard
|
|
.RB ( "ISO 10646-1" )
|
|
defines the first 65534 code positions (0x0000 to 0xfffd), which form
|
|
the
|
|
.IR "Basic Multilingual Plane (BMP)" ,
|
|
that is plane 0 in group 0.
|
|
Part 2 of the standard
|
|
.RB ( "ISO 10646-2" )
|
|
adds characters to group 0 outside the BMP in several
|
|
.I "supplementary planes"
|
|
in the range 0x10000 to 0x10ffff.
|
|
There are no plans to add characters
|
|
beyond 0x10ffff to the standard, therefore of the entire code space,
|
|
only a small fraction of group 0 will ever be actually used in the
|
|
foreseeable future.
|
|
The BMP contains all characters found in the
|
|
commonly used other character sets.
|
|
The supplemental planes added by
|
|
ISO 10646-2 cover only more exotic characters for special scientific,
|
|
dictionary printing, publishing industry, higher-level protocol and
|
|
enthusiast needs.
|
|
.PP
|
|
The representation of each UCS character as a 2-byte word is referred
|
|
to as the
|
|
.B UCS-2
|
|
form (only for BMP characters), whereas
|
|
.B UCS-4
|
|
is the representation of each character by a 4-byte word.
|
|
In addition, there exist two encoding forms
|
|
.B UTF-8
|
|
for backward compatibility with ASCII processing software and
|
|
.B UTF-16
|
|
for the backward-compatible handling of non-BMP characters up to
|
|
0x10ffff by UCS-2 software.
|
|
.PP
|
|
The UCS characters 0x0000 to 0x007f are identical to those of the
|
|
classic
|
|
.B US-ASCII
|
|
character set and the characters in the range 0x0000 to 0x00ff
|
|
are identical to those in
|
|
.BR "ISO 8859-1 Latin-1" .
|
|
.SS "Combining Characters"
|
|
Some code points in
|
|
.B UCS
|
|
have been assigned to
|
|
.IR "combining characters" .
|
|
These are similar to the nonspacing accent keys on a typewriter.
|
|
A combining character just adds an accent to the previous character.
|
|
The most important accented characters have codes of their own in UCS,
|
|
however, the combining character mechanism allows us to add accents
|
|
and other diacritical marks to any character.
|
|
The combining characters
|
|
always follow the character which they modify.
|
|
For example, the German
|
|
character Umlaut-A ("Latin capital letter A with diaeresis") can
|
|
either be represented by the precomposed UCS code 0x00c4, or
|
|
alternatively as the combination of a normal "Latin capital letter A"
|
|
followed by a "combining diaeresis": 0x0041 0x0308.
|
|
.PP
|
|
Combining characters are essential for instance for encoding the Thai
|
|
script or for mathematical typesetting and users of the International
|
|
Phonetic Alphabet.
|
|
.SS "Implementation Levels"
|
|
As not all systems are expected to support advanced mechanisms like
|
|
combining characters, ISO 10646-1 specifies the following three
|
|
.I implementation levels
|
|
of UCS:
|
|
.TP 0.9i
|
|
Level 1
|
|
Combining characters and
|
|
.B Hangul Jamo
|
|
(a variant encoding of the Korean script, where a Hangul syllable
|
|
glyph is coded as a triplet or pair of vovel/consonant codes) are not
|
|
supported.
|
|
.TP
|
|
Level 2
|
|
In addition to level 1, combining characters are now allowed for some
|
|
languages where they are essential (e.g., Thai, Lao, Hebrew,
|
|
Arabic, Devanagari, Malayalam, etc.).
|
|
.TP
|
|
Level 3
|
|
All
|
|
.B UCS
|
|
characters are supported.
|
|
.PP
|
|
The
|
|
.B Unicode 3.0 Standard
|
|
published by the
|
|
.B Unicode Consortium
|
|
contains exactly the
|
|
.B UCS Basic Multilingual Plane
|
|
at implementation level 3, as described in ISO 10646-1:2000.
|
|
.B Unicode 3.1
|
|
added the supplemental planes of ISO 10646-2.
|
|
The Unicode standard and
|
|
technical reports published by the Unicode Consortium provide much
|
|
additional information on the semantics and recommended usages of
|
|
various characters.
|
|
They provide guidelines and algorithms for
|
|
editing, sorting, comparing, normalizing, converting and displaying
|
|
Unicode strings.
|
|
.SS "Unicode Under Linux"
|
|
Under GNU/Linux, the C type
|
|
.I wchar_t
|
|
is a signed 32-bit integer type.
|
|
Its values are always interpreted
|
|
by the C library as
|
|
.B UCS
|
|
code values (in all locales), a convention that is signaled by the GNU
|
|
C library to applications by defining the constant
|
|
.B __STDC_ISO_10646__
|
|
as specified in the ISO C99 standard.
|
|
|
|
UCS/Unicode can be used just like ASCII in input/output streams,
|
|
terminal communication, plaintext files, filenames, and environment
|
|
variables in the ASCII compatible
|
|
.B UTF-8
|
|
multibyte encoding.
|
|
To signal the use of UTF-8 as the character
|
|
encoding to all applications, a suitable
|
|
.I locale
|
|
has to be selected via environment variables (e.g.,
|
|
"LANG=en_GB.UTF-8").
|
|
.PP
|
|
The
|
|
.B nl_langinfo(CODESET)
|
|
function returns the name of the selected encoding.
|
|
Library functions such as
|
|
.BR wctomb (3)
|
|
and
|
|
.BR mbsrtowcs (3)
|
|
can be used to transform the internal
|
|
.I wchar_t
|
|
characters and strings into the system character encoding and back
|
|
and
|
|
.BR wcwidth (3)
|
|
tells, how many positions (0\(en2) the cursor is advanced by the
|
|
output of a character.
|
|
.PP
|
|
Under Linux, in general only the BMP at implementation level 1 should
|
|
be used at the moment.
|
|
Up to two combining characters per base
|
|
character for certain scripts (in particular Thai) are also supported
|
|
by some UTF-8 terminal emulators and ISO 10646 fonts (level 2), but in
|
|
general precomposed characters should be preferred where available
|
|
(Unicode calls this
|
|
.BR "Normalization Form C" ).
|
|
.SS "Private Area"
|
|
In the
|
|
.BR BMP ,
|
|
the range 0xe000 to 0xf8ff will never be assigned to any characters by
|
|
the standard and is reserved for private usage.
|
|
For the Linux
|
|
community, this private area has been subdivided further into the
|
|
range 0xe000 to 0xefff which can be used individually by any end-user
|
|
and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
|
|
coordinated among all Linux users.
|
|
The registry of the characters
|
|
assigned to the Linux zone is currently maintained by H. Peter Anvin
|
|
<Peter.Anvin@linux.org>.
|
|
.SS Literature
|
|
.TP 0.2i
|
|
*
|
|
Information technology \(em Universal Multiple-Octet Coded Character
|
|
Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane.
|
|
International Standard ISO/IEC 10646-1, International Organization
|
|
for Standardization, Geneva, 2000.
|
|
|
|
This is the official specification of
|
|
.BR UCS .
|
|
Available as a PDF file on CD-ROM from http://www.iso.ch/.
|
|
.TP
|
|
*
|
|
The Unicode Standard, Version 3.0.
|
|
The Unicode Consortium, Addison-Wesley,
|
|
Reading, MA, 2000, ISBN 0-201-61633-5.
|
|
.TP
|
|
*
|
|
S. Harbison, G. Steele. C: A Reference Manual. Fourth edition,
|
|
Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
|
|
|
|
A good reference book about the C programming language.
|
|
The fourth
|
|
edition covers the 1994 Amendment 1 to the ISO C90 standard, which
|
|
adds a large number of new C library functions for handling wide and
|
|
multibyte character encodings, but it does not yet cover ISO C99,
|
|
which improved wide and multibyte character support even further.
|
|
.TP
|
|
*
|
|
Unicode Technical Reports.
|
|
.RS
|
|
http://www.unicode.org/unicode/reports/
|
|
.RE
|
|
.TP
|
|
*
|
|
Markus Kuhn: UTF-8 and Unicode FAQ for UNIX/Linux.
|
|
.RS
|
|
http://www.cl.cam.ac.uk/~mgk25/unicode.html
|
|
|
|
Provides subscription information for the
|
|
.I linux-utf8
|
|
mailing list, which is the best place to look for advice on using
|
|
Unicode under Linux.
|
|
.RE
|
|
.TP
|
|
*
|
|
Bruno Haible: Unicode HOWTO.
|
|
.RS
|
|
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
|
|
.RE
|
|
.SH BUGS
|
|
When this man page was last revised, the GNU C Library support for
|
|
.B UTF-8
|
|
locales was mature and XFree86 support was in an advanced state, but
|
|
work on making applications (most notably editors) suitable for use in
|
|
.B UTF-8
|
|
locales was still fully in progress.
|
|
Current general
|
|
.B UCS
|
|
support under Linux usually provides for CJK double-width characters
|
|
and sometimes even simple overstriking combining characters, but
|
|
usually does not include support for scripts with right-to-left
|
|
writing direction or ligature substitution requirements such as
|
|
Hebrew, Arabic, or the Indic scripts.
|
|
These scripts are currently only
|
|
supported in certain GUI applications (HTML viewers, word processors)
|
|
with sophisticated text rendering engines.
|
|
.\" .SH AUTHOR
|
|
.\" Markus Kuhn <mgk25@cl.cam.ac.uk>
|
|
.SH "SEE ALSO"
|
|
.BR setlocale (3),
|
|
.BR charsets (7),
|
|
.BR utf-8 (7)
|