mirror of https://github.com/mkerrisk/man-pages
charsets.7: Update to reflect past developments
Rewrite the introduction to make Unicode's prominence more obvious. Reformulate parts of the text to reflect current Unicode world. Minor clarification for ASCII/ISO sections, some other minor fixes. Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
parent
dfc41d9cfb
commit
a8ed5f7430
225
man7/charsets.7
225
man7/charsets.7
|
@ -11,62 +11,54 @@
|
|||
.\" This is combined from many sources, including notes by aeb and
|
||||
.\" research by esr. Portions derive from a writeup by Roman Czyborra.
|
||||
.\"
|
||||
.\" Last changed by David Starner <dstarner98@aasaa.ofe.org>.
|
||||
.\" Changes also by David Starner <dstarner98@aasaa.ofe.org>.
|
||||
.\"
|
||||
.\" FIXME This page was written long ago, and various pieces are probably
|
||||
.\" no longer quite current. A reworking by someone knowledgeable
|
||||
.\" on charsets is needed. Among other things, the page needs to
|
||||
.\" give more prominence to Unicode. mtk, May 2014
|
||||
.\"
|
||||
.TH CHARSETS 7 2014-05-28 "Linux" "Linux Programmer's Manual"
|
||||
.TH CHARSETS 7 2014-06-05 "Linux" "Linux Programmer's Manual"
|
||||
.SH NAME
|
||||
charsets \- programmer's view of character sets and internationalization
|
||||
charsets - character set standards and internationalization
|
||||
.SH DESCRIPTION
|
||||
Linux is an international operating system.
|
||||
Various of its utilities
|
||||
and device drivers (including the console driver) support multilingual
|
||||
character sets including Latin-alphabet letters with diacritical
|
||||
marks, accents, ligatures, and entire non-Latin alphabets including
|
||||
Greek, Cyrillic, Arabic, and Hebrew.
|
||||
This manual page gives an overview on different character set standards
|
||||
and how they were used on Linux before Unicode became ubiquitous.
|
||||
Some of this information is still helpful for people working with legacy
|
||||
systems and documents.
|
||||
.LP
|
||||
This manual page presents a programmer's-eye view of different
|
||||
character-set standards and how they fit together on Linux.
|
||||
Standards
|
||||
discussed include ASCII, ISO 8859, KOI8-R, Unicode, ISO 2022 and
|
||||
ISO 4873.
|
||||
The primary emphasis is on character sets actually used as
|
||||
locale character sets, not the myriad others that can be found in data
|
||||
Standards discussed include such as
|
||||
ASCII, GB 2312, ISO 8859, JIS, KOI8-R, KS, and Unicode.
|
||||
.LP
|
||||
The primary emphasis is on character sets that were actually used by
|
||||
locale character sets, not the myriad others that could be found in data
|
||||
from other systems.
|
||||
.SS ASCII
|
||||
ASCII (American Standard Code For Information Interchange) is the original
|
||||
7-bit character set, originally designed for American English.
|
||||
It is currently described by the ECMA-6 standard.
|
||||
Also known as US-ASCII.
|
||||
It is currently described by the ISO 646:1991 IRV
|
||||
(International Reference Version) standard.
|
||||
.LP
|
||||
Various ASCII variants replacing the dollar sign with other currency
|
||||
symbols and replacing punctuation with non-English alphabetic characters
|
||||
to cover German, French, Spanish, and others in 7 bits exist.
|
||||
All are
|
||||
deprecated; glibc doesn't support locales whose character sets aren't
|
||||
true supersets of ASCII.
|
||||
(These sets are also known as ISO-646, a close
|
||||
relative of ASCII that permitted replacing these characters.)
|
||||
symbols and replacing punctuation with non-English alphabetic
|
||||
characters to cover German, French, Spanish, and others in 7 bits
|
||||
emerged.
|
||||
All are deprecated;
|
||||
glibc does not support locales whose character sets are not true
|
||||
supersets of ASCII.
|
||||
.LP
|
||||
As Linux was written for hardware designed in the US, it natively
|
||||
supports ASCII.
|
||||
As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text
|
||||
still renders properly on modern UTF-8 using systems.
|
||||
.SS ISO 8859
|
||||
ISO 8859 is a series of 15 8-bit character sets all of which have US
|
||||
ASCII in their low (7-bit) half, invisible control characters in
|
||||
positions 128 to 159, and 96 fixed-width graphics in positions 160-255.
|
||||
ISO 8859 is a series of 15 8-bit character sets all of which have ASCII
|
||||
in their low (7-bit) half, invisible control characters in positions
|
||||
128 to 159, and 96 fixed-width graphics in positions 160-255.
|
||||
.LP
|
||||
Of these, the most important is ISO 8859-1 (Latin-1).
|
||||
It is natively
|
||||
supported in the Linux console driver, fairly well supported in X11R6,
|
||||
and is the base character set of HTML.
|
||||
Of these, the most important is ISO 8859-1
|
||||
("Latin Alphabet No .1" / Latin-1).
|
||||
It was widely adopted and supported by different systems,
|
||||
and is gradually being replaced with Unicode.
|
||||
The ISO 8859-1 characters are also the first 256 characters of Unicode.
|
||||
.LP
|
||||
Console support for the other 8859 character sets is available under
|
||||
Linux through user-mode utilities (such as
|
||||
.BR setfont (8))
|
||||
.\" // some distributions still have the deprecated consolechars
|
||||
that modify keyboard bindings and the EGA graphics
|
||||
table and employ the "user mapping" font table in the console
|
||||
driver.
|
||||
|
@ -74,97 +66,85 @@ driver.
|
|||
Here are brief descriptions of each set:
|
||||
.TP
|
||||
8859-1 (Latin-1)
|
||||
Latin-1 covers most Western European languages such as Albanian, Catalan,
|
||||
Danish, Dutch, English, Faroese, Finnish, French, German, Galician,
|
||||
Irish, Icelandic, Italian, Norwegian, Portuguese, Spanish, and
|
||||
Swedish.
|
||||
The lack of the ligatures Dutch ij, French oe and old-style
|
||||
,,German`` quotation marks is considered tolerable.
|
||||
Latin-1 covers many West European languages such as Albanian, Basque,
|
||||
Danish, English, Faroese, Galician, German, Icelandic, Irish, Italian,
|
||||
Norwegian, Portuguese, Spanish, and Swedish.
|
||||
The lack of the ligatures Dutch IJ/ij, French œ, and old-style „German“
|
||||
quotation marks was considered tolerable.
|
||||
.TP
|
||||
8859-2 (Latin-2)
|
||||
Latin-2 supports most Latin-written Slavic and Central European
|
||||
languages: Croatian, Czech, German, Hungarian, Polish, Romanian,
|
||||
Latin-2 supports many Latin-written Central and East European
|
||||
languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish,
|
||||
Slovak, and Slovene.
|
||||
Replacing Romanian ș/ț with ş/ţ was considered tolerable.
|
||||
.TP
|
||||
8859-3 (Latin-3)
|
||||
Latin-3 is popular with authors of Esperanto, Galician, and Maltese.
|
||||
(Turkish is now written with 8859-9 instead.)
|
||||
Latin-3 was designed to cover of Esperanto, Maltese, and Turkish but
|
||||
8859-9 later superseded it for Turkish.
|
||||
.TP
|
||||
8859-4 (Latin-4)
|
||||
Latin-4 introduced letters for Estonian, Latvian, and Lithuanian.
|
||||
It is essentially obsolete; see 8859-10 (Latin-6) and 8859-13 (Latin-7).
|
||||
Latin-4 introduced letters for North European languages such as
|
||||
Estonian, Latvian, Lithuanian but was superseded by 8859-10 and
|
||||
8859-13.
|
||||
.TP
|
||||
8859-5
|
||||
Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
|
||||
Russian, Serbian, and Ukrainian.
|
||||
Ukrainians read the letter "ghe"
|
||||
with downstroke as "heh" and would need a ghe with upstroke to write a
|
||||
correct ghe.
|
||||
See the discussion of KOI8-R below.
|
||||
Russian, Serbian, and (almost completely) Ukrainian.
|
||||
It was never widely used, see the discussion of KOI8-R/KOI8-U below.
|
||||
.TP
|
||||
8859-6
|
||||
Supports Arabic.
|
||||
Was created for Arabic.
|
||||
The 8859-6 glyph table is a fixed font of separate
|
||||
letter forms, but a proper display engine should combine these
|
||||
using the proper initial, medial, and final forms.
|
||||
.TP
|
||||
8859-7
|
||||
Supports Modern Greek.
|
||||
Was created for modern Greek in 1987, updated in 2003.
|
||||
.TP
|
||||
8859-8
|
||||
Supports modern Hebrew without niqud (punctuation signs).
|
||||
Niqud and full-fledged Biblical Hebrew are outside the scope of this
|
||||
character set; under Linux, UTF-8 is the preferred encoding for
|
||||
these.
|
||||
Niqud and full-fledged Biblical Hebrew were outside the scope of this
|
||||
character set.
|
||||
.TP
|
||||
8859-9 (Latin-5)
|
||||
This is a variant of Latin-1 that replaces Icelandic letters with
|
||||
Turkish ones.
|
||||
.TP
|
||||
8859-10 (Latin-6)
|
||||
Latin 6 adds the last Inuit (Greenlandic) and Sami (Lappish) letters
|
||||
that were missing in Latin 4 to cover the entire Nordic area.
|
||||
RFC 1345 listed a preliminary and different "latin6".
|
||||
Skolt Sami still
|
||||
needs a few more accents than these.
|
||||
Latin-6 added Inuit (Greenlandic) and Sami (Lappish) letters that were
|
||||
missing in Latin-4 to cover the entire Nordic area.
|
||||
.TP
|
||||
8859-11
|
||||
This exists only as a rejected draft standard.
|
||||
The draft standard
|
||||
was identical to TIS-620, which is used under Linux for Thai.
|
||||
Supports the Thai alphabet and is nearly identical to the TIS-620
|
||||
standard.
|
||||
.TP
|
||||
8859-12
|
||||
This set does not exist.
|
||||
While Vietnamese has been suggested for this
|
||||
space, it does not fit within the 96 (noncombining) characters ISO
|
||||
8859 offers.
|
||||
UTF-8 is the preferred character set for Vietnamese use
|
||||
under Linux.
|
||||
.TP
|
||||
8859-13 (Latin-7)
|
||||
Supports the Baltic Rim languages; in particular, it includes Latvian
|
||||
characters not found in Latin-4.
|
||||
.TP
|
||||
8859-14 (Latin-8)
|
||||
This is the Celtic character set, covering Gaelic and Welsh.
|
||||
This charset also contains the dotted characters needed for Old Irish.
|
||||
This is the Celtic character set, covering Old Irish, Manx, Gaelic,
|
||||
Welsh, Cornish, and Breton.
|
||||
.TP
|
||||
8859-15 (Latin-9)
|
||||
This adds the Euro sign and French and Finnish letters that were missing in
|
||||
Latin-1.
|
||||
Latin-9 is similar to widely used Latin-1 but replaces some less
|
||||
common symbols with the Euro sign and French and Finnish letters that
|
||||
were missing in Latin-1.
|
||||
.TP
|
||||
8859-16 (Latin-10)
|
||||
This set covers many of the languages covered by 8859-2, and supports
|
||||
Romanian more completely than that set does.
|
||||
.SS KOI8-R
|
||||
KOI8-R is a non-ISO character set popular in Russia.
|
||||
The lower half
|
||||
is US ASCII; the upper is a Cyrillic character set somewhat better
|
||||
designed than ISO 8859-5.
|
||||
KOI8-U is a common character set, based off
|
||||
KOI8-R, that has better support for Ukrainian.
|
||||
Neither of these sets
|
||||
are ISO-2022 compatible, unlike the ISO-8859 series.
|
||||
This set covers many Southeast European languages, and most
|
||||
importantly supports Romanian more completely than Latin-2.
|
||||
.SS KOI8-R / KOI8-U
|
||||
KOI8-R is a non-ISO character set popular in Russia before Unicode.
|
||||
The lower half is ASCII;
|
||||
the upper is a Cyrillic character set somewhat better designed than
|
||||
ISO 8859-5.
|
||||
KOI8-U, based off KOI8-R, has better support for Ukrainian.
|
||||
Neither of these sets are ISO-2022 compatible,
|
||||
unlike the ISO-8859 series.
|
||||
.LP
|
||||
Console support for KOI8-R is available under Linux through user-mode
|
||||
utilities that modify keyboard bindings and the EGA graphics table,
|
||||
|
@ -184,7 +164,7 @@ JIS X 0208 is used
|
|||
as a component to construct encodings such as EUC-JP, Shift_JIS,
|
||||
and ISO-2022-JP.
|
||||
EUC-JP is the most important encoding for Linux
|
||||
and includes US ASCII and JIS X 0208.
|
||||
and includes ASCII and JIS X 0208.
|
||||
In EUC-JP, JIS X 0208
|
||||
characters are expressed in two bytes, each of which is the
|
||||
JIS X 0208 code plus 0x80.
|
||||
|
@ -195,7 +175,7 @@ JIS X 0208, characters are mapped into a 94x94 two-byte matrix.
|
|||
KS X 1001 is used like JIS X 0208, as a component
|
||||
to construct encodings such as EUC-KR, Johab, and ISO-2022-KR.
|
||||
EUC-KR is the most important encoding for Linux and includes
|
||||
US ASCII and KS X 1001.
|
||||
ASCII and KS X 1001.
|
||||
KS C 5601 is an older name for KS X 1001.
|
||||
.SS GB 2312
|
||||
GB 2312 is a mainland Chinese national standard character set used
|
||||
|
@ -203,37 +183,31 @@ to express simplified Chinese.
|
|||
Just like JIS X 0208, characters are
|
||||
mapped into a 94x94 two-byte matrix used to construct EUC-CN.
|
||||
EUC-CN
|
||||
is the most important encoding for Linux and includes US ASCII and
|
||||
is the most important encoding for Linux and includes ASCII and
|
||||
GB 2312.
|
||||
Note that EUC-CN is often called as GB, GB 2312, or CN-GB.
|
||||
.SS Big5
|
||||
Big5 is a popular character set in Taiwan to express traditional
|
||||
Big5 was a popular character set in Taiwan to express traditional
|
||||
Chinese.
|
||||
(Big5 is both a character set and an encoding.)
|
||||
It is a superset of US ASCII.
|
||||
It is a superset of ASCII.
|
||||
Non-ASCII characters are expressed in two bytes.
|
||||
Bytes 0xa1-0xfe are used as leading bytes for two-byte characters.
|
||||
Big5 and its extension is widely used in Taiwan and Hong Kong.
|
||||
It is not ISO 2022-compliant.
|
||||
.SS TIS 620
|
||||
TIS 620 is a Thai national standard character set and a superset
|
||||
of US ASCII.
|
||||
Like ISO 8859 series, Thai characters are mapped into
|
||||
Big5 and its extension were widely used in Taiwan and Hong Kong.
|
||||
It is not ISO 2022 compliant.
|
||||
.SS TIS-620
|
||||
TIS-620 is a Thai national standard character set and a superset
|
||||
of ASCII.
|
||||
Like in the ISO 8859 series, Thai characters are mapped into
|
||||
0xa1-0xfe.
|
||||
TIS 620 is the only commonly used character set under
|
||||
Linux besides UTF-8 to have combining characters.
|
||||
.SS UNICODE
|
||||
Unicode (ISO 10646) is a standard which aims to unambiguously represent every
|
||||
character in every human language.
|
||||
.SS Unicode
|
||||
Unicode (ISO 10646) is a standard which aims to unambiguously represent
|
||||
every character in every human language.
|
||||
Unicode's structure permits 20.1 bits to encode every character.
|
||||
Since most computers don't include 20.1-bit
|
||||
integers, Unicode is usually encoded as 32-bit integers internally and
|
||||
either a series of 16-bit integers (UTF-16) (needing two 16-bit integers
|
||||
only when encoding certain rare characters) or a series of 8-bit bytes
|
||||
(UTF-8).
|
||||
Information on Unicode is available at
|
||||
.UR http://www.unicode.org
|
||||
.UE .
|
||||
Since most computers don't include 20.1-bit integers, Unicode is
|
||||
usually encoded as 32-bit integers internally and either a series of
|
||||
16-bit integers (UTF-16) (needing two 16-bit integers only when
|
||||
encoding certain rare characters) or a series of 8-bit bytes (UTF-8).
|
||||
.LP
|
||||
Linux represents Unicode using the 8-bit Unicode Transformation Format
|
||||
(UTF-8).
|
||||
|
@ -258,19 +232,19 @@ into xxxxyyyy yyzzzzzz.
|
|||
(When UTF-8 is used to code the 31-bit ISO 10646
|
||||
then this progression continues up to 6-byte codes.)
|
||||
.LP
|
||||
For most people who use ISO-8859 character sets, this means that the
|
||||
For most texts in ISO-8859 character sets, this means that the
|
||||
characters outside of ASCII are now coded with two bytes.
|
||||
This tends
|
||||
to expand ordinary text files by only one or two percent.
|
||||
For Russian
|
||||
or Greek users, this expands ordinary text files by 100%, since text in
|
||||
or Greek texts, this expands ordinary text files by 100%, since text in
|
||||
those languages is mostly outside of ASCII.
|
||||
For Japanese users this means
|
||||
that the 16-bit codes now in common use will take three bytes.
|
||||
While there
|
||||
are algorithmic conversions from some character sets (especially ISO-8859-1) to
|
||||
Unicode, general conversion requires carrying around conversion tables,
|
||||
which can be quite large for 16-bit codes.
|
||||
While there are algorithmic conversions from some character sets
|
||||
(especially ISO 8859-1) to Unicode, general conversion requires
|
||||
carrying around conversion tables, which can be quite large for 16-bit
|
||||
codes.
|
||||
.LP
|
||||
Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other
|
||||
byte is the head of a code.
|
||||
|
@ -288,22 +262,18 @@ Rendering of Unicode data streams is typically handled through
|
|||
"subfont" tables which map a subset of Unicode to glyphs.
|
||||
Internally
|
||||
the kernel uses Unicode to describe the subfont loaded in video RAM.
|
||||
This means that in UTF-8 mode one can use a character set with 512
|
||||
different symbols.
|
||||
This means that the Linux console in UTF-8 mode one can use a character
|
||||
set with 512 different symbols.
|
||||
This is not enough for Japanese, Chinese and
|
||||
Korean, but it is enough for most other purposes.
|
||||
.LP
|
||||
At the current time, the console driver does not handle combining
|
||||
characters.
|
||||
So Thai, Sioux and any other script needing combining
|
||||
characters can't be handled on the console.
|
||||
.SS ISO 2022 and ISO 4873
|
||||
The ISO 2022 and 4873 standards describe a font-control model
|
||||
based on VT100 practice.
|
||||
This model is (partially) supported
|
||||
by the Linux kernel and by
|
||||
.BR xterm (1).
|
||||
It is popular in Japan and Korea.
|
||||
It used to be popular in Japan and Korea.
|
||||
.LP
|
||||
There are 4 graphic character sets, called G0, G1, G2, and G3,
|
||||
and one of them is the current character set for codes with
|
||||
|
@ -357,9 +327,8 @@ In particular, \fB^N\fP and \fB^O\fP are not used anymore, ESC ( xx
|
|||
can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
|
||||
are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively.
|
||||
.SH SEE ALSO
|
||||
.BR iconv (1),
|
||||
.BR console (4),
|
||||
.BR console_codes (4),
|
||||
.BR console_ioctl (4),
|
||||
.BR ascii (7),
|
||||
.BR iso_8859-1 (7),
|
||||
.BR unicode (7),
|
||||
|
|
Loading…
Reference in New Issue