mirror of https://github.com/mkerrisk/man-pages
336 lines
13 KiB
Groff
336 lines
13 KiB
Groff
'\" t -*- coding: UTF-8 -*-
|
|
.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com>
|
|
.\" and Copyright (c) Andries Brouwer <aeb@cwi.nl>
|
|
.\"
|
|
.\" %%%LICENSE_START(GPLv2+_DOC_ONEPARA)
|
|
.\" This is free documentation; you can redistribute it and/or
|
|
.\" modify it under the terms of the GNU General Public License as
|
|
.\" published by the Free Software Foundation; either version 2 of
|
|
.\" the License, or (at your option) any later version.
|
|
.\" %%%LICENSE_END
|
|
.\"
|
|
.\" This is combined from many sources, including notes by aeb and
|
|
.\" research by esr. Portions derive from a writeup by Roman Czyborra.
|
|
.\"
|
|
.\" Changes also by David Starner <dstarner98@aasaa.ofe.org>.
|
|
.\"
|
|
.TH CHARSETS 7 2016-07-17 "Linux" "Linux Programmer's Manual"
|
|
.SH NAME
|
|
charsets \- character set standards and internationalization
|
|
.SH DESCRIPTION
|
|
This manual page gives an overview on different character set standards
|
|
and how they were used on Linux before Unicode became ubiquitous.
|
|
Some of this information is still helpful for people working with legacy
|
|
systems and documents.
|
|
.LP
|
|
Standards discussed include such as
|
|
ASCII, GB 2312, ISO 8859, JIS, KOI8-R, KS, and Unicode.
|
|
.LP
|
|
The primary emphasis is on character sets that were actually used by
|
|
locale character sets, not the myriad others that could be found in data
|
|
from other systems.
|
|
.SS ASCII
|
|
ASCII (American Standard Code For Information Interchange) is the original
|
|
7-bit character set, originally designed for American English.
|
|
Also known as US-ASCII.
|
|
It is currently described by the ISO 646:1991 IRV
|
|
(International Reference Version) standard.
|
|
.LP
|
|
Various ASCII variants replacing the dollar sign with other currency
|
|
symbols and replacing punctuation with non-English alphabetic
|
|
characters to cover German, French, Spanish, and others in 7 bits
|
|
emerged.
|
|
All are deprecated;
|
|
glibc does not support locales whose character sets are not true
|
|
supersets of ASCII.
|
|
.LP
|
|
As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text
|
|
still renders properly on modern UTF-8 using systems.
|
|
.SS ISO 8859
|
|
ISO 8859 is a series of 15 8-bit character sets, all of which have ASCII
|
|
in their low (7-bit) half, invisible control characters in positions
|
|
128 to 159, and 96 fixed-width graphics in positions 160-255.
|
|
.LP
|
|
Of these, the most important is ISO 8859-1
|
|
("Latin Alphabet No .1" / Latin-1).
|
|
It was widely adopted and supported by different systems,
|
|
and is gradually being replaced with Unicode.
|
|
The ISO 8859-1 characters are also the first 256 characters of Unicode.
|
|
.LP
|
|
Console support for the other 8859 character sets is available under
|
|
Linux through user-mode utilities (such as
|
|
.BR setfont (8))
|
|
that modify keyboard bindings and the EGA graphics
|
|
table and employ the "user mapping" font table in the console
|
|
driver.
|
|
.LP
|
|
Here are brief descriptions of each set:
|
|
.TP
|
|
8859-1 (Latin-1)
|
|
Latin-1 covers many West European languages such as Albanian, Basque,
|
|
Danish, English, Faroese, Galician, Icelandic, Irish, Italian,
|
|
Norwegian, Portuguese, Spanish, and Swedish.
|
|
The lack of the ligatures Dutch IJ/ij, French œ, and old-style „German“
|
|
quotation marks was considered tolerable.
|
|
.TP
|
|
8859-2 (Latin-2)
|
|
Latin-2 supports many Latin-written Central and East European
|
|
languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish,
|
|
Slovak, and Slovene.
|
|
Replacing Romanian ș/ț with ş/ţ was considered tolerable.
|
|
.TP
|
|
8859-3 (Latin-3)
|
|
Latin-3 was designed to cover of Esperanto, Maltese, and Turkish, but
|
|
8859-9 later superseded it for Turkish.
|
|
.TP
|
|
8859-4 (Latin-4)
|
|
Latin-4 introduced letters for North European languages such as
|
|
Estonian, Latvian, and Lithuanian, but was superseded by 8859-10 and
|
|
8859-13.
|
|
.TP
|
|
8859-5
|
|
Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
|
|
Russian, Serbian, and (almost completely) Ukrainian.
|
|
It was never widely used, see the discussion of KOI8-R/KOI8-U below.
|
|
.TP
|
|
8859-6
|
|
Was created for Arabic.
|
|
The 8859-6 glyph table is a fixed font of separate
|
|
letter forms, but a proper display engine should combine these
|
|
using the proper initial, medial, and final forms.
|
|
.TP
|
|
8859-7
|
|
Was created for Modern Greek in 1987, updated in 2003.
|
|
.TP
|
|
8859-8
|
|
Supports Modern Hebrew without niqud (punctuation signs).
|
|
Niqud and full-fledged Biblical Hebrew were outside the scope of this
|
|
character set.
|
|
.TP
|
|
8859-9 (Latin-5)
|
|
This is a variant of Latin-1 that replaces Icelandic letters with
|
|
Turkish ones.
|
|
.TP
|
|
8859-10 (Latin-6)
|
|
Latin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were
|
|
missing in Latin-4 to cover the entire Nordic area.
|
|
.TP
|
|
8859-11
|
|
Supports the Thai alphabet and is nearly identical to the TIS-620
|
|
standard.
|
|
.TP
|
|
8859-12
|
|
This set does not exist.
|
|
.TP
|
|
8859-13 (Latin-7)
|
|
Supports the Baltic Rim languages; in particular, it includes Latvian
|
|
characters not found in Latin-4.
|
|
.TP
|
|
8859-14 (Latin-8)
|
|
This is the Celtic character set, covering Old Irish, Manx, Gaelic,
|
|
Welsh, Cornish, and Breton.
|
|
.TP
|
|
8859-15 (Latin-9)
|
|
Latin-9 is similar to the widely used Latin-1 but replaces some less
|
|
common symbols with the Euro sign and French and Finnish letters that
|
|
were missing in Latin-1.
|
|
.TP
|
|
8859-16 (Latin-10)
|
|
This set covers many Southeast European languages, and most
|
|
importantly supports Romanian more completely than Latin-2.
|
|
.SS KOI8-R / KOI8-U
|
|
KOI8-R is a non-ISO character set popular in Russia before Unicode.
|
|
The lower half is ASCII;
|
|
the upper is a Cyrillic character set somewhat better designed than
|
|
ISO 8859-5.
|
|
KOI8-U, based on KOI8-R, has better support for Ukrainian.
|
|
Neither of these sets are ISO-2022 compatible,
|
|
unlike the ISO 8859 series.
|
|
.LP
|
|
Console support for KOI8-R is available under Linux through user-mode
|
|
utilities that modify keyboard bindings and the EGA graphics table,
|
|
and employ the "user mapping" font table in the console driver.
|
|
.SS GB 2312
|
|
GB 2312 is a mainland Chinese national standard character set used
|
|
to express simplified Chinese.
|
|
Just like JIS X 0208, characters are
|
|
mapped into a 94x94 two-byte matrix used to construct EUC-CN.
|
|
EUC-CN
|
|
is the most important encoding for Linux and includes ASCII and
|
|
GB 2312.
|
|
Note that EUC-CN is often called as GB, GB 2312, or CN-GB.
|
|
.SS Big5
|
|
Big5 was a popular character set in Taiwan to express traditional
|
|
Chinese.
|
|
(Big5 is both a character set and an encoding.)
|
|
It is a superset of ASCII.
|
|
Non-ASCII characters are expressed in two bytes.
|
|
Bytes 0xa1-0xfe are used as leading bytes for two-byte characters.
|
|
Big5 and its extension were widely used in Taiwan and Hong Kong.
|
|
It is not ISO 2022 compliant.
|
|
.\" Thanks to Tomohiro KUBOTA for the following sections about
|
|
.\" national standards.
|
|
.SS JIS X 0208
|
|
JIS X 0208 is a Japanese national standard character set.
|
|
Though there are some more Japanese national standard character sets (like
|
|
JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one.
|
|
Characters are mapped into a 94x94 two-byte matrix,
|
|
whose each byte is in the range 0x21-0x7e.
|
|
Note that JIS X 0208 is a character set, not an encoding.
|
|
This means that JIS X 0208
|
|
itself is not used for expressing text data.
|
|
JIS X 0208 is used
|
|
as a component to construct encodings such as EUC-JP, Shift_JIS,
|
|
and ISO-2022-JP.
|
|
EUC-JP is the most important encoding for Linux
|
|
and includes ASCII and JIS X 0208.
|
|
In EUC-JP, JIS X 0208
|
|
characters are expressed in two bytes, each of which is the
|
|
JIS X 0208 code plus 0x80.
|
|
.SS KS X 1001
|
|
KS X 1001 is a Korean national standard character set.
|
|
Just as
|
|
JIS X 0208, characters are mapped into a 94x94 two-byte matrix.
|
|
KS X 1001 is used like JIS X 0208, as a component
|
|
to construct encodings such as EUC-KR, Johab, and ISO-2022-KR.
|
|
EUC-KR is the most important encoding for Linux and includes
|
|
ASCII and KS X 1001.
|
|
KS C 5601 is an older name for KS X 1001.
|
|
.SS ISO 2022 and ISO 4873
|
|
The ISO 2022 and 4873 standards describe a font-control model
|
|
based on VT100 practice.
|
|
This model is (partially) supported
|
|
by the Linux kernel and by
|
|
.BR xterm (1).
|
|
Several ISO 2022-based character encodings have been defined,
|
|
especially for Japanese.
|
|
.LP
|
|
There are 4 graphic character sets, called G0, G1, G2, and G3,
|
|
and one of them is the current character set for codes with
|
|
high bit zero (initially G0), and one of them is the current
|
|
character set for codes with high bit one (initially G1).
|
|
Each graphic character set has 94 or 96 characters, and is
|
|
essentially a 7-bit character set.
|
|
It uses codes either
|
|
040-0177 (041-0176) or 0240-0377 (0241-0376).
|
|
G0 always has size 94 and uses codes 041-0176.
|
|
.LP
|
|
Switching between character sets is done using the shift functions
|
|
\fB^N\fP (SO or LS1), \fB^O\fP (SI or LS0), ESC n (LS2), ESC o (LS3),
|
|
ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), ESC | (LS3R).
|
|
The function LS\fIn\fP makes character set G\fIn\fP the current one
|
|
for codes with high bit zero.
|
|
The function LS\fIn\fPR makes character set G\fIn\fP the current one
|
|
for codes with high bit one.
|
|
The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3)
|
|
the current one for the next character only (regardless of the value
|
|
of its high order bit).
|
|
.LP
|
|
A 94-character set is designated as G\fIn\fP character set
|
|
by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1),
|
|
ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol
|
|
or a pair of symbols found in the ISO 2375 International
|
|
Register of Coded Character Sets.
|
|
For example, ESC ( @ selects the ISO 646 character set as G0,
|
|
ESC ( A selects the UK standard character set (with pound
|
|
instead of number sign), ESC ( B selects ASCII (with dollar
|
|
instead of currency sign), ESC ( M selects a character set
|
|
for African languages, ESC ( ! A selects the Cuban character
|
|
set, and so on.
|
|
.LP
|
|
A 96-character set is designated as G\fIn\fP character set
|
|
by an escape sequence ESC \- xx (for G1), ESC . xx (for G2)
|
|
or ESC / xx (for G3).
|
|
For example, ESC \- G selects the Hebrew alphabet as G1.
|
|
.LP
|
|
A multibyte character set is designated as G\fIn\fP character set
|
|
by an escape sequence ESC $ xx or ESC $ ( xx (for G0),
|
|
ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3).
|
|
For example, ESC $ ( C selects the Korean character set for G0.
|
|
The Japanese character set selected by ESC $ B has a more
|
|
recent version selected by ESC & @ ESC $ B.
|
|
.LP
|
|
ISO 4873 stipulates a narrower use of character sets, where G0
|
|
is fixed (always ASCII), so that G1, G2 and G3
|
|
can be invoked only for codes with the high order bit set.
|
|
In particular, \fB^N\fP and \fB^O\fP are not used anymore, ESC ( xx
|
|
can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
|
|
are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively.
|
|
.SS TIS-620
|
|
TIS-620 is a Thai national standard character set and a superset
|
|
of ASCII.
|
|
In the same fashion as the ISO 8859 series, Thai characters are mapped into
|
|
0xa1-0xfe.
|
|
.SS Unicode
|
|
Unicode (ISO 10646) is a standard which aims to unambiguously represent
|
|
every character in every human language.
|
|
Unicode's structure permits 20.1 bits to encode every character.
|
|
Since most computers don't include 20.1-bit integers, Unicode is
|
|
usually encoded as 32-bit integers internally and either a series of
|
|
16-bit integers (UTF-16) (needing two 16-bit integers only when
|
|
encoding certain rare characters) or a series of 8-bit bytes (UTF-8).
|
|
.LP
|
|
Linux represents Unicode using the 8-bit Unicode Transformation Format
|
|
(UTF-8).
|
|
UTF-8 is a variable length encoding of Unicode.
|
|
It uses 1
|
|
byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes
|
|
for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.
|
|
.LP
|
|
Let 0,1,x stand for a zero, one, or arbitrary bit.
|
|
A byte 0xxxxxxx
|
|
stands for the Unicode 00000000 0xxxxxxx which codes the same symbol
|
|
as the ASCII 0xxxxxxx.
|
|
Thus, ASCII goes unchanged into UTF-8, and
|
|
people using only ASCII do not notice any change: not in code, and not
|
|
in file size.
|
|
.LP
|
|
A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy
|
|
is assembled into 00000xxx xxyyyyyy.
|
|
A byte 1110xxxx is the start
|
|
of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled
|
|
into xxxxyyyy yyzzzzzz.
|
|
(When UTF-8 is used to code the 31-bit ISO 10646
|
|
then this progression continues up to 6-byte codes.)
|
|
.LP
|
|
For most texts in ISO 8859 character sets, this means that the
|
|
characters outside of ASCII are now coded with two bytes.
|
|
This tends
|
|
to expand ordinary text files by only one or two percent.
|
|
For Russian
|
|
or Greek texts, this expands ordinary text files by 100%, since text in
|
|
those languages is mostly outside of ASCII.
|
|
For Japanese users this means
|
|
that the 16-bit codes now in common use will take three bytes.
|
|
While there are algorithmic conversions from some character sets
|
|
(especially ISO 8859-1) to Unicode, general conversion requires
|
|
carrying around conversion tables, which can be quite large for 16-bit
|
|
codes.
|
|
.LP
|
|
Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, any other
|
|
byte is the head of a code.
|
|
Note that the only way ASCII bytes occur
|
|
in a UTF-8 stream, is as themselves.
|
|
In particular, there are no
|
|
embedded NULs (\(aq\\0\(aq) or \(aq/\(aqs that form part of some larger code.
|
|
.LP
|
|
Since ASCII, and, in particular, NUL and \(aq/\(aq, are unchanged, the
|
|
kernel does not notice that UTF-8 is being used.
|
|
It does not care at
|
|
all what the bytes it is handling stand for.
|
|
.LP
|
|
Rendering of Unicode data streams is typically handled through
|
|
"subfont" tables which map a subset of Unicode to glyphs.
|
|
Internally
|
|
the kernel uses Unicode to describe the subfont loaded in video RAM.
|
|
This means that in the Linux console in UTF-8 mode, one can use a character
|
|
set with 512 different symbols.
|
|
This is not enough for Japanese, Chinese, and
|
|
Korean, but it is enough for most other purposes.
|
|
.SH SEE ALSO
|
|
.BR iconv (1),
|
|
.BR ascii (7),
|
|
.BR iso_8859-1 (7),
|
|
.BR unicode (7),
|
|
.BR utf-8 (7)
|