mirror of https://github.com/mkerrisk/man-pages
utf-8.7: Minor formatting fixes
There's no need really to boldface names of standards and character sets. Reported-by: Marko Myllynen <myllynen@redhat.com> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
parent
f2cf1fbfe5
commit
57e792311a
139
man7/utf-8.7
139
man7/utf-8.7
|
@ -30,12 +30,9 @@
|
|||
.SH NAME
|
||||
UTF-8 \- an ASCII compatible multibyte Unicode encoding
|
||||
.SH DESCRIPTION
|
||||
The
|
||||
.B Unicode 3.0
|
||||
character set occupies a 16-bit code space.
|
||||
The Unicode 3.0 character set occupies a 16-bit code space.
|
||||
The most obvious
|
||||
Unicode encoding (known as
|
||||
.BR UCS-2 )
|
||||
Unicode encoding (known as UCS-2)
|
||||
consists of a sequence of 16-bit words.
|
||||
Such strings can contain\(emas part of many 16-bit characters\(embytes
|
||||
such as \(aq\\0\(aq or \(aq/\(aq, which have a
|
||||
|
@ -43,69 +40,48 @@ special meaning in filenames and other C library function arguments.
|
|||
In addition, the majority of UNIX tools expect ASCII files and can't
|
||||
read 16-bit words as characters without major modifications.
|
||||
For these reasons,
|
||||
.B UCS-2
|
||||
is not a suitable external encoding of
|
||||
.B Unicode
|
||||
UCS-2 is not a suitable external encoding of Unicode
|
||||
in filenames, text files, environment variables, and so on.
|
||||
The
|
||||
.BR "ISO 10646 Universal Character Set (UCS)" ,
|
||||
The ISO 10646 Universal Character Set (UCS),
|
||||
a superset of Unicode, occupies an even larger code
|
||||
space\(em31\ bits\(emand the obvious
|
||||
.B UCS-4
|
||||
encoding for it (a sequence of 32-bit words) has the same problems.
|
||||
UCS-4 encoding for it (a sequence of 32-bit words) has the same problems.
|
||||
|
||||
The
|
||||
.B UTF-8
|
||||
encoding of
|
||||
.B Unicode
|
||||
and
|
||||
.B UCS
|
||||
The UTF-8 encoding of Unicode and UCS
|
||||
does not have these problems and is the common way in which
|
||||
.B Unicode
|
||||
is used on UNIX-style operating systems.
|
||||
Unicode is used on UNIX-style operating systems.
|
||||
.SS Properties
|
||||
The
|
||||
.B UTF-8
|
||||
encoding has the following nice properties:
|
||||
The UTF-8 encoding has the following nice properties:
|
||||
.TP 0.2i
|
||||
*
|
||||
.B UCS
|
||||
characters 0x00000000 to 0x0000007f (the classic
|
||||
.B US-ASCII
|
||||
UCS
|
||||
characters 0x00000000 to 0x0000007f (the classic US-ASCII
|
||||
characters) are encoded simply as bytes 0x00 to 0x7f (ASCII
|
||||
compatibility).
|
||||
This means that files and strings which contain only
|
||||
7-bit ASCII characters have the same encoding under both
|
||||
.B ASCII
|
||||
ASCII
|
||||
and
|
||||
.BR UTF-8 .
|
||||
UTF-8 .
|
||||
.TP
|
||||
*
|
||||
All
|
||||
.B UCS
|
||||
characters greater than 0x7f are encoded as a multibyte sequence
|
||||
All UCS characters greater than 0x7f are encoded as a multibyte sequence
|
||||
consisting only of bytes in the range 0x80 to 0xfd, so no ASCII
|
||||
byte can appear as part of another character and there are no
|
||||
problems with, for example, \(aq\\0\(aq or \(aq/\(aq.
|
||||
.TP
|
||||
*
|
||||
The lexicographic sorting order of
|
||||
.B UCS-4
|
||||
strings is preserved.
|
||||
The lexicographic sorting order of UCS-4 strings is preserved.
|
||||
.TP
|
||||
*
|
||||
All possible 2^31 UCS codes can be encoded using
|
||||
.BR UTF-8 .
|
||||
All possible 2^31 UCS codes can be encoded using UTF-8.
|
||||
.TP
|
||||
*
|
||||
The bytes 0xc0, 0xc1, 0xfe, and 0xff are never used in the
|
||||
.B UTF-8
|
||||
encoding.
|
||||
The bytes 0xc0, 0xc1, 0xfe, and 0xff are never used in the UTF-8 encoding.
|
||||
.TP
|
||||
*
|
||||
The first byte of a multibyte sequence which represents a single non-ASCII
|
||||
.B UCS
|
||||
character is always in the range 0xc2 to 0xfd and indicates how long
|
||||
UCS character is always in the range 0xc2 to 0xfd and indicates how long
|
||||
this multibyte sequence is.
|
||||
All further bytes in a multibyte sequence
|
||||
are in the range 0x80 to 0xbf.
|
||||
|
@ -113,14 +89,10 @@ This allows easy resynchronization and
|
|||
makes the encoding stateless and robust against missing bytes.
|
||||
.TP
|
||||
*
|
||||
.B UTF-8
|
||||
encoded
|
||||
.B UCS
|
||||
characters may be up to six bytes long, however the
|
||||
.B Unicode
|
||||
standard specifies no characters above 0x10ffff, so Unicode characters
|
||||
UTF-8 encoded UCS characters may be up to six bytes long, however the
|
||||
Unicode standard specifies no characters above 0x10ffff, so Unicode characters
|
||||
can be only up to four bytes long in
|
||||
.BR UTF-8 .
|
||||
UTF-8.
|
||||
.SS Encoding
|
||||
The following byte sequences are used to represent a character.
|
||||
The sequence to be used depends on the UCS code number of the character:
|
||||
|
@ -165,16 +137,10 @@ binary representation.
|
|||
Only the shortest possible multibyte sequence
|
||||
which can represent the code number of the character can be used.
|
||||
.PP
|
||||
The
|
||||
.B UCS
|
||||
code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and
|
||||
0xffff (UCS noncharacters) should not appear in conforming
|
||||
.B UTF-8
|
||||
streams.
|
||||
The UCS code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and
|
||||
0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams.
|
||||
.SS Example
|
||||
The
|
||||
.B Unicode
|
||||
character 0xa9 = 1010 1001 (the copyright sign) is encoded
|
||||
The Unicode character 0xa9 = 1010 1001 (the copyright sign) is encoded
|
||||
in UTF-8 as
|
||||
.PP
|
||||
.RS
|
||||
|
@ -188,17 +154,13 @@ encoded as:
|
|||
11100010 10001001 10100000 = 0xe2 0x89 0xa0
|
||||
.RE
|
||||
.SS Application notes
|
||||
Users have to select a
|
||||
.B UTF-8
|
||||
locale, for example with
|
||||
Users have to select a UTF-8 locale, for example with
|
||||
.PP
|
||||
.RS
|
||||
export LANG=en_GB.UTF-8
|
||||
.RE
|
||||
.PP
|
||||
in order to activate the
|
||||
.B UTF-8
|
||||
support in applications.
|
||||
in order to activate the UTF-8 support in applications.
|
||||
.PP
|
||||
Application software that has to be aware of the used character
|
||||
encoding should always set the locale with for example
|
||||
|
@ -213,69 +175,46 @@ and programmers can then test the expression
|
|||
strcmp(nl_langinfo(CODESET), "UTF-8") == 0
|
||||
.RE
|
||||
.PP
|
||||
to determine whether a
|
||||
.B UTF-8
|
||||
locale has been selected and whether
|
||||
to determine whether a UTF-8 locale has been selected and whether
|
||||
therefore all plaintext standard input and output, terminal
|
||||
communication, plaintext file content, filenames and environment
|
||||
variables are encoded in
|
||||
.BR UTF-8 .
|
||||
variables are encoded in UTF-8.
|
||||
.PP
|
||||
Programmers accustomed to single-byte encodings such as
|
||||
.B US-ASCII
|
||||
or
|
||||
.B ISO 8859
|
||||
Programmers accustomed to single-byte encodings such as US-ASCII or ISO 8859
|
||||
have to be aware that two assumptions made so far are no longer valid
|
||||
in
|
||||
.B UTF-8
|
||||
locales.
|
||||
in UTF-8 locales.
|
||||
Firstly, a single byte does not necessarily correspond any
|
||||
more to a single character.
|
||||
Secondly, since modern terminal emulators
|
||||
in
|
||||
.B UTF-8
|
||||
Secondly, since modern terminal emulators in UTF-8
|
||||
mode also support Chinese, Japanese, and Korean
|
||||
.B double-width characters
|
||||
as well as nonspacing
|
||||
.BR "combining characters" ,
|
||||
double-width characters as well as nonspacing combining characters,
|
||||
outputting a single character does not necessarily advance the cursor
|
||||
by one position as it did in
|
||||
.BR ASCII .
|
||||
by one position as it did in ASCII.
|
||||
Library functions such as
|
||||
.BR mbsrtowcs (3)
|
||||
and
|
||||
.BR wcswidth (3)
|
||||
should be used today to count characters and cursor positions.
|
||||
.PP
|
||||
The official ESC sequence to switch from an
|
||||
.B ISO 2022
|
||||
The official ESC sequence to switch from an ISO 2022
|
||||
encoding scheme (as used for instance by VT100 terminals) to
|
||||
.B UTF-8
|
||||
is ESC % G
|
||||
UTF-8 is ESC % G
|
||||
("\\x1b%G").
|
||||
The corresponding return sequence from
|
||||
.B UTF-8
|
||||
to ISO 2022 is ESC % @ ("\\x1b%@").
|
||||
UTF-8 to ISO 2022 is ESC % @ ("\\x1b%@").
|
||||
Other ISO 2022 sequences (such as
|
||||
for switching the G0 and G1 sets) are not applicable in UTF-8 mode.
|
||||
.SS Security
|
||||
The
|
||||
.BR Unicode " and " UCS
|
||||
standards require that producers of
|
||||
.B UTF-8
|
||||
The Unicode and UCS standards require that producers of UTF-8
|
||||
shall use the shortest form possible, for example, producing a two-byte
|
||||
sequence with first byte 0xc0 is nonconforming.
|
||||
.B Unicode 3.1
|
||||
has added the requirement that conforming programs must not accept
|
||||
Unicode 3.1 has added the requirement that conforming programs must not accept
|
||||
non-shortest forms in their input.
|
||||
This is for security reasons: if
|
||||
user input is checked for possible security violations, a program
|
||||
might check only for the
|
||||
.B ASCII
|
||||
might check only for the ASCII
|
||||
version of "/../" or ";" or NUL and overlook that there are many
|
||||
.RB non- ASCII
|
||||
ways to represent these things in a non-shortest
|
||||
.B UTF-8
|
||||
non-ASCII ways to represent these things in a non-shortest UTF-8
|
||||
encoding.
|
||||
.SS Standards
|
||||
ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 3629, Plan 9.
|
||||
|
|
Loading…
Reference in New Issue