mirror of https://github.com/mkerrisk/man-pages
utf-8.7: Two clarifications
This patch clarifies that 0xc0 and 0xc1 are not valid in any UTF-8 encoding[0], and it also references RFC 3629 instead of RFC 2279. [0] In order to have 0xc0, you'd have to have a two-byte encoding with all the data bits zero in the first byte (and thus only six bits of data), which would be an ASCII character encoded in the non-shortest form. Similarly with 0xc1. See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=538641 Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
parent
c3ee1c5d6a
commit
4550bf19cf
|
@ -27,7 +27,7 @@
|
|||
.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
|
||||
.\" Update
|
||||
.\"
|
||||
.TH UTF-8 7 2001-05-11 "GNU" "Linux Programmer's Manual"
|
||||
.TH UTF-8 7 2012-04-30 "GNU" "Linux Programmer's Manual"
|
||||
.SH NAME
|
||||
UTF-8 \- an ASCII compatible multibyte Unicode encoding
|
||||
.SH DESCRIPTION
|
||||
|
@ -99,14 +99,14 @@ All possible 2^31 UCS codes can be encoded using
|
|||
.BR UTF-8 .
|
||||
.TP
|
||||
*
|
||||
The bytes 0xfe and 0xff are never used in the
|
||||
The bytes 0xc0, 0xc1, 0xfe and 0xff are never used in the
|
||||
.B UTF-8
|
||||
encoding.
|
||||
.TP
|
||||
*
|
||||
The first byte of a multibyte sequence which represents a single non-ASCII
|
||||
.B UCS
|
||||
character is always in the range 0xc0 to 0xfd and indicates how long
|
||||
character is always in the range 0xc2 to 0xfd and indicates how long
|
||||
this multibyte sequence is.
|
||||
All further bytes in a multibyte sequence
|
||||
are in the range 0x80 to 0xbf.
|
||||
|
@ -288,7 +288,7 @@ ways to represent these things in a non-shortest
|
|||
.B UTF-8
|
||||
encoding.
|
||||
.SS Standards
|
||||
ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 2279, Plan 9.
|
||||
ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 3629, Plan 9.
|
||||
.\" .SH AUTHOR
|
||||
.\" Markus Kuhn <mgk25@cl.cam.ac.uk>
|
||||
.SH "SEE ALSO"
|
||||
|
|
Loading…
Reference in New Issue