utf-8.7: Two clarifications

This patch clarifies that 0xc0 and 0xc1 are not valid in any UTF-8
encoding[0], and it also references RFC 3629 instead of RFC 2279.

[0] In order to have 0xc0, you'd have to have a two-byte encoding
with all the data bits zero in the first byte (and thus only six
bits of data), which would be an ASCII character encoded in the
non-shortest form.  Similarly with 0xc1.

See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=538641

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
This commit is contained in:
Brian M. Carlson 2012-04-30 10:24:06 +12:00 committed by Michael Kerrisk
parent c3ee1c5d6a
commit 4550bf19cf
1 changed files with 4 additions and 4 deletions

View File

@ -27,7 +27,7 @@
.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
.\" Update
.\"
.TH UTF-8 7 2001-05-11 "GNU" "Linux Programmer's Manual"
.TH UTF-8 7 2012-04-30 "GNU" "Linux Programmer's Manual"
.SH NAME
UTF-8 \- an ASCII compatible multibyte Unicode encoding
.SH DESCRIPTION
@ -99,14 +99,14 @@ All possible 2^31 UCS codes can be encoded using
.BR UTF-8 .
.TP
*
The bytes 0xfe and 0xff are never used in the
The bytes 0xc0, 0xc1, 0xfe and 0xff are never used in the
.B UTF-8
encoding.
.TP
*
The first byte of a multibyte sequence which represents a single non-ASCII
.B UCS
character is always in the range 0xc0 to 0xfd and indicates how long
character is always in the range 0xc2 to 0xfd and indicates how long
this multibyte sequence is.
All further bytes in a multibyte sequence
are in the range 0x80 to 0xbf.
@ -288,7 +288,7 @@ ways to represent these things in a non-shortest
.B UTF-8
encoding.
.SS Standards
ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 2279, Plan 9.
ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 3629, Plan 9.
.\" .SH AUTHOR
.\" Markus Kuhn <mgk25@cl.cam.ac.uk>
.SH "SEE ALSO"