utf-8.7: Two clarifications

This patch clarifies that 0xc0 and 0xc1 are not valid in any UTF-8 encoding[0], and it also references RFC 3629 instead of RFC 2279. [0] In order to have 0xc0, you'd have to have a two-byte encoding with all the data bits zero in the first byte (and thus only six bits of data), which would be an ASCII character encoded in the non-shortest form. Similarly with 0xc1. See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=538641 Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2012-04-30 10:24:06 +12:00 · 2012-04-30 10:24:06 +12:00 · 4550bf19cf
parent c3ee1c5d6a
commit 4550bf19cf
1 changed files with 4 additions and 4 deletions
--- a/man7/utf-8.7
+++ b/man7/utf-8.7
@ -27,7 +27,7 @@
 .\" 2001-05-11  Markus Kuhn <mgk25@cl.cam.ac.uk>
 .\"      Update
 .\"
-.TH UTF-8 7 2001-05-11 "GNU" "Linux Programmer's Manual"
+.TH UTF-8 7 2012-04-30 "GNU" "Linux Programmer's Manual"
 .SH NAME
 UTF-8 \- an ASCII compatible multibyte Unicode encoding
 .SH DESCRIPTION
@ -99,14 +99,14 @@ All possible 2^31 UCS codes can be encoded using
 .BR UTF-8 .
 .TP
 *
-The bytes 0xfe and 0xff are never used in the
+The bytes 0xc0, 0xc1, 0xfe and 0xff are never used in the
 .B UTF-8
 encoding.
 .TP
 *
 The first byte of a multibyte sequence which represents a single non-ASCII
 .B UCS
-character is always in the range 0xc0 to 0xfd and indicates how long
+character is always in the range 0xc2 to 0xfd and indicates how long
 this multibyte sequence is.
 All further bytes in a multibyte sequence
 are in the range 0x80 to 0xbf.
@ -288,7 +288,7 @@ ways to represent these things in a non-shortest
 .B UTF-8
 encoding.
 .SS Standards
-ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 2279, Plan 9.
+ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 3629, Plan 9.
 .\" .SH AUTHOR
 .\" Markus Kuhn <mgk25@cl.cam.ac.uk>
 .SH "SEE ALSO"