From 4550bf19cf63901e5c605f641b0fb88eabc5129c Mon Sep 17 00:00:00 2001 From: "Brian M. Carlson" Date: Mon, 30 Apr 2012 10:24:06 +1200 Subject: [PATCH] utf-8.7: Two clarifications This patch clarifies that 0xc0 and 0xc1 are not valid in any UTF-8 encoding[0], and it also references RFC 3629 instead of RFC 2279. [0] In order to have 0xc0, you'd have to have a two-byte encoding with all the data bits zero in the first byte (and thus only six bits of data), which would be an ASCII character encoded in the non-shortest form. Similarly with 0xc1. See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=538641 Signed-off-by: Michael Kerrisk --- man7/utf-8.7 | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/man7/utf-8.7 b/man7/utf-8.7 index ad159ae5b..392d90fc8 100644 --- a/man7/utf-8.7 +++ b/man7/utf-8.7 @@ -27,7 +27,7 @@ .\" 2001-05-11 Markus Kuhn .\" Update .\" -.TH UTF-8 7 2001-05-11 "GNU" "Linux Programmer's Manual" +.TH UTF-8 7 2012-04-30 "GNU" "Linux Programmer's Manual" .SH NAME UTF-8 \- an ASCII compatible multibyte Unicode encoding .SH DESCRIPTION @@ -99,14 +99,14 @@ All possible 2^31 UCS codes can be encoded using .BR UTF-8 . .TP * -The bytes 0xfe and 0xff are never used in the +The bytes 0xc0, 0xc1, 0xfe and 0xff are never used in the .B UTF-8 encoding. .TP * The first byte of a multibyte sequence which represents a single non-ASCII .B UCS -character is always in the range 0xc0 to 0xfd and indicates how long +character is always in the range 0xc2 to 0xfd and indicates how long this multibyte sequence is. All further bytes in a multibyte sequence are in the range 0x80 to 0xbf. @@ -288,7 +288,7 @@ ways to represent these things in a non-shortest .B UTF-8 encoding. .SS Standards -ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 2279, Plan 9. +ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 3629, Plan 9. .\" .SH AUTHOR .\" Markus Kuhn .SH "SEE ALSO"