From ad0fbddddc2758008f8b5204d44472604837bfee Mon Sep 17 00:00:00 2001 From: Shawn Landden Date: Mon, 25 May 2015 23:53:06 -0700 Subject: [PATCH] utf-8: Include RFC 3629 and clarify endianness which is left ambiguous The endianness is suggested by the order the bytes are displayed, but the text is ambiguous. --- man7/utf-8.7 | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/man7/utf-8.7 b/man7/utf-8.7 index 597fad465..bbb016ce5 100644 --- a/man7/utf-8.7 +++ b/man7/utf-8.7 @@ -133,12 +133,14 @@ The sequence to be used depends on the UCS code number of the character: The .I xxx bit positions are filled with the bits of the character code number in -binary representation. +binary representation, most significant bit first (big-endian). Only the shortest possible multibyte sequence which can represent the code number of the character can be used. .PP The UCS code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and -0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams. +0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams. According +to RFC 3629 no point above U+10FFFF should be used, which limits characters to four +bytes. .SS Example The Unicode character 0xa9 = 1010 1001 (the copyright sign) is encoded in UTF-8 as