utf-8: Include RFC 3629 and clarify endianness which is left ambiguous

The endianness is suggested by the order the bytes are displayed,
but the text is ambiguous.
This commit is contained in:
Shawn Landden 2015-05-25 23:53:06 -07:00 committed by Michael Kerrisk
parent 5c1932ae50
commit ad0fbddddc
1 changed files with 4 additions and 2 deletions

View File

@ -133,12 +133,14 @@ The sequence to be used depends on the UCS code number of the character:
The
.I xxx
bit positions are filled with the bits of the character code number in
binary representation.
binary representation, most significant bit first (big-endian).
Only the shortest possible multibyte sequence
which can represent the code number of the character can be used.
.PP
The UCS code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and
0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams.
0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams. According
to RFC 3629 no point above U+10FFFF should be used, which limits characters to four
bytes.
.SS Example
The Unicode character 0xa9 = 1010 1001 (the copyright sign) is encoded
in UTF-8 as