mirror of https://github.com/mkerrisk/man-pages
utf-8: Include RFC 3629 and clarify endianness which is left ambiguous
The endianness is suggested by the order the bytes are displayed, but the text is ambiguous.
This commit is contained in:
parent
5c1932ae50
commit
ad0fbddddc
|
@ -133,12 +133,14 @@ The sequence to be used depends on the UCS code number of the character:
|
|||
The
|
||||
.I xxx
|
||||
bit positions are filled with the bits of the character code number in
|
||||
binary representation.
|
||||
binary representation, most significant bit first (big-endian).
|
||||
Only the shortest possible multibyte sequence
|
||||
which can represent the code number of the character can be used.
|
||||
.PP
|
||||
The UCS code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and
|
||||
0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams.
|
||||
0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams. According
|
||||
to RFC 3629 no point above U+10FFFF should be used, which limits characters to four
|
||||
bytes.
|
||||
.SS Example
|
||||
The Unicode character 0xa9 = 1010 1001 (the copyright sign) is encoded
|
||||
in UTF-8 as
|
||||
|
|
Loading…
Reference in New Issue