diff --git a/man7/unicode.7 b/man7/unicode.7 index 4884b566c..c616f5019 100644 --- a/man7/unicode.7 +++ b/man7/unicode.7 @@ -30,13 +30,10 @@ .SH NAME Unicode \- universal character set .SH DESCRIPTION -The international standard -.B ISO 10646 -defines the -.BR "Universal Character Set (UCS)" . +The international standard ISO 10646 defines the +Universal Character Set (UCS). UCS contains all characters of all other character set standards. -It also guarantees -.BR "round-trip compatibility"; +It also guarantees "round-trip compatibility"; in other words, conversion tables can be built such that no information is lost when a string is converted from any other encoding to UCS and back. @@ -74,14 +71,12 @@ made up of 256 8-bit with 256 .I column positions, one for each character. -Part 1 of the standard -.RB ( "ISO 10646-1" ) +Part 1 of the standard (ISO 10646-1) defines the first 65534 code positions (0x0000 to 0xfffd), which form the .IR "Basic Multilingual Plane (BMP)" , that is plane 0 in group 0. -Part 2 of the standard -.RB ( "ISO 10646-2" ) +Part 2 of the standard (ISO 10646-2) adds characters to group 0 outside the BMP in several .I "supplementary planes" in the range 0x10000 to 0x10ffff. @@ -97,27 +92,20 @@ dictionary printing, publishing industry, higher-level protocol and enthusiast needs. .PP The representation of each UCS character as a 2-byte word is referred -to as the -.B UCS-2 -form (only for BMP characters), whereas -.B UCS-4 -is the representation of each character by a 4-byte word. -In addition, there exist two encoding forms -.B UTF-8 -for backward compatibility with ASCII processing software and -.B UTF-16 +to as the UCS-2 form (only for BMP characters), +whereas UCS-4 is the representation of each character by a 4-byte word. +In addition, there exist two encoding forms UTF-8 +for backward compatibility with ASCII processing software and UTF-16 for the backward-compatible handling of non-BMP characters up to 0x10ffff by UCS-2 software. .PP The UCS characters 0x0000 to 0x007f are identical to those of the -classic -.B US-ASCII +classic US-ASCII character set and the characters in the range 0x0000 to 0x00ff are identical to those in -.BR "ISO 8859-1 Latin-1" . +ISO 8859-1 (Latin-1). .SS Combining characters -Some code points in -.B UCS +Some code points in UCS have been assigned to .IR "combining characters" . These are similar to the nonspacing accent keys on a typewriter. @@ -143,8 +131,7 @@ combining characters, ISO 10646-1 specifies the following three of UCS: .TP 0.9i Level 1 -Combining characters and -.B Hangul Jamo +Combining characters and Hangul Jamo (a variant encoding of the Korean script, where a Hangul syllable glyph is coded as a triplet or pair of vovel/consonant codes) are not supported. @@ -155,19 +142,13 @@ languages where they are essential (e.g., Thai, Lao, Hebrew, Arabic, Devanagari, Malayalam). .TP Level 3 -All -.B UCS -characters are supported. +All UCS characters are supported. .PP -The -.B Unicode 3.0 Standard -published by the -.B Unicode Consortium -contains exactly the -.B UCS Basic Multilingual Plane +The Unicode 3.0 Standard +published by the Unicode Consortium +contains exactly the UCS Basic Multilingual Plane at implementation level 3, as described in ISO 10646-1:2000. -.B Unicode 3.1 -added the supplemental planes of ISO 10646-2. +Unicode 3.1 added the supplemental planes of ISO 10646-2. The Unicode standard and technical reports published by the Unicode Consortium provide much additional information on the semantics and recommended usages of @@ -180,8 +161,7 @@ Under GNU/Linux, the C type .I wchar_t is a signed 32-bit integer type. Its values are always interpreted -by the C library as -.B UCS +by the C library as UCS code values (in all locales), a convention that is signaled by the GNU C library to applications by defining the constant .B __STDC_ISO_10646__ @@ -189,9 +169,7 @@ as specified in the ISO C99 standard. UCS/Unicode can be used just like ASCII in input/output streams, terminal communication, plaintext files, filenames, and environment -variables in the ASCII compatible -.B UTF-8 -multibyte encoding. +variables in the ASCII compatible UTF-8 multibyte encoding. To signal the use of UTF-8 as the character encoding to all applications, a suitable .I locale @@ -236,8 +214,7 @@ Set (UCS) \(em Part 1: Architecture and Basic Multilingual Plane. International Standard ISO/IEC 10646-1, International Organization for Standardization, Geneva, 2000. -This is the official specification of -.BR UCS . +This is the official specification of UCS . Available from .UR http://www.iso.ch/ .UE .