2004-11-03 13:51:07 +00:00
|
|
|
.\" Hey Emacs! This file is -*- nroff -*- source.
|
|
|
|
.\"
|
|
|
|
.\" Copyright (C) Markus Kuhn, 1996, 2001
|
|
|
|
.\"
|
|
|
|
.\" This is free documentation; you can redistribute it and/or
|
|
|
|
.\" modify it under the terms of the GNU General Public License as
|
|
|
|
.\" published by the Free Software Foundation; either version 2 of
|
|
|
|
.\" the License, or (at your option) any later version.
|
|
|
|
.\"
|
|
|
|
.\" The GNU General Public License's references to "object code"
|
|
|
|
.\" and "executables" are to be interpreted as the output of any
|
|
|
|
.\" document formatting or typesetting system, including
|
|
|
|
.\" intermediate and printed output.
|
|
|
|
.\"
|
|
|
|
.\" This manual is distributed in the hope that it will be useful,
|
|
|
|
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
.\" GNU General Public License for more details.
|
|
|
|
.\"
|
|
|
|
.\" You should have received a copy of the GNU General Public
|
|
|
|
.\" License along with this manual; if not, write to the Free
|
|
|
|
.\" Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111,
|
|
|
|
.\" USA.
|
|
|
|
.\"
|
|
|
|
.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
|
|
|
|
.\" First version written
|
|
|
|
.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk>
|
|
|
|
.\" Update
|
|
|
|
.\"
|
|
|
|
.TH UTF-8 7 2001-05-11 "GNU" "Linux Programmer's Manual"
|
|
|
|
.SH NAME
|
getaddrinfo.3, setlocale.3, strchr.3, wctob.3, st.4, glob.7, locale.7, regex.7, standards.7, unicode.7, utf-8.7: Global fix: s/multi-/multi/
The tendency in English, as prescribed in style guides like
Chicago MoS, is towards removing hyphens after prefixes
like "multi-" etc.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2010-01-16 17:47:55 +00:00
|
|
|
UTF-8 \- an ASCII compatible multibyte Unicode encoding
|
2004-11-03 13:51:07 +00:00
|
|
|
.SH DESCRIPTION
|
|
|
|
The
|
|
|
|
.B Unicode 3.0
|
2007-04-12 22:42:49 +00:00
|
|
|
character set occupies a 16-bit code space.
|
|
|
|
The most obvious
|
2004-11-03 13:51:07 +00:00
|
|
|
Unicode encoding (known as
|
|
|
|
.BR UCS-2 )
|
2007-04-12 22:42:49 +00:00
|
|
|
consists of a sequence of 16-bit words.
|
|
|
|
Such strings can contain as
|
2008-06-09 15:49:35 +00:00
|
|
|
parts of many 16-bit characters bytes
|
|
|
|
like \(aq\\0\(aq or \(aq/\(aq which have a
|
2008-07-10 20:53:08 +00:00
|
|
|
special meaning in filenames and other C library function arguments.
|
intro.1, time.1, accept.2, bind.2, connect.2, execve.2, flock.2, getdents.2, getpriority.2, getuid.2, intro.2, ioctl.2, mincore.2, mknod.2, personality.2, ptrace.2, read.2, recv.2, select_tut.2, send.2, sendfile.2, shmctl.2, sigaction.2, signal.2, stat.2, times.2, truncate.2, umask.2, wait.2, MB_CUR_MAX.3, MB_LEN_MAX.3, argz_add.3, btowc.3, clearenv.3, clock.3, cmsg.3, end.3, endian.3, errno.3, exit.3, fgetwc.3, fgetws.3, fopen.3, fputwc.3, fputws.3, fseek.3, fwide.3, getfsent.3, getgrnam.3, gethostid.3, getipnodebyname.3, getmntent.3, getpwnam.3, getwchar.3, grantpt.3, iconv.3, iconv_close.3, iconv_open.3, insque.3, intro.3, iswalnum.3, iswalpha.3, iswblank.3, iswcntrl.3, iswctype.3, iswdigit.3, iswgraph.3, iswlower.3, iswprint.3, iswpunct.3, iswspace.3, iswupper.3, iswxdigit.3, malloc.3, mblen.3, mbrlen.3, mbrtowc.3, mbsinit.3, mbsnrtowcs.3, mbsrtowcs.3, mbstowcs.3, mbtowc.3, mkstemp.3, mktemp.3, nl_langinfo.3, openpty.3, posix_openpt.3, printf.3, ptsname.3, putwchar.3, qecvt.3, rcmd.3, readdir.3, rexec.3, rpc.3, setnetgrent.3, shm_open.3, sigpause.3, stdin.3, stpcpy.3, strftime.3, strptime.3, syslog.3, towctrans.3, towlower.3, towupper.3, ttyslot.3, ungetwc.3, unlocked_stdio.3, wcpcpy.3, wcpncpy.3, wcrtomb.3, wcscasecmp.3, wcscat.3, wcschr.3, wcscmp.3, wcscpy.3, wcscspn.3, wcsdup.3, wcslen.3, wcsncasecmp.3, wcsncat.3, wcsncmp.3, wcsncpy.3, wcsnlen.3, wcsnrtombs.3, wcspbrk.3, wcsrchr.3, wcsrtombs.3, wcsspn.3, wcsstr.3, wcstok.3, wcstombs.3, wcswidth.3, wctob.3, wctomb.3, wctrans.3, wctype.3, wcwidth.3, wmemchr.3, wmemcmp.3, wmemcpy.3, wmemmove.3, wmemset.3, wprintf.3, console_ioctl.4, pts.4, elf.5, filesystems.5, hosts.5, proc.5, ttytype.5, boot.7, capabilities.7, credentials.7, epoll.7, glob.7, koi8-r.7, path_resolution.7, pty.7, signal.7, suffixes.7, time.7, unicode.7, unix.7, uri.7, utf-8.7: global fix: s/Unix/UNIX/
The man pages were rather inconsistent in the use of "Unix"
versus "UNIX". Let's go with the trademark usage.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2010-10-12 04:45:38 +00:00
|
|
|
In addition, the majority of UNIX tools expects ASCII files and can't
|
2007-04-12 22:42:49 +00:00
|
|
|
read 16-bit words as characters without major modifications.
|
|
|
|
For these reasons,
|
2004-11-03 13:51:07 +00:00
|
|
|
.B UCS-2
|
|
|
|
is not a suitable external encoding of
|
|
|
|
.B Unicode
|
2007-04-12 22:42:49 +00:00
|
|
|
in filenames, text files, environment variables, etc.
|
2007-08-27 08:36:21 +00:00
|
|
|
The
|
|
|
|
.BR "ISO 10646 Universal Character Set (UCS)" ,
|
2004-11-03 13:51:07 +00:00
|
|
|
a superset of Unicode, occupies even a 31-bit code space and the obvious
|
|
|
|
.B UCS-4
|
2007-08-27 08:36:21 +00:00
|
|
|
encoding for it (a sequence of 32-bit words) has the same problems.
|
2004-11-03 13:51:07 +00:00
|
|
|
|
|
|
|
The
|
|
|
|
.B UTF-8
|
|
|
|
encoding of
|
|
|
|
.B Unicode
|
|
|
|
and
|
|
|
|
.B UCS
|
|
|
|
does not have these problems and is the common way in which
|
|
|
|
.B Unicode
|
intro.1, time.1, accept.2, bind.2, connect.2, execve.2, flock.2, getdents.2, getpriority.2, getuid.2, intro.2, ioctl.2, mincore.2, mknod.2, personality.2, ptrace.2, read.2, recv.2, select_tut.2, send.2, sendfile.2, shmctl.2, sigaction.2, signal.2, stat.2, times.2, truncate.2, umask.2, wait.2, MB_CUR_MAX.3, MB_LEN_MAX.3, argz_add.3, btowc.3, clearenv.3, clock.3, cmsg.3, end.3, endian.3, errno.3, exit.3, fgetwc.3, fgetws.3, fopen.3, fputwc.3, fputws.3, fseek.3, fwide.3, getfsent.3, getgrnam.3, gethostid.3, getipnodebyname.3, getmntent.3, getpwnam.3, getwchar.3, grantpt.3, iconv.3, iconv_close.3, iconv_open.3, insque.3, intro.3, iswalnum.3, iswalpha.3, iswblank.3, iswcntrl.3, iswctype.3, iswdigit.3, iswgraph.3, iswlower.3, iswprint.3, iswpunct.3, iswspace.3, iswupper.3, iswxdigit.3, malloc.3, mblen.3, mbrlen.3, mbrtowc.3, mbsinit.3, mbsnrtowcs.3, mbsrtowcs.3, mbstowcs.3, mbtowc.3, mkstemp.3, mktemp.3, nl_langinfo.3, openpty.3, posix_openpt.3, printf.3, ptsname.3, putwchar.3, qecvt.3, rcmd.3, readdir.3, rexec.3, rpc.3, setnetgrent.3, shm_open.3, sigpause.3, stdin.3, stpcpy.3, strftime.3, strptime.3, syslog.3, towctrans.3, towlower.3, towupper.3, ttyslot.3, ungetwc.3, unlocked_stdio.3, wcpcpy.3, wcpncpy.3, wcrtomb.3, wcscasecmp.3, wcscat.3, wcschr.3, wcscmp.3, wcscpy.3, wcscspn.3, wcsdup.3, wcslen.3, wcsncasecmp.3, wcsncat.3, wcsncmp.3, wcsncpy.3, wcsnlen.3, wcsnrtombs.3, wcspbrk.3, wcsrchr.3, wcsrtombs.3, wcsspn.3, wcsstr.3, wcstok.3, wcstombs.3, wcswidth.3, wctob.3, wctomb.3, wctrans.3, wctype.3, wcwidth.3, wmemchr.3, wmemcmp.3, wmemcpy.3, wmemmove.3, wmemset.3, wprintf.3, console_ioctl.4, pts.4, elf.5, filesystems.5, hosts.5, proc.5, ttytype.5, boot.7, capabilities.7, credentials.7, epoll.7, glob.7, koi8-r.7, path_resolution.7, pty.7, signal.7, suffixes.7, time.7, unicode.7, unix.7, uri.7, utf-8.7: global fix: s/Unix/UNIX/
The man pages were rather inconsistent in the use of "Unix"
versus "UNIX". Let's go with the trademark usage.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2010-10-12 04:45:38 +00:00
|
|
|
is used on UNIX-style operating systems.
|
2007-06-15 20:16:04 +00:00
|
|
|
.SS Properties
|
2007-04-12 22:42:49 +00:00
|
|
|
The
|
|
|
|
.B UTF-8
|
2004-11-03 13:51:07 +00:00
|
|
|
encoding has the following nice properties:
|
|
|
|
.TP 0.2i
|
|
|
|
*
|
|
|
|
.B UCS
|
|
|
|
characters 0x00000000 to 0x0000007f (the classic
|
|
|
|
.B US-ASCII
|
|
|
|
characters) are encoded simply as bytes 0x00 to 0x7f (ASCII
|
2007-04-12 22:42:49 +00:00
|
|
|
compatibility).
|
|
|
|
This means that files and strings which contain only
|
|
|
|
7-bit ASCII characters have the same encoding under both
|
2004-11-03 13:51:07 +00:00
|
|
|
.B ASCII
|
|
|
|
and
|
|
|
|
.BR UTF-8 .
|
|
|
|
.TP
|
|
|
|
*
|
|
|
|
All
|
|
|
|
.B UCS
|
getaddrinfo.3, setlocale.3, strchr.3, wctob.3, st.4, glob.7, locale.7, regex.7, standards.7, unicode.7, utf-8.7: Global fix: s/multi-/multi/
The tendency in English, as prescribed in style guides like
Chicago MoS, is towards removing hyphens after prefixes
like "multi-" etc.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2010-01-16 17:47:55 +00:00
|
|
|
characters greater than 0x7f are encoded as a multibyte sequence
|
2004-11-03 13:51:07 +00:00
|
|
|
consisting only of bytes in the range 0x80 to 0xfd, so no ASCII
|
|
|
|
byte can appear as part of another character and there are no
|
2008-06-09 15:49:35 +00:00
|
|
|
problems with, for example, \(aq\\0\(aq or \(aq/\(aq.
|
2004-11-03 13:51:07 +00:00
|
|
|
.TP
|
|
|
|
*
|
|
|
|
The lexicographic sorting order of
|
|
|
|
.B UCS-4
|
|
|
|
strings is preserved.
|
|
|
|
.TP
|
|
|
|
*
|
2007-04-12 22:42:49 +00:00
|
|
|
All possible 2^31 UCS codes can be encoded using
|
2004-11-03 13:51:07 +00:00
|
|
|
.BR UTF-8 .
|
|
|
|
.TP
|
|
|
|
*
|
|
|
|
The bytes 0xfe and 0xff are never used in the
|
|
|
|
.B UTF-8
|
|
|
|
encoding.
|
|
|
|
.TP
|
2007-04-12 22:42:49 +00:00
|
|
|
*
|
getaddrinfo.3, setlocale.3, strchr.3, wctob.3, st.4, glob.7, locale.7, regex.7, standards.7, unicode.7, utf-8.7: Global fix: s/multi-/multi/
The tendency in English, as prescribed in style guides like
Chicago MoS, is towards removing hyphens after prefixes
like "multi-" etc.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2010-01-16 17:47:55 +00:00
|
|
|
The first byte of a multibyte sequence which represents a single non-ASCII
|
2004-11-03 13:51:07 +00:00
|
|
|
.B UCS
|
|
|
|
character is always in the range 0xc0 to 0xfd and indicates how long
|
getaddrinfo.3, setlocale.3, strchr.3, wctob.3, st.4, glob.7, locale.7, regex.7, standards.7, unicode.7, utf-8.7: Global fix: s/multi-/multi/
The tendency in English, as prescribed in style guides like
Chicago MoS, is towards removing hyphens after prefixes
like "multi-" etc.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2010-01-16 17:47:55 +00:00
|
|
|
this multibyte sequence is.
|
|
|
|
All further bytes in a multibyte sequence
|
2007-04-12 22:42:49 +00:00
|
|
|
are in the range 0x80 to 0xbf.
|
|
|
|
This allows easy resynchronization and
|
2004-11-03 13:51:07 +00:00
|
|
|
makes the encoding stateless and robust against missing bytes.
|
|
|
|
.TP
|
|
|
|
*
|
|
|
|
.B UTF-8
|
|
|
|
encoded
|
|
|
|
.B UCS
|
|
|
|
characters may be up to six bytes long, however the
|
|
|
|
.B Unicode
|
|
|
|
standard specifies no characters above 0x10ffff, so Unicode characters
|
|
|
|
can only be up to four bytes long in
|
|
|
|
.BR UTF-8 .
|
2007-06-15 20:16:04 +00:00
|
|
|
.SS Encoding
|
2007-04-12 22:42:49 +00:00
|
|
|
The following byte sequences are used to represent a character.
|
|
|
|
The sequence to be used depends on the UCS code number of the character:
|
2004-11-03 13:51:07 +00:00
|
|
|
.TP 0.4i
|
2005-07-06 07:41:37 +00:00
|
|
|
0x00000000 \- 0x0000007F:
|
2004-11-03 13:51:07 +00:00
|
|
|
.RI 0 xxxxxxx
|
|
|
|
.TP
|
2005-07-06 07:41:37 +00:00
|
|
|
0x00000080 \- 0x000007FF:
|
2007-04-12 22:42:49 +00:00
|
|
|
.RI 110 xxxxx
|
2004-11-03 13:51:07 +00:00
|
|
|
.RI 10 xxxxxx
|
|
|
|
.TP
|
2005-07-06 07:41:37 +00:00
|
|
|
0x00000800 \- 0x0000FFFF:
|
2004-11-03 13:51:07 +00:00
|
|
|
.RI 1110 xxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.TP
|
2005-07-06 07:41:37 +00:00
|
|
|
0x00010000 \- 0x001FFFFF:
|
2004-11-03 13:51:07 +00:00
|
|
|
.RI 11110 xxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.TP
|
2005-07-06 07:41:37 +00:00
|
|
|
0x00200000 \- 0x03FFFFFF:
|
2004-11-03 13:51:07 +00:00
|
|
|
.RI 111110 xx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.TP
|
2005-07-06 07:41:37 +00:00
|
|
|
0x04000000 \- 0x7FFFFFFF:
|
2004-11-03 13:51:07 +00:00
|
|
|
.RI 1111110 x
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.RI 10 xxxxxx
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I xxx
|
|
|
|
bit positions are filled with the bits of the character code number in
|
2007-04-12 22:42:49 +00:00
|
|
|
binary representation.
|
getaddrinfo.3, setlocale.3, strchr.3, wctob.3, st.4, glob.7, locale.7, regex.7, standards.7, unicode.7, utf-8.7: Global fix: s/multi-/multi/
The tendency in English, as prescribed in style guides like
Chicago MoS, is towards removing hyphens after prefixes
like "multi-" etc.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2010-01-16 17:47:55 +00:00
|
|
|
Only the shortest possible multibyte sequence
|
2004-11-03 13:51:07 +00:00
|
|
|
which can represent the code number of the character can be used.
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.B UCS
|
|
|
|
code values 0xd800\(en0xdfff (UTF-16 surrogates) as well as 0xfffe and
|
accept.2, access.2, acct.2, clock_nanosleep.2, mbind.2, mincore.2, remap_file_pages.2, sched_setscheduler.2, set_mempolicy.2, splice.2, stat.2, syslog.2, timer_create.2, timerfd_create.2, truncate.2, fenv.3, ferror.3, fflush.3, fgetwc.3, fgetws.3, flockfile.3, fputwc.3, fputws.3, fread.3, getopt.3, gets.3, getwchar.3, glob.3, iconv.3, longjmp.3, pow.3, printf.3, puts.3, putwchar.3, regex.3, rpc.3, scanf.3, setjmp.3, termios.3, unlocked_stdio.3, wcswidth.3, hd.4, rtc.4, st.4, core.5, dir_colors.5, elf.5, proc.5, arp.7, ascii.7, boot.7, bootparam.7, charsets.7, futex.7, ip.7, iso_8859-11.7, man-pages.7, man.7, mdoc.samples.7, path_resolution.7, pipe.7, posixoptions.7, unicode.7, unix.7, uri.7, utf-8.7, ld.so.8: s/non-/non/
The tendency in English, as prescribed in style guides like
Chicago MoS, is towards removing hyphens after prefixes
like "non-" etc.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2010-01-16 17:20:12 +00:00
|
|
|
0xffff (UCS noncharacters) should not appear in conforming
|
2004-11-03 13:51:07 +00:00
|
|
|
.B UTF-8
|
|
|
|
streams.
|
2007-06-15 20:16:04 +00:00
|
|
|
.SS Example
|
2004-11-03 13:51:07 +00:00
|
|
|
The
|
|
|
|
.B Unicode
|
|
|
|
character 0xa9 = 1010 1001 (the copyright sign) is encoded
|
|
|
|
in UTF-8 as
|
|
|
|
.PP
|
|
|
|
.RS
|
|
|
|
11000010 10101001 = 0xc2 0xa9
|
|
|
|
.RE
|
|
|
|
.PP
|
|
|
|
and character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) is
|
|
|
|
encoded as:
|
|
|
|
.PP
|
|
|
|
.RS
|
|
|
|
11100010 10001001 10100000 = 0xe2 0x89 0xa0
|
|
|
|
.RE
|
2007-06-15 20:16:04 +00:00
|
|
|
.SS "Application Notes"
|
2007-04-12 22:42:49 +00:00
|
|
|
Users have to select a
|
2004-11-03 13:51:07 +00:00
|
|
|
.B UTF-8
|
|
|
|
locale, for example with
|
|
|
|
.PP
|
|
|
|
.RS
|
|
|
|
export LANG=en_GB.UTF-8
|
|
|
|
.RE
|
|
|
|
.PP
|
2007-04-12 22:42:49 +00:00
|
|
|
in order to activate the
|
2004-11-03 13:51:07 +00:00
|
|
|
.B UTF-8
|
|
|
|
support in applications.
|
|
|
|
.PP
|
|
|
|
Application software that has to be aware of the used character
|
|
|
|
encoding should always set the locale with for example
|
|
|
|
.PP
|
|
|
|
.RS
|
|
|
|
setlocale(LC_CTYPE, "")
|
|
|
|
.RE
|
|
|
|
.PP
|
|
|
|
and programmers can then test the expression
|
|
|
|
.PP
|
|
|
|
.RS
|
|
|
|
strcmp(nl_langinfo(CODESET), "UTF-8") == 0
|
|
|
|
.RE
|
|
|
|
.PP
|
2007-04-12 22:42:49 +00:00
|
|
|
to determine whether a
|
2004-11-03 13:51:07 +00:00
|
|
|
.B UTF-8
|
|
|
|
locale has been selected and whether
|
|
|
|
therefore all plaintext standard input and output, terminal
|
|
|
|
communication, plaintext file content, filenames and environment
|
2007-04-12 22:42:49 +00:00
|
|
|
variables are encoded in
|
2004-11-03 13:51:07 +00:00
|
|
|
.BR UTF-8 .
|
|
|
|
.PP
|
|
|
|
Programmers accustomed to single-byte encodings such as
|
|
|
|
.B US-ASCII
|
|
|
|
or
|
|
|
|
.B ISO 8859
|
|
|
|
have to be aware that two assumptions made so far are no longer valid
|
|
|
|
in
|
|
|
|
.B UTF-8
|
2007-04-12 22:42:49 +00:00
|
|
|
locales.
|
|
|
|
Firstly, a single byte does not necessarily correspond any
|
|
|
|
more to a single character.
|
|
|
|
Secondly, since modern terminal emulators
|
|
|
|
in
|
2004-11-03 13:51:07 +00:00
|
|
|
.B UTF-8
|
|
|
|
mode also support Chinese, Japanese, and Korean
|
|
|
|
.B double-width characters
|
accept.2, access.2, acct.2, clock_nanosleep.2, mbind.2, mincore.2, remap_file_pages.2, sched_setscheduler.2, set_mempolicy.2, splice.2, stat.2, syslog.2, timer_create.2, timerfd_create.2, truncate.2, fenv.3, ferror.3, fflush.3, fgetwc.3, fgetws.3, flockfile.3, fputwc.3, fputws.3, fread.3, getopt.3, gets.3, getwchar.3, glob.3, iconv.3, longjmp.3, pow.3, printf.3, puts.3, putwchar.3, regex.3, rpc.3, scanf.3, setjmp.3, termios.3, unlocked_stdio.3, wcswidth.3, hd.4, rtc.4, st.4, core.5, dir_colors.5, elf.5, proc.5, arp.7, ascii.7, boot.7, bootparam.7, charsets.7, futex.7, ip.7, iso_8859-11.7, man-pages.7, man.7, mdoc.samples.7, path_resolution.7, pipe.7, posixoptions.7, unicode.7, unix.7, uri.7, utf-8.7, ld.so.8: s/non-/non/
The tendency in English, as prescribed in style guides like
Chicago MoS, is towards removing hyphens after prefixes
like "non-" etc.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
2010-01-16 17:20:12 +00:00
|
|
|
as well as nonspacing
|
2004-11-03 13:51:07 +00:00
|
|
|
.BR "combining characters" ,
|
|
|
|
outputting a single character does not necessarily advance the cursor
|
2007-04-12 22:42:49 +00:00
|
|
|
by one position as it did in
|
2004-11-03 13:51:07 +00:00
|
|
|
.BR ASCII .
|
|
|
|
Library functions such as
|
|
|
|
.BR mbsrtowcs (3)
|
|
|
|
and
|
|
|
|
.BR wcswidth (3)
|
|
|
|
should be used today to count characters and cursor positions.
|
|
|
|
.PP
|
|
|
|
The official ESC sequence to switch from an
|
|
|
|
.B ISO 2022
|
|
|
|
encoding scheme (as used for instance by VT100 terminals) to
|
|
|
|
.B UTF-8
|
|
|
|
is ESC % G
|
2007-04-12 22:42:49 +00:00
|
|
|
("\\x1b%G").
|
|
|
|
The corresponding return sequence from
|
2004-11-03 13:51:07 +00:00
|
|
|
.B UTF-8
|
2007-04-12 22:42:49 +00:00
|
|
|
to ISO 2022 is ESC % @ ("\\x1b%@").
|
|
|
|
Other ISO 2022 sequences (such as
|
2004-11-03 13:51:07 +00:00
|
|
|
for switching the G0 and G1 sets) are not applicable in UTF-8 mode.
|
|
|
|
.PP
|
|
|
|
It can be hoped that in the foreseeable future,
|
|
|
|
.B UTF-8
|
|
|
|
will replace
|
|
|
|
.B ASCII
|
|
|
|
and
|
|
|
|
.B ISO 8859
|
|
|
|
at all levels as the common character encoding on POSIX systems,
|
|
|
|
leading to a significantly richer environment for handling plain text.
|
2007-06-15 20:16:04 +00:00
|
|
|
.SS Security
|
2004-11-03 13:51:07 +00:00
|
|
|
The
|
|
|
|
.BR Unicode " and " UCS
|
2007-04-12 22:42:49 +00:00
|
|
|
standards require that producers of
|
2004-11-03 13:51:07 +00:00
|
|
|
.B UTF-8
|
2007-06-08 11:56:22 +00:00
|
|
|
shall use the shortest form possible, for example, producing a two-byte
|
2010-01-16 16:48:00 +00:00
|
|
|
sequence with first byte 0xc0 is nonconforming.
|
2004-11-03 13:51:07 +00:00
|
|
|
.B Unicode 3.1
|
|
|
|
has added the requirement that conforming programs must not accept
|
2007-04-12 22:42:49 +00:00
|
|
|
non-shortest forms in their input.
|
|
|
|
This is for security reasons: if
|
2004-11-03 13:51:07 +00:00
|
|
|
user input is checked for possible security violations, a program
|
|
|
|
might check only for the
|
|
|
|
.B ASCII
|
|
|
|
version of "/../" or ";" or NUL and overlook that there are many
|
|
|
|
.RB non- ASCII
|
|
|
|
ways to represent these things in a non-shortest
|
|
|
|
.B UTF-8
|
|
|
|
encoding.
|
2007-06-15 20:16:04 +00:00
|
|
|
.SS Standards
|
2005-07-20 07:50:45 +00:00
|
|
|
ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 2279, Plan 9.
|
2007-05-16 03:40:19 +00:00
|
|
|
.\" .SH AUTHOR
|
|
|
|
.\" Markus Kuhn <mgk25@cl.cam.ac.uk>
|
2004-11-03 13:51:07 +00:00
|
|
|
.SH "SEE ALSO"
|
|
|
|
.BR nl_langinfo (3),
|
|
|
|
.BR setlocale (3),
|
|
|
|
.BR charsets (7),
|
|
|
|
.BR unicode (7)
|