old-www/HOWTO/Secure-Programs-HOWTO/character-encoding.html

<HTML
><HEAD
><TITLE
>Character Encoding</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.7"><LINK
REL="HOME"
TITLE="Secure Programming for Linux and Unix HOWTO"
HREF="index.html"><LINK
REL="UP"
TITLE="Validate All Input"
HREF="input.html"><LINK
REL="PREVIOUS"
TITLE="Human Language (Locale) Selection"
HREF="locale.html"><LINK
REL="NEXT"
TITLE="Prevent Cross-site Malicious Content on Input"
HREF="input-protection-cross-site.html"></HEAD
><BODY
CLASS="SECT1"
BGCOLOR="#FFFFFF"
TEXT="#000000"
LINK="#0000FF"
VLINK="#840084"
ALINK="#0000FF"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
>Secure Programming for Linux and Unix HOWTO</TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="locale.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
>Chapter 5. Validate All Input</TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="input-protection-cross-site.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="SECT1"
><H1
CLASS="SECT1"
><A
NAME="CHARACTER-ENCODING"
></A
>5.9. Character Encoding</H1
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="CHARACTER-ENCODING-INTRO"
></A
>5.9.1. Introduction to Character Encoding</H2
><P
>For many years Americans have exchanged text using the ASCII character set;
since essentially all U.S. systems support ASCII,
this permits easy exchange of English text.
Unfortunately, ASCII is completely inadequate in handling the characters
of nearly all other languages.
For many years different countries have adopted different techniques for
exchanging text in different languages, making it difficult to exchange
data in an increasingly interconnected world.</P
><P
>More recently, ISO has developed ISO 10646,
the ``Universal Mulitple-Octet Coded Character Set (UCS).
UCS is a coded character set which
defines a single 31-bit value for each of all of the world's characters.
The first 65536 characters of the UCS (which thus fit into 16 bits)
are termed the ``Basic Multilingual Plane'' (BMP),
and the BMP is intended to cover nearly all of today's spoken languages.
The Unicode forum develops the Unicode standard, which concentrates on
the UCS and adds some additional conventions to aid interoperability.
Historically, Unicode and ISO 10646 were developed by competing groups,
but thankfully they realized that they needed to work together and they now
coordinate with each other.</P
><P
>If you're writing new software that handles internationalized characters,
you should be using ISO 10646/Unicode as your basis for handling
international characters.
However, you may need to process older documents in various older
(language-specific) character sets, in which case, you need to ensure that
an untrusted user cannot control the setting of another document's
character set (since this would significantly affect the document's
interpretation).</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="CHARACTER-ENCODING-UTF8"
></A
>5.9.2. Introduction to UTF-8</H2
><P
>Most software is not designed to handle 16 bit or 32 bit characters,
yet to create a universal character set more than 8 bits was required.
Therefore, a special format called ``UTF-8'' was developed to encode these
potentially international
characters in a format more easily handled by existing programs and libraries.
UTF-8 is defined, among other places, in IETF RFC 2279, so it's a
well-defined standard that can be freely read and used.
UTF-8 is a variable-width encoding; characters numbered 0 to 0x7f (127)
encode to themselves as a single byte,
while characters with larger values are encoded into 2 to 6 bytes of
information (depending on their value).
The encoding has been specially designed to have the following
nice properties (this information is from the RFC and Linux utf-8 man page):

<P
></P
><UL
><LI
><P
>       The classical US ASCII characters (0 to 0x7f) encode as themselves,
       so files  and strings  which  contain only 7-bit ASCII characters
       have the same encoding under both ASCII and UTF-8.
       This is fabulous for backward compatibility with the many existing
       U.S. programs and data files.</P
></LI
><LI
><P
>       All UCS characters beyond 0x7f are  encoded  as  a  multibyte
       sequence  consisting  only of bytes in the range 0x80 to 0xfd.
       This means that no ASCII byte can appear  as  part  of  another
       character.  Many other encodings permit characters such as an
       embedded NIL, causing programs to fail.</P
></LI
><LI
><P
>       It's easy to convert between UTF-8 and a 2-byte or 4-byte
       fixed-width representations of characters (these are called
       UCS-2 and UCS-4 respectively).</P
></LI
><LI
><P
>       The lexicographic sorting order of UCS-4 strings is preserved,
       and the Boyer-Moore fast search algorithm can be used directly
       with UTF-8 data.</P
></LI
><LI
><P
>       All  possible 2^31 UCS codes can be encoded using UTF-8.</P
></LI
><LI
><P
>       The  first byte of a multibyte sequence which represents
       a single non-ASCII UCS character is always in the  range
       0xc0  to  0xfd  and  indicates  how  long this multibyte
       sequence is. All further bytes in a  multibyte  sequence
       are  in  the range 0x80 to 0xbf. This allows easy resynchronization;
       if a byte is missing, it's easy to skip forward to the ``next''
       character, and it's always easy to skip forward and back to the
       ``next'' or ``preceding'' character.</P
></LI
></UL
></P
><P
>In short, the UTF-8 transformation format is becoming a dominant method
for exchanging international text information because it can support all of the
world's languages, yet it is backward compatible with U.S. ASCII files
as well as having other nice properties.
For many purposes I recommend its use, particularly when storing data
in a ``text'' file.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="UTF8-SECURITY-ISSUES"
></A
>5.9.3. UTF-8 Security Issues</H2
><P
>The reason to mention UTF-8 is that
some byte sequences are not legal UTF-8, and
this might be an exploitable security hole.
UTF-8 encoders are supposed to use the ``shortest possible''
encoding, but naive decoders may accept encodings that are longer than
necessary.
Indeed, earlier standards permitted decoders to accept
``non-shortest form'' encodings.
The problem here is that this means that potentially dangerous
input could be represented multiple ways, and thus might
defeat the security routines checking for dangerous inputs.
The RFC describes the problem this way:

<A
NAME="AEN856"
></A
><BLOCKQUOTE
CLASS="BLOCKQUOTE"
><P
>Implementers of UTF-8 need to consider the security aspects of how
they handle illegal UTF-8 sequences.  It is conceivable that in some
circumstances an attacker would be able to exploit an incautious
UTF-8 parser by sending it an octet sequence that is not permitted by
the UTF-8 syntax.</P
><P
>A particularly subtle form of this attack could be carried out
against a parser which performs security-critical validity checks
against the UTF-8 encoded form of its input, but interprets certain
illegal octet sequences as characters.  For example, a parser might
prohibit the NUL character when encoded as the single-octet sequence
00, but allow the illegal two-octet sequence C0 80 (illegal because
it's longer than necessary) and interpret it
as a NUL character (00).  Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F.</P
></BLOCKQUOTE
>&#13;</P
><P
>A longer discussion about this is available at
Markus Kuhn's
<EM
>UTF-8 and Unicode FAQ for Unix/Linux</EM
> at
<A
HREF="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
TARGET="_top"
>http://www.cl.cam.ac.uk/~mgk25/unicode.html</A
>.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="UTF8-LEGAL-VALUES"
></A
>5.9.4. UTF-8 Legal Values</H2
><P
>Thus, when accepting UTF-8 input, you need to check if the input is
valid UTF-8.
Here is a list of all legal UTF-8 sequences; any character
sequence not matching this table is not a legal UTF-8 sequence.
In the following table, the first column shows the various character
values being encoded into UTF-8.
The second column shows how those characters are encoded as binary values;
an ``x'' indicates where the data is placed (either a 0 or 1), though
some values should not be allowed because they're not the shortest possible
encoding.
The last row shows the valid values each byte can have
(in hexadecimal).
Thus, a program should check that every character meets one of the patterns
in the right-hand column.
A ``-'' indicates a range of legal values (inclusive).
Of course, just because a sequence is a legal UTF-8 sequence doesn't
mean that you should accept it (you still need to do all your other
checking), but generally you should check any UTF-8 data for UTF-8 legality
before performing other checks.
<DIV
CLASS="TABLE"
><A
NAME="AEN865"
></A
><P
><B
>Table 5-1. Legal UTF-8 Sequences</B
></P
><TABLE
BORDER="1"
CLASS="CALSTABLE"
><THEAD
><TR
><TH
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>UCS Code (Hex)</TH
><TH
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>Binary UTF-8 Format</TH
><TH
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>Legal UTF-8 Values (Hex)</TH
></TR
></THEAD
><TBODY
><TR
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>00-7F</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>0xxxxxxx</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>00-7F</TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>80-7FF</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>110xxxxx 10xxxxxx</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>C2-DF 80-BF</TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>800-FFF</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>1110xxxx 10xxxxxx 10xxxxxx</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>E0 A0*-BF 80-BF</TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>1000-FFFF</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>1110xxxx 10xxxxxx 10xxxxxx</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>E1-EF 80-BF 80-BF</TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>10000-3FFFF</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>F0 90*-BF 80-BF 80-BF</TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>40000-FFFFFF</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>F1-F3 80-BF 80-BF 80-BF</TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>40000-FFFFFF</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>F1-F3 80-BF 80-BF 80-BF</TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>100000-10FFFFF</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>F4 80-8F* 80-BF 80-BF</TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>200000-3FFFFFF</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>too large; see below</TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>04000000-7FFFFFFF</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</TD
><TD
WIDTH="33%"
ALIGN="LEFT"
VALIGN="TOP"
>too large; see below</TD
></TR
></TBODY
></TABLE
></DIV
></P
><P
>As I noted earlier, there are two standards for character sets,
ISO 10646 and Unicode, who have agreed to synchronize their
character assignments.
The definition of UTF-8 in ISO/IEC 10646-1:2000 and the IETF RFC
also currently support
five and six byte sequences to encode characters outside the range
supported by Uniforum's Unicode, but such values can't be used to
support Unicode characters and it's expected that a future version of
ISO 10646 will have the same limits.
Thus, for most purposes the five and six byte UTF-8 encodings aren't legal,
and you should normally reject them (unless you have a special purpose
for them).</P
><P
>This is set of valid values is tricky to determine, and in fact
earlier versions of this document got some entries
wrong (in some cases it permitted overlong characters).
Language developers should include a function in their libraries
to check for valid UTF-8 values, just because it's so hard to get right.</P
><P
>I should note that in some cases, you might want to cut slack (or use
internally) the hexadecimal sequence C0 80.  This is an overlong sequence
that, if permitted, can represent ASCII NUL (NIL).  Since C and C++
have trouble including a NIL character in an ordinary string,
some people have taken
to using this sequence when they want to represent NIL as part of the
data stream; Java even enshrines the practice.
Feel free to use C0 80 internally while processing data, but technically
you really should translate this back to 00 before saving the data.
Depending on your needs, you might decide to be ``sloppy'' and accept
C0 80 as input in a UTF-8 data stream.
If it doesn't harm security, it's probably a good practice to accept this
sequence since accepting it aids interoperability.</P
><P
>Handling this can be tricky.
You might want to examine the C routines developed by Unicode to
handle conversions, available at
<A
HREF="ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c"
TARGET="_top"
>ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c</A
>.
It's unclear to me if these routines are open source software (the
licenses don't clearly say whether or not they can be modified), so
beware of that.</P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="UTF8-RELATED-ISSUES"
></A
>5.9.5. UTF-8 Related Issues</H2
><P
>This section has discussed UTF-8, because it's the most popular
multibyte encoding of UCS, simplifying a lot of international text
handling issues.
However, it's certainly not the only encoding; there are other encodings,
such as UTF-16 and UTF-7, which have the same kinds of issues and
must be validated for the same reasons.</P
><P
>Another issue is that some phrases can be expressed in more than one
way in ISO 10646/Unicode.
For example, some accented characters can be represented as a single
character (with the accent) and also as a set of characters
(e.g., the base character plus a separate composing accent).
These two forms may appear identical.
There's also a zero-width space that could be inserted, with the
result that apparently-similar items are considered different.
Beware of situations where such hidden text could interfere with the program.
This is an issue that in general is hard to solve; most programs don't
have such tight control over the clients that they know completely how
a particular sequence will be displayed (since this depends on the
client's font, display characteristics, locale, and so on).</P
></DIV
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="locale.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="input-protection-cross-site.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>Human Language (Locale) Selection</TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="input.html"
ACCESSKEY="U"
>Up</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>Prevent Cross-site Malicious Content on Input</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>