1004 lines
32 KiB
HTML
1004 lines
32 KiB
HTML
<HTML
|
|
><HEAD
|
|
><TITLE
|
|
>Filter HTML/URIs That May Be Re-presented</TITLE
|
|
><META
|
|
NAME="GENERATOR"
|
|
CONTENT="Modular DocBook HTML Stylesheet Version 1.7"><LINK
|
|
REL="HOME"
|
|
TITLE="Secure Programming for Linux and Unix HOWTO"
|
|
HREF="index.html"><LINK
|
|
REL="UP"
|
|
TITLE="Validate All Input"
|
|
HREF="input.html"><LINK
|
|
REL="PREVIOUS"
|
|
TITLE="Prevent Cross-site Malicious Content on Input"
|
|
HREF="input-protection-cross-site.html"><LINK
|
|
REL="NEXT"
|
|
TITLE="Forbid HTTP GET To Perform Non-Queries"
|
|
HREF="avoid-get-non-queries.html"></HEAD
|
|
><BODY
|
|
CLASS="SECT1"
|
|
BGCOLOR="#FFFFFF"
|
|
TEXT="#000000"
|
|
LINK="#0000FF"
|
|
VLINK="#840084"
|
|
ALINK="#0000FF"
|
|
><DIV
|
|
CLASS="NAVHEADER"
|
|
><TABLE
|
|
SUMMARY="Header navigation table"
|
|
WIDTH="100%"
|
|
BORDER="0"
|
|
CELLPADDING="0"
|
|
CELLSPACING="0"
|
|
><TR
|
|
><TH
|
|
COLSPAN="3"
|
|
ALIGN="center"
|
|
>Secure Programming for Linux and Unix HOWTO</TH
|
|
></TR
|
|
><TR
|
|
><TD
|
|
WIDTH="10%"
|
|
ALIGN="left"
|
|
VALIGN="bottom"
|
|
><A
|
|
HREF="input-protection-cross-site.html"
|
|
ACCESSKEY="P"
|
|
>Prev</A
|
|
></TD
|
|
><TD
|
|
WIDTH="80%"
|
|
ALIGN="center"
|
|
VALIGN="bottom"
|
|
>Chapter 5. Validate All Input</TD
|
|
><TD
|
|
WIDTH="10%"
|
|
ALIGN="right"
|
|
VALIGN="bottom"
|
|
><A
|
|
HREF="avoid-get-non-queries.html"
|
|
ACCESSKEY="N"
|
|
>Next</A
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><HR
|
|
ALIGN="LEFT"
|
|
WIDTH="100%"></DIV
|
|
><DIV
|
|
CLASS="SECT1"
|
|
><H1
|
|
CLASS="SECT1"
|
|
><A
|
|
NAME="FILTER-HTML"
|
|
></A
|
|
>5.11. Filter HTML/URIs That May Be Re-presented</H1
|
|
><P
|
|
>One special case where cross-site malicious content must be
|
|
prevented are web applications
|
|
which are designed to accept HTML or XHTML from one user, and then send it on
|
|
to other users
|
|
(see <A
|
|
HREF="cross-site-malicious-content.html"
|
|
>Section 7.15</A
|
|
> for
|
|
more information on cross-site malicious content).
|
|
The following subsections discuss filtering this specific kind of input,
|
|
since handling it is such a common requirement.</P
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="REMOVE-HTML-TAGS"
|
|
></A
|
|
>5.11.1. Remove or Forbid Some HTML Data</H2
|
|
><P
|
|
>It's safest to remove all possible (X)HTML tags so they cannot affect anything,
|
|
and this is relatively easy to do.
|
|
As noted above, you should already be identifying the list of legal
|
|
characters, and rejecting or removing those characters that aren't
|
|
in the list.
|
|
In this filter, simply don't include the following characters in
|
|
the list of legal characters: ``<'', ``>'', and ``&'' (and if
|
|
they're used in attributes, the double-quote character ``"'').
|
|
If browsers only operated according the HTML specifications, the ``>"''
|
|
wouldn't need to be removed, but in practice it must be removed.
|
|
This is because some browsers assume that the author of the page
|
|
really meant to put in an opening "<" and ``helpfully'' insert one -
|
|
attackers can exploit this behavior and use the ">" to create an
|
|
undesired "<".</P
|
|
><P
|
|
>Usually the character set for transmitting HTML is
|
|
ISO-8859-1 (even when sending international text),
|
|
so the filter should also omit most control characters (linefeed and
|
|
tab are usually okay) and characters with their high-order bit set.</P
|
|
><P
|
|
>One problem with this approach is that it can really surprise users,
|
|
especially those entering international text if all international
|
|
text is quietly removed.
|
|
If the invalid characters are quietly removed without warning,
|
|
that data will be irrevocably lost and cannot be reconstructed later.
|
|
One alternative is forbidding such characters and sending error messages
|
|
back to users who attempt to use them.
|
|
This at least warns users, but doesn't give them the functionality
|
|
they were looking for.
|
|
Other alternatives are encoding this data or validating this data,
|
|
which are discussed next.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="ENCODING-HTML-TAGS"
|
|
></A
|
|
>5.11.2. Encoding HTML Data</H2
|
|
><P
|
|
>An alternative that is nearly as safe
|
|
is to transform the critical characters so they won't
|
|
have their usual meaning in HTML.
|
|
This can be done by translating all "<" into "&lt;",
|
|
">" into "&gt;", and "&" into "&amp;".
|
|
Arbitrary international characters can be encoded in Latin-1
|
|
using the format "&#value;" - do not forget the ending semicolon.
|
|
Encoding the international characters means you must know what the
|
|
input encoding was, of course.</P
|
|
><P
|
|
>One possible danger here is that if these encodings are accidentally
|
|
interpreted twice, they will become a vulnerability.
|
|
However, this approach at least permits later users to see the
|
|
"intent" of the input.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="VALIDATING-HTML-TAGS"
|
|
></A
|
|
>5.11.3. Validating HTML Data</H2
|
|
><P
|
|
>Some applications, to work at all, must accept HTML from third parties
|
|
and send them on to their users.
|
|
Beware - you are treading dangerous ground at this point; be sure
|
|
that you really want to do this.
|
|
Even the idea of accepting HTML from arbitrary places
|
|
is controversial among some security practitioners, because it is extremely
|
|
difficult to get it right.</P
|
|
><P
|
|
>However, if your application must accept HTML, and you believe
|
|
that it's worth the risk, at least identify a list
|
|
of ``safe'' HTML commands and only permit those commands.</P
|
|
><P
|
|
>Here is a minimal set of safe HTML tags
|
|
that might be useful for applications (such as guestbooks)
|
|
that support short comments:
|
|
<p> (paragraph),
|
|
<b> (bold),
|
|
<i> (italics),
|
|
<em> (emphasis),
|
|
<strong> (strong emphasis),
|
|
<pre> (preformatted text),
|
|
<br> (forced line break - note it doesn't require a closing tag),
|
|
as well as all their ending tags.</P
|
|
><P
|
|
>Not only do you need to ensure that only a small set
|
|
of ``safe'' HTML commands are accepted, you also need to ensure
|
|
that they are properly nested and closed
|
|
(i.e., that the HTML commands are ``balanced'').
|
|
In XML, this is termed ``well-formed'' data.
|
|
A few exceptions could be made if you're accepting standard HTML
|
|
(e.g., supporting an implied </p> where not provided before a
|
|
<p> would be fine), but trying to accept HTML in its full
|
|
generality (which can infer balancing closing tags in many cases)
|
|
is not needed for most applications.
|
|
Indeed, if you're trying to stick to XHTML (instead of HTML), then
|
|
well-formedness is a requirement.
|
|
Also, HTML tags are case-insensitive; tags can be upper case,
|
|
lower case, or a mixture.
|
|
However, if you intend to accept XHTML
|
|
then you need to require all tags to be in lower case
|
|
(XML is case-sensitive; XHTML uses XML and requires the tags to be
|
|
in lower case).</P
|
|
><P
|
|
>Here are a few random tips about doing this.
|
|
Usually you should design whatever surrounds the HTML text and the
|
|
set of permitted tags so that the contributed text cannot be misinterpreted
|
|
as text from the ``main'' site (to prevent forgeries).
|
|
Don't accept any attributes unless you've checked the attribute type and
|
|
its value; there are many attributes that support things such as
|
|
Javascript that can cause trouble for your users.
|
|
You'll notice that in the above list I didn't include any attributes at all,
|
|
which is certainly the safest course.
|
|
You should probably give a warning message if an unsafe tag is used,
|
|
but if that's not practical, encoding the critical characters
|
|
(e.g., "<" becomes "&lt;") prevents data loss while
|
|
simultaneously keeping the users safe.</P
|
|
><P
|
|
>Be careful when expanding this set, and in general be restrictive of
|
|
what you accept.
|
|
If your patterns are too generous, the browser may interpret the
|
|
sequences differently than you expect, resulting in a potential
|
|
exploit.
|
|
For example, FozZy posted on Bugtraq (1 April 2002)
|
|
some sequences that permitted
|
|
exploitation in various web-based mail systems,
|
|
which may give you an idea of the kinds of problems you need to defend
|
|
against.
|
|
Here's some exploit text that, at one time, could
|
|
subvert user accounts in Microsoft Hotmail:
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> <SCRIPT>
|
|
</COMMENT>
|
|
<!-- --> --></PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
Here's some similar exploit text for Yahoo! Mail:
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> <_a<script>
|
|
<<script> (Note: this was found by BugSan)</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
Here's some exploit text for Vizzavi:
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> <b onmousover="...">go here</b>
|
|
<img [line_break] src="javascript:alert(document.location)"></PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
|
|
Andrew Clover posted to Bugtraq (on May 11, 2002) a list of various
|
|
text that invokes Javascript yet manages to bypass many filters.
|
|
Here are his examples (which he says he cut and pasted from elsewhere);
|
|
some only apply to specific browsers
|
|
(IE means Internet Explorer, N4 means Netscape version 4).
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> <a href="javas&#99;ript&#35;[code]">
|
|
<div onmouseover="[code]">
|
|
<img src="javascript:[code]">
|
|
<img dynsrc="javascript:[code]"> [IE]
|
|
<input type="image" dynsrc="javascript:[code]"> [IE]
|
|
<bgsound src="javascript:[code]"> [IE]
|
|
&<script>[code]</script>
|
|
&{[code]}; [N4]
|
|
<img src=&{[code]};> [N4]
|
|
<link rel="stylesheet" href="javascript:[code]">
|
|
<iframe src="vbscript:[code]"> [IE]
|
|
<img src="mocha:[code]"> [N4]
|
|
<img src="livescript:[code]"> [N4]
|
|
<a href="about:<s&#99;ript>[code]</script>">
|
|
<meta http-equiv="refresh" content="0;url=javascript:[code]">
|
|
<body onload="[code]">
|
|
<div style="background-image: url(javascript:[code]);">
|
|
<div style="behaviour: url([link to code]);"> [IE]
|
|
<div style="binding: url([link to code]);"> [Mozilla]
|
|
<div style="width: expression([code]);"> [IE]
|
|
<style type="text/javascript">[code]</style> [N4]
|
|
<object classid="clsid:..." codebase="javascript:[code]"> [IE]
|
|
<style><!--</style><script>[code]//--></script>
|
|
<!-- -- --><script>[code]</script><!-- -- -->
|
|
<<script>[code]</script>
|
|
<img src="blah"onmouseover="[code]">
|
|
<img src="blah>" onmouseover="[code]">
|
|
<xml src="javascript:[code]">
|
|
<xml id="X"><a><b>&lt;script>[code]&lt;/script>;</b></a></xml>
|
|
<div datafld="b" dataformatas="html" datasrc="#X"></div>
|
|
[\xC0][\xBC]script>[code][\xC0][\xBC]/script> [UTF-8; IE, Opera]
|
|
<![CDATA[<!--]] ><script>[code]//--></script> </PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
This is not a complete list, of course, but it at least is a sample
|
|
of the kinds of attacks that you must prevent by strictly limiting the
|
|
tags and attributes you can allow from untrusted users.</P
|
|
><P
|
|
>Konstantin Riabitsev has posted
|
|
<A
|
|
HREF="http://www.mricon.com/html/phpfilter.html"
|
|
TARGET="_top"
|
|
>some PHP code to filter HTML</A
|
|
> (GPL);
|
|
I've not examined it closely, but you might want to take a look.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="VALIDATING-URIS"
|
|
></A
|
|
>5.11.4. Validating Hypertext Links (URIs/URLs)</H2
|
|
><P
|
|
>Careful readers will notice that I did not include the hypertext link tag
|
|
<a> as a safe tag in HTML.
|
|
Clearly, you could add
|
|
<a href="safe URI"> (hypertext link) to the safe list
|
|
(not permitting any other attributes unless you've checked their
|
|
contents).
|
|
If your application requires it, then do so.
|
|
However, permitting third parties to create links
|
|
is much less safe, because defining a ``safe URI''<A
|
|
NAME="AEN961"
|
|
HREF="#FTN.AEN961"
|
|
><SPAN
|
|
CLASS="footnote"
|
|
>[1]</SPAN
|
|
></A
|
|
>
|
|
turns out to be very difficult.
|
|
Many browsers accept
|
|
all sorts of URIs which may be dangerous to the user.
|
|
This section discusses how to validate URIs from third parties for
|
|
re-presenting to others, including URIs incorporated into HTML.</P
|
|
><P
|
|
>First, let's look briefly at URI syntax (as defined by various specifications).
|
|
URIs can be either ``absolute'' or ``relative''.
|
|
The syntax of an absolute URI looks like this:
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
>scheme://authority[path][?query][#fragment]</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
A URI starts with a scheme name (such as ``http''), the characters ``://'',
|
|
the authority (such as ``www.dwheeler.com''), a path
|
|
(which looks like a directory or file name), a question mark followed by
|
|
a query, and a hash (``#'') followed by a fragment identifier.
|
|
The square brackets surround optional portions - e.g., many URIs don't
|
|
actually include the query or fragment.
|
|
Some schemes may not permit some of the data (e.g., paths, queries, or
|
|
fragments), and many schemes have additional requirements unique to them.
|
|
Many schemes permit the ``authority'' field to identify
|
|
optional usernames, passwords, and ports, using this syntax for the
|
|
``authority'' section:
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> [username[:password]@]host[:portnumber]</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
The ``host'' can either be a name (``www.dwheeler.com'') or an IPv4
|
|
numeric address (127.0.0.1).
|
|
A ``relative'' URI references one object relative to the ``current'' one,
|
|
and its syntax looks a lot like a filename:
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
>path[?query][#fragment]</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
There are a limited number of characters permitted in most of the URI,
|
|
so to get around this problem, other 8-bit characters may be ``URL encoded''
|
|
as %hh (where hh is the hexadecimal value of the 8-bit character).
|
|
For more detailed information on valid URIs, see IETF RFC 2396 and its
|
|
related specifications.</P
|
|
><P
|
|
>Now that we've looked at the syntax of URIs, let's examine the risks
|
|
of each part:
|
|
<P
|
|
></P
|
|
><UL
|
|
><LI
|
|
><P
|
|
>Scheme:
|
|
Many schemes are downright dangerous.
|
|
Permitting someone to insert a ``javascript'' scheme into your material
|
|
would allow them to trivially mount denial-of-service attacks
|
|
(e.g., by repeatedly creating windows so the user's machine freezes or
|
|
becomes unusable).
|
|
More seriously, they might be able to exploit a known vulnerability in
|
|
the javascript implementation.
|
|
Some schemes can be a nuisance, such as ``mailto:'' when a mailing
|
|
is not expected, and some schemes may not be sufficiently secure
|
|
on the client machine.
|
|
Thus, it's necessary to limit the set of allowed schemes to
|
|
just a few safe schemes.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>Authority:
|
|
Ideally, you should limit user links to ``safe'' sites, but this is
|
|
difficult to do in practice.
|
|
However, you can certainly do something about usernames, passwords,
|
|
and port numbers: you should forbid them.
|
|
Systems expecting usernames (especially with passwords!) are probably
|
|
guarding more important material;
|
|
rarely is this needed in publicly-posted URIs, and someone could try
|
|
to use this functionality to convince users
|
|
to expose information they have access to and/or
|
|
use it to modify the information.
|
|
Such URIs permit semantic attacks; see
|
|
<A
|
|
HREF="semantic-attacks.html"
|
|
>Section 7.16</A
|
|
>
|
|
for more information.
|
|
Usernames without passwords are no less dangerous, since browsers typically
|
|
cache the passwords.
|
|
You should not usually permit specification of ports, because
|
|
different ports expect different protocols and the resulting
|
|
``protocol confusion'' can produce an exploit.
|
|
For example, on some systems it's possible to use the ``gopher'' scheme
|
|
and specify the SMTP (email) port to cause a user to send email of the
|
|
attacker's choosing.
|
|
You might permit a few special cases (e.g., http ports 8008 and 8080),
|
|
but on the whole it's not worth it.
|
|
The host when specified by name actually has a fairly limited character set
|
|
(using the DNS standards).
|
|
Technically, the standard doesn't permit the underscore (``_'') character,
|
|
but Microsoft ignored this part of the standard and even requires the
|
|
use of the underscore in some circumstances, so you probably should allow it.
|
|
Also, there's been a great deal of work on supporting international
|
|
characters in DNS names, which is not further discussed here.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>Path:
|
|
Permitting a path is usually okay, but unfortunately some applications
|
|
use part of the path as query data, creating an opening we'll discuss next.
|
|
Also, paths are allowed to contain phrases like ``..'', which can expose
|
|
private data in a poorly-written web server;
|
|
this is less a problem than it once was and really should be fixed
|
|
by the web server.
|
|
Since it's only the phrase ``..'' that's special, it's reasonable to
|
|
look at paths (and possibly query data) and forbid ``../'' as a content.
|
|
However, if your validator permits URL escapes, this can be difficult;
|
|
now you need to prevent versions where some of these characters are
|
|
escaped, and may also have to deal with various ``illegal'' character
|
|
encodings of these characters as well.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>Query:
|
|
Query formats (beginning with "?") can be a security risk
|
|
because some query formats actually cause actions to occur on the serving end.
|
|
They shouldn't, and your applications shouldn't, as discussed in
|
|
<A
|
|
HREF="avoid-get-non-queries.html"
|
|
>Section 5.12</A
|
|
> for more information.
|
|
However, we have to acknowledge the reality as a serious problem.
|
|
In addition, many web sites are actually ``redirectors'' - they take a
|
|
parameter specifying where the user should be redirected, and send back
|
|
a command redirecting the user to the new location.
|
|
If an attacker references such sites and provides
|
|
a more dangerous URI as the redirection value, and the
|
|
browser blithely obeys the redirection, this could be a problem.
|
|
Again, the user's browser should be more careful, but not all user
|
|
browsers are sufficiently cautious.
|
|
Also, many web applications have vulnerabilities that can be
|
|
exploited with certain query values, but in general this is hard to
|
|
prevent.
|
|
The official URI specifications don't sanction the ``+'' (plus) character,
|
|
but in practice the ``+'' character often represents the space character.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>Fragment:
|
|
Fragments basically locate a portion of a document; I'm unaware of
|
|
an attack based on fragments as long as the syntax is legal, but the
|
|
legality of its syntax does need checking.
|
|
Otherwise, an attacker might be able to insert a character such as the
|
|
double-quote (") and prematurely end the URI (foiling any checking).</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>URL escapes:
|
|
URL escapes are useful because they can represent arbitrary 8-bit
|
|
characters; they can also be very dangerous for the same reasons.
|
|
In particular, URL escapes can represent control characters, which many
|
|
poorly-written web applications are vulnerable to.
|
|
In fact, with or without URL escapes, many web applications are vulnerable
|
|
to certain characters (such as backslash, ampersand, etc.), but again
|
|
this is difficult to generalize.</P
|
|
></LI
|
|
><LI
|
|
><P
|
|
>Relative URIs:
|
|
Relative URIs should be reasonably safe (if you manage the web site well),
|
|
although in some applications there's no good reason to allow them either.</P
|
|
></LI
|
|
></UL
|
|
>
|
|
Of course, there is a trade-off with simplicity as well.
|
|
Simple patterns are easier to understand, but
|
|
they aren't very refined (so they tend to be too permissive or
|
|
too restrictive, even more than a refined pattern).
|
|
Complex patterns can be more exact, but they are more likely to have
|
|
errors, require more performance to use, and can be hard to
|
|
implement in some circumstances.</P
|
|
><P
|
|
>Here's my suggestion for a ``simple mostly safe'' URI pattern which is
|
|
very simple and can be implemented ``by hand'' or through a regular
|
|
expression; permit the following pattern:
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
>(http|ftp|https)://[-A-Za-z0-9._/]+</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></P
|
|
><P
|
|
>This pattern doesn't permit many potentially dangerous capabilities
|
|
such as queries, fragments, ports, or relative URIs,
|
|
and it only permits a few schemes.
|
|
It prevents the use of the ``%'' character, which is used in URL escapes
|
|
and can be used to specify characters that the server may not be
|
|
prepared to handle.
|
|
Since it doesn't permit either ``:'' or URL escapes, it doesn't permit
|
|
specifying port numbers, and even using it to redirect to a
|
|
more dangerous URI would be difficult (due to the lack of the escape character).
|
|
It also prevents the use of a number of other characters; again, many
|
|
poorly-designed web applications can't handle a number of
|
|
``unexpected'' characters.</P
|
|
><P
|
|
>Even this ``mostly safe'' URI permits
|
|
a number of questionable URIs, such as
|
|
subdirectories (via ``/'') and attempts to move up directories (via `..'');
|
|
illegal queries of this kind should be caught by the server.
|
|
It permits some illegal host identifiers (e.g., ``20.20''),
|
|
though I know of no case where this would be a security weakness.
|
|
Some web applications treat subdirectories as query data (or worse,
|
|
as command data); this is hard to prevent in general since finding
|
|
``all poorly designed web applications'' is hopeless.
|
|
You could prevent the use of all paths, but this would make it
|
|
impossible to reference most Internet information.
|
|
The pattern also allows references to local server information
|
|
(through patterns such as "http:///", "http://localhost/", and
|
|
"http://127.0.0.1") and access to servers on an internal network;
|
|
here you'll have to depend on the servers correctly interpreting the
|
|
resulting HTTP GET request as solely a request for information and not
|
|
a request for an action,
|
|
as recommended in <A
|
|
HREF="avoid-get-non-queries.html"
|
|
>Section 5.12</A
|
|
>.
|
|
Since query forms aren't permitted by this pattern, in many environments
|
|
this should be sufficient.</P
|
|
><P
|
|
>Unfortunately, the ``mostly safe''
|
|
pattern also prevents a number of quite legitimate and useful URIs.
|
|
For example,
|
|
many web sites use the ``?'' character to identify specific documents
|
|
(e.g., articles on a news site).
|
|
The ``#'' character is useful for specifying specific sections of a document,
|
|
and permitting relative URIs can be handy in a discussion.
|
|
Various permitted characters and URL escapes aren't included in the
|
|
``mostly safe'' pattern.
|
|
For example, without permitting URL escapes, it's difficult to access
|
|
many non-English pages.
|
|
If you truly need such functionality, then you can use less safe patterns,
|
|
realizing that you're exposing your users to higher risk while
|
|
giving your users greater functionality.</P
|
|
><P
|
|
>One pattern that permits queries, but at
|
|
least limits the protocols and ports used is the following,
|
|
which I'll call the ``simple somewhat safe pattern'':
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> (http|ftp|https)://[-A-Za-z0-9._]+(\/([A-Za-z0-9\-\_\.\!\~\*\'\(\)\%\?]+))*/?</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
>
|
|
This pattern actually isn't very smart, since it permits illegal escapes,
|
|
multiple queries, queries in ftp, and so on.
|
|
It does have the advantage of being relatively simple.</P
|
|
><P
|
|
>Creating a ``somewhat safe'' pattern that really limits URIs
|
|
to legal values is quite difficult.
|
|
Here's my current attempt to do so, which I call
|
|
the ``sophisticated somewhat safe pattern'', expressed in a form
|
|
where whitespace is ignored and comments are introduced with "#":
|
|
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> (
|
|
(
|
|
# Handle http, https, and relative URIs:
|
|
((https?://([A-Za-z0-9][A-Za-z0-9\-]*(\.[A-Za-z0-9][A-Za-z0-9\-]*)*\.?))|
|
|
([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)?
|
|
((/([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)*/?) # path
|
|
(\?( # query:
|
|
(([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+=
|
|
([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+
|
|
(\&([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+=
|
|
([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+)*)
|
|
|
|
|
(([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+ # isindex
|
|
)
|
|
))?
|
|
(\#([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+)? # fragment
|
|
)|
|
|
# Handle ftp:
|
|
(ftp://([A-Za-z0-9][A-Za-z0-9\-]*(\.[A-Za-z0-9][A-Za-z0-9\-]*)*\.?)
|
|
((/([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)*/?) # path
|
|
(\#([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+)? # fragment
|
|
)
|
|
)</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
> </P
|
|
><P
|
|
>Even the sophisticated pattern shown above doesn't forbid all illegal URIs.
|
|
For example, again, "20.20" isn't a legal domain name, but it's allowed
|
|
by the pattern; however, to my knowledge
|
|
this shouldn't cause any security problems.
|
|
The sophisticated pattern forbids URL escapes that represent
|
|
control characters (e.g., %00 through $1F) -
|
|
the smallest permitted escape value is %20 (ASCII space).
|
|
Forbidding control characters prevents some trouble, but it's
|
|
also limiting; change "2-9" to "0-9" everywhere if you need to support sending
|
|
all control characters to arbitrary web applications.
|
|
This pattern does permit all other URL escape values in paths,
|
|
which is useful for international characters but could cause trouble
|
|
for a few systems which can't handle it.
|
|
The pattern at least prevents spaces, linefeeds,
|
|
double-quotes, and other dangerous characters
|
|
from being in the URI, which prevents other kinds of
|
|
attacks when incorporating the URI into a generated document.
|
|
Note that the pattern permits ``+'' in many places, since in practice
|
|
the plus is often used to replace the space character
|
|
in queries and fragments.</P
|
|
><P
|
|
>Unfortunately, as noted above,
|
|
there are attacks which can work through any technique that permit query data,
|
|
and there don't seem to be really good defenses for them once you
|
|
permit queries.
|
|
So, you could strip out the ability to use query data from the
|
|
pattern above, but permit the other forms, producing a
|
|
``sophisticated mostly safe'' pattern:
|
|
<TABLE
|
|
BORDER="0"
|
|
BGCOLOR="#E0E0E0"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
><FONT
|
|
COLOR="#000000"
|
|
><PRE
|
|
CLASS="PROGRAMLISTING"
|
|
> (
|
|
(
|
|
# Handle http, https, and relative URIs:
|
|
((https?://([A-Za-z0-9][A-Za-z0-9\-]*(\.[A-Za-z0-9][A-Za-z0-9\-]*)*\.?))|
|
|
([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)?
|
|
((/([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)*/?) # path
|
|
(\#([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+)? # fragment
|
|
)|
|
|
# Handle ftp:
|
|
(ftp://([A-Za-z0-9][A-Za-z0-9\-]*(\.[A-Za-z0-9][A-Za-z0-9\-]*)*\.?)
|
|
((/([A-Za-z0-9\-\_\.\!\~\*\'\(\)]|(%[2-9A-Fa-f][0-9a-fA-F]))+)*/?) # path
|
|
(\#([A-Za-z0-9\-\_\.\!\~\*\'\(\)\+]|(%[2-9A-Fa-f][0-9a-fA-F]))+)? # fragment
|
|
)
|
|
)</PRE
|
|
></FONT
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
></P
|
|
><P
|
|
>As far as I can tell, as long as these patterns are only used to check
|
|
hypertext anchors selected by the user (the "<a>" tag)
|
|
this approach also prevents the insertion of ``web bugs''.
|
|
Web bugs are simply text that allow someone other
|
|
than the originating web server
|
|
of the main page to track information such as who read
|
|
the content and when they read it -
|
|
see <A
|
|
HREF="embedded-content-bugs.html"
|
|
>Section 8.7</A
|
|
> for more information.
|
|
This isn't true if you use the <img> (image) tag with the same
|
|
checking rules - the image tag is loaded immediately, permitting
|
|
someone to add a ``web bug''.
|
|
Once again, this presumes that you're not permitting any attributes;
|
|
many attributes can be quite dangerous and pierce the security you're
|
|
trying to provide.</P
|
|
><P
|
|
>Please note that all of these patterns require the entire URI match
|
|
the pattern.
|
|
An unfortunate fact of these patterns is that they limit the
|
|
allowable patterns in a way that forbids many useful ones
|
|
(e.g., they prevent the use of new URI schemes).
|
|
Also, none of them can prevent the very real problem that some web sites
|
|
perform more than queries when presented with a query - and some of these
|
|
web sites are internal to an organization.
|
|
As a result, no URI can really be safe until there
|
|
are no web sites that accept GET queries as an action
|
|
(see <A
|
|
HREF="avoid-get-non-queries.html"
|
|
>Section 5.12</A
|
|
>).
|
|
For more information about legal URLs/URIs, see IETF RFC 2396;
|
|
domain name syntax is further discussed in IETF RFC 1034.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="OTHER-HTML-TAGS"
|
|
></A
|
|
>5.11.5. Other HTML tags</H2
|
|
><P
|
|
>You might even consider supporting more HTML tags.
|
|
Obvious next choices are the list-oriented tags, such as
|
|
<ol> (ordered list),
|
|
<ul> (unordered list),
|
|
and <li> (list item).
|
|
However, after a certain point you're really permitting
|
|
full publishing (in which case you need to trust the provider or perform more
|
|
serious checking than will be described here).
|
|
Even more importantly, every new functionality you add creates an
|
|
opportunity for error (and exploit).</P
|
|
><P
|
|
>One example would be permitting the
|
|
<img> (image) tag with the same URI pattern.
|
|
It turns out this is substantially less safe, because this
|
|
permits third parties to insert ``web bugs'' into the document,
|
|
identifying who read the document and when.
|
|
See <A
|
|
HREF="embedded-content-bugs.html"
|
|
>Section 8.7</A
|
|
> for more information on web bugs.</P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="RELATED-ISSUES"
|
|
></A
|
|
>5.11.6. Related Issues</H2
|
|
><P
|
|
>Web applications should also explicitly specify the character set
|
|
(usually ISO-8859-1), and not permit other characters, if data from
|
|
untrusted users is being used.
|
|
See <A
|
|
HREF="output-character-encoding.html"
|
|
>Section 9.5</A
|
|
> for more information.</P
|
|
><P
|
|
>Since filtering this kind of input is easy to get wrong, other
|
|
alternatives have been discussed as well.
|
|
One option is to ask users to use a different language, much simpler
|
|
than HTML, that you've designed - and you give that language very limited
|
|
functionality.
|
|
Another approach is parsing the HTML into some internal ``safe'' format,
|
|
and then translating that safe format back to HTML.</P
|
|
><P
|
|
>Filtering can be done during input, output, or both.
|
|
The CERT recommends filtering data during the output process,
|
|
just before it is rendered as part of the dynamic page.
|
|
This is because, if it is done correctly,
|
|
this approach ensures that all dynamic content is filtered.
|
|
The CERT believes that filtering on the input side is less effective
|
|
because dynamic content can be entered into a web sites database(s) via
|
|
methods other than HTTP, and in this case,
|
|
the web server may never see the data as part of the input process.
|
|
Unless the filtering is implemented in all places where dynamic data
|
|
is entered, the data elements may still be remain tainted.</P
|
|
><P
|
|
>However, I don't agree with CERT on this point for all cases.
|
|
The problem is that it's just as easy to forget to filter all the output
|
|
as the input, and allowing ``tainted'' input into your system
|
|
is a disaster waiting to happen anyway.
|
|
A secure program has to filter its inputs anyway, so it's sometimes better
|
|
to include all of these checks as part of the input filtering
|
|
(so that maintainers can see what the rules really are).
|
|
And finally, in some secure programs there are many different program
|
|
locations that may output a value, but only a very few ways and locations
|
|
where a data can be input into it;
|
|
in such cases filtering on input may be a better idea.</P
|
|
></DIV
|
|
></DIV
|
|
><H3
|
|
CLASS="FOOTNOTES"
|
|
>Notes</H3
|
|
><TABLE
|
|
BORDER="0"
|
|
CLASS="FOOTNOTES"
|
|
WIDTH="100%"
|
|
><TR
|
|
><TD
|
|
ALIGN="LEFT"
|
|
VALIGN="TOP"
|
|
WIDTH="5%"
|
|
><A
|
|
NAME="FTN.AEN961"
|
|
HREF="filter-html.html#AEN961"
|
|
><SPAN
|
|
CLASS="footnote"
|
|
>[1]</SPAN
|
|
></A
|
|
></TD
|
|
><TD
|
|
ALIGN="LEFT"
|
|
VALIGN="TOP"
|
|
WIDTH="95%"
|
|
><P
|
|
>Technically, a hypertext link can be any ``uniform resource
|
|
identifier'' (URI).
|
|
The term "Uniform Resource Locator" (URL) refers to the subset of URIs
|
|
that identify resources via a representation of their primary access
|
|
mechanism (e.g., their network "location"), rather than identifying
|
|
the resource by name or by some other attribute(s) of that resource.
|
|
Many people use the term ``URL'' as synonymous with ``URI'', since URLs
|
|
are the most common kind of URI.
|
|
For example, the encoding used in URIs is actually called ``URL encoding''.</P
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><DIV
|
|
CLASS="NAVFOOTER"
|
|
><HR
|
|
ALIGN="LEFT"
|
|
WIDTH="100%"><TABLE
|
|
SUMMARY="Footer navigation table"
|
|
WIDTH="100%"
|
|
BORDER="0"
|
|
CELLPADDING="0"
|
|
CELLSPACING="0"
|
|
><TR
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="left"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="input-protection-cross-site.html"
|
|
ACCESSKEY="P"
|
|
>Prev</A
|
|
></TD
|
|
><TD
|
|
WIDTH="34%"
|
|
ALIGN="center"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="index.html"
|
|
ACCESSKEY="H"
|
|
>Home</A
|
|
></TD
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="right"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="avoid-get-non-queries.html"
|
|
ACCESSKEY="N"
|
|
>Next</A
|
|
></TD
|
|
></TR
|
|
><TR
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="left"
|
|
VALIGN="top"
|
|
>Prevent Cross-site Malicious Content on Input</TD
|
|
><TD
|
|
WIDTH="34%"
|
|
ALIGN="center"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="input.html"
|
|
ACCESSKEY="U"
|
|
>Up</A
|
|
></TD
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="right"
|
|
VALIGN="top"
|
|
>Forbid HTTP GET To Perform Non-Queries</TD
|
|
></TR
|
|
></TABLE
|
|
></DIV
|
|
></BODY
|
|
></HTML
|
|
> |