XSS hunting through forensic standards-analysis.

By Kate Pearce

Brief: Web standards are complex, with request encoding Microsoft loses if they are “compliant” and they also lose if they are not.

“Ambiguous RFC leads to Cross Site Scripting “ was posted by a colleague at Neohapsis Labs (Patrick Toomey) a few weeks ago, and a related post was also put up by Rob Rachwald at Imperva’s blog. As I have read through some of the associated RFCs many times I decided to dig a little deeper. I journeyed through the final version of seven RFCs defining three things (URL, URI and HTTP ), in an attempt to track down just how this issue arrived in the standards and how the Internet Explorer behavior fitted in.

What I seem to have found is a situation that illustrates the complexity of standards development, shows how unintended consequences can develop during development, and also, surprisingly, how Microsoft is placed in a lose-lose situation with Internet explorer and standards compliance. It appears that if Microsoft is fully, and minimally, standards compliant then they need to exhibit behavior that the other browsers do not. Should they add “safe” behavior then they not only break some legacy applications, but will need to add behavior that the standard isn’t entirely clear on the status of.

Microsoft loses if they are “compliant” and they lose if they are not. And that presumes you can even work out which standard is applicable in the first place….

Recap of the issue at hand:

Cross Site Scripting occurs when a web application or server takes unvalidated and unsanitized user input and displays it back in such a way that any active (or otherwise harmful) content embedded in it (such as JavaScript) will be executed. This happens because web browsers generally treat anything that is received from a web server as having originated there. By sending malicious content through a web server first web browsers lose any associated context that content has, and instead associates it all with the web server. Patrick’s post has a walkthrough of an example of this and how it can be abused.

The specific XSS related problem of inconsistent percent-encoding of sensitive characters in requests across different web browsers is an interesting one. Percent encoding means that if an application directly repeats unsafe input it will be sent to the server in a form with a percent sign and the ascii value, rather than raw form. So an injected input like

http://www.example.com/form.php?name=name”><script>alert(123)</script><&#8221;

will become the following in the webpage source code where it says “hello NAME”:

name%E2%80%9D%3E%3Cscript%3Ealert(123)%3C%2Fscript%3E%3C%E2%80%9D

which will not, and cannot, execute as it is neither valid JavaScript nor Valid HTML.

Well, it turns out that Firefox, Chrome, and Safari all perform this encoding of request parameters while Internet Explorer does not. Therefore any website which naievely repeats input from URL parameters may find that its IE wielding users are vulnerable to XSS while those using other browsers are not.

Thus it appears that Internet Explorer increases the risk of its users to Cross-Site Scripting.

Latest standards

Both previous posts on this issue list RFC 3986, “URI Generic Syntax”, as the root of the problem, because it lists reserved characters but neglects to mention the XML/HTML delimiters of < and > (page 12, section 2.2).

    reserved    = gen-delims / sub-delims

    gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

    sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                / "*" / "+" / "," / ";" / "="
Interestingly, these are not listed in unreserved characters at the bottom of the page either:
   Characters that are allowed in a URI but do not have a reserved
   purpose are called unreserved.  These include uppercase and lowercase
   letters, decimal digits, hyphen, period, underscore, and tilde.

      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

So, should they be encoded or not? They are not explicitly unsafe, nor are they explicitly safe!

“Family” history

Patrick mentions that RFC 1738Uniform Resource Locators” (which RFC 3986 above updated) specifically mentioned < and > as unsafe on page 2:

   The characters "<" and ">" are unsafe because they are used as the
   delimiters around URLs in free text; the quote mark (""") is used to
   delimit URLs in some systems.  The character "#" is unsafe and should
   always be encoded because it is used in World Wide Web and in other
   systems to delimit a URL from a fragment/anchor identifier that might
   follow it.

However, in between the times of these two standards it occurred to me that there are other players. Namely, RFC 2396 which was made obsolecent by RFC 3986, and RFC 1808 which was made obsolescent by 2396. Interestingly RFC 1738 states that it is updated by 1808, but 1808 doesn’t mention it updates 1738. Note that 1808 is only a partial update to 1738, as it is only concerned with relative URLs.

With this chain we have, in increasing time going down:

RFC 1738
Uniform Resource Locators (URL)

||

RFC 1808
Relative Uniform Resource Locators

||

RFC 2396
Uniform Resource Identifiers (URI): Generic Syntax

||

RFC 3986
Uniform Resource Identifier (URI): Generic Syntax

At the top of this chain we have < and > being encoded, but at the bottom we don’t. What happened in between?

I’ll get to that soon, but first I have to introduce another RFC family, the HTTP family of RFCs.

“Neighborly” history.

Since HTTP is really what we are concerned with, (it uses URI’s to find resources) we need to look at the specifications for HTTP too.

Interestingly, the first IETF HTTP standard, RFC 1945 Hypertext Transfer Protocol — HTTP/1.0, had < and > as unsafe and required encoding (referencing RFC 1808), as did the first HTTP/1.1 RFC 2068, but the latest HTTP RFC, RFC 2616
Hypertext Transfer Protocol — HTTP/1.1 does not state that they have to be encoded explicitly (instead referencing RFC 2396 on page 19).

   Characters other than those in the "reserved" and "unsafe" sets (see
   RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

   For example, the following three URIs are equivalent:

      http://abc.com:80/~smith/home.html
      http://ABC.com/%7Esmith/home.html
      http://ABC.com:/%7esmith/home.html

It does state that to be in an HTTP parameter value they need to be inside double quotes though (RFC 2616 page 16).

   Many HTTP/1.1 header field values consist of words separated by LWS
   or special characters. These special characters MUST be in a quoted
   string to be used within a parameter value (as defined in section
   3.6).

       token          = 1*<any CHAR except CTLs or separators>
       separators     = "(" | ")" | "<" | ">" | "@"
                      | "," | ";" | ":" | "\" | <">
                      | "/" | "[" | "]" | "?" | "="
                      | "{" | "}" | SP | HT

So, as of HTTP version 1.1 we have < and > indirectly requiring hashing (via RFC 2396). But, the HTTP protocol no longer requires encoding in addition to 2616, leaving the HTTP protocol potentially vulnerable. But that’s OK, because RFC 2396 still offers protection (RFC 2396 page 9):

   The angle-bracket "<" and ">" and double-quote (") characters are
   excluded because they are often used as the delimiters around URI in
   text documents and protocol fields.  The character "#" is excluded
   because it is used to delimit a URI from a fragment identifier in URI
   references (Section 4). The percent character "%" is excluded because
   it is used for the encoding of escaped characters.

   delims      = "<" | ">" | "#" | "%" | <">

The nail in the coffin, Updating URL Generic Syntax.

Then, the actual issue occurred. RFC 3986 Updated 1738, made 2396 obsolete, and made a slight change (RFC 3986 Page 11/12):

   URIs include components and subcomponents that are delimited by
   characters in the "reserved" set.  These characters are called
   "reserved" because they may (or may not) be defined as delimiters by
   the generic syntax, by each scheme-specific syntax, or by the
   implementation-specific syntax of a URI's dereferencing algorithm.
   If data for a URI component would conflict with a reserved
   character's purpose as a delimiter, then the conflicting data must be
   percent-encoded before the URI is formed.
   ...
   reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

Notice something missing? No more < or > (or % or ” for that matter, but that’s more complicated).

Maybe this RFC isn’t ambiguous though? Consider the line from the except above (RFC 3986 Page 11):

“If Data for a URI component would conflict with a reserved character’s purpose as a delimiter, then the conflicting data must be percent encoded before the URI is formed”

Here’s the issue: the later RFC, 3986 is referring to delimiters of URI’s, whereas RFC 2396 is referring to delimiters in content (ostensibly not it’s job as a URI standard).

Summary and timeline

In short the problem is: HTTP shifted decisions about it’s own content to an RFC for URI, that URI RFC is now obsolete and replaced by another which does not offer this protection.

URI Timeline HTTP TimeLine Notes Requires encoding in URI family? Require Encoding in HTTP family?
1738 URL (updated by 1738) Yes N/A
1808 Relative URL (updates 1738) Yes N/A
1945 HTTP 1.0 Yes Yes
2068 HTTP 1.1 Yes Yes
2396 URI Generic Yes Yes
2616 HTTP 1.1 Yes No
3986 URI Generic No No

So the error was introduced into HTTP in RFC 2616 but not manifest until RFC 3986 removed the mitigations from the URL syntax.

Implications and other considerations

There are a few implications that come to mind, most notably who is responsible for a decision about something in a specification, and whether this particular case may be leading to multiple-encoding vulnerabilities in applications.

Controlling responsibility for functionality in standards.

One of the core problems here was that early on an HTTP standard shifted control of a content-level decision to a protocol, and that protocol later removed the constraints in it that were there for the purposes of HTTP. Early on in this history we had two non-conflicting layers of protection, but by the end there were none. The problem was that while this may appear conceptually that these two protocols are a protocol stack, with no dependencies relying upon another layer this is not the case in practice:

How it seems HTTP and URI interact, with HTTP sitting on top of URI syntax making no cross-dependent assumptions

They actually intertwine slightly.

.

When developing your own standards and protocols you need to carefully map out who own what, and make security decisions of data in your component based upon your component alone, and not based upon unfounded and potentially dangerous assumptions about the behavior of another component. Another common example is when web applications presume the incoming TCP/IP details or referrer header prove something. The former relies upon TCP/IP not being spoofed while the latter presumes they are using a non-compromised web browser.

Double-encoding

One potential problem with this inconsistent encoding across web browsers is that it may lead developers to decode their incoming data multiple times, or to simply keep decoding incoming requests to their web applications until they decode no more. This is so that all their applications can see the same data to process. But this may be leading developers to introduce multiple-decode vulnerabilities in their applications.

Encoding can offer a degree of protection against some injection attacks, but this is not always the case as it can sometimes introduce them. Furthermore, often web servers, application components or the application themselves will transparently decode percent encoded requests transparently and on-the-fly. When an application, or its architecture, do this decoding in unanticipated ways you get double and triple-encoding vulnerabilities.

For example, %25 is a percent character and %27 is an apostrophe (‘), so %2527 can be double-decoded first to %27, and then to an apostrophe (‘). %252527 is triple encoded , %25252527 is quadruple etc. This can sometimes introduce errors such as sql injection in applications that check the input (and sometimes its first decoded variant) for unsafe input (such as apostrophes) rather than using safe mechanisms like SQL parameterized statements.

If you ever have or suspect you application (or a component in its architecture) ensure that:

1. Validation checks are made unnecessary through using safe techniques where possible,

2. That where required to be used validation checks are made as close possible to the usage of the data,

3. That all security testing you do checks at least triple-decoded variants.

2 thoughts on “XSS hunting through forensic standards-analysis.

  1. Here’s an example of PHP script which results in an XSS vulnerability on IE but not on Firefox or Chrome.

    <form method="POST" action="”>

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s