-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
The utf8toUnicode transformation function outputs hex sequences of the form %uXXXX and %uXXXXXX. Other characters are passed through as-is. For some inputs, these sequences are indigustinguable from its encoded output produces by different inputs.
This ticket is to report two separate such encoding ambiguities: no escaping for a literal %, and trailing literal hex digits after four-digit sequences.
Escaping literal % characters
This function outputs hex sequences for non-ASCII codepoints. Other characters are passed through as-is.
The % character is passed through as-is, too, and so input of the form abc%uXXXXxyz will produce output which is indistinguishable from a legitimate hex sequence generated by the function.
This is possibly also a security risk, in that a consumer reading hex sequences would then treat the following few characters as digits (if they are legal hex characters), and then convert the sequence to a single codepoint. So this allows for a bypass by way of "sneaking through" those characters.
Variable length sequences
utf8toUnicode encodes to four hex digits for some codepoints, and six hex digits for other codepoints.
For example, the input \xc4\x80-\xf4\x8f\xbf\xbf is encoded to: %u0100-%u10ffff
This gives ambiguity for output, for example å00, because the output there would be %uXXXX00 where the 00 are low-ASCII bytes (literal hex digits) passed through as-is. That's indistinguishable from a single codepoint of ``%uXXXXXX`.
This could also be a possible bypass: For a rule looks to match %u1234 but would allow %u123456, an attacker would find some character for the 56 part which is syntactically permissible (whitespace, for example), and arrange for that to be after the disallowed character. Rather like \0-free shellcode, but the other way around.
My suggested fix for both issues is to to always encode to six-digit hex sequences, and also to encode % as a hex squence.
This breaks compatibility for rules using utf8toUnicode, and so notice would need to be given to rule authors to update those rules. Hence I would suggest replacing utf8toUnicode with a new function (which I have called utf8toHex), and to update rules accordingly. Then utf8toUnicode may be removed.