Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.6] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595) #25728

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions Doc/library/urllib.parse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,9 @@ or on combining URL components into a URL string.
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
decomposed before parsing, no error will be raised.

Following the `WHATWG spec`_ that updates RFC 3986, ASCII newline
``\n``, ``\r`` and tab ``\t`` characters are stripped from the URL.

.. versionchanged:: 3.6
Out-of-range port numbers now raise :exc:`ValueError`, instead of
returning :const:`None`.
Expand All @@ -296,6 +299,10 @@ or on combining URL components into a URL string.
Characters that affect netloc parsing under NFKC normalization will
now raise :exc:`ValueError`.

.. versionchanged:: 3.6.14
ASCII newline and tab characters are stripped from the URL.

.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser

.. function:: urlunsplit(parts)

Expand Down Expand Up @@ -633,6 +640,10 @@ task isn't already covered by the URL parsing functions above.

.. seealso::

`WHATWG`_ - URL Living standard
Working Group for the URL Standard that defines URLs, domains, IP addresses, the
application/x-www-form-urlencoded format, and their API.

:rfc:`3986` - Uniform Resource Identifiers
This is the current standard (STD66). Any changes to urllib.parse module
should conform to this. Certain deviations could be observed, which are
Expand All @@ -656,3 +667,5 @@ task isn't already covered by the URL parsing functions above.

:rfc:`1738` - Uniform Resource Locators (URL)
This specifies the formal syntax and semantics of absolute URLs.

.. _WHATWG: https://url.spec.whatwg.org/
39 changes: 39 additions & 0 deletions Lib/test/test_urlparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -612,6 +612,45 @@ def test_urlsplit_attributes(self):
with self.assertRaisesRegex(ValueError, "out of range"):
p.port

def test_urlsplit_remove_unsafe_bytes(self):
# Remove ASCII tabs and newlines from input, for http common case scenario.
url = "http://www.python.org/java\nscript:\talert('msg\r\n')/#frag"
p = urllib.parse.urlsplit(url)
self.assertEqual(p.scheme, "http")
self.assertEqual(p.netloc, "www.python.org")
self.assertEqual(p.path, "/javascript:alert('msg')/")
self.assertEqual(p.query, "")
self.assertEqual(p.fragment, "frag")
self.assertEqual(p.username, None)
self.assertEqual(p.password, None)
self.assertEqual(p.hostname, "www.python.org")
self.assertEqual(p.port, None)
self.assertEqual(p.geturl(), "http://www.python.org/javascript:alert('msg')/#frag")

# Remove ASCII tabs and newlines from input as bytes, for http common case scenario.
url = b"http://www.python.org/java\nscript:\talert('msg\r\n')/#frag"
p = urllib.parse.urlsplit(url)
self.assertEqual(p.scheme, b"http")
self.assertEqual(p.netloc, b"www.python.org")
self.assertEqual(p.path, b"/javascript:alert('msg')/")
self.assertEqual(p.query, b"")
self.assertEqual(p.fragment, b"frag")
self.assertEqual(p.username, None)
self.assertEqual(p.password, None)
self.assertEqual(p.hostname, b"www.python.org")
self.assertEqual(p.port, None)
self.assertEqual(p.geturl(), b"http://www.python.org/javascript:alert('msg')/#frag")

# any scheme
url = "x-new-scheme://www.python.org/java\nscript:\talert('msg\r\n')/#frag"
p = urllib.parse.urlsplit(url)
self.assertEqual(p.geturl(), "x-new-scheme://www.python.org/javascript:alert('msg')/#frag")

# Remove ASCII tabs and newlines from input as bytes, any scheme.
url = b"x-new-scheme://www.python.org/java\nscript:\talert('msg\r\n')/#frag"
p = urllib.parse.urlsplit(url)
self.assertEqual(p.geturl(), b"x-new-scheme://www.python.org/javascript:alert('msg')/#frag")

def test_attributes_bad_port(self):
"""Check handling of invalid ports."""
for bytes in (False, True):
Expand Down
11 changes: 11 additions & 0 deletions Lib/urllib/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,9 @@
'0123456789'
'+-.')

# Unsafe bytes to be removed per WHATWG spec
_UNSAFE_URL_BYTES_TO_REMOVE = ['\t', '\r', '\n']

# XXX: Consider replacing with functools.lru_cache
MAX_CACHE_SIZE = 20
_parse_cache = {}
Expand Down Expand Up @@ -409,6 +412,11 @@ def _checknetloc(netloc):
raise ValueError("netloc '" + netloc + "' contains invalid " +
"characters under NFKC normalization")

def _remove_unsafe_bytes_from_url(url):
for b in _UNSAFE_URL_BYTES_TO_REMOVE:
url = url.replace(b, "")
return url

def urlsplit(url, scheme='', allow_fragments=True):
"""Parse a URL into 5 components:
<scheme>://<netloc>/<path>?<query>#<fragment>
Expand Down Expand Up @@ -439,6 +447,7 @@ def urlsplit(url, scheme='', allow_fragments=True):
if '?' in url:
url, query = url.split('?', 1)
_checknetloc(netloc)
url = _remove_unsafe_bytes_from_url(url)
v = SplitResult(scheme, netloc, url, query, fragment)
_parse_cache[key] = v
return _coerce_result(v)
Expand All @@ -453,6 +462,8 @@ def urlsplit(url, scheme='', allow_fragments=True):
# not a port number
scheme, url = url[:i].lower(), rest

url = _remove_unsafe_bytes_from_url(url)

if url[:2] == '//':
netloc, url = _splitnetloc(url, 2)
if (('[' in netloc and ']' not in netloc) or
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
The presence of newline or tab characters in parts of a URL could allow
some forms of attacks.

Following the controlling specification for URLs defined by WHATWG
:func:`urllib.parse` now removes ASCII newlines and tabs from URLs,
preventing such attacks.