Some functions to parse and normalize URLs.
this is base on the original work that used to be here github.com/rbaier/urltools.git but is now gone
>>> urltools.normalize("Http://exAMPLE.com./foo")
http://example.com/foo
Rules that are applied to normalize a URL:
- tolower scheme
- tolower host (also works with IDNs)
- remove default port
- remove ':' without port
- remove DNS root label
- unquote path, query, fragment
- collapse path (remove '//', '/./', '/../')
- sort query params and remove params without value
The result of parse and extract is a ParseResult named tuple that contains scheme, username, password, subdomain, domain, tld, port, path, query and fragment.
>>> urltools.parse("http://example.co.uk/foo/bar?x=1#abc")
ParseResult(scheme='http', username='', password='', subdomain='', domain='example', tld='co.uk', port='', path='/foo/bar', query='x=1', fragment='abc')
If the scheme is missing parse interprets the URL as relative.
>>> urltools.parse("www.example.co.uk/abc")
ParseResult(scheme='', username='', password='', subdomain='', domain='', tld='', port='', path='www.example.co.uk/abc', query='', fragment='')
extract does not care about relative URLs and always tries to extract as much information as possible.
>>> urltools.extract("www.example.co.uk/abc")
ParseResult(scheme='', username='', password='', subdomain='www', domain='example', tld='co.uk', port='', path='/abc', query='', fragment='')
Besides the already described main functions urltools has some more functions to manipulate segments of a URL.
-
encode(IDNA, see RFC 3490)>>> urltools.encode("http://müller.de") 'http://xn--mller-kva.de/' -
assemblea new URL from aParseResult -
normalize_host -
normalize_port -
normalize_path>>> normalize_path("/a/b/../../c") '/c' -
normalize_query>>> normalize_query("x=1&y=&z=3") 'x=1&z=3' -
normalize_fragment -
unquote -
split(basically the same asurlparse.urlparse)>>> split("http://www.example.com/abc?x=1&y=2#foo") SplitResult(scheme='http', netloc='www.example.com', path='/abc', query='x=1&y=2', fragment='foo') -
split_netloc>>> split_netloc("foo:[email protected]:8080") ('foo', 'bar', 'www.example.com', '8080') -
split_host>>> split_host("www.example.ac.at") ('www', 'example', 'ac.at')
pip is not working yet
You can install urltools from the Python Package Index (PyPI):
pip install urltools
... or get the newest version directly from GitHub:
pip install -e git://github.com/itzik-h/urltools.git#egg=urltools
urltools uses the Public Suffix List to split domain names correctly. E.g. the
TLD of example.co.uk would be .co.uk and not .uk.
I recommend to use a local copy of this list. Otherwise it will be downloaded
after each import of urltools.
export PUBLIC_SUFFIX_LIST="/path/to/effective_tld_names.dat"
For more information see http://publicsuffix.org/
To run the tests I use pytest:
py.test -vrxs