xmltodict
is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":
>>> print(json.dumps(xmltodict.parse("""
... <mydocument has="an attribute">
... <and>
... <many>elements</many>
... <many>more elements</many>
... </and>
... <plus a="complex">
... element as well
... </plus>
... </mydocument>
... """), indent=4))
{
"mydocument": {
"@has": "an attribute",
"and": {
"many": [
"elements",
"more elements"
]
},
"plus": {
"@a": "complex",
"#text": "element as well"
}
}
}
By default, xmltodict
does no XML namespace processing (it just treats namespace declarations as regular node attributes), but passing process_namespaces=True
will make it expand namespaces for you:
>>> xml = """
... <root xmlns="http://defaultns.com/"
... xmlns:a="http://a.com/"
... xmlns:b="http://b.com/">
... <x>1</x>
... <a:y>2</a:y>
... <b:z>3</b:z>
... </root>
... """
>>> xmltodict.parse(xml, process_namespaces=True) == {
... 'http://defaultns.com/:root': {
... 'http://defaultns.com/:x': '1',
... 'http://a.com/:y': '2',
... 'http://b.com/:z': '3',
... }
... }
True
It also lets you collapse certain namespaces to shorthand prefixes, or skip them altogether:
>>> namespaces = {
... 'http://defaultns.com/': None, # skip this namespace
... 'http://a.com/': 'ns_a', # collapse "http://a.com/" -> "ns_a"
... }
>>> xmltodict.parse(xml, process_namespaces=True, namespaces=namespaces) == {
... 'root': {
... 'x': '1',
... 'ns_a:y': '2',
... 'http://b.com/:z': '3',
... },
... }
True
xmltodict
is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:
>>> def handle_artist(_, artist):
... print artist['name']
... return True
>>>
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
... item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...
Also you can use generator, for this item_depth
has to be greater than 0 and no item_callback
. Since the generator uses a thread under the hood you need to use it from within a with
statement so that the thread is correctly disposed of:
>>> with xmltodict.parse(GzipFile('discogs_artists.xml.gz'), item_depth=2) as gen:
... artists_names = [artist['name'] for _, artist in gen]
... artists_names
['A Perfect Circle', 'Fantômas', 'King Crimson', 'Chris Potter']
It can also be used from the command line to pipe objects to a script like this:
import sys, marshal
while True:
_, article = marshal.load(sys.stdin)
print article['title']
$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...
Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:
$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | gzip > enwiki.dicts.gz
And you reuse the dicts with every script that needs them:
$ cat enwiki.dicts.gz | gunzip | script1.py
$ cat enwiki.dicts.gz | gunzip | script2.py
...
You can also convert in the other direction, using the unparse()
method:
>>> mydict = {
... 'response': {
... 'status': 'good',
... 'last_updated': '2014-02-16T23:10:12Z',
... }
... }
>>> print unparse(mydict, pretty=True)
<?xml version="1.0" encoding="utf-8"?>
<response>
<status>good</status>
<last_updated>2014-02-16T23:10:12Z</last_updated>
</response>
You just need to
$ pip install xmltodict
There is an official Fedora package for xmltodict. If you are on Fedora or RHEL, you can do:
$ sudo yum install python-xmltodict
If you love xmltodict
, consider supporting the author on Gittip.