Skip to content

Parsing JavaScript objects into Python data structures

License

Notifications You must be signed in to change notification settings

Nykakin/chompjs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chompjs

license pypi version python version downloads

Transforms JavaScript objects into Python data structures.

In web scraping, you sometimes need to transform Javascript objects embedded in HTML pages into valid Python dictionaries. chompjs is a library designed to do that as a more powerful replacement of standard json.loads:

>>> chompjs.parse_js_object("{a: 100}")
{'a': 100}
>>>
>>> json_lines = """
... {'a': 12}
... {'b': 13}
... {'c': 14}
... """
>>> for entry in chompjs.parse_js_objects(json_lines):
...     print(entry)
... 
{'a': 12}
{'b': 13}
{'c': 14}

Reference documentation

Quickstart

1. installation

> pip install chompjs

or build from source:

$ git clone https://github.com/Nykakin/chompjs
$ cd chompjs
$ python setup.py build
$ python setup.py install

Features

There are two functions available:

  • parse_js_object - try reading first encountered JSON-like object. Raises ValueError on failure
  • parse_js_objects - returns a generator yielding all encountered JSON-like objects. Can be used to read JSON Lines. Does not raise on invalid input.

An example usage with scrapy:

import chompjs
import scrapy


class MySpider(scrapy.Spider):
    # ...

    def parse(self, response):
        script_css = 'script:contains("__NEXT_DATA__")::text'
        script_pattern = r'__NEXT_DATA__ = (.*);'
        # warning: for some pages you need to pass replace_entities=True
        # into re_first to have JSON escaped properly
        script_text = response.css(script_css).re_first(script_pattern)
        try:
            json_data = chompjs.parse_js_object(script_text)
        except ValueError:
            self.log('Failed to extract data from {}'.format(response.url))
            return

        # work on json_data

Parsing of JSON5 objects is supported:

>>> data = """
... {
...   // comments
...   unquoted: 'and you can quote me on that',
...   singleQuotes: 'I can use "double quotes" here',
...   lineBreaks: "Look, Mom! \
... No \\n's!",
...   hexadecimal: 0xdecaf,
...   leadingDecimalPoint: .8675309, andTrailing: 8675309.,
...   positiveSign: +1,
...   trailingComma: 'in objects', andIn: ['arrays',],
...   "backwardsCompatible": "with JSON",
... }
... """
>>> chompjs.parse_js_object(data)
{'unquoted': 'and you can quote me on that', 'singleQuotes': 'I can use "double quotes" here', 'lineBreaks': "Look, Mom! No \n's!", 'hexadecimal': 912559, 'leadingDecimalPoint': 0.8675309, 'andTrailing': 8675309.0, 'positiveSign': '+1', 'trailingComma': 'in objects', 'andIn': ['arrays'], 'backwardsCompatible': 'with JSON'}

If the input string is not yet escaped and contains a lot of \\ characters, then unicode_escape=True argument might help to sanitize it:

>>> chompjs.parse_js_object('{\\\"a\\\": 12}', unicode_escape=True)
{'a': 12}

By default chompjs tries to start with first { or [ character it founds, omitting the rest:

>>> chompjs.parse_js_object('<div>...</div><script>foo = [1, 2, 3];</script><div>...</div>')
[1, 2, 3]

Post-processed input is parsed using json.loads by default. A different loader such as orsjon can be used with loader argument:

>>> import orjson
>>> import chompjs
>>> 
>>> chompjs.parse_js_object("{'a': 12}", loader=orjson.loads)
{'a': 12}

loader_args and loader_kwargs arguments can be used to pass options to underlying loader function. For example for default json.loads you can pass down options such as strict or object_hook:

>>> import decimal
>>> import chompjs
>>> chompjs.parse_js_object('[23.2]', loader_kwargs={'parse_float': decimal.Decimal})
[Decimal('23.2')]

Rationale

In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:

<html>
<head>...</head>
<body>
...
<script type="text/javascript">window.__PRELOADED_STATE__={"foo": "bar"}</script>
...
</body>
</html>

Standard library function json.loads is usually sufficient to extract this data:

>>> # scrapy shell file:///tmp/test.html
>>> import json
>>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)')
>>> json.loads(script_text)
{u'foo': u'bar'}

The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:

  • "{'a': 'b'}" is not a valid JSON because it uses ' character to quote
  • '{a: "b"}'is not a valid JSON because property name is not quoted at all
  • '{"a": [1, 2, 3,]}' is not a valid JSON because there is an extra , character at the end of the array
  • '{"a": .99}' is not a valid JSON because float value lacks a leading 0

As a result, json.loads fail to extract any of those:

>>> json.loads("{'a': 'b'}")
Traceback (most recent call last):
  ...
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{a: "b"}')
Traceback (most recent call last):
  ...
ValueError: Expecting property name: line 1 column 2 (char 1)
>>> json.loads('{"a": [1, 2, 3,]}')
Traceback (most recent call last):
  ...
ValueError: No JSON object could be decoded
>>> json.loads('{"a": .99}')
Traceback (most recent call last):
  ...
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)

chompjs library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:

>>> import chompjs
>>> 
>>> chompjs.parse_js_object("{'a': 'b'}")
{'a': 'b'}
>>> chompjs.parse_js_object('{a: "b"}')
{'a': 'b'}
>>> chompjs.parse_js_object('{"a": [1, 2, 3,]}')
{'a': [1, 2, 3]}
>>> chompjs.parse_js_object('{"a": .99}')
{'a': 0.99}

Internally chompjs use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's json.loads, ensuring a high speed as compared to full-blown JavaScript parsers such as demjson.

>>> import json
>>> import _chompjs
>>> 
>>> _chompjs.parse('{a: 1}')
'{"a":1}'
>>> json.loads(_)
{'a': 1}

Development

Pull requests are welcome.

To run unittests

$ tox

About

Parsing JavaScript objects into Python data structures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •