Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix --write-metadata option to support extractors with datetime metadata #251

Closed
wants to merge 1 commit into from
Closed

Fix --write-metadata option to support extractors with datetime metadata #251

wants to merge 1 commit into from

Conversation

iamleot
Copy link
Contributor

@iamleot iamleot commented May 8, 2019

When using --write-metadata option with extractors that have
datetime metadata gallery-dl fails to encode JSON because json.dump() can
not serialize them, e.g.:

% env PYTHONPATH=. python3.7 -m gallery_dl --verbose --write-metadata 'https://www.instagram.com/p/BqvsDleB3lV/'
[...]
[instagram][debug]
Traceback (most recent call last):
  File ".../gallery-dl/gallery_dl/job.py", line 55, in run
    self.dispatch(msg)
  File ".../gallery-dl/gallery_dl/job.py", line 99, in dispatch
    self.handle_url(url, kwds)
  File ".../gallery-dl/gallery_dl/job.py", line 230, in handle_url
    pp.run(self.pathfmt)
  File ".../gallery-dl/gallery_dl/postprocessor/metadata.py", line 40, in run
    self.write(file, pathfmt)
  File ".../gallery-dl/gallery_dl/postprocessor/metadata.py", line 69, in _write_json
    ensure_ascii=self.ascii,
  File ".../json/__init__.py", line 179, in dump
    for chunk in iterable:
  File ".../json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File ".../json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File ".../json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File ".../json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable

This pull request introduce a GalleryDLJSONEncoder that serialize
datetime.datetime objects as ISO 8601 strings.

In the future - if needed - this can be extended to support more
non-basic types.

Introduce a GalleryDLJSONEncoder to encode datetime.datetime as
ISO 8601 strings.
@iamleot
Copy link
Contributor Author

iamleot commented May 9, 2019

JFTR, the test failures/errors seems unrelated to this change.

Here relevant output:

[...]
test_B4kThreadExtractor_1 (test.test_results.TestExtractorResults) ... FAIL
[...]
test_TwitterMediaExtractor_1 (test.test_results.TestExtractorResults) ... ERROR
test_TwitterTimelineExtractor_1 (test.test_results.TestExtractorResults) ... ERROR
test_TwitterTweetExtractor_1 (test.test_results.TestExtractorResults) ... ERROR
[...]
======================================================================
ERROR: test_TwitterMediaExtractor_1 (test.test_results.TestExtractorResults)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 258, in test
    self._run_test(extr, url, result)
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 61, in _run_test
    tjob.run()
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 152, in run
    for msg in self.extractor:
  File "/home/travis/build/mikf/gallery-dl/gallery_dl/extractor/twitter.py", line 36, in items
    for tweet in self.tweets():
  File "/home/travis/build/mikf/gallery-dl/gallery_dl/extractor/twitter.py", line 117, in _tweets_from_api
    data = self.request(url, params=params, headers=headers).json()
  File "/home/travis/build/mikf/gallery-dl/gallery_dl/extractor/common.py", line 107, in request
    raise exception.HttpError(msg)
gallery_dl.exception.HttpError: 404: Not Found for url: https://twitter.com/i/profiles/show/PicturesEarth/media_timeline
-------------------- >> begin captured stdout << ---------------------

https://twitter.com/PicturesEarth/media

--------------------- >> end captured stdout << ----------------------
-------------------- >> begin captured logging << --------------------
twitter: DEBUG: Using TwitterMediaExtractor for 'https://twitter.com/PicturesEarth/media'
urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): twitter.com:443
urllib3.connectionpool: DEBUG: https://twitter.com:443 "GET /i/profiles/show/PicturesEarth/media_timeline?include_available_features=1&include_entities=1&reset_error_state=false&lang=en HTTP/1.1" 404 65
--------------------- >> end captured logging << ---------------------

======================================================================
ERROR: test_TwitterTimelineExtractor_1 (test.test_results.TestExtractorResults)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 258, in test
    self._run_test(extr, url, result)
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 61, in _run_test
    tjob.run()
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 152, in run
    for msg in self.extractor:
  File "/home/travis/build/mikf/gallery-dl/gallery_dl/extractor/twitter.py", line 36, in items
    for tweet in self.tweets():
  File "/home/travis/build/mikf/gallery-dl/gallery_dl/extractor/twitter.py", line 117, in _tweets_from_api
    data = self.request(url, params=params, headers=headers).json()
  File "/home/travis/build/mikf/gallery-dl/gallery_dl/extractor/common.py", line 107, in request
    raise exception.HttpError(msg)
gallery_dl.exception.HttpError: 404: Not Found for url: https://twitter.com/i/profiles/show/PicturesEarth/timeline/tweets
-------------------- >> begin captured stdout << ---------------------

https://twitter.com/PicturesEarth

--------------------- >> end captured stdout << ----------------------
-------------------- >> begin captured logging << --------------------
twitter: DEBUG: Using TwitterTimelineExtractor for 'https://twitter.com/PicturesEarth'
urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): twitter.com:443
urllib3.connectionpool: DEBUG: https://twitter.com:443 "GET /i/profiles/show/PicturesEarth/timeline/tweets?include_available_features=1&include_entities=1&reset_error_state=false&lang=en HTTP/1.1" 404 65
--------------------- >> end captured logging << ---------------------

======================================================================
ERROR: test_TwitterTweetExtractor_1 (test.test_results.TestExtractorResults)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 258, in test
    self._run_test(extr, url, result)
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 61, in _run_test
    tjob.run()
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 152, in run
    for msg in self.extractor:
  File "/home/travis/build/mikf/gallery-dl/gallery_dl/extractor/twitter.py", line 36, in items
    for tweet in self.tweets():
  File "/home/travis/build/mikf/gallery-dl/gallery_dl/extractor/twitter.py", line 200, in tweets
    page = self.request(url).text
  File "/home/travis/build/mikf/gallery-dl/gallery_dl/extractor/common.py", line 107, in request
    raise exception.HttpError(msg)
gallery_dl.exception.HttpError: 404: Not Found for url: https://twitter.com/PicturesEarth/status/672897688871018500
-------------------- >> begin captured stdout << ---------------------

https://twitter.com/PicturesEarth/status/672897688871018500

--------------------- >> end captured stdout << ----------------------
-------------------- >> begin captured logging << --------------------
twitter: DEBUG: Using TwitterTweetExtractor for 'https://twitter.com/PicturesEarth/status/672897688871018500'
urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): twitter.com:443
urllib3.connectionpool: DEBUG: https://twitter.com:443 "GET /PicturesEarth/status/672897688871018500 HTTP/1.1" 404 1742
--------------------- >> end captured logging << ---------------------

======================================================================
FAIL: test_B4kThreadExtractor_1 (test.test_results.TestExtractorResults)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 258, in test
    self._run_test(extr, url, result)
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 82, in _run_test
    self.assertEqual(result["url"], tjob.hash_url.hexdigest())
AssertionError: 'cdd4931ac1cd00264b0b54e2e3b0d8f6ae48957e' != '9b0ae01292133268fe9178b71332da1ee25b7704'
- cdd4931ac1cd00264b0b54e2e3b0d8f6ae48957e
+ 9b0ae01292133268fe9178b71332da1ee25b7704

-------------------- >> begin captured stdout << ---------------------

https://arch.b4k.co/meta/thread/196/

--------------------- >> end captured stdout << ----------------------
-------------------- >> begin captured logging << --------------------
b4k: DEBUG: Using B4kThreadExtractor for 'https://arch.b4k.co/meta/thread/196/'
urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): arch.b4k.co:443
urllib3.connectionpool: DEBUG: https://arch.b4k.co:443 "GET /_/api/chan/thread/?board=meta&num=196 HTTP/1.1" 200 None
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 502 tests in 651.801s

FAILED (errors=3, failures=1)
[...]

mikf added a commit that referenced this pull request May 9, 2019
Simplified universal serialization support in json.dump() can be achieved
by passing 'default=str', which was already the case in DataJob.run()
for -j/--dump-json, but not for the 'metadata' post-processor.

This commit introduces util.dump_json() that (more or less) unifies the
JSON output procedure of both --write-metadata and --dump-json.

(#251, #252)
@mikf
Copy link
Owner

mikf commented May 9, 2019

Thanks for your quick fix, but I'd rather solve this in a slightly different way.

Adding a whole JSONEncoder subclass is kind of overkill when passing default=str to json.dump() does pretty much the some thing. This was already done in the other place where JSON output happens, but for some reason not in the metadata post-processor. I've now added a function that gets used in both places, so they should behave the same (523ebc9).

@mikf mikf closed this May 9, 2019
@iamleot
Copy link
Contributor Author

iamleot commented May 9, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants