Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling undefined length in CdxRecord #83

Merged
merged 6 commits into from
Oct 15, 2021

Conversation

8W9aG
Copy link
Contributor

@8W9aG 8W9aG commented Oct 13, 2021

  • Currently there is a bug that prevents the parsing of CdxRecord when length is "-".
  • The following code has the following error:
import wayback
client = wayback.WaybackClient()
records = client.search("www.cnn.com/*", matchType="domain", filter_field="statuscode:200")
len(list(records))
wayback.exceptions.UnexpectedResponseFormat: Could not parse CDX output: "com,cnn)/ 20100218025451 http://www.cnn.com/ text/html 200 JK7O4OWOMDJHBNQJ5U43X5OQRORF6FQB -" (query: {'url': 'www.cnn.com/*', 'matchType': 'domain', 'filter': 'statuscode:200', 'showResumeKey': 'true', 'resolveRevisits': 'true'})

The exception occurs on line 542 of _client.py. The problem is the service returns "-" where length should be, probably indicating that for some reason this data is missing, however the wayback client attempts to parse it to an int. In this case, use None instead.

@Mr0grog
Copy link
Member

Mr0grog commented Oct 14, 2021

Wow, I’m surprised I haven’t run into this before! Could you please add a test? After that, I’ll get this merged and released ASAP.

Since this is covering a pretty narrow scenario, it’s probably better to use a mock (most of the tests use VCR). This one is a decent example:

def test_search_removes_malformed_entries(requests_mock):
"""
The CDX index contains many lines for things that can't actually be
archived and will have no corresponding memento, like `mailto:` and `data:`
URLs. We should be stripping these out.
Because these are rare and hard to get all in a single CDX query that isn't
*huge*, we use a made-up mock for this one instead of a VCR recording. All
the lines in the mock file are lines from real CDX queries (we lost track
of the specific cases that triggered that one, and it was *very* rare).
"""
with open(Path(__file__).parent / 'test_files' / 'malformed_cdx.txt') as f:
bad_cdx_data = f.read()
with WaybackClient() as client:
requests_mock.get('http://web.archive.org/cdx/search/cdx'
'?url=https%3A%2F%2Fepa.gov%2F%2A'
'&from=20200418000000&to=20200419000000'
'&showResumeKey=true&resolveRevisits=true',
[{'status_code': 200, 'text': bad_cdx_data}])
records = client.search('https://epa.gov/*',
from_date=datetime(2020, 4, 18),
to_date=datetime(2020, 4, 19))
assert 2 == len(list(records))

But since this is such a simple case, you probably don’t need the response content in a separate file. You could just have a string right in the source code for the test.

@8W9aG
Copy link
Contributor Author

8W9aG commented Oct 14, 2021

@Mr0grog thanks for the feedback, added a test case

@@ -79,10 +79,10 @@ Ready to contribute? Here's how to set up `wayback` for local development.
5. When you're done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox::

$ flake8 wayback tests
$ python setup.py test
$ pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks for catching this!

matchType="domain",
filter_field="statuscode:200")

assert 5 == len(list(records))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably also be asserting that the .length property on the last one is None while the others are ints, e.g:

Suggested change
assert 5 == len(list(records))
items = list(records)
assert 5 == len(items)
assert isinstance(items[0].length, int)
assert items[4] is None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will make this change

@@ -539,7 +539,7 @@ def search(self, url, *, matchType=None, limit=None, offset=None,
status_code = None
else:
status_code = int(data.status_code)
length = int(data.length)
length = None if data.length == "-" else int(data.length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, please use single quotes for non-multiline strings, too. Sorry I missed this last night.

Suggested change
length = None if data.length == "-" else int(data.length)
length = None if data.length == '-' else int(data.length)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Member

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Would you like to be listed as a contributor?

@8W9aG
Copy link
Contributor Author

8W9aG commented Oct 14, 2021

@Mr0grog That sounds great thanks!

@Mr0grog Mr0grog merged commit b791151 into edgi-govdata-archiving:main Oct 15, 2021
Mr0grog added a commit that referenced this pull request Oct 15, 2021
Mr0grog added a commit that referenced this pull request Oct 15, 2021
Mr0grog added a commit that referenced this pull request Oct 15, 2021
Also prep to release v0.3.1.
@Mr0grog
Copy link
Member

Mr0grog commented Oct 15, 2021

You can now grab this from PyPI as version 0.3.1! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants