Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bad timestamps #85

Merged
merged 2 commits into from
Nov 17, 2021
Merged

Conversation

8W9aG
Copy link
Contributor

@8W9aG 8W9aG commented Oct 18, 2021

  • Currently there is a bug that prevents the parsing of CdxRecord when the timestamp is invalid, for example "20000800241623". This timestamp doesn't really make sense because the day is 00 and the hour is 24.
  • The following code has the following error:
import wayback
client = wayback.WaybackClient()
records = client.search("www.usatoday.com/*", matchType="domain", filter_field="statuscode:200")
len(list(records))
wayback.exceptions.UnexpectedResponseFormat: Could not parse CDX output: "com,usatoday)/2000/century/tech/004.htm 20000800241623 http://www.usatoday.com:80/2000/century/tech/004.htm text/html 200 PAJWSPCRQMVBTYWV4NPJPNDQHKWJC3OO 6177" (query: {'url': 'www.usatoday.com/*', 'matchType': 'domain', 'filter': 'statuscode:200', 'showResumeKey': 'true', 'resolveRevisits': 'true'})

The exception occurs on line 549 of _client.py. The problem is the service returns an invalid timestamp, for reasons I cannot fathom, however this tends to happen only on older logs probably indicating it is an issue with certain records, however the wayback client attempts to parse it to an datetime with no correction attempt. In this case, attempt to correct the flaws before parsing.

Copy link
Member

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I’d already hit a lot of the weird cases, but you’re finding issues I haven’t seen left and right! Nice find. I am clearly working on too narrow a dataset. 😜

I posted on the Internet Archive Slack about this to see if they had any guidance on the most correct way to handle this case, since they might know of more ways the timestamp could be malformed we should account for, or if this just means the record is bad and we should skip it (for example, no Memento is returned for this URL + timestamp). I’ll probably wait a little bit to see if I get some feedback from them before merging. That said, I think what you’re doing here makes sense.

wayback/_client.py Outdated Show resolved Hide resolved
* Delete this data
* Add 00 to the end of the timestamp
Copy link
Member

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I completely lost track of this. Archive folks did not wind up ever having any better thoughts about general approaches here, so I think this is good!

@Mr0grog Mr0grog merged commit 02099ca into edgi-govdata-archiving:main Nov 17, 2021
@8W9aG 8W9aG deleted the fix-bad-timestamps branch November 17, 2021 23:38
@Mr0grog Mr0grog mentioned this pull request Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants