Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: time data #88

Closed
edsu opened this issue Sep 24, 2022 · 2 comments · Fixed by #89
Closed

ValueError: time data #88

edsu opened this issue Sep 24, 2022 · 2 comments · Fixed by #89

Comments

@edsu
Copy link
Contributor

edsu commented Sep 24, 2022

I happened to be doing this:

from wayback import WaybackClient

ia = WaybackClient()
for result in ia.search('lapdonline.org', matchType='prefix'):
    print(result)

and noticed that after running for 10 minutes or so it blew up with:

Traceback (most recent call last):
  File "/Users/edsummers/Projects/wayback/wayback/_client.py", line 543, in search
    capture_time = _utils.parse_timestamp(data.timestamp)
  File "/Users/edsummers/Projects/wayback/wayback/_utils.py", line 57, in parse_timestamp
    .strptime(''.join(timestamp_chars), URL_DATE_FORMAT)
  File "/usr/local/Cellar/python@3.10/3.10.6_1/Frameworks/Python.framework/Versi
ons/3.10/lib/python3.10/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/[email protected]/3.10.6_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '20000008241731' does not match format '%Y%m%d%H%M%S'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/edsummers/Projects/wayback/./x.py", line 6, in <module>
    for result in ia.search('lapdonline.org', matchType='prefix'):
  File "/Users/edsummers/Projects/wayback/wayback/_client.py", line 547, in search
    raise UnexpectedResponseFormat(
wayback.exceptions.UnexpectedResponseFormat: Could not parse CDX output: "org,lapdonline)/community/op_valley_bureau/north_hollywood/map/map.htm 20000008241731 http://www.lapdonline.org:80/community/op_valley_bureau/north_hollywood/map/map.htm text/html 200 2GPKQMU3BLZXOEZ5EWDQEYHPMKWEHNT3 1158" (query: {'url': 'lapdonline.org', 'matchType': 'prefix', 'showResumeKey': 'true', 'resolveRevisits': 'true'})

It looks like the CDX API returned a datetime 20000008241731 which throws an exception during parse because 00 isn't a valid month?

I don't know what the solution is here:

  • ignore the record?
  • see if the new CDX API is better behaved and switch to it?
  • something else?
@edsu
Copy link
Contributor Author

edsu commented Sep 24, 2022

I see in _utils.parse_timestamp there is already some logic to guard against a day of 00. But I'm confused by the logic.

In the test it looks like this timestamp 20000800241623 has the day 00 removed leaving 200008241623 and then 00 is appended on the end leaving 20000824162300. This means the date 2000-08-00 24:16:23 is corrected to 2000-08-24 16:23:00?

Did someone from the Internet Archive indicate that this was a good way to handle the situation? Absent any other information I would have been inclined to rewrite the 00 as 01, and also log a warning? But in this case that would have left an hour of 24 which is invalid. Was the 20000800241623 timestamp actually found in the wild, or was it fabricated for the test?

edsu added a commit to edsu/wayback that referenced this issue Sep 24, 2022
This commit extends the existing logic for handling invalid days of `00` to months that are `00`. It also adds a warning to be logged in both situations.

So if a timestamp of `20200001120000` will get rewritten to `20200112000000` prior to conversion to a datetime.

I have tested on live CDX API data that was failing, and this fix causes the
full result to be returned. If more information is known about why this
approach is taken it would be good to add in a comment?

Closes edgi-govdata-archiving#88
@Mr0grog
Copy link
Member

Mr0grog commented Sep 24, 2022

We talked on Slack, but summarizing here for transparency…

Did someone from the Internet Archive indicate that this was a good way to handle the situation? Absent any other information I would have been inclined to rewrite the 00 as 01

Yep! Rolling the 24 over to a day of 01 and an hour of 00 was what we initially thought was right, but someone from the Archive found the actual archived content and it turned out that was not correct — somehow an extra 00 had just been stuck in the middle. More here: #85 (comment)

IIRC, nobody was sure whether it was good to generalize or not, but since a 00 day will always be invalid, this seemed as good a fix as any.

Was the 20000800241623 timestamp actually found in the wild, or was it fabricated for the test?

In the wild! More details in #85.

Mr0grog pushed a commit that referenced this issue Sep 30, 2022
This commit extends the existing logic for handling invalid days of `00` to months that are `00`. It also adds a warning to be logged in both situations.

So if a timestamp of `20200001120000` will get rewritten to `20200112000000` prior to conversion to a datetime.

I have tested on live CDX API data that was failing, and this fix causes the
full result to be returned. If more information is known about why this
approach is taken it would be good to add in a comment?

Closes #88
Mr0grog pushed a commit to edsu/wayback that referenced this issue Sep 30, 2022
This commit extends the existing logic for handling invalid days of `00` to months that are `00`. It also adds a warning to be logged in both situations.

So if a timestamp of `20200001120000` will get rewritten to `20200112000000` prior to conversion to a datetime.

I have tested on live CDX API data that was failing, and this fix causes the
full result to be returned. If more information is known about why this
approach is taken it would be good to add in a comment?

Closes edgi-govdata-archiving#88
Mr0grog added a commit that referenced this issue Sep 30, 2022
This commit extends the existing logic for handling invalid days of `00` to months that are `00`. It also adds a warning to be logged in both situations. For example, if a timestamp of `20200001120000` will get rewritten to `20200112000000` prior to conversion to a datetime.

Fixes #88

Co-authored-by: Rob Brackett <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants