Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wayback redirects without scheme + domain don’t work #59

Closed
Mr0grog opened this issue Nov 2, 2020 · 0 comments · Fixed by #60
Closed

Wayback redirects without scheme + domain don’t work #59

Mr0grog opened this issue Nov 2, 2020 · 0 comments · Fixed by #60
Labels
bug Something isn't working

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Nov 2, 2020

New bug in v0.3.0a1:

Some Wayback redirects use a Location: header with a scheme and domain, e.g:

Location: http://web.archive.org/web/20201027215555id_/https://www.whitehouse.gov/administration/eop/ostp/about/student/faqs

But others don’t, e.g:

Location: /web/20201027215555id_/https://www.whitehouse.gov/ostp/about/student/faqs

The latter will cause Wayback v0.3.0a1 to fail when trying to parse the headers:

>>> import wayback
>>> c = wayback.WaybackClient()
>>> c.get_memento('http://web.archive.org/web/20201027215555id_/https://www.whitehouse.gov/administration/eop/ostp/about/student/faqs')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rbrackett/Dev/datarescue/wayback/wayback/_client.py", line 724, in get_memento
    headers=Memento.parse_memento_headers(response.headers),
  File "/Users/rbrackett/Dev/datarescue/wayback/wayback/_models.py", line 285, in parse_memento_headers
    headers['Location'], _, _ = memento_url_data(raw_headers['Location'])
  File "/Users/rbrackett/Dev/datarescue/wayback/wayback/_utils.py", line 122, in memento_url_data
    raise ValueError(f'"{memento_url}" is not a memento URL')
ValueError: "/web/20201027215555id_/https://www.whitehouse.gov/ostp/about/student/faqs" is not a memento URL
@Mr0grog Mr0grog added the bug Something isn't working label Nov 2, 2020
Mr0grog added a commit that referenced this issue Nov 3, 2020
Most redirects in Wayback redirect to a complete URL, with headers like:

    Location: http://web.archive.org/web/20201027215555id_/https://www.whitehouse.gov/administration/eop/ostp/about/student/faqs

But some include only an absolute path, (which is still valid) e.g:

    Location: /web/20201027215555id_/https://www.whitehouse.gov/ostp/about/student/faqs

We weren't correctly handling the latter case, leading to exceptions while parsing headers.

Fixes #59.
Mr0grog added a commit that referenced this issue Nov 4, 2020
Most redirects in Wayback redirect to a complete URL, with headers like:

    Location: http://web.archive.org/web/20201027215555id_/https://www.whitehouse.gov/administration/eop/ostp/about/student/faqs

But some include only an absolute path, (which is still valid) e.g:

    Location: /web/20201027215555id_/whitehouse.gov/ostp/about/student/faqs

We weren't correctly handling the latter case, leading to exceptions while parsing headers.

Fixes #59.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant