Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some original headers are getting lost #98

Closed
Mr0grog opened this issue Oct 29, 2022 · 1 comment · Fixed by #101
Closed

Some original headers are getting lost #98

Mr0grog opened this issue Oct 29, 2022 · 1 comment · Fixed by #101
Labels
bug Something isn't working

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Oct 29, 2022

It looks like something has changed about either Requests or the Wayback Machine, and we are no longer including all the original archived headers in a Memento object’s headers property. For example:

from wayback import WaybackClient
c = WaybackClient()

memento = c.get_memento('https://robbrackett.com/', datetime='20220315020402')
memento.headers
# {'Content-Type': 'text/html'}

But the value of memento.headers should really be something like:

{'date': 'Tue, 15 Mar 2022 02:04:02 GMT', 'server': 'Apache', 'upgrade': 'h2,h2c', 'connection': 'Upgrade, Keep-Alive', 'last-modified': 'Mon, 30 Nov 2020 22:51:03 GMT', 'accept-ranges': 'bytes', 'content-length': '13182', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=15, max=768', 'Content-Type': 'text/html'}

(Based on https://web.archive.org/web/20220315020402id_/http://robbrackett.com/)

@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 29, 2022

At a quick glance, it looks like archive.org has started returning x-archive-orig-* headers with lower-case header names, and we are looking for capitalized ones (which they used to be):

wayback/wayback/_models.py

Lines 273 to 277 in bce65fd

prefix = 'X-Archive-Orig-'
headers = {
key[len(prefix):]: value for key, value in raw_headers.items()
if key.startswith(prefix)
}

I’m guessing this started happening when they added HTTP/2 support (in HTTP/2, all header names are lower-case). That said, we can’t just switch to looking for lower-case here, since archive.org’s HTTP/1.1 responses still include upper-cased names for standard headers like Date and Location.

@Mr0grog Mr0grog added the bug Something isn't working label Oct 29, 2022
Mr0grog added a commit that referenced this issue Oct 31, 2022
This fixes #98, which was caused by two changes:

1. The Internet Archive now returns most *archived* header names (i.e. those prefixed with 'x-archive-org-') in lower-case.
2. In HTTP/2 (now possible since we are using HTTPS as of #97), *all* headers are lower-case/case-insensitive.

This also meant I needed to make the `Memento.headers` attribute case-insensitive. I've implemented that using code largely taken from Requests, since their implementation is not public so we can't just use it directly (plus we plan to switch of Requests at some point anyway).
@Mr0grog Mr0grog closed this as completed in 0b2134b Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant