Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the Link header to get a memento’s URL #108

Merged
merged 5 commits into from
Nov 14, 2022

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Nov 12, 2022

This fixes an issue where the Memento.url property could be slightly incorrect, since it was based on the URL you requested the memento from (e.g. https://web.archive.org/web/20221010000000/<url>), rather than the actual URL the memento was captured from. The URL the memento is requested from matches records via SURT key rather than the URL.

For example, requesting an archived copy of http://fws.gov/ might return a capture of https://www.fws.gov/ instead. The returned Memento object’s url property used to be http://fws.gov/ in this case, but this changes it to be https://www.fws.gov/.

Fixes #99.

This adds a test for #99, which currently fails.
When getting a memento, parse the `Link` header to get the URL the Memento is a capture of, rather than parsing the Memento's URL, which isn't necessarily accurate. Fixes #99.
@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 12, 2022

I think I might hold off on merging this for the weekend and consider adding a links property to Memento that is the parsed links, since I had to make use of them for this anyway. (See #57)

It also might make sense to parse the Memento-Datetime header (we can use Python’s built-in email.utils.parsedate_to_datetime function for that). OTOH, I don’t see any situation where I’d expect the timestamp we are already parsing from the Memento URL to be inaccurate like the capture URL was, so this may not really be meaningful.

@Mr0grog Mr0grog linked an issue Nov 14, 2022 that may be closed by this pull request
@Mr0grog Mr0grog merged commit 6cb13f9 into main Nov 14, 2022
@Mr0grog Mr0grog deleted the 99-the-url-in-the-request-might-not-be-the-captured-url branch November 14, 2022 20:23
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant