Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memento.url property can be wrong if it is SURT-equivalent to the actual URL #99

Closed
Mr0grog opened this issue Oct 29, 2022 · 0 comments · Fixed by #108
Closed

Memento.url property can be wrong if it is SURT-equivalent to the actual URL #99

Mr0grog opened this issue Oct 29, 2022 · 0 comments · Fixed by #108
Labels
bug Something isn't working
Milestone

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Oct 29, 2022

If you request a memento URL with a SURT form that is equivalent to the memento’s actual URL, the url property of the resulting memento object is incorrect — it reflects the URL you requested, rather than the actual, captured URL.

For example:

from wayback import WaybackClient
c = WaybackClient()

memento = c.get_memento('http://robbrackett.com/', datetime='20220315020402')
memento.url
# 'http://robbrackett.com/'
# But the actual capture was from:
# 'https://robbrackett.com/'

# The `link` header has the right info:
memento._raw_headers['link']
# '<https://robbrackett.com/>; rel="original", ...'

The right details are in the link header, and we should be parsing that. We’ve had a feature request to do that for a while (#57), but I hadn’t realized that there was a bug like this that we have to do it to properly work around.

@Mr0grog Mr0grog added the bug Something isn't working label Oct 29, 2022
@Mr0grog Mr0grog added this to the v0.4.x milestone Nov 10, 2022
Mr0grog added a commit that referenced this issue Nov 12, 2022
This adds a test for #99, which currently fails.
Mr0grog added a commit that referenced this issue Nov 12, 2022
When getting a memento, parse the `Link` header to get the URL the Memento is a capture of, rather than parsing the Memento's URL, which isn't necessarily accurate. Fixes #99.
Mr0grog added a commit that referenced this issue Nov 14, 2022
This fixes an issue where the `Memento.url` property could be slightly incorrect, since it was based on the URL you requested the memento from (e.g. `https://web.archive.org/web/20221010000000/<url>`), rather than the actual URL the memento was captured from. The URL the memento is requested from matches records via SURT key rather than the URL.

For example, requesting an archived copy of `http://fws.gov/` might return a capture of `https://www.fws.gov/` instead. The returned Memento object’s `url` property used to be `http://fws.gov/` in this case, but this changes it to be `https://www.fws.gov/`.

Since this required checking the `links` header, I also went ahead and made the parsed `links` data available on `Memento`.

Fixes #57.
Fixes #99.
@Mr0grog Mr0grog modified the milestones: v0.5.x, v0.4.x Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant