Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mementos of redirects in view mode raise "could not be played" error #109

Closed
rhaksw opened this issue Feb 23, 2023 · 4 comments · Fixed by #110
Closed

Mementos of redirects in view mode raise "could not be played" error #109

rhaksw opened this issue Feb 23, 2023 · 4 comments · Fixed by #110
Labels
bug Something isn't working

Comments

@rhaksw
Copy link

rhaksw commented Feb 23, 2023

Hi, I'm getting an error from this code,

from wayback import WaybackClient, WaybackSession
wc = WaybackClient(session = WaybackSession(
                                user_agent='agent-218947',
                                timeout=10,
                             ))
u='https://web.archive.org/web/20230212225711/https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/'
memento = wc.get_memento(u, exact=False)
    raise MementoPlaybackError(f'Memento at {url} could not be played')
wayback.exceptions.MementoPlaybackError: Memento at https://web.archive.org/web/20230212225711/https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/ could not be played

The comment in that section of the WaybackClient code states that this error should only occur if exact is True or if the target URL is outside the target_window. I don't think either of those apply because I'm setting exact to False and the target URL has the same timestamp:

original url / target url (both are 20230212225711)

Anyone know what might cause this?

@Mr0grog
Copy link
Member

Mr0grog commented Feb 23, 2023

Ah! It turns out the URL you are using effectively sets the mode parameter to Mode.view, which gets you a response designed for viewing in a web browser (it has lots of tweaks and extras and is not the original, archived HTTP response).

The URL you requested was a redirect, but in view mode, the Wayback Machine gives us a normal webpage (not a redirect) with info about the where the redirect is going and pauses for a few seconds before redirecting with JavaScript. I obviously haven’t done rigorous-enough testing with that playback mode (we almost always use the default, which is mode=Mode.original); it looks like it’s going to be a bit tricky to detect this scenario in a way that works even if the design of the Wayback Machine’s redirect page changes.

That said, did you intend to use mode=Mode.view? If not, you should either:

  1. (Recommended) Don’t use the full Internet Archive URL when requesting a memento. Instead, use the URL of the page you want and the timestamp parameter:

    url = 'https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/'
    client.get_memento(url, timestamp='20230212225711', exact=False)
  2. Or make sure to append id_ to the end of the timestamp portion of the URL to set the mode:

    url = 'https://web.archive.org/web/20230212225711id_/https://www.reddit.com/r/Suomi/comments/110nd1i/mink%c3%a4_takia_kommentit_ei_aina_n%c3%a4y_redditiss%c3%a4/j8arudd/'
    client.get_memento(url, exact=False)

@Mr0grog Mr0grog added the bug Something isn't working label Feb 23, 2023
Mr0grog added a commit that referenced this issue Feb 23, 2023
Redirects are messy in view mode (see #109). This adds a test for them, even though it currently fails.
Mr0grog added a commit that referenced this issue Feb 23, 2023
When requesting mementos of redirects in `view` mode, we get back a web page that redirects in JavaScript and that is missing some important memento headers, causing us to raise a pretty unexpected error. This attempts to work around the issue and detect that a page is a memento of a redirect.

This is a first pass, and probably needs some more cleanup.

Fixes #109.
@Mr0grog Mr0grog changed the title Memento at ... could not be played Mementos of redirects in view mode raise "could not be played" error Feb 23, 2023
@rhaksw
Copy link
Author

rhaksw commented Feb 24, 2023

Thank you for this detailed explanation. You're correct that I intended to use mode=Mode.original. Now I've done that via the second solution you gave. I didn't mention it before, but the URL I was using came from the view_url of results of search(). I switched to use the raw_url. Maybe I'll go back later and use the original url and timestamp instead as you recommend.

@Mr0grog
Copy link
Member

Mr0grog commented Feb 24, 2023

the URL I was using came from the view_url of results of search(). I switched to use the raw_url.

If you are using CdxRecord objects from the search() method, you can just pass them directly to get_memento() and it’ll pull out the right values for you! It’s a little easier that way:

for record in client.search('https://somewhere.com/', ...):
    get_memento(record, exact=False)  # gets `original` mode by default
    # or: get_memento(record, mode=wayback.Mode.view, exact=False)

@rhaksw
Copy link
Author

rhaksw commented Feb 24, 2023

Thank you, that is indeed easier. I must've missed it when first reading the docs and coding this up.

Mr0grog added a commit that referenced this issue Feb 25, 2023
When requesting mementos of redirects in `view` mode, we get back a web page that redirects in JavaScript and that is missing some important memento headers, causing us to raise a pretty unexpected error. This attempts to work around the issue and detect that a page is a memento of a redirect when in view mode.

This is a first pass, and probably needs some more cleanup.

Fixes #109.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants