Skip to content

Commit

Permalink
Handle historical redirects in view mode (#110)
Browse files Browse the repository at this point in the history
When requesting mementos of redirects in `view` mode, we get back a web page that redirects in JavaScript and that is missing some important memento headers, causing us to raise a pretty unexpected error. This attempts to work around the issue and detect that a page is a memento of a redirect when in view mode.

This is a first pass, and probably needs some more cleanup.

Fixes #109.
  • Loading branch information
Mr0grog committed Feb 25, 2023
1 parent 6cb13f9 commit 992be2d
Show file tree
Hide file tree
Showing 4 changed files with 1,195 additions and 2 deletions.
16 changes: 15 additions & 1 deletion docs/source/release-history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,14 @@ Release History
In Development
--------------

Fix an issue where the :attr:`Memento.url` attribute might not be slightly off (it could have a different protocol, different upper/lower-casing, etc.). (:issue:`99`)
Breaking Changes
^^^^^^^^^^^^^^^^

N/A


Features
^^^^^^^^

:class:`wayback.Memento` now has a ``links`` property with information about other URLs that are related to the memento, such as the previous or next mementos in time. It’s a dict where the keys identify the relationship (e.g. ``'prev memento'``) and the values are dicts with additional information about the link. (:issue:`57`) For example::

Expand Down Expand Up @@ -40,6 +47,13 @@ One use for these is to iterate through additional mementos. For example, to get

client.get_memento(memento.links['prev memento']['url'])

Fixes & Maintenance
^^^^^^^^^^^^^^^^^^^

- Fix an issue where the :attr:`Memento.url` attribute might not be slightly off (it could have a different protocol, different upper/lower-casing, etc.). (:issue:`99`)

- Fix an error when getting a memento for a redirect in ``view`` mode. If you called :meth:`wayback.WaybackClient.get_memento` with a URL that turned out to be a redirect at the given time and set the ``mode`` option to :attr:`wayback.Mode.view`, you’d get an exception saying “Memento at {url} could not be played.” Now this works just fine. (:issue:`109`)


v0.4.0 (2022-11-10)
-------------------
Expand Down
72 changes: 71 additions & 1 deletion wayback/_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
RetryError,
Timeout)
import time
from urllib.parse import urljoin
from urllib3.connectionpool import HTTPConnectionPool
from urllib3.exceptions import (ConnectTimeoutError,
MaxRetryError,
Expand Down Expand Up @@ -156,6 +157,62 @@ def read_and_close(response):
response.close()


REDIRECT_PAGE_PATTERN = re.compile(r'Got an? HTTP 3\d\d response at crawl time', re.IGNORECASE)


def detect_view_mode_redirect(response, current_date):
"""
Given a response for a page in view mode, detect whether it represents
a historical redirect and return the target URL or ``None``.
In view mode, historical redirects aren't served as actual 3xx
responses. Instead, they are a normal web page that displays information
about the redirect. After a short delay, JavaScript on the page
redirects the browser. That obviously doesn't work great for us! The
goal here is to detect that we got one of those pages and extract the
URL that was redirected to.
If the page looks like a redirect but we can't find the target URL,
this raises an exception.
"""
if (
response.status_code == 200
and 'x-archive-src' in response.headers
and REDIRECT_PAGE_PATTERN.search(response.text)
):
# The page should have a link to the redirect target. Only look for URLs
# using the same timestamp to reduce the chance of picking up some other
# link that isn't about the redirect.
current_timestamp = _utils.format_timestamp(current_date)
redirect_match = re.search(fr'''
<a\s # <a> element
(?:[^>\s]+\s)* # Possible other attributes
href=(["\']) # href attribute and quote
( # URL of another archived page with the same timestamp
(?:(?:https?:)//[^/]+)? # Optional schema and host
/web/{current_timestamp}/.*?
)
\1 # End quote
[\s|>] # Space before another attribute or end of element
''', response.text, re.VERBOSE | re.IGNORECASE)

if redirect_match:
redirect_url = redirect_match.group(2)
if redirect_url.startswith('/'):
redirect_url = urljoin(response.url, redirect_url)

return redirect_url
else:
raise WaybackException(
'The server sent a response in `view` mode that looks like a redirect, '
'but the URL to redirect to could not be found on the page. Please file '
'an issue at https://github.com/edgi-govdata-archiving/wayback/issues/ '
'with details about what happened.'
)

return None


#####################################################################
# HACK: handle malformed Content-Encoding headers from Wayback.
# When you send `Accept-Encoding: gzip` on a request for a memento, Wayback
Expand Down Expand Up @@ -796,8 +853,21 @@ def get_memento(self, url, timestamp=None, mode=Mode.original, *,
protocol_and_www = re.compile(r'^https?://(www\d?\.)?')
memento = None
while True:
is_memento = 'Memento-Datetime' in response.headers
current_url, current_date, current_mode = _utils.memento_url_data(response.url)

# In view mode, redirects need special handling.
if current_mode == Mode.view.value:
redirect_url = detect_view_mode_redirect(response, current_date)
if redirect_url:
# Fix up response properties to be like other modes.
redirect = requests.Request('GET', redirect_url)
response._next = self.session.prepare_request(redirect)
response.headers['Memento-Datetime'] = current_date.strftime(
'%a, %d %b %Y %H:%M:%S %Z'
)

is_memento = 'Memento-Datetime' in response.headers

# A memento URL will match possible captures based on its SURT
# form, which means we might be getting back a memento captured
# from a different URL than the one specified in the request.
Expand Down
Loading

0 comments on commit 992be2d

Please sign in to comment.