Wrap result from `get_memento` in a custom object #2

danielballan · 2019-11-18T18:16:13Z

Rather than exposing requests.Response directly, we could wrap it in a custom object to give us the freedom to refactor in the future. From edgi-govdata-archiving/web-monitoring-processing#477

The text was updated successfully, but these errors were encountered:

This implementation is kinda wonky, but is the best way I've come up with to support sessions/clients across multiple threads and pooling connections across multiple threads. It's based on the kind of hacky implementation in edgi-govdata-archiving/web-monitoring-processing#551. The basic idea here includes two pieces, and works around the fact that urllib3 is thread-safe, while requests is not: 1. `WaybackSession` is renamed to `UnsafeWaybackSession` (denoting it should only be used on a single thread) and a new `WaybackSession` class just acts as a proxy to multiple UnsafeWaybackSessions, one per thread. 2. A special subclass of requests's `HTTPAdapter` that takes an instance of urllib3's `PoolManager` to wrap. `HTTPAdapter` itself is really just a wrapper around `PoolManager`, but it always creates a new one. This version just wraps whatever one is given to it. `UnsafeWaybackSession` now takes a `PoolManager` as an argument, which, if provided, is passed to its `HTTPAdapter`. `WaybackSession` creates one `PoolManager` which it sets on all the actual `UnsafeWaybackSession` objects it creates and proxies access to. That way a single pool of requests is shared across many threads. This is super wonky! It definitely makes me feel like we might just be better off dropping requests and just using urllib3 directly (especially given #2 -- which means requests wouldn't be part of our public interface in any way). But this is a smaller change that *probably* carries less short-term risk.

Mr0grog · 2020-08-28T08:45:16Z

In Web Monitoring, we are adding better media type parsing (see edgi-govdata-archiving/web-monitoring-processing#621), and that should probably ultimately live here instead. I’m thinking we’d have:

class MediaType:
    type: str         # e.g. 'text'
    subtype: str      # e.g. 'html'
    parameters: dict  # e.g. {'charset': 'utf-8'}

    @property
    def media(self):
        "The main media type without parameters, e.g. 'text/html'"
        return f'{self.type}/{self.subtype}'

	@property
    def parameter_string(self):
        "The parameters as a single string, e.g. 'charset=utf-8; other-param=whatever'"
        return '; '.join(f'{key}={value}' for key, value in self.parameters)

    def __str__(self):
        if parameters:
            return f'{self.media}; {self.parameter_string}'
        else:
            return self.media

And then Memento.media would be an instance of the above.

Mr0grog · 2020-08-28T09:04:06Z

Or it could be simpler, with just:

memento.media_type == 'text/html'
memento.media_parameters == {'charset': 'utf-8'}

Mr0grog · 2020-09-18T20:12:42Z

OK, thinking through an ideal model for a Memento, here’s my first cut:

encoding The text encoding of the response. (Do we need an apparent_encoding like Requests has that detects the encoding if one isn’t specified in the headers?)
content The content of the response in bytes.
text The decoded content of the response as a string.
headers The archived headers of the response (that is, the headers that came with the original snapshot, not including additional Wayback headers).
history A list of mementos for any historical redirects that led to this memento.
debug_history A list of URLs for any redirects (including those that are involved in general communication with the Wayback Machine, not just historical redirects). (Should this be a bunch of response objects of some sort instead? Memento objects that aren’t actually mementos?)
is_redirect Whether the memento is of a redirect (i.e. had a 3xx status code).
ok Whether the status code was < 400.
status_code The status code of the response as an integer.
memento_type The type of memento (see also Support original URL + date or CdxRecord instances as parameters for get_memento? #15), e.g. 'raw', 'view', etc. (Maybe simpler if these are the actual codes the Wayback Machine uses, i.e. '', 'id', 'im', 'js', 'cs'? See http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html#Archival_URL_Replay_Mode)
timestamp (or time?) The timestamp of when the memento was captured, as a datetime instance with a timezone (always UTC).
url The original URL that this is a memento of, e.g. http://epa.gov, not http://web.archive.org/web/20200101000000/http://epa.gov. (Mainly useful when there were redirects.)
memento_url The URL from which this memento was retrieved — basically the opposite of url above. I.e. http://web.archive.org/web/20200101000000/http://epa.gov, not http://epa.gov. (This is mainly a convenience, since you could compose it from url + timestamp + memento_type.)
media Parsed media type information from the Content-Type header (see above comment).
related An object with the URLs of related mementos (parsed from the Link header):
- original
- previous
- next
- last
- timemap
- timegate
Could also be links? Might be confusing vs. a parsed version of the Links header from the original capture, rather than Wayback-related info.
close() Closes the HTTP response. Happens automatically if you read text or content. Always safe to call (it’s no-op if already closed).

Instances of the `Memento` class will replace the return value from `WaybackClient.get_memento()`, which currently returns `requests.Response` objects. See #2 for more on plans here. I've attempted to keep this relatively simple for now (e.g. headers are case-sensitive, content is not streamable/iterable) in order to minimize what needs to be done here to the essentials. We can then expand on it in later commits or PRs. I also expect this implementation to evolve at least a little over the course of the initial implementation of the feature.

Rather than returning a `requests.Response` object from `WaybackClient.get_memento()`, we now return a completely new `Memento` type. It’s similar to `requests.Response` in a lot of ways, but it contains additional properties that are specific to mementos (like a timestamp) and has different values that reflect the historical archive rather than the exact response the Wayback Machine sent (e.g. `url` is the URL of the archived page, not `http://web.archive.org/web/<date>/<url>`; `headers` contains only the historic headers, not additional info added by the Wayback Machine). There are two major reasons for this: 1. We can provide some more useful conveniences like the above noted bits. 2. We can change the underlying HTTP implementation since we no longer expose it publicly. This is critical to getting thread safety, since Requests is not thread safe and it doesn’t look like that’s going to change soon or really ever be something they want to guarantee. Actually swapping out the underlying HTTP library is not part of this change, though. That’ll be done separately. Note this drops a few features from #2 that I ultimately decided were a bit too complex (e.g. detailed media type parsing) and/or non-critical (e.g. parsing out all the timemap info from the `Links` header). We can add those in later. Fixes #2. Co-authored-by: Dan Allan <[email protected]>

Mr0grog mentioned this issue Nov 19, 2019

Use Wayback package and remove internetarchive module edgi-govdata-archiving/web-monitoring-processing#511

Merged

6 tasks

Mr0grog mentioned this issue Feb 7, 2020

Sketch out a way to support multithreading #23

Draft

Mr0grog mentioned this issue Mar 27, 2020

Memento.history should only list responses that were actual mementos #36

Closed

Mr0grog mentioned this issue Sep 24, 2020

Change url, add datetime & mode arguments to get_memento #50

Merged

6 tasks

Mr0grog mentioned this issue Oct 2, 2020

Return a new Memento type from get_memento() #52

Merged

3 tasks

Mr0grog linked a pull request Oct 2, 2020 that will close this issue

Return a new Memento type from get_memento() #52

Merged

3 tasks

Mr0grog closed this as completed in #52 Oct 21, 2020

This was referenced Oct 21, 2020

Add info about media type to Memento #56

Open

Add info about link header relationships to Memento #57

Closed

Pool connections across threads and/or make WaybackClient thread-safe #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrap result from `get_memento` in a custom object #2

Wrap result from `get_memento` in a custom object #2

danielballan commented Nov 18, 2019

Mr0grog commented Aug 28, 2020 •

edited

Loading

Mr0grog commented Aug 28, 2020

Mr0grog commented Sep 18, 2020 •

edited

Loading

Wrap result from get_memento in a custom object #2

Wrap result from get_memento in a custom object #2

Comments

danielballan commented Nov 18, 2019

Mr0grog commented Aug 28, 2020 • edited Loading

Mr0grog commented Aug 28, 2020

Mr0grog commented Sep 18, 2020 • edited Loading

Wrap result from `get_memento` in a custom object #2

Wrap result from `get_memento` in a custom object #2

Mr0grog commented Aug 28, 2020 •

edited

Loading

Mr0grog commented Sep 18, 2020 •

edited

Loading