Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap result from get_memento in a custom object #2

Closed
danielballan opened this issue Nov 18, 2019 · 3 comments · Fixed by #52
Closed

Wrap result from get_memento in a custom object #2

danielballan opened this issue Nov 18, 2019 · 3 comments · Fixed by #52

Comments

@danielballan
Copy link
Contributor

Rather than exposing requests.Response directly, we could wrap it in a custom object to give us the freedom to refactor in the future. From edgi-govdata-archiving/web-monitoring-processing#477

Mr0grog added a commit that referenced this issue Feb 7, 2020
This implementation is kinda wonky, but is the best way I've come up with to support sessions/clients across multiple threads and pooling connections across multiple threads. It's based on the kind of hacky implementation in edgi-govdata-archiving/web-monitoring-processing#551.

The basic idea here includes two pieces, and works around the fact that urllib3 is thread-safe, while requests is not:

1. `WaybackSession` is renamed to `UnsafeWaybackSession` (denoting it should only be used on a single thread) and a new `WaybackSession` class just acts as a proxy to multiple UnsafeWaybackSessions, one per thread.

2. A special subclass of requests's `HTTPAdapter` that takes an instance of urllib3's `PoolManager` to wrap. `HTTPAdapter` itself is really just a wrapper around `PoolManager`, but it always creates a new one. This version just wraps whatever one is given to it. `UnsafeWaybackSession` now takes a `PoolManager` as an argument, which, if provided, is passed to its `HTTPAdapter`. `WaybackSession` creates one `PoolManager` which it sets on all the actual `UnsafeWaybackSession` objects it creates and proxies access to. That way a single pool of requests is shared across many threads.

This is super wonky! It definitely makes me feel like we might just be better off dropping requests and just using urllib3 directly (especially given #2 -- which means requests wouldn't be part of our public interface in any way). But this is a smaller change that *probably* carries less short-term risk.
Mr0grog added a commit that referenced this issue Feb 7, 2020
This implementation is kinda wonky, but is the best way I've come up with to support sessions/clients across multiple threads and pooling connections across multiple threads. It's based on the kind of hacky implementation in edgi-govdata-archiving/web-monitoring-processing#551.

The basic idea here includes two pieces, and works around the fact that urllib3 is thread-safe, while requests is not:

1. `WaybackSession` is renamed to `UnsafeWaybackSession` (denoting it should only be used on a single thread) and a new `WaybackSession` class just acts as a proxy to multiple UnsafeWaybackSessions, one per thread.

2. A special subclass of requests's `HTTPAdapter` that takes an instance of urllib3's `PoolManager` to wrap. `HTTPAdapter` itself is really just a wrapper around `PoolManager`, but it always creates a new one. This version just wraps whatever one is given to it. `UnsafeWaybackSession` now takes a `PoolManager` as an argument, which, if provided, is passed to its `HTTPAdapter`. `WaybackSession` creates one `PoolManager` which it sets on all the actual `UnsafeWaybackSession` objects it creates and proxies access to. That way a single pool of requests is shared across many threads.

This is super wonky! It definitely makes me feel like we might just be better off dropping requests and just using urllib3 directly (especially given #2 -- which means requests wouldn't be part of our public interface in any way). But this is a smaller change that *probably* carries less short-term risk.
@Mr0grog
Copy link
Member

Mr0grog commented Aug 28, 2020

In Web Monitoring, we are adding better media type parsing (see edgi-govdata-archiving/web-monitoring-processing#621), and that should probably ultimately live here instead. I’m thinking we’d have:

class MediaType:
    type: str         # e.g. 'text'
    subtype: str      # e.g. 'html'
    parameters: dict  # e.g. {'charset': 'utf-8'}

    @property
    def media(self):
        "The main media type without parameters, e.g. 'text/html'"
        return f'{self.type}/{self.subtype}'

	@property
    def parameter_string(self):
        "The parameters as a single string, e.g. 'charset=utf-8; other-param=whatever'"
        return '; '.join(f'{key}={value}' for key, value in self.parameters)

    def __str__(self):
        if parameters:
            return f'{self.media}; {self.parameter_string}'
        else:
            return self.media

And then Memento.media would be an instance of the above.

@Mr0grog
Copy link
Member

Mr0grog commented Aug 28, 2020

Or it could be simpler, with just:

memento.media_type == 'text/html'
memento.media_parameters == {'charset': 'utf-8'}

@Mr0grog
Copy link
Member

Mr0grog commented Sep 18, 2020

OK, thinking through an ideal model for a Memento, here’s my first cut:

  • encoding The text encoding of the response. (Do we need an apparent_encoding like Requests has that detects the encoding if one isn’t specified in the headers?)

  • content The content of the response in bytes.

  • text The decoded content of the response as a string.

  • headers The archived headers of the response (that is, the headers that came with the original snapshot, not including additional Wayback headers).

  • history A list of mementos for any historical redirects that led to this memento.

  • debug_history A list of URLs for any redirects (including those that are involved in general communication with the Wayback Machine, not just historical redirects). (Should this be a bunch of response objects of some sort instead? Memento objects that aren’t actually mementos?)

  • is_redirect Whether the memento is of a redirect (i.e. had a 3xx status code).

  • ok Whether the status code was < 400.

  • status_code The status code of the response as an integer.

  • memento_type The type of memento (see also Support original URL + date or CdxRecord instances as parameters for get_memento? #15), e.g. 'raw', 'view', etc. (Maybe simpler if these are the actual codes the Wayback Machine uses, i.e. '', 'id', 'im', 'js', 'cs'? See http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html#Archival_URL_Replay_Mode)

  • timestamp (or time?) The timestamp of when the memento was captured, as a datetime instance with a timezone (always UTC).

  • url The original URL that this is a memento of, e.g. http://epa.gov, not http://web.archive.org/web/20200101000000/http://epa.gov. (Mainly useful when there were redirects.)

  • memento_url The URL from which this memento was retrieved — basically the opposite of url above. I.e. http://web.archive.org/web/20200101000000/http://epa.gov, not http://epa.gov. (This is mainly a convenience, since you could compose it from url + timestamp + memento_type.)

  • media Parsed media type information from the Content-Type header (see above comment).

  • related An object with the URLs of related mementos (parsed from the Link header):

    • original
    • previous
    • next
    • last
    • timemap
    • timegate

    Could also be links? Might be confusing vs. a parsed version of the Links header from the original capture, rather than Wayback-related info.

  • close() Closes the HTTP response. Happens automatically if you read text or content. Always safe to call (it’s no-op if already closed).

Mr0grog added a commit that referenced this issue Sep 25, 2020
Instances of the `Memento` class will replace the return value from `WaybackClient.get_memento()`, which currently returns `requests.Response` objects. See #2 for more on plans here.

I've attempted to keep this relatively simple for now (e.g. headers are case-sensitive, content is not streamable/iterable) in order to minimize what needs to be done here to the essentials. We can then expand on it in later commits or PRs. I also expect this implementation to evolve at least a little over the course of the initial implementation of the feature.
Mr0grog added a commit that referenced this issue Oct 2, 2020
Instances of the `Memento` class will replace the return value from `WaybackClient.get_memento()`, which currently returns `requests.Response` objects. See #2 for more on plans here.

I've attempted to keep this relatively simple for now (e.g. headers are case-sensitive, content is not streamable/iterable) in order to minimize what needs to be done here to the essentials. We can then expand on it in later commits or PRs. I also expect this implementation to evolve at least a little over the course of the initial implementation of the feature.
@Mr0grog Mr0grog linked a pull request Oct 2, 2020 that will close this issue
3 tasks
Mr0grog added a commit that referenced this issue Oct 20, 2020
Instances of the `Memento` class will replace the return value from `WaybackClient.get_memento()`, which currently returns `requests.Response` objects. See #2 for more on plans here.

I've attempted to keep this relatively simple for now (e.g. headers are case-sensitive, content is not streamable/iterable) in order to minimize what needs to be done here to the essentials. We can then expand on it in later commits or PRs. I also expect this implementation to evolve at least a little over the course of the initial implementation of the feature.
Mr0grog added a commit that referenced this issue Oct 21, 2020
Rather than returning a `requests.Response` object from `WaybackClient.get_memento()`, we now return a completely new `Memento` type. It’s similar to `requests.Response` in a lot of ways, but it contains additional properties that are specific to mementos (like a timestamp) and has different values that reflect the historical archive rather than the exact response the Wayback Machine sent (e.g. `url` is the URL of the archived page, not `http://web.archive.org/web/<date>/<url>`; `headers` contains only the historic headers, not additional info added by the Wayback Machine).

There are two major reasons for this:

1. We can provide some more useful conveniences like the above noted bits.

2. We can change the underlying HTTP implementation since we no longer expose it publicly. This is critical to getting thread safety, since Requests is not thread safe and it doesn’t look like that’s going to change soon or really ever be something they want to guarantee. Actually swapping out the underlying HTTP library is not part of this change, though. That’ll be done separately.

Note this drops a few features from #2 that I ultimately decided were a bit too complex (e.g. detailed media type parsing) and/or non-critical (e.g. parsing out all the timemap info from the `Links` header). We can add those in later.

Fixes #2.

Co-authored-by: Dan Allan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants