-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrap result from get_memento
in a custom object
#2
Comments
This implementation is kinda wonky, but is the best way I've come up with to support sessions/clients across multiple threads and pooling connections across multiple threads. It's based on the kind of hacky implementation in edgi-govdata-archiving/web-monitoring-processing#551. The basic idea here includes two pieces, and works around the fact that urllib3 is thread-safe, while requests is not: 1. `WaybackSession` is renamed to `UnsafeWaybackSession` (denoting it should only be used on a single thread) and a new `WaybackSession` class just acts as a proxy to multiple UnsafeWaybackSessions, one per thread. 2. A special subclass of requests's `HTTPAdapter` that takes an instance of urllib3's `PoolManager` to wrap. `HTTPAdapter` itself is really just a wrapper around `PoolManager`, but it always creates a new one. This version just wraps whatever one is given to it. `UnsafeWaybackSession` now takes a `PoolManager` as an argument, which, if provided, is passed to its `HTTPAdapter`. `WaybackSession` creates one `PoolManager` which it sets on all the actual `UnsafeWaybackSession` objects it creates and proxies access to. That way a single pool of requests is shared across many threads. This is super wonky! It definitely makes me feel like we might just be better off dropping requests and just using urllib3 directly (especially given #2 -- which means requests wouldn't be part of our public interface in any way). But this is a smaller change that *probably* carries less short-term risk.
This implementation is kinda wonky, but is the best way I've come up with to support sessions/clients across multiple threads and pooling connections across multiple threads. It's based on the kind of hacky implementation in edgi-govdata-archiving/web-monitoring-processing#551. The basic idea here includes two pieces, and works around the fact that urllib3 is thread-safe, while requests is not: 1. `WaybackSession` is renamed to `UnsafeWaybackSession` (denoting it should only be used on a single thread) and a new `WaybackSession` class just acts as a proxy to multiple UnsafeWaybackSessions, one per thread. 2. A special subclass of requests's `HTTPAdapter` that takes an instance of urllib3's `PoolManager` to wrap. `HTTPAdapter` itself is really just a wrapper around `PoolManager`, but it always creates a new one. This version just wraps whatever one is given to it. `UnsafeWaybackSession` now takes a `PoolManager` as an argument, which, if provided, is passed to its `HTTPAdapter`. `WaybackSession` creates one `PoolManager` which it sets on all the actual `UnsafeWaybackSession` objects it creates and proxies access to. That way a single pool of requests is shared across many threads. This is super wonky! It definitely makes me feel like we might just be better off dropping requests and just using urllib3 directly (especially given #2 -- which means requests wouldn't be part of our public interface in any way). But this is a smaller change that *probably* carries less short-term risk.
In Web Monitoring, we are adding better media type parsing (see edgi-govdata-archiving/web-monitoring-processing#621), and that should probably ultimately live here instead. I’m thinking we’d have: class MediaType:
type: str # e.g. 'text'
subtype: str # e.g. 'html'
parameters: dict # e.g. {'charset': 'utf-8'}
@property
def media(self):
"The main media type without parameters, e.g. 'text/html'"
return f'{self.type}/{self.subtype}'
@property
def parameter_string(self):
"The parameters as a single string, e.g. 'charset=utf-8; other-param=whatever'"
return '; '.join(f'{key}={value}' for key, value in self.parameters)
def __str__(self):
if parameters:
return f'{self.media}; {self.parameter_string}'
else:
return self.media And then |
Or it could be simpler, with just: memento.media_type == 'text/html'
memento.media_parameters == {'charset': 'utf-8'} |
OK, thinking through an ideal model for a Memento, here’s my first cut:
|
Instances of the `Memento` class will replace the return value from `WaybackClient.get_memento()`, which currently returns `requests.Response` objects. See #2 for more on plans here. I've attempted to keep this relatively simple for now (e.g. headers are case-sensitive, content is not streamable/iterable) in order to minimize what needs to be done here to the essentials. We can then expand on it in later commits or PRs. I also expect this implementation to evolve at least a little over the course of the initial implementation of the feature.
Instances of the `Memento` class will replace the return value from `WaybackClient.get_memento()`, which currently returns `requests.Response` objects. See #2 for more on plans here. I've attempted to keep this relatively simple for now (e.g. headers are case-sensitive, content is not streamable/iterable) in order to minimize what needs to be done here to the essentials. We can then expand on it in later commits or PRs. I also expect this implementation to evolve at least a little over the course of the initial implementation of the feature.
Instances of the `Memento` class will replace the return value from `WaybackClient.get_memento()`, which currently returns `requests.Response` objects. See #2 for more on plans here. I've attempted to keep this relatively simple for now (e.g. headers are case-sensitive, content is not streamable/iterable) in order to minimize what needs to be done here to the essentials. We can then expand on it in later commits or PRs. I also expect this implementation to evolve at least a little over the course of the initial implementation of the feature.
Rather than returning a `requests.Response` object from `WaybackClient.get_memento()`, we now return a completely new `Memento` type. It’s similar to `requests.Response` in a lot of ways, but it contains additional properties that are specific to mementos (like a timestamp) and has different values that reflect the historical archive rather than the exact response the Wayback Machine sent (e.g. `url` is the URL of the archived page, not `http://web.archive.org/web/<date>/<url>`; `headers` contains only the historic headers, not additional info added by the Wayback Machine). There are two major reasons for this: 1. We can provide some more useful conveniences like the above noted bits. 2. We can change the underlying HTTP implementation since we no longer expose it publicly. This is critical to getting thread safety, since Requests is not thread safe and it doesn’t look like that’s going to change soon or really ever be something they want to guarantee. Actually swapping out the underlying HTTP library is not part of this change, though. That’ll be done separately. Note this drops a few features from #2 that I ultimately decided were a bit too complex (e.g. detailed media type parsing) and/or non-critical (e.g. parsing out all the timemap info from the `Links` header). We can add those in later. Fixes #2. Co-authored-by: Dan Allan <[email protected]>
Rather than exposing
requests.Response
directly, we could wrap it in a custom object to give us the freedom to refactor in the future. From edgi-govdata-archiving/web-monitoring-processing#477The text was updated successfully, but these errors were encountered: