-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a shared HTTPAdapter across all threads #551
Conversation
Oh, also of note here: after lots of digging around in requests and urllib3 source code, I really don’t think our |
This implementation is kinda wonky, but is the best way I've come up with to support sessions/clients across multiple threads and pooling connections across multiple threads. It's based on the kind of hacky implementation in edgi-govdata-archiving/web-monitoring-processing#551. The basic idea here includes two pieces, and works around the fact that urllib3 is thread-safe, while requests is not: 1. `WaybackSession` is renamed to `UnsafeWaybackSession` (denoting it should only be used on a single thread) and a new `WaybackSession` class just acts as a proxy to multiple UnsafeWaybackSessions, one per thread. 2. A special subclass of requests's `HTTPAdapter` that takes an instance of urllib3's `PoolManager` to wrap. `HTTPAdapter` itself is really just a wrapper around `PoolManager`, but it always creates a new one. This version just wraps whatever one is given to it. `UnsafeWaybackSession` now takes a `PoolManager` as an argument, which, if provided, is passed to its `HTTPAdapter`. `WaybackSession` creates one `PoolManager` which it sets on all the actual `UnsafeWaybackSession` objects it creates and proxies access to. That way a single pool of requests is shared across many threads. This is super wonky! It definitely makes me feel like we might just be better off dropping requests and just using urllib3 directly (especially given #2 -- which means requests wouldn't be part of our public interface in any way). But this is a smaller change that *probably* carries less short-term risk.
This implementation is kinda wonky, but is the best way I've come up with to support sessions/clients across multiple threads and pooling connections across multiple threads. It's based on the kind of hacky implementation in edgi-govdata-archiving/web-monitoring-processing#551. The basic idea here includes two pieces, and works around the fact that urllib3 is thread-safe, while requests is not: 1. `WaybackSession` is renamed to `UnsafeWaybackSession` (denoting it should only be used on a single thread) and a new `WaybackSession` class just acts as a proxy to multiple UnsafeWaybackSessions, one per thread. 2. A special subclass of requests's `HTTPAdapter` that takes an instance of urllib3's `PoolManager` to wrap. `HTTPAdapter` itself is really just a wrapper around `PoolManager`, but it always creates a new one. This version just wraps whatever one is given to it. `UnsafeWaybackSession` now takes a `PoolManager` as an argument, which, if provided, is passed to its `HTTPAdapter`. `WaybackSession` creates one `PoolManager` which it sets on all the actual `UnsafeWaybackSession` objects it creates and proxies access to. That way a single pool of requests is shared across many threads. This is super wonky! It definitely makes me feel like we might just be better off dropping requests and just using urllib3 directly (especially given #2 -- which means requests wouldn't be part of our public interface in any way). But this is a smaller change that *probably* carries less short-term risk.
Had to add rate limiting here because of the current situation with the Internet Archive. Like the rest of this stuff, we should eventually find a way to stuff that functionality down into the Wayback package itself. |
f9c30a1
to
109c917
Compare
Rebased on top of #553. |
Wayback has newly imposed rate limits on the memento API that we are running afoul of. This limits actual requests (rather than mementos) to an acceptable rate. (NOTE: this rate is actually problematically slow for us and we are working with Wayback folks to raise it, but 10/second is what it is now.) This also rewrites the `rate_limited` context manager as a lock-like object for more portability/re-use. This also means we don't keep a global registry of rate limiters, which is probably good.
This has been running well for the past week or so. Going to remove the commented out bits and merge tomorrow, ugly as this may be. |
Took me a little longer than a day to get back to this. But it’s merged now! |
I spent an absurd amount of time over the last few days trying different approaches to better managing connections and errors with Wayback. Part of that was prompted by #525 and #507
This basically rigs up a situation where we create a single
HTTPAdapter
for use across all our sessions (one session per thread). We then set the maximum number of connections in the adapter to the number of threads we are using, and tell it to block (if you don’t tell it to block, it’ll just go ahead and make more connections than the max anyway, which, ??? is kinda weird but hey). This has a huge impact on error rate and performance:Before:
After:
(Tested on my local machine at the office with a pretty speedy connection, alternating between different approaches to hopefully ensure they all got tested under similar load conditions at the Internet Archive.)
Why not just share a single Session or Client? Requests is apparently not thread-safe, and it seems like its authors have been struggling with small, niggling issues around thread-safety for a while. It’s not great to be sharing an
HTTPAdapter
across threads, but it’s a small interface and doesn’t interact much with the rest of requests. This seems OK, and I haven’t run into any issues yet. Going any lower means wrapping parts of urllib3, which makes me feel like “why are we even using requests?” So I figured this is the sweet spot for now.Along the way, I also implemented an AIMD (Additive Increase/Multiplicative Decrease) sizing algorithm for pools (this is the algorithm typically used for managing TCP window sizes to optimize network congestion). It turns out that it didn’t seem to perform a whole lot better than what I ultimately did here, and it’s much more complicated to implement, so I left it out for now. Maybe it’ll come back someday! ¯\_(ツ)_/¯
There’s still a little cleanup to do to make this final, so this is not yet ready to merge. (See all the commented out bits.)