Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a shared HTTPAdapter across all threads #551

Merged
merged 6 commits into from
Mar 19, 2020

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Feb 6, 2020

I spent an absurd amount of time over the last few days trying different approaches to better managing connections and errors with Wayback. Part of that was prompted by #525 and #507

This basically rigs up a situation where we create a single HTTPAdapter for use across all our sessions (one session per thread). We then set the maximum number of connections in the adapter to the number of threads we are using, and tell it to block (if you don’t tell it to block, it’ll just go ahead and make more connections than the max anyway, which, ??? is kinda weird but hey). This has a huge impact on error rate and performance:

Before:

  • Takes roughly 13-16 minutes to download 500 mementos from Wayback.
  • Has an error rate of 20-30% (we see lower error rates at the end of the day because we go back and retry everything several times).

After:

  • Takes roughly 8-9 minutes to download 500 mementos from Wayback
  • Has an error rate of 0.0-1.2%

(Tested on my local machine at the office with a pretty speedy connection, alternating between different approaches to hopefully ensure they all got tested under similar load conditions at the Internet Archive.)

Why not just share a single Session or Client? Requests is apparently not thread-safe, and it seems like its authors have been struggling with small, niggling issues around thread-safety for a while. It’s not great to be sharing an HTTPAdapter across threads, but it’s a small interface and doesn’t interact much with the rest of requests. This seems OK, and I haven’t run into any issues yet. Going any lower means wrapping parts of urllib3, which makes me feel like “why are we even using requests?” So I figured this is the sweet spot for now.

Along the way, I also implemented an AIMD (Additive Increase/Multiplicative Decrease) sizing algorithm for pools (this is the algorithm typically used for managing TCP window sizes to optimize network congestion). It turns out that it didn’t seem to perform a whole lot better than what I ultimately did here, and it’s much more complicated to implement, so I left it out for now. Maybe it’ll come back someday! ¯\_(ツ)_/¯

There’s still a little cleanup to do to make this final, so this is not yet ready to merge. (See all the commented out bits.)

@Mr0grog
Copy link
Member Author

Mr0grog commented Feb 6, 2020

Oh, also of note here: after lots of digging around in requests and urllib3 source code, I really don’t think our reset() method was doing anything useful other than paving over our connection management problems. That’s why it’s commented out here. HOWEVER, that makes this PR dependent on all the recent updates we’ve made in the wayback package to fix places we held onto open connections.

Mr0grog added a commit to edgi-govdata-archiving/wayback that referenced this pull request Feb 7, 2020
This implementation is kinda wonky, but is the best way I've come up with to support sessions/clients across multiple threads and pooling connections across multiple threads. It's based on the kind of hacky implementation in edgi-govdata-archiving/web-monitoring-processing#551.

The basic idea here includes two pieces, and works around the fact that urllib3 is thread-safe, while requests is not:

1. `WaybackSession` is renamed to `UnsafeWaybackSession` (denoting it should only be used on a single thread) and a new `WaybackSession` class just acts as a proxy to multiple UnsafeWaybackSessions, one per thread.

2. A special subclass of requests's `HTTPAdapter` that takes an instance of urllib3's `PoolManager` to wrap. `HTTPAdapter` itself is really just a wrapper around `PoolManager`, but it always creates a new one. This version just wraps whatever one is given to it. `UnsafeWaybackSession` now takes a `PoolManager` as an argument, which, if provided, is passed to its `HTTPAdapter`. `WaybackSession` creates one `PoolManager` which it sets on all the actual `UnsafeWaybackSession` objects it creates and proxies access to. That way a single pool of requests is shared across many threads.

This is super wonky! It definitely makes me feel like we might just be better off dropping requests and just using urllib3 directly (especially given #2 -- which means requests wouldn't be part of our public interface in any way). But this is a smaller change that *probably* carries less short-term risk.
Mr0grog added a commit to edgi-govdata-archiving/wayback that referenced this pull request Feb 7, 2020
This implementation is kinda wonky, but is the best way I've come up with to support sessions/clients across multiple threads and pooling connections across multiple threads. It's based on the kind of hacky implementation in edgi-govdata-archiving/web-monitoring-processing#551.

The basic idea here includes two pieces, and works around the fact that urllib3 is thread-safe, while requests is not:

1. `WaybackSession` is renamed to `UnsafeWaybackSession` (denoting it should only be used on a single thread) and a new `WaybackSession` class just acts as a proxy to multiple UnsafeWaybackSessions, one per thread.

2. A special subclass of requests's `HTTPAdapter` that takes an instance of urllib3's `PoolManager` to wrap. `HTTPAdapter` itself is really just a wrapper around `PoolManager`, but it always creates a new one. This version just wraps whatever one is given to it. `UnsafeWaybackSession` now takes a `PoolManager` as an argument, which, if provided, is passed to its `HTTPAdapter`. `WaybackSession` creates one `PoolManager` which it sets on all the actual `UnsafeWaybackSession` objects it creates and proxies access to. That way a single pool of requests is shared across many threads.

This is super wonky! It definitely makes me feel like we might just be better off dropping requests and just using urllib3 directly (especially given #2 -- which means requests wouldn't be part of our public interface in any way). But this is a smaller change that *probably* carries less short-term risk.
@Mr0grog
Copy link
Member Author

Mr0grog commented Feb 12, 2020

Had to add rate limiting here because of the current situation with the Internet Archive. Like the rest of this stuff, we should eventually find a way to stuff that functionality down into the Wayback package itself.

@Mr0grog
Copy link
Member Author

Mr0grog commented Feb 18, 2020

Rebased on top of #553.

Wayback has newly imposed rate limits on the memento API that we are running afoul of. This limits actual requests (rather than mementos) to an acceptable rate. (NOTE: this rate is actually problematically slow for us and we are working with Wayback folks to raise it, but 10/second is what it is now.)

This also rewrites the `rate_limited` context manager as a lock-like object for more portability/re-use. This also means we don't keep a global registry of rate limiters, which is probably good.
@Mr0grog
Copy link
Member Author

Mr0grog commented Feb 27, 2020

This has been running well for the past week or so. Going to remove the commented out bits and merge tomorrow, ugly as this may be.

@Mr0grog Mr0grog marked this pull request as ready for review March 19, 2020 18:39
@Mr0grog Mr0grog merged commit 594bf55 into master Mar 19, 2020
@Mr0grog Mr0grog deleted the lets-all-just-share-one-adapter branch March 19, 2020 18:52
@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 19, 2020

Took me a little longer than a day to get back to this. But it’s merged now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants