Use a shared HTTPAdapter across all threads #551

Mr0grog · 2020-02-06T23:53:32Z

I spent an absurd amount of time over the last few days trying different approaches to better managing connections and errors with Wayback. Part of that was prompted by #525 and #507

This basically rigs up a situation where we create a single HTTPAdapter for use across all our sessions (one session per thread). We then set the maximum number of connections in the adapter to the number of threads we are using, and tell it to block (if you don’t tell it to block, it’ll just go ahead and make more connections than the max anyway, which, ??? is kinda weird but hey). This has a huge impact on error rate and performance:

Before:

Takes roughly 13-16 minutes to download 500 mementos from Wayback.
Has an error rate of 20-30% (we see lower error rates at the end of the day because we go back and retry everything several times).

After:

Takes roughly 8-9 minutes to download 500 mementos from Wayback
Has an error rate of 0.0-1.2%

(Tested on my local machine at the office with a pretty speedy connection, alternating between different approaches to hopefully ensure they all got tested under similar load conditions at the Internet Archive.)

Why not just share a single Session or Client? Requests is apparently not thread-safe, and it seems like its authors have been struggling with small, niggling issues around thread-safety for a while. It’s not great to be sharing an HTTPAdapter across threads, but it’s a small interface and doesn’t interact much with the rest of requests. This seems OK, and I haven’t run into any issues yet. Going any lower means wrapping parts of urllib3, which makes me feel like “why are we even using requests?” So I figured this is the sweet spot for now.

Along the way, I also implemented an AIMD (Additive Increase/Multiplicative Decrease) sizing algorithm for pools (this is the algorithm typically used for managing TCP window sizes to optimize network congestion). It turns out that it didn’t seem to perform a whole lot better than what I ultimately did here, and it’s much more complicated to implement, so I left it out for now. Maybe it’ll come back someday! ¯\_(ツ)_/¯

There’s still a little cleanup to do to make this final, so this is not yet ready to merge. (See all the commented out bits.)

Mr0grog · 2020-02-06T23:56:12Z

Oh, also of note here: after lots of digging around in requests and urllib3 source code, I really don’t think our reset() method was doing anything useful other than paving over our connection management problems. That’s why it’s commented out here. HOWEVER, that makes this PR dependent on all the recent updates we’ve made in the wayback package to fix places we held onto open connections.

web_monitoring/cli/cli.py

This implementation is kinda wonky, but is the best way I've come up with to support sessions/clients across multiple threads and pooling connections across multiple threads. It's based on the kind of hacky implementation in edgi-govdata-archiving/web-monitoring-processing#551. The basic idea here includes two pieces, and works around the fact that urllib3 is thread-safe, while requests is not: 1. `WaybackSession` is renamed to `UnsafeWaybackSession` (denoting it should only be used on a single thread) and a new `WaybackSession` class just acts as a proxy to multiple UnsafeWaybackSessions, one per thread. 2. A special subclass of requests's `HTTPAdapter` that takes an instance of urllib3's `PoolManager` to wrap. `HTTPAdapter` itself is really just a wrapper around `PoolManager`, but it always creates a new one. This version just wraps whatever one is given to it. `UnsafeWaybackSession` now takes a `PoolManager` as an argument, which, if provided, is passed to its `HTTPAdapter`. `WaybackSession` creates one `PoolManager` which it sets on all the actual `UnsafeWaybackSession` objects it creates and proxies access to. That way a single pool of requests is shared across many threads. This is super wonky! It definitely makes me feel like we might just be better off dropping requests and just using urllib3 directly (especially given #2 -- which means requests wouldn't be part of our public interface in any way). But this is a smaller change that *probably* carries less short-term risk.

Mr0grog · 2020-02-12T06:14:35Z

Had to add rate limiting here because of the current situation with the Internet Archive. Like the rest of this stuff, we should eventually find a way to stuff that functionality down into the Wayback package itself.

Mr0grog · 2020-02-18T00:00:47Z

Rebased on top of #553.

Wayback has newly imposed rate limits on the memento API that we are running afoul of. This limits actual requests (rather than mementos) to an acceptable rate. (NOTE: this rate is actually problematically slow for us and we are working with Wayback folks to raise it, but 10/second is what it is now.) This also rewrites the `rate_limited` context manager as a lock-like object for more portability/re-use. This also means we don't keep a global registry of rate limiters, which is probably good.

Mr0grog · 2020-02-27T05:07:11Z

This has been running well for the past week or so. Going to remove the commented out bits and merge tomorrow, ugly as this may be.

Mr0grog · 2020-03-19T18:53:20Z

Took me a little longer than a day to get back to this. But it’s merged now!

danielballan reviewed Feb 7, 2020

View reviewed changes

web_monitoring/cli/cli.py Outdated Show resolved Hide resolved

Mr0grog mentioned this pull request Feb 7, 2020

Sketch out a way to support multithreading edgi-govdata-archiving/wayback#23

Draft

Mr0grog mentioned this pull request Feb 13, 2020

Handle too many connections to Wayback automatically #525

Open

7 tasks

Mr0grog force-pushed the lets-all-just-share-one-adapter branch from f9c30a1 to 109c917 Compare February 18, 2020 00:00

Mr0grog added 5 commits February 17, 2020 16:01

Use a shared HTTPAdapter across all threads

74a2676

Ensure sessions w/out custom adapters get closed

52514f0

Fixing linting errors

046b9dc

Update Wayback now that it's released

109c917

Clean up commented-out code

dce41bc

Mr0grog marked this pull request as ready for review March 19, 2020 18:39

Mr0grog merged commit 594bf55 into master Mar 19, 2020

Mr0grog deleted the lets-all-just-share-one-adapter branch March 19, 2020 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a shared HTTPAdapter across all threads #551

Use a shared HTTPAdapter across all threads #551

Mr0grog commented Feb 6, 2020 •

edited

Loading

Mr0grog commented Feb 6, 2020

Mr0grog commented Feb 12, 2020

Mr0grog commented Feb 18, 2020

Mr0grog commented Feb 27, 2020

Mr0grog commented Mar 19, 2020

Use a shared HTTPAdapter across all threads #551

Use a shared HTTPAdapter across all threads #551

Conversation

Mr0grog commented Feb 6, 2020 • edited Loading

Mr0grog commented Feb 6, 2020

Mr0grog commented Feb 12, 2020

Mr0grog commented Feb 18, 2020

Mr0grog commented Feb 27, 2020

Mr0grog commented Mar 19, 2020

Mr0grog commented Feb 6, 2020 •

edited

Loading