Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle too many connections to Wayback automatically #525

Open
5 of 7 tasks
Mr0grog opened this issue Dec 4, 2019 · 3 comments
Open
5 of 7 tasks

Handle too many connections to Wayback automatically #525

Mr0grog opened this issue Dec 4, 2019 · 3 comments
Assignees

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Dec 4, 2019

Long ago, we worked around an issue where we were getting lots of connection failures from Wayback with a dirty hack: if we ran out of retries but still had a failure to establish a new connection, we’d try resetting the whole session (only once!) and start over:

except Exception as error:
# On connection failures, reset the session and try again. If we
# don't do this, the connection pool for this thread is pretty much
# dead. It's not clear to me whether there is a problem in urllib3
# or Wayback's servers that requires this.
# This unfortunately requires string checking because the error can
# get wrapped up into multiple kinds of higher-level errors :(
if retry_connection_failures and ('failed to establish a new connection' in str(error).lower()):
self.wayback.session.reset()
return self.process_record(record)

At the time, I knew this was a kind of ugly hack, and probably masking some underlying issues. After some discussion with @danielballan and some spelunking through the source of requests and urllib3 today, we realized there were two problems:

  1. We might be holding open connections for an unnecessarily long time. This can be fixed by explicitly closing our response objects.

  2. But that doesn’t cover everything. Ultimately, requests buries some features around connection pooling from urllib3 in a way that means each WaybackSession (we have one per thread, for 36 in production right now) could possibly create infinity connections and then hold onto up to 10. (so we could wind up attempting to hold 360 open connections to Wayback!) Some things we can do to be smarter about this:

    • Just reduce the number of threads in production! Already on this, but it doesn’t make the problem obvious or clear if you scale up too far. We should ultimately do better.

    • When get_memento() redirects, make sure we read the body and close the response before moving onto the next in the redirect chain. (Release connections for memento redirects wayback#20)

    • Automatically step down the number of memento threads (to a point) if one gets too many connections refused. (How Many is Too Many? Two? Five? Probably > 1 to differentiate between something spurious and Wayback actually telling us we have too many connections open. Then again, maybe the retry configuration can be partially responsible for this? Much nuance here.) We have to make sure we re-queue the CdxRecord that was being loaded in this case. The error we’d see to trigger this would include Failed to establish a new connection: [Errno 111] Connection refused

      (Update: I prototyped this out and it didn’t actually improve things over just sharing a connection pool between threads. See Use a shared HTTPAdapter across all threads #551 for details.)

    • Expose ways to set pool_maxsize and pool_block on the requests.HTTPAdapter instances used by WaybackSession so users can control this better. (Update: see Use a shared HTTPAdapter across all threads #551 and Sketch out a way to support multithreading wayback#23)

    • Make sure WaybackClient and WaybackSession and requests.Session and requests.HTTPAdapter are thread-safe, and use one client with appropriate pool_maxsize and pool_block parameters on its HTTPAdapter instead of one client per thread. This is the most resilient solution, but most questionable: we can make our stuff thread-safe by fixing DisableAfterCloseSession, but there seems to deep confusion among both users and maintainers of requests as to its thread-safety (urllib3 is all good).

      (Update: Given the issues in the requests package, see Use a shared HTTPAdapter across all threads #551 for a practical, short-term workaround in this repo and Sketch out a way to support multithreading wayback#23 for one approach to solving this better.)

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 4, 2019

(Other options on the last point above include dropping requests and using urllib3 directly, but there are potentially lots of little edge cases that makes harder or leaves totally unknown.)

Mr0grog added a commit that referenced this issue Dec 5, 2019
These are meant to parially address #525. Using memento objects (which are just requests response objects) means they are explicitly closed when we are done with them, and our HTTP connections might be managed a little better. Warning on resets will let us examine logs to see how other future fixes improve (or don't) the problem.
Mr0grog added a commit that referenced this issue Dec 5, 2019
These are meant to parially address #525. Using memento objects (which are just requests response objects) means they are explicitly closed when we are done with them, and our HTTP connections might be managed a little better. Warning on resets will let us examine logs to see how other future fixes improve (or don't) the problem.
Mr0grog added a commit to edgi-govdata-archiving/wayback that referenced this issue Jan 12, 2020
When we handle memento redirects, we should have been loading all the content and then closing the response so that the connection is released for re-use, otherwise it just gets left hanging around :(

This contributes to solving some of the issues over in edgi-govdata-archiving/web-monitoring-processing#525
@Mr0grog Mr0grog self-assigned this Feb 13, 2020
@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 19, 2020

All the work that is unique to this repo has now been done! Ideally we want to get thread-safety for Wayback in edgi-govdata-archiving/wayback#23 (or some flavor of it) done soon, and we’ll then update code here to make use of it.

@stale stale bot added the stale label Sep 17, 2020
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Sep 18, 2020
@stale stale bot removed the stale label Sep 18, 2020
@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 20, 2020

Update: over on the Wayback side, edgi-govdata-archiving/wayback#52 is about to land. It's the last stepping stone required before we can switch away from requests to make it thread-safe.

@stale stale bot added the stale label Jun 11, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 11, 2021
@stale stale bot removed the stale label Jun 11, 2021
@stale stale bot added the stale label Jan 9, 2022
@Mr0grog Mr0grog removed the stale label Jan 14, 2022
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant