Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool connections across threads and/or make WaybackClient thread-safe #58

Open
Mr0grog opened this issue Oct 21, 2020 · 6 comments · May be fixed by #152
Open

Pool connections across threads and/or make WaybackClient thread-safe #58

Mr0grog opened this issue Oct 21, 2020 · 6 comments · May be fixed by #152
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Oct 21, 2020

This is a current major goal, but we didn’t have an issue tracking it!

This package should be thread-safe, but because it is based on Requests, we can’t guarantee that. Requests’s authors won’t guarantee it, and at various times have expressed everything from certainty that it’s not thread-safe to cautious optimism that it might be, but overall have expressed that they don’t plan to go out of their way to make certain (and then to maintain that status).

Thread safety is important for EDGI’s use: we need to pull lots of data and so need to make get_memento() calls concurrent. Connections also need to be pooled and shared across threads for stability (when EDGI implemented some hacks to do this, speed and reliability both improved considerably). In any case, other EDGI codebases implement some pretty nutty workarounds to make that relatively safe. That shouldn’t be necessary, and people should just be able to use a single WaybackClient across multiple threads.

Ideally, we should switch to a different HTTP client instead of Requests. Two options:

  1. urllib3 is thread-safe and is what Requests is based on, so it should be a reliable switch. Its API is very different, though, so there’s lots to update and test.

  2. httpx is new, has an API that matches Requests, and has lots of fancy features. It also has async support in case we ever want to make this package async-capable in the future. However, it’s still in beta (should be final before the end of the year), and may possibly have some funny under-the-hood differences we’ll need to account for. We’d at least need to figure out how to re-implement our crazy hack for Wayback’s gzip issues.


Some other approaches worth keeping in the back pocket, but that probably aren’t ideal:

  • A sketched out implementation in Sketch out a way to support multithreading #23 makes a funky abstraction that lets it appear as though you are using a single WaybackClient on multiple threads, when in fact you are using several. It’s clever, but also a little hacky and probably has a lot of messy corner cases as a result. I don’t feel great about it.

  • We could implement EDGI’s workaround from web-monitoring-processing under the hood so that all connections across all clients are pooled. This makes things magically seem to work even if you create separate clients on each thread, but it’s probably unexpected behavior. If a user was actually trying to isolate sets of connections, this would get in their way (and do it in a silent way so they might not even know). It also depends on a small part of Requests staying as thread-safe as it currently is, and there are no guarantees there.

@Mr0grog Mr0grog added the enhancement New feature or request label Oct 21, 2020
@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 21, 2020

Now that #2 is finished, I’d like to get to this sometime relatively soon, so urllib3 is probably the better option: it’s production-safe, the same core implementation we already depend on, and compatible with the complicated hack we already have for gzip.

@Mr0grog
Copy link
Member Author

Mr0grog commented May 9, 2023

Updates, since it's been 2 and a half years:

  • urllib3 v2 is out now, which adds more high-level tooling.
  • httpx is still v0.x, but no longer markets itself as beta or unstable; it's probably safe to use, and supports HTTP/2 (but not yet 3).

Both require Python 3.7+, and we have so far tried to support 3.6. However, that’s mostly been for Google Colab users, and Colab now runs Python 3.10 (wow, they really zoomed forward — I recall checking last year they were still on 3.6). Dropping 3.6 support might be ok now.

@Mr0grog
Copy link
Member Author

Mr0grog commented May 9, 2023

Some other urllib3 v2 notes:

  • urllib3.HTTPResponse.from_httplib was removed (see HTTPConnection should return urllib3.HTTPResponse instead of http.client.HTTPResponse urllib3/urllib3#2648), so our hack for Wayback’s malformed gzip headers needs updating. OTOH, I think Wayback finally fixed that issue, so maybe we can just drop it. :

    wayback/wayback/_client.py

    Lines 157 to 186 in ee40997

    #####################################################################
    # HACK: handle malformed Content-Encoding headers from Wayback.
    # When you send `Accept-Encoding: gzip` on a request for a memento, Wayback
    # will faithfully gzip the response body. However, if the original response
    # from the web server that was snapshotted was gzipped, Wayback screws up the
    # `Content-Encoding` header on the memento response, leading any HTTP client to
    # *not* decompress the gzipped body. Wayback folks have no clear timeline for
    # a fix, hence the workaround here. More info in this issue:
    # https://github.com/edgi-govdata-archiving/web-monitoring-processing/issues/309
    #
    # This subclass of urllib3's response class identifies the malformed headers
    # and repairs them before instantiating the actual response object, so when it
    # reads the body, it knows to decode it correctly.
    #
    # See what we're overriding from urllib3:
    # https://github.com/urllib3/urllib3/blob/a6ec68a5c5c5743c59fe5c62c635c929586c429b/src/urllib3/response.py#L499-L526
    class WaybackResponse(HTTPConnectionPool.ResponseCls):
    @classmethod
    def from_httplib(cls, httplib_response, **response_kwargs):
    headers = httplib_response.msg
    pairs = headers.items()
    if ('content-encoding', '') in pairs and ('Content-Encoding', 'gzip') in pairs:
    del headers['content-encoding']
    headers['Content-Encoding'] = 'gzip'
    return super().from_httplib(httplib_response, **response_kwargs)
    HTTPConnectionPool.ResponseCls = WaybackResponse
    # END HACK
    #####################################################################
  • It looks like some exceptions have changed, so we should be careful about those. (That said, rejiggering all the exception logic will be pretty messy switching away from Requests anyway.)
  • Release notes: https://github.com/urllib3/urllib3/releases/tag/2.0.0
  • Migration guide: https://urllib3.readthedocs.io/en/latest/v2-migration-guide.html

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 13, 2023

We’ve since updated to support urllib3 v2 under the hood, but still call into it through requests, so the issues there have been handled. We are also moving up to Python 3.8+ in the next release, so version support is also a non-issue now.

At the moment, I think I am going to move forward with direct usage of urllib3 rather than switching to httpx. I would like to move to httpx in the long run so we can get async support (I don’t think urllib3 will have this in the foreseeable future) and HTTP/2 support (but hopefully that’s actually coming soon in urllib3: urllib3/urllib3#3000). For now, though, getting thread-safety done is really important, and keeping with the known behavior of urllib3 for all our weird hacks and such should make that easier than shifting to httpx and learning whatever weird quirks it might have (there are always some).

I hope to get this done over the next couple weeks and include it in a v0.5.0 release before the end of the year.

@Mr0grog Mr0grog self-assigned this Dec 13, 2023
@Mr0grog Mr0grog pinned this issue Dec 13, 2023
Mr0grog added a commit that referenced this issue Dec 14, 2023
This starts the path toward removing the requests package by rejiggering *most* of our exception handling. There is a huge amount more to do and this doesn't even begin to start using urllib3 to make requests, so basically everything should be completely broken with this commit.

Part of #58.
@Mr0grog Mr0grog linked a pull request Dec 14, 2023 that will close this issue
Mr0grog added a commit that referenced this issue Dec 16, 2023
This starts the path toward removing the requests package by rejiggering *most* of our exception handling. There is a huge amount more to do and this doesn't even begin to start using urllib3 to make requests, so basically everything should be completely broken with this commit.

Part of #58.
Mr0grog added a commit that referenced this issue Dec 18, 2023
This starts the path toward removing the requests package by rejiggering *most* of our exception handling. There is a huge amount more to do and this doesn't even begin to start using urllib3 to make requests, so basically everything should be completely broken with this commit.

Part of #58.
Mr0grog added a commit that referenced this issue Dec 18, 2023
This starts the path toward removing the requests package by rejiggering *most* of our exception handling. There is a huge amount more to do and this doesn't even begin to start using urllib3 to make requests, so basically everything should be completely broken with this commit.

Part of #58.
@Ousret
Copy link

Ousret commented Jan 16, 2024

Hello there,

Maybe, try Niquests instead?
We can assist if needed. There a non negligible chance that is going to ease your tasks ahead.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jan 17, 2024

Thanks for the recommendation! I don’t think it’s necessarily any easier for us to use than httpx would be, but I’ll take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants