Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not compatible with urllib3 2.0 #116

Closed
dgilman opened this issue May 29, 2023 · 2 comments · Fixed by #128
Closed

Not compatible with urllib3 2.0 #116

dgilman opened this issue May 29, 2023 · 2 comments · Fixed by #128
Labels
bug Something isn't working enhancement New feature or request
Milestone

Comments

@dgilman
Copy link
Contributor

dgilman commented May 29, 2023

No description provided.

@Mr0grog
Copy link
Member

Mr0grog commented May 30, 2023

Ah, yeah, at the very least, this fix for gzip encoding problems in Wayback will need to be removed or done in a different way for v2:

wayback/wayback/_client.py

Lines 260 to 289 in 6d360e9

#####################################################################
# HACK: handle malformed Content-Encoding headers from Wayback.
# When you send `Accept-Encoding: gzip` on a request for a memento, Wayback
# will faithfully gzip the response body. However, if the original response
# from the web server that was snapshotted was gzipped, Wayback screws up the
# `Content-Encoding` header on the memento response, leading any HTTP client to
# *not* decompress the gzipped body. Wayback folks have no clear timeline for
# a fix, hence the workaround here. More info in this issue:
# https://github.com/edgi-govdata-archiving/web-monitoring-processing/issues/309
#
# This subclass of urllib3's response class identifies the malformed headers
# and repairs them before instantiating the actual response object, so when it
# reads the body, it knows to decode it correctly.
#
# See what we're overriding from urllib3:
# https://github.com/urllib3/urllib3/blob/a6ec68a5c5c5743c59fe5c62c635c929586c429b/src/urllib3/response.py#L499-L526
class WaybackResponse(HTTPConnectionPool.ResponseCls):
@classmethod
def from_httplib(cls, httplib_response, **response_kwargs):
headers = httplib_response.msg
pairs = headers.items()
if ('content-encoding', '') in pairs and ('Content-Encoding', 'gzip') in pairs:
del headers['content-encoding']
headers['Content-Encoding'] = 'gzip'
return super().from_httplib(httplib_response, **response_kwargs)
HTTPConnectionPool.ResponseCls = WaybackResponse
# END HACK
#####################################################################

I had thought we set constraints for urllib3 v1.x, but it looks like we didn’t, which is causing your test failures on #117. Thanks for calling this out.

I think the first thing that needs doing here is constraining the dependency to urllib3 >=1.20 <2 as a quick fix, and then we need to fix the above issue in order to be compatible with v2 (or switch to httpx, see #58).

Mr0grog added a commit that referenced this issue May 30, 2023
Right now, this package is not compatible with urllib3 v2, so we need to set version constraints so package managers don’t try to install an incompatible version.

Partially covers #116.
Mr0grog added a commit that referenced this issue May 30, 2023
Right now, this package is not compatible with urllib3 v2, so we need to set version constraints so package managers don’t try to install an incompatible version.

Partially covers #116.
@Mr0grog Mr0grog added bug Something isn't working enhancement New feature or request labels May 30, 2023
@Mr0grog
Copy link
Member

Mr0grog commented May 30, 2023

Released v0.4.2 with correct dependency constraints.

@Mr0grog Mr0grog added this to the v0.4.3 milestone Sep 22, 2023
Mr0grog added a commit that referenced this issue Sep 25, 2023
This should fail for urllib3 v2 and pass for v1, since no changes have been made to the actual source code yet, and v2 should be incompatible. First stop on the way to fully fixing #116.
@Mr0grog Mr0grog mentioned this issue Sep 25, 2023
3 tasks
Mr0grog added a commit that referenced this issue Sep 26, 2023
This should fail for urllib3 v2 and pass for v1, since no changes have been made to the actual source code yet, and v2 should be incompatible. First stop on the way to fully fixing #116.
Mr0grog added a commit that referenced this issue Sep 26, 2023
Back in #118 we “fixed” things for urllib3 v2 by marking this package as only compatible with v1, so users wouldn't wind up with bad dependency combinations. This adds real, proper support for urllib3 v2. Fixes #116.

For the most part, urllib3 v2 just works, but there were two significant changes I had to make here:

1. We had a funky hack to deal with Wayback’s broken gzip handling that used the `from_httplib()` static method. That method no longer exists, and the new equivalent combines a bunch of other behavior that isn't really reasonable to override. Instead, we patch the `HTTPHeadersDict` class in v2.

2. VCR.py writes different, incompatible cassette files for v1 and v2 of urllib3 (see kevin1024/vcrpy#719). To address this in a way that doesn't make future contributions too hard, I added a custom serializer to make VCR behave compatibly between the two versions of urllib3, so you can record or read the same cassette files regardless of which version you are working with. It is no longer perfectly accurate to the response you received if using urllib3 v2, but it's generally good enough for our needs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants