Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaybackSession not properly handling archived 429 (rate limit) responses #158

Closed
geosoco opened this issue Jan 31, 2024 · 1 comment · Fixed by #159
Closed

WaybackSession not properly handling archived 429 (rate limit) responses #158

geosoco opened this issue Jan 31, 2024 · 1 comment · Fixed by #159
Labels
bug Something isn't working

Comments

@geosoco
Copy link

geosoco commented Jan 31, 2024

As per the memento spec and as noted in the comments in _client.py (line ~987), 4XX status responses should also be returned by the memento server exactly as they were archived. WaybackClient has code to work around this, but Wayback session currently assumes that all 429s are real and throws an exception instead of returning the response.

Here's an example archived link where reddit returned a 429 to the archive bot: https://web.archive.org/web/20150129034904/http://www.reddit.com/r/PokemonGiveaway

This happens in _client.py around line ~458 because it has the memento-datetime header the executation is gated through the if statements. The second only checks if the response code is 429 without seeing if it has memento-datetime, and then throws an exception instead of returning the response.

untested: I suspect there should be an outer if statement that checks for the memento-datetime, and if present always return the response as these are recorded errors and the client/calling code can handle if needed. In the else, the rest can be handled? I can't currently reproduce a real 429 from archive, so I can't validate whether this would fully solve the issue.

@Mr0grog
Copy link
Member

Mr0grog commented Jan 31, 2024

Oh, good catch, and thanks for a real example! I definitely won’t have time to get to this until at least tomorrow (and maybe a couple of days), but if you have time to work on it, I am happy to review a pull request!

I suspect there should be an outer if statement that checks for the memento-datetime, and if present always return the response…

Yeah, I think you are correct here. The first thing WaybackSession.send() does with the response should be to immediately return it if is_memento(response) is True. Then the rest of the existing logic can stay as-is, although we should remove the is_memento() call from WaybackSession.should_retry() since it will now be redundant:

def should_retry(self, response):
# A memento may actually be a capture of an error, so don't retry it :P
if is_memento_response(response):
return False
return response.status_code in self.retryable_statuses

If you decide to work on a pull request (many thanks! 🙏), please be sure to include a test! Since you have a real-world example, the easiest thing to do is just use that same URL in the test, and add the @ia_vcr.use_cassette() decorator, like this test:

@ia_vcr.use_cassette()
def test_get_memento_with_date_datetime():
with WaybackClient() as client:
memento = client.get_memento('https://www.fws.gov/birds/',
timestamp=date(2017, 11, 24),
exact=False)
assert 'https://www.fws.gov/birds/' == memento.url
assert datetime(2017, 11, 24, 15, 13, 15, tzinfo=timezone.utc) == memento.timestamp
assert 'id_' == memento.mode

That will cause the test framework to record the actual server calls the first time you run it, then replay them for future test runs. (The recording will be a file named src/wayback/tests/cassettes/<name_of_test>.yml, so make sure to commit that new file.)

@Mr0grog Mr0grog added the bug Something isn't working label Jan 31, 2024
Mr0grog added a commit that referenced this issue Feb 1, 2024
Mr0grog added a commit that referenced this issue Feb 1, 2024
A memento can be a archive of an old rate limit error (status code 429) and in our feverish run to handle rate limit errors better at the end of 2023, we caused `WaybackSession.send()` to raise exceptions for both real rate limits *and* archived ones. However, the archived ones might be an actual memento that you were looking for, and should have been exempted from raising.

This solves the issue by simply checking whether a response is a memento and returning it immediately without doing any other checks, since the *effective* status code for a memento is always 200. (Checking various attributes of a memento is complicated, so it’s better to just return them right away rather than remembering to make complex exceptions in all the places where various response attributes have to be treated differently for mementos.)

Fixes #158.
Mr0grog added a commit that referenced this issue Feb 1, 2024
A memento can be a archive of an old rate limit error (status code 429) and in our feverish run to handle rate limit errors better at the end of 2023, we caused `WaybackSession.send()` to raise exceptions for both real rate limits *and* archived ones. However, the archived ones might be an actual memento that you were looking for, and should have been exempted from raising.

This solves the issue by simply checking whether a response is a memento and returning it immediately without doing any other checks, since the *effective* status code for a memento is always 200. (Checking various attributes of a memento is complicated, so it’s better to just return them right away rather than remembering to make complex exceptions in all the places where various response attributes have to be treated differently for mementos.)

Fixes #158.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants