Rethink rate limiting #12

Mr0grog · 2019-12-02T02:58:00Z

WaybackClient.get_memento has left-over rate-limiting behavior from web-monitoring-processing:

Lines 645 to 648 in f1cdb1d

    
           with _utils.rate_limited(calls_per_second=30, group='get_memento'): 
        
               # Correctly following redirects is actually pretty complicated. In 
        
               # the simplest case, a memento is a simple web page, and that's 
        
               # no problem. However...

From the perspective of this more generic module…

Should there be rate limiting at all?
If there should be, it should probably be optional and/or configurable: the rate shouldn’t be hard-coded, and should maybe be an option on the session or client constructor.
If there should be, should it apply across both get_memento and search (and whatever other methods WaybackClient might gain in the future)? Or maybe it should apply at the level of making a request to Wayback, rather than at the higher-level get_memento method?

/cc @danielballan

Updates

We should keep rate limiting.
It should be expanded to cover search.
search and get_memento should fall under separate rate limits that don’t interfere with each other.
The rate limit should be configurable (and should continue to apply across all client/session instances on all threads). The right API for this is not yet clear.
The default limit for CDX calls should be 1/second and for Memento calls should be 30/second.

The text was updated successfully, but these errors were encountered:

Mr0grog · 2019-12-02T03:01:10Z

I guess we could also supply it as a separate helper (rather than building it into the client or session) for people to use or ignore as they see fit. There might also be better generic tools for this out there and we should just drop this [possibly too-simple] internal implementation. Not sure. ¯\_(ツ)_/¯

danielballan · 2019-12-02T03:38:43Z

There is an argument for continuing to include simple built-in rate limiting as a default so that we guide our users toward good manners. Users coming from the analysis side of things might not know they should do this, or how to do it effectively. Since this library lends itself toward scraping data in batch from IA, the potential for accidental misuse seems significant.

As a first step, I suggest we apply our simple rate limiting consistently and clearly document how to opt out and leave rate limiting up to the caller. If we discover a generic tool for rate limiting that we are willing to rely on as dependency, great—no need to make a special one here—but IIRC your initial investigation didn’t uncover one.

Mr0grog · 2019-12-02T07:04:32Z

There is an argument for continuing to include simple built-in rate limiting as a default so that we guide our users toward good manners.

👍

If we discover a generic tool for rate limiting that we are willing to rely on as dependency, great—no need to make a special one here—but IIRC your initial investigation didn’t uncover one.

I think I didn’t ever really look — I threw together this quick implementation way back when just to have something, and because I wasn’t entirely sure what pieces were worth limiting in what way at the time, so it seemed worthwhile to keep it internal and therefore pretty malleable (e.g. it has this weird group option because I thought I might want to tie multiple functions/blocks/whatever together under the same overall limit, but the code we ultimately merged didn’t really utilize that). But! Agree that, since this is working well enough, it might not be worth going out of our way for.

If we want to keep it (which it sounds like we do), I feel like there’s some design complexity here. At the moment, it applies a rate limit across all instances of a WaybackSession.

If it needs to be adjustable, what’s the best pattern for configuring a global value like that? Is this a setter on the WaybackSession class?
Should you be able to create instances that don’t get entangled with each other’s rate limits? (For our particular use case, we don’t need or want that, but not sure if there’s a use case here we should support.) If so, how would that be done? A default rate-limit group and the option to specify a new/different rate limit group when creating a session? Something else?

We currently only limit get_memento. Should search fall under the same collective limit or should it be separate? For example, if my rate limit is 2 calls/second and each call takes exactly 100 ms to execute…

I call search(); get_memento(x); get_memento(y)
It’s a given we paused for 400 ms between the two get_memento calls, but
Did did we pause for 0 ms or 400 ms between the search() call and the get_memento(x) call?

i.e. did we:

search()        # takes 100 ms to run
                # No pause because search and get_memento are not limited together
get_memento(x)  # takes 100 ms to run
time.sleep(0.4) # Paused 400 ms to round out the 500 ms/call rate limit
get_memento(y)

Or:

search()        # takes 100 ms to run
time.sleep(0.4) # Paused 400 ms to round out the 500 ms/call rate limit
get_memento(x)  # takes 100 ms to run
time.sleep(0.4) # Paused 400 ms to round out the 500 ms/call rate limit
get_memento(y)

Mr0grog · 2019-12-02T07:08:09Z

To be clear, I really do think making it at least globally configurable is important — it’s definitely a number I’ve fiddled with a lot and don’t feel like we have a stable, perfect answer.

I can also see Wayback folks asking for the default to be turned down and then letting particular clients (like EDGI or some journalistic outfit, etc.) turn it up a little higher, and that ought to be a pattern that we can easily support.

(Also, in past non-theoretical reality, Wayback has sometimes had traffic problems for which we would’ve wanted to turn down rates.)

Mr0grog · 2019-12-02T07:14:21Z

Actually, going to go ahead and semi-answer my own question on one of the above:

Should you be able to create instances that don’t get entangled with each other’s rate limits?

This probably isn’t worth worrying about for now. As long as the global value is configurable, a sophisticated user could turn it off entirely and rate limit calls according to their own logic however they want.

Mr0grog · 2019-12-02T20:56:29Z

Re: question 3 (separate or same rate limit for different operations), Kenji over at the Archive says:

current deployment allocates distinct set of backend servers to playback and CDX query endpoints, and we apply different rate limits for them. So separate rate limits should be appropriate, I suppose.

(Have also asked whether they have an ideal default, but no answer on that yet.)

Mr0grog · 2019-12-03T18:57:18Z

Update from Wayback folks on good defaults:

So, we only have rate limit on CDXServerAPI and other API endpoints, which is 100/min (or higher) per IP. So 50/min (or like 1/s) would be a safe rate, well, currently, for those endpoints.

We don't have rate limit for playback at present, so it's kinda difficult to tell how fast is safe 🙂 ... hmm

(strictly speaking, we're not rate limiting per-IP -- it's more like per-network)

So we should limit CDX calls to 1/second and memento calls same as they currently are (30/second).

lion-sz · 2021-02-28T20:10:15Z

Is this still planed?

I need robust rate limits for a project, and I could write a pull request.

Mr0grog · 2021-03-01T02:19:58Z

@LionSzl Yes, it definitely is! Everything right now is blocked by my [not] having time to finish #64, which is the last remaining issue to moving from alpha to final release on v0.3.0.

There’s a rate limiting implementation I’ve been using in a downstream library that depends on this, and I had planned to pull that back into here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/9640655cfffcb0886e3bf1a5b06b194bb26b07b9/web_monitoring/utils.py#L193-L242 (with tests), but if you have improvements or an alternate approach that might be better I’d be happy to consider it.

That implementation also has little to do with how it would actually be surfaced in this package’s API, though. I haven’t given serious consideration to that yet, so if you have ideas or feedback, that would also be much appreciated. 😃

lion-sz · 2021-03-01T21:41:52Z

Here I've used the already implemented context manager. The Limits are set as class attributes with classmethods to change them.
The test is basically copied from your implementation.

The interface is probably not elegant, but I can't think of a better way to set these limits globally.

Add rate limiting to `search` (not just `get_memento`) and make the limits configurable in the `WaybackSession` constructor. Fixes #12. Co-authored-by: Rob Brackett <[email protected]>

Mr0grog added the question Further information is requested label Dec 2, 2019

Mr0grog added the enhancement New feature or request label Dec 2, 2019

Mr0grog mentioned this issue Mar 16, 2021

v0.3.0 Roadmap #70

Closed

lion-sz mentioned this issue Mar 22, 2021

Ratelimits #79

Merged

Mr0grog linked a pull request Nov 9, 2022 that will close this issue

Ratelimits #79

Merged

Mr0grog closed this as completed in #79 Nov 10, 2022

Mr0grog added a commit that referenced this issue Nov 10, 2022

Ratelimits (#79)

cbb9c14

Add rate limiting to `search` (not just `get_memento`) and make the limits configurable in the `WaybackSession` constructor. Fixes #12. Co-authored-by: Rob Brackett <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink rate limiting #12

Rethink rate limiting #12

Mr0grog commented Dec 2, 2019 •

edited

Loading

Mr0grog commented Dec 2, 2019

danielballan commented Dec 2, 2019

Mr0grog commented Dec 2, 2019 •

edited

Loading

Mr0grog commented Dec 2, 2019 •

edited

Loading

Mr0grog commented Dec 2, 2019

Mr0grog commented Dec 2, 2019

Mr0grog commented Dec 3, 2019

lion-sz commented Feb 28, 2021

Mr0grog commented Mar 1, 2021

lion-sz commented Mar 1, 2021

Rethink rate limiting #12

Rethink rate limiting #12

Comments

Mr0grog commented Dec 2, 2019 • edited Loading

Updates

Mr0grog commented Dec 2, 2019

danielballan commented Dec 2, 2019

Mr0grog commented Dec 2, 2019 • edited Loading

Mr0grog commented Dec 2, 2019 • edited Loading

Mr0grog commented Dec 2, 2019

Mr0grog commented Dec 2, 2019

Mr0grog commented Dec 3, 2019

lion-sz commented Feb 28, 2021

Mr0grog commented Mar 1, 2021

lion-sz commented Mar 1, 2021

Mr0grog commented Dec 2, 2019 •

edited

Loading

Mr0grog commented Dec 2, 2019 •

edited

Loading

Mr0grog commented Dec 2, 2019 •

edited

Loading