Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink rate limiting #12

Closed
Mr0grog opened this issue Dec 2, 2019 · 10 comments · Fixed by #79
Closed

Rethink rate limiting #12

Mr0grog opened this issue Dec 2, 2019 · 10 comments · Fixed by #79
Labels
enhancement New feature or request question Further information is requested

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Dec 2, 2019

WaybackClient.get_memento has left-over rate-limiting behavior from web-monitoring-processing:

wayback/wayback/_client.py

Lines 645 to 648 in f1cdb1d

with _utils.rate_limited(calls_per_second=30, group='get_memento'):
# Correctly following redirects is actually pretty complicated. In
# the simplest case, a memento is a simple web page, and that's
# no problem. However...

From the perspective of this more generic module…

  • Should there be rate limiting at all?
  • If there should be, it should probably be optional and/or configurable: the rate shouldn’t be hard-coded, and should maybe be an option on the session or client constructor.
  • If there should be, should it apply across both get_memento and search (and whatever other methods WaybackClient might gain in the future)? Or maybe it should apply at the level of making a request to Wayback, rather than at the higher-level get_memento method?

/cc @danielballan


Updates

  • We should keep rate limiting.
  • It should be expanded to cover search.
  • search and get_memento should fall under separate rate limits that don’t interfere with each other.
  • The rate limit should be configurable (and should continue to apply across all client/session instances on all threads). The right API for this is not yet clear.
  • The default limit for CDX calls should be 1/second and for Memento calls should be 30/second.
@Mr0grog Mr0grog added the question Further information is requested label Dec 2, 2019
@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 2, 2019

I guess we could also supply it as a separate helper (rather than building it into the client or session) for people to use or ignore as they see fit. There might also be better generic tools for this out there and we should just drop this [possibly too-simple] internal implementation. Not sure. ¯\_(ツ)_/¯

@danielballan
Copy link
Contributor

There is an argument for continuing to include simple built-in rate limiting as a default so that we guide our users toward good manners. Users coming from the analysis side of things might not know they should do this, or how to do it effectively. Since this library lends itself toward scraping data in batch from IA, the potential for accidental misuse seems significant.

As a first step, I suggest we apply our simple rate limiting consistently and clearly document how to opt out and leave rate limiting up to the caller. If we discover a generic tool for rate limiting that we are willing to rely on as dependency, great—no need to make a special one here—but IIRC your initial investigation didn’t uncover one.

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 2, 2019

There is an argument for continuing to include simple built-in rate limiting as a default so that we guide our users toward good manners.

👍

If we discover a generic tool for rate limiting that we are willing to rely on as dependency, great—no need to make a special one here—but IIRC your initial investigation didn’t uncover one.

I think I didn’t ever really look — I threw together this quick implementation way back when just to have something, and because I wasn’t entirely sure what pieces were worth limiting in what way at the time, so it seemed worthwhile to keep it internal and therefore pretty malleable (e.g. it has this weird group option because I thought I might want to tie multiple functions/blocks/whatever together under the same overall limit, but the code we ultimately merged didn’t really utilize that). But! Agree that, since this is working well enough, it might not be worth going out of our way for.


If we want to keep it (which it sounds like we do), I feel like there’s some design complexity here. At the moment, it applies a rate limit across all instances of a WaybackSession.

  • If it needs to be adjustable, what’s the best pattern for configuring a global value like that? Is this a setter on the WaybackSession class?

  • Should you be able to create instances that don’t get entangled with each other’s rate limits? (For our particular use case, we don’t need or want that, but not sure if there’s a use case here we should support.) If so, how would that be done? A default rate-limit group and the option to specify a new/different rate limit group when creating a session? Something else?

  • We currently only limit get_memento. Should search fall under the same collective limit or should it be separate? For example, if my rate limit is 2 calls/second and each call takes exactly 100 ms to execute…

    • I call search(); get_memento(x); get_memento(y)

    • It’s a given we paused for 400 ms between the two get_memento calls, but

    • Did did we pause for 0 ms or 400 ms between the search() call and the get_memento(x) call?

    • i.e. did we:

      search()        # takes 100 ms to run
                      # No pause because search and get_memento are not limited together
      get_memento(x)  # takes 100 ms to run
      time.sleep(0.4) # Paused 400 ms to round out the 500 ms/call rate limit
      get_memento(y)

      Or:

      search()        # takes 100 ms to run
      time.sleep(0.4) # Paused 400 ms to round out the 500 ms/call rate limit
      get_memento(x)  # takes 100 ms to run
      time.sleep(0.4) # Paused 400 ms to round out the 500 ms/call rate limit
      get_memento(y)

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 2, 2019

To be clear, I really do think making it at least globally configurable is important — it’s definitely a number I’ve fiddled with a lot and don’t feel like we have a stable, perfect answer.

I can also see Wayback folks asking for the default to be turned down and then letting particular clients (like EDGI or some journalistic outfit, etc.) turn it up a little higher, and that ought to be a pattern that we can easily support.

(Also, in past non-theoretical reality, Wayback has sometimes had traffic problems for which we would’ve wanted to turn down rates.)

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 2, 2019

Actually, going to go ahead and semi-answer my own question on one of the above:

Should you be able to create instances that don’t get entangled with each other’s rate limits?

This probably isn’t worth worrying about for now. As long as the global value is configurable, a sophisticated user could turn it off entirely and rate limit calls according to their own logic however they want.

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 2, 2019

Re: question 3 (separate or same rate limit for different operations), Kenji over at the Archive says:

current deployment allocates distinct set of backend servers to playback and CDX query endpoints, and we apply different rate limits for them. So separate rate limits should be appropriate, I suppose.

(Have also asked whether they have an ideal default, but no answer on that yet.)

@Mr0grog Mr0grog added the enhancement New feature or request label Dec 2, 2019
@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 3, 2019

Update from Wayback folks on good defaults:

So, we only have rate limit on CDXServerAPI and other API endpoints, which is 100/min (or higher) per IP. So 50/min (or like 1/s) would be a safe rate, well, currently, for those endpoints.

We don't have rate limit for playback at present, so it's kinda difficult to tell how fast is safe 🙂 ... hmm

(strictly speaking, we're not rate limiting per-IP -- it's more like per-network)

So we should limit CDX calls to 1/second and memento calls same as they currently are (30/second).

@lion-sz
Copy link
Contributor

lion-sz commented Feb 28, 2021

Is this still planed?

I need robust rate limits for a project, and I could write a pull request.

@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 1, 2021

@LionSzl Yes, it definitely is! Everything right now is blocked by my [not] having time to finish #64, which is the last remaining issue to moving from alpha to final release on v0.3.0.

There’s a rate limiting implementation I’ve been using in a downstream library that depends on this, and I had planned to pull that back into here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/9640655cfffcb0886e3bf1a5b06b194bb26b07b9/web_monitoring/utils.py#L193-L242 (with tests), but if you have improvements or an alternate approach that might be better I’d be happy to consider it.

That implementation also has little to do with how it would actually be surfaced in this package’s API, though. I haven’t given serious consideration to that yet, so if you have ideas or feedback, that would also be much appreciated. 😃

@lion-sz
Copy link
Contributor

lion-sz commented Mar 1, 2021

Here I've used the already implemented context manager. The Limits are set as class attributes with classmethods to change them.
The test is basically copied from your implementation.

The interface is probably not elegant, but I can't think of a better way to set these limits globally.

@Mr0grog Mr0grog mentioned this issue Mar 16, 2021
@lion-sz lion-sz mentioned this issue Mar 22, 2021
@Mr0grog Mr0grog linked a pull request Nov 9, 2022 that will close this issue
Mr0grog added a commit that referenced this issue Nov 10, 2022
Add rate limiting to `search` (not just `get_memento`) and make the limits configurable in the `WaybackSession` constructor. Fixes #12.

Co-authored-by: Rob Brackett <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants