-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethink rate limiting #12
Comments
I guess we could also supply it as a separate helper (rather than building it into the client or session) for people to use or ignore as they see fit. There might also be better generic tools for this out there and we should just drop this [possibly too-simple] internal implementation. Not sure. ¯\_(ツ)_/¯ |
There is an argument for continuing to include simple built-in rate limiting as a default so that we guide our users toward good manners. Users coming from the analysis side of things might not know they should do this, or how to do it effectively. Since this library lends itself toward scraping data in batch from IA, the potential for accidental misuse seems significant. As a first step, I suggest we apply our simple rate limiting consistently and clearly document how to opt out and leave rate limiting up to the caller. If we discover a generic tool for rate limiting that we are willing to rely on as dependency, great—no need to make a special one here—but IIRC your initial investigation didn’t uncover one. |
👍
I think I didn’t ever really look — I threw together this quick implementation way back when just to have something, and because I wasn’t entirely sure what pieces were worth limiting in what way at the time, so it seemed worthwhile to keep it internal and therefore pretty malleable (e.g. it has this weird If we want to keep it (which it sounds like we do), I feel like there’s some design complexity here. At the moment, it applies a rate limit across all instances of a
|
To be clear, I really do think making it at least globally configurable is important — it’s definitely a number I’ve fiddled with a lot and don’t feel like we have a stable, perfect answer. I can also see Wayback folks asking for the default to be turned down and then letting particular clients (like EDGI or some journalistic outfit, etc.) turn it up a little higher, and that ought to be a pattern that we can easily support. (Also, in past non-theoretical reality, Wayback has sometimes had traffic problems for which we would’ve wanted to turn down rates.) |
Actually, going to go ahead and semi-answer my own question on one of the above:
This probably isn’t worth worrying about for now. As long as the global value is configurable, a sophisticated user could turn it off entirely and rate limit calls according to their own logic however they want. |
Re: question 3 (separate or same rate limit for different operations), Kenji over at the Archive says:
(Have also asked whether they have an ideal default, but no answer on that yet.) |
Update from Wayback folks on good defaults:
So we should limit CDX calls to 1/second and memento calls same as they currently are (30/second). |
Is this still planed? I need robust rate limits for a project, and I could write a pull request. |
@LionSzl Yes, it definitely is! Everything right now is blocked by my [not] having time to finish #64, which is the last remaining issue to moving from alpha to final release on v0.3.0. There’s a rate limiting implementation I’ve been using in a downstream library that depends on this, and I had planned to pull that back into here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/9640655cfffcb0886e3bf1a5b06b194bb26b07b9/web_monitoring/utils.py#L193-L242 (with tests), but if you have improvements or an alternate approach that might be better I’d be happy to consider it. That implementation also has little to do with how it would actually be surfaced in this package’s API, though. I haven’t given serious consideration to that yet, so if you have ideas or feedback, that would also be much appreciated. 😃 |
Here I've used the already implemented context manager. The Limits are set as class attributes with classmethods to change them. The interface is probably not elegant, but I can't think of a better way to set these limits globally. |
Add rate limiting to `search` (not just `get_memento`) and make the limits configurable in the `WaybackSession` constructor. Fixes #12. Co-authored-by: Rob Brackett <[email protected]>
WaybackClient.get_memento
has left-over rate-limiting behavior from web-monitoring-processing:wayback/wayback/_client.py
Lines 645 to 648 in f1cdb1d
From the perspective of this more generic module…
get_memento
andsearch
(and whatever other methodsWaybackClient
might gain in the future)? Or maybe it should apply at the level of making a request to Wayback, rather than at the higher-levelget_memento
method?/cc @danielballan
Updates
search
.search
andget_memento
should fall under separate rate limits that don’t interfere with each other.The text was updated successfully, but these errors were encountered: