Add rate limiting functionality for coordinator#17628
Add rate limiting functionality for coordinator#17628tdcmeehan merged 1 commit intoprestodb:masterfrom
Conversation
2ee80d9 to
044664c
Compare
4f440a7 to
7c68e41
Compare
presto-main/src/main/java/com/facebook/presto/server/protocol/QueryRateLimiter.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/dispatcher/DispatchManager.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/execution/QueryManagerConfig.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/ExecutingStatementResource.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/QueuedStatementResource.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/QueryRateLimiter.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/QueryRateLimiter.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/QueryRateLimiter.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/QueryRateLimiter.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/QueryRateLimiter.java
Outdated
Show resolved
Hide resolved
ab8ad04 to
c2706ab
Compare
ee8b26f to
afceb3d
Compare
presto-main/src/main/java/com/facebook/presto/server/protocol/ExecutingStatementResource.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/ExecutingStatementResource.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Let's shut down this executor when the application shuts down. See:
For an example
presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
why not use -1 instead to represent disabling?
There was a problem hiding this comment.
0 means available permit or token is 0, which is more consistent to its logical representation and easier to understand.
-1 is a little hard to make sense out of the value, right?
There was a problem hiding this comment.
So if it's 0, it means we don't allow any usage at all; But for disabling rate limiting, it means unlimited slots for all
Looks like they're opposite logic?
There was a problem hiding this comment.
0 ==disabled == no rate limiting == unlimited slots == old behavior
they are the same.
so we are returning immediateFuture(0.0), and the caller logic will immediately move on.
There was a problem hiding this comment.
- Can we also add tests for
QueryBlockingRateLimiter- Like @rschlussel suggested above, add commit message to introduce the context and functionality
I would love to, acquire() itself does not have a lot of complex logic, any suggestions on the test cases?
There was a problem hiding this comment.
0 ==disabled == no rate limiting == unlimited slots == old behavior they are the same. so we are returning
immediateFuture(0.0), and the caller logic will immediately move on.
But rateLimitBucketMaxSize being 0 seems to indicate the bucket's max size is 0, which also means no traffic is allowed at all right? I think this is where it might be confusing
There was a problem hiding this comment.
What I have in mind is we can pass in a smaller sized rateLimiterCache(or if it's configurable, we can pass in the corresponding config) and rateLimitBucketMaxSize
And we can verify
- The behaviors when multiple queries are calling the function multiple times at the same time, they won't affect each others
- After exceeding the rateLimiterCache's max size, what's the expected behavior
- rate limit functionality itself (I understand we have added that in a more high level end point test, but here I think is a better place to test the rate limit logic actually)
There was a problem hiding this comment.
Made cache params configurable.
To further reduce memory footprint, By Default, cache holds 1000 entries with expiration 5 minutes,
When exceeding size limit, the oldest one will get evicted.
There was a problem hiding this comment.
- Let's add the unit--
acquirePermitTimeSeconds - Seconds is probably too coarse of a grain, since we expect in the vast majority of cases that there will be no blocking at all. Let's use nanoseconds.
- Additionally, if we use nanoseconds, we can avoid the new to instantiate a new
Duration, and instead just directly supply the nanosecond value.
There was a problem hiding this comment.
- The add(long value) is a private function
- Seems there is no way to avoid Duration creation. And it will use the NANOSECONDS internally even if pass SECONDS.
public void add(double value, TimeUnit timeUnit) {
this.add(new Duration(value, timeUnit));
}
public void add(Duration duration) {
this.add((long)duration.getValue(TimeUnit.NANOSECONDS));
}
- acquirePermitTimeSeconds will updated.
3de1b79 to
ed67392
Compare
kewang1024
left a comment
There was a problem hiding this comment.
- Can we also add tests for
QueryBlockingRateLimiter - Like @rschlussel suggested above, add commit message to introduce the context and functionality
There was a problem hiding this comment.
NIT: use checkArgument instead and move to the first line of this function
There was a problem hiding this comment.
discussed with @tdcmeehan earlier, seems returning future is more consistent for this function
There was a problem hiding this comment.
Can we make rateLimiterCache 's maximumSize and TTL time configurable, as well as executorService's parameter for future performance tuning
There was a problem hiding this comment.
this one was using the same default value as gateway.
I was also debating if if is worth it to add all of them as configurations such as maximumSize and expiration, etc.
There was a problem hiding this comment.
The reason I'm asking is because if we have a surge of traffic coming in (eg: 1M request/s), with the unlimited queue size, wouldn't it be a non-trivial amount of overhead as well?
So just in case in the future we hit such issue in production, we should be able to tune the executor pool for rate limiter instead of waiting for another release
There was a problem hiding this comment.
Coordinator Rate limiting is on each query level, so it will throttle for that problematic query's traffic.
Do you mean, If there are 1M different queries hitting the same coordinator, the overhead will be non-trivial?
There was a problem hiding this comment.
seems it's not used anywhere?
There was a problem hiding this comment.
Nit:
this.rateLimiterExecutorServiceandthis.rateLimiterCache- Adjust the order to be the same as the order of the definition of those member variables
There was a problem hiding this comment.
QueuedStatementResource's stats are not exported by JMX, so adding annotation here won't expose the stats
same for ExecutingStatementResource
Instead, why don't we move the blockingTime stat (from both QueuedStatementResource and ExecutingStatementResource) to a centralized place QueryBlockingRateLimiter
There was a problem hiding this comment.
Thanks for the suggestion, will sync with you.
There was a problem hiding this comment.
Looks like the logic structure of those two functions are nearly the same, can we extract the common logic and avoid duplicate logic
12689f7 to
d30fd22
Compare
16a88ad to
9904a14
Compare
For accidental bug-caused DoS, we will use delayed processing method to reduce the requests, even when user do not have back-off logic implemented. Rate Limiting is per each query level with token bucket logic, based on Guava SmoothBursty implementation. Currently rate limiter is used on /queued and /executing endpoints. Rate = rateLimitBucketMaxSize/second. By default, for each query, we allow 100 requests/s, in a sliding window manner.
There was a problem hiding this comment.
As discussed offline, there is a future improvement to figure out for ExecutorService in QueryBlockingRateLimiter: how to make sure when one bad query id have too many request stacked in the queue, it won't impact other query id's requests that come later in the queue.
|
Regarding the
In a typical rate limiter setup to avoid DDoS, Infra layer defense and auto-scaling can help mitigate this issue. In current Presto, we do not have a load balancer or |
highker
left a comment
There was a problem hiding this comment.
I will skip the review give it has been reviewed and approved by many other folks. @tdcmeehan or @NikhilCollooru might be able to do a final pass and merge.
Add rate limiting functionality for coordinator
For accidental bug-caused DoS, we will use delayed processing method
to reduce the requests, even when user do not have back-off logic implemented.
Rate Limiting is per each query level with token bucket logic, based on Guava
SmoothBursty implementation.
Currently rate limiter is used on /queued and /executing endpoints.
Rate = rateLimitBucketMaxSize/second.
By default, for each query, we allow 100 requests/s, in a sliding window manner.
Test plan -
testBlockingRateLimitShouldNotDelay
testBlockingRateLimitShouldDelay