Add rate limiting functionality for coordinator by ericyuliu · Pull Request #17628 · prestodb/presto

ericyuliu · 2022-04-11T14:56:43Z

Add rate limiting functionality for coordinator

For accidental bug-caused DoS, we will use delayed processing method
to reduce the requests, even when user do not have back-off logic implemented.

Rate Limiting is per each query level with token bucket logic, based on Guava
SmoothBursty implementation.
Currently rate limiter is used on /queued and /executing endpoints.
Rate = rateLimitBucketMaxSize/second.
By default, for each query, we allow 100 requests/s, in a sliding window manner.

Test plan -
testBlockingRateLimitShouldNotDelay
testBlockingRateLimitShouldDelay

== RELEASE NOTES ==

General Changes
* Add rate limiting functionality for each query on coordinator http endpoints.

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryRateLimiter.java

presto-main/src/main/java/com/facebook/presto/dispatcher/DispatchManager.java

presto-main/src/main/java/com/facebook/presto/execution/QueryManagerConfig.java

presto-main/src/main/java/com/facebook/presto/server/protocol/ExecutingStatementResource.java

presto-main/src/main/java/com/facebook/presto/server/protocol/QueuedStatementResource.java

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryRateLimiter.java

tdcmeehan

Minor comments

presto-main/src/main/java/com/facebook/presto/server/protocol/ExecutingStatementResource.java

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java

tdcmeehan · 2022-05-02T16:06:05Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java

Let's shut down this executor when the application shuts down. See:

presto/presto-main/src/main/java/com/facebook/presto/execution/resourceGroups/InternalResourceGroupManager.java

Line 211 in a3a5ded

public void destroy()

For an example

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java

kewang1024 · 2022-05-07T00:17:33Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueuedStatementResource.java

kewang1024 · 2022-05-07T00:28:48Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java

why not use -1 instead to represent disabling?

0 means available permit or token is 0, which is more consistent to its logical representation and easier to understand.
-1 is a little hard to make sense out of the value, right?

So if it's 0, it means we don't allow any usage at all; But for disabling rate limiting, it means unlimited slots for all
Looks like they're opposite logic?

0 ==disabled == no rate limiting == unlimited slots == old behavior
they are the same.
so we are returning immediateFuture(0.0), and the caller logic will immediately move on.

Can we also add tests for QueryBlockingRateLimiter

Like @rschlussel suggested above, add commit message to introduce the context and functionality

I would love to, acquire() itself does not have a lot of complex logic, any suggestions on the test cases?

0 ==disabled == no rate limiting == unlimited slots == old behavior they are the same. so we are returning immediateFuture(0.0), and the caller logic will immediately move on.

But rateLimitBucketMaxSize being 0 seems to indicate the bucket's max size is 0, which also means no traffic is allowed at all right? I think this is where it might be confusing

What I have in mind is we can pass in a smaller sized rateLimiterCache(or if it's configurable, we can pass in the corresponding config) and rateLimitBucketMaxSize

And we can verify

The behaviors when multiple queries are calling the function multiple times at the same time, they won't affect each others

After exceeding the rateLimiterCache's max size, what's the expected behavior

rate limit functionality itself (I understand we have added that in a more high level end point test, but here I think is a better place to test the rate limit logic actually)

Made cache params configurable.
To further reduce memory footprint, By Default, cache holds 1000 entries with expiration 5 minutes,
When exceeding size limit, the oldest one will get evicted.

kewang1024 · 2022-05-07T00:31:48Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java

tdcmeehan · 2022-05-08T23:19:24Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueuedStatementResource.java

Let's add the unit--acquirePermitTimeSeconds

Seconds is probably too coarse of a grain, since we expect in the vast majority of cases that there will be no blocking at all. Let's use nanoseconds.

Additionally, if we use nanoseconds, we can avoid the new to instantiate a new Duration, and instead just directly supply the nanosecond value.

The add(long value) is a private function

Seems there is no way to avoid Duration creation. And it will use the NANOSECONDS internally even if pass SECONDS.

public void add(double value, TimeUnit timeUnit) { this.add(new Duration(value, timeUnit)); } public void add(Duration duration) { this.add((long)duration.getValue(TimeUnit.NANOSECONDS)); }

acquirePermitTimeSeconds will updated.

kewang1024

Can we also add tests for QueryBlockingRateLimiter
Like @rschlussel suggested above, add commit message to introduce the context and functionality

kewang1024 · 2022-05-07T17:50:31Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java

NIT: use checkArgument instead and move to the first line of this function

discussed with @tdcmeehan earlier, seems returning future is more consistent for this function

kewang1024 · 2022-05-07T17:54:21Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java

Can we make rateLimiterCache 's maximumSize and TTL time configurable, as well as executorService's parameter for future performance tuning

this one was using the same default value as gateway.
I was also debating if if is worth it to add all of them as configurations such as maximumSize and expiration, etc.

The reason I'm asking is because if we have a surge of traffic coming in (eg: 1M request/s), with the unlimited queue size, wouldn't it be a non-trivial amount of overhead as well?

So just in case in the future we hit such issue in production, we should be able to tune the executor pool for rate limiter instead of waiting for another release

Coordinator Rate limiting is on each query level, so it will throttle for that problematic query's traffic.
Do you mean, If there are 1M different queries hitting the same coordinator, the overhead will be non-trivial?

kewang1024 · 2022-05-07T18:20:03Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java

seems it's not used anywhere?

kewang1024 · 2022-05-07T18:21:20Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueryBlockingRateLimiter.java

Nit:

this.rateLimiterExecutorService and this.rateLimiterCache

Adjust the order to be the same as the order of the definition of those member variables

kewang1024 · 2022-05-07T19:13:25Z

presto-main/src/main/java/com/facebook/presto/server/protocol/QueuedStatementResource.java

QueuedStatementResource's stats are not exported by JMX, so adding annotation here won't expose the stats
same for ExecutingStatementResource

Instead, why don't we move the blockingTime stat (from both QueuedStatementResource and ExecutingStatementResource) to a centralized place QueryBlockingRateLimiter

Thanks for the suggestion, will sync with you.

kewang1024 · 2022-05-09T20:03:51Z

presto-tests/src/test/java/com/facebook/presto/server/TestQueryResource.java

Looks like the logic structure of those two functions are nearly the same, can we extract the common logic and avoid duplicate logic

For accidental bug-caused DoS, we will use delayed processing method to reduce the requests, even when user do not have back-off logic implemented. Rate Limiting is per each query level with token bucket logic, based on Guava SmoothBursty implementation. Currently rate limiter is used on /queued and /executing endpoints. Rate = rateLimitBucketMaxSize/second. By default, for each query, we allow 100 requests/s, in a sliding window manner.

kewang1024

As discussed offline, there is a future improvement to figure out for ExecutorService in QueryBlockingRateLimiter: how to make sure when one bad query id have too many request stacked in the queue, it won't impact other query id's requests that come later in the queue.

ericyuliu · 2022-05-20T14:02:13Z

Regarding the ExecutorService LinkedBlockingQueue in QueryBlockingRateLimiter, there is some tradeoff considerations.
When there is a bad player with high QPS :

if we do not have capacity limit, it may potentially use a lot of memory in LinkedBlockingQueue .
If we give a capacity limit to LinkedBlockingQueue , it will help to constraint the memory usage.
However, when reaching the limit, if we just return a immediateFailedFuture, client can still keep sending the traffic and turn RateLimiter into a Disabled mode, which kinda defeats the purpose of RateLimiter.

In a typical rate limiter setup to avoid DDoS, Infra layer defense and auto-scaling can help mitigate this issue. In current Presto, we do not have a load balancer or Gateway in the critical path when query starts to run.

highker

I will skip the review give it has been reviewed and approved by many other folks. @tdcmeehan or @NikhilCollooru might be able to do a final pass and merge.

ericyuliu force-pushed the ratelimit branch 6 times, most recently from 2ee80d9 to 044664c Compare April 12, 2022 13:41

ericyuliu marked this pull request as ready for review April 12, 2022 13:51

ericyuliu requested a review from tdcmeehan April 12, 2022 13:51

ericyuliu force-pushed the ratelimit branch 7 times, most recently from 4f440a7 to 7c68e41 Compare April 14, 2022 01:23

ajaygeorge requested changes Apr 14, 2022

View reviewed changes

ericyuliu marked this pull request as draft April 14, 2022 21:43

ericyuliu closed this Apr 15, 2022

ericyuliu reopened this Apr 15, 2022

ericyuliu force-pushed the ratelimit branch from 7c68e41 to 1ba880d Compare April 15, 2022 03:47

ericyuliu force-pushed the ratelimit branch 2 times, most recently from ab8ad04 to c2706ab Compare April 27, 2022 18:52

ericyuliu marked this pull request as ready for review April 27, 2022 19:10

ericyuliu force-pushed the ratelimit branch 4 times, most recently from ee8b26f to afceb3d Compare May 2, 2022 14:09

ericyuliu requested a review from a team as a code owner May 2, 2022 14:09

ericyuliu force-pushed the ratelimit branch from afceb3d to e663762 Compare May 2, 2022 15:28

tdcmeehan reviewed May 2, 2022

View reviewed changes

ericyuliu force-pushed the ratelimit branch from f924e1b to 555fa77 Compare May 5, 2022 16:05

ericyuliu requested a review from tdcmeehan May 5, 2022 17:06

kewang1024 reviewed May 7, 2022

View reviewed changes

tdcmeehan reviewed May 8, 2022

View reviewed changes

ericyuliu force-pushed the ratelimit branch 3 times, most recently from 3de1b79 to ed67392 Compare May 9, 2022 16:37

kewang1024 reviewed May 9, 2022

View reviewed changes

ericyuliu changed the title ~~add rate limiting to coordinator~~ Add rate limiting functionality for coordinator May 10, 2022

ericyuliu force-pushed the ratelimit branch 8 times, most recently from 12689f7 to d30fd22 Compare May 12, 2022 16:52

tdcmeehan approved these changes May 16, 2022

View reviewed changes

ericyuliu force-pushed the ratelimit branch 2 times, most recently from 16a88ad to 9904a14 Compare May 17, 2022 22:29

ericyuliu force-pushed the ratelimit branch from 9904a14 to 670aaf3 Compare May 17, 2022 22:29

kewang1024 approved these changes May 17, 2022

View reviewed changes

kewang1024 requested a review from highker May 18, 2022 00:00

highker reviewed May 20, 2022

View reviewed changes

tdcmeehan approved these changes Jun 6, 2022

View reviewed changes

tdcmeehan merged commit c97f527 into prestodb:master Jun 6, 2022

highker mentioned this pull request Jul 6, 2022

Add release notes for 0.274 #17987

Closed

7 tasks

Conversation

ericyuliu commented Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdcmeehan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericyuliu May 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericyuliu May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kewang1024 May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericyuliu May 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kewang1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ericyuliu commented Apr 11, 2022 •

edited

Loading

ericyuliu May 9, 2022 •

edited

Loading

ericyuliu May 10, 2022 •

edited

Loading

kewang1024 May 10, 2022 •

edited

Loading

ericyuliu May 9, 2022 •

edited

Loading

kewang1024 left a comment •

edited

Loading

ericyuliu commented May 20, 2022 •

edited

Loading