Skip to content

Add rate limiting functionality for coordinator#17628

Merged
tdcmeehan merged 1 commit intoprestodb:masterfrom
ericyuliu:ratelimit
Jun 6, 2022
Merged

Add rate limiting functionality for coordinator#17628
tdcmeehan merged 1 commit intoprestodb:masterfrom
ericyuliu:ratelimit

Conversation

@ericyuliu
Copy link
Copy Markdown
Contributor

@ericyuliu ericyuliu commented Apr 11, 2022

Add rate limiting functionality for coordinator

For accidental bug-caused DoS, we will use delayed processing method
to reduce the requests, even when user do not have back-off logic implemented.

Rate Limiting is per each query level with token bucket logic, based on Guava
SmoothBursty implementation.
Currently rate limiter is used on /queued and /executing endpoints.
Rate = rateLimitBucketMaxSize/second.
By default, for each query, we allow 100 requests/s, in a sliding window manner.

Test plan -
testBlockingRateLimitShouldNotDelay
testBlockingRateLimitShouldDelay

== RELEASE NOTES ==

General Changes
* Add rate limiting functionality for each query on coordinator http endpoints.

@ericyuliu ericyuliu force-pushed the ratelimit branch 6 times, most recently from 2ee80d9 to 044664c Compare April 12, 2022 13:41
@ericyuliu ericyuliu marked this pull request as ready for review April 12, 2022 13:51
@ericyuliu ericyuliu requested a review from tdcmeehan April 12, 2022 13:51
@ericyuliu ericyuliu force-pushed the ratelimit branch 7 times, most recently from 4f440a7 to 7c68e41 Compare April 14, 2022 01:23
@ericyuliu ericyuliu marked this pull request as draft April 14, 2022 21:43
@ericyuliu ericyuliu closed this Apr 15, 2022
@ericyuliu ericyuliu reopened this Apr 15, 2022
@ericyuliu ericyuliu force-pushed the ratelimit branch 2 times, most recently from ab8ad04 to c2706ab Compare April 27, 2022 18:52
@ericyuliu ericyuliu marked this pull request as ready for review April 27, 2022 19:10
@ericyuliu ericyuliu force-pushed the ratelimit branch 4 times, most recently from ee8b26f to afceb3d Compare May 2, 2022 14:09
@ericyuliu ericyuliu requested a review from a team as a code owner May 2, 2022 14:09
Copy link
Copy Markdown
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

Comment on lines 62 to 66
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's shut down this executor when the application shuts down. See:

For an example

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use -1 instead to represent disabling?

Copy link
Copy Markdown
Contributor Author

@ericyuliu ericyuliu May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 means available permit or token is 0, which is more consistent to its logical representation and easier to understand.
-1 is a little hard to make sense out of the value, right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if it's 0, it means we don't allow any usage at all; But for disabling rate limiting, it means unlimited slots for all
Looks like they're opposite logic?

Copy link
Copy Markdown
Contributor Author

@ericyuliu ericyuliu May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 ==disabled == no rate limiting == unlimited slots == old behavior
they are the same.
so we are returning immediateFuture(0.0), and the caller logic will immediately move on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we also add tests for QueryBlockingRateLimiter
  2. Like @rschlussel suggested above, add commit message to introduce the context and functionality

I would love to, acquire() itself does not have a lot of complex logic, any suggestions on the test cases?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 ==disabled == no rate limiting == unlimited slots == old behavior they are the same. so we are returning immediateFuture(0.0), and the caller logic will immediately move on.

But rateLimitBucketMaxSize being 0 seems to indicate the bucket's max size is 0, which also means no traffic is allowed at all right? I think this is where it might be confusing

Copy link
Copy Markdown
Collaborator

@kewang1024 kewang1024 May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I have in mind is we can pass in a smaller sized rateLimiterCache(or if it's configurable, we can pass in the corresponding config) and rateLimitBucketMaxSize

And we can verify

  1. The behaviors when multiple queries are calling the function multiple times at the same time, they won't affect each others
  2. After exceeding the rateLimiterCache's max size, what's the expected behavior
  3. rate limit functionality itself (I understand we have added that in a more high level end point test, but here I think is a better place to test the rate limit logic actually)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made cache params configurable.
To further reduce memory footprint, By Default, cache holds 1000 entries with expiration 5 minutes,
When exceeding size limit, the oldest one will get evicted.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's add the unit--acquirePermitTimeSeconds
  • Seconds is probably too coarse of a grain, since we expect in the vast majority of cases that there will be no blocking at all. Let's use nanoseconds.
  • Additionally, if we use nanoseconds, we can avoid the new to instantiate a new Duration, and instead just directly supply the nanosecond value.

Copy link
Copy Markdown
Contributor Author

@ericyuliu ericyuliu May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The add(long value) is a private function
  2. Seems there is no way to avoid Duration creation. And it will use the NANOSECONDS internally even if pass SECONDS.
public void add(double value, TimeUnit timeUnit) {
        this.add(new Duration(value, timeUnit));
 }

public void add(Duration duration) {
        this.add((long)duration.getValue(TimeUnit.NANOSECONDS));
}
  1. acquirePermitTimeSeconds will updated.

@ericyuliu ericyuliu force-pushed the ratelimit branch 3 times, most recently from 3de1b79 to ed67392 Compare May 9, 2022 16:37
Copy link
Copy Markdown
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we also add tests for QueryBlockingRateLimiter
  2. Like @rschlussel suggested above, add commit message to introduce the context and functionality

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: use checkArgument instead and move to the first line of this function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with @tdcmeehan earlier, seems returning future is more consistent for this function

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make rateLimiterCache 's maximumSize and TTL time configurable, as well as executorService's parameter for future performance tuning

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one was using the same default value as gateway.
I was also debating if if is worth it to add all of them as configurations such as maximumSize and expiration, etc.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I'm asking is because if we have a surge of traffic coming in (eg: 1M request/s), with the unlimited queue size, wouldn't it be a non-trivial amount of overhead as well?

So just in case in the future we hit such issue in production, we should be able to tune the executor pool for rate limiter instead of waiting for another release

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coordinator Rate limiting is on each query level, so it will throttle for that problematic query's traffic.
Do you mean, If there are 1M different queries hitting the same coordinator, the overhead will be non-trivial?

Comment on lines 90 to 93
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it's not used anywhere?

Comment on lines 63 to 66
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

  1. this.rateLimiterExecutorService and this.rateLimiterCache
  2. Adjust the order to be the same as the order of the definition of those member variables

Comment on lines 171 to 177
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueuedStatementResource's stats are not exported by JMX, so adding annotation here won't expose the stats
same for ExecutingStatementResource

Instead, why don't we move the blockingTime stat (from both QueuedStatementResource and ExecutingStatementResource) to a centralized place QueryBlockingRateLimiter

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, will sync with you.

Comment on lines 148 to 180
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the logic structure of those two functions are nearly the same, can we extract the common logic and avoid duplicate logic

@ericyuliu ericyuliu changed the title add rate limiting to coordinator Add rate limiting functionality for coordinator May 10, 2022
@ericyuliu ericyuliu force-pushed the ratelimit branch 8 times, most recently from 12689f7 to d30fd22 Compare May 12, 2022 16:52
@ericyuliu ericyuliu force-pushed the ratelimit branch 2 times, most recently from 16a88ad to 9904a14 Compare May 17, 2022 22:29
For accidental bug-caused DoS, we will use delayed processing method
to reduce the requests, even when user do not have back-off logic implemented.

Rate Limiting is per each query level with token bucket logic, based on Guava
SmoothBursty implementation.
Currently rate limiter is used on /queued and /executing endpoints.
Rate = rateLimitBucketMaxSize/second.
By default, for each query, we allow 100 requests/s, in a sliding window manner.
Copy link
Copy Markdown
Collaborator

@kewang1024 kewang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, there is a future improvement to figure out for ExecutorService in QueryBlockingRateLimiter: how to make sure when one bad query id have too many request stacked in the queue, it won't impact other query id's requests that come later in the queue.

@kewang1024 kewang1024 requested a review from highker May 18, 2022 00:00
@ericyuliu
Copy link
Copy Markdown
Contributor Author

ericyuliu commented May 20, 2022

Regarding the ExecutorService LinkedBlockingQueue in QueryBlockingRateLimiter, there is some tradeoff considerations.
When there is a bad player with high QPS :

  1. if we do not have capacity limit, it may potentially use a lot of memory in LinkedBlockingQueue .
  2. If we give a capacity limit to LinkedBlockingQueue , it will help to constraint the memory usage.
    However, when reaching the limit, if we just return a immediateFailedFuture, client can still keep sending the traffic and turn RateLimiter into a Disabled mode, which kinda defeats the purpose of RateLimiter.

In a typical rate limiter setup to avoid DDoS, Infra layer defense and auto-scaling can help mitigate this issue. In current Presto, we do not have a load balancer or Gateway in the critical path when query starts to run.

Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will skip the review give it has been reviewed and approved by many other folks. @tdcmeehan or @NikhilCollooru might be able to do a final pass and merge.

@tdcmeehan tdcmeehan merged commit c97f527 into prestodb:master Jun 6, 2022
@highker highker mentioned this pull request Jul 6, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants