-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[P/D]Provide bucket algorithm rate limiter for proxy_server #22643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…xy to handle concurrent requests and prevent the prefill or decode service crashed or hangs (vllm-project#22575) Signed-off-by: frankie-ys <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a rate limiter and a request queue to the disaggregation proxy server to prevent crashes under high concurrency. The implementation uses a token bucket algorithm for rate limiting and a semaphore for controlling concurrent requests to the backend. While this is a good approach to solve the stability issue, there is a critical flaw in the RateLimiter.acquire method where an asyncio lock is held during an await asyncio.sleep(). This will serialize all requests and severely impact performance, defeating the purpose of using an asynchronous framework. I've provided a suggestion to fix this issue.
Signed-off-by: frankie-ys <[email protected]>
|
yes, I forgot this issue, I have recommit this file. |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: frankie-ys <[email protected]>
Signed-off-by: frankie-ys <[email protected]>
Signed-off-by: frankie-ys <[email protected]>
Signed-off-by: frankie-ys <[email protected]>
Signed-off-by: frankie-ys <[email protected]>
Signed-off-by: frankie-ys <[email protected]>
|
Retrying |
Signed-off-by: frankie-ys <[email protected]>
Thanks, after merge from main branch , it works well. |
Signed-off-by: frankie-ys <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: frankie <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: frankie <[email protected]>
Modify the variables and abstract the request queue into a separate file (vllm-project#22643) Signed-off-by: frankie-ys <[email protected]>
…ject#22643) Signed-off-by: frankie-ys <[email protected]>
…ject#22643) Signed-off-by: frankie-ys <[email protected]>
Signed-off-by: frankie-ys <[email protected]>
Signed-off-by: frankie-ys <[email protected]>
KuntaiDu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other part LGTM.
| AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=300) | ||
| # Maximum concurrent requests to backend services | ||
| MAX_CONCURRENT_REQUESTS = 100 | ||
| REQUEST_QUEUE_SIZE = 500 # Maximum number of requests in the queue | ||
| RATE_LIMIT = 40 # Maximum requests per second (rate limiting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we can move these to cli args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have moved the variable to the cli args.
Signed-off-by: frankie-ys <[email protected]>
Signed-off-by: frankie-ys <[email protected]>
Signed-off-by: frankie-ys <[email protected]>
Signed-off-by: frankie-ys <[email protected]>
KuntaiDu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks much much cleaner. LGTM!
But since currently there are people fixing CI, so I will postpone the merge till the CI is green.
|
The CI should be green now apart from nightly tests so feel free to merge |
|
Can you merge from main branch? |
|
Sure! Thanks for letting me know! |
…ject#22643) Signed-off-by: frankie-ys <[email protected]> Signed-off-by: frankie <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Kuntai Du <[email protected]>
…ject#22643) Signed-off-by: frankie-ys <[email protected]> Signed-off-by: frankie <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Kuntai Du <[email protected]>
…ject#22643) Signed-off-by: frankie-ys <[email protected]> Signed-off-by: frankie <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Signed-off-by: Duncan Moss <[email protected]>
…ject#22643) Signed-off-by: frankie-ys <[email protected]> Signed-off-by: frankie <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Kuntai Du <[email protected]>
…ject#22643) Signed-off-by: frankie-ys <[email protected]> Signed-off-by: frankie <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…ject#22643) Signed-off-by: frankie-ys <[email protected]> Signed-off-by: frankie <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Kuntai Du <[email protected]>

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
I found that run vllm with 1P1D with disaggregation example, the proxy server don't have rate limiter. When request's concurrency bigger than 20, the prefill or decode insatence will hangs or crash. After add rate limiter to proxy server , it works smoothly. And solve this issue 's problem which because of high concurrency request. https://github.com/vllm-project/vllm/issues/11247
What's more, you can contact me via email if you have any questions.
Test Plan
No need to add new test.
Test Result
(Optional) Documentation Update