Skip to content

[Feature] Add --max-unfinished-requests apiserver parameter#39492

Open
chaunceyjiang wants to merge 2 commits intovllm-project:mainfrom
chaunceyjiang:feature/max-unfinished-requests
Open

[Feature] Add --max-unfinished-requests apiserver parameter#39492
chaunceyjiang wants to merge 2 commits intovllm-project:mainfrom
chaunceyjiang:feature/max-unfinished-requests

Conversation

@chaunceyjiang
Copy link
Copy Markdown
Collaborator

@chaunceyjiang chaunceyjiang commented Apr 10, 2026

Purpose

Feature implementation #18826

CLOSE #18826

CLOSE #21352

Add new CLI argument --max-unfinished-requests to limit concurrent unfinished requests across all API servers. When the limit is exceeded, new requests are rejected with 503 Service Unavailable.

Test Plan

 vllm serve /mnt/data3/models/MiniMax/MiniMax-M2.5 -tp 4 --tool-call-parser minimax_m2 --enable-auto-tool-choice --reasoning-parser minimax_m2   --trust-remote-code  --max_unfinished_requests 10 --api-server-count 4

Test Result

(ApiServer_1 pid=3380147) INFO:     127.0.0.1:52948 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(ApiServer_1 pid=3380147) INFO:     127.0.0.1:53006 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(ApiServer_1 pid=3380147) INFO:     127.0.0.1:53012 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(ApiServer_1 pid=3380147) INFO:     127.0.0.1:53012 - "POST /v1/chat/completions HTTP/1.1" 503 Service Unavailable
(ApiServer_1 pid=3380147) INFO:     127.0.0.1:53006 - "POST /v1/chat/completions HTTP/1.1" 503 Service Unavailable
(ApiServer_2 pid=3380148) INFO:     127.0.0.1:52970 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(ApiServer_3 pid=3380149) INFO:     127.0.0.1:52964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(ApiServer_3 pid=3380149) INFO:     127.0.0.1:52990 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(ApiServer_2 pid=3380148) INFO:     127.0.0.1:53022 - "POST /v1/chat/completions HTTP/1.1" 200 OK


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Add new CLI argument --max-unfinished-requests to limit concurrent
unfinished requests across all API servers. When the limit is exceeded,
new requests are rejected with 503 Service Unavailable.

Features:
- Single server mode: checks local server_load_metrics directly
- Multi-server mode: uses shared multiprocessing.Array to aggregate
  counts from all API servers and check the total
- Auto-enables --enable-server-load-tracking when this option is set

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a global request limit across multiple API servers using shared memory. It introduces the --max-unfinished-requests parameter and updates the load_aware_call decorator to reject requests with a 503 error when the limit is reached. Review feedback identifies several critical issues: the shared memory array is not updated during request completion, leading to stale data; a race condition exists in the limit-checking logic; and the shared array should use a lock to ensure consistent reads during summation.

Comment thread vllm/entrypoints/utils.py
Comment thread vllm/entrypoints/utils.py Outdated
Comment thread vllm/entrypoints/cli/serve.py
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang chaunceyjiang force-pushed the feature/max-unfinished-requests branch from 9990397 to bc30971 Compare April 10, 2026 10:30
@chaunceyjiang chaunceyjiang requested a review from orozery April 10, 2026 10:32
@chaunceyjiang
Copy link
Copy Markdown
Collaborator Author

#27064 (comment)

Hi @orozery PTAL.

@chaunceyjiang
Copy link
Copy Markdown
Collaborator Author

/cc @DarkLight1337 PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Controlling the maximum length of the waiting queue

1 participant