-
-
Notifications
You must be signed in to change notification settings - Fork 12.5k
[EP/DP][API Server] Enable DP-aware routing in OpenAI API requests #24945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EP/DP][API Server] Enable DP-aware routing in OpenAI API requests #24945
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively enables DP-aware routing by introducing the data_parallel_rank parameter in OpenAI API requests. The changes are well-implemented across the protocol, serving layer, and client logic, and include a new test for validation. The core functionality appears solid. My review identified one critical issue in a new Kubernetes manifest file which seems to be a copy-paste error, making the file invalid.
vllm-node1-bench-fixed.yaml
Outdated
| restartPolicy: Never# vLLM Bench deployment on Node 1 targeting the router - Fixed version | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: vllm-node1-bench | ||
| labels: | ||
| app: vllm-node1-bench | ||
| node-role: load-tester | ||
| spec: | ||
| containers: | ||
| - name: vllm-bench | ||
| image: vllm/vllm-openai:latest | ||
| command: ["/bin/bash", "-c"] | ||
| args: | ||
| - | | ||
| set -e | ||
| echo "🏗️ Setting up vLLM Bench on Node 1..." | ||
| echo "🎯 Target: vLLM Router managing 7 servers" | ||
|
|
||
| # Wait for router to be ready | ||
| echo "⏳ Waiting for vLLM router to be ready..." | ||
| router_url="http://vllm-router-all-nodes-service.default.svc.cluster.local:8080" | ||
|
|
||
| while ! curl -s "$router_url/health" > /dev/null 2>&1; do | ||
| echo "Router at $router_url not ready, waiting 10 seconds..." | ||
| sleep 10 | ||
| done | ||
|
|
||
| echo "✅ Router is ready! Router health:" | ||
| curl -s "$router_url/health" | ||
|
|
||
| echo "🚀 Starting vLLM bench serve load test (1-minute test)..." | ||
| echo "📊 Configuration:" | ||
| echo " • Target URL: $router_url" | ||
| echo " • Model: facebook/opt-350m" | ||
| echo " • Requests: 200 (1-minute test)" | ||
| echo " • Concurrency: 16" | ||
| echo " • Input length: 100 tokens" | ||
| echo " • Output length: 50 tokens" | ||
|
|
||
| # Run vLLM bench serve command (1-minute test) | ||
| vllm bench serve \ | ||
| --dataset-name random \ | ||
| --num-prompts 200 \ | ||
| --model facebook/opt-350m \ | ||
| --random-input-len 100 \ | ||
| --random-output-len 50 \ | ||
| --endpoint /v1/completions \ | ||
| --base-url "$router_url" \ | ||
| --max-concurrency 16 \ | ||
| --save-result \ | ||
| --ignore-eos \ | ||
| --served-model-name facebook/opt-350m \ | ||
| --result-filename /tmp/bench_results.json | ||
|
|
||
| echo "✅ Load test completed!" | ||
| echo "📊 Results saved to /tmp/bench_results.json" | ||
|
|
||
| # Show summary of results using simple shell commands | ||
| if [ -f /tmp/bench_results.json ]; then | ||
| echo "📈 Load test results available at /tmp/bench_results.json" | ||
| echo "File size: $(wc -c < /tmp/bench_results.json) bytes" | ||
| fi | ||
|
|
||
| # Keep container alive for result analysis | ||
| echo "📈 Keeping container alive for result analysis..." | ||
| echo "Access results with: kubectl exec vllm-node1-bench -- cat /tmp/bench_results.json" | ||
|
|
||
| while true; do | ||
| echo "📊 $(date): Load test completed. Router status:" | ||
| curl -s "$router_url/health" || echo "Router not responding" | ||
| sleep 300 # Check every 5 minutes | ||
| done | ||
| ports: | ||
| - containerPort: 8080 | ||
| name: http | ||
| env: | ||
| - name: PYTHONUNBUFFERED | ||
| value: "1" | ||
| - name: CUDA_VISIBLE_DEVICES | ||
| value: "" # No GPU needed for bench client | ||
| volumeMounts: | ||
| - mountPath: /tmp | ||
| name: results-volume | ||
| resources: | ||
| requests: | ||
| cpu: "4" | ||
| memory: "8Gi" | ||
| limits: | ||
| cpu: "8" | ||
| memory: "16Gi" | ||
| volumes: | ||
| - name: results-volume | ||
| emptyDir: {} | ||
| # Pin to Node 1 specifically | ||
| nodeSelector: | ||
| kubernetes.io/hostname: ip-192-168-45-140.us-west-2.compute.internal | ||
| restartPolicy: Never |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I think this should be implemented as a header rather than in the body similar to how external processes can add request id (
So this could be something like This is more canonical with how extra arguments are added to a standard api |
4b4fb6a to
a56748d
Compare
|
@robertgshaw2-redhat The current approach is consistent with how SGLang handles DP-aware requests. With this change to the vLLM server, the sgl-router fork can send requests directly to the vLLM server without requiring any API modifications. That said, your suggestion is valid—P/D disaggregation also embeds the prefill/decode address in the X-Request-Id. I’ll make a few adjustments to incorporate your recommendation. |
0d5207c to
a5ef013
Compare
|
|
||
|
|
||
|
|
||
| @router.get("/get_server_info") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? I think its usually not great to allow external users to see this type of server info
I think that instead we should make sure the server is in DEV mode to expose this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design is inherited from sglang API which allows its router to fetch entire server config.
We have a couple of alternative impl options:
- Pass data_parallel_size as an input flag to the router. This would require all vllm servers to use the same DP size (or we could further separate it into prefill and decode configs).
- Create a /get_data_parallel_size endpoint, allowing us to expose only this specific configuration to callers.
@robertgshaw2-redhat Which of these approaches would you recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's good to keep the discovery part separate (can figure out in future PR). The header changes in this one look good to me now.
|
This pull request has merge conflicts that must be resolved before it can be |
66fa31c to
0fe49c6
Compare
|
Documentation preview: https://vllm--24945.org.readthedocs.build/en/24945/ |
0fe49c6 to
d9cb45e
Compare
7e9dcb5 to
8a5a86f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Prowindy! LGTM
Looks like there is a commit which needs DCO sign-off, see https://github.com/vllm-project/vllm/pull/24945/checks?check_run_id=53896850372, and some pre-commit errors.
Also I wonder if we could add something to the docs to cover this. Perhaps add a note to the external router section of the data parallel page: https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html#external-load-balancing
8a5a86f to
5d00c87
Compare
| return raw_request.headers.get("X-Request-Id", default) | ||
|
|
||
| @staticmethod | ||
| def _get_data_parallel_rank(raw_request: Request | None) -> int | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an attack vector:
- user specifies
X-data-parallel-rankthat is > DP size or is negative - user specifies
X-data-parallel-rankwhen the server is not using DP
What happens? We need to be careful as this is untrusted user input that could case DOS issues.
Perhaps we should only parse this field if the server operator ops in and log a warning about it. We also need to santize the inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The engine performs a sanity check on the input DP rank range before utilizing it:
vllm/vllm/v1/engine/processor.py
Lines 356 to 363 in 141e6a0
| data_parallel_size = self.vllm_config.parallel_config.data_parallel_size | |
| if data_parallel_rank is not None and not ( | |
| 0 <= data_parallel_rank < data_parallel_size | |
| ): | |
| raise ValueError( | |
| f"data_parallel_rank {data_parallel_rank} " | |
| f"is out of range [0, {data_parallel_size})." | |
| ) |
Do we want to enhance safety by clamping, or raising exceptions during parsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-redhat I tested the vLLM patch by requesting with invalid DP ranks, including negative or large numbers, and non-integer characters. In these cases, the system fallback to a regular request without DP rank info. Would this be our expected behavior?
5d00c87 to
c5b7d26
Compare
Head branch was pushed to by a user without write access
92eac51 to
074249a
Compare
Signed-off-by: Cong Chen <[email protected]> This commit adds support for data parallel rank routing in the OpenAI API to enable intelligent load balancing across data parallel instances. Key changes: - Add _get_data_parallel_rank() method to extract X-data-parallel-rank header - Pass data_parallel_rank to engine.generate() in chat and completion APIs - Add /get_server_info endpoint to return DP size for router discovery - Add test coverage for data_parallel_rank extraction The X-data-parallel-rank header allows external routers to specify which data parallel rank should handle a request, enabling DP-aware request routing and load balancing.
074249a to
2fc7bbc
Compare
…llm-project#24945) Co-authored-by: Cong Chen <[email protected]>
…llm-project#24945) Co-authored-by: Cong Chen <[email protected]>
…llm-project#24945) Co-authored-by: Cong Chen <[email protected]> Signed-off-by: Eldar Kurtic <[email protected]>
…llm-project#24945) Co-authored-by: Cong Chen <[email protected]>
Signed-off-by: Cong Chen [email protected]
Purpose
This PR adds support for the
data_parallel_rankparameter in vLLM's OpenAI API, enabling external routers to specify which data parallel rankshould handle a request. This enhancement improves load distribution and cache locality in multi-GPU deployments by allowing DP-aware routing
from router to engine core.
Key functionality:
data_parallel_rankinto API requestsTest Plan
- Added comprehensive logging throughout the request pipeline
- Implemented rank counters to track request distribution
- Verified GPU device assignment for each request
- Tested with multiple routing policies (round_robin, cache_aware)
Test Result
✅ Successful DP-Aware Routing Validation:
workers: ["http://0.0.0.0:8000@0", "http://0.0.0.0:8000@2", "http://0.0.0.0:8000@4", "http://0.0.0.0:8000@6"]
Current counters: {0: 2, 2: 3, 4: 3, 6: 2}
API_SERVER_DEBUG: Received completion request with data_parallel_rank=4
ROUTING_DEBUG: Using data_parallel_rank=4 directly
ENGINE_DISPATCH DEBUG: Request dispatched to engine identity 0400 (rank 4)
GPU_DEVICE_DEBUG: Request received on GPU device, parallel_config.rank=4
- EngineCore_0 pid=296341 → Rank 0 requests
- EngineCore_2 pid=296343 → Rank 2 requests
- EngineCore_4 pid=296345 → Rank 4 requests
- EngineCore_6 pid=296347 → Rank 6 requests