[Bugfix] fix DP-aware routing in OpenAI API requests#29002
[Bugfix] fix DP-aware routing in OpenAI API requests#29002njhill merged 10 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix an issue with Data Parallelism-aware routing by propagating the data_parallel_rank to the engine's processor. The changes correctly pass the rank through serving_completion.py and serving_engine.py.
However, I've identified a critical issue in serving_engine.py where the signature of _process_inputs is changed in a way that breaks other parts of the code and has an incorrect type hint. I've left a comment with a suggested fix to make the new argument optional, which will prevent runtime errors.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
7d32d92 to
af200e8
Compare
Signed-off-by: inkcherry <mingzhi.liu@amd.com>
af200e8 to
3c79698
Compare
| lora_request=lora_request, | ||
| trace_headers=trace_headers, | ||
| priority=request.priority, | ||
| data_parallel_rank=data_parallel_rank, |
There was a problem hiding this comment.
Looks like a similar fix is required in serving_chat.py?
Also would be good to catch this issue in the test_serving_chat_data_parallel_rank_extraction test
Even better, it would be great to add a similar test for serving_completion !
There was a problem hiding this comment.
@markmc Thanks for the comment. I've added it to serving_chat.py.
I noticed that mock objects won't trigger this error. So I added a test with a real engine for coverage, placed after the test_dp_rank_argument test.
| lora_request=lora_request, | ||
| trace_headers=trace_headers, | ||
| priority=request.priority, | ||
| data_parallel_rank=data_parallel_rank, |
There was a problem hiding this comment.
Could this be a breaking change for the current API?
There was a problem hiding this comment.
Actually not. When unspecified, it defaults to None and uses the default DP load algorithm
|
@inkcherry could you clarify what failure you saw, crash or request failure? Any repro steps will be helpful. I think |
Signed-off-by: inkcherry <mingzhi.liu@amd.com>
Signed-off-by: inkcherry <mingzhi.liu@amd.com>
Signed-off-by: inkcherry <mingzhi.liu@amd.com>
Thank you for your great work and feedback. I noticed that it does not crash; instead, it fails under certain circumstances (equivalent to not being specified). I have added tests to cover this. I agree with your observation, I did not add support for new endpoints. |
njhill
left a comment
There was a problem hiding this comment.
Thanks @inkcherry!
Doesn't need to be addressed in this PR but I don't see why we would only want to support this header on the chat and completion endpoints, it could apply similarly to all of the endpoints.
Purpose
fix #24945
In
add_request, duplicate initialization is skipped, but during the previousself.processor.process_inputs,data_parallel_rankis not initialized. Using-H 'X-data-parallel-rank'to specify the data parallel rank would be invalid in this case., cc @njhillvllm/vllm/v1/engine/async_llm.py
Lines 283 to 302 in d69062c
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.