[LoRA] Support dual CUDA streams-Linear Layer#35721
[LoRA] Support dual CUDA streams-Linear Layer#35721DarkLight1337 merged 36 commits intovllm-project:mainfrom
Conversation
|
WIP. Triggering CI to test for any uncovered cases. |
There was a problem hiding this comment.
Code Review
This pull request introduces support for dual CUDA streams to enable overlapping base layer and LoRA computations, which is a great performance optimization. The implementation correctly sets up a custom PyTorch operation and an auxiliary stream. However, I've found a critical issue in the asynchronous implementation that serializes the computations, defeating the purpose of using dual streams. My review includes a comment with a suggested fix for this issue. The other changes related to plumbing for this feature seem correct.
| # LoRA stream waits for base layer output before reading. | ||
| self._lora_stream.wait_stream(current_stream()) |
There was a problem hiding this comment.
The comment on line 212 is incorrect, and the wait_stream call on line 213 introduces a serialization point that prevents the intended overlap between the base layer and LoRA computations. The LoRA computation depends on the input x, not the output of the base layer. The wait on the current stream here forces the LoRA computation to wait until the base layer computation is complete, defeating the purpose of using a separate stream. Removing this wait will allow the two computations to run in parallel as intended.
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @jeejeelee, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @jeejeelee, will you extend this to MoE LoRA layers in the future? Thanks! |
varun-sundar-rabindranath
left a comment
There was a problem hiding this comment.
Thanks @jeejeelee . left some comments - generally looks good to me. Will take another look tomorrow.
|
@jhaotingc Yeah, I will |
| VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS: bool = False | ||
| VLLM_NIXL_EP_MAX_NUM_RANKS: int = 32 | ||
| VLLM_XPU_ENABLE_XPU_GRAPH: bool = False | ||
| VLLM_LORA_ENABLE_DUAL_STREAM: bool = False |
There was a problem hiding this comment.
a lot of changes in the file appear to be linting - is it maybe a linter version mismatch ? can you check please. Thanks.
varun-sundar-rabindranath
left a comment
There was a problem hiding this comment.
LGTM ! Thanks @jeejeelee .
Left a comment on linting in envs.py. PTAL. Thanks 🙌
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.