Skip to content

[LoRA] Support dual CUDA streams-Linear Layer#35721

Merged
DarkLight1337 merged 36 commits intovllm-project:mainfrom
jeejeelee:lora-dual-stream
Apr 13, 2026
Merged

[LoRA] Support dual CUDA streams-Linear Layer#35721
DarkLight1337 merged 36 commits intovllm-project:mainfrom
jeejeelee:lora-dual-stream

Conversation

@jeejeelee
Copy link
Copy Markdown
Collaborator

@jeejeelee jeejeelee commented Mar 2, 2026

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify mergify bot added the nvidia label Mar 2, 2026
@jeejeelee jeejeelee added ready ONLY add when PR is ready to merge/full CI is needed and removed nvidia labels Mar 2, 2026
@mergify mergify bot added the nvidia label Mar 2, 2026
@jeejeelee
Copy link
Copy Markdown
Collaborator Author

WIP. Triggering CI to test for any uncovered cases.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for dual CUDA streams to enable overlapping base layer and LoRA computations, which is a great performance optimization. The implementation correctly sets up a custom PyTorch operation and an auxiliary stream. However, I've found a critical issue in the asynchronous implementation that serializes the computations, defeating the purpose of using dual streams. My review includes a comment with a suggested fix for this issue. The other changes related to plumbing for this feature seem correct.

Comment thread vllm/lora/layers/base_linear.py Outdated
Comment on lines +212 to +213
# LoRA stream waits for base layer output before reading.
self._lora_stream.wait_stream(current_stream())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The comment on line 212 is incorrect, and the wait_stream call on line 213 introduces a serialization point that prevents the intended overlap between the base layer and LoRA computations. The LoRA computation depends on the input x, not the output of the base layer. The wait on the current stream here forces the LoRA computation to wait until the base layer computation is complete, defeating the purpose of using a separate stream. Removing this wait will allow the two computations to run in parallel as intended.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@jeejeelee jeejeelee marked this pull request as draft March 2, 2026 13:18
@jeejeelee jeejeelee removed the ready ONLY add when PR is ready to merge/full CI is needed label Mar 2, 2026
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jeejeelee.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 31, 2026

Hi @jeejeelee, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 1, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jeejeelee.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 1, 2026
@jhaotingc
Copy link
Copy Markdown
Contributor

Hi @jeejeelee, will you extend this to MoE LoRA layers in the future? Thanks!

Copy link
Copy Markdown
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeejeelee . left some comments - generally looks good to me. Will take another look tomorrow.

Comment thread tests/lora/conftest.py
Comment thread vllm/lora/layers/base_linear.py Outdated
Comment thread vllm/lora/layers/base_linear.py Outdated
Comment thread vllm/lora/layers/base_linear.py
Comment thread vllm/lora/layers/base_linear.py
Comment thread vllm/lora/ops/triton_ops/utils.py Outdated
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify mergify bot removed the needs-rebase label Apr 3, 2026
@jeejeelee
Copy link
Copy Markdown
Collaborator Author

@jhaotingc Yeah, I will

Comment thread vllm/envs.py
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS: bool = False
VLLM_NIXL_EP_MAX_NUM_RANKS: int = 32
VLLM_XPU_ENABLE_XPU_GRAPH: bool = False
VLLM_LORA_ENABLE_DUAL_STREAM: bool = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of changes in the file appear to be linting - is it maybe a linter version mismatch ? can you check please. Thanks.

Copy link
Copy Markdown
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks @jeejeelee .

Left a comment on linting in envs.py. PTAL. Thanks 🙌

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Apr 13, 2026
@DarkLight1337 DarkLight1337 merged commit 715681c into vllm-project:main Apr 13, 2026
57 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Apr 13, 2026
@jeejeelee jeejeelee deleted the lora-dual-stream branch April 13, 2026 02:58
wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
lisp19 pushed a commit to lisp19/vllm that referenced this pull request Apr 20, 2026
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants