Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. by yugong333 · Pull Request #32005 · vllm-project/vllm

yugong333 · 2026-01-09T01:41:53Z

Purpose

Test Plan

Using llmperf to benchmark concurrency = 1, 2, 4, 8 when max-loras = 4

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

^{Cursor Bugbot is generating a summary for commit bb040594f60f02a5a5fd8b8bf3e9630da19d47b7. Configure here.}

Note

Introduces specialization by active LoRA count and plumbs it end-to-end to reduce kernel/grid overhead when fewer adapters are used.

Adds LoRAConfig.specialize_active_lora and --specialize-active-lora CLI flag; stores num_active_loras in kernel metadata and BatchDescriptor
cudagraph_dispatcher and runner capture/dispatch multiple CUDA graphs for LoRA counts (powers of 2 up to max_loras), with fallback to nearest captured count
Updates Triton ops (lora_shrink, lora_expand, fused_moe_lora) to accept num_active_loras and optionally size the launch grid by it; propagate via Punica wrapper and fake ops
Adjusts dummy-run LoRA selection to activate an exact number of adapters for capture; API wiring across v1 worker and forward context

^{Written by Cursor Bugbot for commit bb040594f60f02a5a5fd8b8bf3e9630da19d47b7. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit cfb1a4e9f4c83820ea67821ade4e5e5dcaee9611. Configure here.}

Note

Introduces LoRA-count specialization to reduce kernel/grid overhead when fewer adapters are active.

Adds LoRAConfig.specialize_active_lora and --specialize-active-lora; stores num_active_loras in BatchDescriptor and LoRA kernel metadata
CUDAGraph capture/dispatch updated to key on (has_lora, num_active_loras); captures graphs for powers-of-two counts up to max_loras+1 and dispatches to the nearest captured count
Triton ops (lora_shrink, lora_expand, fused_moe_lora) accept num_active_loras and optionally size the launch grid by it; fake ops and Punica wrapper updated accordingly
LoRAKernelMeta now tracks num_active_loras and optional captured_lora_counts for rounding during cudagraph use
Runner/dispatcher changes propagate active LoRA count through dummy-run capture paths and live dispatch

^{Written by Cursor Bugbot for commit cfb1a4e9f4c83820ea67821ade4e5e5dcaee9611. This will update automatically on new commits. Configure here.}

gemini-code-assist

Code Review

This pull request introduces an optimization for LoRA serving by specializing CUDA graphs for different numbers of active LoRA adapters, controlled by a new specialize_active_lora flag. The changes are well-structured and span configuration, CUDA graph dispatching logic, LoRA kernel metadata, and Triton kernels to support this specialization. The implementation is mostly correct and consistent. However, I've identified a critical issue in the CUDA graph dispatching logic that prevents the feature from working as intended. The dispatcher doesn't correctly map the runtime number of active LoRAs to the captured graphs. I've provided a detailed comment and a code suggestion to fix this.

vllm/v1/cudagraph_dispatcher.py

vllm/lora/punica_wrapper/punica_gpu.py

jeejeelee · 2026-01-10T17:00:52Z

@yugong333 Could you please check the bot comments first?

yugong333 · 2026-01-12T18:12:45Z

@yugong333 Could you please check the bot comments first?

Hi @jeejeelee I have fixed the bugs. Thanks for your review!

vllm/v1/cudagraph_dispatcher.py

vllm/lora/ops/triton_ops/lora_shrink_op.py

vllm/v1/worker/lora_model_runner_mixin.py

jeejeelee · 2026-01-13T08:05:00Z

@ProExpertProg Could you please take a look at this PR? Thank you!

ProExpertProg

I am not that familiar with LoRA but the cudagraph changes look good to me

vllm/lora/ops/triton_ops/fused_moe_lora_op.py

vllm/v1/worker/lora_model_runner_mixin.py

varun-sundar-rabindranath · 2026-01-20T14:30:12Z

vllm/lora/ops/triton_ops/lora_kernel_metadata.py

        )

    def _reset(self):
        self.active_lora_ids.fill_(-1)
        self.num_tokens_per_lora.fill_(0)
        self.lora_token_start_loc.fill_(0)
        self.no_lora_flag_cpu.fill_(False)
+        self.num_active_loras = 0


doesn't captured_lora_counts need a reset ?

Thanks! I have added reset here: https://github.com/vllm-project/vllm/pull/32005/changes#diff-d593fbedbb939694930a23db4958ed49c135b9511d0ed9b054b65e8ba06705c3R94

vllm/lora/ops/triton_ops/lora_kernel_metadata.py

vllm/lora/punica_wrapper/punica_gpu.py

vllm/lora/ops/triton_ops/lora_kernel_metadata.py

vllm/lora/ops/triton_ops/fused_moe_lora_op.py

vllm/lora/ops/triton_ops/lora_kernel_metadata.py

mergify · 2026-01-26T22:35:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yugong333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-01-26T23:41:23Z

Hi @yugong333, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

mergify · 2026-02-02T04:38:51Z

Hi @yugong333, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

…max loras. Multiple cuda graphs are captured for each num of active-loras. (vllm-project#32005) Signed-off-by: Yu Gong <yu3.gong@gmail.com> Signed-off-by: Pai <416932041@qq.com>

…max loras. Multiple cuda graphs are captured for each num of active-loras. (vllm-project#32005) Signed-off-by: Yu Gong <yu3.gong@gmail.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

…max loras. Multiple cuda graphs are captured for each num of active-loras. (vllm-project#32005) Signed-off-by: Yu Gong <yu3.gong@gmail.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

… vLLM changed. (#6958) ### What this PR does / why we need it? Fix the LoRA e2e test accuracy issue that introduced by the upstream PR vllm-project/vllm#32005 ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama32_lora.py - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: yupeng <507435917@qq.com>

… vLLM changed. (vllm-project#6958) ### What this PR does / why we need it? Fix the LoRA e2e test accuracy issue that introduced by the upstream PR vllm-project/vllm#32005 ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama32_lora.py - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: yupeng <507435917@qq.com>

yugong333 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, jeejeelee, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners January 9, 2026 01:41

mergify bot added nvidia v1 labels Jan 9, 2026

github-project-automation bot added this to NVIDIA Jan 9, 2026

gemini-code-assist bot reviewed Jan 9, 2026

View reviewed changes

vllm/v1/cudagraph_dispatcher.py Show resolved Hide resolved

jeejeelee self-assigned this Jan 9, 2026

cursor bot reviewed Jan 9, 2026

View reviewed changes

vllm/v1/cudagraph_dispatcher.py Outdated Show resolved Hide resolved

vllm/v1/cudagraph_dispatcher.py Outdated Show resolved Hide resolved

vllm/lora/punica_wrapper/punica_gpu.py Show resolved Hide resolved

cursor bot reviewed Jan 12, 2026

View reviewed changes

vllm/v1/cudagraph_dispatcher.py Outdated Show resolved Hide resolved

vllm/lora/ops/triton_ops/lora_shrink_op.py Outdated Show resolved Hide resolved

cursor bot reviewed Jan 13, 2026

View reviewed changes

vllm/v1/worker/lora_model_runner_mixin.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Jan 20, 2026

View reviewed changes

vllm/lora/ops/triton_ops/fused_moe_lora_op.py Outdated Show resolved Hide resolved

vllm/v1/worker/lora_model_runner_mixin.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath reviewed Jan 20, 2026

View reviewed changes

dcmaddix mentioned this pull request Jan 22, 2026

[lora/moe] Improve fused MoE‑LoRA kernel indexing and memory access #32770

Merged

mergify bot added the needs-rebase label Jan 26, 2026

yugong333 force-pushed the new_grid branch from 829b2c9 to b1d839c Compare January 26, 2026 23:38

mergify bot removed the needs-rebase label Jan 26, 2026

yugong333 added 7 commits February 2, 2026 04:32

Cleaning code

18fc597

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

Cleaning code

585cc6c

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

clean code

607d31b

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

Updating test_fused_moe_lora_kernel.py

599eaef

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

reset captured_lora_counts

8a27da6

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

Fixing bugs in test files

a68972e

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

fix bugs in cudagraph_specialize_lora

25ca977

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

yugong333 force-pushed the new_grid branch from 621f94b to 25ca977 Compare February 2, 2026 04:34

mergify bot removed the needs-rebase label Feb 2, 2026

Fix errors

91773b6

Signed-off-by: Yu Gong <yu3.gong@gmail.com>

ProExpertProg merged commit ffe1fc7 into vllm-project:main Feb 2, 2026
53 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 2, 2026

Meihan-chen mentioned this pull request Feb 5, 2026

[main2main] upgrade vllm main 0202 vllm-project/vllm-ascend#6560

Merged

dcmaddix mentioned this pull request Feb 18, 2026

[Fix Bug]num_active_loras always equals to zero #34119

Merged

5 tasks

paulyu12 mentioned this pull request Mar 3, 2026

[bugfix][LoRA] Fix the lora accuracy issue introduced by the upstream vLLM changed. vllm-project/vllm-ascend#6958

Merged

Uh oh!

Conversation

yugong333 commented Jan 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeejeelee commented Jan 10, 2026

Uh oh!

yugong333 commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeejeelee commented Jan 13, 2026

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

varun-sundar-rabindranath Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

dcmaddix Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

yugong333 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 26, 2026

Uh oh!

mergify bot commented Jan 26, 2026

Uh oh!

mergify bot commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yugong333 commented Jan 9, 2026 •

edited by github-actions bot

Loading