Skip to content

Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras.#32005

Merged
ProExpertProg merged 31 commits intovllm-project:mainfrom
yugong333:new_grid
Feb 2, 2026

Conversation

@yugong333
Copy link
Copy Markdown
Contributor

@yugong333 yugong333 commented Jan 9, 2026

Purpose

Test Plan

Using llmperf to benchmark concurrency = 1, 2, 4, 8 when max-loras = 4

Test Result

image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Cursor Bugbot is generating a summary for commit bb040594f60f02a5a5fd8b8bf3e9630da19d47b7. Configure here.


Note

Introduces specialization by active LoRA count and plumbs it end-to-end to reduce kernel/grid overhead when fewer adapters are used.

  • Adds LoRAConfig.specialize_active_lora and --specialize-active-lora CLI flag; stores num_active_loras in kernel metadata and BatchDescriptor
  • cudagraph_dispatcher and runner capture/dispatch multiple CUDA graphs for LoRA counts (powers of 2 up to max_loras), with fallback to nearest captured count
  • Updates Triton ops (lora_shrink, lora_expand, fused_moe_lora) to accept num_active_loras and optionally size the launch grid by it; propagate via Punica wrapper and fake ops
  • Adjusts dummy-run LoRA selection to activate an exact number of adapters for capture; API wiring across v1 worker and forward context

Written by Cursor Bugbot for commit bb040594f60f02a5a5fd8b8bf3e9630da19d47b7. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit cfb1a4e9f4c83820ea67821ade4e5e5dcaee9611. Configure here.


Note

Introduces LoRA-count specialization to reduce kernel/grid overhead when fewer adapters are active.

  • Adds LoRAConfig.specialize_active_lora and --specialize-active-lora; stores num_active_loras in BatchDescriptor and LoRA kernel metadata
  • CUDAGraph capture/dispatch updated to key on (has_lora, num_active_loras); captures graphs for powers-of-two counts up to max_loras+1 and dispatches to the nearest captured count
  • Triton ops (lora_shrink, lora_expand, fused_moe_lora) accept num_active_loras and optionally size the launch grid by it; fake ops and Punica wrapper updated accordingly
  • LoRAKernelMeta now tracks num_active_loras and optional captured_lora_counts for rounding during cudagraph use
  • Runner/dispatcher changes propagate active LoRA count through dummy-run capture paths and live dispatch

Written by Cursor Bugbot for commit cfb1a4e9f4c83820ea67821ade4e5e5dcaee9611. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization for LoRA serving by specializing CUDA graphs for different numbers of active LoRA adapters, controlled by a new specialize_active_lora flag. The changes are well-structured and span configuration, CUDA graph dispatching logic, LoRA kernel metadata, and Triton kernels to support this specialization. The implementation is mostly correct and consistent. However, I've identified a critical issue in the CUDA graph dispatching logic that prevents the feature from working as intended. The dispatcher doesn't correctly map the runtime number of active LoRAs to the captured graphs. I've provided a detailed comment and a code suggestion to fix this.

@jeejeelee jeejeelee self-assigned this Jan 9, 2026
@jeejeelee
Copy link
Copy Markdown
Collaborator

@yugong333 Could you please check the bot comments first?

@yugong333
Copy link
Copy Markdown
Contributor Author

@yugong333 Could you please check the bot comments first?

Hi @jeejeelee I have fixed the bugs. Thanks for your review!

@jeejeelee
Copy link
Copy Markdown
Collaborator

@ProExpertProg Could you please take a look at this PR? Thank you!

Copy link
Copy Markdown
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not that familiar with LoRA but the cudagraph changes look good to me

)

def _reset(self):
self.active_lora_ids.fill_(-1)
self.num_tokens_per_lora.fill_(0)
self.lora_token_start_loc.fill_(0)
self.no_lora_flag_cpu.fill_(False)
self.num_active_loras = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't captured_lora_counts need a reset ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 26, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yugong333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 26, 2026

Hi @yugong333, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 2, 2026

Hi @yugong333, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Yu Gong <yu3.gong@gmail.com>
@ProExpertProg ProExpertProg merged commit ffe1fc7 into vllm-project:main Feb 2, 2026
53 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Feb 2, 2026
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…max loras. Multiple cuda graphs are captured for each num of active-loras. (vllm-project#32005)

Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Pai <416932041@qq.com>
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…max loras. Multiple cuda graphs are captured for each num of active-loras. (vllm-project#32005)

Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Pai <416932041@qq.com>
gameofdimension pushed a commit to gameofdimension/vllm that referenced this pull request Feb 5, 2026
…max loras. Multiple cuda graphs are captured for each num of active-loras. (vllm-project#32005)

Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: felix01.yu <felix01.yu@vipshop.com>
wangxiyuan added a commit to vllm-project/vllm-ascend that referenced this pull request Feb 5, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…max loras. Multiple cuda graphs are captured for each num of active-loras. (vllm-project#32005)

Signed-off-by: Yu Gong <yu3.gong@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
MengqingCao pushed a commit to vllm-project/vllm-ascend that referenced this pull request Mar 10, 2026
… vLLM changed. (#6958)

### What this PR does / why we need it?
Fix the LoRA e2e test accuracy issue that introduced by the upstream PR
vllm-project/vllm#32005

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_llama32_lora.py

- vLLM version: v0.16.0
- vLLM main:
vllm-project/vllm@15d76f7
---------
Signed-off-by: paulyu12 <507435917@qq.com>
Signed-off-by: yupeng <507435917@qq.com>
Nagisa125 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Mar 17, 2026
… vLLM changed. (vllm-project#6958)

### What this PR does / why we need it?
Fix the LoRA e2e test accuracy issue that introduced by the upstream PR
vllm-project/vllm#32005

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_llama32_lora.py

- vLLM version: v0.16.0
- vLLM main:
vllm-project/vllm@15d76f7
---------
Signed-off-by: paulyu12 <507435917@qq.com>
Signed-off-by: yupeng <507435917@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants