Skip to content

[refactor] refactor model runner capture model#5230

Merged
weijinqian0 merged 2 commits intovllm-project:mainfrom
weiguihua2:new_main
Dec 30, 2025
Merged

[refactor] refactor model runner capture model#5230
weijinqian0 merged 2 commits intovllm-project:mainfrom
weiguihua2:new_main

Conversation

@weiguihua2
Copy link
Copy Markdown
Collaborator

@weiguihua2 weiguihua2 commented Dec 22, 2025

What this PR does / why we need it?

Refactor the capture_model method in model_runner to directly reuse the method from vLLM.

Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend

Does this PR introduce any user-facing change?

No

How was this patch tested?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the model capturing logic to inherit from the community GPUModelRunner, which simplifies the code and improves maintainability. It also refactors how CUDA graph support is determined for attention backends, moving from a static attribute to a dynamic method. These changes align the Ascend-specific codebase more closely with the upstream vLLM project. My review has identified a critical bug in a new context manager that fails to restore a patched function, and a performance issue in initialization logic where a function is called redundantly in a loop.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm_ascend/worker/model_runner_v1.py (3201-3207)

critical

The context manager _replace_gpu_model_runner_function_wrapper does not correctly restore the original graph_capture function. The finally block re-assigns the patched function instead of restoring the original one. This will lead to a permanent patch, which can cause unexpected behavior in other parts of the codebase that might rely on the original graph_capture from the parent module.

def _replace_gpu_model_runner_function_wrapper(target_module_name):
    target_module = sys.modules[target_module_name]
    original_graph_capture = target_module.graph_capture
    try:
        target_module.graph_capture = graph_capture
        yield
    finally:
        target_module.graph_capture = original_graph_capture

vllm_ascend/worker/model_runner_v1.py (2709-2713)

high

The function get_attn_backends_for_group is called twice for each kv_cache_group_spec. It's called once in the first loop to populate attention_backend_maps and attention_backend_list, and then again in this second loop. This is inefficient. The results from the first loop should be reused here.

        for i, attn_backends_map in enumerate(attention_backend_maps):
            self.attn_groups.append(create_attn_groups(attn_backends_map, i))

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Comment thread vllm_ascend/worker/model_runner_v1.py Outdated
Comment thread vllm_ascend/worker/model_runner_v1.py
Comment on lines 2687 to +2710
for i, kv_cache_group_spec in enumerate(
kv_cache_config.kv_cache_groups):
attn_backends = get_attn_backends_for_group( # type: ignore
kv_cache_group_spec)
self.attn_groups.append(create_attn_groups(attn_backends, i))
self.attn_groups.append(create_attn_groups(attn_backends[0], i))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should iterate through attention_backend_maps now, please fix this in another PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@weiguihua2 weiguihua2 added ready read for review ready-for-test start test by label for PR labels Dec 23, 2025
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
@weijinqian0 weijinqian0 merged commit 15d73f2 into vllm-project:main Dec 30, 2025
8 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Dec 31, 2025
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits)
  [1/N] Refactor nightly test structure (vllm-project#5479)
  Docs: Remove deprecated --task parameter for embedding models (vllm-project#5257)
  Revert "moe_gating_top_k" (vllm-project#5512)
  [Doc] Fix issue link for 0.12.0 (vllm-project#5500)
  [CI]update triton ascend version (vllm-project#5392)
  moe_gating_top_k (vllm-project#5271)
  [refactor] refactor model runner capture model (vllm-project#5230)
  Update corresponding vllm commit ID to 12 29 (vllm-project#5475)
  [Kernel]update csrc cmakelist for open-source cann (vllm-project#5458)
  [OP] add custom op aclnnMoeInitRoutingCustom (vllm-project#5251)
  [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (vllm-project#5176)
  [Refactor][Triton] Move reject sample triton kernels into ops/triton (vllm-project#5324)
  [Feature] support eager mode in model runner v2 (vllm-project#5210)
  [feature] fia support sliding windows (vllm-project#5239)
  Optimize some rejectsampler functions to make npu op launch non-blocking (vllm-project#4587)
  [Feature] Support to use fullgraph with eagle (vllm-project#5118)
  [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (vllm-project#5311)
  [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314)
  [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277)
  update vllm pin to 12.27 (vllm-project#5412)
  ...
shenchuxiaofugui pushed a commit to shenchuxiaofugui/vllm-ascend that referenced this pull request Dec 31, 2025
### What this PR does / why we need it?
Refactor the `capture_model` method in model_runner to directly reuse
the method from vLLM.

Currently, most of the logic in the capture_model method is similar to
that in the vllm code. Directly using the vllm method can reduce the
maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to
initialize_attn_backend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
yiz-liu pushed a commit that referenced this pull request Jan 8, 2026
### What this PR does / why we need it?
#5230 this PR introduced a problem when both mtp and full_decode_only
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
guanguan0308 pushed a commit to guanguan0308/vllm-ascend that referenced this pull request Jan 13, 2026
…oject#5679)

### What this PR does / why we need it?
vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
guanguan0308 pushed a commit to guanguan0308/vllm-ascend that referenced this pull request Jan 13, 2026
…oject#5679)

### What this PR does / why we need it?
vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
aipaes pushed a commit to aipaes/vllm-ascend that referenced this pull request Jan 15, 2026
…oject#5679)

### What this PR does / why we need it?
vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
Refactor the `capture_model` method in model_runner to directly reuse
the method from vLLM.

Currently, most of the logic in the capture_model method is similar to
that in the vllm code. Directly using the vllm method can reduce the
maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to
initialize_attn_backend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…oject#5679)

### What this PR does / why we need it?
vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
Refactor the `capture_model` method in model_runner to directly reuse
the method from vLLM.

Currently, most of the logic in the capture_model method is similar to
that in the vllm code. Directly using the vllm method can reduce the
maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to
initialize_attn_backend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
…oject#5679)

### What this PR does / why we need it?
vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
Refactor the `capture_model` method in model_runner to directly reuse
the method from vLLM.

Currently, most of the logic in the capture_model method is similar to
that in the vllm code. Directly using the vllm method can reduce the
maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to
initialize_attn_backend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…oject#5679)

### What this PR does / why we need it?
vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…oject#5679)

### What this PR does / why we need it?
vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants