[refactor] refactor model runner capture model#5230
[refactor] refactor model runner capture model#5230weijinqian0 merged 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the model capturing logic to inherit from the community GPUModelRunner, which simplifies the code and improves maintainability. It also refactors how CUDA graph support is determined for attention backends, moving from a static attribute to a dynamic method. These changes align the Ascend-specific codebase more closely with the upstream vLLM project. My review has identified a critical bug in a new context manager that fails to restore a patched function, and a performance issue in initialization logic where a function is called redundantly in a loop.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm_ascend/worker/model_runner_v1.py (3201-3207)
The context manager _replace_gpu_model_runner_function_wrapper does not correctly restore the original graph_capture function. The finally block re-assigns the patched function instead of restoring the original one. This will lead to a permanent patch, which can cause unexpected behavior in other parts of the codebase that might rely on the original graph_capture from the parent module.
def _replace_gpu_model_runner_function_wrapper(target_module_name):
target_module = sys.modules[target_module_name]
original_graph_capture = target_module.graph_capture
try:
target_module.graph_capture = graph_capture
yield
finally:
target_module.graph_capture = original_graph_capture
vllm_ascend/worker/model_runner_v1.py (2709-2713)
The function get_attn_backends_for_group is called twice for each kv_cache_group_spec. It's called once in the first loop to populate attention_backend_maps and attention_backend_list, and then again in this second loop. This is inefficient. The results from the first loop should be reused here.
for i, attn_backends_map in enumerate(attention_backend_maps):
self.attn_groups.append(create_attn_groups(attn_backends_map, i))
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
| for i, kv_cache_group_spec in enumerate( | ||
| kv_cache_config.kv_cache_groups): | ||
| attn_backends = get_attn_backends_for_group( # type: ignore | ||
| kv_cache_group_spec) | ||
| self.attn_groups.append(create_attn_groups(attn_backends, i)) | ||
| self.attn_groups.append(create_attn_groups(attn_backends[0], i)) |
There was a problem hiding this comment.
We should iterate through attention_backend_maps now, please fix this in another PR.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits) [1/N] Refactor nightly test structure (vllm-project#5479) Docs: Remove deprecated --task parameter for embedding models (vllm-project#5257) Revert "moe_gating_top_k" (vllm-project#5512) [Doc] Fix issue link for 0.12.0 (vllm-project#5500) [CI]update triton ascend version (vllm-project#5392) moe_gating_top_k (vllm-project#5271) [refactor] refactor model runner capture model (vllm-project#5230) Update corresponding vllm commit ID to 12 29 (vllm-project#5475) [Kernel]update csrc cmakelist for open-source cann (vllm-project#5458) [OP] add custom op aclnnMoeInitRoutingCustom (vllm-project#5251) [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (vllm-project#5176) [Refactor][Triton] Move reject sample triton kernels into ops/triton (vllm-project#5324) [Feature] support eager mode in model runner v2 (vllm-project#5210) [feature] fia support sliding windows (vllm-project#5239) Optimize some rejectsampler functions to make npu op launch non-blocking (vllm-project#4587) [Feature] Support to use fullgraph with eagle (vllm-project#5118) [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (vllm-project#5311) [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314) [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277) update vllm pin to 12.27 (vllm-project#5412) ...
### What this PR does / why we need it? Refactor the `capture_model` method in model_runner to directly reuse the method from vLLM. Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows: 1、refactor capture_model function, directly inheriting community methods 2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it? #5230 this PR introduced a problem when both mtp and full_decode_only are enabled for the DSV32 model, the operators cannot be compiled into the graph. This PR fixes that issue. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
…oject#5679) ### What this PR does / why we need it? vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only are enabled for the DSV32 model, the operators cannot be compiled into the graph. This PR fixes that issue. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
…oject#5679) ### What this PR does / why we need it? vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only are enabled for the DSV32 model, the operators cannot be compiled into the graph. This PR fixes that issue. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
…oject#5679) ### What this PR does / why we need it? vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only are enabled for the DSV32 model, the operators cannot be compiled into the graph. This PR fixes that issue. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
### What this PR does / why we need it? Refactor the `capture_model` method in model_runner to directly reuse the method from vLLM. Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows: 1、refactor capture_model function, directly inheriting community methods 2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…oject#5679) ### What this PR does / why we need it? vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only are enabled for the DSV32 model, the operators cannot be compiled into the graph. This PR fixes that issue. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? Refactor the `capture_model` method in model_runner to directly reuse the method from vLLM. Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows: 1、refactor capture_model function, directly inheriting community methods 2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>
…oject#5679) ### What this PR does / why we need it? vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only are enabled for the DSV32 model, the operators cannot be compiled into the graph. This PR fixes that issue. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
### What this PR does / why we need it? Refactor the `capture_model` method in model_runner to directly reuse the method from vLLM. Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows: 1、refactor capture_model function, directly inheriting community methods 2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…oject#5679) ### What this PR does / why we need it? vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only are enabled for the DSV32 model, the operators cannot be compiled into the graph. This PR fixes that issue. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…oject#5679) ### What this PR does / why we need it? vllm-project#5230 this PR introduced a problem when both mtp and full_decode_only are enabled for the DSV32 model, the operators cannot be compiled into the graph. This PR fixes that issue. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
What this PR does / why we need it?
Refactor the
capture_modelmethod in model_runner to directly reuse the method from vLLM.Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend
Does this PR introduce any user-facing change?
No
How was this patch tested?