[Feat] Support to use fullgraph with eagle#5118
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request adds support for using full-graph execution with the Eagle speculative decoding proposer. The changes include refactoring is_mtp_model to the more generic is_draft_model, introducing separate graph parameters for the draft model, and significantly updating EagleProposer to handle graph capture and replay.
My review has identified several critical issues that need to be addressed:
- A bug in
_update_attn_pa_paramswhere it fails to use draft model graph parameters. - Missing import and incorrect logic for handling MLA models within
EagleProposer, which will lead to runtime errors.
Additionally, I've pointed out a high-severity maintainability issue regarding duplicated code for graph parameter selection.
Addressing these points will improve the correctness and robustness of the new feature.
|
|
||
| def _update_attn_pa_params(update_stream, forward_context, runtime_shape): | ||
| # TODO: Is pa should be adapted with draft_graph_params? | ||
| graph_params = get_graph_params() |
There was a problem hiding this comment.
The function _update_attn_pa_params does not correctly handle draft models. It unconditionally uses get_graph_params(), which will fetch parameters for the main model. When this function is called for a draft model, it will lead to using incorrect graph parameters, which is a critical bug. The TODO comment you added indicates you might be aware of this. This needs to be fixed by checking forward_context.is_draft_model and calling get_draft_graph_params() accordingly.
| graph_params = get_graph_params() | |
| if forward_context.is_draft_model: | |
| graph_params = get_draft_graph_params() | |
| else: | |
| graph_params = get_graph_params() |
| if self.vllm_config.model_config.use_mla: | ||
| update_mla_attn_params( | ||
| self.update_stream, | ||
| forward_context, | ||
| num_tokens, | ||
| self.vllm_config.speculative_config, | ||
| ) | ||
| else: | ||
| update_attn_params( | ||
| self.update_stream, | ||
| forward_context, | ||
| num_tokens, | ||
| ) |
There was a problem hiding this comment.
The function call to update_mla_attn_params will raise a NameError because it is not imported in this file. Please add the import. You can import it locally within the if block to keep the scope minimal.
| if self.vllm_config.model_config.use_mla: | |
| update_mla_attn_params( | |
| self.update_stream, | |
| forward_context, | |
| num_tokens, | |
| self.vllm_config.speculative_config, | |
| ) | |
| else: | |
| update_attn_params( | |
| self.update_stream, | |
| forward_context, | |
| num_tokens, | |
| ) | |
| if self.vllm_config.model_config.use_mla: | |
| from vllm_ascend.compilation.acl_graph import update_mla_attn_params | |
| update_mla_attn_params( | |
| self.update_stream, | |
| forward_context, | |
| num_tokens, | |
| self.vllm_config.speculative_config, | |
| ) | |
| else: | |
| update_attn_params( | |
| self.update_stream, | |
| forward_context, | |
| num_tokens, | |
| ) |
| if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL: | ||
| # TODO: support mla in future. | ||
| update_attn_params( | ||
| self.update_stream, | ||
| forward_context, | ||
| num_input_tokens, | ||
| ) |
There was a problem hiding this comment.
This logic for updating attention parameters in full graph mode does not account for models using MLA (Multi-Layer Attention). It unconditionally calls update_attn_params, which will lead to incorrect behavior or failures for MLA models. The dummy_run function correctly handles this by checking self.vllm_config.model_config.use_mla and calling update_mla_attn_params when appropriate. The same logic should be applied here. The TODO comment also suggests this is incomplete. Also, update_mla_attn_params needs to be imported.
| if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL: | |
| # TODO: support mla in future. | |
| update_attn_params( | |
| self.update_stream, | |
| forward_context, | |
| num_input_tokens, | |
| ) | |
| if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL: | |
| if self.vllm_config.model_config.use_mla: | |
| from vllm_ascend.compilation.acl_graph import update_mla_attn_params | |
| update_mla_attn_params( | |
| self.update_stream, | |
| forward_context, | |
| num_input_tokens, | |
| self.vllm_config.speculative_config, | |
| ) | |
| else: | |
| update_attn_params( | |
| self.update_stream, | |
| forward_context, | |
| num_input_tokens, | |
| ) |
| if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL: | ||
| update_attn_params( | ||
| self.update_stream, | ||
| forward_context, | ||
| input_batch_size, | ||
| ) |
There was a problem hiding this comment.
Similar to a previous comment, this logic for updating attention parameters in full graph mode does not account for models using MLA (Multi-Layer Attention). It unconditionally calls update_attn_params, which will lead to incorrect behavior or failures for MLA models. The dummy_run function correctly handles this by checking self.vllm_config.model_config.use_mla and calling update_mla_attn_params when appropriate. The same logic should be applied here.
| if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL: | |
| update_attn_params( | |
| self.update_stream, | |
| forward_context, | |
| input_batch_size, | |
| ) | |
| if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL: | |
| if self.vllm_config.model_config.use_mla: | |
| from vllm_ascend.compilation.acl_graph import update_mla_attn_params | |
| update_mla_attn_params( | |
| self.update_stream, | |
| forward_context, | |
| input_batch_size, | |
| self.vllm_config.speculative_config, | |
| ) | |
| else: | |
| update_attn_params( | |
| self.update_stream, | |
| forward_context, | |
| input_batch_size, | |
| ) |
| if forward_context.is_draft_model: | ||
| graph_params = get_draft_graph_params() | ||
| else: | ||
| graph_params = get_graph_params() |
There was a problem hiding this comment.
This logic to select graph parameters based on whether the model is a draft model is duplicated in several places across the codebase (e.g., attention_v1.py, mla_v1.py, acl_graph.py). This increases maintenance overhead and the risk of inconsistencies. Consider refactoring this into a helper function in vllm_ascend/compilation/acl_graph.py to promote code reuse and simplify maintenance. A similar helper could be created for updating workspaces.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
b73e264 to
bfc74e3
Compare
0f3b2e2 to
e8b52c5
Compare
| for i in range(self.num_speculative_tokens): | ||
| if i > 0 and not in_graph_capturing and aclgraph_runtime_mode == CUDAGraphMode.FULL: | ||
| aclgraph_runtime_mode = CUDAGraphMode.NONE | ||
| with set_ascend_forward_context( | ||
| attn_metadata, | ||
| self.vllm_config, | ||
| num_tokens=num_tokens, | ||
| num_actual_tokens=0, | ||
| in_profile_run=True, | ||
| batch_descriptor=batch_descriptor, | ||
| aclgraph_runtime_mode=aclgraph_runtime_mode, | ||
| is_draft_model=True): | ||
| forward_context = get_forward_context() | ||
| self.model( | ||
| input_ids=self.input_ids[:num_tokens], | ||
| positions=self.positions[:num_tokens], | ||
| hidden_states=self.hidden_states[:num_tokens], | ||
| ) |
There was a problem hiding this comment.
Why is it different from propose that seperates i=0 and i>0?
There was a problem hiding this comment.
Also it should not be not in_graph_capturing, see #5072 .
There was a problem hiding this comment.
- In propose, the num_tokens is different between with 'i=0' and 'i>0'. We can use a smaller graph when i>0. Using 'i=0' and 'i>0' is also to maintain consistency with the code structure of the 'propose' in vLLM.
- Fixed.
|
|
||
|
|
||
| def _update_attn_pa_params(update_stream, forward_context, runtime_shape): | ||
| # TODO: Is pa should be adapted with draft_graph_params? |
There was a problem hiding this comment.
Please take a look at full_graph_pa in model runner.
There was a problem hiding this comment.
Checked. Model will not run paged attention when set draft model because of the function 'using_paged_attention' in 'vllm_ascend/attention/utils.py'.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
e8b52c5 to
f9c62b6
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
a179858 to
fbfefa0
Compare
cf33ae5 to
e2e0c41
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
c3b9245 to
6b97c5d
Compare
6b97c5d to
8118f9b
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
…params Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
1940179 to
b3531e5
Compare
…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: (46 commits) [Feature] Support to use fullgraph with eagle (vllm-project#5118) [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (vllm-project#5311) [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314) [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277) update vllm pin to 12.27 (vllm-project#5412) [ReleaseNote] Add release note for v0.13.0rc1 (vllm-project#5334) [Bugfix] Correctly handle the output shape in multimodal attention (vllm-project#5443) Fix nightly (vllm-project#5413) [bugfix] fix typo of _skip_all_reduce_across_dp_group (vllm-project#5435) [Doc]modify pcp tutorial doc (vllm-project#5440) [Misc] fast fail for exiting if tools/install_flash_infer_attention_score_ops_a2.sh (vllm-project#5422) [Doc] Update DeepSeek V3.1/R1 2P1D doc (vllm-project#5387) [DOC]Fix model weight download links (vllm-project#5436) [Doc] Modify DeepSeek-R1/V3.1 documentation (vllm-project#5426) Revert "[feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5300)" (vllm-project#5434) [Bugfix] fix greedy temperature detection (vllm-project#5417) [doc] Update Qwen3-235B doc for reproducing latest performance (vllm-project#5323) [feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5300) [Doc] delete environment variable HCCL_OP_EXPANSION_MODE in DeepSeekV3.1/R1 (vllm-project#5419) [Doc] add long_sequence feature user guide (vllm-project#5343) ...
### What this PR does / why we need it?
We support to use full graph with eagle.
Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
2. Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.
NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits) [1/N] Refactor nightly test structure (vllm-project#5479) Docs: Remove deprecated --task parameter for embedding models (vllm-project#5257) Revert "moe_gating_top_k" (vllm-project#5512) [Doc] Fix issue link for 0.12.0 (vllm-project#5500) [CI]update triton ascend version (vllm-project#5392) moe_gating_top_k (vllm-project#5271) [refactor] refactor model runner capture model (vllm-project#5230) Update corresponding vllm commit ID to 12 29 (vllm-project#5475) [Kernel]update csrc cmakelist for open-source cann (vllm-project#5458) [OP] add custom op aclnnMoeInitRoutingCustom (vllm-project#5251) [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (vllm-project#5176) [Refactor][Triton] Move reject sample triton kernels into ops/triton (vllm-project#5324) [Feature] support eager mode in model runner v2 (vllm-project#5210) [feature] fia support sliding windows (vllm-project#5239) Optimize some rejectsampler functions to make npu op launch non-blocking (vllm-project#4587) [Feature] Support to use fullgraph with eagle (vllm-project#5118) [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (vllm-project#5311) [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314) [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277) update vllm pin to 12.27 (vllm-project#5412) ...
### What this PR does / why we need it?
We support to use full graph with eagle.
Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
2. Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.
NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it?
We support to use full graph with eagle.
Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
2. Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.
NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
### What this PR does / why we need it?
We support to use full graph with eagle.
Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
2. Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.
NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e
---------
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
What this PR does / why we need it?
We support to use full graph with eagle. See #5459 .
Change list:
1. Distinguish between processing graph_params and draft_graph_params in attention_v1.
2. Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of workspace.
NOTE:
When set async_scheduling=True, the draft model will enforce execution in eager mode.
Does this PR introduce any user-facing change?
How was this patch tested?