[Feat] Support to use fullgraph with eagle by anon189Ty · Pull Request #5118 · vllm-project/vllm-ascend

anon189Ty · 2025-12-17T04:30:03Z

What this PR does / why we need it?

We support to use full graph with eagle. See #5459 .

Change list:
1. Distinguish between processing graph_params and draft_graph_params in attention_v1.
2. Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of workspace.

NOTE:
When set async_scheduling=True, the draft model will enforce execution in eager mode.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

github-actions · 2025-12-17T04:30:13Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for using full-graph execution with the Eagle speculative decoding proposer. The changes include refactoring is_mtp_model to the more generic is_draft_model, introducing separate graph parameters for the draft model, and significantly updating EagleProposer to handle graph capture and replay.

My review has identified several critical issues that need to be addressed:

A bug in _update_attn_pa_params where it fails to use draft model graph parameters.
Missing import and incorrect logic for handling MLA models within EagleProposer, which will lead to runtime errors.

Additionally, I've pointed out a high-severity maintainability issue regarding duplicated code for graph parameter selection.

Addressing these points will improve the correctness and robustness of the new feature.

gemini-code-assist · 2025-12-17T04:32:27Z


 def _update_attn_pa_params(update_stream, forward_context, runtime_shape):
+    # TODO: Is pa should be adapted with draft_graph_params?
    graph_params = get_graph_params()


The function _update_attn_pa_params does not correctly handle draft models. It unconditionally uses get_graph_params(), which will fetch parameters for the main model. When this function is called for a draft model, it will lead to using incorrect graph parameters, which is a critical bug. The TODO comment you added indicates you might be aware of this. This needs to be fixed by checking forward_context.is_draft_model and calling get_draft_graph_params() accordingly.

Suggested change

graph_params = get_graph_params()

if forward_context.is_draft_model:

graph_params = get_draft_graph_params()

else:

graph_params = get_graph_params()

gemini-code-assist · 2025-12-17T04:32:27Z

+                    if self.vllm_config.model_config.use_mla:
+                        update_mla_attn_params(
+                            self.update_stream,
+                            forward_context,
+                            num_tokens,
+                            self.vllm_config.speculative_config,
+                        )
+                    else:
+                        update_attn_params(
+                            self.update_stream,
+                            forward_context,
+                            num_tokens,
+                        )


The function call to update_mla_attn_params will raise a NameError because it is not imported in this file. Please add the import. You can import it locally within the if block to keep the scope minimal.

Suggested change

if self.vllm_config.model_config.use_mla:

update_mla_attn_params(

self.update_stream,

forward_context,

num_tokens,

self.vllm_config.speculative_config,

)

else:

update_attn_params(

self.update_stream,

forward_context,

num_tokens,

)

if self.vllm_config.model_config.use_mla:

from vllm_ascend.compilation.acl_graph import update_mla_attn_params

update_mla_attn_params(

self.update_stream,

forward_context,

num_tokens,

self.vllm_config.speculative_config,

)

else:

update_attn_params(

self.update_stream,

forward_context,

num_tokens,

)

gemini-code-assist · 2025-12-17T04:32:27Z

+            if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:
+                # TODO: support mla in future.
+                update_attn_params(
+                    self.update_stream,
+                    forward_context,
+                    num_input_tokens,
+                )


This logic for updating attention parameters in full graph mode does not account for models using MLA (Multi-Layer Attention). It unconditionally calls update_attn_params, which will lead to incorrect behavior or failures for MLA models. The dummy_run function correctly handles this by checking self.vllm_config.model_config.use_mla and calling update_mla_attn_params when appropriate. The same logic should be applied here. The TODO comment also suggests this is incomplete. Also, update_mla_attn_params needs to be imported.

Suggested change

if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:

# TODO: support mla in future.

update_attn_params(

self.update_stream,

forward_context,

num_input_tokens,

)

if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:

if self.vllm_config.model_config.use_mla:

from vllm_ascend.compilation.acl_graph import update_mla_attn_params

update_mla_attn_params(

self.update_stream,

forward_context,

num_input_tokens,

self.vllm_config.speculative_config,

)

else:

update_attn_params(

self.update_stream,

forward_context,

num_input_tokens,

)

gemini-code-assist · 2025-12-17T04:32:27Z

+                if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:
+                    update_attn_params(
+                        self.update_stream,
+                        forward_context,
+                        input_batch_size,
+                    )


Similar to a previous comment, this logic for updating attention parameters in full graph mode does not account for models using MLA (Multi-Layer Attention). It unconditionally calls update_attn_params, which will lead to incorrect behavior or failures for MLA models. The dummy_run function correctly handles this by checking self.vllm_config.model_config.use_mla and calling update_mla_attn_params when appropriate. The same logic should be applied here.

Suggested change

if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:

update_attn_params(

self.update_stream,

forward_context,

input_batch_size,

)

if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:

if self.vllm_config.model_config.use_mla:

from vllm_ascend.compilation.acl_graph import update_mla_attn_params

update_mla_attn_params(

self.update_stream,

forward_context,

input_batch_size,

self.vllm_config.speculative_config,

)

else:

update_attn_params(

self.update_stream,

forward_context,

input_batch_size,

)

gemini-code-assist · 2025-12-17T04:32:28Z

+        if forward_context.is_draft_model:
+            graph_params = get_draft_graph_params()
+        else:
+            graph_params = get_graph_params()


This logic to select graph parameters based on whether the model is a draft model is duplicated in several places across the codebase (e.g., attention_v1.py, mla_v1.py, acl_graph.py). This increases maintenance overhead and the risk of inconsistencies. Consider refactoring this into a helper function in vllm_ascend/compilation/acl_graph.py to promote code reuse and simplify maintenance. A similar helper could be created for updating workspaces.

github-actions · 2025-12-17T04:55:51Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

yiz-liu · 2025-12-18T09:09:35Z

+        for i in range(self.num_speculative_tokens):
+            if i > 0 and not in_graph_capturing and aclgraph_runtime_mode == CUDAGraphMode.FULL:
+                aclgraph_runtime_mode = CUDAGraphMode.NONE
+            with set_ascend_forward_context(
+                    attn_metadata,
+                    self.vllm_config,
+                    num_tokens=num_tokens,
+                    num_actual_tokens=0,
+                    in_profile_run=True,
+                    batch_descriptor=batch_descriptor,
+                    aclgraph_runtime_mode=aclgraph_runtime_mode,
+                    is_draft_model=True):
+                forward_context = get_forward_context()
+                self.model(
+                    input_ids=self.input_ids[:num_tokens],
+                    positions=self.positions[:num_tokens],
+                    hidden_states=self.hidden_states[:num_tokens],
+                )


Why is it different from propose that seperates i=0 and i>0?

Also it should not be not in_graph_capturing, see #5072 .

In propose, the num_tokens is different between with 'i=0' and 'i>0'. We can use a smaller graph when i>0. Using 'i=0' and 'i>0' is also to maintain consistency with the code structure of the 'propose' in vLLM.

Fixed.

yiz-liu · 2025-12-18T09:22:52Z



 def _update_attn_pa_params(update_stream, forward_context, runtime_shape):
+    # TODO: Is pa should be adapted with draft_graph_params?


Please take a look at full_graph_pa in model runner.

Checked. Model will not run paged attention when set draft model because of the function 'using_paged_attention' in 'vllm_ascend/attention/utils.py'.

github-actions · 2025-12-18T14:35:11Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-23T00:57:31Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-24T02:37:49Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-28T02:45:00Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

…params Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: (46 commits) [Feature] Support to use fullgraph with eagle (vllm-project#5118) [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy（depend on pr5285） (vllm-project#5311) [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314) [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277) update vllm pin to 12.27 (vllm-project#5412) [ReleaseNote] Add release note for v0.13.0rc1 (vllm-project#5334) [Bugfix] Correctly handle the output shape in multimodal attention (vllm-project#5443) Fix nightly (vllm-project#5413) [bugfix] fix typo of _skip_all_reduce_across_dp_group (vllm-project#5435) [Doc]modify pcp tutorial doc (vllm-project#5440) [Misc] fast fail for exiting if tools/install_flash_infer_attention_score_ops_a2.sh (vllm-project#5422) [Doc] Update DeepSeek V3.1/R1 2P1D doc (vllm-project#5387) [DOC]Fix model weight download links (vllm-project#5436) [Doc] Modify DeepSeek-R1/V3.1 documentation (vllm-project#5426) Revert "[feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5300)" (vllm-project#5434) [Bugfix] fix greedy temperature detection (vllm-project#5417) [doc] Update Qwen3-235B doc for reproducing latest performance (vllm-project#5323) [feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5300) [Doc] delete environment variable HCCL_OP_EXPANSION_MODE in DeepSeekV3.1/R1 (vllm-project#5419) [Doc] add long_sequence feature user guide (vllm-project#5343) ...

### What this PR does / why we need it? We support to use full graph with eagle. Change list: 1. Distinguish between processing graph_params and draft_graph_params in attention_v1. 2. Adapt the full-graph mode in eagle_proposer, include: 1). If use full graph, make Fullgraph Wrapper when load model. 2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode. 3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping. 4). Add a descriptor. 5). Set running mode and triggered update metadata. 3. Trans is_mtp_model to is_draft_model, and add the update of workspace. NOTE: When set async_scheduling=True, the draft model will enforce execution in eager mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits) [1/N] Refactor nightly test structure (vllm-project#5479) Docs: Remove deprecated --task parameter for embedding models (vllm-project#5257) Revert "moe_gating_top_k" (vllm-project#5512) [Doc] Fix issue link for 0.12.0 (vllm-project#5500) [CI]update triton ascend version (vllm-project#5392) moe_gating_top_k (vllm-project#5271) [refactor] refactor model runner capture model (vllm-project#5230) Update corresponding vllm commit ID to 12 29 (vllm-project#5475) [Kernel]update csrc cmakelist for open-source cann (vllm-project#5458) [OP] add custom op aclnnMoeInitRoutingCustom (vllm-project#5251) [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (vllm-project#5176) [Refactor][Triton] Move reject sample triton kernels into ops/triton (vllm-project#5324) [Feature] support eager mode in model runner v2 (vllm-project#5210) [feature] fia support sliding windows (vllm-project#5239) Optimize some rejectsampler functions to make npu op launch non-blocking (vllm-project#4587) [Feature] Support to use fullgraph with eagle (vllm-project#5118) [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy（depend on pr5285） (vllm-project#5311) [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314) [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277) update vllm pin to 12.27 (vllm-project#5412) ...

### What this PR does / why we need it? We support to use full graph with eagle. Change list: 1. Distinguish between processing graph_params and draft_graph_params in attention_v1. 2. Adapt the full-graph mode in eagle_proposer, include: 1). If use full graph, make Fullgraph Wrapper when load model. 2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode. 3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping. 4). Add a descriptor. 5). Set running mode and triggered update metadata. 3. Trans is_mtp_model to is_draft_model, and add the update of workspace. NOTE: When set async_scheduling=True, the draft model will enforce execution in eager mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? We support to use full graph with eagle. Change list: 1. Distinguish between processing graph_params and draft_graph_params in attention_v1. 2. Adapt the full-graph mode in eagle_proposer, include: 1). If use full graph, make Fullgraph Wrapper when load model. 2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode. 3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping. 4). Add a descriptor. 5). Set running mode and triggered update metadata. 3. Trans is_mtp_model to is_draft_model, and add the update of workspace. NOTE: When set async_scheduling=True, the draft model will enforce execution in eager mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>

### What this PR does / why we need it? We support to use full graph with eagle. Change list: 1. Distinguish between processing graph_params and draft_graph_params in attention_v1. 2. Adapt the full-graph mode in eagle_proposer, include: 1). If use full graph, make Fullgraph Wrapper when load model. 2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode. 3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping. 4). Add a descriptor. 5). Set running mode and triggered update metadata. 3. Trans is_mtp_model to is_draft_model, and add the update of workspace. NOTE: When set async_scheduling=True, the draft model will enforce execution in eager mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

github-actions bot added module:tests module:core labels Dec 17, 2025

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Dec 17, 2025

anon189Ty force-pushed the eagle_graph_new branch from b73e264 to bfc74e3 Compare December 17, 2025 05:51

github-actions bot removed the merge-conflicts label Dec 17, 2025

anon189Ty force-pushed the eagle_graph_new branch 2 times, most recently from 0f3b2e2 to e8b52c5 Compare December 17, 2025 16:41

yiz-liu reviewed Dec 18, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Dec 18, 2025

anon189Ty force-pushed the eagle_graph_new branch from e8b52c5 to f9c62b6 Compare December 22, 2025 11:54

github-actions bot added merge-conflicts and removed merge-conflicts labels Dec 22, 2025

anon189Ty force-pushed the eagle_graph_new branch 2 times, most recently from a179858 to fbfefa0 Compare December 23, 2025 11:42

github-actions bot removed the merge-conflicts label Dec 23, 2025

anon189Ty force-pushed the eagle_graph_new branch 2 times, most recently from cf33ae5 to e2e0c41 Compare December 24, 2025 02:36

github-actions bot added the merge-conflicts label Dec 24, 2025

anon189Ty force-pushed the eagle_graph_new branch 2 times, most recently from c3b9245 to 6b97c5d Compare December 24, 2025 04:01

github-actions bot removed the merge-conflicts label Dec 24, 2025

anon189Ty force-pushed the eagle_graph_new branch from 6b97c5d to 8118f9b Compare December 24, 2025 06:12

MengqingCao mentioned this pull request Dec 24, 2025

[Release]: Release checklist for v0.13.0rc1 #5229

Closed

46 tasks

MengqingCao added ready read for review ready-for-test start test by label for PR labels Dec 24, 2025

github-actions bot added the merge-conflicts label Dec 28, 2025

anon189Ty and others added 3 commits December 28, 2025 16:02

[Feat] Support use fullgraph with eagle

8761350

Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

Rename new mtp_graph_params and add vllm_config input in update_attn_…

0203dc5

…params Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

Add the e2e test and ut of spec models

b3531e5

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

anon189Ty force-pushed the eagle_graph_new branch from 1940179 to b3531e5 Compare December 28, 2025 08:14

github-actions bot removed the merge-conflicts label Dec 28, 2025

yiz-liu approved these changes Dec 29, 2025

View reviewed changes

Merge branch 'main' into eagle_graph_new

5eb2b00

yiz-liu merged commit 3e67e82 into vllm-project:main Dec 29, 2025
8 of 10 checks passed

yiz-liu mentioned this pull request Dec 29, 2025

[RFC]: Add support for Eagle3 in ACL Graph FULL_DECODE_ONLY mode #5459

Open

Yikun mentioned this pull request Feb 5, 2026

[v0.13.0rc2] FAQ / Feedback | 问题/反馈 #6186

Closed



		def _update_attn_pa_params(update_stream, forward_context, runtime_shape):
		# TODO: Is pa should be adapted with draft_graph_params?

Conversation

anon189Ty commented Dec 17, 2025 • edited by yiz-liu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

yiz-liu Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

anon189Ty Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

yiz-liu Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

anon189Ty Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anon189Ty commented Dec 17, 2025 •

edited by yiz-liu

Loading