Skip to content

[Feat] Support to use fullgraph with eagle#5118

Merged
yiz-liu merged 4 commits intovllm-project:mainfrom
anon189Ty:eagle_graph_new
Dec 29, 2025
Merged

[Feat] Support to use fullgraph with eagle#5118
yiz-liu merged 4 commits intovllm-project:mainfrom
anon189Ty:eagle_graph_new

Conversation

@anon189Ty
Copy link
Copy Markdown
Contributor

@anon189Ty anon189Ty commented Dec 17, 2025

What this PR does / why we need it?

We support to use full graph with eagle. See #5459 .

Change list:
1. Distinguish between processing graph_params and draft_graph_params in attention_v1.
2. Adapt the full-graph mode in eagle_proposer, include:
1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as attn_metadata.slot_mapping.
4). Add a descriptor.
5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of workspace.

NOTE:
When set async_scheduling=True, the draft model will enforce execution in eager mode.

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for using full-graph execution with the Eagle speculative decoding proposer. The changes include refactoring is_mtp_model to the more generic is_draft_model, introducing separate graph parameters for the draft model, and significantly updating EagleProposer to handle graph capture and replay.

My review has identified several critical issues that need to be addressed:

  • A bug in _update_attn_pa_params where it fails to use draft model graph parameters.
  • Missing import and incorrect logic for handling MLA models within EagleProposer, which will lead to runtime errors.

Additionally, I've pointed out a high-severity maintainability issue regarding duplicated code for graph parameter selection.

Addressing these points will improve the correctness and robustness of the new feature.


def _update_attn_pa_params(update_stream, forward_context, runtime_shape):
# TODO: Is pa should be adapted with draft_graph_params?
graph_params = get_graph_params()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function _update_attn_pa_params does not correctly handle draft models. It unconditionally uses get_graph_params(), which will fetch parameters for the main model. When this function is called for a draft model, it will lead to using incorrect graph parameters, which is a critical bug. The TODO comment you added indicates you might be aware of this. This needs to be fixed by checking forward_context.is_draft_model and calling get_draft_graph_params() accordingly.

Suggested change
graph_params = get_graph_params()
if forward_context.is_draft_model:
graph_params = get_draft_graph_params()
else:
graph_params = get_graph_params()

Comment on lines +231 to +243
if self.vllm_config.model_config.use_mla:
update_mla_attn_params(
self.update_stream,
forward_context,
num_tokens,
self.vllm_config.speculative_config,
)
else:
update_attn_params(
self.update_stream,
forward_context,
num_tokens,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function call to update_mla_attn_params will raise a NameError because it is not imported in this file. Please add the import. You can import it locally within the if block to keep the scope minimal.

Suggested change
if self.vllm_config.model_config.use_mla:
update_mla_attn_params(
self.update_stream,
forward_context,
num_tokens,
self.vllm_config.speculative_config,
)
else:
update_attn_params(
self.update_stream,
forward_context,
num_tokens,
)
if self.vllm_config.model_config.use_mla:
from vllm_ascend.compilation.acl_graph import update_mla_attn_params
update_mla_attn_params(
self.update_stream,
forward_context,
num_tokens,
self.vllm_config.speculative_config,
)
else:
update_attn_params(
self.update_stream,
forward_context,
num_tokens,
)

Comment on lines +456 to +462
if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:
# TODO: support mla in future.
update_attn_params(
self.update_stream,
forward_context,
num_input_tokens,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This logic for updating attention parameters in full graph mode does not account for models using MLA (Multi-Layer Attention). It unconditionally calls update_attn_params, which will lead to incorrect behavior or failures for MLA models. The dummy_run function correctly handles this by checking self.vllm_config.model_config.use_mla and calling update_mla_attn_params when appropriate. The same logic should be applied here. The TODO comment also suggests this is incomplete. Also, update_mla_attn_params needs to be imported.

Suggested change
if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:
# TODO: support mla in future.
update_attn_params(
self.update_stream,
forward_context,
num_input_tokens,
)
if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:
if self.vllm_config.model_config.use_mla:
from vllm_ascend.compilation.acl_graph import update_mla_attn_params
update_mla_attn_params(
self.update_stream,
forward_context,
num_input_tokens,
self.vllm_config.speculative_config,
)
else:
update_attn_params(
self.update_stream,
forward_context,
num_input_tokens,
)

Comment on lines +576 to +581
if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:
update_attn_params(
self.update_stream,
forward_context,
input_batch_size,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to a previous comment, this logic for updating attention parameters in full graph mode does not account for models using MLA (Multi-Layer Attention). It unconditionally calls update_attn_params, which will lead to incorrect behavior or failures for MLA models. The dummy_run function correctly handles this by checking self.vllm_config.model_config.use_mla and calling update_mla_attn_params when appropriate. The same logic should be applied here.

Suggested change
if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:
update_attn_params(
self.update_stream,
forward_context,
input_batch_size,
)
if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:
if self.vllm_config.model_config.use_mla:
from vllm_ascend.compilation.acl_graph import update_mla_attn_params
update_mla_attn_params(
self.update_stream,
forward_context,
input_batch_size,
self.vllm_config.speculative_config,
)
else:
update_attn_params(
self.update_stream,
forward_context,
input_batch_size,
)

Comment on lines +431 to +434
if forward_context.is_draft_model:
graph_params = get_draft_graph_params()
else:
graph_params = get_graph_params()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This logic to select graph parameters based on whether the model is a draft model is duplicated in several places across the codebase (e.g., attention_v1.py, mla_v1.py, acl_graph.py). This increases maintenance overhead and the risk of inconsistencies. Consider refactoring this into a helper function in vllm_ascend/compilation/acl_graph.py to promote code reuse and simplify maintenance. A similar helper could be created for updating workspaces.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@anon189Ty anon189Ty force-pushed the eagle_graph_new branch 2 times, most recently from 0f3b2e2 to e8b52c5 Compare December 17, 2025 16:41
Comment on lines +214 to +231
for i in range(self.num_speculative_tokens):
if i > 0 and not in_graph_capturing and aclgraph_runtime_mode == CUDAGraphMode.FULL:
aclgraph_runtime_mode = CUDAGraphMode.NONE
with set_ascend_forward_context(
attn_metadata,
self.vllm_config,
num_tokens=num_tokens,
num_actual_tokens=0,
in_profile_run=True,
batch_descriptor=batch_descriptor,
aclgraph_runtime_mode=aclgraph_runtime_mode,
is_draft_model=True):
forward_context = get_forward_context()
self.model(
input_ids=self.input_ids[:num_tokens],
positions=self.positions[:num_tokens],
hidden_states=self.hidden_states[:num_tokens],
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it different from propose that seperates i=0 and i>0?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it should not be not in_graph_capturing, see #5072 .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In propose, the num_tokens is different between with 'i=0' and 'i>0'. We can use a smaller graph when i>0. Using 'i=0' and 'i>0' is also to maintain consistency with the code structure of the 'propose' in vLLM.
  2. Fixed.

Comment thread vllm_ascend/compilation/acl_graph.py Outdated


def _update_attn_pa_params(update_stream, forward_context, runtime_shape):
# TODO: Is pa should be adapted with draft_graph_params?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at full_graph_pa in model runner.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked. Model will not run paged attention when set draft model because of the function 'using_paged_attention' in 'vllm_ascend/attention/utils.py'.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@anon189Ty anon189Ty force-pushed the eagle_graph_new branch 2 times, most recently from a179858 to fbfefa0 Compare December 23, 2025 11:42
@anon189Ty anon189Ty force-pushed the eagle_graph_new branch 2 times, most recently from cf33ae5 to e2e0c41 Compare December 24, 2025 02:36
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Dec 24, 2025
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

anon189Ty and others added 3 commits December 28, 2025 16:02
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
…params

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
@yiz-liu yiz-liu merged commit 3e67e82 into vllm-project:main Dec 29, 2025
8 of 10 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Dec 29, 2025
…to eplb_refactor

* 'main' of https://github.com/vllm-project/vllm-ascend: (46 commits)
  [Feature] Support to use fullgraph with eagle (vllm-project#5118)
  [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (vllm-project#5311)
  [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314)
  [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277)
  update vllm pin to 12.27 (vllm-project#5412)
  [ReleaseNote] Add release note for v0.13.0rc1 (vllm-project#5334)
  [Bugfix] Correctly handle the output shape in multimodal attention (vllm-project#5443)
  Fix nightly (vllm-project#5413)
  [bugfix] fix typo of _skip_all_reduce_across_dp_group (vllm-project#5435)
  [Doc]modify pcp tutorial doc (vllm-project#5440)
  [Misc] fast fail for exiting if tools/install_flash_infer_attention_score_ops_a2.sh (vllm-project#5422)
  [Doc] Update DeepSeek V3.1/R1 2P1D doc (vllm-project#5387)
  [DOC]Fix model weight download links (vllm-project#5436)
  [Doc] Modify DeepSeek-R1/V3.1 documentation (vllm-project#5426)
  Revert "[feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5300)" (vllm-project#5434)
  [Bugfix] fix greedy temperature detection (vllm-project#5417)
  [doc] Update Qwen3-235B doc for reproducing latest performance (vllm-project#5323)
  [feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5300)
  [Doc] delete environment variable HCCL_OP_EXPANSION_MODE in DeepSeekV3.1/R1 (vllm-project#5419)
  [Doc] add long_sequence feature user guide (vllm-project#5343)
  ...
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 29, 2025
### What this PR does / why we need it?

We support to use full graph with eagle.

Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
    2. Adapt the full-graph mode in eagle_proposer, include:
        1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
        4). Add a descriptor.
        5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.

NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Dec 31, 2025
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits)
  [1/N] Refactor nightly test structure (vllm-project#5479)
  Docs: Remove deprecated --task parameter for embedding models (vllm-project#5257)
  Revert "moe_gating_top_k" (vllm-project#5512)
  [Doc] Fix issue link for 0.12.0 (vllm-project#5500)
  [CI]update triton ascend version (vllm-project#5392)
  moe_gating_top_k (vllm-project#5271)
  [refactor] refactor model runner capture model (vllm-project#5230)
  Update corresponding vllm commit ID to 12 29 (vllm-project#5475)
  [Kernel]update csrc cmakelist for open-source cann (vllm-project#5458)
  [OP] add custom op aclnnMoeInitRoutingCustom (vllm-project#5251)
  [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (vllm-project#5176)
  [Refactor][Triton] Move reject sample triton kernels into ops/triton (vllm-project#5324)
  [Feature] support eager mode in model runner v2 (vllm-project#5210)
  [feature] fia support sliding windows (vllm-project#5239)
  Optimize some rejectsampler functions to make npu op launch non-blocking (vllm-project#4587)
  [Feature] Support to use fullgraph with eagle (vllm-project#5118)
  [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (vllm-project#5311)
  [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314)
  [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277)
  update vllm pin to 12.27 (vllm-project#5412)
  ...
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?

We support to use full graph with eagle.

Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
    2. Adapt the full-graph mode in eagle_proposer, include:
        1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
        4). Add a descriptor.
        5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.

NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
    
We support to use full graph with eagle. 

Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
    2. Adapt the full-graph mode in eagle_proposer, include:
        1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
        4). Add a descriptor.
        5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.

NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?

We support to use full graph with eagle.

Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
    2. Adapt the full-graph mode in eagle_proposer, include:
        1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
        4). Add a descriptor.
        5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.

NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants