Skip to content

[Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in eagle.py#31714

Merged
tjtanaa merged 4 commits intovllm-project:mainfrom
EmbeddedLLM:fix-rocm-eagle
Jan 6, 2026
Merged

[Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in eagle.py#31714
tjtanaa merged 4 commits intovllm-project:mainfrom
EmbeddedLLM:fix-rocm-eagle

Conversation

@vllmellm
Copy link
Copy Markdown
Contributor

@vllmellm vllmellm commented Jan 5, 2026

Purpose

this PR fixes the issue mentioned here #30811 (comment)

Test Plan

Only testing VLLM_ATTENTION_BACKEND=ROCM_ATTN

  1. run vllm serve:
    VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=0 VLLM_ATTENTION_BACKEND=ROCM_ATTN vllm serve meta-llama/Llama-3.3-70B-Instruct -tp 4 --speculative-config '{"model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B", "num_speculative_tokens": 3, "method":"eagle3", "draft_tensor_parallel_size":1}'
  2. run lm_eval
    lm_eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Llama-3.3-70B-Instruct,base_url=http://localhost:8000/v1/completions --trust_remote_code --num_fewshot 5 --batch_size 128

Test Result

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9333 ± 0.0144
strict-match 5 exact_match 0.9033 ± 0.0171

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a ValueError that occurs during speculative decoding on ROCm when using the ROCM_ATTN backend. The fix correctly includes RocmAttentionMetadata in the list of allowed attention metadata types. My review includes a suggestion to make this fix more comprehensive by also adding support for the ROCM_AITER_UNIFIED_ATTN backend, which appears to have been overlooked and would cause a similar issue.

Comment on lines +171 to +177
from vllm.v1.attention.backends.rocm_attn import RocmAttentionMetadata

rocm_types = [
TritonAttentionMetadata,
FlashAttentionMetadata,
RocmAttentionMetadata,
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change correctly adds support for RocmAttentionMetadata, it appears that another ROCm attention backend, ROCM_AITER_UNIFIED_ATTN, might have been missed. This backend likely uses RocmAiterUnifiedAttentionMetadata, which is not included in rocm_types and could lead to a similar ValueError during speculative decoding. To ensure a more comprehensive fix, I recommend adding RocmAiterUnifiedAttentionMetadata to the list of allowed types.

            from vllm.v1.attention.backends.rocm_attn import RocmAttentionMetadata
            from vllm.v1.attention.backends.rocm_aiter_unified_attn import RocmAiterUnifiedAttentionMetadata

            rocm_types = [
                TritonAttentionMetadata,
                FlashAttentionMetadata,
                RocmAttentionMetadata,
                RocmAiterUnifiedAttentionMetadata,
            ]

Copy link
Copy Markdown
Contributor Author

@vllmellm vllmellm Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RocmAiterUnifiedAttentionMetadata does not exist. RocmAiterUnifiedAttentionBackend shares the same metadata as RocmAttentionMetadata and the lm_eval scores are as follow:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9333 ± 0.0144
strict-match 5 exact_match 0.9133 ± 0.0163

run vllm serve command:

VLLM_USE_V1=1 \ VLLM_ROCM_USE_AITER=1 VLLM_ATTENTION_BACKEND=ROCM_AITER_UNIFIED_ATTN vllm serve meta-llama/Llama-3.3-70B-Instruct -tp 4 --speculative-config '{"model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B", "num_speculative_tokens": 3, "method":"eagle3", "draft_tensor_parallel_size":1}'


rocm_types = [
TritonAttentionMetadata,
FlashAttentionMetadata,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vllmellm is FlashAttentionMetadata still relevant for ROCm platform?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for cleaning up the conditions

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 6, 2026
@tjtanaa tjtanaa enabled auto-merge (squash) January 6, 2026 04:29
@tjtanaa tjtanaa merged commit e971780 into vllm-project:main Jan 6, 2026
45 checks passed
LucasWilkinson pushed a commit to neuralmagic/vllm that referenced this pull request Jan 6, 2026
…e decoding in `eagle.py` (vllm-project#31714)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
…e decoding in `eagle.py` (vllm-project#31714)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
…e decoding in `eagle.py` (vllm-project#31714)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…e decoding in `eagle.py` (vllm-project#31714)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…e decoding in `eagle.py` (vllm-project#31714)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants