[Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in eagle.py#31714
Conversation
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a ValueError that occurs during speculative decoding on ROCm when using the ROCM_ATTN backend. The fix correctly includes RocmAttentionMetadata in the list of allowed attention metadata types. My review includes a suggestion to make this fix more comprehensive by also adding support for the ROCM_AITER_UNIFIED_ATTN backend, which appears to have been overlooked and would cause a similar issue.
| from vllm.v1.attention.backends.rocm_attn import RocmAttentionMetadata | ||
|
|
||
| rocm_types = [ | ||
| TritonAttentionMetadata, | ||
| FlashAttentionMetadata, | ||
| RocmAttentionMetadata, | ||
| ] |
There was a problem hiding this comment.
While this change correctly adds support for RocmAttentionMetadata, it appears that another ROCm attention backend, ROCM_AITER_UNIFIED_ATTN, might have been missed. This backend likely uses RocmAiterUnifiedAttentionMetadata, which is not included in rocm_types and could lead to a similar ValueError during speculative decoding. To ensure a more comprehensive fix, I recommend adding RocmAiterUnifiedAttentionMetadata to the list of allowed types.
from vllm.v1.attention.backends.rocm_attn import RocmAttentionMetadata
from vllm.v1.attention.backends.rocm_aiter_unified_attn import RocmAiterUnifiedAttentionMetadata
rocm_types = [
TritonAttentionMetadata,
FlashAttentionMetadata,
RocmAttentionMetadata,
RocmAiterUnifiedAttentionMetadata,
]There was a problem hiding this comment.
RocmAiterUnifiedAttentionMetadata does not exist. RocmAiterUnifiedAttentionBackend shares the same metadata as RocmAttentionMetadata and the lm_eval scores are as follow:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.9333 | ± | 0.0144 |
| strict-match | 5 | exact_match | ↑ | 0.9133 | ± | 0.0163 |
run vllm serve command:
VLLM_USE_V1=1 \ VLLM_ROCM_USE_AITER=1 VLLM_ATTENTION_BACKEND=ROCM_AITER_UNIFIED_ATTN vllm serve meta-llama/Llama-3.3-70B-Instruct -tp 4 --speculative-config '{"model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B", "num_speculative_tokens": 3, "method":"eagle3", "draft_tensor_parallel_size":1}'
vllm/v1/spec_decode/eagle.py
Outdated
|
|
||
| rocm_types = [ | ||
| TritonAttentionMetadata, | ||
| FlashAttentionMetadata, |
There was a problem hiding this comment.
@vllmellm is FlashAttentionMetadata still relevant for ROCm platform?
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
tjtanaa
left a comment
There was a problem hiding this comment.
LGTM. Thanks for cleaning up the conditions
…e decoding in `eagle.py` (vllm-project#31714) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
…e decoding in `eagle.py` (vllm-project#31714) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
…e decoding in `eagle.py` (vllm-project#31714) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
…e decoding in `eagle.py` (vllm-project#31714) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…e decoding in `eagle.py` (vllm-project#31714) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Purpose
this PR fixes the issue mentioned here #30811 (comment)
Test Plan
Only testing VLLM_ATTENTION_BACKEND=ROCM_ATTN
VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=0 VLLM_ATTENTION_BACKEND=ROCM_ATTN vllm serve meta-llama/Llama-3.3-70B-Instruct -tp 4 --speculative-config '{"model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B", "num_speculative_tokens": 3, "method":"eagle3", "draft_tensor_parallel_size":1}'lm_eval --model local-completions --tasks gsm8k --model_args model=meta-llama/Llama-3.3-70B-Instruct,base_url=http://localhost:8000/v1/completions --trust_remote_code --num_fewshot 5 --batch_size 128Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.