[Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling#32303
[Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling#32303micah-wil wants to merge 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a bug causing embedding lookup failures on certain hardware by replacing -1 placeholder tokens with 0. This is a good fix. I've also identified a related potential issue where these -1 placeholders could cause a crash during penalty calculations if async_scheduling is disabled. I've left a comment with a suggestion to make the fix more robust.
| safe_spec_token_ids = [tok if tok != -1 else 0 for tok in spec_token_ids] | ||
| self.token_ids_cpu[req_index, start_index:end_token_index] = safe_spec_token_ids |
There was a problem hiding this comment.
This is a great fix for the embedding lookup error. However, there's a related issue on the next line.
cur_spec_token_ids is being extended with the original spec_token_ids, which can contain -1 placeholders. This can cause a RuntimeError from torch.bincount when penalties (like frequency or presence penalty) are applied, as bincount does not support negative indices. This would happen when async_scheduling is False and penalties are enabled.
To prevent this potential crash, you should also use safe_spec_token_ids to extend cur_spec_token_ids.
Suggested change for line 469:
cur_spec_token_ids.extend(safe_spec_token_ids)|
@micah-wil thanks for the detailed analysis! I haven't dug into exactly what is happening here yet, but I think what's missing from what you summarized:
vllm/vllm/v1/worker/gpu_model_runner.py Line 1302 in a884bc6 I guess either there's some circumstance where that is not happening or it's actually a different root cause. I wonder if it could be somehow related to #30618 (only a vague thought, not sure about that at all!) |
Thanks for taking a look. You brought up a good point about |
…ed_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] (vllm-project#32355)" This reverts commit 773d707. Signed-off-by: Micah Williamson <micah.williamson@amd.com>
They happen on the same stream so should be sequential from device pov. |
|
Hey @njhill, sorry for the delay. This actually passes with a newer ROCm version, so it looks like the bug isn't coming from vLLM. Going to close this and will revert the workaround in |
Problem
The issue was exposed in the
V1 Test entrypointstest group in AMD CI after #31998 enabled async scheduling by default with spec decoding. The test group has been failing ever since that PR was merged(e.g. in build#2803). It passes if you set async_scheduling=False. The failure can be reproduced with this command:pytest -v -s tests/v1/entrypoints/llm/test_struct_output_generate.py::test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9], and the resulting error presents as the following:V1 Test entrypoints test_structured_output error
Root Cause
The issue is ultimately caused by using
-1to represent invalid spec tokensvllm/vllm/v1/core/sched/scheduler.py
Line 1435 in 90c0836
Without async scheduling,
token_ids_cpuis populated with valid token IDs before being copied to the GPU. With async scheduling, the-1placeholders can propogate to the embedding layer.update_draft_token_ids_in_output pads rejected tokens with
-1.update_req_spec_token_ids writes
spec_token_ids, which includes-1values, totoken_ids_cpu.vllm/vllm/v1/worker/gpu_input_batch.py
Line 466 in 90c0836
_prepare_input_ids copies the tokens to the GPU.
The issue doesn't seem to manifest on CUDA. I'm not exactly sure why, it could be due to a timing thing or maybe the embedding kernel is handling
-1values, but either way I believe the solution here is safest.Solution
Get rid of the
-1placeholders before copying tokens to the GPU by replacing them with zero.With this fix, I am seeing the following when running
pytest -v -s tests/v1/entrypoints/llm/test_struct_output_generate.py::test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9]: