[Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling by micah-wil · Pull Request #32303 · vllm-project/vllm

micah-wil · 2026-01-14T05:42:59Z

Problem

The issue was exposed in the V1 Test entrypoints test group in AMD CI after #31998 enabled async scheduling by default with spec decoding. The test group has been failing ever since that PR was merged(e.g. in build#2803). It passes if you set async_scheduling=False. The failure can be reproduced with this command:
pytest -v -s tests/v1/entrypoints/llm/test_struct_output_generate.py::test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9], and the resulting error presents as the following:

V1 Test entrypoints test_structured_output error

:0:rocdevice.cpp            :3675: 4871230665497 us:  Callback: Queue 0x7ebfec600000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.14.0rc2.dev22+g69f8a0ea3.d20260113) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=SpeculativeConfig(method='eagle', model='yuhuili/EAGLE-LLaMA3.1-Instruct-8B', num_spec_tokens=5), tokenizer='meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=True, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=120, served_model_name=meta-llama/Meta-Llama-3.1-8B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}, 
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['12-b282e46b'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={'12-b282e46b': 1020},new_block_ids=[None],num_computed_tokens=[1019],num_output_tokens=[782]), num_scheduled_tokens={12-b282e46b: 4}, total_num_scheduled_tokens=4, scheduled_spec_decode_tokens={12-b282e46b: [-1, -1, -1]}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[64], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=true, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938] EngineCore encountered a fatal error.
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938] Traceback (most recent call last):
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/engine/core.py", line 929, in run_engine_core
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     engine_core.run_busy_loop()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/engine/core.py", line 956, in run_busy_loop
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     self._process_engine_step()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/engine/core.py", line 989, in _process_engine_step
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     outputs, model_executed = self.step_fn()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]                               ^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/engine/core.py", line 487, in step_with_batch_queue
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     model_output = future.result()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]                    ^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return self.__get_result()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]            ^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     raise self._exception
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     result = run_method(self.driver_worker, method, args, kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/serial_utils.py", line 461, in run_method
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/worker/gpu_worker.py", line 578, in sample_tokens
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return self.model_runner.sample_tokens(grammar_output)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/worker/gpu_model_runner.py", line 3453, in sample_tokens
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     self.drafter.prepare_next_token_ids_padded(
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/spec_decode/eagle.py", line 627, in prepare_next_token_ids_padded
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     eagle_prepare_next_token_padded_kernel[grid](
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda>
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 623, in run
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     ^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 467, in __getattribute__
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     self._init_handles()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 457, in _init_handles
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938] RuntimeError: Triton Error [HIP]:  Code: 719, Messsage: unspecified launch failure
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m Process EngineCore_DP0:
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m Traceback (most recent call last):
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self.run()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self._target(*self._args, **self._kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/engine/core.py", line 940, in run_engine_core
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     raise e
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/engine/core.py", line 929, in run_engine_core
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     engine_core.run_busy_loop()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/engine/core.py", line 956, in run_busy_loop
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self._process_engine_step()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/engine/core.py", line 989, in _process_engine_step
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     outputs, model_executed = self.step_fn()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m                               ^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/engine/core.py", line 487, in step_with_batch_queue
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     model_output = future.result()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m                    ^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return self.__get_result()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m            ^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     raise self._exception
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     result = run_method(self.driver_worker, method, args, kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/serial_utils.py", line 461, in run_method
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/worker/gpu_worker.py", line 578, in sample_tokens
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return self.model_runner.sample_tokens(grammar_output)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/worker/gpu_model_runner.py", line 3453, in sample_tokens
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self.drafter.prepare_next_token_ids_padded(
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/spec_decode/eagle.py", line 627, in prepare_next_token_ids_padded
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     eagle_prepare_next_token_padded_kernel[grid](
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda>
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 623, in run
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     ^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 467, in __getattribute__
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self._init_handles()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 457, in _init_handles
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m RuntimeError: Triton Error [HIP]:  Code: 719, Messsage: unspecified launch failure
Kernel Name: _ZN2at6native12_GLOBAL__N_121indexSelectSmallIndexIN3c108BFloat16EljLi2ELi2ELin2EEEvNS_4cuda6detail10TensorInfoIT_T1_EENS7_IKS8_S9_EENS7_IKT0_S9_EEiiS9_l
VGPU=0x4fe7b760 SWq=0x7ec026fa0000, HWq=0x7ebfec600000, id=1
	Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=0
	grid=[4096, 1, 1], workgroup=[512, 1, 1]
	private_seg_size=0, group_seg_size=0
	kernel_obj=0x7ebdc98ad540, kernarg_address=0x0x7ebfec2c1580
	completion_signal=0x0, correlation_id=0
	rptr=239701, wptr=240043

Root Cause

The issue is ultimately caused by using -1 to represent invalid spec tokens

vllm/vllm/v1/core/sched/scheduler.py

Line 1435 in 90c0836

spec_token_ids.extend([-1] * num_invalid_tokens)

Without async scheduling, token_ids_cpu is populated with valid token IDs before being copied to the GPU. With async scheduling, the -1 placeholders can propogate to the embedding layer.

update_draft_token_ids_in_output pads rejected tokens with -1.
update_req_spec_token_ids writes spec_token_ids, which includes -1 values, to token_ids_cpu.

vllm/vllm/v1/worker/gpu_input_batch.py

Line 466 in 90c0836

self.token_ids_cpu[req_index, start_index:end_token_index] = spec_token_ids

_prepare_input_ids copies the tokens to the GPU.

The issue doesn't seem to manifest on CUDA. I'm not exactly sure why, it could be due to a timing thing or maybe the embedding kernel is handling -1 values, but either way I believe the solution here is safest.

Solution
Get rid of the -1 placeholders before copying tokens to the GPU by replacing them with zero.

With this fix, I am seeing the following when running pytest -v -s tests/v1/entrypoints/llm/test_struct_output_generate.py::test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9]:

======================== 1 passed, 3 warnings in 35.51s ========================

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

gemini-code-assist

Code Review

This pull request correctly addresses a bug causing embedding lookup failures on certain hardware by replacing -1 placeholder tokens with 0. This is a good fix. I've also identified a related potential issue where these -1 placeholders could cause a crash during penalty calculations if async_scheduling is disabled. I've left a comment with a suggestion to make the fix more robust.

gemini-code-assist · 2026-01-14T05:45:10Z

vllm/v1/worker/gpu_input_batch.py

+        safe_spec_token_ids = [tok if tok != -1 else 0 for tok in spec_token_ids]
+        self.token_ids_cpu[req_index, start_index:end_token_index] = safe_spec_token_ids


This is a great fix for the embedding lookup error. However, there's a related issue on the next line.

cur_spec_token_ids is being extended with the original spec_token_ids, which can contain -1 placeholders. This can cause a RuntimeError from torch.bincount when penalties (like frequency or presence penalty) are applied, as bincount does not support negative indices. This would happen when async_scheduling is False and penalties are enabled.

To prevent this potential crash, you should also use safe_spec_token_ids to extend cur_spec_token_ids.

Suggested change for line 469:

cur_spec_token_ids.extend(safe_spec_token_ids)

njhill · 2026-01-16T19:49:32Z

@micah-wil thanks for the detailed analysis! I haven't dug into exactly what is happening here yet, but I think what's missing from what you summarized:

_prepare_input_ids does copy these -1's to the GPU, however in this case it should subsequently overwrite them with "valid" draft token ids here:

vllm/vllm/v1/worker/gpu_model_runner.py

Line 1302 in a884bc6

self.input_ids.gpu.scatter_(

I guess either there's some circumstance where that is not happening or it's actually a different root cause.

I wonder if it could be somehow related to #30618 (only a vague thought, not sure about that at all!)

micah-wil · 2026-01-16T21:30:12Z

@micah-wil thanks for the detailed analysis! I haven't dug into exactly what is happening here yet, but I think what's missing from what you summarized:

_prepare_input_ids does copy these -1's to the GPU, however in this case it should subsequently overwrite them with "valid" draft token ids here:

vllm/vllm/v1/worker/gpu_model_runner.py

Line 1302 in a884bc6

self.input_ids.gpu.scatter_(

I guess either there's some circumstance where that is not happening or it's actually a different root cause.

I wonder if it could be somehow related to #30618 (only a vague thought, not sure about that at all!)

Thanks for taking a look. You brought up a good point about scatter_, I'd come across that when looking for the root cause of this issue. I had thought that maybe somehow there's a race condition between scatter_ and copy_to_gpu that happens to run correctly on CUDA but not ROCm for whatever reason, but that really shouldn't be the case now that I'm looking again. I will collect a trace to see if I can see what is really going on there.

…ed_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] (vllm-project#32355)" This reverts commit 773d707. Signed-off-by: Micah Williamson <micah.williamson@amd.com>

njhill · 2026-01-17T00:50:37Z

I had thought that maybe somehow there's a race condition between scatter_ and copy_to_gpu that happens to run correctly on CUDA but not ROCm for whatever reason

They happen on the same stream so should be sequential from device pov.

micah-wil · 2026-01-17T04:57:39Z

@njhill Not sure if #30618 is related, I checked out that PR and I'm still seeing the same error. I've got a trace to take a look at now, will investigate it and get back to you with my findings

micah-wil · 2026-01-30T21:40:07Z

Hey @njhill, sorry for the delay. This actually passes with a newer ROCm version, so it looks like the bug isn't coming from vLLM. Going to close this and will revert the workaround in test_structured_output once we upgrade ROCm. Thanks

make sure to replace invalid spec tokens with zero

4f3b500

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

micah-wil changed the title ~~[Bugfix] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling~~ [Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling Jan 14, 2026

mergify bot added rocm Related to AMD ROCm v1 bug Something isn't working labels Jan 14, 2026

gemini-code-assist bot reviewed Jan 14, 2026

View reviewed changes

micah-wil mentioned this pull request Jan 14, 2026

[ROCm][CI] Disable async scheduling on ROCm for test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] #32355

Merged

njhill self-requested a review January 16, 2026 19:27

micah-wil added 2 commits January 16, 2026 16:13

Merge branch 'main' into micah/spec-dec-async-sched

ad4ba07

Revert "[ROCm][CI] Disable async scheduling on ROCm for test_structur…

b9fd404

…ed_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] (vllm-project#32355)" This reverts commit 773d707. Signed-off-by: Micah Williamson <micah.williamson@amd.com>

micah-wil requested review from aarnphm, mgoin and russellb as code owners January 16, 2026 22:18

mergify bot added the structured-output label Jan 16, 2026

github-project-automation bot added this to Structured Output Jan 16, 2026

micah-wil closed this Jan 30, 2026

github-project-automation bot moved this to Done in Structured Output Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling#32303

[Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling#32303
micah-wil wants to merge 3 commits intovllm-project:mainfrom
ROCm:micah/spec-dec-async-sched

micah-wil commented Jan 14, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

njhill commented Jan 16, 2026

Uh oh!

micah-wil commented Jan 16, 2026

Uh oh!

njhill commented Jan 17, 2026

Uh oh!

micah-wil commented Jan 17, 2026

Uh oh!

micah-wil commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		safe_spec_token_ids = [tok if tok != -1 else 0 for tok in spec_token_ids]
		self.token_ids_cpu[req_index, start_index:end_token_index] = safe_spec_token_ids

Uh oh!

Conversation

micah-wil commented Jan 14, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

njhill commented Jan 16, 2026

Uh oh!

micah-wil commented Jan 16, 2026

Uh oh!

njhill commented Jan 17, 2026

Uh oh!

micah-wil commented Jan 17, 2026

Uh oh!

micah-wil commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

micah-wil commented Jan 14, 2026 •

edited by github-actions bot

Loading