Skip to content

[Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling#32303

Closed
micah-wil wants to merge 3 commits intovllm-project:mainfrom
ROCm:micah/spec-dec-async-sched
Closed

[Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling#32303
micah-wil wants to merge 3 commits intovllm-project:mainfrom
ROCm:micah/spec-dec-async-sched

Conversation

@micah-wil
Copy link
Contributor

@micah-wil micah-wil commented Jan 14, 2026

Problem

The issue was exposed in the V1 Test entrypoints test group in AMD CI after #31998 enabled async scheduling by default with spec decoding. The test group has been failing ever since that PR was merged(e.g. in build#2803). It passes if you set async_scheduling=False. The failure can be reproduced with this command:
pytest -v -s tests/v1/entrypoints/llm/test_struct_output_generate.py::test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9], and the resulting error presents as the following:

V1 Test entrypoints test_structured_output error
:0:rocdevice.cpp            :3675: 4871230665497 us:  Callback: Queue 0x7ebfec600000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.14.0rc2.dev22+g69f8a0ea3.d20260113) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=SpeculativeConfig(method='eagle', model='yuhuili/EAGLE-LLaMA3.1-Instruct-8B', num_spec_tokens=5), tokenizer='meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=True, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=120, served_model_name=meta-llama/Meta-Llama-3.1-8B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}, 
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['12-b282e46b'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={'12-b282e46b': 1020},new_block_ids=[None],num_computed_tokens=[1019],num_output_tokens=[782]), num_scheduled_tokens={12-b282e46b: 4}, total_num_scheduled_tokens=4, scheduled_spec_decode_tokens={12-b282e46b: [-1, -1, -1]}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[64], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=true, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938] EngineCore encountered a fatal error.
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938] Traceback (most recent call last):
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/engine/core.py", line 929, in run_engine_core
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     engine_core.run_busy_loop()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/engine/core.py", line 956, in run_busy_loop
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     self._process_engine_step()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/engine/core.py", line 989, in _process_engine_step
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     outputs, model_executed = self.step_fn()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]                               ^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/engine/core.py", line 487, in step_with_batch_queue
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     model_output = future.result()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]                    ^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return self.__get_result()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]            ^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     raise self._exception
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     result = run_method(self.driver_worker, method, args, kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/serial_utils.py", line 461, in run_method
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/worker/gpu_worker.py", line 578, in sample_tokens
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return self.model_runner.sample_tokens(grammar_output)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/worker/gpu_model_runner.py", line 3453, in sample_tokens
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     self.drafter.prepare_next_token_ids_padded(
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/projects/vllm/vllm/v1/spec_decode/eagle.py", line 627, in prepare_next_token_ids_padded
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     eagle_prepare_next_token_padded_kernel[grid](
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda>
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 623, in run
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     ^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 467, in __getattribute__
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     self._init_handles()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 457, in _init_handles
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938]                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m ERROR 01-14 04:47:03 [core.py:938] RuntimeError: Triton Error [HIP]:  Code: 719, Messsage: unspecified launch failure
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m Process EngineCore_DP0:
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m Traceback (most recent call last):
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self.run()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self._target(*self._args, **self._kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/engine/core.py", line 940, in run_engine_core
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     raise e
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/engine/core.py", line 929, in run_engine_core
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     engine_core.run_busy_loop()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/engine/core.py", line 956, in run_busy_loop
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self._process_engine_step()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/engine/core.py", line 989, in _process_engine_step
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     outputs, model_executed = self.step_fn()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m                               ^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/engine/core.py", line 487, in step_with_batch_queue
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     model_output = future.result()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m                    ^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return self.__get_result()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m            ^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     raise self._exception
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     result = run_method(self.driver_worker, method, args, kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/serial_utils.py", line 461, in run_method
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/worker/gpu_worker.py", line 578, in sample_tokens
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return self.model_runner.sample_tokens(grammar_output)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return func(*args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m            ^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/worker/gpu_model_runner.py", line 3453, in sample_tokens
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self.drafter.prepare_next_token_ids_padded(
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/projects/vllm/vllm/v1/spec_decode/eagle.py", line 627, in prepare_next_token_ids_padded
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     eagle_prepare_next_token_padded_kernel[grid](
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda>
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 623, in run
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     ^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 467, in __getattribute__
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self._init_handles()
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 457, in _init_handles
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[0;36m(EngineCore_DP0 pid=414815)^[[0;0m RuntimeError: Triton Error [HIP]:  Code: 719, Messsage: unspecified launch failure
Kernel Name: _ZN2at6native12_GLOBAL__N_121indexSelectSmallIndexIN3c108BFloat16EljLi2ELi2ELin2EEEvNS_4cuda6detail10TensorInfoIT_T1_EENS7_IKS8_S9_EENS7_IKT0_S9_EEiiS9_l
VGPU=0x4fe7b760 SWq=0x7ec026fa0000, HWq=0x7ebfec600000, id=1
	Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=0
	grid=[4096, 1, 1], workgroup=[512, 1, 1]
	private_seg_size=0, group_seg_size=0
	kernel_obj=0x7ebdc98ad540, kernarg_address=0x0x7ebfec2c1580
	completion_signal=0x0, correlation_id=0
	rptr=239701, wptr=240043

Root Cause

The issue is ultimately caused by using -1 to represent invalid spec tokens

spec_token_ids.extend([-1] * num_invalid_tokens)

Without async scheduling, token_ids_cpu is populated with valid token IDs before being copied to the GPU. With async scheduling, the -1 placeholders can propogate to the embedding layer.

update_draft_token_ids_in_output pads rejected tokens with -1.
update_req_spec_token_ids writes spec_token_ids, which includes -1 values, to token_ids_cpu.

self.token_ids_cpu[req_index, start_index:end_token_index] = spec_token_ids

_prepare_input_ids copies the tokens to the GPU.

The issue doesn't seem to manifest on CUDA. I'm not exactly sure why, it could be due to a timing thing or maybe the embedding kernel is handling -1 values, but either way I believe the solution here is safest.

Solution
Get rid of the -1 placeholders before copying tokens to the GPU by replacing them with zero.

With this fix, I am seeing the following when running pytest -v -s tests/v1/entrypoints/llm/test_struct_output_generate.py::test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9]:

======================== 1 passed, 3 warnings in 35.51s ========================

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@micah-wil micah-wil changed the title [Bugfix] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling [Bugfix][ROCm] Fix invalid spec token placeholders causing embedding lookup failure with async scheduling Jan 14, 2026
@mergify mergify bot added rocm Related to AMD ROCm v1 bug Something isn't working labels Jan 14, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a bug causing embedding lookup failures on certain hardware by replacing -1 placeholder tokens with 0. This is a good fix. I've also identified a related potential issue where these -1 placeholders could cause a crash during penalty calculations if async_scheduling is disabled. I've left a comment with a suggestion to make the fix more robust.

Comment on lines +467 to +468
safe_spec_token_ids = [tok if tok != -1 else 0 for tok in spec_token_ids]
self.token_ids_cpu[req_index, start_index:end_token_index] = safe_spec_token_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is a great fix for the embedding lookup error. However, there's a related issue on the next line.

cur_spec_token_ids is being extended with the original spec_token_ids, which can contain -1 placeholders. This can cause a RuntimeError from torch.bincount when penalties (like frequency or presence penalty) are applied, as bincount does not support negative indices. This would happen when async_scheduling is False and penalties are enabled.

To prevent this potential crash, you should also use safe_spec_token_ids to extend cur_spec_token_ids.

Suggested change for line 469:

cur_spec_token_ids.extend(safe_spec_token_ids)

@njhill
Copy link
Member

njhill commented Jan 16, 2026

@micah-wil thanks for the detailed analysis! I haven't dug into exactly what is happening here yet, but I think what's missing from what you summarized:

_prepare_input_ids does copy these -1's to the GPU, however in this case it should subsequently overwrite them with "valid" draft token ids here:

self.input_ids.gpu.scatter_(

I guess either there's some circumstance where that is not happening or it's actually a different root cause.

I wonder if it could be somehow related to #30618 (only a vague thought, not sure about that at all!)

@micah-wil
Copy link
Contributor Author

@micah-wil thanks for the detailed analysis! I haven't dug into exactly what is happening here yet, but I think what's missing from what you summarized:

_prepare_input_ids does copy these -1's to the GPU, however in this case it should subsequently overwrite them with "valid" draft token ids here:

self.input_ids.gpu.scatter_(

I guess either there's some circumstance where that is not happening or it's actually a different root cause.

I wonder if it could be somehow related to #30618 (only a vague thought, not sure about that at all!)

Thanks for taking a look. You brought up a good point about scatter_, I'd come across that when looking for the root cause of this issue. I had thought that maybe somehow there's a race condition between scatter_ and copy_to_gpu that happens to run correctly on CUDA but not ROCm for whatever reason, but that really shouldn't be the case now that I'm looking again. I will collect a trace to see if I can see what is really going on there.

…ed_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] (vllm-project#32355)"

This reverts commit 773d707.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@njhill
Copy link
Member

njhill commented Jan 17, 2026

I had thought that maybe somehow there's a race condition between scatter_ and copy_to_gpu that happens to run correctly on CUDA but not ROCm for whatever reason

They happen on the same stream so should be sequential from device pov.

@micah-wil
Copy link
Contributor Author

@njhill Not sure if #30618 is related, I checked out that PR and I'm still seeing the same error. I've got a trace to take a look at now, will investigate it and get back to you with my findings

@micah-wil
Copy link
Contributor Author

Hey @njhill, sorry for the delay. This actually passes with a newer ROCm version, so it looks like the bug isn't coming from vLLM. Going to close this and will revert the workaround in test_structured_output once we upgrade ROCm. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working rocm Related to AMD ROCm structured-output v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants