[Bugfix] Add SM 12.1 support + Fix GPT-OSS Harmony garbled reasoning and HarmonyError crashes by ohsono · Pull Request #31607 · vllm-project/vllm

ohsono · 2026-01-01T20:42:41Z

Summary

This PR has two parts:

Part 1 — SM 12.1 (Blackwell GB10) Compatibility (original)

CUTLASS ops graceful fallback (vllm/_custom_ops.py): AttributeError handling for missing CUTLASS ops on incomplete builds
Compilation ops compatibility (vllm/compilation/matcher_utils.py): hasattr checks for FP8 quant and silu_and_mul ops

Part 2 — GPT-OSS Harmony serving fixes (new)

Partially addresses #37030

Fixes two bugs causing openai/gpt-oss-120b to produce garbled, hallucinated, or null reasoning output. Not a duplicate of #34951, #30247, or #30482 (those address developer block formatting, model identity, and chat template — different issues).

Root Cause (Part 2)

Bug 1 — System message corrupts Harmony token sequence → hallucinated reasoning

_make_request_with_harmony passed all messages including role: "system" directly to parse_chat_inputs_to_harmony_messages. The Harmony encoder does not handle raw system-role messages — it expects them as structured SystemContent. The raw tokens corrupted the Harmony context, causing the model to produce off-topic or foreign-language output.

Example (before fix): Request "What is 2+2?" with system "You are a helpful assistant":

reasoning: "We need to solve the problem. Problem statement: 'You have an
agriculturally oriented text?? ... There is problem named 'find sum of
GCD(L,R)'..." [completely unrelated Codeforces hallucination]

Bug 2 — `HarmonyError` unhandled in streaming and non-streaming paths

harmony_parser.process(token_id) had no HarmonyError handler. Any unexpected Harmony token raised an unhandled exception, crashing the stream or silently returning null content with no diagnostic. Particularly impactful on SM121/Blackwell with MXFP4 quantization (see #37030).

Changes

`vllm/entrypoints/openai/chat_completion/serving.py`

System message extraction in _make_request_with_harmony: scan request.messages for role: "system" entries, extract and concatenate their text content (handling both str and list[{type, text}] formats), pass as structured instructions to get_system_message() / get_developer_message(). Feed only non-system messages to parse_chat_inputs_to_harmony_messages.
HarmonyError handling in streaming generator: wrap harmony_parser.process(token_id) in try/except HarmonyError; log a warning and break to return partial result.

`vllm/entrypoints/openai/parser/harmony_utils.py`

HarmonyError handling in parse_output_into_messages: same pattern — catch, log warning, break with partial result.

Test Plan

TIKTOKEN_RS_CACHE_DIR=/path/to/tiktoken_encodings \
  pytest tests/entrypoints/openai/parser/test_harmony_utils.py -v
# Result: 57/57 passed

Live server validation on gpt-oss-120b (SM121, MXFP4):

System message no longer causes hallucinated Codeforces/Russian reasoning
HarmonyError now caught and logged (harmony_utils.py:342) instead of crashing
System message extraction handles both str and list[{type, text}] content formats
Baseline requests (no system message) unaffected

Note: Null content/reasoning with finish_reason: length is a pre-existing SM121 MXFP4 Marlin kernel issue (wrong first Harmony token generated) tracked separately in #37030. Our changes correctly catch this case gracefully.

AI assistance used: Claude assisted with debugging, root cause analysis, and code authoring. All changes reviewed, tested, and validated by the human submitter on live hardware.

Partially addresses [Bug]: GPT-OSS-120B gpt-oss MXFP4 on SM121 (Blackwell DGX Spark): Marlin kernel generates wrong first Harmony token, producing null content/reasoning #37030 (deeper SM121 MXFP4 Marlin kernel issue — requires kernel-level fix)
Related: [Bug]: SM120/SM100: gpt-oss gibberish with tp=2 #31422 (SM120 gibberish with tp=2 — different root cause: prefix caching + TP)

github-actions · 2026-01-01T20:42:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-01-01T20:46:42Z

Hi @ohsono, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

gemini-code-assist

Code Review

This pull request introduces a critical bugfix for the V1 engine's async scheduling and adds support for SM 12.1 (Blackwell). The changes are well-structured and address the described issues effectively. The core bugfix in vllm/v1/engine/core.py correctly handles a None return, preventing a crash. The compatibility improvements for SM 12.1 are implemented defensively using try-except and hasattr checks, which is great for robustness. My main feedback is regarding the new dependency on the Triton main branch, which should be pinned to a specific commit for build stability.

requirements/common.txt

njhill · 2026-01-02T00:56:19Z

Thanks @ohsono . But this should only ever happen as a secondary error after the execute_model method already raised an exception. Please share the whole log, you should see an earlier root cause error/traceback.

I agree that this is still a bug though since the secondary error is misleading. I've opened #31611 to address this.

Please remove the core.py change from this PR, the other changes can be reviewed separately.

ohsono · 2026-01-02T02:54:58Z

Thanks @ohsono . But this should only ever happen as a secondary error after the execute_model method already raised an exception. Please share the whole log, you should see an earlier root cause error/traceback.

I agree that this is still a bug though since the secondary error is misleading. I've opened #31611 to address this.

Please remove the core.py change from this PR, the other changes can be reviewed separately.

Thanks, @njhill, for the suggestion. I've updated accordingly. Although I didn't think to save the log, I found some error messages from vllm_server.log. I am not 100% sure this reflects the exact moment the crash is triggered, but it's close enough. Please let me know if you have any questions.

(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.14.0rc1.dev187+g04147dcfa.d20251231) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False), seed=0, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None},
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-9b7523ffdf34efc8-0-a44d6762,prompt_token_ids_len=5,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[200012, 199999], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=50, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1], [2]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={cmpl-9b7523ffdf34efc8-0-a44d6762: 5}, total_num_scheduled_tokens=5, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 1], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=7.791803023216026e-05, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=5, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=305423) [2025-12-30 23:36:10] ERROR _base.py:426: exception calling callback for <Future at 0xeadc4c0141d0 state=finished raised PTXASError>
(EngineCore_DP0 pid=305423) Traceback (most recent call last):
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 468, in make_cubin
(EngineCore_DP0 pid=305423)     subprocess.run(ptxas_cmd, check=True, close_fds=False, stderr=flog)
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/subprocess.py", line 571, in run
(EngineCore_DP0 pid=305423)     raise CalledProcessError(retcode, process.args,
(EngineCore_DP0 pid=305423) subprocess.CalledProcessError: Command '['/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas', '-lineinfo', '-v', '--gpu-name=sm_121a', '/tmp/tmpkdq4382_.ptx', '-o', '/tmp/tmpkdq4382_.ptx.o']' returned non-zero exit status 255.
(EngineCore_DP0 pid=305423)
(EngineCore_DP0 pid=305423) During handling of the above exception, another exception occurred:
(EngineCore_DP0 pid=305423)
(EngineCore_DP0 pid=305423) Traceback (most recent call last):
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 424, in add_done_callback
(EngineCore_DP0 pid=305423)     fn(self)
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 344, in callback
(EngineCore_DP0 pid=305423)     result = f.result()
(EngineCore_DP0 pid=305423)              ^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=305423)     return self.__get_result()
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=305423)     raise self._exception
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
(EngineCore_DP0 pid=305423)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=305423)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=305423)     return func(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/worker/worker_base.py", line 369, in execute_model
(EngineCore_DP0 pid=305423)     return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=305423)     return func(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/worker/gpu_worker.py", line 624, in execute_model
(EngineCore_DP0 pid=305423)     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=305423)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=305423)     return func(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/worker/gpu_model_runner.py", line 3219, in execute_model
(EngineCore_DP0 pid=305423)     model_output = self._model_forward(
(EngineCore_DP0 pid=305423)                    ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/worker/gpu_model_runner.py", line 2848, in _model_forward
(EngineCore_DP0 pid=305423)     return self.model(
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=305423)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=305423)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/model_executor/models/gpt_oss.py", line 722, in forward
(EngineCore_DP0 pid=305423)     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/compilation/decorators.py", line 372, in __call__
(EngineCore_DP0 pid=305423)     return self.forward(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/model_executor/models/gpt_oss.py", line 300, in forward
(EngineCore_DP0 pid=305423)     x, residual = layer(x, positions, residual)
(EngineCore_DP0 pid=305423)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=305423)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=305423)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/model_executor/models/gpt_oss.py", line 233, in forward
(EngineCore_DP0 pid=305423)     hidden_states = self.attn(hidden_states, positions)
(EngineCore_DP0 pid=305423)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=305423)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=305423)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/model_executor/models/gpt_oss.py", line 137, in forward
(EngineCore_DP0 pid=305423)     attn_output = self.attn(q, k, v)
(EngineCore_DP0 pid=305423)                   ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=305423)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=305423)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/attention/layer.py", line 385, in forward
(EngineCore_DP0 pid=305423)     torch.ops.vllm.unified_attention_with_output(
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=305423)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/attention/utils/kv_transfer_utils.py", line 39, in wrapper
(EngineCore_DP0 pid=305423)     return func(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/attention/layer.py", line 804, in unified_attention_with_output
(EngineCore_DP0 pid=305423)     self.impl.forward(
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/attention/backends/triton_attn.py", line 435, in forward
(EngineCore_DP0 pid=305423)     triton_reshape_and_cache_flash(
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/attention/ops/triton_reshape_and_cache_flash.py", line 162, in triton_reshape_and_cache_flash
(EngineCore_DP0 pid=305423)     reshape_and_cache_kernel_flash[grid](
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 419, in <lambda>
(EngineCore_DP0 pid=305423)     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=305423)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 733, in run
(EngineCore_DP0 pid=305423)     kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore_DP0 pid=305423)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 861, in _do_compile
(EngineCore_DP0 pid=305423)     kernel = self.compile(src, target=target, options=options.__dict__)
(EngineCore_DP0 pid=305423)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/compiler/compiler.py", line 320, in compile
(EngineCore_DP0 pid=305423)     next_module = compile_ir(module, metadata)
(EngineCore_DP0 pid=305423)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 520, in <lambda>
(EngineCore_DP0 pid=305423)     stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.target.arch)
(EngineCore_DP0 pid=305423)                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
(EngineCore_DP0 pid=305423) `ptxas` stderr:
(EngineCore_DP0 pid=305423) ptxas fatal   : Value 'sm_121a' is not defined for option 'gpu-name'
(EngineCore_DP0 pid=305423)
(EngineCore_DP0 pid=305423) Repro command: /home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp/tmpkdq4382_.ptx -o /tmp/tmpkdq4382_.ptx.o
(EngineCore_DP0 pid=305423)
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879] Traceback (most recent call last):
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 897, in run_busy_loop
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]     self._process_engine_step()
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 930, in _process_engine_step
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 467, in step_with_batch_queue
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]     engine_core_outputs = self.scheduler.update_from_output(
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]   File "/home/ohsono/vllm/vllm/v1/core/sched/scheduler.py", line 1066, in update_from_output
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]     sampled_token_ids = model_runner_output.sampled_token_ids
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879] AttributeError: 'NoneType' object has no attribute 'sampled_token_ids'
(EngineCore_DP0 pid=305423) Process EngineCore_DP0:
(EngineCore_DP0 pid=305423) Traceback (most recent call last):
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=305423)     self.run()
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=305423)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 881, in run_engine_core
(EngineCore_DP0 pid=305423)     raise e
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=305423)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 897, in run_busy_loop
(EngineCore_DP0 pid=305423)     self._process_engine_step()
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 930, in _process_engine_step
(EngineCore_DP0 pid=305423)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=305423)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 467, in step_with_batch_queue
(EngineCore_DP0 pid=305423)     engine_core_outputs = self.scheduler.update_from_output(
(EngineCore_DP0 pid=305423)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/core/sched/scheduler.py", line 1066, in update_from_output
(EngineCore_DP0 pid=305423)     sampled_token_ids = model_runner_output.sampled_token_ids
(EngineCore_DP0 pid=305423)                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423) AttributeError: 'NoneType' object has no attribute 'sampled_token_ids'
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543] AsyncLLM output_handler failed.
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543] Traceback (most recent call last):
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543]   File "/home/ohsono/vllm/vllm/v1/engine/async_llm.py", line 495, in output_handler
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543]     outputs = await engine_core.get_output_async()
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543]   File "/home/ohsono/vllm/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543]     raise self._format_exception(outputs) from None
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=305263) INFO:     127.0.0.1:43878 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
[rank0]:[W1230 23:36:10.698255043 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=305263) INFO:     Shutting down
(APIServer pid=305263) INFO:     Waiting for application shutdown.
(APIServer pid=305263) INFO:     Application shutdown complete.
(APIServer pid=305263) INFO:     Finished server process [305263]

njhill · 2026-01-02T07:43:43Z

Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error:

(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error

The other side-effect message is more of a red herring and will be addressed by #31611.

ohsono · 2026-01-03T00:11:31Z

Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error:
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
The other side-effect message is more of a red herring and will be addressed by #31611.

Thanks for your input on PTXAS. I actually downgrade the PTXAS and apply --gpu-name=sm_90a instead of sm_121a (GB10 or other Blackwall-compatible), which I believe the current library of PTXAS doesn't support newer architectures.

However, there is a much simpler solution for this, bypassing it altogether. The --enforce-eager flag:

Disables torch.compile completely
Prevents Triton from generating kernels that need ptxas
Avoids the SM 12.1 compatibility issue entirely
Has minimal performance impact (20-30% slower than optimized, but still fast with MXFP4)

Plus, I saw the v3.5.1 Triton build (latest) doesn't reflect the latest commit of the main branch. So I also include Triton sync to pin the specific commit. #c9a2344 which has been address by PR(triton-lang/triton#8498)
Hopefully, this issue will be addressed in a newer build of CUDA, which I hope they will expose at a later release, including support for {sm_120, sm_121, and sm_121a}.

LucasWilkinson · 2026-01-03T01:19:48Z

cc @mgoin

mgoin · 2026-01-04T19:04:00Z

Each pytorch release pins to a specific triton version, so we want to avoid affecting that pin. We may just have to wait for the next pytorch release to hopefully have the triton fix included

johnnynunez · 2026-01-04T21:00:50Z

triton 3.6.0 triton-lang/triton#8862

ProExpertProg · 2026-01-05T19:27:01Z

vllm/model_executor/layers/quantization/utils/w8a8_utils.py

-CUTLASS_BLOCK_FP8_SUPPORTED = cutlass_block_fp8_supported()
+# Handle case where CUTLASS ops might not be available (build issue)
+try:
+    CUTLASS_FP8_SUPPORTED = cutlass_fp8_supported()


Let's incorporate this logic into the function itself

ProExpertProg · 2026-01-05T19:28:09Z

vllm/_custom_ops.py

 def cutlass_scaled_mm_supports_fp4(cuda_device_capability: int) -> bool:
-    return torch.ops._C.cutlass_scaled_mm_supports_fp4(cuda_device_capability)
+    try:
+        return torch.ops._C.cutlass_scaled_mm_supports_fp4(cuda_device_capability)


Should we build these functions for cuda 12.1?

Thanks @ProExpertProg. Yes. I believe this has been addressed as of CUDA 13.1. I haven't update to my local machine yet.

johnnynunez · 2026-01-05T19:30:56Z

Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error:
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
The other side-effect message is more of a red herring and will be addressed by #31611.
Thanks for your input on PTXAS. I actually downgrade the PTXAS and apply --gpu-name=sm_90a instead of sm_121a (GB10 or other Blackwall-compatible), which I believe the current library of PTXAS doesn't support newer architectures.

However, there is a much simpler solution for this, bypassing it altogether. The --enforce-eager flag:

Disables torch.compile completely

Prevents Triton from generating kernels that need ptxas

Avoids the SM 12.1 compatibility issue entirely

Has minimal performance impact (20-30% slower than optimized, but still fast with MXFP4)

Plus, I saw the v3.5.1 Triton build (latest) doesn't reflect the latest commit of the main branch. So I also include Triton sync to pin the specific commit. #c9a2344 which has been address by PR(triton-lang/triton#8498) Hopefully, this issue will be addressed in a newer build of CUDA, which I hope they will expose at a later release, including support for {sm_120, sm_121, and sm_121a}.

It is coming i think in torch 2.10 that will be released in 2-3 weeks

ohsono · 2026-01-05T21:44:08Z

Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error:
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
The other side-effect message is more of a red herring and will be addressed by #31611.
Thanks for your input on PTXAS. I actually downgrade the PTXAS and apply --gpu-name=sm_90a instead of sm_121a (GB10 or other Blackwall-compatible), which I believe the current library of PTXAS doesn't support newer architectures.
However, there is a much simpler solution for this, bypassing it altogether. The --enforce-eager flag:

Disables torch.compile completely

Prevents Triton from generating kernels that need ptxas

Avoids the SM 12.1 compatibility issue entirely

Has minimal performance impact (20-30% slower than optimized, but still fast with MXFP4)

Plus, I saw the v3.5.1 Triton build (latest) doesn't reflect the latest commit of the main branch. So I also include Triton sync to pin the specific commit. #c9a2344 which has been address by PR(triton-lang/triton#8498) Hopefully, this issue will be addressed in a newer build of CUDA, which I hope they will expose at a later release, including support for {sm_120, sm_121, and sm_121a}.
It is coming i think in torch 2.10 that will be released in 2-3 weeks

Thanks @johnnynunez. I think I am ok to wait for a couple weeks if that is the case. Also as you mentioned, v3.6.0 Triton release on the way. so, we should keep this PR ticket open for the future reference. Since we identified the bug and address by #31611.
Thank you so much for your insight for all. cc: @njhill @johnnynunez @ProExpertProg

johnnynunez · 2026-01-06T03:20:15Z

Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error:
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
The other side-effect message is more of a red herring and will be addressed by #31611.
Thanks for your input on PTXAS. I actually downgrade the PTXAS and apply --gpu-name=sm_90a instead of sm_121a (GB10 or other Blackwall-compatible), which I believe the current library of PTXAS doesn't support newer architectures.
However, there is a much simpler solution for this, bypassing it altogether. The --enforce-eager flag:

Disables torch.compile completely

Prevents Triton from generating kernels that need ptxas

Avoids the SM 12.1 compatibility issue entirely

Has minimal performance impact (20-30% slower than optimized, but still fast with MXFP4)

Plus, I saw the v3.5.1 Triton build (latest) doesn't reflect the latest commit of the main branch. So I also include Triton sync to pin the specific commit. #c9a2344 which has been address by PR(triton-lang/triton#8498) Hopefully, this issue will be addressed in a newer build of CUDA, which I hope they will expose at a later release, including support for {sm_120, sm_121, and sm_121a}.
It is coming i think in torch 2.10 that will be released in 2-3 weeks
Thanks @johnnynunez. I think I am ok to wait for a couple weeks if that is the case. Also as you mentioned, v3.6.0 Triton release on the way. so, we should keep this PR ticket open for the future reference. Since we identified the bug and address by #31611. Thank you so much for your insight for all. cc: @njhill @johnnynunez @ProExpertProg

I have compiled all my stack with cuda 13.0 and pytorch 2.10. Maybe i am going to test in the following days

mergify · 2026-01-08T07:28:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ohsono.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-14T03:00:15Z

Hi @ohsono, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-14T03:37:17Z

Hi @ohsono, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

ohsono · 2026-03-14T03:59:04Z

We may need to add
diff --git a/csrc/torch_bindings.cpp b/csrc/torch_bindings.cpp
index c16b9c223..532bd03ef 100644
--- a/csrc/torch_bindings.cpp
+++ b/csrc/torch_bindings.cpp
@@ -239,10 +239,12 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {

   // Quantization ops
 #ifndef USE_ROCM
+#ifdef ENABLE_DSV3_FUSED_A_GEMM
   // DeepSeek V3 fused A GEMM (SM 9.0+, bf16 only, 1-16 tokens).
   ops.def(
       "dsv3_fused_a_gemm(Tensor! output, Tensor mat_a, Tensor mat_b) -> ()");
   ops.impl("dsv3_fused_a_gemm", torch::kCUDA, &dsv3_fused_a_gemm);
+#endif
I try to build in the main branch, but it fail to launch due to this.
I agree with this, in CMakelists.txt checks dsv3_fused_a_gemm requires SM9.0+,

When building using TORCH_CUDA_ARCH_LIST="8.9+PTX", dsv3_fused_a_gemm.cu is not compiled, will result undefined symbol linker error.

After patch this 8.9+PTX build works good

Thanks for reporting that. Hopefully, this issue has been addressed by the CUDA 13.1 release. But I am not sure.
https://docs.nvidia.com/cuda/archive/13.1.0/cuda-toolkit-release-notes/index.html

Otherwise, my recent commit, 570686b. It may bypass the linking error.

mergify · 2026-03-14T06:06:24Z

Hi @ohsono, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Fixes vllm-project#31588 This PR addresses a critical V1 engine bug affecting all users with async scheduling enabled (default) and improves SM 12.1 (Blackwell GB10) compatibility. **Problem:** V1 engine crashes with `AttributeError: 'NoneType' object has no attribute 'sampled_token_ids'` on first inference when async scheduling is enabled. **Root Cause:** The step_with_batch_queue() method did not handle None return from execute_model(), while step() method handles this correctly. **Solution:** Added None check in step_with_batch_queue() that mirrors the existing pattern in step() method (vllm/v1/engine/core.py:464-470). **Impact:** - Fixes crash on first inference with default V1 engine configuration - Affects all platforms using async scheduling (default) - Zero performance impact - only adds missing error handling 1. **CUTLASS ops graceful fallback** (vllm/_custom_ops.py) - Added AttributeError handling for missing CUTLASS ops - Prevents crashes on incomplete builds - Provides clear warning messages 2. **Compilation ops compatibility** (vllm/compilation/matcher_utils.py) - Added hasattr checks for FP8 quant and silu_and_mul ops - Prevents crashes during compilation matching 3. **MXFP4 SM range extension** (vllm/model_executor/layers/quantization/mxfp4.py) - Extended Triton kernel support range to include SM 12.0-12.1 - Enables proper fallback to Marlin backend on SM 12.1 4. **FP8 ops exception handling** (vllm/model_executor/layers/quantization/utils/w8a8_utils.py) - Wrapped CUTLASS_FP8_SUPPORTED initialization in try-except - Prevents module import crashes 5. **Triton dependency** (requirements/common.txt) - Added Triton main branch for SM 12.1 PTX support - Temporary until official release with sm_121a support Tested on: - GPU: NVIDIA GB10 (SM 12.1) - CUDA: 13.0 - PyTorch: 2.9.1+cu130 - Model: openai/gpt-oss-120b with MXFP4 quantization - vllm-project#28589 - V1 Engine fails on Blackwell GB10 - vllm-project#31128 - Add support for Blackwell SM121 - vllm-project#28621 - EP parallel + async scheduling crash Signed-off-by: Hochan Son <ohsono@gmail.com>

… and SILU_MUL_OP guard - Move try/except for CUTLASS ops into cutlass_fp8_supported() and cutlass_block_fp8_supported() per reviewer request - Fix SM 12.x Triton range to exclude SM 11.x (split into <11.0 and >=12.0) - Add SILU_MUL_OP is not None check in MatcherSiluAndMul.enabled Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hochan Son <ohsono@gmail.com>

Apply ruff formatting fixes: - Split long lines to meet 88 character limit - Reformat multi-line function calls Signed-off-by: Hochan Son <ohsono@gmail.com>

…D_A_GEMM dsv3_fused_a_gemm requires SM 9.0+ and is conditionally compiled in CMakeLists.txt, but torch_bindings.cpp unconditionally registers the op, causing an undefined symbol linker error when building with TORCH_CUDA_ARCH_LIST="8.9+PTX" or other sub-SM90 targets. Add ENABLE_DSV3_FUSED_A_GEMM compile definition in CMake when the kernel is built, and guard the ops.def in torch_bindings.cpp with #ifdef. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hochan Son <ohsono@gmail.com>

Signed-off-by: Hochan Son <ohsono@gmail.com>

…port - Exclude upstream shell scripts with pre-existing shellcheck issues (.buildkite/scripts/run-multi-node-test.sh, tests/v1/kv_connector/nixl_integration/spec_decode_acceptance_test.sh) - Add SC2089/SC2090/SC2086/SC2046/SC2048/SC2206 to .shellcheckrc disabled list - Add SM 12.1 (GB10) to CUDA_SUPPORTED_ARCHS and MARLIN_FP8_ARCHS Signed-off-by: Hochan Son <ohsono@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ssages and HarmonyError crashes Two related bugs caused GPT-OSS (gpt-oss-120b) to produce garbled output (unicode garbage, foreign language text like Russian) during reasoning: 1. System messages passed via `request.messages` with `role: "system"` were fed directly to `parse_chat_inputs_to_harmony_messages`, which does not handle the system role in Harmony encoding format. The raw tokens caused the model to see unexpected token sequences during reasoning, producing unreadable output. Fix: extract system-role messages and pass their content as structured `instructions` to `get_system_message` / `get_developer_message` before encoding, in `OpenAIServingRender`. 2. `harmony_parser.process(token_id)` in the streaming generator had no `HarmonyError` handler. Any unexpected token raised an unhandled exception, crashing the stream. Fix: wrap the call site with `try/except HarmonyError` and return the partial result with a warning log. Signed-off-by: Hochan Son <ohsono@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mergify · 2026-03-21T03:40:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ohsono.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Hochan Son <ohsono.email@gmail.com>

mergify · 2026-03-30T07:15:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ohsono.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ohsono requested review from ProExpertProg, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners January 1, 2026 20:42

mergify bot added ci/build v1 labels Jan 1, 2026

gemini-code-assist bot reviewed Jan 1, 2026

View reviewed changes

requirements/common.txt Outdated Show resolved Hide resolved

ohsono force-pushed the fix-v1-engine-sm121-support branch from 8a2dcf0 to 91e9533 Compare January 1, 2026 23:05

njhill mentioned this pull request Jan 2, 2026

[Bug]: vLLM SM 12.1 (Blackwell GB10) V1 Engine Bug Report (Relates to: #28589, #31128, #28621, #27679) #31588

Closed

1 task

LucasWilkinson assigned mgoin Jan 3, 2026

njhill changed the title ~~[Bugfix] Fix V1 engine batch queue crash and add SM 12.1 support~~ [Bugfix] Add SM 12.1 support Jan 4, 2026

ProExpertProg reviewed Jan 5, 2026

View reviewed changes

mergify bot added the needs-rebase label Jan 8, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Mar 14, 2026

ohsono changed the title ~~[Bugfix] Add SM 12.1 support~~ [Bugfix] Add SM 12.1 support + Fix GPT-OSS Harmony garbled reasoning and HarmonyError crashes Mar 14, 2026

mergify bot removed the needs-rebase label Mar 14, 2026

ohsono and others added 7 commits March 13, 2026 23:27

Fix pre-commit formatting issues

6cfe9e3

Apply ruff formatting fixes: - Split long lines to meet 88 character limit - Reformat multi-line function calls Signed-off-by: Hochan Son <ohsono@gmail.com>

pre-existing bug

a2530ee

Signed-off-by: Hochan Son <ohsono@gmail.com>

ohsono force-pushed the fix-v1-engine-sm121-support branch from cdafc01 to 3e2d8e1 Compare March 14, 2026 06:56

ohsono requested review from hmellor and njhill as code owners March 14, 2026 06:56

Merge branch 'vllm-project:main' into fix-v1-engine-sm121-support

73999a5

ohsono requested a review from ProExpertProg March 14, 2026 11:56

Merge branch 'main' into fix-v1-engine-sm121-support

9f0eedb

haosdent mentioned this pull request Mar 14, 2026

[Bug]: GPT-OSS-120B gpt-oss MXFP4 on SM121 (Blackwell DGX Spark): Marlin kernel generates wrong first Harmony token, producing null content/reasoning #37030

Open

Merge branch 'main' into fix-v1-engine-sm121-support

bf03c71

will-deines mentioned this pull request Mar 18, 2026

[Responses API] Sanitize leaked Harmony control tokens in tool names and recipients #35906

Open

6 tasks

mergify bot added the needs-rebase label Mar 21, 2026

Merge branch 'main' into fix-v1-engine-sm121-support

a48e5d5

Signed-off-by: Hochan Son <ohsono.email@gmail.com>

mergify bot removed the needs-rebase label Mar 26, 2026

Merge branch 'main' into fix-v1-engine-sm121-support

532413b

mergify bot added the needs-rebase label Mar 30, 2026

Uh oh!

Conversation

ohsono commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Part 1 — SM 12.1 (Blackwell GB10) Compatibility (original)

Part 2 — GPT-OSS Harmony serving fixes (new)

Root Cause (Part 2)

Bug 1 — System message corrupts Harmony token sequence → hallucinated reasoning

Bug 2 — HarmonyError unhandled in streaming and non-streaming paths

Changes

vllm/entrypoints/openai/chat_completion/serving.py

vllm/entrypoints/openai/parser/harmony_utils.py

Test Plan

Related

Uh oh!

github-actions bot commented Jan 1, 2026

Uh oh!

mergify bot commented Jan 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

njhill commented Jan 2, 2026

Uh oh!

ohsono commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Jan 2, 2026

Uh oh!

ohsono commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson commented Jan 3, 2026

Uh oh!

mgoin commented Jan 4, 2026

Uh oh!

johnnynunez commented Jan 4, 2026

Uh oh!

ProExpertProg Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

ohsono Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

johnnynunez commented Jan 5, 2026

Uh oh!

ohsono commented Jan 5, 2026

Uh oh!

johnnynunez commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jan 8, 2026

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

ohsono commented Mar 14, 2026

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

mergify bot commented Mar 21, 2026

Uh oh!

mergify bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ohsono commented Jan 1, 2026 •

edited

Loading

Bug 2 — `HarmonyError` unhandled in streaming and non-streaming paths

`vllm/entrypoints/openai/chat_completion/serving.py`

`vllm/entrypoints/openai/parser/harmony_utils.py`

ohsono commented Jan 2, 2026 •

edited

Loading

ohsono commented Jan 3, 2026 •

edited

Loading

johnnynunez commented Jan 6, 2026 •

edited

Loading