Skip to content

[Bugfix] Add SM 12.1 support + Fix GPT-OSS Harmony garbled reasoning and HarmonyError crashes#31607

Open
ohsono wants to merge 12 commits intovllm-project:mainfrom
ohsono:fix-v1-engine-sm121-support
Open

[Bugfix] Add SM 12.1 support + Fix GPT-OSS Harmony garbled reasoning and HarmonyError crashes#31607
ohsono wants to merge 12 commits intovllm-project:mainfrom
ohsono:fix-v1-engine-sm121-support

Conversation

@ohsono
Copy link
Copy Markdown

@ohsono ohsono commented Jan 1, 2026

Summary

This PR has two parts:

Part 1 — SM 12.1 (Blackwell GB10) Compatibility (original)

Fixes #31588

  • CUTLASS ops graceful fallback (vllm/_custom_ops.py): AttributeError handling for missing CUTLASS ops on incomplete builds
  • Compilation ops compatibility (vllm/compilation/matcher_utils.py): hasattr checks for FP8 quant and silu_and_mul ops

Part 2 — GPT-OSS Harmony serving fixes (new)

Partially addresses #37030

Fixes two bugs causing openai/gpt-oss-120b to produce garbled, hallucinated, or null reasoning output. Not a duplicate of #34951, #30247, or #30482 (those address developer block formatting, model identity, and chat template — different issues).


Root Cause (Part 2)

Bug 1 — System message corrupts Harmony token sequence → hallucinated reasoning

_make_request_with_harmony passed all messages including role: "system" directly to parse_chat_inputs_to_harmony_messages. The Harmony encoder does not handle raw system-role messages — it expects them as structured SystemContent. The raw tokens corrupted the Harmony context, causing the model to produce off-topic or foreign-language output.

Example (before fix): Request "What is 2+2?" with system "You are a helpful assistant":

reasoning: "We need to solve the problem. Problem statement: 'You have an
agriculturally oriented text?? ... There is problem named 'find sum of
GCD(L,R)'..." [completely unrelated Codeforces hallucination]

Bug 2 — HarmonyError unhandled in streaming and non-streaming paths

harmony_parser.process(token_id) had no HarmonyError handler. Any unexpected Harmony token raised an unhandled exception, crashing the stream or silently returning null content with no diagnostic. Particularly impactful on SM121/Blackwell with MXFP4 quantization (see #37030).


Changes

vllm/entrypoints/openai/chat_completion/serving.py

  • System message extraction in _make_request_with_harmony: scan request.messages for role: "system" entries, extract and concatenate their text content (handling both str and list[{type, text}] formats), pass as structured instructions to get_system_message() / get_developer_message(). Feed only non-system messages to parse_chat_inputs_to_harmony_messages.
  • HarmonyError handling in streaming generator: wrap harmony_parser.process(token_id) in try/except HarmonyError; log a warning and break to return partial result.

vllm/entrypoints/openai/parser/harmony_utils.py

  • HarmonyError handling in parse_output_into_messages: same pattern — catch, log warning, break with partial result.

Test Plan

TIKTOKEN_RS_CACHE_DIR=/path/to/tiktoken_encodings \
  pytest tests/entrypoints/openai/parser/test_harmony_utils.py -v
# Result: 57/57 passed

Live server validation on gpt-oss-120b (SM121, MXFP4):

  • System message no longer causes hallucinated Codeforces/Russian reasoning
  • HarmonyError now caught and logged (harmony_utils.py:342) instead of crashing
  • System message extraction handles both str and list[{type, text}] content formats
  • Baseline requests (no system message) unaffected

Note: Null content/reasoning with finish_reason: length is a pre-existing SM121 MXFP4 Marlin kernel issue (wrong first Harmony token generated) tracked separately in #37030. Our changes correctly catch this case gracefully.

AI assistance used: Claude assisted with debugging, root cause analysis, and code authoring. All changes reviewed, tested, and validated by the human submitter on live hardware.


Related

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 1, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 1, 2026

Hi @ohsono, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a critical bugfix for the V1 engine's async scheduling and adds support for SM 12.1 (Blackwell). The changes are well-structured and address the described issues effectively. The core bugfix in vllm/v1/engine/core.py correctly handles a None return, preventing a crash. The compatibility improvements for SM 12.1 are implemented defensively using try-except and hasattr checks, which is great for robustness. My main feedback is regarding the new dependency on the Triton main branch, which should be pinned to a specific commit for build stability.

@ohsono ohsono force-pushed the fix-v1-engine-sm121-support branch from 8a2dcf0 to 91e9533 Compare January 1, 2026 23:05
@njhill
Copy link
Copy Markdown
Member

njhill commented Jan 2, 2026

Thanks @ohsono . But this should only ever happen as a secondary error after the execute_model method already raised an exception. Please share the whole log, you should see an earlier root cause error/traceback.

I agree that this is still a bug though since the secondary error is misleading. I've opened #31611 to address this.

Please remove the core.py change from this PR, the other changes can be reviewed separately.

@ohsono
Copy link
Copy Markdown
Author

ohsono commented Jan 2, 2026

Thanks @ohsono . But this should only ever happen as a secondary error after the execute_model method already raised an exception. Please share the whole log, you should see an earlier root cause error/traceback.

I agree that this is still a bug though since the secondary error is misleading. I've opened #31611 to address this.

Please remove the core.py change from this PR, the other changes can be reviewed separately.

Thanks, @njhill, for the suggestion. I've updated accordingly. Although I didn't think to save the log, I found some error messages from vllm_server.log. I am not 100% sure this reflects the exact moment the crash is triggered, but it's close enough. Please let me know if you have any questions.

(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.14.0rc1.dev187+g04147dcfa.d20251231) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False), seed=0, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None},
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-9b7523ffdf34efc8-0-a44d6762,prompt_token_ids_len=5,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[200012, 199999], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=50, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1], [2]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={cmpl-9b7523ffdf34efc8-0-a44d6762: 5}, total_num_scheduled_tokens=5, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 1], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=7.791803023216026e-05, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=5, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=305423) [2025-12-30 23:36:10] ERROR _base.py:426: exception calling callback for <Future at 0xeadc4c0141d0 state=finished raised PTXASError>
(EngineCore_DP0 pid=305423) Traceback (most recent call last):
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 468, in make_cubin
(EngineCore_DP0 pid=305423)     subprocess.run(ptxas_cmd, check=True, close_fds=False, stderr=flog)
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/subprocess.py", line 571, in run
(EngineCore_DP0 pid=305423)     raise CalledProcessError(retcode, process.args,
(EngineCore_DP0 pid=305423) subprocess.CalledProcessError: Command '['/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas', '-lineinfo', '-v', '--gpu-name=sm_121a', '/tmp/tmpkdq4382_.ptx', '-o', '/tmp/tmpkdq4382_.ptx.o']' returned non-zero exit status 255.
(EngineCore_DP0 pid=305423)
(EngineCore_DP0 pid=305423) During handling of the above exception, another exception occurred:
(EngineCore_DP0 pid=305423)
(EngineCore_DP0 pid=305423) Traceback (most recent call last):
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 424, in add_done_callback
(EngineCore_DP0 pid=305423)     fn(self)
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 344, in callback
(EngineCore_DP0 pid=305423)     result = f.result()
(EngineCore_DP0 pid=305423)              ^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=305423)     return self.__get_result()
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=305423)     raise self._exception
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
(EngineCore_DP0 pid=305423)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=305423)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=305423)     return func(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/worker/worker_base.py", line 369, in execute_model
(EngineCore_DP0 pid=305423)     return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=305423)     return func(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/worker/gpu_worker.py", line 624, in execute_model
(EngineCore_DP0 pid=305423)     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=305423)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=305423)     return func(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/worker/gpu_model_runner.py", line 3219, in execute_model
(EngineCore_DP0 pid=305423)     model_output = self._model_forward(
(EngineCore_DP0 pid=305423)                    ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/worker/gpu_model_runner.py", line 2848, in _model_forward
(EngineCore_DP0 pid=305423)     return self.model(
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=305423)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=305423)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/model_executor/models/gpt_oss.py", line 722, in forward
(EngineCore_DP0 pid=305423)     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/compilation/decorators.py", line 372, in __call__
(EngineCore_DP0 pid=305423)     return self.forward(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/model_executor/models/gpt_oss.py", line 300, in forward
(EngineCore_DP0 pid=305423)     x, residual = layer(x, positions, residual)
(EngineCore_DP0 pid=305423)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=305423)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=305423)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/model_executor/models/gpt_oss.py", line 233, in forward
(EngineCore_DP0 pid=305423)     hidden_states = self.attn(hidden_states, positions)
(EngineCore_DP0 pid=305423)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=305423)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=305423)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/model_executor/models/gpt_oss.py", line 137, in forward
(EngineCore_DP0 pid=305423)     attn_output = self.attn(q, k, v)
(EngineCore_DP0 pid=305423)                   ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=305423)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=305423)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/attention/layer.py", line 385, in forward
(EngineCore_DP0 pid=305423)     torch.ops.vllm.unified_attention_with_output(
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=305423)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/attention/utils/kv_transfer_utils.py", line 39, in wrapper
(EngineCore_DP0 pid=305423)     return func(*args, **kwargs)
(EngineCore_DP0 pid=305423)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/attention/layer.py", line 804, in unified_attention_with_output
(EngineCore_DP0 pid=305423)     self.impl.forward(
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/attention/backends/triton_attn.py", line 435, in forward
(EngineCore_DP0 pid=305423)     triton_reshape_and_cache_flash(
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/attention/ops/triton_reshape_and_cache_flash.py", line 162, in triton_reshape_and_cache_flash
(EngineCore_DP0 pid=305423)     reshape_and_cache_kernel_flash[grid](
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 419, in <lambda>
(EngineCore_DP0 pid=305423)     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=305423)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 733, in run
(EngineCore_DP0 pid=305423)     kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore_DP0 pid=305423)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 861, in _do_compile
(EngineCore_DP0 pid=305423)     kernel = self.compile(src, target=target, options=options.__dict__)
(EngineCore_DP0 pid=305423)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/compiler/compiler.py", line 320, in compile
(EngineCore_DP0 pid=305423)     next_module = compile_ir(module, metadata)
(EngineCore_DP0 pid=305423)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 520, in <lambda>
(EngineCore_DP0 pid=305423)     stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.target.arch)
(EngineCore_DP0 pid=305423)                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
(EngineCore_DP0 pid=305423) `ptxas` stderr:
(EngineCore_DP0 pid=305423) ptxas fatal   : Value 'sm_121a' is not defined for option 'gpu-name'
(EngineCore_DP0 pid=305423)
(EngineCore_DP0 pid=305423) Repro command: /home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_121a /tmp/tmpkdq4382_.ptx -o /tmp/tmpkdq4382_.ptx.o
(EngineCore_DP0 pid=305423)
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879] Traceback (most recent call last):
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 897, in run_busy_loop
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]     self._process_engine_step()
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 930, in _process_engine_step
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 467, in step_with_batch_queue
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]     engine_core_outputs = self.scheduler.update_from_output(
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]   File "/home/ohsono/vllm/vllm/v1/core/sched/scheduler.py", line 1066, in update_from_output
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]     sampled_token_ids = model_runner_output.sampled_token_ids
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423) ERROR 12-30 23:36:10 [core.py:879] AttributeError: 'NoneType' object has no attribute 'sampled_token_ids'
(EngineCore_DP0 pid=305423) Process EngineCore_DP0:
(EngineCore_DP0 pid=305423) Traceback (most recent call last):
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=305423)     self.run()
(EngineCore_DP0 pid=305423)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=305423)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 881, in run_engine_core
(EngineCore_DP0 pid=305423)     raise e
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=305423)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 897, in run_busy_loop
(EngineCore_DP0 pid=305423)     self._process_engine_step()
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 930, in _process_engine_step
(EngineCore_DP0 pid=305423)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=305423)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/engine/core.py", line 467, in step_with_batch_queue
(EngineCore_DP0 pid=305423)     engine_core_outputs = self.scheduler.update_from_output(
(EngineCore_DP0 pid=305423)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/vllm/v1/core/sched/scheduler.py", line 1066, in update_from_output
(EngineCore_DP0 pid=305423)     sampled_token_ids = model_runner_output.sampled_token_ids
(EngineCore_DP0 pid=305423)                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=305423) AttributeError: 'NoneType' object has no attribute 'sampled_token_ids'
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543] AsyncLLM output_handler failed.
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543] Traceback (most recent call last):
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543]   File "/home/ohsono/vllm/vllm/v1/engine/async_llm.py", line 495, in output_handler
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543]     outputs = await engine_core.get_output_async()
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543]   File "/home/ohsono/vllm/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543]     raise self._format_exception(outputs) from None
(APIServer pid=305263) ERROR 12-30 23:36:10 [async_llm.py:543] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=305263) INFO:     127.0.0.1:43878 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
[rank0]:[W1230 23:36:10.698255043 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=305263) INFO:     Shutting down
(APIServer pid=305263) INFO:     Waiting for application shutdown.
(APIServer pid=305263) INFO:     Application shutdown complete.
(APIServer pid=305263) INFO:     Finished server process [305263]

@njhill
Copy link
Copy Markdown
Member

njhill commented Jan 2, 2026

Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error:

(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error

The other side-effect message is more of a red herring and will be addressed by #31611.

@ohsono
Copy link
Copy Markdown
Author

ohsono commented Jan 3, 2026

Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error:

(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error

The other side-effect message is more of a red herring and will be addressed by #31611.

Thanks for your input on PTXAS. I actually downgrade the PTXAS and apply --gpu-name=sm_90a instead of sm_121a (GB10 or other Blackwall-compatible), which I believe the current library of PTXAS doesn't support newer architectures.

However, there is a much simpler solution for this, bypassing it altogether. The --enforce-eager flag:

  • Disables torch.compile completely
  • Prevents Triton from generating kernels that need ptxas
  • Avoids the SM 12.1 compatibility issue entirely
  • Has minimal performance impact (20-30% slower than optimized, but still fast with MXFP4)

Plus, I saw the v3.5.1 Triton build (latest) doesn't reflect the latest commit of the main branch. So I also include Triton sync to pin the specific commit. #c9a2344 which has been address by PR(triton-lang/triton#8498)
Hopefully, this issue will be addressed in a newer build of CUDA, which I hope they will expose at a later release, including support for {sm_120, sm_121, and sm_121a}.

@LucasWilkinson
Copy link
Copy Markdown
Collaborator

cc @mgoin

@mgoin
Copy link
Copy Markdown
Member

mgoin commented Jan 4, 2026

Each pytorch release pins to a specific triton version, so we want to avoid affecting that pin. We may just have to wait for the next pytorch release to hopefully have the triton fix included

@johnnynunez
Copy link
Copy Markdown
Contributor

triton 3.6.0 triton-lang/triton#8862

@njhill njhill changed the title [Bugfix] Fix V1 engine batch queue crash and add SM 12.1 support [Bugfix] Add SM 12.1 support Jan 4, 2026
CUTLASS_BLOCK_FP8_SUPPORTED = cutlass_block_fp8_supported()
# Handle case where CUTLASS ops might not be available (build issue)
try:
CUTLASS_FP8_SUPPORTED = cutlass_fp8_supported()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's incorporate this logic into the function itself

def cutlass_scaled_mm_supports_fp4(cuda_device_capability: int) -> bool:
return torch.ops._C.cutlass_scaled_mm_supports_fp4(cuda_device_capability)
try:
return torch.ops._C.cutlass_scaled_mm_supports_fp4(cuda_device_capability)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we build these functions for cuda 12.1?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ProExpertProg. Yes. I believe this has been addressed as of CUDA 13.1. I haven't update to my local machine yet.

@johnnynunez
Copy link
Copy Markdown
Contributor

Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error:

(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error

The other side-effect message is more of a red herring and will be addressed by #31611.

Thanks for your input on PTXAS. I actually downgrade the PTXAS and apply --gpu-name=sm_90a instead of sm_121a (GB10 or other Blackwall-compatible), which I believe the current library of PTXAS doesn't support newer architectures.

However, there is a much simpler solution for this, bypassing it altogether. The --enforce-eager flag:

  • Disables torch.compile completely
  • Prevents Triton from generating kernels that need ptxas
  • Avoids the SM 12.1 compatibility issue entirely
  • Has minimal performance impact (20-30% slower than optimized, but still fast with MXFP4)

Plus, I saw the v3.5.1 Triton build (latest) doesn't reflect the latest commit of the main branch. So I also include Triton sync to pin the specific commit. #c9a2344 which has been address by PR(triton-lang/triton#8498) Hopefully, this issue will be addressed in a newer build of CUDA, which I hope they will expose at a later release, including support for {sm_120, sm_121, and sm_121a}.

It is coming i think in torch 2.10 that will be released in 2-3 weeks

@ohsono
Copy link
Copy Markdown
Author

ohsono commented Jan 5, 2026

Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error:

(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error

The other side-effect message is more of a red herring and will be addressed by #31611.

Thanks for your input on PTXAS. I actually downgrade the PTXAS and apply --gpu-name=sm_90a instead of sm_121a (GB10 or other Blackwall-compatible), which I believe the current library of PTXAS doesn't support newer architectures.
However, there is a much simpler solution for this, bypassing it altogether. The --enforce-eager flag:

  • Disables torch.compile completely
  • Prevents Triton from generating kernels that need ptxas
  • Avoids the SM 12.1 compatibility issue entirely
  • Has minimal performance impact (20-30% slower than optimized, but still fast with MXFP4)

Plus, I saw the v3.5.1 Triton build (latest) doesn't reflect the latest commit of the main branch. So I also include Triton sync to pin the specific commit. #c9a2344 which has been address by PR(triton-lang/triton#8498) Hopefully, this issue will be addressed in a newer build of CUDA, which I hope they will expose at a later release, including support for {sm_120, sm_121, and sm_121a}.

It is coming i think in torch 2.10 that will be released in 2-3 weeks

Thanks @johnnynunez. I think I am ok to wait for a couple weeks if that is the case. Also as you mentioned, v3.6.0 Triton release on the way. so, we should keep this PR ticket open for the future reference. Since we identified the bug and address by #31611.
Thank you so much for your insight for all. cc: @njhill @johnnynunez @ProExpertProg

@johnnynunez
Copy link
Copy Markdown
Contributor

johnnynunez commented Jan 6, 2026

Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error:

(EngineCore_DP0 pid=305423)   File "/home/ohsono/vllm/.venv/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin
(EngineCore_DP0 pid=305423)     raise PTXASError(error)
(EngineCore_DP0 pid=305423) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error

The other side-effect message is more of a red herring and will be addressed by #31611.

Thanks for your input on PTXAS. I actually downgrade the PTXAS and apply --gpu-name=sm_90a instead of sm_121a (GB10 or other Blackwall-compatible), which I believe the current library of PTXAS doesn't support newer architectures.
However, there is a much simpler solution for this, bypassing it altogether. The --enforce-eager flag:

  • Disables torch.compile completely
  • Prevents Triton from generating kernels that need ptxas
  • Avoids the SM 12.1 compatibility issue entirely
  • Has minimal performance impact (20-30% slower than optimized, but still fast with MXFP4)

Plus, I saw the v3.5.1 Triton build (latest) doesn't reflect the latest commit of the main branch. So I also include Triton sync to pin the specific commit. #c9a2344 which has been address by PR(triton-lang/triton#8498) Hopefully, this issue will be addressed in a newer build of CUDA, which I hope they will expose at a later release, including support for {sm_120, sm_121, and sm_121a}.

It is coming i think in torch 2.10 that will be released in 2-3 weeks

Thanks @johnnynunez. I think I am ok to wait for a couple weeks if that is the case. Also as you mentioned, v3.6.0 Triton release on the way. so, we should keep this PR ticket open for the future reference. Since we identified the bug and address by #31611. Thank you so much for your insight for all. cc: @njhill @johnnynunez @ProExpertProg

I have compiled all my stack with cuda 13.0 and pytorch 2.10. Maybe i am going to test in the following days
image

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ohsono.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 8, 2026
@ohsono ohsono changed the title [Bugfix] Add SM 12.1 support [Bugfix] Add SM 12.1 support + Fix GPT-OSS Harmony garbled reasoning and HarmonyError crashes Mar 14, 2026
@mergify mergify bot removed the needs-rebase label Mar 14, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 14, 2026

Hi @ohsono, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

1 similar comment
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 14, 2026

Hi @ohsono, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@ohsono
Copy link
Copy Markdown
Author

ohsono commented Mar 14, 2026

We may need to add

diff --git a/csrc/torch_bindings.cpp b/csrc/torch_bindings.cpp
index c16b9c223..532bd03ef 100644
--- a/csrc/torch_bindings.cpp
+++ b/csrc/torch_bindings.cpp
@@ -239,10 +239,12 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {

   // Quantization ops
 #ifndef USE_ROCM
+#ifdef ENABLE_DSV3_FUSED_A_GEMM
   // DeepSeek V3 fused A GEMM (SM 9.0+, bf16 only, 1-16 tokens).
   ops.def(
       "dsv3_fused_a_gemm(Tensor! output, Tensor mat_a, Tensor mat_b) -> ()");
   ops.impl("dsv3_fused_a_gemm", torch::kCUDA, &dsv3_fused_a_gemm);
+#endif

I try to build in the main branch, but it fail to launch due to this.

I agree with this, in CMakelists.txt checks dsv3_fused_a_gemm requires SM9.0+,

When building using TORCH_CUDA_ARCH_LIST="8.9+PTX", dsv3_fused_a_gemm.cu is not compiled, will result undefined symbol linker error.

After patch this 8.9+PTX build works good

Thanks for reporting that. Hopefully, this issue has been addressed by the CUDA 13.1 release. But I am not sure.
https://docs.nvidia.com/cuda/archive/13.1.0/cuda-toolkit-release-notes/index.html

Otherwise, my recent commit, 570686b. It may bypass the linking error.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 14, 2026

Hi @ohsono, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

ohsono and others added 7 commits March 13, 2026 23:27
Fixes vllm-project#31588

This PR addresses a critical V1 engine bug affecting all users with async
scheduling enabled (default) and improves SM 12.1 (Blackwell GB10) compatibility.

**Problem:**
V1 engine crashes with `AttributeError: 'NoneType' object has no attribute
'sampled_token_ids'` on first inference when async scheduling is enabled.

**Root Cause:**
The step_with_batch_queue() method did not handle None return from
execute_model(), while step() method handles this correctly.

**Solution:**
Added None check in step_with_batch_queue() that mirrors the existing
pattern in step() method (vllm/v1/engine/core.py:464-470).

**Impact:**
- Fixes crash on first inference with default V1 engine configuration
- Affects all platforms using async scheduling (default)
- Zero performance impact - only adds missing error handling

1. **CUTLASS ops graceful fallback** (vllm/_custom_ops.py)
   - Added AttributeError handling for missing CUTLASS ops
   - Prevents crashes on incomplete builds
   - Provides clear warning messages

2. **Compilation ops compatibility** (vllm/compilation/matcher_utils.py)
   - Added hasattr checks for FP8 quant and silu_and_mul ops
   - Prevents crashes during compilation matching

3. **MXFP4 SM range extension** (vllm/model_executor/layers/quantization/mxfp4.py)
   - Extended Triton kernel support range to include SM 12.0-12.1
   - Enables proper fallback to Marlin backend on SM 12.1

4. **FP8 ops exception handling** (vllm/model_executor/layers/quantization/utils/w8a8_utils.py)
   - Wrapped CUTLASS_FP8_SUPPORTED initialization in try-except
   - Prevents module import crashes

5. **Triton dependency** (requirements/common.txt)
   - Added Triton main branch for SM 12.1 PTX support
   - Temporary until official release with sm_121a support

Tested on:
- GPU: NVIDIA GB10 (SM 12.1)
- CUDA: 13.0
- PyTorch: 2.9.1+cu130
- Model: openai/gpt-oss-120b with MXFP4 quantization

- vllm-project#28589 - V1 Engine fails on Blackwell GB10
- vllm-project#31128 - Add support for Blackwell SM121
- vllm-project#28621 - EP parallel + async scheduling crash

Signed-off-by: Hochan Son <ohsono@gmail.com>
… and SILU_MUL_OP guard

- Move try/except for CUTLASS ops into cutlass_fp8_supported() and
  cutlass_block_fp8_supported() per reviewer request
- Fix SM 12.x Triton range to exclude SM 11.x (split into <11.0 and >=12.0)
- Add SILU_MUL_OP is not None check in MatcherSiluAndMul.enabled

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Hochan Son <ohsono@gmail.com>
Apply ruff formatting fixes:
- Split long lines to meet 88 character limit
- Reformat multi-line function calls

Signed-off-by: Hochan Son <ohsono@gmail.com>
…D_A_GEMM

dsv3_fused_a_gemm requires SM 9.0+ and is conditionally compiled in
CMakeLists.txt, but torch_bindings.cpp unconditionally registers the op,
causing an undefined symbol linker error when building with
TORCH_CUDA_ARCH_LIST="8.9+PTX" or other sub-SM90 targets.

Add ENABLE_DSV3_FUSED_A_GEMM compile definition in CMake when the kernel
is built, and guard the ops.def in torch_bindings.cpp with #ifdef.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Hochan Son <ohsono@gmail.com>
Signed-off-by: Hochan Son <ohsono@gmail.com>
…port

- Exclude upstream shell scripts with pre-existing shellcheck issues
  (.buildkite/scripts/run-multi-node-test.sh,
   tests/v1/kv_connector/nixl_integration/spec_decode_acceptance_test.sh)
- Add SC2089/SC2090/SC2086/SC2046/SC2048/SC2206 to .shellcheckrc disabled list
- Add SM 12.1 (GB10) to CUDA_SUPPORTED_ARCHS and MARLIN_FP8_ARCHS

Signed-off-by: Hochan Son <ohsono@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ssages and HarmonyError crashes

Two related bugs caused GPT-OSS (gpt-oss-120b) to produce garbled output
(unicode garbage, foreign language text like Russian) during reasoning:

1. System messages passed via `request.messages` with `role: "system"` were
   fed directly to `parse_chat_inputs_to_harmony_messages`, which does not
   handle the system role in Harmony encoding format. The raw tokens caused
   the model to see unexpected token sequences during reasoning, producing
   unreadable output. Fix: extract system-role messages and pass their
   content as structured `instructions` to `get_system_message` /
   `get_developer_message` before encoding, in `OpenAIServingRender`.

2. `harmony_parser.process(token_id)` in the streaming generator had no
   `HarmonyError` handler. Any unexpected token raised an unhandled exception,
   crashing the stream. Fix: wrap the call site with `try/except HarmonyError`
   and return the partial result with a warning log.

Signed-off-by: Hochan Son <ohsono@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ohsono ohsono force-pushed the fix-v1-engine-sm121-support branch from cdafc01 to 3e2d8e1 Compare March 14, 2026 06:56
@ohsono ohsono requested review from hmellor and njhill as code owners March 14, 2026 06:56
@ohsono ohsono requested a review from ProExpertProg March 14, 2026 11:56
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 21, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ohsono.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 21, 2026
Signed-off-by: Hochan Son <ohsono.email@gmail.com>
@mergify mergify bot removed the needs-rebase label Mar 26, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 30, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ohsono.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build frontend gpt-oss Related to GPT-OSS models needs-rebase v1

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

[Bug]: vLLM SM 12.1 (Blackwell GB10) V1 Engine Bug Report (Relates to: #28589, #31128, #28621, #27679)

8 participants