[Bugfix] Add SM 12.1 support + Fix GPT-OSS Harmony garbled reasoning and HarmonyError crashes#31607
[Bugfix] Add SM 12.1 support + Fix GPT-OSS Harmony garbled reasoning and HarmonyError crashes#31607ohsono wants to merge 12 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @ohsono, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request introduces a critical bugfix for the V1 engine's async scheduling and adds support for SM 12.1 (Blackwell). The changes are well-structured and address the described issues effectively. The core bugfix in vllm/v1/engine/core.py correctly handles a None return, preventing a crash. The compatibility improvements for SM 12.1 are implemented defensively using try-except and hasattr checks, which is great for robustness. My main feedback is regarding the new dependency on the Triton main branch, which should be pinned to a specific commit for build stability.
8a2dcf0 to
91e9533
Compare
|
Thanks @ohsono . But this should only ever happen as a secondary error after the execute_model method already raised an exception. Please share the whole log, you should see an earlier root cause error/traceback. I agree that this is still a bug though since the secondary error is misleading. I've opened #31611 to address this. Please remove the |
Thanks, @njhill, for the suggestion. I've updated accordingly. Although I didn't think to save the log, I found some error messages from |
|
Thanks @ohsono .. yes so like I was explaining, the actual issue here is this error: The other side-effect message is more of a red herring and will be addressed by #31611. |
Thanks for your input on PTXAS. I actually downgrade the PTXAS and apply However, there is a much simpler solution for this, bypassing it altogether. The
Plus, I saw the v3.5.1 Triton build (latest) doesn't reflect the latest commit of the main branch. So I also include Triton sync to pin the specific commit. #c9a2344 which has been address by PR(triton-lang/triton#8498) |
|
cc @mgoin |
|
Each pytorch release pins to a specific triton version, so we want to avoid affecting that pin. We may just have to wait for the next pytorch release to hopefully have the triton fix included |
|
triton 3.6.0 triton-lang/triton#8862 |
| CUTLASS_BLOCK_FP8_SUPPORTED = cutlass_block_fp8_supported() | ||
| # Handle case where CUTLASS ops might not be available (build issue) | ||
| try: | ||
| CUTLASS_FP8_SUPPORTED = cutlass_fp8_supported() |
There was a problem hiding this comment.
Let's incorporate this logic into the function itself
| def cutlass_scaled_mm_supports_fp4(cuda_device_capability: int) -> bool: | ||
| return torch.ops._C.cutlass_scaled_mm_supports_fp4(cuda_device_capability) | ||
| try: | ||
| return torch.ops._C.cutlass_scaled_mm_supports_fp4(cuda_device_capability) |
There was a problem hiding this comment.
Should we build these functions for cuda 12.1?
There was a problem hiding this comment.
Thanks @ProExpertProg. Yes. I believe this has been addressed as of CUDA 13.1. I haven't update to my local machine yet.
It is coming i think in torch 2.10 that will be released in 2-3 weeks |
Thanks @johnnynunez. I think I am ok to wait for a couple weeks if that is the case. Also as you mentioned, v3.6.0 Triton release on the way. so, we should keep this PR ticket open for the future reference. Since we identified the bug and address by #31611. |
I have compiled all my stack with cuda 13.0 and pytorch 2.10. Maybe i am going to test in the following days |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @ohsono, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @ohsono, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Thanks for reporting that. Hopefully, this issue has been addressed by the CUDA 13.1 release. But I am not sure. Otherwise, my recent commit, 570686b. It may bypass the linking error. |
|
Hi @ohsono, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Fixes vllm-project#31588 This PR addresses a critical V1 engine bug affecting all users with async scheduling enabled (default) and improves SM 12.1 (Blackwell GB10) compatibility. **Problem:** V1 engine crashes with `AttributeError: 'NoneType' object has no attribute 'sampled_token_ids'` on first inference when async scheduling is enabled. **Root Cause:** The step_with_batch_queue() method did not handle None return from execute_model(), while step() method handles this correctly. **Solution:** Added None check in step_with_batch_queue() that mirrors the existing pattern in step() method (vllm/v1/engine/core.py:464-470). **Impact:** - Fixes crash on first inference with default V1 engine configuration - Affects all platforms using async scheduling (default) - Zero performance impact - only adds missing error handling 1. **CUTLASS ops graceful fallback** (vllm/_custom_ops.py) - Added AttributeError handling for missing CUTLASS ops - Prevents crashes on incomplete builds - Provides clear warning messages 2. **Compilation ops compatibility** (vllm/compilation/matcher_utils.py) - Added hasattr checks for FP8 quant and silu_and_mul ops - Prevents crashes during compilation matching 3. **MXFP4 SM range extension** (vllm/model_executor/layers/quantization/mxfp4.py) - Extended Triton kernel support range to include SM 12.0-12.1 - Enables proper fallback to Marlin backend on SM 12.1 4. **FP8 ops exception handling** (vllm/model_executor/layers/quantization/utils/w8a8_utils.py) - Wrapped CUTLASS_FP8_SUPPORTED initialization in try-except - Prevents module import crashes 5. **Triton dependency** (requirements/common.txt) - Added Triton main branch for SM 12.1 PTX support - Temporary until official release with sm_121a support Tested on: - GPU: NVIDIA GB10 (SM 12.1) - CUDA: 13.0 - PyTorch: 2.9.1+cu130 - Model: openai/gpt-oss-120b with MXFP4 quantization - vllm-project#28589 - V1 Engine fails on Blackwell GB10 - vllm-project#31128 - Add support for Blackwell SM121 - vllm-project#28621 - EP parallel + async scheduling crash Signed-off-by: Hochan Son <ohsono@gmail.com>
… and SILU_MUL_OP guard - Move try/except for CUTLASS ops into cutlass_fp8_supported() and cutlass_block_fp8_supported() per reviewer request - Fix SM 12.x Triton range to exclude SM 11.x (split into <11.0 and >=12.0) - Add SILU_MUL_OP is not None check in MatcherSiluAndMul.enabled Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hochan Son <ohsono@gmail.com>
Apply ruff formatting fixes: - Split long lines to meet 88 character limit - Reformat multi-line function calls Signed-off-by: Hochan Son <ohsono@gmail.com>
…D_A_GEMM dsv3_fused_a_gemm requires SM 9.0+ and is conditionally compiled in CMakeLists.txt, but torch_bindings.cpp unconditionally registers the op, causing an undefined symbol linker error when building with TORCH_CUDA_ARCH_LIST="8.9+PTX" or other sub-SM90 targets. Add ENABLE_DSV3_FUSED_A_GEMM compile definition in CMake when the kernel is built, and guard the ops.def in torch_bindings.cpp with #ifdef. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hochan Son <ohsono@gmail.com>
Signed-off-by: Hochan Son <ohsono@gmail.com>
…port - Exclude upstream shell scripts with pre-existing shellcheck issues (.buildkite/scripts/run-multi-node-test.sh, tests/v1/kv_connector/nixl_integration/spec_decode_acceptance_test.sh) - Add SC2089/SC2090/SC2086/SC2046/SC2048/SC2206 to .shellcheckrc disabled list - Add SM 12.1 (GB10) to CUDA_SUPPORTED_ARCHS and MARLIN_FP8_ARCHS Signed-off-by: Hochan Son <ohsono@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ssages and HarmonyError crashes Two related bugs caused GPT-OSS (gpt-oss-120b) to produce garbled output (unicode garbage, foreign language text like Russian) during reasoning: 1. System messages passed via `request.messages` with `role: "system"` were fed directly to `parse_chat_inputs_to_harmony_messages`, which does not handle the system role in Harmony encoding format. The raw tokens caused the model to see unexpected token sequences during reasoning, producing unreadable output. Fix: extract system-role messages and pass their content as structured `instructions` to `get_system_message` / `get_developer_message` before encoding, in `OpenAIServingRender`. 2. `harmony_parser.process(token_id)` in the streaming generator had no `HarmonyError` handler. Any unexpected token raised an unhandled exception, crashing the stream. Fix: wrap the call site with `try/except HarmonyError` and return the partial result with a warning log. Signed-off-by: Hochan Son <ohsono@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cdafc01 to
3e2d8e1
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Hochan Son <ohsono.email@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |

Summary
This PR has two parts:
Part 1 — SM 12.1 (Blackwell GB10) Compatibility (original)
Fixes #31588
vllm/_custom_ops.py):AttributeErrorhandling for missing CUTLASS ops on incomplete buildsvllm/compilation/matcher_utils.py):hasattrchecks for FP8 quant and silu_and_mul opsPart 2 — GPT-OSS Harmony serving fixes (new)
Partially addresses #37030
Fixes two bugs causing
openai/gpt-oss-120bto produce garbled, hallucinated, or null reasoning output. Not a duplicate of #34951, #30247, or #30482 (those address developer block formatting, model identity, and chat template — different issues).Root Cause (Part 2)
Bug 1 — System message corrupts Harmony token sequence → hallucinated reasoning
_make_request_with_harmonypassed all messages includingrole: "system"directly toparse_chat_inputs_to_harmony_messages. The Harmony encoder does not handle raw system-role messages — it expects them as structuredSystemContent. The raw tokens corrupted the Harmony context, causing the model to produce off-topic or foreign-language output.Example (before fix): Request
"What is 2+2?"with system"You are a helpful assistant":Bug 2 —
HarmonyErrorunhandled in streaming and non-streaming pathsharmony_parser.process(token_id)had noHarmonyErrorhandler. Any unexpected Harmony token raised an unhandled exception, crashing the stream or silently returning null content with no diagnostic. Particularly impactful on SM121/Blackwell with MXFP4 quantization (see #37030).Changes
vllm/entrypoints/openai/chat_completion/serving.py_make_request_with_harmony: scanrequest.messagesforrole: "system"entries, extract and concatenate their text content (handling bothstrandlist[{type, text}]formats), pass as structuredinstructionstoget_system_message()/get_developer_message(). Feed only non-system messages toparse_chat_inputs_to_harmony_messages.HarmonyErrorhandling in streaming generator: wrapharmony_parser.process(token_id)intry/except HarmonyError; log a warning and break to return partial result.vllm/entrypoints/openai/parser/harmony_utils.pyHarmonyErrorhandling inparse_output_into_messages: same pattern — catch, log warning, break with partial result.Test Plan
TIKTOKEN_RS_CACHE_DIR=/path/to/tiktoken_encodings \ pytest tests/entrypoints/openai/parser/test_harmony_utils.py -v # Result: 57/57 passedLive server validation on
gpt-oss-120b(SM121, MXFP4):HarmonyErrornow caught and logged (harmony_utils.py:342) instead of crashingstrandlist[{type, text}]content formatsNote: Null
content/reasoningwithfinish_reason: lengthis a pre-existing SM121 MXFP4 Marlin kernel issue (wrong first Harmony token generated) tracked separately in #37030. Our changes correctly catch this case gracefully.AI assistance used: Claude assisted with debugging, root cause analysis, and code authoring. All changes reviewed, tested, and validated by the human submitter on live hardware.
Related