Open
Conversation
Fixes "This kernel only supports sm120." error seen when launching Qwen3.5-35B-FP8 model on the recent container build. vLLM's FP8 CUTLASS kernels use enable_sm120_only, which checks __CUDA_ARCH__ == 1200 and traps on SM12.1 where __CUDA_ARCH__ is 1210. Replace with enable_sm120_family (>= 1200 && < 1300) which already exists in the codebase. See: https://github.com/saifgithub/vllm-gb10-sm121 Note: after applying this fix, a full rebuild with --rebuild-vllm is required. If upgrading from a previous build, clean Docker build cache first to ensure stale compiled objects are not reused: docker builder prune (a more radical approach that also worked: docker system prune --all --volumes) ./build-and-copy.sh --rebuild-vllm --tf5 -t vllm-node-tf5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
"This kernel only supports sm120."crash when running FP8 models (e.g. Qwen3.5-35B-FP8) on DGX Spark (SM12.1)enable_sm120_onlywithenable_sm120_familyin two CUTLASS kernel files, allowing SM12.1 (__CUDA_ARCH__ == 1210) to pass the architecture guardThis is a temporary workaround until the upstream fix is merged: vllm-project/vllm#35568
Details
vLLM's FP8 CUTLASS kernel wrappers use
enable_sm120_only, which checks__CUDA_ARCH__ == 1200and traps on any other arch. On SM12.1 (GB10, DGX Spark),__CUDA_ARCH__is1210, triggeringcudaErrorLaunchFailureon every FP8 GEMM call.enable_sm120_familyalready exists in the codebase (>= 1200 && < 1300) and is used by the blockwise dispatch path. This fix applies it to the two remaining files that still useenable_sm120_only.See: https://github.com/saifgithub/vllm-gb10-sm121
Rebuild notes
Requires
--rebuild-vllmsince the fix is compiled into the binary. If upgrading from a previous build, clean Docker build cache first:Note:
--tf5and-t vllm-node-tf5are optional — the container name depends on your setup (default isvllm-node).Test plan
enable_sm120_familysymbols present in compiled_C.abi3.so