Skip to content

Fix FP8 CUTLASS crash on SM12.1 (DGX Spark)#151

Open
ageev wants to merge 1 commit intoeugr:mainfrom
ageev:fix-sm121-fp8-cutlass
Open

Fix FP8 CUTLASS crash on SM12.1 (DGX Spark)#151
ageev wants to merge 1 commit intoeugr:mainfrom
ageev:fix-sm121-fp8-cutlass

Conversation

@ageev
Copy link
Copy Markdown

@ageev ageev commented Mar 29, 2026

Summary

  • Fixes "This kernel only supports sm120." crash when running FP8 models (e.g. Qwen3.5-35B-FP8) on DGX Spark (SM12.1)
  • Replaces enable_sm120_only with enable_sm120_family in two CUTLASS kernel files, allowing SM12.1 (__CUDA_ARCH__ == 1210) to pass the architecture guard
  • One-line sed applied at build time, before final compilation

This is a temporary workaround until the upstream fix is merged: vllm-project/vllm#35568

Details

vLLM's FP8 CUTLASS kernel wrappers use enable_sm120_only, which checks __CUDA_ARCH__ == 1200 and traps on any other arch. On SM12.1 (GB10, DGX Spark), __CUDA_ARCH__ is 1210, triggering cudaErrorLaunchFailure on every FP8 GEMM call.

enable_sm120_family already exists in the codebase (>= 1200 && < 1300) and is used by the blockwise dispatch path. This fix applies it to the two remaining files that still use enable_sm120_only.

See: https://github.com/saifgithub/vllm-gb10-sm121

Rebuild notes

Requires --rebuild-vllm since the fix is compiled into the binary. If upgrading from a previous build, clean Docker build cache first:

docker builder prune
./build-and-copy.sh --rebuild-vllm [--tf5] [-t vllm-node-tf5]

Note: --tf5 and -t vllm-node-tf5 are optional — the container name depends on your setup (default is vllm-node).

Test plan

  • Verified on DGX Spark with Qwen3.5-35B-A3B-FP8 — model loads and serves without crash
  • Confirmed enable_sm120_family symbols present in compiled _C.abi3.so

Fixes "This kernel only supports sm120." error seen when launching
Qwen3.5-35B-FP8 model on the recent container build.

vLLM's FP8 CUTLASS kernels use enable_sm120_only, which checks
__CUDA_ARCH__ == 1200 and traps on SM12.1 where __CUDA_ARCH__ is 1210.
Replace with enable_sm120_family (>= 1200 && < 1300) which already
exists in the codebase.

See: https://github.com/saifgithub/vllm-gb10-sm121

Note: after applying this fix, a full rebuild with --rebuild-vllm is
required. If upgrading from a previous build, clean Docker build cache
first to ensure stale compiled objects are not reused:

  docker builder prune

(a more radical approach that also worked: docker system prune --all --volumes)

  ./build-and-copy.sh --rebuild-vllm --tf5 -t vllm-node-tf5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant