Add default global_scratch allocator fallback for Blackwell SM 12.0#10002
Add default global_scratch allocator fallback for Blackwell SM 12.0#10002ssubbotin wants to merge 1 commit intotriton-lang:mainfrom
Conversation
On Blackwell (SM 12.0+), Triton kernels may require global_scratch memory for cooperative operations. When no explicit allocator is configured via triton.set_allocator(), the NullAllocator raises RuntimeError, crashing any kernel that uses global_scratch. This adds allocate_default_global_scratch() to GPUDriver — mirroring the existing allocate_default_profile_scratch() pattern — and uses it as a fallback in both NVIDIA and AMD backend launchers when NullAllocator is detected. Fixes kernel crashes on RTX PRO 6000, RTX 5090, and other Blackwell consumer GPUs when running Triton kernels that use global_scratch (e.g., FLA solve_tril in vLLM for MoE+Mamba models).
Performance impact of this fixWithout this fix, the standard workaround for Blackwell is Qwen3-Coder-Next 80B (AWQ 4-bit) on RTX PRO 6000 Blackwell:
The workaround costs 12x performance. This fix restores full CUDA graph + torch.compile performance on Blackwell by providing a default global_scratch allocator. Tested on RTX PRO 6000 (SM 12.0, CUDA 13.1, Triton 3.6.0) running vLLM with FLA kernels for MoE+Mamba models. |
Can you explain a bit more with an example? |
|
More specifically, why cannot you use |
|
@Jokeren Sure — here's the concrete example and a standalone reproduce script. Example: FLA
|
|
@Jokeren Re: why not use 1. Library code can't call The crash happens inside vLLM's FLA kernel library, which is third-party code consumed by vLLM. If FLA calls Every library that uses Triton kernels with 2. vLLM spawns the EngineCore as a separate process via We tried patching via The fix in this PR mirrors the existing |
|
does the kernel in question use make_tensor_descriptor? This is the only case that I am aware of that needs scratch space allocation from user. This is TMA with descriptor created in kernel. The right solution is to create descriptors on host via TensorDescriptor constructor. I dont think this is sm120-specific issue. |
ThomasRaoux
left a comment
There was a problem hiding this comment.
It should be user's responsibility to attach an allocator we don't want the compiler to do allocation under the hood
+1 |
|
@masahi @ThomasRaoux Thank you for the review. I want to clarify an important detail: The crash happens WITHOUT To verify: # FLA_USE_TMA is NOT set (defaults to '0')
# The kernel path is the non-TMA branch: tl.make_block_ptr
python reproduce_blackwell_deadlock.py
# → CRASH: Kernel requires a runtime memory allocation, but no allocator was set.The So the suggestion to "use host-side TensorDescriptor" doesn't apply here — the kernel isn't using TMA at all. The This is why we believe a default allocator fallback is the right fix — user code shouldn't need to know that the compiler decided to use scratch space internally. Happy to investigate further which compiler pass introduces the |
On Blackwell (SM 10.0+), the Triton compiler emits global_scratch memory for autotuned kernels even when TMA is not used (FLA_USE_TMA=0). Without an allocator registered, this causes NullAllocator crashes during kernel autotuning, which corrupts CUDA synchronization state and leads to process deadlocks. The existing allocator registration only runs when IS_TMA_SUPPORTED is True (requires FLA_USE_TMA=1). This change also registers the allocator on Blackwell when TMA is disabled, since the compiler still needs scratch space for other purposes on SM 10.0+. Fixes deadlocks when running MoE+Mamba models (Qwen3-Coder-Next, Qwen3.5) on Blackwell GPUs via vLLM. See: triton-lang/triton#10002
|
For reference, we submitted a workaround to FLA: fla-org/flash-linear-attention#825 That PR registers a default allocator on Blackwell regardless of However, we still believe Triton should provide a default allocator (or at least a better error path) when the compiler decides to use Happy to keep this PR open or close it depending on your preference. |
We shouldn't provide a default allocator. I still suspect there's something wrong. Are you able to provide a single script reproducer? Thanks |
Looks like reproduce_blackwell_deadlock.py has a lot of dependencies |
|
Running your script with |
|
So your reproducer uses the FLA kernel vendored in vllm. The decision to use TMA in vllm seems to have changed last week: vllm-project/vllm#38981. So if you are using vllm prior to that commit, this explains what's happening. |
|
@masahi @Jokeren Thank you for digging into this — you were right. Our vLLM Docker image was built before vllm-project/vllm#38981 (merged April 4), which aligned vLLM's vendored FLA copy with upstream's Updating to a vLLM build that includes #38981 resolves the issue since the non-TMA path doesn't need I apologize for the confusion — I should have verified more carefully which code path was active before asserting it wasn't TMA-related. Happy to close this PR. We also submitted fla-org/flash-linear-attention#825 which registers a default allocator on Blackwell as defense-in-depth, but the real fix was already in vLLM #38981. Thank you for your patience and the pointer to the vLLM change. |
|
Closing — the root cause was our vLLM image using a pre-#38981 vendored FLA copy that unconditionally enabled TMA on Blackwell. Updating vLLM resolves the issue. Thank you for the review. |
* fix: register default global_scratch allocator on Blackwell GPUs On Blackwell (SM 10.0+), the Triton compiler emits global_scratch memory for autotuned kernels even when TMA is not used (FLA_USE_TMA=0). Without an allocator registered, this causes NullAllocator crashes during kernel autotuning, which corrupts CUDA synchronization state and leads to process deadlocks. The existing allocator registration only runs when IS_TMA_SUPPORTED is True (requires FLA_USE_TMA=1). This change also registers the allocator on Blackwell when TMA is disabled, since the compiler still needs scratch space for other purposes on SM 10.0+. Fixes deadlocks when running MoE+Mamba models (Qwen3-Coder-Next, Qwen3.5) on Blackwell GPUs via vLLM. See: triton-lang/triton#10002 * style: fix autopep8 blank lines * fix: use current device for capability check (review feedback) * refactor: use IS_NVIDIA_BLACKWELL constant, update to >= 10 (review feedback) - Use shared IS_NVIDIA_BLACKWELL constant instead of inline capability check - Change IS_NVIDIA_BLACKWELL from == 10 to >= 10 for forward compatibility with future NVIDIA architectures beyond Blackwell - Addresses CodeRabbit and Gemini review feedback
Summary
On Blackwell (SM 12.0+), Triton kernels may require
global_scratchmemory for cooperative operations. When no explicit allocator is configured viatriton.set_allocator(), theNullAllocatorraisesRuntimeError, crashing any kernel that usesglobal_scratch.This adds
allocate_default_global_scratch()toGPUDriver— mirroring the existingallocate_default_profile_scratch()pattern — and uses it as a fallback in both NVIDIA and AMD backend launchers whenNullAllocatoris detected.Problem
This crashes on consumer Blackwell GPUs (RTX PRO 6000, RTX 5090, RTX 5080) when running Triton kernels that use
global_scratch— e.g., FLAsolve_trilin vLLM for MoE+Mamba models like Qwen3-Coder-Next.On pre-Blackwell GPUs, these kernels don't use
global_scratch, so the issue doesn't surface.Fix
allocate_default_global_scratch()toGPUDriver(mirrorsallocate_default_profile_scratch())allocate_scratch()in both NVIDIA and AMD launchers, fall back to the new method whenNullAllocatoris the current allocatortorch.empty()for allocation, consistent with the existing profile scratch patternTesting
Verified on RTX PRO 6000 Blackwell (SM 12.0, CUDA 13.1):
RuntimeErroron any kernel usingglobal_scratchchunk_gated_delta_rulecompiles and runs correctly (27.4s first compile, correct output)Related Issues
solve_tril