-
Notifications
You must be signed in to change notification settings - Fork 569
fix: register default global_scratch allocator on Blackwell GPUs #825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
zhiyuan1i
merged 4 commits into
fla-org:main
from
ssubbotin:fix/blackwell-global-scratch-allocator
Apr 13, 2026
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
a668170
fix: register default global_scratch allocator on Blackwell GPUs
ssubbotin 27229be
style: fix autopep8 blank lines
ssubbotin 84f040a
fix: use current device for capability check (review feedback)
ssubbotin c426461
refactor: use IS_NVIDIA_BLACKWELL constant, update to >= 10 (review f…
ssubbotin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
triton set_allocator callback signature alignment stream requirements💡 Result:
The triton.set_allocator function in the Triton Language (triton-lang/triton Python package) allows users to specify a custom memory allocator callback for GPU kernel memory allocations. This is particularly useful for custom memory management, such as using torch.empty for temporary buffers or integrating with custom pools. Callback signature: def alloc_fn(size: int, alignment: int, stream: Optional[int] = None) -> torch.Tensor - size: Number of bytes to allocate. - alignment: Required alignment in bytes for the allocation (e.g., 128 or 256 bytes common for shared memory or cache lines; Triton's internal shared memory allocator uses 128-byte scratchAlignment). - stream: Optional CUDA stream handle (int); if provided, allocation should be stream-ordered (e.g., using cudaMallocAsync if implementing in C++/CUDA). Allocations must respect stream ordering to avoid synchronization issues. Usage example: def alloc_fn(size: int, alignment: int, stream: Optional[int]): return torch.empty(size, device="cuda", dtype=torch.int8) # or pinned memory, custom pool slice, etc. triton.set_allocator(alloc_fn) Requirements: - Alignment: Must return a pointer aligned to the requested 'alignment' bytes. Triton enforces alignments like 128 bytes for scratch buffers (lib/Analysis/Allocation.cpp). Misalignment can cause warp memory errors (e.g., Issue
#5136). - Stream: Optional but should use stream-ordered allocation (e.g., torch.empty implicitly uses current stream; for explicit control, use CUDA stream-ordered APIs like cudaMallocAsync from CUDA 11.2+ docs). Supports CUDA stream-ordered memory allocator for async alloc/free without global sync. - Return: A torch.Tensor (device="cuda") whose data_ptr is the allocated memory. Tensor should be contiguous and match size/alignment. - Thread-safe and efficient for kernel compilation/runtime. This feature is recent (not in all released versions; build from source or use latest PyTorch/Triton for availability, per Issues#6191,#155584). Used in advanced kernels like grouped-GEMM tutorials and third-party libs (e.g., gemlite). No official docs page found; inferred from source code patterns, tutorials, and issues.Citations:
The
alignmentparameter must not be ignored in the allocator callback.While ignoring
streamis acceptable (sincetorch.emptyimplicitly uses the current stream), thealignmentparameter is part of Triton's allocator contract and must be respected. Triton enforces alignment requirements (e.g., 128 bytes for scratch buffers), and misaligned allocations can cause warp memory errors. The current implementation usingtorch.empty(size, ...)does not guarantee the required alignment. This needs to be fixed by either using an alignment-aware allocation method or manually aligning the returned pointer.🤖 Prompt for AI Agents