Skip to content

[GSan] Implement shadow memory allocator#9478

Merged
peterbell10 merged 4 commits into
mainfrom
pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d
Mar 19, 2026
Merged

[GSan] Implement shadow memory allocator#9478
peterbell10 merged 4 commits into
mainfrom
pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d

Conversation

@peterbell10
Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 commented Feb 16, 2026

Commits in this PR

  1. [GSan] Implement shadow memory allocator

    This implements an allocator that hooks into PyTorch's memory allocation
    API to map tensors into a GSan-managed virtual address space. We also create
    a corresponding shadow memory region that is mapped into the lower half of the
    reserved address space.

    Usage is like:

    from triton.experimental import gsan
    allocator = gsan.get_allocator()
    pool = torch.cuda.MemPool(allocator.allocator())
    with torch.cuda.use_mem_pool(pool):
        t = torch.empty(4096, dtype=torch.uint8, device="cuda")
  2. Misc cleanup/fixes

  3. Sync stream before dealloc

  4. More misc changes

PR chain

  1. 👉 [GSan] Implement shadow memory allocator #9478 👈 YOU ARE HERE
  2. [GSan] Instrument tl.{load,store} #9568
  3. [GSan] Partially support TMA & cp.async ops #9699
  4. [GSan] Add symmetric memory API #9493
  5. [GSan] Support atomics #9700

@peterbell10 peterbell10 requested a review from ptillet as a code owner February 16, 2026 15:59
@peterbell10 peterbell10 marked this pull request as draft February 16, 2026 16:05
@peterbell10 peterbell10 force-pushed the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch 3 times, most recently from a9e330d to c7202a5 Compare February 17, 2026 23:36
@peterbell10 peterbell10 force-pushed the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch from c7202a5 to d580ded Compare February 25, 2026 14:13
@peterbell10 peterbell10 force-pushed the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch 2 times, most recently from 2cb71fb to 9a0d8b4 Compare March 4, 2026 15:20
@peterbell10 peterbell10 force-pushed the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch 2 times, most recently from 53ab7de to 8c215f9 Compare March 13, 2026 12:27
This implements an allocator that hooks into PyTorch's memory allocation
API to map tensors into a GSan-managed virtual address space. We also create
a corresponding shadow memory region that is mapped into the lower half of the
reserved address space.

Usage is like:
```python
from triton.experimental import gsan
allocator = gsan.get_allocator()
pool = torch.cuda.MemPool(allocator.allocator())
with torch.cuda.use_mem_pool(pool):
    t = torch.empty(4096, dtype=torch.uint8, device="cuda")
```

git-pr-chain: gsan_implement_shadow_memory_allocator_cc5d
@peterbell10 peterbell10 force-pushed the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch from 8c215f9 to 661198b Compare March 13, 2026 22:27
@peterbell10 peterbell10 marked this pull request as ready for review March 17, 2026 00:38
// Place the thread state for each device at a fixed stride for ease of
// address calculation.
static constexpr uintptr_t kPerDeviceStateStride = 1ull << 30;
static constexpr uintptr_t kMaxGPUs = 16;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications if we's bump it to 32?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more or less an arbitrary constant, I bump it to 32 in a later PR. Basically we reserve a memory space equal to kMaxGPUs * kPerDeviceStateStride and this is where all the non-shadow memory state lives. Because it's only virtual memory this is cheap.


inline GSAN_HOST_DEVICE GlobalState *getGlobalState(ThreadState *threadState) {
auto threadAddr = (uintptr_t)threadState;
return (GlobalState *)(threadAddr & ~(kPerDeviceStateStride - 1));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, how is getGlobalState used, and why do we need the masking here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global state are actually all constants, and I have them in a common place so we don't have to store them duplicated in each thread state or keep them around in registers

struct GlobalState {
// Base address of gsan managed memory
uintptr_t reserveBase;
uintptr_t globalsBase;
uint32_t rngSeed;
thread_id_t numSms;
thread_id_t numDevices;
// numThreads = numSms * numDevices
thread_id_t numThreads;
uint16_t clockBufferSize;
};

The masking is a bit of a trick with the memory layout. Each GPU holds

GlobalState | ThreadState0 | ThreadState1 | ThreadState2 | ...

with the global state aligned to kPerDeviceStateStride. This means you can have a single pointer to the thread state, and by masking down to an aligned pointer you get a pointer to the globals. This saves either carrying 2 pointers around, or doing an extra indirection.

Copy link
Copy Markdown
Contributor

@pawelszczerbuk pawelszczerbuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions just for my education. Good stuff!

@peterbell10 peterbell10 merged commit d06f067 into main Mar 19, 2026
17 of 18 checks passed
@peterbell10 peterbell10 deleted the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch March 19, 2026 10:49
raymondtay pushed a commit to raymondtay/triton that referenced this pull request Mar 22, 2026
This implements an allocator that hooks into PyTorch's memory allocation
API to map tensors into a GSan-managed virtual address space. We also
create a corresponding shadow memory region that is mapped into the lower
half of the reserved address space.
    
Usage is like:
```python
from triton.experimental import gsan
allocator = gsan.get_allocator()
pool = torch.cuda.MemPool(allocator.allocator())
with torch.cuda.use_mem_pool(pool):
    t = torch.empty(4096, dtype=torch.uint8, device="cuda")
```
jvican pushed a commit to jvican/triton that referenced this pull request Mar 27, 2026
This implements an allocator that hooks into PyTorch's memory allocation
API to map tensors into a GSan-managed virtual address space. We also
create a corresponding shadow memory region that is mapped into the lower
half of the reserved address space.
    
Usage is like:
```python
from triton.experimental import gsan
allocator = gsan.get_allocator()
pool = torch.cuda.MemPool(allocator.allocator())
with torch.cuda.use_mem_pool(pool):
    t = torch.empty(4096, dtype=torch.uint8, device="cuda")
```
plognjen pushed a commit to plognjen/triton that referenced this pull request Apr 14, 2026
This implements an allocator that hooks into PyTorch's memory allocation
API to map tensors into a GSan-managed virtual address space. We also
create a corresponding shadow memory region that is mapped into the lower
half of the reserved address space.
    
Usage is like:
```python
from triton.experimental import gsan
allocator = gsan.get_allocator()
pool = torch.cuda.MemPool(allocator.allocator())
with torch.cuda.use_mem_pool(pool):
    t = torch.empty(4096, dtype=torch.uint8, device="cuda")
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants