[GSan] Implement shadow memory allocator by peterbell10 · Pull Request #9478 · triton-lang/triton

peterbell10 · 2026-02-16T15:59:05Z

Commits in this PR

[GSan] Implement shadow memory allocator

This implements an allocator that hooks into PyTorch's memory allocation
API to map tensors into a GSan-managed virtual address space. We also create
a corresponding shadow memory region that is mapped into the lower half of the
reserved address space.

Usage is like:
```
from triton.experimental import gsan
allocator = gsan.get_allocator()
pool = torch.cuda.MemPool(allocator.allocator())
with torch.cuda.use_mem_pool(pool):
    t = torch.empty(4096, dtype=torch.uint8, device="cuda")
```
Misc cleanup/fixes
Sync stream before dealloc
More misc changes

PR chain

This implements an allocator that hooks into PyTorch's memory allocation API to map tensors into a GSan-managed virtual address space. We also create a corresponding shadow memory region that is mapped into the lower half of the reserved address space. Usage is like: ```python from triton.experimental import gsan allocator = gsan.get_allocator() pool = torch.cuda.MemPool(allocator.allocator()) with torch.cuda.use_mem_pool(pool): t = torch.empty(4096, dtype=torch.uint8, device="cuda") ``` git-pr-chain: gsan_implement_shadow_memory_allocator_cc5d

pawelszczerbuk · 2026-03-18T16:05:46Z

+// Place the thread state for each device at a fixed stride for ease of
+// address calculation.
+static constexpr uintptr_t kPerDeviceStateStride = 1ull << 30;
+static constexpr uintptr_t kMaxGPUs = 16;


What are the implications if we's bump it to 32?

It's more or less an arbitrary constant, I bump it to 32 in a later PR. Basically we reserve a memory space equal to kMaxGPUs * kPerDeviceStateStride and this is where all the non-shadow memory state lives. Because it's only virtual memory this is cheap.

pawelszczerbuk · 2026-03-18T16:17:19Z

+
+inline GSAN_HOST_DEVICE GlobalState *getGlobalState(ThreadState *threadState) {
+  auto threadAddr = (uintptr_t)threadState;
+  return (GlobalState *)(threadAddr & ~(kPerDeviceStateStride - 1));


Just curious, how is getGlobalState used, and why do we need the masking here?

The global state are actually all constants, and I have them in a common place so we don't have to store them duplicated in each thread state or keep them around in registers

triton/python/triton/experimental/gsan/src/GSan.h

Lines 50 to 63 in 661198b

struct GlobalState {

// Base address of gsan managed memory

uintptr_t reserveBase;

uintptr_t globalsBase;

uint32_t rngSeed;

thread_id_t numSms;

thread_id_t numDevices;

// numThreads = numSms * numDevices

thread_id_t numThreads;

uint16_t clockBufferSize;

};

The masking is a bit of a trick with the memory layout. Each GPU holds

GlobalState | ThreadState0 | ThreadState1 | ThreadState2 | ...

with the global state aligned to kPerDeviceStateStride. This means you can have a single pointer to the thread state, and by masking down to an aligned pointer you get a pointer to the globals. This saves either carrying 2 pointers around, or doing an extra indirection.

pawelszczerbuk

Left some questions just for my education. Good stuff!

This implements an allocator that hooks into PyTorch's memory allocation API to map tensors into a GSan-managed virtual address space. We also create a corresponding shadow memory region that is mapped into the lower half of the reserved address space. Usage is like: ```python from triton.experimental import gsan allocator = gsan.get_allocator() pool = torch.cuda.MemPool(allocator.allocator()) with torch.cuda.use_mem_pool(pool): t = torch.empty(4096, dtype=torch.uint8, device="cuda") ```

peterbell10 requested a review from ptillet as a code owner February 16, 2026 15:59

peterbell10 marked this pull request as draft February 16, 2026 16:05

peterbell10 force-pushed the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch 3 times, most recently from a9e330d to c7202a5 Compare February 17, 2026 23:36

peterbell10 mentioned this pull request Feb 17, 2026

[GSan] Add symmetric memory API #9493

Merged

peterbell10 force-pushed the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch from c7202a5 to d580ded Compare February 25, 2026 14:13

peterbell10 mentioned this pull request Feb 25, 2026

[GSan] Instrument tl.{load,store} #9568

Merged

peterbell10 force-pushed the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch 2 times, most recently from 2cb71fb to 9a0d8b4 Compare March 4, 2026 15:20

This was referenced Mar 12, 2026

[GSan] Partially support TMA & cp.async ops #9699

Merged

[GSan] Support atomics #9700

Merged

peterbell10 force-pushed the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch 2 times, most recently from 53ab7de to 8c215f9 Compare March 13, 2026 12:27

peterbell10 added 4 commits March 13, 2026 12:54

Misc cleanup/fixes

e56b30b

Sync stream before dealloc

4f3ecb5

More misc changes

661198b

peterbell10 force-pushed the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch from 8c215f9 to 661198b Compare March 13, 2026 22:27

peterbell10 marked this pull request as ready for review March 17, 2026 00:38

peterbell10 requested review from ThomasRaoux and pawelszczerbuk March 17, 2026 00:38

pawelszczerbuk reviewed Mar 18, 2026

View reviewed changes

pawelszczerbuk approved these changes Mar 18, 2026

View reviewed changes

peterbell10 merged commit d06f067 into main Mar 19, 2026
17 of 18 checks passed

peterbell10 deleted the pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d branch March 19, 2026 10:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSan] Implement shadow memory allocator#9478

[GSan] Implement shadow memory allocator#9478
peterbell10 merged 4 commits into
mainfrom
pb/pr-chain/gsan_implement_shadow_memory_allocator_cc5d

peterbell10 commented Feb 16, 2026 •

edited

Loading

Uh oh!

pawelszczerbuk Mar 18, 2026

Uh oh!

peterbell10 Mar 18, 2026

Uh oh!

pawelszczerbuk Mar 18, 2026

Uh oh!

peterbell10 Mar 18, 2026

Uh oh!

pawelszczerbuk left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	struct GlobalState {
	// Base address of gsan managed memory
	uintptr_t reserveBase;
	uintptr_t globalsBase;

	uint32_t rngSeed;

	thread_id_t numSms;
	thread_id_t numDevices;
	// numThreads = numSms * numDevices
	thread_id_t numThreads;

	uint16_t clockBufferSize;
	};

Conversation

peterbell10 commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commits in this PR

PR chain

Uh oh!

pawelszczerbuk Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

peterbell10 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

pawelszczerbuk Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

peterbell10 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

pawelszczerbuk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peterbell10 commented Feb 16, 2026 •

edited

Loading