[GSan] Implement shadow memory allocator#9478
Conversation
a9e330d to
c7202a5
Compare
c7202a5 to
d580ded
Compare
2cb71fb to
9a0d8b4
Compare
53ab7de to
8c215f9
Compare
This implements an allocator that hooks into PyTorch's memory allocation
API to map tensors into a GSan-managed virtual address space. We also create
a corresponding shadow memory region that is mapped into the lower half of the
reserved address space.
Usage is like:
```python
from triton.experimental import gsan
allocator = gsan.get_allocator()
pool = torch.cuda.MemPool(allocator.allocator())
with torch.cuda.use_mem_pool(pool):
t = torch.empty(4096, dtype=torch.uint8, device="cuda")
```
git-pr-chain: gsan_implement_shadow_memory_allocator_cc5d
8c215f9 to
661198b
Compare
| // Place the thread state for each device at a fixed stride for ease of | ||
| // address calculation. | ||
| static constexpr uintptr_t kPerDeviceStateStride = 1ull << 30; | ||
| static constexpr uintptr_t kMaxGPUs = 16; |
There was a problem hiding this comment.
What are the implications if we's bump it to 32?
There was a problem hiding this comment.
It's more or less an arbitrary constant, I bump it to 32 in a later PR. Basically we reserve a memory space equal to kMaxGPUs * kPerDeviceStateStride and this is where all the non-shadow memory state lives. Because it's only virtual memory this is cheap.
|
|
||
| inline GSAN_HOST_DEVICE GlobalState *getGlobalState(ThreadState *threadState) { | ||
| auto threadAddr = (uintptr_t)threadState; | ||
| return (GlobalState *)(threadAddr & ~(kPerDeviceStateStride - 1)); |
There was a problem hiding this comment.
Just curious, how is getGlobalState used, and why do we need the masking here?
There was a problem hiding this comment.
The global state are actually all constants, and I have them in a common place so we don't have to store them duplicated in each thread state or keep them around in registers
triton/python/triton/experimental/gsan/src/GSan.h
Lines 50 to 63 in 661198b
The masking is a bit of a trick with the memory layout. Each GPU holds
GlobalState | ThreadState0 | ThreadState1 | ThreadState2 | ...
with the global state aligned to kPerDeviceStateStride. This means you can have a single pointer to the thread state, and by masking down to an aligned pointer you get a pointer to the globals. This saves either carrying 2 pointers around, or doing an extra indirection.
pawelszczerbuk
left a comment
There was a problem hiding this comment.
Left some questions just for my education. Good stuff!
This implements an allocator that hooks into PyTorch's memory allocation
API to map tensors into a GSan-managed virtual address space. We also
create a corresponding shadow memory region that is mapped into the lower
half of the reserved address space.
Usage is like:
```python
from triton.experimental import gsan
allocator = gsan.get_allocator()
pool = torch.cuda.MemPool(allocator.allocator())
with torch.cuda.use_mem_pool(pool):
t = torch.empty(4096, dtype=torch.uint8, device="cuda")
```
This implements an allocator that hooks into PyTorch's memory allocation
API to map tensors into a GSan-managed virtual address space. We also
create a corresponding shadow memory region that is mapped into the lower
half of the reserved address space.
Usage is like:
```python
from triton.experimental import gsan
allocator = gsan.get_allocator()
pool = torch.cuda.MemPool(allocator.allocator())
with torch.cuda.use_mem_pool(pool):
t = torch.empty(4096, dtype=torch.uint8, device="cuda")
```
This implements an allocator that hooks into PyTorch's memory allocation
API to map tensors into a GSan-managed virtual address space. We also
create a corresponding shadow memory region that is mapped into the lower
half of the reserved address space.
Usage is like:
```python
from triton.experimental import gsan
allocator = gsan.get_allocator()
pool = torch.cuda.MemPool(allocator.allocator())
with torch.cuda.use_mem_pool(pool):
t = torch.empty(4096, dtype=torch.uint8, device="cuda")
```
Commits in this PR
[GSan] Implement shadow memory allocator
This implements an allocator that hooks into PyTorch's memory allocation
API to map tensors into a GSan-managed virtual address space. We also create
a corresponding shadow memory region that is mapped into the lower half of the
reserved address space.
Usage is like:
Misc cleanup/fixes
Sync stream before dealloc
More misc changes
PR chain