[Host IR] Allocation cache by samnordmann · Pull Request #4466 · NVIDIA/Fuser

samnordmann · 2025-05-16T16:05:56Z

What

Add an allocation cache at the HostIrEvaluator level, that is persistent across runs. Before that, we were only relying on torch allocation cache, but I observed that it doesn't always behave as excepted, especially on multiprocess and multistream programs, when dealing with large tensors.

The proposed solution may be too simple in the long run -- we should think about how make it more controllable/smart in the future.

Illustration of the problem:
We execute a simple program with Allgather + GEMM, without even overlap. The generated Host program contained an Allocation node for allocating the allgather's destination buffer. Before the current patch, we observe that, across iterations, the buffers sometimes (but not systematically) get reallocated, thus torch's allocation cache is not hit. This is only observed for large matrices.

github-actions · 2025-05-16T16:07:04Z

Review updated until commit ff9013e

Description

Added persistent allocation cache in HostIrEvaluator
Enabled cache control via MultiDeviceExecutorParams
Updated test cases to use allocation cache
Exposed cache toggle in Python bindings

Changes walkthrough 📝

Relevant files

Enhancement

evaluator.cpp `Implement allocation caching logic` csrc/host_ir/evaluator.cpp Check allocation cache before allocating new tensors Cache allocated tensors if use_allocation_cache is enabled Reuse cached allocations across iterations	+16/-0
multidevice.cpp `Expose allocation cache in Python API` python/python_direct/multidevice.cpp Bind MultiDeviceExecutorParams with use_allocation_cache property Add backend_type property to params Update MultiDeviceExecutor constructor to accept params object	+34/-12
evaluator.h `Declare allocation cache and params flag` csrc/host_ir/evaluator.h Add use_allocation_cache flag to HostIrEvaluatorParams Declare allocation_cache_ in HostIrEvaluator Store allocations using kir::Allocate* as key	+4/-0

Tests

test_overlap.py `Use allocation cache in overlap tests` tests/python/multidevice/test_overlap.py Update test setup to use MultiDeviceExecutorParams Enable allocation cache in test cases Initialize params before creating executor	+7/-6

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Cache Lifetime and Ownership The allocation cache stores tensors using raw pointers to `kir::Allocate` nodes as keys. This raises concerns about the lifetime and ownership of these keys. If the `kir::Allocate` node is destroyed or invalidated while the cache holds a reference to it, the cache may contain dangling pointers, leading to undefined behavior. if (params_.use_allocation_cache) { auto it = allocation_cache_.find(allocate); if (it != allocation_cache_.end()) { expr_evaluator_.bind(tv, it->second); return; } } Memory Growth Risk The current implementation unconditionally inserts allocations into the cache without any eviction policy. Over long-running or dynamic workloads, this may lead to unbounded memory growth, especially when dealing with large tensors as mentioned in the PR description. // Cache the allocation if enabled if (params_.use_allocation_cache) { allocation_cache_[allocate] = tensor; } Cache Key Stability Using `kir::Allocate` as a cache key assumes that the address uniquely and stably identifies a logical allocation across runs. However, if the IR is regenerated or modified between runs, the same logical allocation may have a different address, reducing cache effectiveness. std::unordered_map<kir::Allocate, at::Tensor> allocation_cache_;

samnordmann · 2025-05-16T16:27:57Z

!test

nsarka · 2025-05-16T19:56:35Z

In the PRs current form, doesn't it make the Deallocate operation obsolete? Also, since the issue is only for large matrices, could setting garbage_collection_threshold from https://docs.pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management to a higher value help?

wujingyue

Do you have a minimal repro so others can help debug? It's surprising at least to see the caching allocator not reaching a steady state with a seemingly simple pattern shown in the trace.

wujingyue · 2025-05-19T05:04:27Z

doesn't it make the Deallocate operation obsolete?

I believe it's more than that. With this PR, all allocations are persistent even across executions. I suspect it's going to be a problem with dynamic shapes.

@samnordmann , to unblock your experiment, can you create an opt-in HostIrEvaluator parameter to enable persistent allocation? I'd also consider class HostIrEvaluatorV2 : public HostIrEvaluator which overrides handle(Allocate*).

samnordmann · 2025-05-19T14:07:34Z

!test

samnordmann · 2025-05-19T14:07:42Z

doesn't it make the Deallocate operation obsolete?

I believe it's more than that. With this PR, all allocations are persistent even across executions.

right, and that's effectively the behavior we want in many cases, but it's today relying on torch's allocation cache.

I suspect it's going to be a problem with dynamic shapes.

you're right that's an issue with dynamic shape...

@samnordmann , to unblock your experiment, can you create an opt-in HostIrEvaluator parameter to enable persistent allocation? I'd also consider class HostIrEvaluatorV2 : public HostIrEvaluator which overrides handle(Allocate*).

I added an opt-in param in the last commit.

Since I'm not able to reproduce the allocation cache issue for now in the benchmark, let me close this PR in the meantime

samnordmann · 2025-09-29T10:51:59Z

I'm resurrecting this PR as it became relevant again for cuda ipc backend #5259

Indeed, we need to make sure the ipc registered buffers stay live accross iteration for two reasons:

for performance, to avoid re-registering the buffers, which is a costly operation
more critically, if one ranks re-allocates a new buffer and the other is not, ShareMemHandle will hang

This is a temporary workaround. The proper would be to either make the cache smarter or to use new pytorch's symmetric allocator.

samnordmann · 2025-09-29T10:52:23Z

!test

samnordmann · 2025-09-29T14:40:30Z

!test

wujingyue

LGTM otherwise!

python/nvfuser/__init__.py

python/python_frontend/fusion_definition.cpp

python/python_frontend/fusion_definition.h

python/python_frontend/python_bindings.cpp

wujingyue · 2025-09-30T05:28:23Z

python/python_direct/multidevice.cpp

+  // top-level module in bindMultiDevice to allow direct imports.
+  py::class_<MultiDeviceExecutorParams>(nvfuser, "MultiDeviceExecutorParams")
+      .def(py::init<>())
+      .def_property(


Try .def_readwrite

I don't know how to use that with nested structures.

Otherwise I can also change the behavior and let the use "readwrite" the whole params and not only those two knobs:

.def_readwrite("executor", &MultiDeviceExecutorParams::executor) .def_readwrite("lower", &MultiDeviceExecutorParams::lower)

samnordmann · 2025-09-30T17:16:13Z

!test

csrc/host_ir/evaluator.cpp

samnordmann · 2025-10-01T16:53:19Z

!test

Adds a lowering path to generate a p2p ring pipeline backed by our recent cuda ipc backend. The performance look great, and even beats transformer engine for large matrix sizes, e.g., for TP columnwise (i.e. AG+Matmul), for m=32, k=16k, n=8k, the Throughput (in TFLOPs) of the different implementations reads as follows: - Fuser default, with nccl backend: 560 TFLOPs. This has the same perf as a baseline pytorch eager implementation - Fuser with p2p pipeline and cuda ipc backend: 678 TFLOPs - Transformer Engine: 660 TFLOPs <img width="786" height="473" alt="Screenshot 2025-09-29 at 16 29 42" src="https://github.com/user-attachments/assets/0bf34178-ccef-4d4d-abcf-3f4aa3704f69" /> This was measured using [DDLB](https://github.com/samnordmann/ddlb) and [this Fuser's branch](https://github.com/NVIDIA/Fuser/tree/lower_to_cuda_ipc_p2p_rebased), on a single 8*H100 DGX node This PR is dependent on - #4466. Without the Allocation Cache, a rank might change the allocated buffer accross iteration. Besides being a performance issue, it can create a hang if the ipc cache is not hit uniformly accross rank. A long term better solution would be to use pytorch's recent symmetric allocator - (for performance only) #5325 The test written in the PR expresses a matmul ``` C = matmul(A,B), where - A [DIDx(d), M/d, K] - B[K,N], - C[Stream(d), M/d, N] ``` The generated host program is: ``` %HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) : T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) GetCurrentStream into Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) Synchronize Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 ) IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ): T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming ) ELSE: ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA) P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA) Wait Communication 38 Wait Communication 37 T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i107 ) T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i107 ) T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = linear(T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) , T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) ) SetCurrentStream to Stream 0 Synchronize Stream ( streamIdx % numberOfStreams ) } // %HostIrContainer ```

samnordmann force-pushed the hir_allocation_cache branch from dfe4eaf to a2a671a Compare May 16, 2025 16:06

samnordmann mentioned this pull request May 16, 2025

Benchmark for distributed matmul with overlap #4326

Merged

samnordmann requested review from nsarka and wujingyue May 16, 2025 16:28

wujingyue reviewed May 19, 2025

View reviewed changes

samnordmann marked this pull request as draft May 19, 2025 14:08

samnordmann force-pushed the hir_allocation_cache branch from 20df599 to 3f392f1 Compare September 29, 2025 10:48

samnordmann marked this pull request as ready for review September 29, 2025 10:50

add allocation cache

1f85269

samnordmann force-pushed the hir_allocation_cache branch from 3f392f1 to 1f85269 Compare September 29, 2025 13:53

samnordmann mentioned this pull request Sep 29, 2025

P2p cuda lowering #5259

Merged

samnordmann requested a review from wujingyue September 29, 2025 14:40

wujingyue reviewed Sep 30, 2025

View reviewed changes

minor comments

3fd07df

wujingyue approved these changes Sep 30, 2025

View reviewed changes

csrc/host_ir/evaluator.cpp Outdated Show resolved Hide resolved

minor review

ff9013e

samnordmann mentioned this pull request Oct 1, 2025

Revert "Remove unnecessary barrier in share mem handle cuda ipc" #5273

Merged

nsarka approved these changes Oct 2, 2025

View reviewed changes

samnordmann merged commit 73233ad into main Oct 2, 2025
55 checks passed

samnordmann deleted the hir_allocation_cache branch October 2, 2025 16:41

Conversation

samnordmann commented May 16, 2025 • edited by wujingyue Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Uh oh!

github-actions bot commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

samnordmann commented May 16, 2025

Uh oh!

nsarka commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samnordmann commented May 19, 2025

Uh oh!

samnordmann commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samnordmann commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samnordmann commented Sep 29, 2025

Uh oh!

samnordmann commented Sep 29, 2025

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

samnordmann Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

samnordmann commented Sep 30, 2025

Uh oh!

Uh oh!

samnordmann commented Oct 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samnordmann commented May 16, 2025 •

edited by wujingyue

Loading

github-actions bot commented May 16, 2025 •

edited

Loading

nsarka commented May 16, 2025 •

edited

Loading

wujingyue commented May 19, 2025 •

edited

Loading

samnordmann commented May 19, 2025 •

edited

Loading

samnordmann commented Sep 29, 2025 •

edited

Loading