Skip to content

[Host IR] Allocation cache#4466

Merged
samnordmann merged 3 commits intomainfrom
hir_allocation_cache
Oct 2, 2025
Merged

[Host IR] Allocation cache#4466
samnordmann merged 3 commits intomainfrom
hir_allocation_cache

Conversation

@samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented May 16, 2025

What

Add an allocation cache at the HostIrEvaluator level, that is persistent across runs. Before that, we were only relying on torch allocation cache, but I observed that it doesn't always behave as excepted, especially on multiprocess and multistream programs, when dealing with large tensors.

The proposed solution may be too simple in the long run -- we should think about how make it more controllable/smart in the future.

Illustration of the problem:
We execute a simple program with Allgather + GEMM, without even overlap. The generated Host program contained an Allocation node for allocating the allgather's destination buffer. Before the current patch, we observe that, across iterations, the buffers sometimes (but not systematically) get reallocated, thus torch's allocation cache is not hit. This is only observed for large matrices.
Screenshot 2025-05-16 at 18 23 37

@samnordmann samnordmann force-pushed the hir_allocation_cache branch from dfe4eaf to a2a671a Compare May 16, 2025 16:06
@github-actions
Copy link

github-actions bot commented May 16, 2025

Review updated until commit ff9013e

Description

  • Added persistent allocation cache in HostIrEvaluator

  • Enabled cache control via MultiDeviceExecutorParams

  • Updated test cases to use allocation cache

  • Exposed cache toggle in Python bindings


Changes walkthrough 📝

Relevant files
Enhancement
evaluator.cpp
Implement allocation caching logic                                             

csrc/host_ir/evaluator.cpp

  • Check allocation cache before allocating new tensors
  • Cache allocated tensors if use_allocation_cache is enabled
  • Reuse cached allocations across iterations
  • +16/-0   
    multidevice.cpp
    Expose allocation cache in Python API                                       

    python/python_direct/multidevice.cpp

  • Bind MultiDeviceExecutorParams with use_allocation_cache property
  • Add backend_type property to params
  • Update MultiDeviceExecutor constructor to accept params object
  • +34/-12 
    evaluator.h
    Declare allocation cache and params flag                                 

    csrc/host_ir/evaluator.h

  • Add use_allocation_cache flag to HostIrEvaluatorParams
  • Declare allocation_cache_ in HostIrEvaluator
  • Store allocations using kir::Allocate* as key
  • +4/-0     
    Tests
    test_overlap.py
    Use allocation cache in overlap tests                                       

    tests/python/multidevice/test_overlap.py

  • Update test setup to use MultiDeviceExecutorParams
  • Enable allocation cache in test cases
  • Initialize params before creating executor
  • +7/-6     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Cache Lifetime and Ownership

    The allocation cache stores tensors using raw pointers to kir::Allocate nodes as keys. This raises concerns about the lifetime and ownership of these keys. If the kir::Allocate node is destroyed or invalidated while the cache holds a reference to it, the cache may contain dangling pointers, leading to undefined behavior.

    if (params_.use_allocation_cache) {
      auto it = allocation_cache_.find(allocate);
      if (it != allocation_cache_.end()) {
        expr_evaluator_.bind(tv, it->second);
        return;
      }
    }
    Memory Growth Risk

    The current implementation unconditionally inserts allocations into the cache without any eviction policy. Over long-running or dynamic workloads, this may lead to unbounded memory growth, especially when dealing with large tensors as mentioned in the PR description.

    // Cache the allocation if enabled
    if (params_.use_allocation_cache) {
      allocation_cache_[allocate] = tensor;
    }
    Cache Key Stability

    Using kir::Allocate* as a cache key assumes that the address uniquely and stably identifies a logical allocation across runs. However, if the IR is regenerated or modified between runs, the same logical allocation may have a different address, reducing cache effectiveness.

    std::unordered_map<kir::Allocate*, at::Tensor> allocation_cache_;

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann samnordmann requested review from nsarka and wujingyue May 16, 2025 16:28
    @nsarka
    Copy link
    Member

    nsarka commented May 16, 2025

    In the PRs current form, doesn't it make the Deallocate operation obsolete? Also, since the issue is only for large matrices, could setting garbage_collection_threshold from https://docs.pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management to a higher value help?

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Do you have a minimal repro so others can help debug? It's surprising at least to see the caching allocator not reaching a steady state with a seemingly simple pattern shown in the trace.

    @wujingyue
    Copy link
    Collaborator

    wujingyue commented May 19, 2025

    doesn't it make the Deallocate operation obsolete?

    I believe it's more than that. With this PR, all allocations are persistent even across executions. I suspect it's going to be a problem with dynamic shapes.

    @samnordmann , to unblock your experiment, can you create an opt-in HostIrEvaluator parameter to enable persistent allocation? I'd also consider class HostIrEvaluatorV2 : public HostIrEvaluator which overrides handle(Allocate*).

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    samnordmann commented May 19, 2025

    doesn't it make the Deallocate operation obsolete?

    I believe it's more than that. With this PR, all allocations are persistent even across executions.

    right, and that's effectively the behavior we want in many cases, but it's today relying on torch's allocation cache.

    I suspect it's going to be a problem with dynamic shapes.

    you're right that's an issue with dynamic shape...

    @samnordmann , to unblock your experiment, can you create an opt-in HostIrEvaluator parameter to enable persistent allocation? I'd also consider class HostIrEvaluatorV2 : public HostIrEvaluator which overrides handle(Allocate*).

    I added an opt-in param in the last commit.

    Since I'm not able to reproduce the allocation cache issue for now in the benchmark, let me close this PR in the meantime

    @samnordmann samnordmann marked this pull request as draft May 19, 2025 14:08
    @samnordmann samnordmann marked this pull request as ready for review September 29, 2025 10:50
    @samnordmann
    Copy link
    Collaborator Author

    samnordmann commented Sep 29, 2025

    I'm resurrecting this PR as it became relevant again for cuda ipc backend #5259

    Indeed, we need to make sure the ipc registered buffers stay live accross iteration for two reasons:

    1. for performance, to avoid re-registering the buffers, which is a costly operation
    2. more critically, if one ranks re-allocates a new buffer and the other is not, ShareMemHandle will hang

    This is a temporary workaround. The proper would be to either make the cache smarter or to use new pytorch's symmetric allocator.

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM otherwise!

    // top-level module in bindMultiDevice to allow direct imports.
    py::class_<MultiDeviceExecutorParams>(nvfuser, "MultiDeviceExecutorParams")
    .def(py::init<>())
    .def_property(
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Try .def_readwrite

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I don't know how to use that with nested structures.

    Otherwise I can also change the behavior and let the use "readwrite" the whole params and not only those two knobs:

    .def_readwrite("executor", &MultiDeviceExecutorParams::executor)
    .def_readwrite("lower", &MultiDeviceExecutorParams::lower)
    

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann samnordmann merged commit 73233ad into main Oct 2, 2025
    55 checks passed
    @samnordmann samnordmann deleted the hir_allocation_cache branch October 2, 2025 16:41
    samnordmann added a commit that referenced this pull request Oct 8, 2025
    Adds a lowering path to generate a p2p ring pipeline backed by our
    recent cuda ipc backend. The performance look great, and even beats
    transformer engine for large matrix sizes, e.g., for TP columnwise (i.e.
    AG+Matmul), for m=32, k=16k, n=8k, the Throughput (in TFLOPs) of the
    different implementations reads as follows:
    - Fuser default, with nccl backend: 560 TFLOPs. This has the same perf
    as a baseline pytorch eager implementation
    - Fuser with p2p pipeline and cuda ipc backend: 678 TFLOPs
    - Transformer Engine: 660 TFLOPs
    
    
    <img width="786" height="473" alt="Screenshot 2025-09-29 at 16 29 42"
    src="https://github.com/user-attachments/assets/0bf34178-ccef-4d4d-abcf-3f4aa3704f69"
    />
    
    This was measured using [DDLB](https://github.com/samnordmann/ddlb) and
    [this Fuser's
    branch](https://github.com/NVIDIA/Fuser/tree/lower_to_cuda_ipc_p2p_rebased),
    on a single 8*H100 DGX node
    
    
    This PR is dependent on
    - #4466. Without the Allocation
    Cache, a rank might change the allocated buffer accross iteration.
    Besides being a performance issue, it can create a hang if the ipc cache
    is not hit uniformly accross rank. A long term better solution would be
    to use pytorch's recent symmetric allocator
    - (for performance only) #5325
    
    
    
    The test written in the PR expresses a matmul
    ```
    C = matmul(A,B), 
    where 
    - A [DIDx(d), M/d, K]
    - B[K,N],
    - C[Stream(d), M/d, N]
    ```
    The generated host program is:
    ```
    %HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) :
      T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false)
      T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false)
      GetCurrentStream into Stream 0
      FOR streamIdx in istreamIdx10{8}:
        SetCurrentStream to Stream ( streamIdx % numberOfStreams )
        Synchronize Stream 0
      FOR streamIdx in istreamIdx10{8}:
        SetCurrentStream to Stream ( streamIdx % numberOfStreams )
        T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
           = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 )
        IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ):
          T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
             = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 )
          T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
             = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming )
        ELSE:
          ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA),
          P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA)
          P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA)
          Wait Communication 38
          Wait Communication 37
        T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
           = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i107 )
        T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
           = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i107 )
        T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
           = linear(T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}),
                    T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})      ,
              T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})      )
        SetCurrentStream to Stream 0
        Synchronize Stream ( streamIdx % numberOfStreams )
    } // %HostIrContainer
    
    ```
    tbqh pushed a commit that referenced this pull request Nov 12, 2025
    Adds a lowering path to generate a p2p ring pipeline backed by our
    recent cuda ipc backend. The performance look great, and even beats
    transformer engine for large matrix sizes, e.g., for TP columnwise (i.e.
    AG+Matmul), for m=32, k=16k, n=8k, the Throughput (in TFLOPs) of the
    different implementations reads as follows:
    - Fuser default, with nccl backend: 560 TFLOPs. This has the same perf
    as a baseline pytorch eager implementation
    - Fuser with p2p pipeline and cuda ipc backend: 678 TFLOPs
    - Transformer Engine: 660 TFLOPs
    
    
    <img width="786" height="473" alt="Screenshot 2025-09-29 at 16 29 42"
    src="https://github.com/user-attachments/assets/0bf34178-ccef-4d4d-abcf-3f4aa3704f69"
    />
    
    This was measured using [DDLB](https://github.com/samnordmann/ddlb) and
    [this Fuser's
    branch](https://github.com/NVIDIA/Fuser/tree/lower_to_cuda_ipc_p2p_rebased),
    on a single 8*H100 DGX node
    
    
    This PR is dependent on
    - #4466. Without the Allocation
    Cache, a rank might change the allocated buffer accross iteration.
    Besides being a performance issue, it can create a hang if the ipc cache
    is not hit uniformly accross rank. A long term better solution would be
    to use pytorch's recent symmetric allocator
    - (for performance only) #5325
    
    
    
    The test written in the PR expresses a matmul
    ```
    C = matmul(A,B), 
    where 
    - A [DIDx(d), M/d, K]
    - B[K,N],
    - C[Stream(d), M/d, N]
    ```
    The generated host program is:
    ```
    %HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) :
      T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false)
      T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false)
      GetCurrentStream into Stream 0
      FOR streamIdx in istreamIdx10{8}:
        SetCurrentStream to Stream ( streamIdx % numberOfStreams )
        Synchronize Stream 0
      FOR streamIdx in istreamIdx10{8}:
        SetCurrentStream to Stream ( streamIdx % numberOfStreams )
        T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
           = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 )
        IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ):
          T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
             = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 )
          T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
             = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming )
        ELSE:
          ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA),
          P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA)
          P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA)
          Wait Communication 38
          Wait Communication 37
        T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
           = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i107 )
        T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
           = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i107 )
        T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
           = linear(T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}),
                    T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})      ,
              T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})      )
        SetCurrentStream to Stream 0
        Synchronize Stream ( streamIdx % numberOfStreams )
    } // %HostIrContainer
    
    ```
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants