Conversation
dfe4eaf to
a2a671a
Compare
|
Review updated until commit ff9013e Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
|
In the PRs current form, doesn't it make the Deallocate operation obsolete? Also, since the issue is only for large matrices, could setting |
wujingyue
left a comment
There was a problem hiding this comment.
Do you have a minimal repro so others can help debug? It's surprising at least to see the caching allocator not reaching a steady state with a seemingly simple pattern shown in the trace.
I believe it's more than that. With this PR, all allocations are persistent even across executions. I suspect it's going to be a problem with dynamic shapes. @samnordmann , to unblock your experiment, can you create an opt-in HostIrEvaluator parameter to enable persistent allocation? I'd also consider |
|
!test |
right, and that's effectively the behavior we want in many cases, but it's today relying on torch's allocation cache.
you're right that's an issue with dynamic shape...
I added an opt-in param in the last commit. Since I'm not able to reproduce the allocation cache issue for now in the benchmark, let me close this PR in the meantime |
20df599 to
3f392f1
Compare
|
I'm resurrecting this PR as it became relevant again for cuda ipc backend #5259 Indeed, we need to make sure the ipc registered buffers stay live accross iteration for two reasons:
This is a temporary workaround. The proper would be to either make the cache smarter or to use new pytorch's symmetric allocator. |
|
!test |
3f392f1 to
1f85269
Compare
|
!test |
| // top-level module in bindMultiDevice to allow direct imports. | ||
| py::class_<MultiDeviceExecutorParams>(nvfuser, "MultiDeviceExecutorParams") | ||
| .def(py::init<>()) | ||
| .def_property( |
There was a problem hiding this comment.
I don't know how to use that with nested structures.
Otherwise I can also change the behavior and let the use "readwrite" the whole params and not only those two knobs:
.def_readwrite("executor", &MultiDeviceExecutorParams::executor)
.def_readwrite("lower", &MultiDeviceExecutorParams::lower)
|
!test |
|
!test |
Adds a lowering path to generate a p2p ring pipeline backed by our recent cuda ipc backend. The performance look great, and even beats transformer engine for large matrix sizes, e.g., for TP columnwise (i.e. AG+Matmul), for m=32, k=16k, n=8k, the Throughput (in TFLOPs) of the different implementations reads as follows: - Fuser default, with nccl backend: 560 TFLOPs. This has the same perf as a baseline pytorch eager implementation - Fuser with p2p pipeline and cuda ipc backend: 678 TFLOPs - Transformer Engine: 660 TFLOPs <img width="786" height="473" alt="Screenshot 2025-09-29 at 16 29 42" src="https://github.com/user-attachments/assets/0bf34178-ccef-4d4d-abcf-3f4aa3704f69" /> This was measured using [DDLB](https://github.com/samnordmann/ddlb) and [this Fuser's branch](https://github.com/NVIDIA/Fuser/tree/lower_to_cuda_ipc_p2p_rebased), on a single 8*H100 DGX node This PR is dependent on - #4466. Without the Allocation Cache, a rank might change the allocated buffer accross iteration. Besides being a performance issue, it can create a hang if the ipc cache is not hit uniformly accross rank. A long term better solution would be to use pytorch's recent symmetric allocator - (for performance only) #5325 The test written in the PR expresses a matmul ``` C = matmul(A,B), where - A [DIDx(d), M/d, K] - B[K,N], - C[Stream(d), M/d, N] ``` The generated host program is: ``` %HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) : T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) GetCurrentStream into Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) Synchronize Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 ) IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ): T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming ) ELSE: ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA) P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA) Wait Communication 38 Wait Communication 37 T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i107 ) T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i107 ) T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = linear(T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) , T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) ) SetCurrentStream to Stream 0 Synchronize Stream ( streamIdx % numberOfStreams ) } // %HostIrContainer ```
Adds a lowering path to generate a p2p ring pipeline backed by our recent cuda ipc backend. The performance look great, and even beats transformer engine for large matrix sizes, e.g., for TP columnwise (i.e. AG+Matmul), for m=32, k=16k, n=8k, the Throughput (in TFLOPs) of the different implementations reads as follows: - Fuser default, with nccl backend: 560 TFLOPs. This has the same perf as a baseline pytorch eager implementation - Fuser with p2p pipeline and cuda ipc backend: 678 TFLOPs - Transformer Engine: 660 TFLOPs <img width="786" height="473" alt="Screenshot 2025-09-29 at 16 29 42" src="https://github.com/user-attachments/assets/0bf34178-ccef-4d4d-abcf-3f4aa3704f69" /> This was measured using [DDLB](https://github.com/samnordmann/ddlb) and [this Fuser's branch](https://github.com/NVIDIA/Fuser/tree/lower_to_cuda_ipc_p2p_rebased), on a single 8*H100 DGX node This PR is dependent on - #4466. Without the Allocation Cache, a rank might change the allocated buffer accross iteration. Besides being a performance issue, it can create a hang if the ipc cache is not hit uniformly accross rank. A long term better solution would be to use pytorch's recent symmetric allocator - (for performance only) #5325 The test written in the PR expresses a matmul ``` C = matmul(A,B), where - A [DIDx(d), M/d, K] - B[K,N], - C[Stream(d), M/d, N] ``` The generated host program is: ``` %HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) : T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) GetCurrentStream into Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) Synchronize Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 ) IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ): T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming ) ELSE: ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA) P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA) Wait Communication 38 Wait Communication 37 T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i107 ) T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i107 ) T8_l___bfloat[iS19{128}, iS20{1024}, rS21{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = linear(T7_l___bfloat[iS17{128}, iS18{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) , T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) ) SetCurrentStream to Stream 0 Synchronize Stream ( streamIdx % numberOfStreams ) } // %HostIrContainer ```
What
Add an allocation cache at the HostIrEvaluator level, that is persistent across runs. Before that, we were only relying on torch allocation cache, but I observed that it doesn't always behave as excepted, especially on multiprocess and multistream programs, when dealing with large tensors.
The proposed solution may be too simple in the long run -- we should think about how make it more controllable/smart in the future.
Illustration of the problem:

We execute a simple program with Allgather + GEMM, without even overlap. The generated Host program contained an Allocation node for allocating the allgather's destination buffer. Before the current patch, we observe that, across iterations, the buffers sometimes (but not systematically) get reallocated, thus torch's allocation cache is not hit. This is only observed for large matrices.