Refactor: fix symmetric memory pool isolation per communication group#20153
Refactor: fix symmetric memory pool isolation per communication group#20153wangfakang wants to merge 5 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Signed-off-by: wangfakang <fakangwang@gmail.com>
143aaf6 to
850f663
Compare
Signed-off-by: wangfakang <fakangwang@gmail.com>
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
There is no reason why the symmetric memory cannot be register by multiple groups. Creating extra pools will increase the memory footprint and also create more fragmentation. |
@nvcastet @yizhang2077 NCCL symmetric memory function requires that the registered memory must be consistent with the binding of the communicator/group, otherwise it will cause the symmetric memory function to not be enabled. So as mentioned in the modified motivation, when multiple communication groups share a single global MemPool, memory blocks released by one group's comm may be reused by another group's comm. However, symmetric memory requires buffers to be registered with a specific ncclComm via ncclCommWindowRegister. Reusing memory across groups causes the registration to be associated with the wrong communicator. This fix without the cpu overhead of The NCCL's window/memory lookup is scoped to the specific communicator/group used in the collective call, not globally. Here's the key point from the source code:
// src/dev_runtime.cc:1097
ncclResult_t ncclCommWindowRegister(ncclComm_t comm, void* buff, ...) {
……
ncclIntruQueueEnqueue(&comm->devrState.regTaskQueue, task);
……
}
// src/dev_runtime.cc:1132-1144
ncclResult_t ncclDevrFindWindow(struct ncclComm* comm, void const* userPtr, ...) {
……
struct ncclDevrState* devr = &comm->devrState; // Uses the comm passed in
// Searches in devr->winSorted
……
}
// src/enqueue.cc:2914-2915
ncclDevrFindWindow(comm, info->sendbuff, &sendWin);
ncclDevrFindWindow(comm, info->recvbuff, &recvWin); |
ncclCommWindowRegister can be called multiple times on the same memory with multiple groups (aka you can register all the memory segments of the one pool over time with all the communicators that use symmetric memory) |
@nvcastet I agree that |
I think we can keep track of the memory segment ptr ourself in C++ to be able to access them when we want to register multiple groups. In that way the symmetric memory pool pre-allocation can be re-used for any groups. |
|
@nvcastet @merrymercy Thanks for the suggestion. I've explored the approach of tracking memory segments in C++ to enable a shared pool across groups. However, there's a fundamental challenge:
Since there's no way to get a callback when MemPool returns cached memory, the C++ tracking approach doesn't fundamentally solve the overhead problem - it just moves the iteration from Python ( Given this limitation, the per-group MemPool approach in this PR remains a practical solution that guarantees correctness without runtime overhead, at the cost of some memory fragmentation (which is manageable in practice). The current implementation in PyTorch also has a similar issue of CPU overhead: |
@wangfakang I disagree. There is a small number of segments in the symmetric pools. The CPU overhead came from the python/pytorch impl.
Only if new segment is added or new group is seen. |
|
@wangfakang Please benchmark the approach to evaluate option A. |
hello @nvcastet , I have refactored a commit based on option A and completed benchmark testing, which found that the cpu cost of However, the current option A occasionally reports an error when the CUDA Graph mode is enabled, that is, when triggering the |
|
Frendly ping @nvcastet cc @merrymercy |
|
I reproduced an issue where option A (different comm/groups sharing a memory pool) occasionally causes NCCL's function Under option A , executing the following code will trigger the cuMemMap error at the Section2 location. Meanwhile, if the code at Section1 is commented out, the error does not occur. Additionally, when using the approach in this PR (isolated memory pools for different comm/groups) to run the same code, this error does not occur. Reproduce the code of the issue: Debug NCCL patch: some debug info: NCCL warn log: |
|
@nvcastet @xiaofanl-nvidia @sjeaugey @AddyLaddy I tried to register the address list in the same order as the address list registered in |
hello @nvcastet Based on our discussion, I've refactored the code and opened a new PR. PTAL, thx. |
|
Thanks @wangfakang ! |
@nvcastet Thank you, the issue is resolved. I registered the address list in the same order as the address list registered in |
That makes sense, window registration is a collective op, so you will deadlock if they don't happen in the same order on all the GPUs. |
CC @nvcastet @yizhang2077 @merrymercy @ShangmingCai @Fridge003 @ch-wan PTAL, thx.
Motivation
When multiple communication groups share a single global MemPool, memory blocks released by one group's comm may be reused by another group's comm. However, symmetric memory requires buffers to be registered with a specific
ncclCommviancclCommWindowRegister. Reusing memory across groups causes the registration to be associated with the wrong communicator.I refactored SymmPool to replace the global MemPool with a per-group dictionary. Now each communication group has its own MemPool, ensuring proper memory registration and preventing cross-group allocation issues in multi-comm scenarios.
Related PR: #19329 (comment)
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci