Skip to content

Revert "Remove unnecessary barrier in share mem handle cuda ipc"#5273

Merged
Priya2698 merged 1 commit intomainfrom
revert-5260-remove_ipc_barrier_share_mem_handle
Oct 1, 2025
Merged

Revert "Remove unnecessary barrier in share mem handle cuda ipc"#5273
Priya2698 merged 1 commit intomainfrom
revert-5260-remove_ipc_barrier_share_mem_handle

Conversation

@Priya2698
Copy link
Collaborator

Reverts #5260

RingAllgatherOverlapTest.RingAllgatherBasedPipeliningHostIRImplementationCudaIpc failed for this PR.

@Priya2698
Copy link
Collaborator Author

!test

@github-actions
Copy link

Description

  • Reintroduce barrier to ensure memhandle synchronization

  • Prevent correctness issues in IPC handle exchange

  • Fix failing RingAllgatherOverlapTest test case

  • Ensure ranks synchronize before next handle exchange


Changes walkthrough 📝

Relevant files
Bug fix
ipc_handle.cpp
Reintroduce synchronization barrier in IPC handle exchange

csrc/multidevice/ipc_handle.cpp

  • Readded barrier after inserting IPC handles
  • Ensures all ranks receive handles before proceeding
  • Prevents premature key deletion from store
  • Fixes correctness issue in handle exchange
  • +0/-3     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review

    Missing Early Return

    The reverted code removes the early return condition when non_cached_communications is empty, which may lead to unnecessary barrier execution and potential performance impact. This change should be validated for correctness and performance implications.

      auto ipc_handles = std::make_unique<P2pIpcHandle>(
          std::move(local_ipc_handle), std::move(peer_ipc_handle));
    
      insert(communication, std::move(ipc_handles));
    }
    
    // a barrier is needed here to ensure all ranks have received the
    // memhandles and the keys are deleted from the store before the next call to
    // exchangeHandles, otherwise there is a correctness issue
    // TODO: precisely select what ranks need to wait on that barrier.

    @Priya2698 Priya2698 merged commit b90eb75 into main Oct 1, 2025
    50 of 51 checks passed
    @Priya2698 Priya2698 deleted the revert-5260-remove_ipc_barrier_share_mem_handle branch October 1, 2025 03:11
    @samnordmann
    Copy link
    Collaborator

    Thanks, and sorry for the trouble. I believe the allocation cache #4466 should fix the issue. I'll reopen a new PR later

    samnordmann added a commit that referenced this pull request Oct 7, 2025
    This PR re-introduces the performance-critical changes from #5260. The
    original PR was reverted in #5273 due to a single failing test. To
    unblock this essential performance gain, this patch temporarily disables
    the test in question:
    `RingAllgatherBasedPipeliningHostIRImplementationCudaIpc`.
    
    The rationale for disabling it is as follows:
    - The underlying pipelined algorithm is now extensively covered by
    several other tests, ensuring sufficient validation.
    - The hand-written logic within this specific test appears to be overly
    complex and may not accurately reflect our target use cases.
    - Debugging the failure proved to be non-trivial, and its resolution
    should not block development.
    
    A follow-up task can be created to either fix or remove this test
    permanently. In the meantime, merging this patch will unblock upcoming
    tasks.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants