Skip to content

Conversation

michal-shalev
Copy link
Contributor

What?

Fix segmentation fault when releasing multiple exported batches on the same UCX endpoint.

Why?

Single exported_batch variable was overwritten by subsequent exports, causing releaseBatch() to use wrong batch handle. This caused crashes during elastic training when removing ranks with multiple active batches.

How?

  • Replace single variable with std::map<nixlUcxReq, ucp_batch_h> to track multiple batch handles
  • Add test multipleExportReleaseSameEndpoint that reproduces the bug

…same endpoint

Co-Authored-By: Roey Azran <[email protected]>

Signed-off-by: Michal Shalev <[email protected]>
Copy link

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

public:
void err_cb(ucp_ep_h ucp_ep, ucs_status_t status);
void* exported_batch{nullptr};
std::map<nixlUcxReq, ucp_batch_h> exported_batches;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use unordered map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: Michal Shalev <[email protected]>
Copy link

copy-pr-bot bot commented Oct 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Oct 12, 2025
@michal-shalev michal-shalev force-pushed the fix-multiple-batch-export-segfault branch 2 times, most recently from b7ee68e to 0fa6861 Compare October 12, 2025 11:53
@pull-request-size pull-request-size bot added size/M and removed size/L labels Oct 12, 2025
@michal-shalev michal-shalev merged commit a241f80 into ai-dynamo:device-api Oct 12, 2025
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants