DEVICE/API: Wait for wireup completion in createGpuXferReq #947

michal-shalev · 2025-10-23T15:41:28Z

What?

Add blocking wait for endpoint wireup completion in createGpuXferReq().

Why?

Previously, users had to implement workarounds in their applications to wait for wireup completion before calling createGpuXferReq() (as shown in UCX tests).
This PR moves the wireup handling into the library, simplifying the API and removing the burden from application code.

How?

Retry ucp_device_mem_list_create() in a loop while it returns UCS_ERR_NOT_CONNECTED
Call worker.progress() in each iteration to advance the wireup state machine
Document the blocking behavior in the public API

github-actions · 2025-10-23T15:41:42Z

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

Signed-off-by: Michal Shalev <[email protected]>

michal-shalev · 2025-10-27T15:55:09Z

/build

michal-shalev · 2025-10-27T17:26:19Z

/build

michal-shalev · 2025-10-27T20:19:14Z

/build

brminich · 2025-10-28T11:58:24Z

/build

Signed-off-by: Michal Shalev <[email protected]>

rakhmets · 2025-10-30T14:06:16Z

src/utils/ucx/gpu_xfer_req_h.cpp

    params.num_elements = ucp_elements.size();

+    const auto start = std::chrono::steady_clock::now();
+    constexpr auto timeout = std::chrono::seconds(5);


What do you think about making it configurable via environment variable?

rakhmets · 2025-10-30T14:25:44Z

src/utils/ucx/gpu_xfer_req_h.cpp

+        if (std::chrono::steady_clock::now() - start > timeout) {
+            throw std::runtime_error(
+                "Timeout waiting for endpoint wireup completion has been exceeded");
+        }


I think it makes sense to swap the time check and the execution of the progress on the workers. Otherwise, we may throw the exception even when the wireup is completed on this iteration.

Optional. I'd prefer to do a time loop. E.g.:

for (const auto start = std::chrono::steady_clock::now(); std::chrono::steady_clock::now() - start <= timeout;) { status = ucp_device_mem_list_create(ep.getEp(), &params, &ucx_handle); if (status != UCS_ERR_NOT_CONNECTED) { break; } for (const auto &w : workers) { w->progress(); } } if (status == UCS_ERR_NOT_CONNECTED) { throw std::runtime_error("Timeout waiting for endpoint wireup completion has been exceeded"); } else if (status != UCS_OK) { throw std::runtime_error(std::string("Failed to create device memory list: ") + ucs_status_string(ucs_status)); }

rakhmets · 2025-10-30T14:27:02Z

src/utils/ucx/gpu_xfer_req_h.h


 nixlGpuXferReqH
 createGpuXferReq(const nixlUcxEp &ep,
+                 const std::vector<std::unique_ptr<nixlUcxWorker>> &all_workers,


It looks like all_ is redundant for this parameter. workers would be enough.

michal-shalev self-assigned this Oct 23, 2025

michal-shalev requested review from a team, brminich, gleon99 and yosefe as code owners October 23, 2025 15:41

pull-request-size bot added the size/M label Oct 23, 2025

copy-pr-bot bot temporarily deployed to SWX_AWS October 23, 2025 15:41 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 15:41 Inactive

github-actions bot added the external-contribution label Oct 23, 2025

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 15:42 Inactive

michal-shalev force-pushed the internal-wireup branch from 6fc28c0 to f9dd38b Compare October 23, 2025 15:46

copy-pr-bot bot temporarily deployed to SWX_AWS October 23, 2025 15:47 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 15:47 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 15:48 Inactive

DEVICE/API: Wait for wireup completion in createGpuXferReq

9b37338

Signed-off-by: Michal Shalev <[email protected]>

michal-shalev force-pushed the internal-wireup branch from f9dd38b to 9b37338 Compare October 23, 2025 15:50

copy-pr-bot bot temporarily deployed to SWX_AWS October 23, 2025 15:50 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 15:50 Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS October 23, 2025 15:50 Inactive

michal-shalev requested a review from ovidiusm October 23, 2025 15:52

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 15:53 Inactive

brminich previously approved these changes Oct 28, 2025

View reviewed changes

Merge branch 'main' into internal-wireup

661090d

copy-pr-bot bot temporarily deployed to GITLAB October 28, 2025 11:58 Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS October 28, 2025 11:58 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 28, 2025 11:58 Inactive

Progress all workers

173772b

Signed-off-by: Michal Shalev <[email protected]>

michal-shalev dismissed stale reviews from brminich and rakhmets via 173772b October 30, 2025 07:52

copy-pr-bot bot temporarily deployed to SWX_AWS October 30, 2025 07:52 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 07:52 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 07:53 Inactive

Increase timout to 5 seconds

63328a1

Signed-off-by: Michal Shalev <[email protected]>

copy-pr-bot bot temporarily deployed to SWX_AWS October 30, 2025 13:59 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 13:59 Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS October 30, 2025 13:59 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 30, 2025 14:00 Inactive

rakhmets reviewed Oct 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DEVICE/API: Wait for wireup completion in createGpuXferReq #947

DEVICE/API: Wait for wireup completion in createGpuXferReq #947

Uh oh!

michal-shalev commented Oct 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

michal-shalev commented Oct 27, 2025

Uh oh!

michal-shalev commented Oct 27, 2025

Uh oh!

michal-shalev commented Oct 27, 2025

Uh oh!

brminich commented Oct 28, 2025

Uh oh!

rakhmets Oct 30, 2025

Uh oh!

rakhmets Oct 30, 2025

Uh oh!

rakhmets Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DEVICE/API: Wait for wireup completion in createGpuXferReq #947

Are you sure you want to change the base?

DEVICE/API: Wait for wireup completion in createGpuXferReq #947

Uh oh!

Conversation

michal-shalev commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

michal-shalev commented Oct 27, 2025

Uh oh!

michal-shalev commented Oct 27, 2025

Uh oh!

michal-shalev commented Oct 27, 2025

Uh oh!

brminich commented Oct 28, 2025

Uh oh!

rakhmets Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

rakhmets Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

rakhmets Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michal-shalev commented Oct 23, 2025 •

edited

Loading