Skip to content

Conversation

@michal-shalev
Copy link
Contributor

@michal-shalev michal-shalev commented Oct 9, 2025

Pending openucx/ucx#10921

What?

Add device-side logging infrastructure to NIXL with nixl_device_error macro and use it to log when UCX backend operations fail.

Why?

Without NIXL-layer logging, we lose calling context when UCX errors occur in device code. Each layer should log its own context for proper debugging.

How?

Added nixl_device_printf and nixl_device_error macros that print thread/block info, file, line, and function. Used in nixlGpuConvertUcsStatus and nixlGpuGetXferStatus to log "UCX backend error" when errors occur.

@github-actions
Copy link

github-actions bot commented Oct 9, 2025

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀


/* Helper macro to print a message from NIXL device function including the
* thread and block indices, file, line, and function */
#define nixl_device_printf(_title, _fmt, ...) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title -> level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to log_level

* thread and block indices, file, line, and function */
#define nixl_device_printf(_title, _fmt, ...) \
printf("(%5d:%5d) %5s %s:%d %s: " _fmt "\n", threadIdx.x, blockIdx.x, _title, \
__FILE__, __LINE__, __func__, ##__VA_ARGS__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably __func__ not needed, maybe left pad file and pad lines with %-5d

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed func

I1020 23:39:53.765828 1403216 gpu_xfer_req_h.cpp:84] Created device memory list handle with 1 elements
E T0   :B0                    nixl_device.cuh:88] UCX backend error: Device is busy

return NIXL_SUCCESS;
}
printf("UCX returned error: %d\n", status);
nixl_device_error("UCX backend error");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have deleted the error code that was the reason for adding the logging. Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use ucs_device_status_string from openucx/ucx#10921

/* Helper macro to print a message from NIXL device function including the
* thread and block indices, file, line, and function */
#define nixl_device_printf(_title, _fmt, ...) \
printf("(%5d:%5d) %5s %s:%d %s: " _fmt "\n", threadIdx.x, blockIdx.x, _title, \
Copy link
Contributor

@rakhmets rakhmets Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a way to disable messages. E.g., using the existing environment variable, or a new one.

/* Helper macro to print a message from NIXL device function including the
* thread and block indices, file, line, and function */
#define nixl_device_printf(_title, _fmt, ...) \
printf("(%5d:%5d) %5s %s:%d %s: " _fmt "\n", threadIdx.x, blockIdx.x, _title, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review this section https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#limitations.
My biggest doubts are related to this:

Note that the buffer is not flushed automatically when the program exits.
The user must call cudaDeviceReset() or cuCtxDestroy() explicitly, as shown in the examples below.

Signed-off-by: Michal Shalev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants