-
Notifications
You must be signed in to change notification settings - Fork 177
DEVICE/API: Add NIXL device-side logging infrastructure #886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Michal Shalev <[email protected]>
|
👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
src/api/gpu/ucx/nixl_device.cuh
Outdated
|
|
||
| /* Helper macro to print a message from NIXL device function including the | ||
| * thread and block indices, file, line, and function */ | ||
| #define nixl_device_printf(_title, _fmt, ...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title -> level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to log_level
src/api/gpu/ucx/nixl_device.cuh
Outdated
| * thread and block indices, file, line, and function */ | ||
| #define nixl_device_printf(_title, _fmt, ...) \ | ||
| printf("(%5d:%5d) %5s %s:%d %s: " _fmt "\n", threadIdx.x, blockIdx.x, _title, \ | ||
| __FILE__, __LINE__, __func__, ##__VA_ARGS__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably __func__ not needed, maybe left pad file and pad lines with %-5d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed func
I1020 23:39:53.765828 1403216 gpu_xfer_req_h.cpp:84] Created device memory list handle with 1 elements
E T0 :B0 nixl_device.cuh:88] UCX backend error: Device is busy
src/api/gpu/ucx/nixl_device.cuh
Outdated
| return NIXL_SUCCESS; | ||
| } | ||
| printf("UCX returned error: %d\n", status); | ||
| nixl_device_error("UCX backend error"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have deleted the error code that was the reason for adding the logging. Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use ucs_device_status_string from openucx/ucx#10921
src/api/gpu/ucx/nixl_device.cuh
Outdated
| /* Helper macro to print a message from NIXL device function including the | ||
| * thread and block indices, file, line, and function */ | ||
| #define nixl_device_printf(_title, _fmt, ...) \ | ||
| printf("(%5d:%5d) %5s %s:%d %s: " _fmt "\n", threadIdx.x, blockIdx.x, _title, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a way to disable messages. E.g., using the existing environment variable, or a new one.
src/api/gpu/ucx/nixl_device.cuh
Outdated
| /* Helper macro to print a message from NIXL device function including the | ||
| * thread and block indices, file, line, and function */ | ||
| #define nixl_device_printf(_title, _fmt, ...) \ | ||
| printf("(%5d:%5d) %5s %s:%d %s: " _fmt "\n", threadIdx.x, blockIdx.x, _title, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review this section https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#limitations.
My biggest doubts are related to this:
Note that the buffer is not flushed automatically when the program exits.
The user must call cudaDeviceReset() or cuCtxDestroy() explicitly, as shown in the examples below.
Signed-off-by: Michal Shalev <[email protected]>
Signed-off-by: Michal Shalev <[email protected]>
Pending openucx/ucx#10921
What?
Add device-side logging infrastructure to NIXL with
nixl_device_errormacro and use it to log when UCX backend operations fail.Why?
Without NIXL-layer logging, we lose calling context when UCX errors occur in device code. Each layer should log its own context for proper debugging.
How?
Added
nixl_device_printfandnixl_device_errormacros that print thread/block info, file, line, and function. Used innixlGpuConvertUcsStatusandnixlGpuGetXferStatusto log "UCX backend error" when errors occur.