-
Notifications
You must be signed in to change notification settings - Fork 493
UCP/DEVICE: Make memh and local_addr optional for counter elements #10945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP/DEVICE: Make memh and local_addr optional for counter elements #10945
Conversation
|
/azp run UCX PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| UCP_DEVICE_MEM_LIST_ELEM_FIELD_RKEY = UCS_BIT(1), /**< Unpacked remote memory key (always required) */ | ||
| UCP_DEVICE_MEM_LIST_ELEM_FIELD_LOCAL_ADDR = UCS_BIT(2), /**< Local address */ | ||
| UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR = UCS_BIT(3), /**< Remote address */ | ||
| UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR = UCS_BIT(3), /**< Remote address */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, it is also always required, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not always, check remote_offset in partial, users can to pass null to remote address and then the addr aa remote_offset
src/ucp/core/ucp_device.c
Outdated
| ucp_element = ¶ms->elements[i]; | ||
| local_addr = UCS_PARAM_VALUE(UCP_DEVICE_MEM_LIST_ELEM_FIELD, | ||
| ucp_element, local_addr, LOCAL_ADDR, NULL); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
src/ucp/core/ucp_device.c
Outdated
| "uct_memh=%p md_map=0x%lx local_md_index=%u", uct_memh, | ||
| ucp_element->memh->md_map, local_md_index); | ||
| ucs_assert(uct_memh != UCT_MEM_HANDLE_NULL); | ||
| if (ucp_element->memh != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we can't assume it is NULL without checking UCP_DEVICE_MEM_LIST_ELEM_FIELD_MEMH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR optimizes the UCP device memory list API by making memh and local_addr optional for counter elements, which only require remote addressing for atomic operations. Previously, counter elements unnecessarily required local memory registration, adding overhead. The implementation now detects local_sys_dev by allocating a temporary buffer on the current CUDA context device when memh is absent (mirroring the approach in ucp_ep_rma_batch_export). Changes span four files: API documentation updates (ucp_host.h, ucp_device_impl.h), core implementation logic in ucp_device.c (which handles detection of system device and MD map when local fields are omitted), performance test CUDA kernel refactoring (ucp_cuda_kernel.cu), and comprehensive test updates (test_ucp_device.cc) that introduce memory list modes to distinguish data-only, counter-only, and mixed scenarios.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
src/ucp/api/device/ucp_host.h |
5/5 | Documentation updates clarifying memh and local_addr are optional for counter elements while rkey remains required |
src/ucp/api/device/ucp_device_impl.h |
5/5 | Documentation update explaining data elements vs. counter elements' memory registration requirements |
src/tools/perf/cuda/ucp_cuda_kernel.cu |
4/5 | Refactored performance test to conditionally create counter elements with only remote fields when needed |
src/ucp/core/ucp_device.c |
3/5 | Core implementation detecting local_sys_dev via temporary buffer when memh absent; hardcoded CUDA assumption is concerning |
test/gtest/ucp/test_ucp_device.cc |
4/5 | Added memory list modes and tests validating optional fields, including edge cases for missing required fields and mismatched devices |
Confidence score: 3/5
- This PR requires careful review due to implementation complexity around memory detection fallback paths
- Score reflects hardcoded
UCS_MEMORY_TYPE_CUDAassumption inucp_device.c(line 255) which may break for ROCm/other accelerators, unclear handling ofUCT_MEM_HANDLE_NULLin downstream UCT layer functions, and the reliance on CUDA context device detection that may fail if no CUDA context is active - Pay close attention to
src/ucp/core/ucp_device.clines 254-264 (hardcoded CUDA type), lines 526-533 (uct_memh detection error handling), and lines360-362 (NULL checks) to ensure fallback logic is robust across all memory types and contexts
Sequence Diagram
sequenceDiagram
participant User
participant ucp_device_mem_list_create
participant ucp_device_mem_list_params_check
participant ucp_check_memh_elem
participant ucp_device_detect_local_sys_dev
participant ucp_mem_do_alloc
participant ucp_memory_detect_internal
participant ucp_device_detect_local_md_map
participant ucp_device_mem_list_create_handle
participant ucp_device_detect_uct_memh
User->>ucp_device_mem_list_create: "Create mem list with optional memh"
ucp_device_mem_list_create->>ucp_device_mem_list_params_check: "Validate params"
loop For each element
ucp_device_mem_list_params_check->>ucp_check_memh_elem: "Check if memh provided"
alt memh is NULL
ucp_check_memh_elem-->>ucp_device_mem_list_params_check: "No memh, continue"
else memh exists
ucp_check_memh_elem-->>ucp_device_mem_list_params_check: "Extract sys_dev and md_map"
end
end
alt No memh found in any element
ucp_device_mem_list_params_check->>ucp_device_detect_local_sys_dev: "Detect local_sys_dev from CUDA context"
ucp_device_detect_local_sys_dev->>ucp_mem_do_alloc: "Allocate temporary buffer on CUDA device"
ucp_mem_do_alloc-->>ucp_device_detect_local_sys_dev: "Return temporary memory"
ucp_device_detect_local_sys_dev->>ucp_memory_detect_internal: "Query sys_dev from temp buffer"
ucp_memory_detect_internal-->>ucp_device_detect_local_sys_dev: "Return sys_dev"
ucp_device_detect_local_sys_dev-->>ucp_device_mem_list_params_check: "Return detected sys_dev"
ucp_device_mem_list_params_check->>ucp_device_detect_local_md_map: "Build MD map from sys_dev"
ucp_device_detect_local_md_map-->>ucp_device_mem_list_params_check: "Return local_md_map"
end
ucp_device_mem_list_params_check-->>ucp_device_mem_list_create: "Validation complete"
ucp_device_mem_list_create->>ucp_device_mem_list_create_handle: "Create handle with params"
loop For each element without memh
ucp_device_mem_list_create_handle->>ucp_device_detect_uct_memh: "Detect uct_memh from local_addr"
alt local_addr exists
ucp_device_detect_uct_memh-->>ucp_device_mem_list_create_handle: "Return detected uct_memh"
else local_addr is NULL
ucp_device_detect_uct_memh-->>ucp_device_mem_list_create_handle: "Return UCS_ERR_NO_ELEM"
end
end
ucp_device_mem_list_create_handle-->>ucp_device_mem_list_create: "Return created handle"
ucp_device_mem_list_create-->>User: "Return handle"
5 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The latest changes refactor parameter validation and fallback logic for detecting local system device information when memh is not provided. The validation function ucp_device_mem_list_params_check was simplified to make length, memh, and local_addr optional (only rkey is strictly required), and a new helper ucp_device_extract_memh was introduced to extract metadata from memh when available. The core change in ucp_device_mem_list_create now uses ucp_device_detect_local_sys_dev as a fallback mechanism when memh is NULL, allocating a temporary CUDA buffer to infer the local system device ID. Test updates in test_ucp_device.cc now populate field_mask more precisely to reflect which fields are actually provided for counter versus data elements, aligning with the new optional-field semantics.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| src/ucp/core/ucp_device.c | 4/5 | Refactored validation to make memh/local_addr optional; added fallback detection for local_sys_dev using temporary CUDA buffer allocation |
| test/gtest/ucp/test_ucp_device.cc | 5/5 | Updated test setup to correctly set field_mask for counter elements (omitting memh/local_addr) vs data elements |
Confidence score: 4/5
- This PR is safe to merge with minor risk; the changes are well-scoped and the fallback detection logic is sound.
- Score reflects clean design and thorough validation updates, but deducted one point because the fallback device detection path (allocating temporary CUDA buffers) is relatively new and could introduce edge-case issues on diverse GPU configurations or when CUDA contexts are not initialized as expected.
- Pay close attention to
src/ucp/core/ucp_device.c, particularly theucp_device_detect_local_sys_devinvocation and how it handles cases where no CUDA context is available or the temporary allocation fails.
2 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The latest changes address previous reviewer feedback by refactoring the parameter validation and local device detection logic in src/ucp/core/ucp_device.c. The developer extracted the memh parameter extraction into a dedicated helper function ucp_device_extract_memh to improve code clarity and eliminated the intermediate have_memh variable by directly checking local_sys_dev against UCS_SYS_DEVICE_ID_UNKNOWN. The ucp_device_detect_local_sys_dev function now accepts a mem_type parameter to make it more generic and only sets output parameters upon success. These structural improvements align with the PR's goal of making memh and local_addr optional for counter elements by cleanly separating the detection and validation paths.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| src/ucp/core/ucp_device.c | 3/5 | Refactored parameter extraction into ucp_device_extract_memh helper, updated ucp_device_detect_local_sys_dev to accept mem_type parameter, and removed redundant tracking variables |
Confidence score: 3/5
- This PR requires careful review due to assumptions about CUDA memory types and NULL memh handling in counter-only operations
- Score reflects concerns about (1) lack of validation that multiple memh entries have matching memory types, (2) hardcoded CUDA memory type assumption when memh is absent, (3) UCT layer's ability to handle NULL memh in all code paths, and (4) the md_map derivation logic change from intersection-based to detection-based approach
- Pay close attention to lines 251-261 where local_sys_dev detection occurs for NULL memh cases, and lines 479-491 where
uct_memhmay be set toUCT_MEM_HANDLE_NULLand passed touct_iface_mem_element_pack
1 file reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The latest update applies several refactoring improvements based on prior feedback. The developer has moved field-mask initialization to a helper function (ucp_device_get_common_field_mask), simplified error-checking logic by consolidating validation into ucp_device_mem_list_params_check, removed the redundant offset variable from the class field, and improved naming conventions (renaming enums and refining local variables). The code now sets mem_type consistently and handles optional memh more gracefully by defaulting to UCS_MEMORY_TYPE_UNKNOWN when detection fails. The primary goal remains to eliminate unnecessary local memory registration overhead for counter-only elements in ucp_device_mem_list_create, but the changes focus on cleaner, more maintainable structure rather than new functional behavior.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| src/ucp/core/ucp_device.c | 4/5 | Refactored element initialization logic, removed redundant field, improved error handling, and consolidated field-mask creation into a helper function. |
Confidence score: 4/5
- This PR is safe to merge with minimal risk, pending final verification of detection fallback behavior.
- Score reflects clean refactoring and comprehensive prior review cycles; minor point deducted because the fallback value for
mem_typewhenucp_device_detect_local_sys_devfails is not explicitly verified in the diff (should default toUCS_MEMORY_TYPE_UNKNOWN), and edge-case handling for NULLuct_memhin counter-only paths needs runtime validation. - Pay close attention to
src/ucp/core/ucp_device.c, particularly the new helper functionucp_device_get_common_field_maskand the modified error-path inucp_device_detect_local_sys_devto ensuremem_typeis set correctly when detection fails.
1 file reviewed, no comments
What?
Make
memhandlocal_addroptional for counter elements inucp_device_mem_list_create.Why?
Counter elements only require remote addressing for atomic operations (
ucp_device_counter_inc, or as part ofucp_device_put_multi/put_multi_partial). Requiring local memory registration (memh/local_addr) for these elements is unnecessary overhead.How?
memhis not provided, detectlocal_sys_devby allocating a temporary buffer on the current CUDA context device (similar toucp_ep_rma_batch_export)