Skip to content

Conversation

@michal-shalev
Copy link
Contributor

What?

Make memh and local_addr optional for counter elements in ucp_device_mem_list_create.

Why?

Counter elements only require remote addressing for atomic operations (ucp_device_counter_inc, or as part of ucp_device_put_multi/put_multi_partial). Requiring local memory registration (memh/local_addr) for these elements is unnecessary overhead.

How?

  • When memh is not provided, detect local_sys_dev by allocating a temporary buffer on the current CUDA context device (similar to ucp_ep_rma_batch_export)
  • Updated tests and documentation

@Alexey-Rivkin
Copy link
Contributor

/azp run UCX PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

UCP_DEVICE_MEM_LIST_ELEM_FIELD_RKEY = UCS_BIT(1), /**< Unpacked remote memory key (always required) */
UCP_DEVICE_MEM_LIST_ELEM_FIELD_LOCAL_ADDR = UCS_BIT(2), /**< Local address */
UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR = UCS_BIT(3), /**< Remote address */
UCP_DEVICE_MEM_LIST_ELEM_FIELD_REMOTE_ADDR = UCS_BIT(3), /**< Remote address */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, it is also always required, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not always, check remote_offset in partial, users can to pass null to remote address and then the addr aa remote_offset

ucp_element = &params->elements[i];
local_addr = UCS_PARAM_VALUE(UCP_DEVICE_MEM_LIST_ELEM_FIELD,
ucp_element, local_addr, LOCAL_ADDR, NULL);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

"uct_memh=%p md_map=0x%lx local_md_index=%u", uct_memh,
ucp_element->memh->md_map, local_md_index);
ucs_assert(uct_memh != UCT_MEM_HANDLE_NULL);
if (ucp_element->memh != NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can't assume it is NULL without checking UCP_DEVICE_MEM_LIST_ELEM_FIELD_MEMH

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR optimizes the UCP device memory list API by making memh and local_addr optional for counter elements, which only require remote addressing for atomic operations. Previously, counter elements unnecessarily required local memory registration, adding overhead. The implementation now detects local_sys_dev by allocating a temporary buffer on the current CUDA context device when memh is absent (mirroring the approach in ucp_ep_rma_batch_export). Changes span four files: API documentation updates (ucp_host.h, ucp_device_impl.h), core implementation logic in ucp_device.c (which handles detection of system device and MD map when local fields are omitted), performance test CUDA kernel refactoring (ucp_cuda_kernel.cu), and comprehensive test updates (test_ucp_device.cc) that introduce memory list modes to distinguish data-only, counter-only, and mixed scenarios.

Important Files Changed

Filename Score Overview
src/ucp/api/device/ucp_host.h 5/5 Documentation updates clarifying memh and local_addr are optional for counter elements while rkey remains required
src/ucp/api/device/ucp_device_impl.h 5/5 Documentation update explaining data elements vs. counter elements' memory registration requirements
src/tools/perf/cuda/ucp_cuda_kernel.cu 4/5 Refactored performance test to conditionally create counter elements with only remote fields when needed
src/ucp/core/ucp_device.c 3/5 Core implementation detecting local_sys_dev via temporary buffer when memh absent; hardcoded CUDA assumption is concerning
test/gtest/ucp/test_ucp_device.cc 4/5 Added memory list modes and tests validating optional fields, including edge cases for missing required fields and mismatched devices

Confidence score: 3/5

  • This PR requires careful review due to implementation complexity around memory detection fallback paths
  • Score reflects hardcoded UCS_MEMORY_TYPE_CUDA assumption in ucp_device.c (line 255) which may break for ROCm/other accelerators, unclear handling of UCT_MEM_HANDLE_NULL in downstream UCT layer functions, and the reliance on CUDA context device detection that may fail if no CUDA context is active
  • Pay close attention to src/ucp/core/ucp_device.c lines 254-264 (hardcoded CUDA type), lines 526-533 (uct_memh detection error handling), and lines360-362 (NULL checks) to ensure fallback logic is robust across all memory types and contexts

Sequence Diagram

sequenceDiagram
    participant User
    participant ucp_device_mem_list_create
    participant ucp_device_mem_list_params_check
    participant ucp_check_memh_elem
    participant ucp_device_detect_local_sys_dev
    participant ucp_mem_do_alloc
    participant ucp_memory_detect_internal
    participant ucp_device_detect_local_md_map
    participant ucp_device_mem_list_create_handle
    participant ucp_device_detect_uct_memh

    User->>ucp_device_mem_list_create: "Create mem list with optional memh"
    ucp_device_mem_list_create->>ucp_device_mem_list_params_check: "Validate params"
    
    loop For each element
        ucp_device_mem_list_params_check->>ucp_check_memh_elem: "Check if memh provided"
        alt memh is NULL
            ucp_check_memh_elem-->>ucp_device_mem_list_params_check: "No memh, continue"
        else memh exists
            ucp_check_memh_elem-->>ucp_device_mem_list_params_check: "Extract sys_dev and md_map"
        end
    end
    
    alt No memh found in any element
        ucp_device_mem_list_params_check->>ucp_device_detect_local_sys_dev: "Detect local_sys_dev from CUDA context"
        ucp_device_detect_local_sys_dev->>ucp_mem_do_alloc: "Allocate temporary buffer on CUDA device"
        ucp_mem_do_alloc-->>ucp_device_detect_local_sys_dev: "Return temporary memory"
        ucp_device_detect_local_sys_dev->>ucp_memory_detect_internal: "Query sys_dev from temp buffer"
        ucp_memory_detect_internal-->>ucp_device_detect_local_sys_dev: "Return sys_dev"
        ucp_device_detect_local_sys_dev-->>ucp_device_mem_list_params_check: "Return detected sys_dev"
        
        ucp_device_mem_list_params_check->>ucp_device_detect_local_md_map: "Build MD map from sys_dev"
        ucp_device_detect_local_md_map-->>ucp_device_mem_list_params_check: "Return local_md_map"
    end
    
    ucp_device_mem_list_params_check-->>ucp_device_mem_list_create: "Validation complete"
    
    ucp_device_mem_list_create->>ucp_device_mem_list_create_handle: "Create handle with params"
    
    loop For each element without memh
        ucp_device_mem_list_create_handle->>ucp_device_detect_uct_memh: "Detect uct_memh from local_addr"
        alt local_addr exists
            ucp_device_detect_uct_memh-->>ucp_device_mem_list_create_handle: "Return detected uct_memh"
        else local_addr is NULL
            ucp_device_detect_uct_memh-->>ucp_device_mem_list_create_handle: "Return UCS_ERR_NO_ELEM"
        end
    end
    
    ucp_device_mem_list_create_handle-->>ucp_device_mem_list_create: "Return created handle"
    ucp_device_mem_list_create-->>User: "Return handle"
Loading

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest changes refactor parameter validation and fallback logic for detecting local system device information when memh is not provided. The validation function ucp_device_mem_list_params_check was simplified to make length, memh, and local_addr optional (only rkey is strictly required), and a new helper ucp_device_extract_memh was introduced to extract metadata from memh when available. The core change in ucp_device_mem_list_create now uses ucp_device_detect_local_sys_dev as a fallback mechanism when memh is NULL, allocating a temporary CUDA buffer to infer the local system device ID. Test updates in test_ucp_device.cc now populate field_mask more precisely to reflect which fields are actually provided for counter versus data elements, aligning with the new optional-field semantics.

Important Files Changed

Filename Score Overview
src/ucp/core/ucp_device.c 4/5 Refactored validation to make memh/local_addr optional; added fallback detection for local_sys_dev using temporary CUDA buffer allocation
test/gtest/ucp/test_ucp_device.cc 5/5 Updated test setup to correctly set field_mask for counter elements (omitting memh/local_addr) vs data elements

Confidence score: 4/5

  • This PR is safe to merge with minor risk; the changes are well-scoped and the fallback detection logic is sound.
  • Score reflects clean design and thorough validation updates, but deducted one point because the fallback device detection path (allocating temporary CUDA buffers) is relatively new and could introduce edge-case issues on diverse GPU configurations or when CUDA contexts are not initialized as expected.
  • Pay close attention to src/ucp/core/ucp_device.c, particularly the ucp_device_detect_local_sys_dev invocation and how it handles cases where no CUDA context is available or the temporary allocation fails.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest changes address previous reviewer feedback by refactoring the parameter validation and local device detection logic in src/ucp/core/ucp_device.c. The developer extracted the memh parameter extraction into a dedicated helper function ucp_device_extract_memh to improve code clarity and eliminated the intermediate have_memh variable by directly checking local_sys_dev against UCS_SYS_DEVICE_ID_UNKNOWN. The ucp_device_detect_local_sys_dev function now accepts a mem_type parameter to make it more generic and only sets output parameters upon success. These structural improvements align with the PR's goal of making memh and local_addr optional for counter elements by cleanly separating the detection and validation paths.

Important Files Changed

Filename Score Overview
src/ucp/core/ucp_device.c 3/5 Refactored parameter extraction into ucp_device_extract_memh helper, updated ucp_device_detect_local_sys_dev to accept mem_type parameter, and removed redundant tracking variables

Confidence score: 3/5

  • This PR requires careful review due to assumptions about CUDA memory types and NULL memh handling in counter-only operations
  • Score reflects concerns about (1) lack of validation that multiple memh entries have matching memory types, (2) hardcoded CUDA memory type assumption when memh is absent, (3) UCT layer's ability to handle NULL memh in all code paths, and (4) the md_map derivation logic change from intersection-based to detection-based approach
  • Pay close attention to lines 251-261 where local_sys_dev detection occurs for NULL memh cases, and lines 479-491 where uct_memh may be set to UCT_MEM_HANDLE_NULL and passed to uct_iface_mem_element_pack

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest update applies several refactoring improvements based on prior feedback. The developer has moved field-mask initialization to a helper function (ucp_device_get_common_field_mask), simplified error-checking logic by consolidating validation into ucp_device_mem_list_params_check, removed the redundant offset variable from the class field, and improved naming conventions (renaming enums and refining local variables). The code now sets mem_type consistently and handles optional memh more gracefully by defaulting to UCS_MEMORY_TYPE_UNKNOWN when detection fails. The primary goal remains to eliminate unnecessary local memory registration overhead for counter-only elements in ucp_device_mem_list_create, but the changes focus on cleaner, more maintainable structure rather than new functional behavior.

Important Files Changed

Filename Score Overview
src/ucp/core/ucp_device.c 4/5 Refactored element initialization logic, removed redundant field, improved error handling, and consolidated field-mask creation into a helper function.

Confidence score: 4/5

  • This PR is safe to merge with minimal risk, pending final verification of detection fallback behavior.
  • Score reflects clean refactoring and comprehensive prior review cycles; minor point deducted because the fallback value for mem_type when ucp_device_detect_local_sys_dev fails is not explicitly verified in the diff (should default to UCS_MEMORY_TYPE_UNKNOWN), and edge-case handling for NULL uct_memh in counter-only paths needs runtime validation.
  • Pay close attention to src/ucp/core/ucp_device.c, particularly the new helper function ucp_device_get_common_field_mask and the modified error-path in ucp_device_detect_local_sys_dev to ensure mem_type is set correctly when detection fails.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@michal-shalev michal-shalev merged commit fb479dd into openucx:master Oct 27, 2025
143 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants