Skip to content

Conversation

tsg-
Copy link

@tsg- tsg- commented Oct 8, 2025

What?

This fixes a bug in multi-descriptor transfers where descriptors point to different offsets within the same registered memory region. The bug caused all descriptors to incorrectly use the base address of the registration (remote_md->remote_buf_addr_) instead of each descriptor's specific offset address (remote[desc_idx].addr).

Impact: Block-based transfers (Iteration N would read blocks from iteration 0, etc). Also, Scatter-gather operations, Partial buffer updates.

Why?

Without this fix, RDMA reads always target offset 0. Should extract each descriptor's specific target address instead.

Example test case:

  buffer = allocate_memory(1GB)
  register_memory(buffer)

  # Pass 0: Transfer blocks 0-127
  descriptors = [
      {addr: buffer + 0*block_size, len: block_size},      # Block 0
      {addr: buffer + 1*block_size, len: block_size},      # Block 1
      ...
      {addr: buffer + 127*block_size, len: block_size}     # Block 127
  ]

  # Pass 1: Transfer blocks 128-255
  descriptors = [
      {addr: buffer + 128*block_size, len: block_size},    # Block 128  .... Bug: reads block 0
      {addr: buffer + 129*block_size, len: block_size},    # Block 129  .... Bug: reads block 1
      ...
  ]

How?

After fix: Each descriptor uses remote[desc_idx].addr (specific target offset)

Copy link

copy-pr-bot bot commented Oct 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

github-actions bot commented Oct 8, 2025

👋 Hi tsg-! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@tsg- tsg- force-pushed the per_desc_target_offset branch 2 times, most recently from 2d45ba0 to b948f0f Compare October 8, 2025 18:00
This fixes a bug in multi-descriptor transfers where descriptors
point to different offsets within the same registered memory region.

Without this fix, RDMA reads always target offset 0. Should extract
each descriptor's specific target address instead.

Also impacted: Block-based transfers (Iteration N would read blocks
from iteration 0, etc), Partial buffer updates, etc.

Signed-off-by: Tushar Gohad <[email protected]>
@tsg- tsg- force-pushed the per_desc_target_offset branch from b948f0f to 005d3ec Compare October 8, 2025 18:13
@ovidiusm ovidiusm requested a review from mkhazraee October 9, 2025 08:48
@tsg-
Copy link
Author

tsg- commented Oct 12, 2025

@akkart-aws @yexiang-aws we'll push some focused tests for the failing scenarios if it helps with this review. Thank you!

@ovidiusm
Copy link
Contributor

/build

@ovidiusm
Copy link
Contributor

Approving from my side, but it will still need code owner approval from AWS team

@tsg-
Copy link
Author

tsg- commented Oct 14, 2025

@akkart-aws @yexiang-aws any comments on this change?

@ovidiusm
Copy link
Contributor

/ok to test 005d3ec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants