-
Notifications
You must be signed in to change notification settings - Fork 161
libfabric: Use desc-specific target offset #883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi tsg-! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
2d45ba0
to
b948f0f
Compare
This fixes a bug in multi-descriptor transfers where descriptors point to different offsets within the same registered memory region. Without this fix, RDMA reads always target offset 0. Should extract each descriptor's specific target address instead. Also impacted: Block-based transfers (Iteration N would read blocks from iteration 0, etc), Partial buffer updates, etc. Signed-off-by: Tushar Gohad <[email protected]>
b948f0f
to
005d3ec
Compare
@akkart-aws @yexiang-aws we'll push some focused tests for the failing scenarios if it helps with this review. Thank you! |
/build |
Approving from my side, but it will still need code owner approval from AWS team |
@akkart-aws @yexiang-aws any comments on this change? |
/ok to test 005d3ec |
What?
This fixes a bug in multi-descriptor transfers where descriptors point to different offsets within the same registered memory region. The bug caused all descriptors to incorrectly use the base address of the registration (
remote_md->remote_buf_addr_
) instead of each descriptor's specific offset address (remote[desc_idx].addr
).Impact: Block-based transfers (Iteration N would read blocks from iteration 0, etc). Also, Scatter-gather operations, Partial buffer updates.
Why?
Without this fix, RDMA reads always target offset 0. Should extract each descriptor's specific target address instead.
Example test case:
How?
After fix: Each descriptor uses
remote[desc_idx].addr
(specific target offset)