Skip to content

[WIP] Implement RDMA P2P weight update using TransferEngine #1164

Closed
JD-ETH wants to merge 11 commits intoTHUDM:mainfrom
JD-ETH:jd/rdma-integration
Closed

[WIP] Implement RDMA P2P weight update using TransferEngine #1164
JD-ETH wants to merge 11 commits intoTHUDM:mainfrom
JD-ETH:jd/rdma-integration

Conversation

@JD-ETH
Copy link
Copy Markdown
Contributor

@JD-ETH JD-ETH commented Dec 20, 2025

This is a WIP PR that demonstrate an efficient implementation of remote weight update through RDMA and TransferEngine, instead of through nccl.

It is tested with sglang 0.5.6 post2.
replica

We reuse sglang engine memory registration introduced here: remote_instance_engine_info. To reduce registration overhead, we propose a solution with engine replica on trainer side to enable zero-copy; single registration weight transfer.

The goal of this PR is to demonstrate a first working example using Qwen3-4B with TP on Sglang, and TP on the Megatron trainer.

CUDA_VISIBLE_DEVICES="1,2" python ./tests/test_weight_transfer.py --mode=rdma tests a minimal E2E example

def register_memory_region_v1(named_param_with_buffers: Sequence[tuple[str, torch.Tensor]], transfer_engine):
weight_mr_dict = {}
for name, weight in named_param_with_buffers:
ret = transfer_engine.register(weight.data_ptr(), weight.numel() * weight.element_size())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can use transfer_engine.batch_register_memory

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this function, we only use the efficient registration version below

@lilei199908
Copy link
Copy Markdown
Collaborator

regarding Rollout TP, the tensor slice for the Down Projection layer is non-contiguous in physical memory relative to the all-gathered full parameters on the training side. How should we handle this for efficient remote transfer (e.g., RDMA)?

@JD-ETH
Copy link
Copy Markdown
Contributor Author

JD-ETH commented Dec 23, 2025

with the current design, we force the source side to construct a model replica that has the exact shape as the rollout side.

@lilei199908
Copy link
Copy Markdown
Collaborator

This is a WIP PR that demonstrate an efficient implementation of remote weight update through RDMA and TransferEngine, instead of through nccl.

It is tested with sglang 0.5.6 post2. replica

We reuse sglang engine memory registration introduced here: remote_instance_engine_info. To reduce registration overhead, we propose a solution with engine replica on trainer side to enable zero-copy; single registration weight transfer.

The goal of this PR is to demonstrate a first working example using Qwen3-4B with TP on Sglang, and TP on the Megatron trainer.

CUDA_VISIBLE_DEVICES="1,2" python ./tests/test_weight_transfer.py --mode=rdma tests a minimal E2E example

with the current design, we force the source side to construct a model replica that has the exact shape as the rollout side.

Doesn't creating a rollout replica occupy more GPU memory (VRAM)?

@JD-ETH
Copy link
Copy Markdown
Contributor Author

JD-ETH commented Dec 24, 2025

yep. The focus of the design is to avoid costly registration/deregistration of the transfer engine. We will add CPU offloading while keeping the virtual memory registration

@lilei199908
Copy link
Copy Markdown
Collaborator

yep. The focus of the design is to avoid costly registration/deregistration of the transfer engine. We will add CPU offloading while keeping the virtual memory registration

i wonder if using virtual memory registration maybe degrade perf beacuse nic only read physical position?

@JD-ETH
Copy link
Copy Markdown
Contributor Author

JD-ETH commented Jan 4, 2026

we will reopen once we hit our design target, stay tuned

@JD-ETH JD-ETH closed this Jan 4, 2026
@vwxyzjn
Copy link
Copy Markdown

vwxyzjn commented Jan 8, 2026

Great work! Will this support DSv3 / Kimi K2 and replicate https://abcdabcd987.com/2025/09/17/rl-weight-transfer-2/ ?

@JD-ETH
Copy link
Copy Markdown
Contributor Author

JD-ETH commented Jan 9, 2026

we intend to test Kimi K2 before releasing. We aim at similar performance numbers as Perplexity but not targeting AWS infra.

@vwxyzjn
Copy link
Copy Markdown

vwxyzjn commented Jan 9, 2026

@JD-ETH incredible! Looking forward to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants