[WIP] Implement RDMA P2P weight update using TransferEngine #1164
[WIP] Implement RDMA P2P weight update using TransferEngine #1164JD-ETH wants to merge 11 commits intoTHUDM:mainfrom
Conversation
| def register_memory_region_v1(named_param_with_buffers: Sequence[tuple[str, torch.Tensor]], transfer_engine): | ||
| weight_mr_dict = {} | ||
| for name, weight in named_param_with_buffers: | ||
| ret = transfer_engine.register(weight.data_ptr(), weight.numel() * weight.element_size()) |
There was a problem hiding this comment.
maybe can use transfer_engine.batch_register_memory
There was a problem hiding this comment.
I will remove this function, we only use the efficient registration version below
|
regarding Rollout TP, the tensor slice for the Down Projection layer is non-contiguous in physical memory relative to the all-gathered full parameters on the training side. How should we handle this for efficient remote transfer (e.g., RDMA)? |
|
with the current design, we force the source side to construct a model replica that has the exact shape as the rollout side. |
Doesn't creating a rollout replica occupy more GPU memory (VRAM)? |
|
yep. The focus of the design is to avoid costly registration/deregistration of the transfer engine. We will add CPU offloading while keeping the virtual memory registration |
i wonder if using virtual memory registration maybe degrade perf beacuse nic only read physical position? |
|
we will reopen once we hit our design target, stay tuned |
|
Great work! Will this support DSv3 / Kimi K2 and replicate https://abcdabcd987.com/2025/09/17/rl-weight-transfer-2/ ? |
|
we intend to test Kimi K2 before releasing. We aim at similar performance numbers as Perplexity but not targeting AWS infra. |
|
@JD-ETH incredible! Looking forward to it. |

This is a WIP PR that demonstrate an efficient implementation of remote weight update through RDMA and TransferEngine, instead of through nccl.
It is tested with sglang 0.5.6 post2.

We reuse sglang engine memory registration introduced here: remote_instance_engine_info. To reduce registration overhead, we propose a solution with engine replica on trainer side to enable zero-copy; single registration weight transfer.
The goal of this PR is to demonstrate a first working example using Qwen3-4B with TP on Sglang, and TP on the Megatron trainer.
CUDA_VISIBLE_DEVICES="1,2" python ./tests/test_weight_transfer.py --mode=rdmatests a minimal E2E example