[WIP] Implement RDMA P2P weight update using TransferEngine by JD-ETH · Pull Request #1164 · THUDM/slime

JD-ETH · 2025-12-20T18:10:14Z

This is a WIP PR that demonstrate an efficient implementation of remote weight update through RDMA and TransferEngine, instead of through nccl.

It is tested with sglang 0.5.6 post2.

We reuse sglang engine memory registration introduced here: remote_instance_engine_info. To reduce registration overhead, we propose a solution with engine replica on trainer side to enable zero-copy; single registration weight transfer.

The goal of this PR is to demonstrate a first working example using Qwen3-4B with TP on Sglang, and TP on the Megatron trainer.

CUDA_VISIBLE_DEVICES="1,2" python ./tests/test_weight_transfer.py --mode=rdma tests a minimal E2E example

lilei199908 · 2025-12-23T06:55:18Z

slime/backends/megatron_utils/update_weight/common.py

+def register_memory_region_v1(named_param_with_buffers: Sequence[tuple[str, torch.Tensor]], transfer_engine):
+    weight_mr_dict = {}
+    for name, weight in named_param_with_buffers:
+        ret = transfer_engine.register(weight.data_ptr(), weight.numel() * weight.element_size())


maybe can use transfer_engine.batch_register_memory

I will remove this function, we only use the efficient registration version below

lilei199908 · 2025-12-23T07:55:02Z

regarding Rollout TP, the tensor slice for the Down Projection layer is non-contiguous in physical memory relative to the all-gathered full parameters on the training side. How should we handle this for efficient remote transfer (e.g., RDMA)?

JD-ETH · 2025-12-23T14:45:37Z

with the current design, we force the source side to construct a model replica that has the exact shape as the rollout side.

lilei199908 · 2025-12-23T15:35:48Z

This is a WIP PR that demonstrate an efficient implementation of remote weight update through RDMA and TransferEngine, instead of through nccl.

It is tested with sglang 0.5.6 post2.

We reuse sglang engine memory registration introduced here: remote_instance_engine_info. To reduce registration overhead, we propose a solution with engine replica on trainer side to enable zero-copy; single registration weight transfer.

The goal of this PR is to demonstrate a first working example using Qwen3-4B with TP on Sglang, and TP on the Megatron trainer.

CUDA_VISIBLE_DEVICES="1,2" python ./tests/test_weight_transfer.py --mode=rdma tests a minimal E2E example

with the current design, we force the source side to construct a model replica that has the exact shape as the rollout side.

Doesn't creating a rollout replica occupy more GPU memory (VRAM)?

JD-ETH · 2025-12-24T04:51:46Z

yep. The focus of the design is to avoid costly registration/deregistration of the transfer engine. We will add CPU offloading while keeping the virtual memory registration

lilei199908 · 2025-12-25T06:23:06Z

yep. The focus of the design is to avoid costly registration/deregistration of the transfer engine. We will add CPU offloading while keeping the virtual memory registration

i wonder if using virtual memory registration maybe degrade perf beacuse nic only read physical position?

JD-ETH · 2026-01-04T19:37:24Z

we will reopen once we hit our design target, stay tuned

vwxyzjn · 2026-01-08T22:56:52Z

Great work! Will this support DSv3 / Kimi K2 and replicate https://abcdabcd987.com/2025/09/17/rl-weight-transfer-2/ ?

JD-ETH · 2026-01-09T00:45:08Z

we intend to test Kimi K2 before releasing. We aim at similar performance numbers as Perplexity but not targeting AWS infra.

vwxyzjn · 2026-01-09T01:17:50Z

@JD-ETH incredible! Looking forward to it.

JD-ETH added 8 commits December 15, 2025 01:22

rdma impl rebase with profile

789f311

remove old benchmark

91d3ab9

initial impl

3ed9867

fixing name mismatch issue

f333c51

rework with local copy

9b47b11

fail on engine registration

dd5a2ce

runs

bddc95d

refactored

c270f61

lilei199908 reviewed Dec 23, 2025

View reviewed changes

fix: modify check weight equality

52767eb

Risc-lt added 2 commits December 26, 2025 05:15

feat: offload model replica and transfer engine

af9bb9d

feat: add load-transfer pipelining

2f637cb

JD-ETH closed this Jan 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement RDMA P2P weight update using TransferEngine #1164

[WIP] Implement RDMA P2P weight update using TransferEngine #1164
JD-ETH wants to merge 11 commits intoTHUDM:mainfrom
JD-ETH:jd/rdma-integration

JD-ETH commented Dec 20, 2025 •

edited

Loading

Uh oh!

lilei199908 Dec 23, 2025

Uh oh!

JD-ETH Dec 23, 2025

Uh oh!

lilei199908 commented Dec 23, 2025

Uh oh!

JD-ETH commented Dec 23, 2025

Uh oh!

lilei199908 commented Dec 23, 2025

Uh oh!

JD-ETH commented Dec 24, 2025

Uh oh!

lilei199908 commented Dec 25, 2025

Uh oh!

JD-ETH commented Jan 4, 2026

Uh oh!

vwxyzjn commented Jan 8, 2026

Uh oh!

JD-ETH commented Jan 9, 2026

Uh oh!

vwxyzjn commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JD-ETH commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lilei199908 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

JD-ETH Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

lilei199908 commented Dec 23, 2025

Uh oh!

JD-ETH commented Dec 23, 2025

Uh oh!

lilei199908 commented Dec 23, 2025

Uh oh!

JD-ETH commented Dec 24, 2025

Uh oh!

lilei199908 commented Dec 25, 2025

Uh oh!

JD-ETH commented Jan 4, 2026

Uh oh!

vwxyzjn commented Jan 8, 2026

Uh oh!

JD-ETH commented Jan 9, 2026

Uh oh!

vwxyzjn commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JD-ETH commented Dec 20, 2025 •

edited

Loading