support non disturbing remote instance weight loader v2 by amysaq2023 · Pull Request #14997 · sgl-project/sglang

amysaq2023 · 2025-12-12T14:51:40Z

Motivation

In #8215, SGLang has already supported a new load format: remote_instance, that allows new instance to load weights from another running instance. This approach can greatly improve weight loading time during instance initialization. However, since it use torch.distributed with NCCL as backend, it will disturb on-going inference requests: torch.distributed will always launch CUDA kernels for transferring weight tensors.

We come up with another backend option: TransferEngine, which will not disturbing any GPU workload and still, using RDMA to transfer weights.

Modifications

We initialize one TransferEngine for each ModelRunner and will register its weights to RDMA channel during initialization.

When initializing a new instance who wants to use remote_instance load_format with TransferEngine backend:

It will send an HTTP request to retrieve the source instance's TransferEngine metadata, including RDMA keys mapped to the corresponding GPU memory addresses.
Using these RDMA keys, the new instance directly loads weights from the source's GPU memory.

How to use:
seed instance:
python -m sglang.launch_server [args] \
--remote-instance-weight-loader-support-transfer-engine

client instance:
python -m sglang.launch_server [args] \
--load-format remote_instance \
--remote-instance-weight-loader-seed-instance-ip [seed_instance_ip] \
--remote-instance-weight-loader-seed-instance-service-port [seed_instance_service_port] \
--remote-instance-weight-loader-backend "transfer_engine"
--remote-instance-weight-loader-support-transfer-engine

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-12T14:52:09Z

Summary of Changes

Hello @amysaq2023, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's R-Fork functionality by integrating a TransferEngine backend for remote instance weight loading. The primary goal is to enable non-disruptive, high-performance weight transfers using RDMA, thereby preventing interference with active GPU workloads during instance initialization. This improvement aims to maintain system stability and efficiency while accelerating the boot-up process for new SGLang instances.

Highlights

Non-Disturbing Remote Weight Loading: Introduced a new backend option, TransferEngine, for the remote_instance weight loader, designed to load model weights from a running SGLang instance without disturbing ongoing GPU inference requests. This addresses a limitation of the existing NCCL backend which can cause disruptions.
TransferEngine Implementation: The TransferEngine leverages RDMA (Remote Direct Memory Access) for efficient, zero-copy weight transfer. It involves initializing a TransferEngine for each ModelRunner and registering its weights to an RDMA channel. Client instances retrieve RDMA keys and GPU memory addresses via HTTP from the source instance and then directly load weights.
Configuration and Usage: New command-line arguments --remote-instance-weight-loader-backend (with choices transfer_engine or nccl) and --remote-instance-weight-loader-support-transfer-engine have been added to enable and configure the new backend. Documentation for R-Fork has been updated to include detailed usage instructions for the TransferEngine.
API and Internal Changes: The LoadConfig and ServerArgs classes have been extended to support TransferEngine specific parameters. The HTTP server now exposes a new endpoint /get_remote_instance_transfer_engine_info to allow client instances to fetch necessary metadata. Core components like ModelRunner, Scheduler, and TPWorker have been modified to integrate TransferEngine initialization, memory registration, and weight transfer logic.
Compatibility and Testing: Checks are in place to ensure mooncake.engine (required for TransferEngine) is available and that TransferEngine is not used when memory saver is enabled, as they are incompatible. Unit tests for remote weight loading have been updated to include the transfer_engine backend as a test case.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new TransferEngine backend for remote instance weight loading, which is a great addition for improving instance startup time without disturbing ongoing inference. The implementation is well-structured, with changes spanning documentation, configuration, server entrypoints, and the model loading logic. The use of torch.cuda.memory.memory_snapshot for optimizing memory registration is particularly clever.

My review includes a few suggestions to improve documentation clarity, fix a type hint, remove unused code, correct a logic issue in a utility function, and fix a typo in the tests. Overall, this is a solid contribution.

python/sglang/srt/configs/load_config.py

python/sglang/srt/model_loader/remote_instance_weight_loader_utils.py

test/srt/test_load_weights_from_remote_instance.py

docs/advanced_features/rfork.md

python/sglang/srt/entrypoints/engine.py

Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

amysaq2023 · 2025-12-13T00:39:38Z

/rerun-failed-ci

amysaq2023 · 2025-12-13T00:42:00Z

/tag-and-rerun-ci

amysaq2023 · 2025-12-13T01:25:06Z

/rerun-failed-ci

amysaq2023 · 2025-12-13T01:32:32Z

/rerun-stage unit-test-backend-1-gpu(0)

github-actions · 2025-12-13T01:32:50Z

❌ Stage unit-test-backend-1-gpu(0) doesn't support isolated runs yet.

NVIDIA stages:

stage-a-test-1
stage-b-test-small-1-gpu
multimodal-gen-test-1-gpu
multimodal-gen-test-2-gpu
quantization-test
unit-test-backend-1-gpu
unit-test-backend-2-gpu
unit-test-backend-4-gpu
unit-test-backend-8-gpu-h200
unit-test-backend-8-gpu-h20
unit-test-backend-8-gpu-b200
performance-test-1-gpu-part-1
performance-test-1-gpu-part-2
performance-test-1-gpu-part-3
performance-test-2-gpu
accuracy-test-1-gpu
accuracy-test-2-gpu
unit-test-deepep-4-gpu
unit-test-deepep-8-gpu
unit-test-backend-4-gpu-b200
unit-test-backend-4-gpu-gb200

AMD stages:

sgl-kernel-unit-test-amd
stage-a-test-1-amd
unit-test-backend-1-gpu-amd
unit-test-backend-2-gpu-amd
unit-test-backend-8-gpu-amd
performance-test-1-gpu-part-1-amd
performance-test-1-gpu-part-2-amd
performance-test-2-gpu-amd
accuracy-test-1-gpu-amd
accuracy-test-2-gpu-amd

Other stages will be added soon. For now, use /rerun-failed-ci for those stages.

amysaq2023 · 2025-12-13T01:54:32Z

/rerun-failed-ci

amysaq2023 · 2025-12-13T01:56:21Z

/rerun-failed-ci

amysaq2023 · 2025-12-13T02:58:49Z

/rerun-failed-ci

amysaq2023 · 2025-12-13T05:19:16Z

/rerun-failed-ci

amysaq2023 · 2025-12-13T05:35:51Z

/rerun-failed-ci

amysaq2023 · 2025-12-13T10:18:47Z

@zhaochenyang20 @merrymercy

merrymercy · 2025-12-15T06:58:32Z

python/sglang/srt/entrypoints/http_server.py

+    #           }
+    #         )
+    # }
+    remote_instance_transfer_engine_info: Optional[Dict] = None


Can you drop this?

We can parse from scheduler_info directly.

For now, we only change the return value of launch_subprocesses from scheduler_info to the list of scheduler_info. But when it is stored to global_state, it only keep the first scheduler info item as before.

Would you suggest it will be better to store the whole list of scheduler_info in global_state as well? I'm a little concern about whether it will be too much redundant scheduler info in global state.

This commit supports non-disturbing mode for loading weights from remote instances. Previouly, SGLang has already supported loading weights from remote instances using torch.dist with NCCL as backend; however, this approach will disturb on-going inference requests since torch.dist will launch CUDA kernels to transfer weight tensors. In this commit, it introduces another backend option: TransferEngine, which can transfer weight tensors through RDMA without disturbing any on-going GPU workload. Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

amysaq2023 · 2025-12-15T16:01:09Z

/rerun-failed-ci

amysaq2023 · 2025-12-15T16:15:53Z

/rerun-failed-ci

zhaochenyang20 · 2025-12-16T00:30:55Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T01:00:01Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T01:41:40Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T02:39:23Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T03:32:10Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T03:34:59Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T03:51:22Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T03:52:46Z

/rerun-stage unit-test-backend-4-gpu

amysaq2023 · 2025-12-16T04:08:33Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T04:46:15Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T05:40:10Z

/rerun-failed-ci

zhaochenyang20 · 2025-12-16T06:01:07Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T07:14:23Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T07:36:00Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T07:38:22Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T08:04:08Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T08:33:25Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T08:39:58Z

/rerun-failed-ci

amysaq2023 · 2025-12-16T13:51:54Z

/rerun-failed-ci

…14997) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

luphye · 2026-04-02T09:04:22Z

Hi, may i ask why "Memory saver is not compatible with TransferEngine" ? I tried to use TransferEngine to transfer weights with Memory saver enabled. but got an error in register_memory_region() : "Transfer Engine does not support overlapped memory region" . Is it the only reason you mean "not compatible" ? Is there a way to workaround this problem?

JD-ETH · 2026-04-02T18:30:46Z

transfer engine does not work with vram, requires physical memory registration to be fixed

amysaq2023 requested review from CatherineSue, Fridge003, JustinTong0323, Ying1123, hnyls2002, ispobock, merrymercy, slin1237, xiezhq-hermann and zhyncs as code owners December 12, 2025 14:51

github-actions bot added the documentation Improvements or additions to documentation label Dec 12, 2025

gemini-code-assist bot reviewed Dec 12, 2025

View reviewed changes

amysaq2023 force-pushed the amy/non-disturbing-remote-instance-weight-loader-v2 branch from cb38eea to 6748f6b Compare December 12, 2025 15:13

return all scheduler infos from launch_subprocesses

d3e03e4

Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

amysaq2023 force-pushed the amy/non-disturbing-remote-instance-weight-loader-v2 branch from 6748f6b to 00c82b9 Compare December 13, 2025 00:38

github-actions bot added the run-ci label Dec 13, 2025

merrymercy approved these changes Dec 15, 2025

View reviewed changes

merrymercy merged commit ccc8f3b into sgl-project:main Dec 16, 2025
1022 of 1116 checks passed

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 17, 2025

support non disturbing remote instance weight loader v2 (sgl-project#…

042e2f7

…14997) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

JD-ETH mentioned this pull request Dec 20, 2025

[WIP] Implement RDMA P2P weight update using TransferEngine THUDM/slime#1164

Closed

jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025

support non disturbing remote instance weight loader v2 (sgl-project#…

854f2ac

…14997) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

JD-ETH mentioned this pull request Jan 10, 2026

Expose Model Parallelism Information #16860

Closed

5 tasks

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

support non disturbing remote instance weight loader v2 (sgl-project#…

2e95750

…14997) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

JD-ETH mentioned this pull request Mar 19, 2026

Expose Model Parallelism Information #20907

Open

2 tasks

Conversation

amysaq2023 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot commented Dec 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amysaq2023 commented Dec 13, 2025

Uh oh!

amysaq2023 commented Dec 13, 2025

Uh oh!

amysaq2023 commented Dec 13, 2025

Uh oh!

amysaq2023 commented Dec 13, 2025

Uh oh!

github-actions bot commented Dec 13, 2025

Uh oh!

amysaq2023 commented Dec 13, 2025

Uh oh!

amysaq2023 commented Dec 13, 2025

Uh oh!

amysaq2023 commented Dec 13, 2025

Uh oh!

amysaq2023 commented Dec 13, 2025

Uh oh!

amysaq2023 commented Dec 13, 2025

Uh oh!

amysaq2023 commented Dec 13, 2025

Uh oh!

merrymercy Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

amysaq2023 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

amysaq2023 commented Dec 15, 2025

Uh oh!

amysaq2023 commented Dec 15, 2025

Uh oh!

zhaochenyang20 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

zhaochenyang20 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 16, 2025

Uh oh!

amysaq2023 commented Dec 12, 2025 •

edited

Loading