Skip to content

support non disturbing remote instance weight loader v2#14997

Merged
merrymercy merged 5 commits intosgl-project:mainfrom
amysaq2023:amy/non-disturbing-remote-instance-weight-loader-v2
Dec 16, 2025
Merged

support non disturbing remote instance weight loader v2#14997
merrymercy merged 5 commits intosgl-project:mainfrom
amysaq2023:amy/non-disturbing-remote-instance-weight-loader-v2

Conversation

@amysaq2023
Copy link
Copy Markdown
Contributor

@amysaq2023 amysaq2023 commented Dec 12, 2025

Motivation

In #8215, SGLang has already supported a new load format: remote_instance, that allows new instance to load weights from another running instance. This approach can greatly improve weight loading time during instance initialization. However, since it use torch.distributed with NCCL as backend, it will disturb on-going inference requests: torch.distributed will always launch CUDA kernels for transferring weight tensors.

We come up with another backend option: TransferEngine, which will not disturbing any GPU workload and still, using RDMA to transfer weights.

Modifications

We initialize one TransferEngine for each ModelRunner and will register its weights to RDMA channel during initialization.

When initializing a new instance who wants to use remote_instance load_format with TransferEngine backend:

  1. It will send an HTTP request to retrieve the source instance's TransferEngine metadata, including RDMA keys mapped to the corresponding GPU memory addresses.
  2. Using these RDMA keys, the new instance directly loads weights from the source's GPU memory.

How to use:
seed instance:
python -m sglang.launch_server [args] \
--remote-instance-weight-loader-support-transfer-engine

client instance:
python -m sglang.launch_server [args] \
--load-format remote_instance \
--remote-instance-weight-loader-seed-instance-ip [seed_instance_ip] \
--remote-instance-weight-loader-seed-instance-service-port [seed_instance_service_port] \
--remote-instance-weight-loader-backend "transfer_engine"
--remote-instance-weight-loader-support-transfer-engine

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @amysaq2023, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's R-Fork functionality by integrating a TransferEngine backend for remote instance weight loading. The primary goal is to enable non-disruptive, high-performance weight transfers using RDMA, thereby preventing interference with active GPU workloads during instance initialization. This improvement aims to maintain system stability and efficiency while accelerating the boot-up process for new SGLang instances.

Highlights

  • Non-Disturbing Remote Weight Loading: Introduced a new backend option, TransferEngine, for the remote_instance weight loader, designed to load model weights from a running SGLang instance without disturbing ongoing GPU inference requests. This addresses a limitation of the existing NCCL backend which can cause disruptions.
  • TransferEngine Implementation: The TransferEngine leverages RDMA (Remote Direct Memory Access) for efficient, zero-copy weight transfer. It involves initializing a TransferEngine for each ModelRunner and registering its weights to an RDMA channel. Client instances retrieve RDMA keys and GPU memory addresses via HTTP from the source instance and then directly load weights.
  • Configuration and Usage: New command-line arguments --remote-instance-weight-loader-backend (with choices transfer_engine or nccl) and --remote-instance-weight-loader-support-transfer-engine have been added to enable and configure the new backend. Documentation for R-Fork has been updated to include detailed usage instructions for the TransferEngine.
  • API and Internal Changes: The LoadConfig and ServerArgs classes have been extended to support TransferEngine specific parameters. The HTTP server now exposes a new endpoint /get_remote_instance_transfer_engine_info to allow client instances to fetch necessary metadata. Core components like ModelRunner, Scheduler, and TPWorker have been modified to integrate TransferEngine initialization, memory registration, and weight transfer logic.
  • Compatibility and Testing: Checks are in place to ensure mooncake.engine (required for TransferEngine) is available and that TransferEngine is not used when memory saver is enabled, as they are incompatible. Unit tests for remote weight loading have been updated to include the transfer_engine backend as a test case.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new TransferEngine backend for remote instance weight loading, which is a great addition for improving instance startup time without disturbing ongoing inference. The implementation is well-structured, with changes spanning documentation, configuration, server entrypoints, and the model loading logic. The use of torch.cuda.memory.memory_snapshot for optimizing memory registration is particularly clever.

My review includes a few suggestions to improve documentation clarity, fix a type hint, remove unused code, correct a logic issue in a utility function, and fix a typo in the tests. Overall, this is a solid contribution.

@amysaq2023 amysaq2023 force-pushed the amy/non-disturbing-remote-instance-weight-loader-v2 branch from cb38eea to 6748f6b Compare December 12, 2025 15:13
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
@amysaq2023 amysaq2023 force-pushed the amy/non-disturbing-remote-instance-weight-loader-v2 branch from 6748f6b to 00c82b9 Compare December 13, 2025 00:38
@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-stage unit-test-backend-1-gpu(0)

@github-actions
Copy link
Copy Markdown
Contributor

❌ Stage unit-test-backend-1-gpu(0) doesn't support isolated runs yet.

NVIDIA stages:

  • stage-a-test-1
  • stage-b-test-small-1-gpu
  • multimodal-gen-test-1-gpu
  • multimodal-gen-test-2-gpu
  • quantization-test
  • unit-test-backend-1-gpu
  • unit-test-backend-2-gpu
  • unit-test-backend-4-gpu
  • unit-test-backend-8-gpu-h200
  • unit-test-backend-8-gpu-h20
  • unit-test-backend-8-gpu-b200
  • performance-test-1-gpu-part-1
  • performance-test-1-gpu-part-2
  • performance-test-1-gpu-part-3
  • performance-test-2-gpu
  • accuracy-test-1-gpu
  • accuracy-test-2-gpu
  • unit-test-deepep-4-gpu
  • unit-test-deepep-8-gpu
  • unit-test-backend-4-gpu-b200
  • unit-test-backend-4-gpu-gb200

AMD stages:

  • sgl-kernel-unit-test-amd
  • stage-a-test-1-amd
  • unit-test-backend-1-gpu-amd
  • unit-test-backend-2-gpu-amd
  • unit-test-backend-8-gpu-amd
  • performance-test-1-gpu-part-1-amd
  • performance-test-1-gpu-part-2-amd
  • performance-test-2-gpu-amd
  • accuracy-test-1-gpu-amd
  • accuracy-test-2-gpu-amd

Other stages will be added soon. For now, use /rerun-failed-ci for those stages.

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

4 similar comments
@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

# }
# )
# }
remote_instance_transfer_engine_info: Optional[Dict] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you drop this?

We can parse from scheduler_info directly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we only change the return value of launch_subprocesses from scheduler_info to the list of scheduler_info. But when it is stored to global_state, it only keep the first scheduler info item as before.

Would you suggest it will be better to store the whole list of scheduler_info in global_state as well? I'm a little concern about whether it will be too much redundant scheduler info in global state.

This commit supports non-disturbing mode for loading weights from
remote instances.
Previouly, SGLang has already supported loading weights from remote
instances using torch.dist with NCCL as backend; however, this approach
will disturb on-going inference requests since torch.dist will launch
CUDA kernels to transfer weight tensors.
In this commit, it introduces another backend option: TransferEngine,
which can transfer weight tensors through RDMA without disturbing any
on-going GPU workload.

Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

8 similar comments
@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-stage unit-test-backend-4-gpu

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

10 similar comments
@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@amysaq2023
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@merrymercy merrymercy merged commit ccc8f3b into sgl-project:main Dec 16, 2025
1022 of 1116 checks passed
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 17, 2025
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
@luphye
Copy link
Copy Markdown

luphye commented Apr 2, 2026

Hi, may i ask why "Memory saver is not compatible with TransferEngine" ? I tried to use TransferEngine to transfer weights with Memory saver enabled. but got an error in register_memory_region() : "Transfer Engine does not support overlapped memory region" . Is it the only reason you mean "not compatible" ? Is there a way to workaround this problem?

@JD-ETH
Copy link
Copy Markdown
Contributor

JD-ETH commented Apr 2, 2026

transfer engine does not work with vram, requires physical memory registration to be fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants