Support loading weights from remote instance by amysaq2023 · Pull Request #8215 · sgl-project/sglang

amysaq2023 · 2025-07-21T07:41:20Z

Motivation

During engine initialization, the process of loading model weights is time-consuming, as the weights are currently fetched either from disk or from system memory, both of which are constrained by PCIe bandwidth. To optimize the startup time of the engine, we propose a method to load the weights directly from another running SGLang instance that is using the same model. In this approach, the data path for weight transfer is from the source instance's GPU memory → source instance's NIC → destination instance's NIC → destination instance's GPU memory.

Modifications

This PR introduces a new load-format option, "remote_instance", which enables a new instance to load model weights from an already running remote instance during initialization. When using the "remote_instance" load format, the new instance will:

establish a communication group with the target instance; and
transfer the weights from the target instance via the established communication group.

To start a new instance using "remote_instance" load format:
python -m sglang.launch_server
--model-path instance://[target_instance_ip]:[communication_group_port]
--tokenizer-path [local_tokenizer_path]
--seed-instance-url http://[target_instance_ip]:[target_instance_serving_port]
--load-format remote_instance
--client-instance-id [local_instance_id](optional)

Performance tested with Deepseek-R1:

from disk: ～440s
from main memory: ~60s
from remote instance: <10s

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @amysaq2023, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the SGLang engine's startup time by introducing a novel method for loading model weights. Instead of relying on local disk or system memory, instances can now fetch weights directly from a running remote SGLang instance, utilizing high-speed network communication for a substantial performance boost.

Highlights

New Weight Loading Mechanism: Introduced a new remote_instance load format, enabling SGLang instances to load model weights directly from an already running remote SGLang instance. This bypasses slower disk/memory I/O, leveraging high-bandwidth network interfaces for faster startup.
Distributed Communication Setup: Implemented a dedicated torch.distributed (NCCL) communication group between the 'seed' (source) and 'client' (destination) instances. This group facilitates efficient, direct GPU-to-GPU weight transfer.
API Extensions for Remote Loading: Added new FastAPI endpoints (/init_weights_send_group_for_remote_instance and /send_weights_to_remote_instance) to the HTTP server. These endpoints coordinate the setup of the distributed communication group and trigger the actual weight broadcasting from the seed instance.
Performance Optimization: Demonstrated significant reductions in model loading times. For example, Deepseek-R1 model loading time was reduced from 436.15s (disk) or 57.16s (main memory) to 7.76s (remote instance).
Enhanced Configurability: Introduced new command-line arguments (--seed-instance-url, --client-instance-id, --model-config-path) and corresponding LoadConfig parameters to provide flexible control over the remote instance weight loading process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an innovative feature to load model weights from a remote running instance, which dramatically improves engine startup times. The implementation is well-structured, touching configurations, connectors, HTTP endpoints, and the model loading logic to facilitate direct GPU-to-GPU weight transfer via torch.distributed. The overall approach is sound. I've identified a few areas for improvement, including a potential crasher due to fragile URL parsing, a type hint error, and some minor bugs in f-string formatting for error messages. My suggestions aim to enhance the robustness and maintainability of this new feature.

python/sglang/srt/managers/tp_worker_overlap_thread.py

python/sglang/srt/model_loader/loader.py

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/model_loader/loader.py

zhaochenyang20 · 2025-07-22T14:58:25Z

@amysaq2023 Shall find someone from the community to review this

stmatengss · 2025-07-24T01:49:06Z

@amysaq2023 Shall find someone from the community to review this

@zhaochenyang20 If no one else can get to it, I’m happy to review this PR

ryang-max · 2025-07-24T10:07:11Z

wondering what's the use case of this new load format in practice?

zhaochenyang20 · 2025-07-24T21:45:55Z

@amysaq2023 Shall find someone from the community to review this

@zhaochenyang20 If no one else can get to it, I’m happy to review this PR

Please feel free to review

zhaochenyang20 · 2025-07-29T07:40:14Z

python -m sglang.launch_server
--model-path instance://[target_instance_ip]:[communication_group_port] 
--tokenizer-path [local_tokenizer_path]
--seed-instance-url http://[target_instance_ip]:[target_instance_serving_port]
--load-format remote_instance
--client-instance-id [local_instance_id](optional)

I think the launching command is quite strange. Why --model-path instance://[target_instance_ip]:[communication_group_port] ? Is this a mistake? Maybe the model_path shall be:

--model-path instance://[new instance ip]:[new instance port]·

And, what's the local_instance_id here means?

zhaochenyang20

Generally good PR. A 2-gpu unit test is needed.

python/sglang/srt/configs/load_config.py

zhaochenyang20 · 2025-07-29T07:50:41Z

python/sglang/srt/connector/remote_instance.py

will double check wether this will affect AMD CI.

python/sglang/srt/connector/remote_instance.py

python/sglang/srt/managers/tokenizer_manager.py

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/model_loader/loader.py

python/sglang/srt/server_args.py

zhaochenyang20

Adding unit tests:

https://github.com/sgl-project/sglang/blob/main/test/srt/test_update_weights_from_distributed.py

python/sglang/srt/configs/load_config.py

python/sglang/srt/connector/remote_instance.py

python/sglang/srt/managers/io_struct.py

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/model_loader/loader.py

python/sglang/srt/server_args.py

zhaochenyang20 · 2025-08-01T01:16:36Z

if is_in_ci(): use 2-gpu to test, tp1 server init a tp1 server.
if testing locally, shall test use a tp2 server to init another tp2 server.

        if is_in_ci():
            mode = random.choice(["Engine", "Server"])
            test_suits = [
                (1, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, mode),
            ]
        else:
            test_suits = [
                (1, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                (1, 1, DEFAULT_MODEL_NAME_FOR_TEST, "Sever"),
            ]

            if torch.cuda.device_count() >= 4:
                test_suits.extend(
                    [
                        (2, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                        (1, 2, DEFAULT_MODEL_NAME_FOR_TEST, "Server"),
                    ]
                )

            if torch.cuda.device_count() >= 5:
                test_suits.extend(
                    [
                        (2, 2, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                        (2, 2, DEFAULT_MODEL_NAME_FOR_TEST, "Server"),
                    ]
                )

zhaochenyang20 · 2025-08-01T01:17:46Z

Also, add the unit tests file's name into run_suite.

amysaq2023 · 2025-08-12T07:36:46Z

if is_in_ci(): use 2-gpu to test, tp1 server init a tp1 server. if testing locally, shall test use a tp2 server to init another tp2 server.

        if is_in_ci():
            mode = random.choice(["Engine", "Server"])
            test_suits = [
                (1, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, mode),
            ]
        else:
            test_suits = [
                (1, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                (1, 1, DEFAULT_MODEL_NAME_FOR_TEST, "Sever"),
            ]

            if torch.cuda.device_count() >= 4:
                test_suits.extend(
                    [
                        (2, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                        (1, 2, DEFAULT_MODEL_NAME_FOR_TEST, "Server"),
                    ]
                )

            if torch.cuda.device_count() >= 5:
                test_suits.extend(
                    [
                        (2, 2, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                        (2, 2, DEFAULT_MODEL_NAME_FOR_TEST, "Server"),
                    ]
                )

Thanks for advice! We have added a unit test for loading weights from remote instance.

python/sglang/srt/managers/tokenizer_manager.py

test/srt/test_load_weights_from_remote_instance.py

python/sglang/srt/model_executor/model_runner.py

zhaochenyang20 · 2025-08-12T19:54:37Z

test/srt/test_load_weights_from_remote_instance.py

In this test, you initialize a server as the seed at rank 0, then for 1/2 new instances, you want to create it from rank 0.

On rank 1 or rank 2, they run exactly the same code with dp 1 in their process? That is to say, we can not init a SGLang server with dp 2 on this. But we create 2 server with dp 1 and get the weights from the seed?

assert ( self.server_args.dp_size == 1 ), "dp_size must be 1 for init_weights_send_group_for_remote_instance"

test/srt/test_load_weights_from_remote_instance.py

zhaochenyang20 · 2025-08-12T20:04:52Z

Also, adds the tests to test/srt/run_suite.py @amysaq2023

zxpdemonio · 2025-09-04T06:18:58Z

For clustered deployments, multi‑node weight update efficiency needs attention. Concurrent transfer mechanisms and P2P weight distribution strategies may help.

python/sglang/srt/server_args.py

hnyls2002 · 2025-09-08T18:27:18Z

python/sglang/srt/server_args.py

Do not worry about a long name, --remote-instance-weight-loader is clearer

hnyls2002 · 2025-09-08T18:29:02Z

python/sglang/srt/model_loader/loader.py

@amysaq2023

python/sglang/srt/configs/model_config.py

python/sglang/srt/configs/device_config.py

python/sglang/srt/model_loader/loader.py

hnyls2002 · 2025-09-08T18:44:49Z

@amysaq2023 By the way, just curious about the current implementation of two calls, one is initializing a sending group, and one is triggering sending action. Why are two calls needed here?

python/sglang/srt/configs/device_config.py

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/server_args.py

amysaq2023 · 2025-09-10T18:33:53Z

@amysaq2023 By the way, just curious about the current implementation of two calls, one is initializing a sending group, and one is triggering sending action. Why are two calls needed here?

Getting the idea from how update_weights_from_distributed implemented, we also separate loading weights from remote instance into two steps. Beside, we are going to add P2P port negotiation between seed and destination instances during initializing group (just in the coming PR ), this action needs to be done before loading weights since the destination instance will need to know from the seed instance what ports to use.

python/sglang/srt/remote_instance_weight_loader_mixin.py

stmatengss · 2025-09-11T11:36:13Z

This PR adds new server arguments and FastAPI endpoints. In my opinion, the documentation should also be updated.

Previously, during initialization, model weights are loaded from either disk or main memory. With this feature, weights can be directly loaded from another running SGLang instance. The datapath is seed instance GPU memory -> seed instance NIC -> client instance NIC -> client instance GPU memory, which can significantly decrease the time used in loading weights during inference engine initialization. Co-developed-by: Zehuan Li <lizehuan.lzh@antgroup.com> Co-developed-by: Tianyu Zhou <wentong.zty@antgroup.com> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

merrymercy · 2025-09-23T00:56:10Z

Please address the following comments

move remote_instance_weight_loader_utils.py under model_loader
move tp_rank, remote_instance_weight_loader_seed_instance_ip, ... out of ModelConfig and to LoadConfig
move parser.add_argument("--remote-instance-weight-loader-seed-instance-ip", after parser.add_argument("--weight-loader-disable-mmap",. The order of parser.add_argument should be the same as the order of how they are defined in class ServerArgs:

amysaq2023 · 2025-09-26T07:32:12Z

Please address the following comments

move remote_instance_weight_loader_utils.py under model_loader

move tp_rank, remote_instance_weight_loader_seed_instance_ip, ... out of ModelConfig and to LoadConfig

Thanks for suggestion :) We have updated the format in new PR: #10941

move parser.add_argument("--remote-instance-weight-loader-seed-instance-ip", after parser.add_argument("--weight-loader-disable-mmap",. The order of parser.add_argument should be the same as the order of how they are defined in class ServerArgs:

I think in the original version, the order has already been followed. Please let me know if I understood this comment wrong. Thanks

merrymercy · 2025-09-26T20:45:34Z

thanks!

Add attribution comments to reference the original source of the remote instance loading functionality adapted from sgl-project/sglang. Updated files include: - remote_instance_loader.py - remote_instance_loader_utils.py - api_server.py - protocol.py All changes add proper attribution comments referencing: sgl-project/sglang#8215 Signed-off-by: pengdrumli <pengdrumli@tencent.com>

nihao1997 · 2026-01-20T03:38:22Z

nihao, I'm wondering if this PR supports weight transfer/loading across multiple machines within a single training instance.

I tried loading weights with a PP=2, TP=8 configuration using two machines, each with total 16 GPUs, but the process keeps hanging indefinitely.

If multi-machine weight loading is supported, could you please provide an example launch command for such a setup?

Thank you very much!

amysaq2023 requested review from ByronHsu, CatherineSue, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann, zhaochenyang20 and zhyncs as code owners July 21, 2025 07:41

gemini-code-assist bot reviewed Jul 21, 2025

View reviewed changes

amysaq2023 force-pushed the amy/support-loading-weights-from-remote-instance branch 5 times, most recently from 7e34f68 to 805ef4d Compare July 22, 2025 03:14

zhaochenyang20 requested a review from slin1237 as a code owner July 29, 2025 07:25

zhaochenyang20 requested changes Jul 29, 2025

View reviewed changes

zhaochenyang20 requested changes Aug 1, 2025

View reviewed changes

amysaq2023 force-pushed the amy/support-loading-weights-from-remote-instance branch from eb47cd5 to 3e09458 Compare August 12, 2025 07:19

zhaochenyang20 requested changes Aug 12, 2025

View reviewed changes

zhyncs added the high priority label Sep 8, 2025

hnyls2002 requested changes Sep 8, 2025

View reviewed changes

merrymercy requested changes Sep 9, 2025

View reviewed changes

python/sglang/srt/configs/device_config.py Outdated Show resolved Hide resolved

python/sglang/srt/model_executor/model_runner.py Outdated Show resolved Hide resolved

python/sglang/srt/server_args.py Outdated Show resolved Hide resolved

amysaq2023 force-pushed the amy/support-loading-weights-from-remote-instance branch from 721313d to 67332a8 Compare September 11, 2025 04:23

hnyls2002 approved these changes Sep 11, 2025

View reviewed changes

python/sglang/srt/remote_instance_weight_loader_mixin.py Outdated Show resolved Hide resolved

amysaq2023 force-pushed the amy/support-loading-weights-from-remote-instance branch from 67332a8 to b49a462 Compare September 11, 2025 06:04

stmatengss approved these changes Sep 11, 2025

View reviewed changes

amysaq2023 added 2 commits September 11, 2025 20:27

add unit test for loading weights from remote instance

946f761

Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

amysaq2023 force-pushed the amy/support-loading-weights-from-remote-instance branch from b49a462 to 946f761 Compare September 11, 2025 13:35

hnyls2002 merged commit 30d20ce into sgl-project:main Sep 12, 2025
194 of 214 checks passed

This was referenced Sep 15, 2025

[Feature] RFC: Integrating checkpoint engine into SGLang #10464

Closed

[Feature] Support loading weights from ckpt engine connector #10667

Closed

amysaq2023 mentioned this pull request Sep 26, 2025

refactor loading weights from remote instance coding format #10941

Merged

4 tasks

tianyuzhou95 mentioned this pull request Oct 27, 2025

[Core] Remote Instance Model Loader: A loader used to fetch model weights from a ready vLLM instance. vllm-project/vllm#27417

Open

5 tasks

noa-neria mentioned this pull request Oct 29, 2025

Fix model loading from S3 buckets #11865

Closed

XiaotaoChen mentioned this pull request Oct 30, 2025

can't reuse weights from existing instances when new-instance not in the same gpu MoonshotAI/checkpoint-engine#44

Closed

This was referenced Nov 4, 2025

[Feature] Multiple model weight loading improvements #12529

Closed

[Feature] RFC: Add a planner to manage seed instances for loading weights from remote instances #12910

Closed

amysaq2023 mentioned this pull request Nov 12, 2025

support non-disturbing remote-instance-weight-loader #13125

Merged

4 tasks

amysaq2023 mentioned this pull request Dec 12, 2025

support non disturbing remote instance weight loader v2 #14997

Merged

6 tasks

Conversation

amysaq2023 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Jul 22, 2025

Uh oh!

stmatengss commented Jul 24, 2025

Uh oh!

ryang-max commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Jul 24, 2025

Uh oh!

zhaochenyang20 commented Jul 29, 2025

Uh oh!

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaochenyang20 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Aug 1, 2025

Uh oh!

amysaq2023 commented Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaochenyang20 commented Aug 12, 2025

Uh oh!

zxpdemonio commented Sep 4, 2025

Uh oh!

Uh oh!

hnyls2002 Sep 8, 2025

amysaq2023 commented Jul 21, 2025 •

edited

Loading

ryang-max commented Jul 24, 2025 •

edited

Loading

zhaochenyang20 commented Aug 1, 2025 •

edited

Loading