Skip to content

Support loading weights from remote instance#8215

Merged
hnyls2002 merged 2 commits intosgl-project:mainfrom
amysaq2023:amy/support-loading-weights-from-remote-instance
Sep 12, 2025
Merged

Support loading weights from remote instance#8215
hnyls2002 merged 2 commits intosgl-project:mainfrom
amysaq2023:amy/support-loading-weights-from-remote-instance

Conversation

@amysaq2023
Copy link
Copy Markdown
Contributor

@amysaq2023 amysaq2023 commented Jul 21, 2025

Motivation

During engine initialization, the process of loading model weights is time-consuming, as the weights are currently fetched either from disk or from system memory, both of which are constrained by PCIe bandwidth. To optimize the startup time of the engine, we propose a method to load the weights directly from another running SGLang instance that is using the same model. In this approach, the data path for weight transfer is from the source instance's GPU memory → source instance's NIC → destination instance's NIC → destination instance's GPU memory.

Modifications

This PR introduces a new load-format option, "remote_instance", which enables a new instance to load model weights from an already running remote instance during initialization. When using the "remote_instance" load format, the new instance will:

  1. establish a communication group with the target instance; and
  2. transfer the weights from the target instance via the established communication group.
sglang-upstream-2

To start a new instance using "remote_instance" load format:
python -m sglang.launch_server
--model-path instance://[target_instance_ip]:[communication_group_port]
--tokenizer-path [local_tokenizer_path]
--seed-instance-url http://[target_instance_ip]:[target_instance_serving_port]
--load-format remote_instance
--client-instance-id [local_instance_id](optional)

Performance tested with Deepseek-R1:

  1. from disk: ~440s
  2. from main memory: ~60s
  3. from remote instance: <10s

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @amysaq2023, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the SGLang engine's startup time by introducing a novel method for loading model weights. Instead of relying on local disk or system memory, instances can now fetch weights directly from a running remote SGLang instance, utilizing high-speed network communication for a substantial performance boost.

Highlights

  • New Weight Loading Mechanism: Introduced a new remote_instance load format, enabling SGLang instances to load model weights directly from an already running remote SGLang instance. This bypasses slower disk/memory I/O, leveraging high-bandwidth network interfaces for faster startup.
  • Distributed Communication Setup: Implemented a dedicated torch.distributed (NCCL) communication group between the 'seed' (source) and 'client' (destination) instances. This group facilitates efficient, direct GPU-to-GPU weight transfer.
  • API Extensions for Remote Loading: Added new FastAPI endpoints (/init_weights_send_group_for_remote_instance and /send_weights_to_remote_instance) to the HTTP server. These endpoints coordinate the setup of the distributed communication group and trigger the actual weight broadcasting from the seed instance.
  • Performance Optimization: Demonstrated significant reductions in model loading times. For example, Deepseek-R1 model loading time was reduced from 436.15s (disk) or 57.16s (main memory) to 7.76s (remote instance).
  • Enhanced Configurability: Introduced new command-line arguments (--seed-instance-url, --client-instance-id, --model-config-path) and corresponding LoadConfig parameters to provide flexible control over the remote instance weight loading process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an innovative feature to load model weights from a remote running instance, which dramatically improves engine startup times. The implementation is well-structured, touching configurations, connectors, HTTP endpoints, and the model loading logic to facilitate direct GPU-to-GPU weight transfer via torch.distributed. The overall approach is sound. I've identified a few areas for improvement, including a potential crasher due to fragile URL parsing, a type hint error, and some minor bugs in f-string formatting for error messages. My suggestions aim to enhance the robustness and maintainability of this new feature.

@amysaq2023 amysaq2023 force-pushed the amy/support-loading-weights-from-remote-instance branch 5 times, most recently from 7e34f68 to 805ef4d Compare July 22, 2025 03:14
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@amysaq2023 Shall find someone from the community to review this

@stmatengss
Copy link
Copy Markdown
Collaborator

@amysaq2023 Shall find someone from the community to review this

@zhaochenyang20 If no one else can get to it, I’m happy to review this PR

@ryang-max
Copy link
Copy Markdown
Contributor

ryang-max commented Jul 24, 2025

wondering what's the use case of this new load format in practice?

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@amysaq2023 Shall find someone from the community to review this

@zhaochenyang20 If no one else can get to it, I’m happy to review this PR

Please feel free to review

@zhaochenyang20 zhaochenyang20 requested a review from slin1237 as a code owner July 29, 2025 07:25
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

python -m sglang.launch_server
--model-path instance://[target_instance_ip]:[communication_group_port] 
--tokenizer-path [local_tokenizer_path]
--seed-instance-url http://[target_instance_ip]:[target_instance_serving_port]
--load-format remote_instance
--client-instance-id [local_instance_id](optional)

I think the launching command is quite strange. Why --model-path instance://[target_instance_ip]:[communication_group_port] ? Is this a mistake? Maybe the model_path shall be:

--model-path instance://[new instance ip]:[new instance port]·

And, what's the local_instance_id here means?

Copy link
Copy Markdown
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally good PR. A 2-gpu unit test is needed.

Comment on lines 25 to 36
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will double check wether this will affect AMD CI.

Copy link
Copy Markdown
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

zhaochenyang20 commented Aug 1, 2025

if is_in_ci(): use 2-gpu to test, tp1 server init a tp1 server.
if testing locally, shall test use a tp2 server to init another tp2 server.

        if is_in_ci():
            mode = random.choice(["Engine", "Server"])
            test_suits = [
                (1, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, mode),
            ]
        else:
            test_suits = [
                (1, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                (1, 1, DEFAULT_MODEL_NAME_FOR_TEST, "Sever"),
            ]

            if torch.cuda.device_count() >= 4:
                test_suits.extend(
                    [
                        (2, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                        (1, 2, DEFAULT_MODEL_NAME_FOR_TEST, "Server"),
                    ]
                )

            if torch.cuda.device_count() >= 5:
                test_suits.extend(
                    [
                        (2, 2, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                        (2, 2, DEFAULT_MODEL_NAME_FOR_TEST, "Server"),
                    ]
                )

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Also, add the unit tests file's name into run_suite.

@amysaq2023 amysaq2023 force-pushed the amy/support-loading-weights-from-remote-instance branch from eb47cd5 to 3e09458 Compare August 12, 2025 07:19
@amysaq2023
Copy link
Copy Markdown
Contributor Author

if is_in_ci(): use 2-gpu to test, tp1 server init a tp1 server. if testing locally, shall test use a tp2 server to init another tp2 server.

        if is_in_ci():
            mode = random.choice(["Engine", "Server"])
            test_suits = [
                (1, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, mode),
            ]
        else:
            test_suits = [
                (1, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                (1, 1, DEFAULT_MODEL_NAME_FOR_TEST, "Sever"),
            ]

            if torch.cuda.device_count() >= 4:
                test_suits.extend(
                    [
                        (2, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                        (1, 2, DEFAULT_MODEL_NAME_FOR_TEST, "Server"),
                    ]
                )

            if torch.cuda.device_count() >= 5:
                test_suits.extend(
                    [
                        (2, 2, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
                        (2, 2, DEFAULT_MODEL_NAME_FOR_TEST, "Server"),
                    ]
                )

Thanks for advice! We have added a unit test for loading weights from remote instance.

Comment on lines 283 to 310
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, you initialize a server as the seed at rank 0, then for 1/2 new instances, you want to create it from rank 0.

On rank 1 or rank 2, they run exactly the same code with dp 1 in their process? That is to say, we can not init a SGLang server with dp 2 on this. But we create 2 server with dp 1 and get the weights from the seed?

    assert (
        self.server_args.dp_size == 1
    ), "dp_size must be 1 for init_weights_send_group_for_remote_instance"

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Also, adds the tests to test/srt/run_suite.py @amysaq2023

@zxpdemonio
Copy link
Copy Markdown
Contributor

For clustered deployments, multi‑node weight update efficiency needs attention. Concurrent transfer mechanisms and P2P weight distribution strategies may help.

Comment on lines 842 to 889
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not worry about a long name, --remote-instance-weight-loader is clearer

Comment on lines 1373 to 1387
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hnyls2002
Copy link
Copy Markdown
Collaborator

@amysaq2023 By the way, just curious about the current implementation of two calls, one is initializing a sending group, and one is triggering sending action. Why are two calls needed here?

@amysaq2023
Copy link
Copy Markdown
Contributor Author

@amysaq2023 By the way, just curious about the current implementation of two calls, one is initializing a sending group, and one is triggering sending action. Why are two calls needed here?

Getting the idea from how update_weights_from_distributed implemented, we also separate loading weights from remote instance into two steps. Beside, we are going to add P2P port negotiation between seed and destination instances during initializing group (just in the coming PR ), this action needs to be done before loading weights since the destination instance will need to know from the seed instance what ports to use.

@amysaq2023 amysaq2023 force-pushed the amy/support-loading-weights-from-remote-instance branch from 721313d to 67332a8 Compare September 11, 2025 04:23
@amysaq2023 amysaq2023 force-pushed the amy/support-loading-weights-from-remote-instance branch from 67332a8 to b49a462 Compare September 11, 2025 06:04
@stmatengss
Copy link
Copy Markdown
Collaborator

This PR adds new server arguments and FastAPI endpoints. In my opinion, the documentation should also be updated.

Previously, during initialization, model weights are loaded from either
disk or main memory. With this feature, weights can be directly loaded
from another running SGLang instance.
The datapath is seed instance GPU memory -> seed instance NIC ->
client instance NIC -> client instance GPU memory, which can
significantly decrease the time used in loading weights during inference
engine initialization.

Co-developed-by: Zehuan Li <lizehuan.lzh@antgroup.com>
Co-developed-by: Tianyu Zhou <wentong.zty@antgroup.com>
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
@amysaq2023 amysaq2023 force-pushed the amy/support-loading-weights-from-remote-instance branch from b49a462 to 946f761 Compare September 11, 2025 13:35
@hnyls2002 hnyls2002 merged commit 30d20ce into sgl-project:main Sep 12, 2025
194 of 214 checks passed
@merrymercy
Copy link
Copy Markdown
Contributor

Please address the following comments

  • move remote_instance_weight_loader_utils.py under model_loader
  • move tp_rank, remote_instance_weight_loader_seed_instance_ip, ... out of ModelConfig and to LoadConfig
  • move parser.add_argument("--remote-instance-weight-loader-seed-instance-ip", after parser.add_argument("--weight-loader-disable-mmap",. The order of parser.add_argument should be the same as the order of how they are defined in class ServerArgs:

@amysaq2023
Copy link
Copy Markdown
Contributor Author

Please address the following comments

  • move remote_instance_weight_loader_utils.py under model_loader
  • move tp_rank, remote_instance_weight_loader_seed_instance_ip, ... out of ModelConfig and to LoadConfig

Thanks for suggestion :) We have updated the format in new PR: #10941

  • move parser.add_argument("--remote-instance-weight-loader-seed-instance-ip", after parser.add_argument("--weight-loader-disable-mmap",. The order of parser.add_argument should be the same as the order of how they are defined in class ServerArgs:

I think in the original version, the order has already been followed. Please let me know if I understood this comment wrong. Thanks

@merrymercy
Copy link
Copy Markdown
Contributor

thanks!

lirong-lirong added a commit to lirong-lirong/vllm that referenced this pull request Oct 27, 2025
Add attribution comments to reference the original source of the remote instance
loading functionality adapted from sgl-project/sglang.

Updated files include:
- remote_instance_loader.py
- remote_instance_loader_utils.py
- api_server.py
- protocol.py

All changes add proper attribution comments referencing:
sgl-project/sglang#8215

Signed-off-by: pengdrumli <pengdrumli@tencent.com>
@nihao1997
Copy link
Copy Markdown

nihao, I'm wondering if this PR supports weight transfer/loading across multiple machines within a single training instance.

I tried loading weights with a PP=2, TP=8 configuration using two machines, each with total 16 GPUs, but the process keeps hanging indefinitely.

If multi-machine weight loading is supported, could you please provide an example launch command for such a setup?

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants