Support loading weights from remote instance#8215
Support loading weights from remote instance#8215hnyls2002 merged 2 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @amysaq2023, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly optimizes the SGLang engine's startup time by introducing a novel method for loading model weights. Instead of relying on local disk or system memory, instances can now fetch weights directly from a running remote SGLang instance, utilizing high-speed network communication for a substantial performance boost.
Highlights
- New Weight Loading Mechanism: Introduced a new
remote_instanceload format, enabling SGLang instances to load model weights directly from an already running remote SGLang instance. This bypasses slower disk/memory I/O, leveraging high-bandwidth network interfaces for faster startup. - Distributed Communication Setup: Implemented a dedicated
torch.distributed(NCCL) communication group between the 'seed' (source) and 'client' (destination) instances. This group facilitates efficient, direct GPU-to-GPU weight transfer. - API Extensions for Remote Loading: Added new FastAPI endpoints (
/init_weights_send_group_for_remote_instanceand/send_weights_to_remote_instance) to the HTTP server. These endpoints coordinate the setup of the distributed communication group and trigger the actual weight broadcasting from the seed instance. - Performance Optimization: Demonstrated significant reductions in model loading times. For example, Deepseek-R1 model loading time was reduced from 436.15s (disk) or 57.16s (main memory) to 7.76s (remote instance).
- Enhanced Configurability: Introduced new command-line arguments (
--seed-instance-url,--client-instance-id,--model-config-path) and correspondingLoadConfigparameters to provide flexible control over the remote instance weight loading process.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces an innovative feature to load model weights from a remote running instance, which dramatically improves engine startup times. The implementation is well-structured, touching configurations, connectors, HTTP endpoints, and the model loading logic to facilitate direct GPU-to-GPU weight transfer via torch.distributed. The overall approach is sound. I've identified a few areas for improvement, including a potential crasher due to fragile URL parsing, a type hint error, and some minor bugs in f-string formatting for error messages. My suggestions aim to enhance the robustness and maintainability of this new feature.
7e34f68 to
805ef4d
Compare
|
@amysaq2023 Shall find someone from the community to review this |
@zhaochenyang20 If no one else can get to it, I’m happy to review this PR |
|
wondering what's the use case of this new load format in practice? |
Please feel free to review |
python -m sglang.launch_server
--model-path instance://[target_instance_ip]:[communication_group_port]
--tokenizer-path [local_tokenizer_path]
--seed-instance-url http://[target_instance_ip]:[target_instance_serving_port]
--load-format remote_instance
--client-instance-id [local_instance_id](optional)I think the launching command is quite strange. Why
And, what's the |
zhaochenyang20
left a comment
There was a problem hiding this comment.
Generally good PR. A 2-gpu unit test is needed.
There was a problem hiding this comment.
will double check wether this will affect AMD CI.
|
if is_in_ci():
mode = random.choice(["Engine", "Server"])
test_suits = [
(1, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, mode),
]
else:
test_suits = [
(1, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
(1, 1, DEFAULT_MODEL_NAME_FOR_TEST, "Sever"),
]
if torch.cuda.device_count() >= 4:
test_suits.extend(
[
(2, 1, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
(1, 2, DEFAULT_MODEL_NAME_FOR_TEST, "Server"),
]
)
if torch.cuda.device_count() >= 5:
test_suits.extend(
[
(2, 2, DEFAULT_SMALL_MODEL_NAME_FOR_TEST, "Engine"),
(2, 2, DEFAULT_MODEL_NAME_FOR_TEST, "Server"),
]
) |
|
Also, add the unit tests file's name into |
eb47cd5 to
3e09458
Compare
Thanks for advice! We have added a unit test for loading weights from remote instance. |
There was a problem hiding this comment.
In this test, you initialize a server as the seed at rank 0, then for 1/2 new instances, you want to create it from rank 0.
On rank 1 or rank 2, they run exactly the same code with dp 1 in their process? That is to say, we can not init a SGLang server with dp 2 on this. But we create 2 server with dp 1 and get the weights from the seed?
assert (
self.server_args.dp_size == 1
), "dp_size must be 1 for init_weights_send_group_for_remote_instance"
|
Also, adds the tests to test/srt/run_suite.py @amysaq2023 |
|
For clustered deployments, multi‑node weight update efficiency needs attention. Concurrent transfer mechanisms and P2P weight distribution strategies may help. |
python/sglang/srt/server_args.py
Outdated
There was a problem hiding this comment.
Do not worry about a long name, --remote-instance-weight-loader is clearer
|
@amysaq2023 By the way, just curious about the current implementation of two calls, one is initializing a sending group, and one is triggering sending action. Why are two calls needed here? |
Getting the idea from how update_weights_from_distributed implemented, we also separate loading weights from remote instance into two steps. Beside, we are going to add P2P port negotiation between seed and destination instances during initializing group (just in the coming PR ), this action needs to be done before loading weights since the destination instance will need to know from the seed instance what ports to use. |
721313d to
67332a8
Compare
67332a8 to
b49a462
Compare
|
This PR adds new server arguments and FastAPI endpoints. In my opinion, the documentation should also be updated. |
Previously, during initialization, model weights are loaded from either disk or main memory. With this feature, weights can be directly loaded from another running SGLang instance. The datapath is seed instance GPU memory -> seed instance NIC -> client instance NIC -> client instance GPU memory, which can significantly decrease the time used in loading weights during inference engine initialization. Co-developed-by: Zehuan Li <lizehuan.lzh@antgroup.com> Co-developed-by: Tianyu Zhou <wentong.zty@antgroup.com> Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
b49a462 to
946f761
Compare
|
Please address the following comments
|
Thanks for suggestion :) We have updated the format in new PR: #10941
I think in the original version, the order has already been followed. Please let me know if I understood this comment wrong. Thanks |
|
thanks! |
Add attribution comments to reference the original source of the remote instance loading functionality adapted from sgl-project/sglang. Updated files include: - remote_instance_loader.py - remote_instance_loader_utils.py - api_server.py - protocol.py All changes add proper attribution comments referencing: sgl-project/sglang#8215 Signed-off-by: pengdrumli <pengdrumli@tencent.com>
|
nihao, I'm wondering if this PR supports weight transfer/loading across multiple machines within a single training instance. I tried loading weights with a PP=2, TP=8 configuration using two machines, each with total 16 GPUs, but the process keeps hanging indefinitely. If multi-machine weight loading is supported, could you please provide an example launch command for such a setup? Thank you very much! |
Motivation
During engine initialization, the process of loading model weights is time-consuming, as the weights are currently fetched either from disk or from system memory, both of which are constrained by PCIe bandwidth. To optimize the startup time of the engine, we propose a method to load the weights directly from another running SGLang instance that is using the same model. In this approach, the data path for weight transfer is from the source instance's GPU memory → source instance's NIC → destination instance's NIC → destination instance's GPU memory.
Modifications
This PR introduces a new load-format option, "remote_instance", which enables a new instance to load model weights from an already running remote instance during initialization. When using the "remote_instance" load format, the new instance will:
To start a new instance using "remote_instance" load format:
python -m sglang.launch_server--model-path instance://[target_instance_ip]:[communication_group_port]--tokenizer-path [local_tokenizer_path]--seed-instance-url http://[target_instance_ip]:[target_instance_serving_port]--load-format remote_instance--client-instance-id [local_instance_id](optional)Performance tested with Deepseek-R1:
Checklist