[Core] Remote Instance Model Loader: A loader used to fetch model weights from a ready vLLM instance. by lirong-lirong · Pull Request #27417 · vllm-project/vllm

lirong-lirong · 2025-10-23T13:39:18Z

Purpose

This PR implements a new model loading mechanism that enables transferring model weights from remote vLLM instances. This feature is particularly useful in distributed deployments where multiple vLLM instances need to share model weights without individually loading them from storage.

Key features of this implementation:

Adds RemoteInstanceModelLoader for loading model weights from GPUs of other vLLM instances
Supports weight transfer between instances with the same parallelism strategy (include TP and PP)
Implements custom process groups for secure and efficient weight transfer
Adds new API endpoints (/init_weights_send_group_for_remote_instance and /send_weights_to_remote_instance) for coordinating weight transfers
Extends the model loader framework to support the new "remote_instance" load format
Includes validation mechanisms to ensure model compatibility between instances
Leverages high-speed interconnects (NVLink/RDMA) for efficient model weight distribution and loading

This PR effectively reduces the startup time of vLLM, particularly in large-scale scaling scenarios on the cloud.

Test Plan

End-to-end tests with multiple vLLM instances to verify weight transfer

Test commands:

pytest -s -v tests/model_executor/model_loader/test_remote_instance_loader.py

Test Result

log.txt

How to use

Start a vLLM instance normally as a seed instance to provide the model weight data source

python3 -m vllm.entrypoints.openai.api_server \
	--model /data0/deepseek-ai/DeepSeek-V3.1 \
	--tensor-parallel-size 8 \
	--host 0.0.0.0 --port 12346 \
	--gpu-memory-utilization 0.95 \
	--max_model_len 1024 \
	--enforce_eager

Launch new vLLM using the Remote Instance Model Loader

export REMOTE_INSTANCE_IP="10.32.11.46"
export REMOTE_INSTANCE_SERVER_PORT=12346
export REMOTE_INSTANCE_PORTS="[62000,62001,62002,62003,62004,62005,62006,62007]"

python3 -m vllm.entrypoints.openai.api_server \
	--model /data0/deepseek-ai/DeepSeek-V3.1 \
        --load_format remote_instance \
	--tensor-parallel-size 8 \
	--port 12346 \
	--gpu-memory-utilization 0.95 \
	--max_model_len 1024 \
	--enforce_eager

The three environment variables above represent the IP address of the remote instance, the service port, and the designated ports for the NCCL process group, respectively. The number of ports should match the number of workers.

Additional Context

We are continuously working on reducing the startup overhead of inference services. Previously, we implemented other solutions, such as using Mooncake Store to store processed weight tensors and pulling them directly to the GPU via GPUDirect RDMA (GDR), as well as leveraging the Mooncake transfer engine for direct GPU-to-GPU tensor transfer between old and new instances.

Recently, the sglang project introduced a new 'remote instance model loader' feature, which accomplishes the same task using the default NCCL communication backend combined with a broadcast operation. Considering the potential value of maintaining future compatibility with sglang, we have implemented the same approach in vLLM.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a RemoteInstanceModelLoader for fetching model weights from another running vLLM instance, which is a valuable feature for reducing startup times in distributed environments. The implementation is comprehensive, adding new API endpoints, a weight transfer connector, and utility functions for coordination. The code is generally well-structured. I've identified a critical bug that could lead to a server crash and a potential resource leak issue that should be addressed for improved robustness.

vllm/model_executor/model_loader/remote_instance_loader_utils.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/model_loader/remote_instance_loader.py

…transfer across vLLM instances Signed-off-by: pengdrumli <pengdrumli@tencent.com>

This commit adds a comprehensive end-to-end test for the remote instance model loader functionality that was recently implemented. The test verifies that: 1. A seed server can be started successfully as the source of model weights 2. A client server can load weights from the remote seed server using the new remote_instance load format 3. Both servers produce identical outputs when given the same prompts The test requires 8 GPUs (4 for seed instance + 4 for client instance) with a 2x2 TP/PP configuration and is appropriately skipped when insufficient resources are available. It also sets up proper environment variables and network communication between the coordinated servers. This test ensures the reliability and correctness of the distributed weight transfer feature across vLLM instances. Signed-off-by: pengdrumli <pengdrumli@tencent.com>

Apply automatic code formatting using ruff-format and other pre-commit hooks. No functional changes. Signed-off-by: pengdrumli <pengdrumli@tencent.com>

Signed-off-by: pengdrumli <pengdrumli@tencent.com>

…project#27417 (comment)] Signed-off-by: pengdrumli <pengdrumli@tencent.com>

This commit fixes Ruff linting errors including B024, E501, G004, E402, and SIM102 to ensure code compliance with project standards and pass all pre-commit checks. Signed-off-by: pengdrumli <pengdrumli@tencent.com>

In remote_instance_loader_utils.py: - Add threading and time imports to support cleanup functionality - Improve _weights_send_group type annotation for better code clarity - Implement automatic cleanup of stale process groups: * Add cleanup_thread function to create daemon threads * Implement _cleanup_stale_group function to safely destroy process groups - Start cleanup thread after initializing weight sending group - Rename parameter tensor_nums to remote_tensor_nums for better semantic clarity Signed-off-by: pengdrumli <pengdrumli@tencent.com>

Key changes: 1. Properly initialize global_ranks_in_group with the current process rank instead of an empty list 2. Correctly map global ranks to group ranks in _world.pg_group_ranks 3. Update torch.distributed.broadcast parameter from 'src' to 'group_src' for proper group handling 4. Add torch.cuda.empty_cache() calls to optimize GPU memory management during weight transfer Signed-off-by: pengdrumli <pengdrumli@tencent.com>

tianyuzhou95 · 2025-10-27T03:12:47Z

It seems the functionality and implementation are similar to this PR: sgl-project/sglang#8215. cc @amysaq2023

Would you mind adding a reference or acknowledgment?

lirong-lirong · 2025-10-27T03:34:39Z

It seems the functionality and implementation are similar to this PR: sgl-project/sglang#8215. cc @amysaq2023

Would you mind adding a reference or acknowledgment?

Sure! As I mentioned earlier in PR‘s description, this PR is indeed based on sglang's RemoteInstanceLoader implementation. Thank you for the reminder! I will promptly add the acknowledgment. Could you please guide me on where and how you'd like me to add it? For example, should I:

Add it in the PR description?
Add comments in the code referencing the original implementation?
Or somewhere else?

I'd appreciate your guidance on the preferred way to do this in this project. Thanks!

amysaq2023 · 2025-10-27T06:20:01Z

It seems the functionality and implementation are similar to this PR: sgl-project/sglang#8215. cc @amysaq2023
Would you mind adding a reference or acknowledgment?

Sure! As I mentioned earlier, this PR is indeed based on sglang's RemoteInstanceLoader implementation. Thank you for the reminder! I will promptly add the acknowledgment. Could you please guide me on where and how you'd like me to add it? For example, should I:

Add it in the PR description?

Add comments in the code referencing the original implementation?

Or somewhere else?

I'd appreciate your guidance on the preferred way to do this in this project. Thanks!

Either way works for me. Let's respect the way how vLLM maintainers would suggest to cite the reference.

Add attribution comments to reference the original source of the remote instance loading functionality adapted from sgl-project/sglang. Updated files include: - remote_instance_loader.py - remote_instance_loader_utils.py - api_server.py - protocol.py All changes add proper attribution comments referencing: sgl-project/sglang#8215 Signed-off-by: pengdrumli <pengdrumli@tencent.com>

ovidiusm · 2025-11-17T13:15:34Z

This transfer should reuse vllm's NixlConnector instead of creating a new transfer method from scratch with torch collectives.

There are serverless usecases where the NIXL API is preferrable, since it allows, for example, fast swapping in/out of weights to cache on workers that serve several models and are oversubscribed in terms of VRAM, with models used infrequently. There are already some projects working on this.

Reusing NixlConnector will also bring for free all the performance optimizations made for KVCache transfers.

lirong-lirong · 2025-11-17T13:35:50Z

This transfer should reuse vllm's NixlConnector instead of creating a new transfer method from scratch with torch collectives.

There are serverless usecases where the NIXL API is preferrable, since it allows, for example, fast swapping in/out of weights to cache on workers that serve several models and are oversubscribed in terms of VRAM, with models used infrequently. There are already some projects working on this.

Reusing NixlConnector will also bring for free all the performance optimizations made for KVCache transfers.
Thank you for your feedback!

Initially, the custom communicator was chosen to maintain consistency with sglang, under the assumption that mixed deployment scenarios involving vLLM and sglang might arise in the future. Keeping the communicators consistent would allow for seamless model weight exchange between them.

However, I later concluded that this scenario should not be considered. In my current implementation, I reuse vLLM’s StatelessProcessGroup and PyNcclCommunicator to handle weight transfer. Recently, I encountered a peculiar performance issue: in the case of TP1, after the model transfer is completed, the overhead of destroying the process group reaches 16 seconds and blocks subsequent GPU operations.

I will promptly adopt the NixlConnector you mentioned to handle model weight distribution and evaluate its performance. Thank you very much for your suggestion! @ovidiusm

ApsarasX · 2026-01-05T07:58:47Z

vllm/distributed/kv_transfer/kv_connector/v1/weight_transfer_connector.py

+            "gpu_id and tp_rank must be specified for RemoteInstanceConnector. "
+        )
+
+        self.device_id = torch.device("cuda", gpu_id)


use current_platform rather than cuda

ApsarasX · 2026-01-05T07:59:14Z

vllm/model_executor/model_loader/remote_instance_loader.py

+        # To support tp, pp
+        global_rank = _get_rank()
+        success, message = client.build_group(
+            gpu_id=torch.cuda.current_device(),


ApsarasX · 2026-01-05T07:59:20Z

vllm/model_executor/model_loader/remote_instance_loader.py

+            t.start()
+
+        try:
+            torch.cuda.empty_cache()


ApsarasX · 2026-01-05T07:59:24Z

vllm/model_executor/model_loader/remote_instance_loader.py

+                "finish getting all weights from remote instance, time used: %.4fs",
+                end_get_weights_tic - start_get_weights_tic,
+            )
+            torch.cuda.empty_cache()


ApsarasX · 2026-01-05T07:59:30Z

vllm/model_executor/model_loader/remote_instance_loader_utils.py

+    global_rank = pp_rank * tp_size + tp_rank
+
+    ports_list = ports.split(",")
+    gpu_id = torch.cuda.current_device()


ApsarasX · 2026-01-05T07:59:35Z

vllm/model_executor/model_loader/remote_instance_loader_utils.py

+            world_size=world_size,
+            rank=group_rank,
+            group_name=group_name,
+            device_id=torch.device("cuda", gpu_id),


ApsarasX · 2026-01-05T07:59:40Z

vllm/model_executor/model_loader/remote_instance_loader_utils.py

+            return {"success": False, "message": message}
+
+        logger.info("Send weight in %s", send_group)
+        torch.cuda.empty_cache()


ApsarasX · 2026-01-05T07:59:45Z

vllm/model_executor/model_loader/remote_instance_loader_utils.py

+                    group_src=0,
+                    group=send_group,
+                )
+        torch.cuda.empty_cache()


github-actions · 2026-04-06T02:17:58Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

lirong-lirong requested review from 22quinn, ApostaC, NickLucche, aarnphm and chaunceyjiang as code owners October 23, 2025 13:39

lirong-lirong changed the title ~~Remote Instance Model Loader: A loader used to fetch model weights from a ready vLLM instance.~~ [Core] Remote Instance Model Loader: A loader used to fetch model weights from a ready vLLM instance. Oct 23, 2025

mergify bot added frontend v1 kv-connector labels Oct 23, 2025

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

vllm/model_executor/model_loader/remote_instance_loader_utils.py Outdated Show resolved Hide resolved

vllm/model_executor/model_loader/remote_instance_loader_utils.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 23, 2025

View reviewed changes

vllm/model_executor/model_loader/remote_instance_loader.py Show resolved Hide resolved

lirong-lirong added 5 commits October 23, 2025 22:23

[Core] Implement remote instance model loader for distributed weight …

e22b705

…transfer across vLLM instances Signed-off-by: pengdrumli <pengdrumli@tencent.com>

style: reformat code with pre-commit hooks

c7c4781

Apply automatic code formatting using ruff-format and other pre-commit hooks. No functional changes. Signed-off-by: pengdrumli <pengdrumli@tencent.com>

fix: fix the bug of vllm-project#27417 (comment)

0ef6f5b

Signed-off-by: pengdrumli <pengdrumli@tencent.com>

fix: Check build_group return status in remote instance loader [vllm-…

acc0a9e

…project#27417 (comment)] Signed-off-by: pengdrumli <pengdrumli@tencent.com>

lirong-lirong force-pushed the remote-instance-model-loader branch from 6b480ca to acc0a9e Compare October 23, 2025 14:24

lirong-lirong added 3 commits October 23, 2025 22:51

fix: resolve pre-commit format errors in remote instance loader files

4bfd053

This commit fixes Ruff linting errors including B024, E501, G004, E402, and SIM102 to ensure code compliance with project standards and pass all pre-commit checks. Signed-off-by: pengdrumli <pengdrumli@tencent.com>

ApsarasX suggested changes Jan 5, 2026

View reviewed changes

github-actions bot added the stale Over 90 days of inactivity label Apr 6, 2026

Uh oh!

Conversation

lirong-lirong commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

How to use

Start a vLLM instance normally as a seed instance to provide the model weight data source

Launch new vLLM using the Remote Instance Model Loader

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

tianyuzhou95 commented Oct 27, 2025

Uh oh!

lirong-lirong commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amysaq2023 commented Oct 27, 2025

Uh oh!

ovidiusm commented Nov 17, 2025

Uh oh!

lirong-lirong commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lirong-lirong commented Oct 23, 2025 •

edited by github-actions bot

Loading

lirong-lirong commented Oct 27, 2025 •

edited

Loading