Skip to content

[Core] Remote Instance Model Loader: A loader used to fetch model weights from a ready vLLM instance.#27417

Open
lirong-lirong wants to merge 9 commits intovllm-project:mainfrom
lirong-lirong:remote-instance-model-loader
Open

[Core] Remote Instance Model Loader: A loader used to fetch model weights from a ready vLLM instance.#27417
lirong-lirong wants to merge 9 commits intovllm-project:mainfrom
lirong-lirong:remote-instance-model-loader

Conversation

@lirong-lirong
Copy link
Copy Markdown
Contributor

@lirong-lirong lirong-lirong commented Oct 23, 2025

Purpose

This PR implements a new model loading mechanism that enables transferring model weights from remote vLLM instances. This feature is particularly useful in distributed deployments where multiple vLLM instances need to share model weights without individually loading them from storage.

Key features of this implementation:

  • Adds RemoteInstanceModelLoader for loading model weights from GPUs of other vLLM instances
  • Supports weight transfer between instances with the same parallelism strategy (include TP and PP)
  • Implements custom process groups for secure and efficient weight transfer
  • Adds new API endpoints (/init_weights_send_group_for_remote_instance and /send_weights_to_remote_instance) for coordinating weight transfers
  • Extends the model loader framework to support the new "remote_instance" load format
  • Includes validation mechanisms to ensure model compatibility between instances
  • ​​Leverages high-speed interconnects (NVLink/RDMA) for efficient model weight distribution and loading​

This PR effectively reduces the startup time of vLLM, particularly in large-scale scaling scenarios on the cloud.

Test Plan

End-to-end tests with multiple vLLM instances to verify weight transfer

Test commands:

pytest -s -v tests/model_executor/model_loader/test_remote_instance_loader.py

Test Result

log.txt

How to use

Start a vLLM instance normally as a seed instance to provide the model weight data source

python3 -m vllm.entrypoints.openai.api_server \
	--model /data0/deepseek-ai/DeepSeek-V3.1 \
	--tensor-parallel-size 8 \
	--host 0.0.0.0 --port 12346 \
	--gpu-memory-utilization 0.95 \
	--max_model_len 1024 \
	--enforce_eager

Launch new vLLM using the Remote Instance Model Loader

export REMOTE_INSTANCE_IP="10.32.11.46"
export REMOTE_INSTANCE_SERVER_PORT=12346
export REMOTE_INSTANCE_PORTS="[62000,62001,62002,62003,62004,62005,62006,62007]"

python3 -m vllm.entrypoints.openai.api_server \
	--model /data0/deepseek-ai/DeepSeek-V3.1 \
        --load_format remote_instance \
	--tensor-parallel-size 8 \
	--port 12346 \
	--gpu-memory-utilization 0.95 \
	--max_model_len 1024 \
	--enforce_eager

The three environment variables above represent the IP address of the remote instance, the service port, and the designated ports for the NCCL process group, respectively. The number of ports should match the number of workers.

Additional Context

We are continuously working on reducing the startup overhead of inference services. Previously, we implemented other solutions, such as using Mooncake Store to store processed weight tensors and pulling them directly to the GPU via GPUDirect RDMA (GDR), as well as leveraging the Mooncake transfer engine for direct GPU-to-GPU tensor transfer between old and new instances.

Recently, the sglang project introduced a new 'remote instance model loader' feature, which accomplishes the same task using the default NCCL communication backend combined with a broadcast operation. Considering the potential value of maintaining future compatibility with sglang, we have implemented the same approach in vLLM.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@lirong-lirong lirong-lirong changed the title Remote Instance Model Loader: A loader used to fetch model weights from a ready vLLM instance. [Core] Remote Instance Model Loader: A loader used to fetch model weights from a ready vLLM instance. Oct 23, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a RemoteInstanceModelLoader for fetching model weights from another running vLLM instance, which is a valuable feature for reducing startup times in distributed environments. The implementation is comprehensive, adding new API endpoints, a weight transfer connector, and utility functions for coordination. The code is generally well-structured. I've identified a critical bug that could lead to a server crash and a potential resource leak issue that should be addressed for improved robustness.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…transfer across vLLM instances

Signed-off-by: pengdrumli <pengdrumli@tencent.com>
This commit adds a comprehensive end-to-end test for the remote instance model loader functionality that was recently implemented. The test verifies that:

1. A seed server can be started successfully as the source of model weights
2. A client server can load weights from the remote seed server using the new remote_instance load format
3. Both servers produce identical outputs when given the same prompts

The test requires 8 GPUs (4 for seed instance + 4 for client instance) with a 2x2 TP/PP configuration and is appropriately skipped when insufficient resources are available. It also sets up proper environment variables and network communication between the coordinated servers.

This test ensures the reliability and correctness of the distributed weight transfer feature across vLLM instances.

Signed-off-by: pengdrumli <pengdrumli@tencent.com>
Apply automatic code formatting using ruff-format and other pre-commit hooks.
No functional changes.

Signed-off-by: pengdrumli <pengdrumli@tencent.com>
Signed-off-by: pengdrumli <pengdrumli@tencent.com>
@lirong-lirong lirong-lirong force-pushed the remote-instance-model-loader branch from 6b480ca to acc0a9e Compare October 23, 2025 14:24
This commit fixes Ruff linting errors including B024, E501, G004, E402, and SIM102
to ensure code compliance with project standards and pass all pre-commit checks.

Signed-off-by: pengdrumli <pengdrumli@tencent.com>
In remote_instance_loader_utils.py:
   - Add threading and time imports to support cleanup functionality
   - Improve _weights_send_group type annotation for better code clarity
   - Implement automatic cleanup of stale process groups:
     * Add cleanup_thread function to create daemon threads
     * Implement _cleanup_stale_group function to safely destroy process groups
   - Start cleanup thread after initializing weight sending group
   - Rename parameter tensor_nums to remote_tensor_nums for better semantic clarity

Signed-off-by: pengdrumli <pengdrumli@tencent.com>
Key changes:
1. Properly initialize global_ranks_in_group with the current process rank instead of an empty list
2. Correctly map global ranks to group ranks in _world.pg_group_ranks
3. Update torch.distributed.broadcast parameter from 'src' to 'group_src' for proper group handling
4. Add torch.cuda.empty_cache() calls to optimize GPU memory management during weight transfer

Signed-off-by: pengdrumli <pengdrumli@tencent.com>
@tianyuzhou95
Copy link
Copy Markdown
Contributor

It seems the functionality and implementation are similar to this PR: sgl-project/sglang#8215. cc @amysaq2023

Would you mind adding a reference or acknowledgment?

@lirong-lirong
Copy link
Copy Markdown
Contributor Author

lirong-lirong commented Oct 27, 2025

It seems the functionality and implementation are similar to this PR: sgl-project/sglang#8215. cc @amysaq2023

Would you mind adding a reference or acknowledgment?

Sure! As I mentioned earlier in PR‘s description, this PR is indeed based on sglang's RemoteInstanceLoader implementation. Thank you for the reminder! I will promptly add the acknowledgment. Could you please guide me on where and how you'd like me to add it? For example, should I:

  1. Add it in the PR description?
  2. Add comments in the code referencing the original implementation?
  3. Or somewhere else?

I'd appreciate your guidance on the preferred way to do this in this project. Thanks!

@amysaq2023
Copy link
Copy Markdown

It seems the functionality and implementation are similar to this PR: sgl-project/sglang#8215. cc @amysaq2023
Would you mind adding a reference or acknowledgment?

Sure! As I mentioned earlier, this PR is indeed based on sglang's RemoteInstanceLoader implementation. Thank you for the reminder! I will promptly add the acknowledgment. Could you please guide me on where and how you'd like me to add it? For example, should I:

  1. Add it in the PR description?
  2. Add comments in the code referencing the original implementation?
  3. Or somewhere else?

I'd appreciate your guidance on the preferred way to do this in this project. Thanks!

Either way works for me. Let's respect the way how vLLM maintainers would suggest to cite the reference.

Add attribution comments to reference the original source of the remote instance
loading functionality adapted from sgl-project/sglang.

Updated files include:
- remote_instance_loader.py
- remote_instance_loader_utils.py
- api_server.py
- protocol.py

All changes add proper attribution comments referencing:
sgl-project/sglang#8215

Signed-off-by: pengdrumli <pengdrumli@tencent.com>
@ovidiusm
Copy link
Copy Markdown

This transfer should reuse vllm's NixlConnector instead of creating a new transfer method from scratch with torch collectives.

There are serverless usecases where the NIXL API is preferrable, since it allows, for example, fast swapping in/out of weights to cache on workers that serve several models and are oversubscribed in terms of VRAM, with models used infrequently. There are already some projects working on this.

Reusing NixlConnector will also bring for free all the performance optimizations made for KVCache transfers.

@lirong-lirong
Copy link
Copy Markdown
Contributor Author

This transfer should reuse vllm's NixlConnector instead of creating a new transfer method from scratch with torch collectives.

There are serverless usecases where the NIXL API is preferrable, since it allows, for example, fast swapping in/out of weights to cache on workers that serve several models and are oversubscribed in terms of VRAM, with models used infrequently. There are already some projects working on this.

Reusing NixlConnector will also bring for free all the performance optimizations made for KVCache transfers.
Thank you for your feedback!

Initially, the custom communicator was chosen to maintain consistency with sglang, under the assumption that mixed deployment scenarios involving vLLM and sglang might arise in the future. Keeping the communicators consistent would allow for seamless model weight exchange between them.

However, I later concluded that this scenario should not be considered. In my current implementation, I reuse vLLM’s StatelessProcessGroup and PyNcclCommunicator to handle weight transfer. Recently, I encountered a peculiar performance issue: in the case of TP1, after the model transfer is completed, the overhead of destroying the process group reaches 16 seconds and blocks subsequent GPU operations.

I will promptly adopt the NixlConnector you mentioned to handle model weight distribution and evaluate its performance. Thank you very much for your suggestion! @ovidiusm

"gpu_id and tp_rank must be specified for RemoteInstanceConnector. "
)

self.device_id = torch.device("cuda", gpu_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use current_platform rather than cuda

# To support tp, pp
global_rank = _get_rank()
success, message = client.build_group(
gpu_id=torch.cuda.current_device(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

t.start()

try:
torch.cuda.empty_cache()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

"finish getting all weights from remote instance, time used: %.4fs",
end_get_weights_tic - start_get_weights_tic,
)
torch.cuda.empty_cache()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

global_rank = pp_rank * tp_size + tp_rank

ports_list = ports.split(",")
gpu_id = torch.cuda.current_device()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

world_size=world_size,
rank=group_rank,
group_name=group_name,
device_id=torch.device("cuda", gpu_id),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return {"success": False, "message": message}

logger.info("Send weight in %s", send_group)
torch.cuda.empty_cache()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

group_src=0,
group=send_group,
)
torch.cuda.empty_cache()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend kv-connector stale Over 90 days of inactivity v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants