[Core] Remote Instance Model Loader: A loader used to fetch model weights from a ready vLLM instance.#27417
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a RemoteInstanceModelLoader for fetching model weights from another running vLLM instance, which is a valuable feature for reducing startup times in distributed environments. The implementation is comprehensive, adding new API endpoints, a weight transfer connector, and utility functions for coordination. The code is generally well-structured. I've identified a critical bug that could lead to a server crash and a potential resource leak issue that should be addressed for improved robustness.
vllm/model_executor/model_loader/remote_instance_loader_utils.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/model_loader/remote_instance_loader_utils.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…transfer across vLLM instances Signed-off-by: pengdrumli <pengdrumli@tencent.com>
This commit adds a comprehensive end-to-end test for the remote instance model loader functionality that was recently implemented. The test verifies that: 1. A seed server can be started successfully as the source of model weights 2. A client server can load weights from the remote seed server using the new remote_instance load format 3. Both servers produce identical outputs when given the same prompts The test requires 8 GPUs (4 for seed instance + 4 for client instance) with a 2x2 TP/PP configuration and is appropriately skipped when insufficient resources are available. It also sets up proper environment variables and network communication between the coordinated servers. This test ensures the reliability and correctness of the distributed weight transfer feature across vLLM instances. Signed-off-by: pengdrumli <pengdrumli@tencent.com>
Apply automatic code formatting using ruff-format and other pre-commit hooks. No functional changes. Signed-off-by: pengdrumli <pengdrumli@tencent.com>
Signed-off-by: pengdrumli <pengdrumli@tencent.com>
…project#27417 (comment)] Signed-off-by: pengdrumli <pengdrumli@tencent.com>
6b480ca to
acc0a9e
Compare
This commit fixes Ruff linting errors including B024, E501, G004, E402, and SIM102 to ensure code compliance with project standards and pass all pre-commit checks. Signed-off-by: pengdrumli <pengdrumli@tencent.com>
In remote_instance_loader_utils.py:
- Add threading and time imports to support cleanup functionality
- Improve _weights_send_group type annotation for better code clarity
- Implement automatic cleanup of stale process groups:
* Add cleanup_thread function to create daemon threads
* Implement _cleanup_stale_group function to safely destroy process groups
- Start cleanup thread after initializing weight sending group
- Rename parameter tensor_nums to remote_tensor_nums for better semantic clarity
Signed-off-by: pengdrumli <pengdrumli@tencent.com>
Key changes: 1. Properly initialize global_ranks_in_group with the current process rank instead of an empty list 2. Correctly map global ranks to group ranks in _world.pg_group_ranks 3. Update torch.distributed.broadcast parameter from 'src' to 'group_src' for proper group handling 4. Add torch.cuda.empty_cache() calls to optimize GPU memory management during weight transfer Signed-off-by: pengdrumli <pengdrumli@tencent.com>
|
It seems the functionality and implementation are similar to this PR: sgl-project/sglang#8215. cc @amysaq2023 Would you mind adding a reference or acknowledgment? |
Sure! As I mentioned earlier in PR‘s description, this PR is indeed based on sglang's RemoteInstanceLoader implementation. Thank you for the reminder! I will promptly add the acknowledgment. Could you please guide me on where and how you'd like me to add it? For example, should I:
I'd appreciate your guidance on the preferred way to do this in this project. Thanks! |
Either way works for me. Let's respect the way how vLLM maintainers would suggest to cite the reference. |
Add attribution comments to reference the original source of the remote instance loading functionality adapted from sgl-project/sglang. Updated files include: - remote_instance_loader.py - remote_instance_loader_utils.py - api_server.py - protocol.py All changes add proper attribution comments referencing: sgl-project/sglang#8215 Signed-off-by: pengdrumli <pengdrumli@tencent.com>
|
This transfer should reuse vllm's There are serverless usecases where the NIXL API is preferrable, since it allows, for example, fast swapping in/out of weights to cache on workers that serve several models and are oversubscribed in terms of VRAM, with models used infrequently. There are already some projects working on this. Reusing NixlConnector will also bring for free all the performance optimizations made for KVCache transfers. |
Initially, the custom communicator was chosen to maintain consistency with sglang, under the assumption that mixed deployment scenarios involving vLLM and sglang might arise in the future. Keeping the communicators consistent would allow for seamless model weight exchange between them. However, I later concluded that this scenario should not be considered. In my current implementation, I reuse vLLM’s StatelessProcessGroup and PyNcclCommunicator to handle weight transfer. Recently, I encountered a peculiar performance issue: in the case of TP1, after the model transfer is completed, the overhead of destroying the process group reaches 16 seconds and blocks subsequent GPU operations. I will promptly adopt the NixlConnector you mentioned to handle model weight distribution and evaluate its performance. Thank you very much for your suggestion! @ovidiusm |
| "gpu_id and tp_rank must be specified for RemoteInstanceConnector. " | ||
| ) | ||
|
|
||
| self.device_id = torch.device("cuda", gpu_id) |
| # To support tp, pp | ||
| global_rank = _get_rank() | ||
| success, message = client.build_group( | ||
| gpu_id=torch.cuda.current_device(), |
| t.start() | ||
|
|
||
| try: | ||
| torch.cuda.empty_cache() |
| "finish getting all weights from remote instance, time used: %.4fs", | ||
| end_get_weights_tic - start_get_weights_tic, | ||
| ) | ||
| torch.cuda.empty_cache() |
| global_rank = pp_rank * tp_size + tp_rank | ||
|
|
||
| ports_list = ports.split(",") | ||
| gpu_id = torch.cuda.current_device() |
| world_size=world_size, | ||
| rank=group_rank, | ||
| group_name=group_name, | ||
| device_id=torch.device("cuda", gpu_id), |
| return {"success": False, "message": message} | ||
|
|
||
| logger.info("Send weight in %s", send_group) | ||
| torch.cuda.empty_cache() |
| group_src=0, | ||
| group=send_group, | ||
| ) | ||
| torch.cuda.empty_cache() |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
Purpose
This PR implements a new model loading mechanism that enables transferring model weights from remote vLLM instances. This feature is particularly useful in distributed deployments where multiple vLLM instances need to share model weights without individually loading them from storage.
Key features of this implementation:
RemoteInstanceModelLoaderfor loading model weights from GPUs of other vLLM instances/init_weights_send_group_for_remote_instanceand/send_weights_to_remote_instance) for coordinating weight transfersThis PR effectively reduces the startup time of vLLM, particularly in large-scale scaling scenarios on the cloud.
Test Plan
End-to-end tests with multiple vLLM instances to verify weight transfer
Test commands:
Test Result
log.txt
How to use
Start a vLLM instance normally as a seed instance to provide the model weight data source
Launch new vLLM using the Remote Instance Model Loader
The three environment variables above represent the IP address of the remote instance, the service port, and the designated ports for the NCCL process group, respectively. The number of ports should match the number of workers.
Additional Context
We are continuously working on reducing the startup overhead of inference services. Previously, we implemented other solutions, such as using Mooncake Store to store processed weight tensors and pulling them directly to the GPU via GPUDirect RDMA (GDR), as well as leveraging the Mooncake transfer engine for direct GPU-to-GPU tensor transfer between old and new instances.
Recently, the sglang project introduced a new 'remote instance model loader' feature, which accomplishes the same task using the default NCCL communication backend combined with a broadcast operation. Considering the potential value of maintaining future compatibility with sglang, we have implemented the same approach in vLLM.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.