support non disturbing remote instance weight loader v2#14997
Conversation
Summary of ChangesHello @amysaq2023, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances SGLang's R-Fork functionality by integrating a Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new TransferEngine backend for remote instance weight loading, which is a great addition for improving instance startup time without disturbing ongoing inference. The implementation is well-structured, with changes spanning documentation, configuration, server entrypoints, and the model loading logic. The use of torch.cuda.memory.memory_snapshot for optimizing memory registration is particularly clever.
My review includes a few suggestions to improve documentation clarity, fix a type hint, remove unused code, correct a logic issue in a utility function, and fix a typo in the tests. Overall, this is a solid contribution.
cb38eea to
6748f6b
Compare
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
6748f6b to
00c82b9
Compare
|
/rerun-failed-ci |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
/rerun-stage unit-test-backend-1-gpu(0) |
|
❌ Stage NVIDIA stages:
AMD stages:
Other stages will be added soon. For now, use |
|
/rerun-failed-ci |
4 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
| # } | ||
| # ) | ||
| # } | ||
| remote_instance_transfer_engine_info: Optional[Dict] = None |
There was a problem hiding this comment.
Can you drop this?
We can parse from scheduler_info directly.
There was a problem hiding this comment.
For now, we only change the return value of launch_subprocesses from scheduler_info to the list of scheduler_info. But when it is stored to global_state, it only keep the first scheduler info item as before.
Would you suggest it will be better to store the whole list of scheduler_info in global_state as well? I'm a little concern about whether it will be too much redundant scheduler info in global state.
This commit supports non-disturbing mode for loading weights from remote instances. Previouly, SGLang has already supported loading weights from remote instances using torch.dist with NCCL as backend; however, this approach will disturb on-going inference requests since torch.dist will launch CUDA kernels to transfer weight tensors. In this commit, it introduces another backend option: TransferEngine, which can transfer weight tensors through RDMA without disturbing any on-going GPU workload. Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
|
/rerun-failed-ci |
8 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-stage unit-test-backend-4-gpu |
|
/rerun-failed-ci |
10 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
…14997) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
…14997) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
…14997) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
|
Hi, may i ask why "Memory saver is not compatible with TransferEngine" ? I tried to use TransferEngine to transfer weights with Memory saver enabled. but got an error in register_memory_region() : "Transfer Engine does not support overlapped memory region" . Is it the only reason you mean "not compatible" ? Is there a way to workaround this problem? |
|
transfer engine does not work with vram, requires physical memory registration to be fixed |
Motivation
In #8215, SGLang has already supported a new load format: remote_instance, that allows new instance to load weights from another running instance. This approach can greatly improve weight loading time during instance initialization. However, since it use torch.distributed with NCCL as backend, it will disturb on-going inference requests: torch.distributed will always launch CUDA kernels for transferring weight tensors.
We come up with another backend option: TransferEngine, which will not disturbing any GPU workload and still, using RDMA to transfer weights.
Modifications
We initialize one TransferEngine for each ModelRunner and will register its weights to RDMA channel during initialization.
When initializing a new instance who wants to use
remote_instanceload_format with TransferEngine backend:How to use:
seed instance:
python -m sglang.launch_server [args] \--remote-instance-weight-loader-support-transfer-engineclient instance:
python -m sglang.launch_server [args] \--load-format remote_instance \--remote-instance-weight-loader-seed-instance-ip [seed_instance_ip] \--remote-instance-weight-loader-seed-instance-service-port [seed_instance_service_port] \--remote-instance-weight-loader-backend "transfer_engine"--remote-instance-weight-loader-support-transfer-engineChecklist