feat: Add checkpoint-engine IPC integration#10646
feat: Add checkpoint-engine IPC integration#10646BraveY wants to merge 2 commits intosgl-project:mainfrom BraveY:checkpoint-engine
Conversation
Implement IPC-based weight updates for checkpoint-engine compatibility: - Add SGLangCheckpointEngineWorkerExtension worker class - Implement update_weights_from_ipc across scheduler/tokenizer/worker layers - Add collective_rpc endpoint for vLLM API compatibility - Support ZMQ communication with device UUID management - Include post-loading hooks and error handling This allows efficient model weight updates via IPC without server restart.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Similar with #10667. |
Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
| raise ValueError( | ||
| f"Device UUID {device_uuid} not found in zmq_handles: {list(zmq_handles.keys())}" | ||
| ) | ||
| update_weights_from_ipc( |
There was a problem hiding this comment.
Maybe directly use from checkpoint_engine.worker import update_weights_from_ipc instead of copy duplicated codes
| return ORJSONResponse( | ||
| {"error": {"message": str(e)}}, status_code=HTTPStatus.BAD_REQUEST | ||
| ) | ||
| @app.post("/collective_rpc") |
There was a problem hiding this comment.
Maybe this method is unnecessary, client may use /update_weights_from_ipc is enough
There was a problem hiding this comment.
I think keeping this api can directly reuse the scripts in checkpoint-engine's examples, allowing users to quickly test the sglang integration. If we remove this API interface, we would need to add a sglang example script, leading to more duplicate code. Perhaps keeping the interface is better?
There was a problem hiding this comment.
Maybe this file can be moved to the checkpoint-engine/example folder
| @dataclass | ||
| class UpdateWeightsFromIPCReqInput: | ||
| # ZMQ socket paths for each device UUID | ||
| zmq_handles: Dict[str, str] |
There was a problem hiding this comment.
This connection logic should probably be generalized, since it won't just be ZMQ.
|
@stmatengss @XucSh let's optimize this PR on https://github.com/openanolis/sglang/commits/kaiyong/checkpoint-engine/ . I would close this PR. |
Implement IPC-based weight updates for checkpoint-engine compatibility:
This allows efficient model weight updates via IPC without server restart.
Motivation
Implement IPC-based weight updates for checkpoint-engine compatibility.
Modifications
Accuracy Tests
Test command is similar with test update weights on vllm in checkpoint-engine's readme. Just replace the command to start vLLM with the command to start SGLang.
Single Node:
Start sglang
Update weigths by checkpoint-engine's script.
The weight update process of DeepSeek-R1 has been tested on a single machine and two machines. After the weight update is completed, the inference results of the model are verified to be normal.
Co-author: @stmatengss @XucSh @zxpdemonio
Thanks to @weixiao-huang for help.
Benchmarking and Profiling
Checklist