Skip to content

feat: Add checkpoint-engine IPC integration#10646

Closed
BraveY wants to merge 2 commits intosgl-project:mainfrom
BraveY:checkpoint-engine
Closed

feat: Add checkpoint-engine IPC integration#10646
BraveY wants to merge 2 commits intosgl-project:mainfrom
BraveY:checkpoint-engine

Conversation

@BraveY
Copy link
Copy Markdown
Contributor

@BraveY BraveY commented Sep 19, 2025

Implement IPC-based weight updates for checkpoint-engine compatibility:

This allows efficient model weight updates via IPC without server restart.

Motivation

Implement IPC-based weight updates for checkpoint-engine compatibility.

Modifications

  • Add SGLangCheckpointEngineWorkerExtension worker class
  • Implement update_weights_from_ipc across scheduler/tokenizer/worker layers
  • Add collective_rpc endpoint for vLLM API compatibility
  • Support ZMQ communication with device UUID management
  • Include post-loading hooks and error handling

Accuracy Tests

Test command is similar with test update weights on vllm in checkpoint-engine's readme. Just replace the command to start vLLM with the command to start SGLang.

Single Node:
Start sglang

python -m sglang.launch_server --model-path /workspace/Qwen/Qwen3-4B/ --tensor-parallel-size 2 --port 19730 --load-format dummy --mem-fraction-static 0.7

Update weigths by checkpoint-engine's script.

torchrun --nproc-per-node 2 examples/update.py --update-method broadcast --checkpoint-path /workspace/Qwen/Qwen3-4B/  --inference-parallel-size 2

The weight update process of DeepSeek-R1 has been tested on a single machine and two machines. After the weight update is completed, the inference results of the model are verified to be normal.

Co-author: @stmatengss @XucSh @zxpdemonio
Thanks to @weixiao-huang for help.

Benchmarking and Profiling

Checklist

Implement IPC-based weight updates for checkpoint-engine compatibility:
- Add SGLangCheckpointEngineWorkerExtension worker class
- Implement update_weights_from_ipc across scheduler/tokenizer/worker layers
- Add collective_rpc endpoint for vLLM API compatibility
- Support ZMQ communication with device UUID management
- Include post-loading hooks and error handling

This allows efficient model weight updates via IPC without server restart.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@BraveY BraveY marked this pull request as draft September 19, 2025 05:54
@BraveY
Copy link
Copy Markdown
Contributor Author

BraveY commented Sep 20, 2025

Similar with #10667.

Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
@BraveY BraveY marked this pull request as ready for review September 24, 2025 06:31
@BraveY BraveY changed the title [WIP] feat: Add checkpoint-engine IPC integration feat: Add checkpoint-engine IPC integration Sep 25, 2025
raise ValueError(
f"Device UUID {device_uuid} not found in zmq_handles: {list(zmq_handles.keys())}"
)
update_weights_from_ipc(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe directly use from checkpoint_engine.worker import update_weights_from_ipc instead of copy duplicated codes

return ORJSONResponse(
{"error": {"message": str(e)}}, status_code=HTTPStatus.BAD_REQUEST
)
@app.post("/collective_rpc")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this method is unnecessary, client may use /update_weights_from_ipc is enough

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping this api can directly reuse the scripts in checkpoint-engine's examples, allowing users to quickly test the sglang integration. If we remove this API interface, we would need to add a sglang example script, leading to more duplicate code. Perhaps keeping the interface is better?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this file can be moved to the checkpoint-engine/example folder

@stmatengss
Copy link
Copy Markdown
Collaborator

I suggest merging two PRs (#10646 and #10667), rebasing with main, and submitting a new PR.

@dataclass
class UpdateWeightsFromIPCReqInput:
# ZMQ socket paths for each device UUID
zmq_handles: Dict[str, str]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This connection logic should probably be generalized, since it won't just be ZMQ.

@BraveY
Copy link
Copy Markdown
Contributor Author

BraveY commented Oct 14, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants