[Ckpt Engine] feat: new sglang entrypoint support for update by stmatengss · Pull Request #12216 · sgl-project/sglang

stmatengss · 2025-10-27T13:24:54Z

Motivation

Add a new entrypoint for ease of use, as suggested in #11755 (comment).
With this PR, it can support both torchrun and sglang entrypoint.

Usage

python -m sglang.srt.checkpoint_engine.update --update-method broadcast --checkpoint-path /workspace/Qwen/Qwen3-4B/  --inference-parallel-size 2

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-27T13:25:12Z

Summary of Changes

Hello @stmatengss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SGLang checkpoint engine by providing a user-friendly, integrated command-line interface for updating model weights. It abstracts away the complexities of distributed execution by automatically invoking torchrun, allowing users to easily manage and apply model updates in a distributed environment. The changes facilitate a more robust and efficient workflow for maintaining up-to-date models within the SGLang framework.

Highlights

New Entrypoint for Checkpoint Engine Update: Introduces a convenient new Python module entrypoint (python -m sglang.srt.checkpoint_engine.update) to simplify the process of updating model weights via the checkpoint engine.
Automated torchrun Integration: The new entrypoint automatically handles the invocation of torchrun based on the specified inference-parallel-size, streamlining distributed weight updates without manual torchrun command construction.
Distributed Weight Update Logic: Implements robust functions for splitting checkpoint files and tensors across distributed ranks, checking SGLang server readiness, and coordinating weight updates (supporting 'broadcast' and 'p2p' methods) using a ParameterServer.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new entrypoint for updating model weights via the checkpoint engine, which is a useful addition for ease of use. The implementation is well-structured. I've identified a couple of areas for improvement: one regarding the robustness of command-line argument parsing before launching torchrun, and another related to a logic bug that causes an unnecessary delay in p2p update mode. My review includes specific code suggestions to address these points.

gemini-code-assist · 2025-10-27T13:26:59Z

python/sglang/srt/checkpoint_engine/update.py

+        if update_method:
+            # sleep 2s to wait destroy process group
+            time.sleep(2)


The time.sleep(2) is executed whenever update_method is "p2p" or "all". However, the comment "sleep 2s to wait destroy process group" suggests this delay is only necessary after the broadcast update method has run. This happens when update_method is "all", but not when it is "p2p". When update_method is "p2p", no broadcast occurs, so the sleep is unnecessary and introduces a performance penalty. The condition should be more specific to only sleep when update_method == "all". Additionally, the if update_method: check is redundant as it's always true within this block.

Suggested change

if update_method:

# sleep 2s to wait destroy process group

time.sleep(2)

if update_method == "all":

# sleep 2s to wait destroy process group

time.sleep(2)

gemini-code-assist · 2025-10-27T13:27:00Z

python/sglang/srt/checkpoint_engine/update.py

+    for i, arg in enumerate(args):
+        if arg == "--inference-parallel-size" and i + 1 < len(args):
+            try:
+                inference_parallel_size = int(args[i + 1])
+            except ValueError:
+                pass
+            break
+        elif arg.startswith("--inference-parallel-size="):
+            try:
+                inference_parallel_size = int(arg.split("=", 1)[1])
+            except ValueError:
+                pass
+            break


The manual parsing of --inference-parallel-size uses try-except pass, which silently ignores invalid values. If a user provides a non-integer value, the script proceeds with the default value of 8 for nproc-per-node, only to fail later during the more robust argument parsing in main(). This behavior can be confusing. It's better to fail early with a clear error. Removing the try-except blocks will allow the ValueError from int() to propagate and terminate the script, which is more robust and user-friendly.

Suggested change

for i, arg in enumerate(args):

if arg == "--inference-parallel-size" and i + 1 < len(args):

try:

inference_parallel_size = int(args[i + 1])

except ValueError:

pass

break

elif arg.startswith("--inference-parallel-size="):

try:

inference_parallel_size = int(arg.split("=", 1)[1])

except ValueError:

pass

break

for i, arg in enumerate(args):

if arg == "--inference-parallel-size" and i + 1 < len(args):

inference_parallel_size = int(args[i + 1])

break

elif arg.startswith("--inference-parallel-size="):

inference_parallel_size = int(arg.split("=", 1)[1])

break

python/sglang/srt/checkpoint_engine/__init__.py

docs/advanced_features/checkpoint_engine.md

ShangmingCai

LGTM. But wonder if we have a better solution to replace cmd and torchrun to make the code less hacky. Will the checkpoint-engine team release a generic entrypoint?

ShangmingCai · 2025-10-28T03:12:12Z

python/sglang/srt/checkpoint_engine/update.py

+    # Build torchrun command
+    cmd = ["torchrun", f"--nproc-per-node={inference_parallel_size}", __file__] + args
+
+    print(f"Running: {' '.join(cmd)}", file=sys.stderr)
+
+    # Execute torchrun with the original script
+    try:
+        result = subprocess.run(cmd, check=False)
+        sys.exit(result.returncode)
+    except FileNotFoundError:
+        print(
+            "Error: torchrun command not found. Please ensure PyTorch is installed.",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+    except KeyboardInterrupt:
+        print("\nInterrupted by user", file=sys.stderr)
+        sys.exit(130)


Is torchrun compulsory?

If we use ParameterServer in the checkpoint engine, torchrun is compulsory.

ShangmingCai

LGTM

…port

stmatengss added 3 commits October 26, 2025 02:21

[CKPT] add nenw entrypoint

31ac3ca

pre-commit

c702aa8

fix format

1b7effb

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

stmatengss requested review from ByronHsu and ShangmingCai October 27, 2025 13:31

stmatengss added the run-ci label Oct 27, 2025

add doc

d5cb585

ShangmingCai reviewed Oct 28, 2025

View reviewed changes

python/sglang/srt/checkpoint_engine/__init__.py Outdated Show resolved Hide resolved

ShangmingCai reviewed Oct 28, 2025

View reviewed changes

docs/advanced_features/checkpoint_engine.md Outdated Show resolved Hide resolved

ShangmingCai reviewed Oct 28, 2025

View reviewed changes

fix

a9022b0

stmatengss requested a review from ShangmingCai October 28, 2025 06:12

ShangmingCai approved these changes Oct 28, 2025

View reviewed changes

stmatengss and others added 2 commits October 28, 2025 18:59

Merge branch 'main' into mateng/add_ckpt_engine_update_entrypoint_sup…

f8b1645

…port

Merge branch 'main' into mateng/add_ckpt_engine_update_entrypoint_sup…

a184f50

…port

hnyls2002 merged commit 32438eb into sgl-project:main Oct 30, 2025
79 of 97 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ckpt Engine] feat: new sglang entrypoint support for update#12216

[Ckpt Engine] feat: new sglang entrypoint support for update#12216
hnyls2002 merged 7 commits intosgl-project:mainfrom
openanolis:mateng/add_ckpt_engine_update_entrypoint_support

stmatengss commented Oct 27, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

ShangmingCai left a comment •

edited

Loading

Uh oh!

ShangmingCai Oct 28, 2025

Uh oh!

stmatengss Oct 28, 2025

Uh oh!

ShangmingCai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stmatengss commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Oct 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ShangmingCai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

stmatengss Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stmatengss commented Oct 27, 2025 •

edited

Loading

ShangmingCai left a comment •

edited

Loading