Skip to content

[Ckpt Engine] feat: new sglang entrypoint support for update#12216

Merged
hnyls2002 merged 7 commits intosgl-project:mainfrom
openanolis:mateng/add_ckpt_engine_update_entrypoint_support
Oct 30, 2025
Merged

[Ckpt Engine] feat: new sglang entrypoint support for update#12216
hnyls2002 merged 7 commits intosgl-project:mainfrom
openanolis:mateng/add_ckpt_engine_update_entrypoint_support

Conversation

@stmatengss
Copy link
Copy Markdown
Collaborator

@stmatengss stmatengss commented Oct 27, 2025

Motivation

Add a new entrypoint for ease of use, as suggested in #11755 (comment).
With this PR, it can support both torchrun and sglang entrypoint.

Usage

python -m sglang.srt.checkpoint_engine.update --update-method broadcast --checkpoint-path /workspace/Qwen/Qwen3-4B/  --inference-parallel-size 2

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @stmatengss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SGLang checkpoint engine by providing a user-friendly, integrated command-line interface for updating model weights. It abstracts away the complexities of distributed execution by automatically invoking torchrun, allowing users to easily manage and apply model updates in a distributed environment. The changes facilitate a more robust and efficient workflow for maintaining up-to-date models within the SGLang framework.

Highlights

  • New Entrypoint for Checkpoint Engine Update: Introduces a convenient new Python module entrypoint (python -m sglang.srt.checkpoint_engine.update) to simplify the process of updating model weights via the checkpoint engine.
  • Automated torchrun Integration: The new entrypoint automatically handles the invocation of torchrun based on the specified inference-parallel-size, streamlining distributed weight updates without manual torchrun command construction.
  • Distributed Weight Update Logic: Implements robust functions for splitting checkpoint files and tensors across distributed ranks, checking SGLang server readiness, and coordinating weight updates (supporting 'broadcast' and 'p2p' methods) using a ParameterServer.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new entrypoint for updating model weights via the checkpoint engine, which is a useful addition for ease of use. The implementation is well-structured. I've identified a couple of areas for improvement: one regarding the robustness of command-line argument parsing before launching torchrun, and another related to a logic bug that causes an unnecessary delay in p2p update mode. My review includes specific code suggestions to address these points.

Comment on lines +167 to +169
if update_method:
# sleep 2s to wait destroy process group
time.sleep(2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The time.sleep(2) is executed whenever update_method is "p2p" or "all". However, the comment "sleep 2s to wait destroy process group" suggests this delay is only necessary after the broadcast update method has run. This happens when update_method is "all", but not when it is "p2p". When update_method is "p2p", no broadcast occurs, so the sleep is unnecessary and introduces a performance penalty. The condition should be more specific to only sleep when update_method == "all". Additionally, the if update_method: check is redundant as it's always true within this block.

Suggested change
if update_method:
# sleep 2s to wait destroy process group
time.sleep(2)
if update_method == "all":
# sleep 2s to wait destroy process group
time.sleep(2)

Comment on lines +207 to +219
for i, arg in enumerate(args):
if arg == "--inference-parallel-size" and i + 1 < len(args):
try:
inference_parallel_size = int(args[i + 1])
except ValueError:
pass
break
elif arg.startswith("--inference-parallel-size="):
try:
inference_parallel_size = int(arg.split("=", 1)[1])
except ValueError:
pass
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manual parsing of --inference-parallel-size uses try-except pass, which silently ignores invalid values. If a user provides a non-integer value, the script proceeds with the default value of 8 for nproc-per-node, only to fail later during the more robust argument parsing in main(). This behavior can be confusing. It's better to fail early with a clear error. Removing the try-except blocks will allow the ValueError from int() to propagate and terminate the script, which is more robust and user-friendly.

Suggested change
for i, arg in enumerate(args):
if arg == "--inference-parallel-size" and i + 1 < len(args):
try:
inference_parallel_size = int(args[i + 1])
except ValueError:
pass
break
elif arg.startswith("--inference-parallel-size="):
try:
inference_parallel_size = int(arg.split("=", 1)[1])
except ValueError:
pass
break
for i, arg in enumerate(args):
if arg == "--inference-parallel-size" and i + 1 < len(args):
inference_parallel_size = int(args[i + 1])
break
elif arg.startswith("--inference-parallel-size="):
inference_parallel_size = int(arg.split("=", 1)[1])
break

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But wonder if we have a better solution to replace cmd and torchrun to make the code less hacky. Will the checkpoint-engine team release a generic entrypoint?

Comment on lines +220 to +237
# Build torchrun command
cmd = ["torchrun", f"--nproc-per-node={inference_parallel_size}", __file__] + args

print(f"Running: {' '.join(cmd)}", file=sys.stderr)

# Execute torchrun with the original script
try:
result = subprocess.run(cmd, check=False)
sys.exit(result.returncode)
except FileNotFoundError:
print(
"Error: torchrun command not found. Please ensure PyTorch is installed.",
file=sys.stderr,
)
sys.exit(1)
except KeyboardInterrupt:
print("\nInterrupted by user", file=sys.stderr)
sys.exit(130)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is torchrun compulsory?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use ParameterServer in the checkpoint engine, torchrun is compulsory.

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hnyls2002 hnyls2002 merged commit 32438eb into sgl-project:main Oct 30, 2025
79 of 97 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants