Skip to content

Support getting checksums in weight checker#24537

Merged
fzyzcjy merged 13 commits into
sgl-project:mainfrom
fzyzcjy:weight_ft/2
May 6, 2026
Merged

Support getting checksums in weight checker#24537
fzyzcjy merged 13 commits into
sgl-project:mainfrom
fzyzcjy:weight_ft/2

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented May 6, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

fzyzcjy added 11 commits May 6, 2026 21:34
Cover _random_like dtype branches, _postprocess_tensors (non-persistent
buffer skip, fp8 quant pair handling for both fp32 and ue8m0-packed
scales), _check_tensors error paths, and the WeightChecker class
lifecycle (snapshot, reset_tensors, compare, handle dispatch). Real fp8
tensors are constructed via quant_weight_ue8m0/transform_scale_ue8m0; no
mocks of fp8 utilities.
Launches a real sgl server and exercises snapshot/compare/reset_tensors
plus an unknown-action negative case. Cases that mutate weights are
named to sort last so they cannot affect earlier cases sharing the
server.
Two new cases on top of the existing snapshot/compare/reset coverage:
  - update_weights_from_tensor with a divergent tensor must make compare
    fail and surface the param name in the error message
  - update_weights_from_tensor with byte-identical bytes (prime, snapshot,
    push the same bytes again) must keep compare passing
Qwen3-0.6B is smaller than the previous Llama-3.2-1B-Instruct default,
shortening server launch and pre-fill steps. The fused name
gate_up_proj.weight is sglang's actual on-disk parameter name (no HF
remapping in the path), so the test exercises the parameter unambiguously.
Matches sglang's inference-time nn.Parameter convention so that
_reset_tensors can do in-place copy_ without autograd rejecting it.
Fixes: RuntimeError 'a leaf Variable that requires grad is being used
in an in-place operation' from the reset / compare-after-reset cases.
_snapshot's '.detach().cpu()' is a no-op on a CPU tensor, so a CPU-only
fixture leaves the snapshot aliasing live storage and masks reset-then-
compare divergence. Putting the fixture on CUDA mirrors production
(model is always on the device) and forces _snapshot to produce a real
independent CPU copy.
Sending the fused gate_up_proj.weight name directly trips a name.replace
collision in sglang's stacked_params_mapping (gate_up_proj contains the
substring up_proj), producing the bogus key gate_gate_up_proj.weight and
crashing the model loader. Use the HF unfused alias up_proj.weight with
shape (intermediate_size, hidden_size); sglang rewrites it onto the
fused tensor with shard_id=1, writing only the up half — sufficient to
make compare detect a divergence.
Hoists the per-callsite skip-pattern lists in _reset_tensors and
_postprocess_tensors to a single module-level _NON_PERSISTENT_BUFFER_PATTERNS
tuple, accessed through _is_non_persistent_buffer_name. The unified set is
the union of the two prior lists: cos_sin_cache, inv_freq, freqs_cis,
_weight_fp32 — both callsites skip the same buffers now (previously
_reset_tensors was missing inv_freq and _postprocess_tensors was missing
freqs_cis).
Adds an action='checksum' route to WeightChecker.handle that returns a
dict produced by pydantic ChecksumInfo, containing per-tensor hashes
(hex of tensor_hash from mm_utils, GPU-accelerated via the existing
gpu_tensor_hash triton kernel) plus this rank's ParallelismInfo
(tp/dp/pp coordinates + global rank/size from torch.distributed).

The computation reuses _postprocess_tensors so fp8 weights are
dequantized to bf16 before hashing — two (qweight, scale) pairs that
dequant to the same bf16 produce the same checksum, matching the
semantics of the existing snapshot/compare path. Surrounded by
torch.cuda.synchronize() and timed via logger.info so callers can
observe per-rank duration.

handle() now returns Optional[Dict] — None for snapshot/reset/compare
and the dict payload for checksum.
Plumbs the optional dict returned by WeightChecker.handle from
model_runner up to the /weights_checker HTTP body:

- model_runner.check_weights now returns the underlying handle() value.
- CheckWeightsReqOutput gains an optional payload: Dict carrying one
  rank's ChecksumInfo dict.
- Scheduler's check_weights captures payload into the output.
- TokenizerManager.check_weights now returns
  (success, message, ranks) where ranks is the per-rank list collected
  naively in fan-out order (None when no rank produced a payload).
  FanOutCommunicator.merge_results is left untouched so the 11+ existing
  2-tuple callers keep working.
- /weights_checker HTTP body adds a top-level 'ranks' key when present,
  preserving the prior 'success'/'message' shape for the snapshot, reset,
  and compare actions.
Adds unit coverage for ChecksumInfo / ParallelismInfo /
_is_non_persistent_buffer_name / _hash_tensor / _compute_checksum:
hash stability and hex format, parallelism info reflection,
post-mutation hash drift, and round-trip through the strict pydantic
schema. Extends TestHandle to cover the new 'checksum' route.

The e2e test gains four cases on the shared engine: response shape,
two-call stability, hash drift after update_weights_from_tensor, and
absence of non-persistent buffer names in the checksum keys.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 6, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label May 6, 2026
@fzyzcjy fzyzcjy merged commit c4c5541 into sgl-project:main May 6, 2026
59 of 68 checks passed
ltcs11 added a commit to ltcs11/sglang that referenced this pull request May 7, 2026
* main: (894 commits)
  [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715)
  [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268)
  propagate pytest exit code from test __main__ entries (sgl-project#24487)
  [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550)
  Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981)
  Support Triton MLA FP8 KV cache (sgl-project#20479)
  [diffusion] chore: align LTX-2 with official (sgl-project#24313)
  Expand support matrix for pypi wheel release (sgl-project#24565)
  [codex] Optimize Z-Image packed QKV (sgl-project#24117)
  [Misc] Fix breaking weight checker test (sgl-project#24553)
  [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420)
  ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551)
  [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279)
  Improve metrics, observability, and PD deploy tooling (sgl-project#24521)
  Fix diffusion fallback guards and validation (sgl-project#23335)
  [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539)
  [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040)
  Support getting checksums in weight checker (sgl-project#24537)
  Refactor buffer patterns in weight checker (sgl-project#24538)
  Add unit and end-to-end tests for weight checker (sgl-project#24536)
  ...

# Conflicts:
#	python/sglang/srt/managers/scheduler.py
#	python/sglang/srt/model_executor/model_runner.py
LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant