fix: accept 0-indexed safetensors shard names in CI weight validator by alisonshao · Pull Request #24237 · sgl-project/sglang

alisonshao · 2026-05-01T18:29:19Z

Summary

_validate_sharded_model in python/sglang/srt/model_loader/ci_weight_validation.py hardcodes the expected shard-id set as set(range(1, total_shards + 1)). This works for 1-indexed releases (e.g. deepseek-ai/DeepSeek-V3) but produces a false positive for 0-indexed releases.

Concrete example seen on inclusionAI/Ring-2.5-1T: the cache holds model-00000-of-00160.safetensors through model-00159-of-00160.safetensors (all 160 shards present), but the validator returns:

Validation failed for inclusionAI/Ring-2.5-1T: Missing shards in model-of-00160.safetensors: [160]. Will attempt to download missing files.

Shard 160 doesn't exist — the highest index is 159. The follow-up snapshot_download no-ops against the complete cache, so this is mostly noise, but it can mislead any debugging session and would trigger an actual (and failing) re-download attempt against any cold-cache 0-indexed model.

Fix: derive the expected shard-id range from min(found_shards), accepting either {0..N-1} or {1..N} as valid. Both are HF naming conventions in the wild.

Verification (paper)

Case	Found	`min_idx`	Expected	Missing
1-indexed all-present	`{1,2,3}`	1	`{1,2,3}`	∅ ✅
0-indexed all-present (Ring case)	`{0,1,2}`	0	`{0,1,2}`	∅ ✅ (was: `{3}` false positive)
1-indexed, shard 3 missing	`{1,2}`	1	`{1,2,3}`	`{3}` ✅
0-indexed, shard 2 missing	`{0,1}`	0	`{0,1,2}`	`{2}` ✅

Test plan

Real-model verification: load inclusionAI/Ring-2.5-1T from a complete cache — log no longer prints Missing shards in model-of-00160.safetensors: [160].
No CI-side regression for 1-indexed models (DeepSeek-V3, Qwen, etc.) since the range start now follows the data instead of hardcoding 1.

`_validate_sharded_model` hardcoded `set(range(1, total_shards + 1))`, which only matches 1-indexed models like DeepSeek-V3. Models that publish 0-indexed shards (e.g. inclusionAI/Ring-2.5-1T: `model-00000-of-00160.safetensors` … `model-00159-of-00160.safetensors`) all 160 shards present produced a false-positive "Missing shards: [160]" because shard 160 doesn't exist — the highest is 159. Fix: derive the expected shard-id range from the minimum found shard id, so both `{0..N-1}` and `{1..N}` are valid. Adds a regression test mirroring the existing all-present case but with 0-indexed file names.

gemini-code-assist

Code Review

This pull request updates the model weight validation logic to support both 0-indexed and 1-indexed shard naming conventions, which are both valid Hugging Face conventions. It also includes a new test case to verify validation for 0-indexed shards. Feedback suggests that the current logic for determining the starting index is fragile if shards are missing and recommends a more robust heuristic, while also noting that similar logic in other parts of the codebase should be updated for consistency.

gemini-code-assist · 2026-05-01T18:31:08Z

+        min_idx = min(found_shards) if found_shards else 1
+        expected_shards = set(range(min_idx, min_idx + total_shards))


The logic using min(found_shards) to determine the starting index is fragile. If the first few shards are missing from the cache, min(found_shards) will shift the expected range, leading to confusing error messages (e.g., reporting shards beyond the actual total count as missing). Since Hugging Face models follow either 0-indexed or 1-indexed conventions, a more robust heuristic is to check if shard 0 is present. If it is, assume 0-indexing; otherwise, default to 1-indexing (which is the standard convention). Additionally, please note that validate_cache_lightweight (around line 1213) contains the same hardcoded 1-indexed logic and should be updated for consistency, although it is outside the current diff hunks.

Suggested change

min_idx = min(found_shards) if found_shards else 1

expected_shards = set(range(min_idx, min_idx + total_shards))

min_idx = 0 if 0 in found_shards else 1

expected_shards = set(range(min_idx, min_idx + total_shards))

…gl-project#24237)

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

drop unit-test addition: test/manual/ is not run in CI

5d8263a

alisonshao mentioned this pull request May 2, 2026

ci: combine H200 8-GPU warmup steps and surface server log on every path #24253

Merged

4 tasks

Kangyan-Zhou approved these changes May 2, 2026

View reviewed changes

Kangyan-Zhou merged commit f3dbadb into main May 2, 2026
57 of 65 checks passed

Kangyan-Zhou deleted the alison/fix-sharded-model-validator-off-by-one branch May 2, 2026 07:58

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

fix: accept 0-indexed safetensors shard names in CI weight validator (s…

8f59e6e

…gl-project#24237)

LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026

fix: accept 0-indexed safetensors shard names in CI weight validator (s…

3d184d4

…gl-project#24237)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: accept 0-indexed safetensors shard names in CI weight validator#24237

fix: accept 0-indexed safetensors shard names in CI weight validator#24237
Kangyan-Zhou merged 2 commits into
mainfrom
alison/fix-sharded-model-validator-off-by-one

alisonshao commented May 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		min_idx = min(found_shards) if found_shards else 1
		expected_shards = set(range(min_idx, min_idx + total_shards))

Conversation

alisonshao commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification (paper)

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alisonshao commented May 1, 2026 •

edited

Loading