Skip to content

fix: accept 0-indexed safetensors shard names in CI weight validator#24237

Merged
Kangyan-Zhou merged 2 commits into
mainfrom
alison/fix-sharded-model-validator-off-by-one
May 2, 2026
Merged

fix: accept 0-indexed safetensors shard names in CI weight validator#24237
Kangyan-Zhou merged 2 commits into
mainfrom
alison/fix-sharded-model-validator-off-by-one

Conversation

@alisonshao
Copy link
Copy Markdown
Collaborator

@alisonshao alisonshao commented May 1, 2026

Summary

_validate_sharded_model in python/sglang/srt/model_loader/ci_weight_validation.py hardcodes the expected shard-id set as set(range(1, total_shards + 1)). This works for 1-indexed releases (e.g. deepseek-ai/DeepSeek-V3) but produces a false positive for 0-indexed releases.

Concrete example seen on inclusionAI/Ring-2.5-1T: the cache holds model-00000-of-00160.safetensors through model-00159-of-00160.safetensors (all 160 shards present), but the validator returns:

Validation failed for inclusionAI/Ring-2.5-1T: Missing shards in model-of-00160.safetensors: [160]. Will attempt to download missing files.

Shard 160 doesn't exist — the highest index is 159. The follow-up snapshot_download no-ops against the complete cache, so this is mostly noise, but it can mislead any debugging session and would trigger an actual (and failing) re-download attempt against any cold-cache 0-indexed model.

Fix: derive the expected shard-id range from min(found_shards), accepting either {0..N-1} or {1..N} as valid. Both are HF naming conventions in the wild.

Verification (paper)

Case Found min_idx Expected Missing
1-indexed all-present {1,2,3} 1 {1,2,3} ∅ ✅
0-indexed all-present (Ring case) {0,1,2} 0 {0,1,2} ∅ ✅ (was: {3} false positive)
1-indexed, shard 3 missing {1,2} 1 {1,2,3} {3}
0-indexed, shard 2 missing {0,1} 0 {0,1,2} {2}

Test plan

  • Real-model verification: load inclusionAI/Ring-2.5-1T from a complete cache — log no longer prints Missing shards in model-of-00160.safetensors: [160].
  • No CI-side regression for 1-indexed models (DeepSeek-V3, Qwen, etc.) since the range start now follows the data instead of hardcoding 1.

`_validate_sharded_model` hardcoded `set(range(1, total_shards + 1))`,
which only matches 1-indexed models like DeepSeek-V3. Models that
publish 0-indexed shards (e.g. inclusionAI/Ring-2.5-1T:
`model-00000-of-00160.safetensors` … `model-00159-of-00160.safetensors`)
all 160 shards present produced a false-positive
"Missing shards: [160]" because shard 160 doesn't exist —
the highest is 159.

Fix: derive the expected shard-id range from the minimum found shard
id, so both `{0..N-1}` and `{1..N}` are valid. Adds a regression test
mirroring the existing all-present case but with 0-indexed file names.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the model weight validation logic to support both 0-indexed and 1-indexed shard naming conventions, which are both valid Hugging Face conventions. It also includes a new test case to verify validation for 0-indexed shards. Feedback suggests that the current logic for determining the starting index is fragile if shards are missing and recommends a more robust heuristic, while also noting that similar logic in other parts of the codebase should be updated for consistency.

Comment on lines +1431 to +1432
min_idx = min(found_shards) if found_shards else 1
expected_shards = set(range(min_idx, min_idx + total_shards))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic using min(found_shards) to determine the starting index is fragile. If the first few shards are missing from the cache, min(found_shards) will shift the expected range, leading to confusing error messages (e.g., reporting shards beyond the actual total count as missing). Since Hugging Face models follow either 0-indexed or 1-indexed conventions, a more robust heuristic is to check if shard 0 is present. If it is, assume 0-indexing; otherwise, default to 1-indexing (which is the standard convention). Additionally, please note that validate_cache_lightweight (around line 1213) contains the same hardcoded 1-indexed logic and should be updated for consistency, although it is outside the current diff hunks.

Suggested change
min_idx = min(found_shards) if found_shards else 1
expected_shards = set(range(min_idx, min_idx + total_shards))
min_idx = 0 if 0 in found_shards else 1
expected_shards = set(range(min_idx, min_idx + total_shards))

@Kangyan-Zhou Kangyan-Zhou merged commit f3dbadb into main May 2, 2026
57 of 65 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the alison/fix-sharded-model-validator-off-by-one branch May 2, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants