fix: accept 0-indexed safetensors shard names in CI weight validator#24237
Conversation
`_validate_sharded_model` hardcoded `set(range(1, total_shards + 1))`,
which only matches 1-indexed models like DeepSeek-V3. Models that
publish 0-indexed shards (e.g. inclusionAI/Ring-2.5-1T:
`model-00000-of-00160.safetensors` … `model-00159-of-00160.safetensors`)
all 160 shards present produced a false-positive
"Missing shards: [160]" because shard 160 doesn't exist —
the highest is 159.
Fix: derive the expected shard-id range from the minimum found shard
id, so both `{0..N-1}` and `{1..N}` are valid. Adds a regression test
mirroring the existing all-present case but with 0-indexed file names.
There was a problem hiding this comment.
Code Review
This pull request updates the model weight validation logic to support both 0-indexed and 1-indexed shard naming conventions, which are both valid Hugging Face conventions. It also includes a new test case to verify validation for 0-indexed shards. Feedback suggests that the current logic for determining the starting index is fragile if shards are missing and recommends a more robust heuristic, while also noting that similar logic in other parts of the codebase should be updated for consistency.
| min_idx = min(found_shards) if found_shards else 1 | ||
| expected_shards = set(range(min_idx, min_idx + total_shards)) |
There was a problem hiding this comment.
The logic using min(found_shards) to determine the starting index is fragile. If the first few shards are missing from the cache, min(found_shards) will shift the expected range, leading to confusing error messages (e.g., reporting shards beyond the actual total count as missing). Since Hugging Face models follow either 0-indexed or 1-indexed conventions, a more robust heuristic is to check if shard 0 is present. If it is, assume 0-indexing; otherwise, default to 1-indexing (which is the standard convention). Additionally, please note that validate_cache_lightweight (around line 1213) contains the same hardcoded 1-indexed logic and should be updated for consistency, although it is outside the current diff hunks.
| min_idx = min(found_shards) if found_shards else 1 | |
| expected_shards = set(range(min_idx, min_idx + total_shards)) | |
| min_idx = 0 if 0 in found_shards else 1 | |
| expected_shards = set(range(min_idx, min_idx + total_shards)) |
Summary
_validate_sharded_modelinpython/sglang/srt/model_loader/ci_weight_validation.pyhardcodes the expected shard-id set asset(range(1, total_shards + 1)). This works for 1-indexed releases (e.g.deepseek-ai/DeepSeek-V3) but produces a false positive for 0-indexed releases.Concrete example seen on
inclusionAI/Ring-2.5-1T: the cache holdsmodel-00000-of-00160.safetensorsthroughmodel-00159-of-00160.safetensors(all 160 shards present), but the validator returns:Shard 160 doesn't exist — the highest index is 159. The follow-up
snapshot_downloadno-ops against the complete cache, so this is mostly noise, but it can mislead any debugging session and would trigger an actual (and failing) re-download attempt against any cold-cache 0-indexed model.Fix: derive the expected shard-id range from
min(found_shards), accepting either{0..N-1}or{1..N}as valid. Both are HF naming conventions in the wild.Verification (paper)
min_idx{1,2,3}{1,2,3}{0,1,2}{0,1,2}{3}false positive){1,2}{1,2,3}{3}✅{0,1}{0,1,2}{2}✅Test plan
inclusionAI/Ring-2.5-1Tfrom a complete cache — log no longer printsMissing shards in model-of-00160.safetensors: [160].rangestart now follows the data instead of hardcoding1.