Skip to content

Fix validation to detect missing model files before loading#14253

Merged
Kangyan-Zhou merged 4 commits intomainfrom
fix/detect-missing-model-files-in-validation
Dec 3, 2025
Merged

Fix validation to detect missing model files before loading#14253
Kangyan-Zhou merged 4 commits intomainfrom
fix/detect-missing-model-files-in-validation

Conversation

@alisonshao
Copy link
Collaborator

Problem

The current validation logic only checks files that are found by glob pattern matching. If a model's snapshot directory exists with an index file but actual weight files are missing (due to incomplete downloads or cache corruption), the validation passes and claims "Found local HF snapshot", then crashes with FileNotFoundError when trying to load the missing files.

Example from CI:

[TP0] Found local HF snapshot for openai/gpt-oss-120b at 
/hf_home/hub/models--openai--gpt-oss-120b/snapshots/...

FileNotFoundError: No such file or directory: 
.../model-00000-of-00014.safetensors

The issue is that glob only finds files that exist on disk. If files are missing entirely, they're never validated, so the system doesn't know they should exist.

Solution

Added _check_index_files_exist() function that:

  1. Reads the safetensors index file (model.safetensors.index.json)
  2. Extracts the complete list of required files from the weight_map
  3. Verifies that ALL files in the weight_map actually exist on disk
  4. Returns validation failure with specific missing filenames if any are absent

This function is integrated into _validate_sharded_model() and runs before other validation checks. When files are missing, validation fails and triggers a re-download instead of crashing during load.

Testing

  • Tested with simulated CI scenario (14-shard model with 1 missing file)
  • Validation correctly detects missing files and returns clear error message
  • Non-sharded models (no index file) are unaffected
  • All files present: validation passes as expected

Related

Extends the validation work from #13729 and #13870, which added corruption detection but didn't check for missing files.

@alisonshao alisonshao requested a review from hebiao064 as a code owner December 1, 2025 23:28
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

## Problem

The current validation logic only checks files that are found by glob
pattern matching. If a model's snapshot directory exists with an index
file but actual weight files are missing (due to incomplete downloads
or cache corruption), the validation passes and claims "Found local HF
snapshot", then crashes with FileNotFoundError when trying to load the
missing files.

Example from CI:
```
[TP0] Found local HF snapshot for openai/gpt-oss-120b at
/hf_home/hub/models--openai--gpt-oss-120b/snapshots/...

FileNotFoundError: No such file or directory:
.../model-00000-of-00014.safetensors
```

The issue is that glob only finds files that exist on disk. If files
are missing entirely, they're never validated, so the system doesn't
know they should exist.

## Solution

Added `_check_index_files_exist()` function that:
1. Reads the safetensors index file (model.safetensors.index.json)
2. Extracts the complete list of required files from the weight_map
3. Verifies that ALL files in the weight_map actually exist on disk
4. Returns validation failure with specific missing filenames if any are absent

This function is integrated into `_validate_sharded_model()` and runs
before other validation checks. When files are missing, validation fails
and triggers a re-download instead of crashing during load.

## Testing

- Tested with simulated CI scenario (14-shard model with 1 missing file)
- Validation correctly detects missing files and returns clear error message
- Non-sharded models (no index file) are unaffected
- All files present: validation passes as expected
@alisonshao alisonshao force-pushed the fix/detect-missing-model-files-in-validation branch from c94974a to 56c6192 Compare December 1, 2025 23:32
@alisonshao
Copy link
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Dec 2, 2025
@alisonshao
Copy link
Collaborator Author

Local Testing Verification

Tested this implementation locally by simulating the exact CI failure scenario. All tests passed.

Test Setup

Created a test script (test_validation_manual.py) that:

  1. Creates a fake 14-shard model snapshot directory with model.safetensors.index.json
  2. Deliberately omits certain shard files to simulate incomplete downloads
  3. Runs the validation functions to verify detection

Test Cases & Results

Test Scenario Result
Missing files detection 14-shard model with shard 0 missing (exact CI failure) ✅ PASS
All files present 14-shard model with all files ✅ PASS
Multiple missing files 14-shard model with shards 0, 5, 10 missing ✅ PASS
Non-sharded model Single file model (no index) ✅ PASS

Example Output

Simulating the CI failure (missing model-00000-of-00014.safetensors):

Running _check_index_files_exist()...

[PASS] Validation correctly detected missing files!
  Error message: Missing 1 file(s) from index model.safetensors.index.json: ['model-00000-of-00014.safetensors']

CI Failure Reference

The original CI failure from this run:

[TP0] Found local HF snapshot for openai/gpt-oss-120b at /hf_home/hub/...
FileNotFoundError: No such file or directory: .../model-00000-of-00014.safetensors

The _check_index_files_exist() function now catches this before loading starts, preventing the crash and triggering a re-download instead.

@Kangyan-Zhou Kangyan-Zhou merged commit 80518be into main Dec 3, 2025
138 of 143 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the fix/detect-missing-model-files-in-validation branch December 3, 2025 19:36
tom-jerr pushed a commit to tom-jerr/sglang that referenced this pull request Dec 4, 2025
yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants