Fix for Windows file locking (os error 1224)#589
Conversation
Implement robust Windows file locking handling in model export
The export process was failing with os error 1224 ("file with a user-mapped section open") when trying to save merged LoRA weights on Windows. This occurred because safetensors was memory-mapping files that couldn't be replaced due to locks held by the kernel or other processes.
Changes:
- Add retry logic with exponential backoff (10 attempts, up to ~1.6s wait)
- Force garbage collection and CUDA cache cleanup before each write attempt
- Attempt to delete the original locked file before writing
- Write directly to target location instead of temp file (simpler atomic ops)
- Improved error messages with clear remediation steps
This allows GGUF exports to succeed on Windows systems with aggressive file locking (antivirus, indexing, etc.). Users can now export Gemma4 models directly in Unsloth Studio without workarounds.
Tested with: Gemma4 model, Q4_K_M quantization, Windows 11 Pro
There was a problem hiding this comment.
Code Review
This pull request introduces a retry mechanism with exponential backoff to resolve Windows file locking issues (OS error 1224) when saving tensors. The implementation includes aggressive cleanup steps such as manual garbage collection and CUDA cache clearing to help release file handles. Review feedback correctly identifies redundant local imports and inconsistent module aliasing that should be addressed to maintain code quality and consistency with the rest of the project.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f205ac82a5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 41997c33b3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Applied the safe atomic write fix to the new version. The key differences: - Write to temp file first (original stays intact) - Delete original only after successful write (no data loss) - Atomic move replaces with temp file - Cleanup on failure
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 06249dd53d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…#595) * Fix Windows file lock on resized shard rewrite with atomic os.replace The vocab-resize rewrite branch in _merge_and_overwrite_lora re-opened filename_original via safe_open before calling save_file on the same path. On Windows the Rust safetensors MapViewOfFile release can lag after the context manager exits, so the follow-up write hit WinError 1224 ("cannot be performed on a file with a user-mapped section open") during Unsloth Studio GGUF exports. Write to a sibling temp file and swap it in with os.replace, which maps to MoveFileExW(MOVEFILE_REPLACE_EXISTING) on Windows and rename(2) on POSIX. The original shard is never absent if the replacement fails, so a transient AV or indexer lock cannot cause data loss. A single gc.collect before the first attempt drops the lingering mmap, which is the load-bearing step for the 1224 case. Short retry loop with errno/winerror-based lock detection handles AV rescan jitter. Supersedes #589 with an atomic-replace implementation that closes a delete-then-move data-loss window present in that PR. Credit to @PantelisAndrianakis for identifying and reporting the issue. * Fix review findings: gate POSIX false positives, hoist save_file, preserve permissions - Remove errno-based lock detection that false-positives on POSIX (EIO=5, EACCES=13, EPIPE=32); keep winerror + Windows-specific string checks only - Move save_file() before retry loop so multi-GB shards are not re-serialized on each os.replace retry - Preserve original file permissions via os.chmod before os.replace (mkstemp creates 0o600, but originals are typically 0o644) - Gate gc.collect() + torch.cuda.empty_cache() behind os.name == 'nt' since the mmap lag is Windows-only - Use module-level safetensors.SafetensorError instead of in-function import * Fix review findings: POSIX-only direct save, tighten lock classifier, cleanup 1. Gate temp-file swap to Windows only (os.name == 'nt'). On POSIX, save_file writes directly to the original path, preserving symlinks, hardlinks, and avoiding the extra disk copy. 2. Tighten the winerror=5 lock classifier. Bare ACCESS_DENIED no longer triggers retry -- only when accompanied by "user-mapped", "being used by another process", or "sharing violation" in the error message. 3. Remove dead SafetensorError from except clause. os.replace only raises OSError; SafetensorError is unreachable. 4. Release mmap-backed tensor refs (del tensors + gc.collect) before the os.replace retry loop, not after. This drops the lingering mmap on the source shard before attempting the replace. 5. Preserve exception cause chain in outer except. RuntimeError now passes through directly; other exceptions are wrapped with 'from e' so __cause__ is not stripped. * Shorten inline comments
Implement robust Windows file locking handling in model export
The export process was failing with os error 1224 ("file with a user-mapped section open") when trying to save merged LoRA weights on Windows. This occurred because safetensors was memory-mapping files that couldn't be replaced due to locks held by the kernel or other processes.
Changes:
This allows GGUF exports to succeed on Windows systems with aggressive file locking (antivirus, indexing, etc.). Users can now export Gemma4 models directly in Unsloth Studio without workarounds.
Tested with: Gemma4 model, Q4_K_M quantization, Windows 11 Pro