Skip to content

Fix for Windows file locking (os error 1224)#589

Closed
PantelisAndrianakis wants to merge 5 commits into
unslothai:mainfrom
PantelisAndrianakis:main
Closed

Fix for Windows file locking (os error 1224)#589
PantelisAndrianakis wants to merge 5 commits into
unslothai:mainfrom
PantelisAndrianakis:main

Conversation

@PantelisAndrianakis

Copy link
Copy Markdown

Implement robust Windows file locking handling in model export

The export process was failing with os error 1224 ("file with a user-mapped section open") when trying to save merged LoRA weights on Windows. This occurred because safetensors was memory-mapping files that couldn't be replaced due to locks held by the kernel or other processes.

Changes:

  • Add retry logic with exponential backoff (10 attempts, up to ~1.6s wait)
  • Force garbage collection and CUDA cache cleanup before each write attempt
  • Attempt to delete the original locked file before writing
  • Write directly to target location instead of temp file (simpler atomic ops)
  • Improved error messages with clear remediation steps

This allows GGUF exports to succeed on Windows systems with aggressive file locking (antivirus, indexing, etc.). Users can now export Gemma4 models directly in Unsloth Studio without workarounds.

Tested with: Gemma4 model, Q4_K_M quantization, Windows 11 Pro

Implement robust Windows file locking handling in model export

The export process was failing with os error 1224 ("file with a user-mapped section open") when trying to save merged LoRA weights on Windows. This occurred because safetensors was memory-mapping files that couldn't be replaced due to locks held by the kernel or other processes.

Changes:
- Add retry logic with exponential backoff (10 attempts, up to ~1.6s wait)
- Force garbage collection and CUDA cache cleanup before each write attempt
- Attempt to delete the original locked file before writing
- Write directly to target location instead of temp file (simpler atomic ops)
- Improved error messages with clear remediation steps

This allows GGUF exports to succeed on Windows systems with aggressive file locking (antivirus, indexing, etc.). Users can now export Gemma4 models directly in Unsloth Studio without workarounds.

Tested with: Gemma4 model, Q4_K_M quantization, Windows 11 Pro

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a retry mechanism with exponential backoff to resolve Windows file locking issues (OS error 1224) when saving tensors. The implementation includes aggressive cleanup steps such as manual garbage collection and CUDA cache clearing to help release file handles. Review feedback correctly identifies redundant local imports and inconsistent module aliasing that should be addressed to maintain code quality and consistency with the rest of the project.

Comment thread unsloth_zoo/saving_utils.py Outdated
Comment thread unsloth_zoo/saving_utils.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f205ac82a5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread unsloth_zoo/saving_utils.py Outdated
PantelisAndrianakis and others added 2 commits April 10, 2026 15:48
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41997c33b3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread unsloth_zoo/saving_utils.py Outdated
Applied the safe atomic write fix to the new version.
The key differences:
- Write to temp file first (original stays intact)
- Delete original only after successful write (no data loss)
- Atomic move replaces with temp file
- Cleanup on failure

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06249dd53d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread unsloth_zoo/saving_utils.py
danielhanchen added a commit that referenced this pull request Apr 16, 2026
…#595)

* Fix Windows file lock on resized shard rewrite with atomic os.replace

The vocab-resize rewrite branch in _merge_and_overwrite_lora re-opened
filename_original via safe_open before calling save_file on the same
path. On Windows the Rust safetensors MapViewOfFile release can lag
after the context manager exits, so the follow-up write hit WinError
1224 ("cannot be performed on a file with a user-mapped section open")
during Unsloth Studio GGUF exports.

Write to a sibling temp file and swap it in with os.replace, which maps
to MoveFileExW(MOVEFILE_REPLACE_EXISTING) on Windows and rename(2) on
POSIX. The original shard is never absent if the replacement fails, so
a transient AV or indexer lock cannot cause data loss. A single
gc.collect before the first attempt drops the lingering mmap, which is
the load-bearing step for the 1224 case. Short retry loop with
errno/winerror-based lock detection handles AV rescan jitter.

Supersedes #589 with an atomic-replace implementation that closes a
delete-then-move data-loss window present in that PR. Credit to
@PantelisAndrianakis for identifying and reporting the issue.

* Fix review findings: gate POSIX false positives, hoist save_file, preserve permissions

- Remove errno-based lock detection that false-positives on POSIX (EIO=5,
  EACCES=13, EPIPE=32); keep winerror + Windows-specific string checks only
- Move save_file() before retry loop so multi-GB shards are not re-serialized
  on each os.replace retry
- Preserve original file permissions via os.chmod before os.replace (mkstemp
  creates 0o600, but originals are typically 0o644)
- Gate gc.collect() + torch.cuda.empty_cache() behind os.name == 'nt' since
  the mmap lag is Windows-only
- Use module-level safetensors.SafetensorError instead of in-function import

* Fix review findings: POSIX-only direct save, tighten lock classifier, cleanup

1. Gate temp-file swap to Windows only (os.name == 'nt'). On POSIX,
   save_file writes directly to the original path, preserving symlinks,
   hardlinks, and avoiding the extra disk copy.

2. Tighten the winerror=5 lock classifier. Bare ACCESS_DENIED no longer
   triggers retry -- only when accompanied by "user-mapped", "being used
   by another process", or "sharing violation" in the error message.

3. Remove dead SafetensorError from except clause. os.replace only raises
   OSError; SafetensorError is unreachable.

4. Release mmap-backed tensor refs (del tensors + gc.collect) before the
   os.replace retry loop, not after. This drops the lingering mmap on
   the source shard before attempting the replace.

5. Preserve exception cause chain in outer except. RuntimeError now
   passes through directly; other exceptions are wrapped with 'from e'
   so __cause__ is not stripped.

* Shorten inline comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant