Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use posix semantics when deleting files on Windows #11793

Closed

Conversation

SpexGuy
Copy link
Contributor

@SpexGuy SpexGuy commented Jun 4, 2022

When a file is deleted on Windows, it may not be immediately removed from the directory entry. This can cause problems with future scans of that directory, which will see the partially deleted file. However, it should be noted that under some workloads and system configurations, Windows files may appear to be deleted immediately. This seems to be why the CI is not failing under this bug.

This change forces file deletion to use posix semantics, removing it from the directory immediately on delete.

This fixes a bug that occurs on some windows systems/hard drives where std.fs.Dir.deleteTree fails due to assumed posix semantics. This may also fix rare bugs in other systems (like the cache system), which may assume posix semantics for file deletion.

@SpexGuy SpexGuy added standard library This issue involves writing Zig code for the standard library. os-windows labels Jun 4, 2022
@SpexGuy
Copy link
Contributor Author

SpexGuy commented Jun 5, 2022

Looks like the mac os test failed for an unrelated reason? I think this is good to go, unless there's some connection I'm missing.

@matu3ba
Copy link
Contributor

matu3ba commented Jun 5, 2022

I think this is the correct approach rather than work around Windows changing file semantics in Windows 10 (found it here https://news.ycombinator.com/item?id=23745019) https://stackoverflow.com/questions/60424732/did-the-behaviour-of-deleted-files-open-with-fileshare-delete-change-on-windows

The workaround with renaming sounds more cumbersome.
The other used approach in the linked repo did reportedly still have the same behavior XAMPPRocky/remove_dir_all#7 as of d2248555dd1c8aa08327a75fc024b0520f4afe22, for example in cargo-generate/cargo-generate#619.

One annoyance of all those issues is that reporters did not provide a script + program to reproduce.

@andrewrk
Copy link
Member

andrewrk commented Jun 5, 2022

A few notes:

  • Unfortunately the macOS failure does look relevant. Looks to be a separate self-hosted compiler bug that is revealed when trying to compile the new code in this patch. I'm happy to look into this as a prerequisite fix before merging this PR.
  • Can we please document clearly the intended semantics of DeleteFile in doc comments for this function, so that we know what it is supposed to do - particularly with regards to this subtle Windows behavior. This is mostly on me for not doing this documentation originally, but better late than never. Also let's make sure the semantics are clear in the cross-platform function std.fs.Dir.deleteFile.
  • I'm concerned that this code which deletes a file, a common file system operation, will now take two syscalls instead of one. I don't fully understand why this is necessary and I'm not yet convinced that we have to give up on this one. Are there circumstances where the former behavior is acceptable? Is this perhaps a case where the API can bifurcate, allowing the caller to request the semantics they need, so that an extra syscall is avoided when possible?

@squeek502
Copy link
Collaborator

squeek502 commented Jun 5, 2022

Related to / fixes (?) #6452

Does not affect #5537

Also worth noting that this requires Windows 10 >= 1607 (_WIN32_WINNT_WIN10_RS1)

@SpexGuy
Copy link
Contributor Author

SpexGuy commented Jun 5, 2022

Related to / fixes (?) #6452

Good find, that's the stack trace that is currently blocking me, which is why I made this PR. This triggers consistently on my computer when I try to run compiler tests (deleteTree is used by std.testing.tmpDir), so I cannot run tests locally until this is fixed.

I'm concerned that this code which deletes a file, a common file system operation, will now take two syscalls instead of one.

It's actually three now instead of two, deleting a file already requires NtCreateFile + CloseHandle. NtCreateFile is slow on Windows so the overhead of an extra syscall is not really in the same ball park. That said, we could have a different function which uses DOS semantics if needed.

I don't fully understand why this is necessary and I'm not yet convinced that we have to give up on this one.

std.fs.Dir.deleteTree makes the implicit assumption that deleted files do not appear when iterating over directories, so it does not work correctly on Windows systems where the deletion is not immediate (causing #6452). Whether deletion is immediate can depend on things like the current behavior of the virus scanner, the capacity of the hard drive, and the general speed of the processor, so it may not be visible on all systems (like the CI machine, apparently).

Are there circumstances where the former behavior is acceptable? Is this perhaps a case where the API can bifurcate, allowing the caller to request the semantics they need, so that an extra syscall is avoided when possible?

There probably are circumstances where DOS semantics are ok, however given deleteTree there are probably other places in the std lib that assume posix semantics. We would need to audit them all and decide if we can switch to the DOS semantics. This is especially difficult because dev machines tend to be fast and may not properly repro DOS semantics (the time between deletion and removal from the directory is nonzero but very small), so using DOS semantics may lead to unreliable and difficult-to-reproduce bugs that only manifest on low-end hard drives or in production.

IMO the default should be posix semantics, with a potential opt-in to DOS semantics to avoid the syscall if performance is worth introducing a possible race condition.

Also worth noting that this requires Windows 10 >= 1607 (_WIN32_WINNT_WIN10_RS1)

Ah, that's a problem. We probably need a different implementation then. It might be worth looking at another implementation like boost::fs, which implements posix semantics on windows.

@matu3ba
Copy link
Contributor

matu3ba commented Jun 7, 2022

Ah, that's a problem. We probably need a different implementation then. It might be worth looking at another implementation like boost::fs, which implements posix semantics on windows.

https://github.com/boostorg/filesystem/blob/fcc11010a5899d6a16d3f0a9d60704f88d62eada/src/operations.cpp#L1421

As I understand it, Zig currently has no bindings to SetFileInformationByHandle, which is heavily used in boost and suggested ie here https://stackoverflow.com/questions/36450222/moving-a-file-using-setfileinformationbyhandle
and generally to do atomic file stuff on Windows (by checking the resulting errors etc).

For some reason boost defaults to use file_disposition_info_class, whereas POSIX semantics offers the distinguishment between file_disposition_info_ex_class and file_basic_info_class as internal variables.

Other than that the implementations look relative identical.

Thus my questions:

    1. What is the expected behavior, if we get to a read-only file? This should be also specified in deleteTree.
    1. boost does retry setting file attributes (prone to failure, if file is replaced on filesystem while operation is executing)
      see relevant part of git show 7403ffca00b788c978d2de11c5148f5fa58555de: below. Should we do anything like this to hack around the problem or what is the expected behavior? I dont understand why they do it for Vista and later. Ideas?
    1. How can we properly test this? I feel like a few proper tests on according hardware (ie a potato) could provide stronger confidence and allow us to compare perf (loss) of the solutions. For starter points, look here: boost::filesystem::remove Access is denied on windows boostorg/filesystem#216
    The implementation uses runtime detection of the feature in the OS.
    We are also using two more implementations for file removal: one that
    employs the more recent FILE_DISPOSITION_FLAG_IGNORE_READONLY_ATTRIBUTE
    flag (available since Windows 10 1809), and FILE_DISPOSITION_INFO
    structure (supported since Windows Vista). The former allows to optimize
    removal of read-only files, and the latter allows to make file deletion
    atomic (i.e. not prone to failure if the file is replaced on the filesystem
    while the operation is executing). The implementation is chosen in
    runtime, depending on which one succeeds removing a file.

    Also, added support for deleting read-only directories, in addition
    to non-directory files, and simplified code a little.

    Closes https://github.com/boostorg/filesystem/issues/216.

Update: Crossed out clarified parts.
Update2: Crossed out more clarified parts (see comment below).

@matu3ba
Copy link
Contributor

matu3ba commented Jun 8, 2022

The relevant parts of boost (without functionality fallback) work like the following:

1. set_file_information_by_handle with file_disposition_info_ex_class for deletion; break on success;
2. ifnot ERROR_ACCESS_DENIED goto error;
3. get_file_information_by_handle_ex with file_basic_info_class (basic file info like access time and access permissions); goto error on failure;
4. set_file_information_by_handle without READ-ONLY; goto error on failure;
5. set_file_information_by_handle with file_disposition_info_ex_class for deletion; break on success;
6. set_file_information_by_handle with READ-ONLY; goto error;

Thus the (only?) remaining question is, how we should deal with read-only files.

Context
https://docs.microsoft.com/en-us/windows/win32/api/minwinbase/ne-minwinbase-file_info_by_handle_class
https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-ntquerydirectoryfile

Update: Clarified wording.

@andrewrk andrewrk force-pushed the windows-delete-posix-semantics branch from 6bd61f0 to e3fd441 Compare September 1, 2022 01:48
@andrewrk andrewrk force-pushed the windows-delete-posix-semantics branch 2 times, most recently from 33f9fad to d145320 Compare September 18, 2022 19:42
@andrewrk
Copy link
Member

andrewrk commented Oct 17, 2022

Also worth noting that this requires Windows 10 >= 1607 (_WIN32_WINNT_WIN10_RS1)

According to Microsoft, Windows 8.1 reaches end of Extended Support on January 10, 2023. source

@squeek502
Copy link
Collaborator

Also worth noting that this requires Windows 10 >= 1607 (_WIN32_WINNT_WIN10_RS1)

According to Microsoft, Windows 8.1 reaches end of Extended Support on January 10, 2023. source

Not sure if/how this factors in, but needing Windows 10 >= 1607 means that builds of Windows 10 earlier than that will also not have the feature; i.e it's not new with Windows 10, it was added in a particular build of Windows 10.

@andrewrk
Copy link
Member

It might be worth looking at another implementation like boost::fs, which implements posix semantics on windows.

Also useful: python/cpython#84324

@andrewrk
Copy link
Member

andrewrk commented Oct 17, 2022

This PR can still be merged if a target OS version check is added. builtin.target.os.version_range.windows.min.isAtLeast(.win10_rs1)

@andrewrk andrewrk force-pushed the windows-delete-posix-semantics branch from d145320 to 45307fc Compare December 12, 2022 02:47
@andrewrk andrewrk force-pushed the windows-delete-posix-semantics branch from 45307fc to de163e6 Compare January 27, 2023 06:04
@andrewrk
Copy link
Member

Unfortunately this was never quite merge-ready. Any contributor is welcome to resurrect this PR against latest master. As far as I know it only needs a OS min version check added.

@andrewrk andrewrk closed this Feb 16, 2023
matu3ba added a commit to matu3ba/zig that referenced this pull request Apr 28, 2023
This is the PR with requested fixups. Thanks to @SpexGuy for
the original PR ziglang#11793.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
os-windows standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants