Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs::remove_dir_all rarely succeeds for large directories on windows #29497

Open
Diggsey opened this issue Nov 1, 2015 · 36 comments
Open

fs::remove_dir_all rarely succeeds for large directories on windows #29497

Diggsey opened this issue Nov 1, 2015 · 36 comments
Labels
A-io Area: `std::io`, `std::fs`, `std::net` and `std::path` C-bug Category: This is a bug. E-help-wanted Call for participation: Help is requested to fix this issue. O-windows Operating system: Windows T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@Diggsey
Copy link
Contributor

Diggsey commented Nov 1, 2015

I've been trying to track this one down for a while with little success. So far I've been able to confirm that no external programs are holding long-standing locks on any of the files in the directory, and I've confirmed that it happens with trivial rust programs such as the following:

use std::fs;

fn main() {
    println!("{:?}", fs::remove_dir_all("<path>"));
}

I've also confirmed that deleting the folder from explorer (equivalent to using SHFileOperation), or using rmdir on the command line will always succeed.

Currently my best guess is that either windows or other programs are holding transient locks on some of the files in the directory, possibly as part of the directory indexing windows performs to enable efficient searching.

Several factors led me to think this:

  • fs::remove_dir_all will pretty much always succeed on the second or third invocations.
  • it seems to happen most commonly when dealing with large numbers of text files
  • unlocker and other tools show no active handles to any files in the directory

Maybe there should be some sort of automated retry system, such that temporary locks do not hinder progress?

edit:
The errors returned seem somewhat random: sometimes it gives a "directory is not empty" error, while sometimes it will show up as an "access denied" error.

@sfackler sfackler added the O-windows Operating system: Windows label Nov 1, 2015
@retep998
Copy link
Member

retep998 commented Nov 1, 2015

I think this is just race conditions at work. When you delete a file another program might get a notification and check out what happened and that might hold a temporary lock on something. Or it might be that the previous deletions weren't actually done by the time you tried to delete the directory.

@petrochenkov
Copy link
Contributor

Hm, https://github.com/CppCon/CppCon2015/blob/master/Tutorials/Racing%20the%20Filesystem/Racing%20the%20Filesystem%20-%20Niall%20Douglas%20-%20CppCon%202015.pdf slide "3. Deleting a directory tree" may be related. I accidentally watched this presentation yesterday. If I understood correctly, DeleteFileW used in remove_dir_all doesn't actually delete a file, but only marks it for deletion, it still can be alive when we try to delete the parent directory causing the "directory is not empty" error.

@Diggsey
Copy link
Contributor Author

Diggsey commented Nov 1, 2015

@petrochenkov Great link - that seems like it's exactly the problem I'm experiencing.

It would be great to see some of these portable race-free idioms implemented in std.

@alexcrichton
Copy link
Member

I'd be interested to dig a bit more into this to see what's going on. According to the DeleteFile documentation the only reason the file would be flagged for deletion would be if there are active open handles, which seems like a legitimate race condition? Note though that the same docs also recommend using SHFileOperation for recursively deleting a tree.

Those slides are certainly an interesting read though! I'd be a little wary of putting it into the standard library as the "transactional semantics" are a little weird. For example if you successfully rename a file to a temporary directory and then fail to delete it, what happens?

It may be the case that the best implementation for this function on Windows is to just use SHFileOperation, and that may end up just taking care of these sorts of issues on its own.

@pitdicker
Copy link
Contributor

I will give fixing this a try.

On Windows XP we could use SHFileOperation, and on Vista and up IFileOperation. For Windows store applications there is no easy function. But I would still like to try implementing it by walking the directories and deleting recursively.

Current problems with remove_dir_all:

  • can not remove contents if the path becomes longer than MAX_PATH
  • files may not be deleted immediately, causing remove_dir to fail
  • unable to remove read-only files

And what I hope will fix it:

  • use canonicalise() on the base dir to get support for long paths
  • generate hash
  • read-only files (possibly with hard links) TODO: can the read-only flag be stale?
    • open file
    • remove read-only flag
    • ReOpenFile with FILE_FLAG_DELETE_ON_CLOSE
    • set read-only flag
    • continue as with normal files
  • normal files/dirs
    • open with FILE_FLAG_BACKUP_SEMANTICS (for opening dirs)
      FILE_FLAG_OPEN_REPARSE_POINT (for opening reparse points)
      FILE_FLAG_DELETE_ON_CLOSE
    • atomically rename with SetFileInformationByHandle and FILE_RENAME_INFO
      (normal rename on xp)
    • move all files and subdirectories to the parent of the dir to remove
      (not to %TEMP%, may not be on the same drive)
      rename them to rm-[hash]-[nr]

I think we know deleting will fail when opening the file with FILE_FLAG_DELETE_ON_CLOSE fails, so before moving.

@pitdicker
Copy link
Contributor

O, and I walked against this bug when rustbuild (is that the current name) failed after make clean because of this.

@alexcrichton
Copy link
Member

@pitdicker thanks for taking this on! We may want to discuss this a bit before moving much more on a solution. One question is how principled should something like fs::remove_dir_all be. For example if fs::remove_dir_all automatically handled read-only files, should fs::remove_file also do the same? Currently we have this "duality" where some fs functions are straight counterparts to the underlying system operations, but some are somewhat fancier like create_dir_all and remove_dir_all. We'd just want to make sure to have an explicitly drawn line here. Also note I'm not really thinking of an RFC here, just some more discussion before jumping to a PR.

I personally feel that remove_dir_all is fine to be fancy and do lots of implicit operations. I would continue to want, however, that remove_file and remove_dir are as straightforward as possible (e.g. don't mirror what's happening here unless it's a similarly small set of operations as to what happens today). Your proposed strategy seems reasonable to me at first glance, and I wouldn't mind reviewing more over a PR.

Curious what others think, though? cc @rust-lang/libs

@Diggsey
Copy link
Contributor Author

Diggsey commented Feb 19, 2016

@alexcrichton Hmm, I would prefer remove_file and remove_dir to try to give the same guarantees they have on linux if possible. Specifically that if they succeed, the file or directory should be completely gone.

It's inevitable that people will build their own abstractions over the top of these primitive operations, and I'd very much like to avoid these kind of bugs where it's sufficiently rare that it's almost impossible to track down the true cause of the problem.

@retep998
Copy link
Member

Specifically that if they succeed, the file or directory should be completely gone.

This wouldn't really work because files don't necessarily get deleted right away. You can't just move the file somewhere else and then delete it because there isn't always a reasonable place on the same volume to move the file to. Sitting around and indefinitely waiting for the file to vanish isn't ideal either. I'd personally just use IFileOperation for remove_dir_all and call it a day.

@Diggsey
Copy link
Contributor Author

Diggsey commented Feb 19, 2016

What if the file is opened with FILE_FLAG_DELETE_ON_CLOSE, no share flags specified, and then is immediately closed? Successfully opening the file means that you have exclusive access, and so closing it should delete the file.

@retep998
Copy link
Member

No share flags means nobody else can open it to read/write/delete it. However another handle can still be opened without read/write/delete permissions. Also what happens if you fail to open the file with no share flags? Does it error or fallback to the normal deletion method that provides no guarantees?

@pitdicker
Copy link
Contributor

Great to see so many comments. Just what I hoped for!

@alexcrichton I completely agree to keep remove_file and remove_dir simple, and only let remove_dir_all do magic. Actually removing read-only files is pretty difficult when you also don't want to break hard links. http://stackoverflow.com/questions/3055668/delete-link-to-file-without-clearing-readonly-bit is an example, but I am sure I found some better way somewhere.

@Diggsey I would also like cross-platform consistency, but sometimes Windows is just to different... I think the standard library should at least expose primitives that only need one system call. Otherwise we lose performance and maybe get more vulnerable to race conditions. Maybe we can add a remove_no_matter_what in the future :)

@retep998 I have not yet written the code, so I don't know if this will really solve the problem... But if this doesn't work out I am all for IFileOperation. Meanwhile it is a nice way for me and maybe the standard library to get better at avoiding filesystem races.

I would have to test what happens with FILE_FLAG_DELETE_ON_CLOSE, but in theory if the file opens successfully we only have to move it to avoid a race of a couple of milliseconds. Directly after opening the file no one else can open the file anymore (it is not a share flag).

What is difficult is where to move a file / dir to. I think the parent of the dir we are removing is good enough. It should be on the same volume, and I think we also have write permission to it because otherwise deleting a dir is not possible (wild speculation here, it is late :))

@brson
Copy link
Contributor

brson commented Feb 20, 2016

I feel ok about going to extra effort to make remove_dir_all work as expected.

@alexcrichton
Copy link
Member

@pitdicker perhaps you could draw up a list of the possible implementation strategies for these functions? That'd also help in terms of weighing the pros/cons of each strategy and we could get a good idea of which terminology is associated with what. So far it sounds like IFileOperation may be the best option here, but I'm personally at least not following what FILE_FLAG_DELETE_ON_CLOSE would mean, especially in the case that you can't open it as someone else does.

@pitdicker
Copy link
Contributor

To be fair IFileOperation is a bit to diffcult for me to implement. It should "just work" when removing directories. But as far is I can see it is an entirally different kind of API. For example it does not work on handles but IShellItems, and is mostly made for gui applications.

The steps I proposed are more ugly. They use the same simpler but more low-level APIs as the rest of the standard library. But there is a little window where files may temporarily show up in the parent directory of the one we removed. But these files would otherwise be the ones that cause remove_dir_all to fail currently, which is not all that often.

On Windows deleting a file is not guaranteed to be complete before the function returns. Instead a file is scheduled for deletion, when no one else has it open anymore. When Unix on the other hand deletes the file, it looks really deleted. But deletion of the inode is scheduled until everyone is done with it.

FILE_FLAG_DELETE_ON_CLOSE is a flag we can set when opening a file, together with DELETE permission as access mode. It will schedule the file (or directory) for deletion. When the flag is set (i.e. as soon as open succeeds), it is not possible for others to open the file anymore. The only question is what happens if others already have the file open. If they did not have the file open with FILE_SHARE_DELETE we should fail to open it (as we do not have DELETE permission).

Using this flag should be identical to using DeleteFile. The only reason I want to use it, is that it keeps a hanle open. And with this handle the file can be moved to a temporary location to not block removing the parent dir.

But I am not sure of anything yet. Somehow I just can't get it to work the renaming part yet, so all this is theory, not tested :(.

@Diggsey
Copy link
Contributor Author

Diggsey commented Feb 22, 2016

Having looked at the IFileOperation, I agree with @pitdicker that it doesn't seem like the right kind of API for low-level filesystem operations: it's very much oriented towards interactive use. (The .net framework doesn't use it either, but then Directory.Delete suffers from the exact same problem!)

I'm starting to think that maybe just using the current implementation and retrying on failure is the best option. http://stackoverflow.com/a/1703799 (The second code example has the advantage that it only retries more than once if some progress is made, and retries more often, for more deeply nested directories)

@pitdicker
Copy link
Contributor

YES! I've got the manual method working.

Now to clean up my messy code, and do a lot of testing...

@pitdicker
Copy link
Contributor

@Diggsey Thanks for looking up how .NET handles this!

And I like the simplicity of just retrying on failure. But I don't like that it needs a little luck. So I prefer moving files the instant before they are deleted.

@timvisee
Copy link
Contributor

timvisee commented Nov 15, 2016

Has a fix already been applied to std, or do we have a working fix available yet? I'm currently bumping into the same issue on Windows, and this issue seems to be open for quite a while. Sadly, the problem seems to be quite inconsistent on my machine.

Also, I don't think that retrying to delete a file a few times sound like a good solution to put into std.

@orvly
Copy link

orvly commented Jul 22, 2017

I just ran into a related problem myself, where after calling remove_dir_all on a directory which had just a single empty sub-directory, it completed successfully, but then trying to re-create this directory hierarchy immediately failed with Access Denied.
(I imagine this should be a rather common scenario with testing code which relies on the filesystem, first cleaning up any artifacts from the previous run and then re-creating them).

The problem is basically not just with remove_dir_all, but with any attempt to delete a directory - it might not be executed immediately, so any action done right after it, which assumes it doesn't exist, might fail.

But after going to the issue referenced right above this comment (995 in rustup) I found out about this new "remove_dir_all" crate from @Aaronepower (hosting just this single function), which uses a trick - it moves the directory first, then deletes the moved directory (as documented extensively there).
That solved my problem as well as the more general problem of deleting a directory on Windows.
remove_dir_all crate

Could its implementation replace the one in std::fs?

@Mark-Simulacrum Mark-Simulacrum added C-feature-request Category: A feature request, i.e: not implemented / a PR. C-bug Category: This is a bug. and removed C-feature-request Category: A feature request, i.e: not implemented / a PR. labels Jul 24, 2017
@ChrisDenton
Copy link
Member

ChrisDenton commented May 9, 2021

I think there are (at least) three issues here:

  1. The main issue is that a file can be pending deletion for awhile, which prevents deleting its parent directory. This should already be fixed by recent versions of Windows but a solution that works for older versions would be welcome.
  2. If a process is still running then its .exe file cannot be deleted. I've seen this cause issues where a build system tries to cleanup exe files and fails. More generally, a file could be held open without the "FILE_SHARE_DELETE" share mode (which means there's a file lock preventing deletion attempts until after the file handle is closed).
  3. People thinking the DOS FILE_ATTRIBUTE_READONLY is the same as the POSIX read permission. This is especially an issue with applications ported from Linux. This ship has sailed at this point so maybe we should just force delete readonly files.

I think the first can be mitigated slightly by ignoring any ERROR_DIR_NOT_EMPTY messages at first, then recursively deleting any left over directories afterwards. Maybe in a retry loop that again skips ERROR_DIR_NOT_EMPTY until the final attempt.

I'm uncertain if the second can be sensibly solved by Rust's std. At least not with the current API. It's just a matter of waiting. For example, the process may need to finish running and then the .exe file handle has to be closed. This could take any amount of time. I guess we could retry a few times before giving up?

The third can be solved by either unsetting readonly attributes or using the FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE flag (on OS versions that support it).

@XAMPPRocky
Copy link
Member

I'm uncertain if the second can be sensibly solved by Rust's std. At least not with the current API. It's just a matter of waiting. For example, the process may need to finish running and then the .exe file handle has to be closed. This could take any amount of time. I guess we could retry a few times before giving up?

Just my opinion, but retrying seems fine to me, this is already what some APIs do, such as symlinking in std on Windows, if it receives a certain error from Windows it just tries again with some other flags. The same could work here. I think the main thing to be aware of with retrying though is for it to still feel responsive and design it so that someone can't cause a denial-of-service attack by getting this API to repeatedly try to delete a running exe (such as itself).

I think you'd want something like having just a single timeout for a fixed amount of time, only trigger timeout countdown if we've failed, and to return ErrorKind::TimedOut or similar if it failed.

You could maybe also probably be smarter about it and do something like collect all entries that fail, and try them a second time after you've deleted everything else, so that it's more likely that the OS has had enough time to finish up whatever it was doing.

N3xed added a commit to N3xed/embuild that referenced this issue Aug 31, 2021
Add dependency `remove_dir_all`
Note: On windows `std::fs::remove_dir_all` doesn't really work (rust-lang/rust#29497).
N3xed added a commit to N3xed/embuild that referenced this issue Sep 1, 2021
Add dependency `remove_dir_all`
Note: On windows `std::fs::remove_dir_all` doesn't really work (rust-lang/rust#29497).
N3xed added a commit to esp-rs/embuild that referenced this issue Sep 7, 2021
* cmd* macros: Add ability to specify argument collection with `@<expression>`

* Add fs utilities

Add dependency `remove_dir_all`
Note: On windows `std::fs::remove_dir_all` doesn't really work (rust-lang/rust#29497).

* Allow errors to be turned into cargo warnings

* Add git utilities

Publicly export the `which` crate
Remove `log` dependency of `cmd*` macros.

* Add initial cmake tools

Add dependency `cmake`
Add initial subset of cmake file API
Add function to extract defined variables in cmake script

* Add cmake-file-API cache object

* Refactor `bindgen`, add `from_cmake` contructor

* cmake: Add toolchains object

* Add type alias `NativeCommandArgs`

* Make `LinkArgsBuilder` methods usable

* Construct `LinkArgsBuilder`, `CInclArgs` from cmake build info

* Add `CfgArgs::try_from_json`

* Don't detect bindgen linker from cache

Fix kconfig json parse bug

* Fix clippy warnings

* Fix `git::Repository::apply`

Add `git::Repository::apply_once`
Add `git::Repository::is_applied`
Publicly export `anyhow` (hidden from docs) because some macros need it.
Fix `cmd_spawn`
Add `cmd` variation that returns `Result<ExitStatus>`

* Improve cmake-file-api error messages

Check the cmake version for support of a specific file-api object kind.
@pwinckles
Copy link

pwinckles commented Jan 21, 2022

It looks like one of the commits that addressed the remove_dir_all security vulnerability may have also resolved this issue, new retry logic starting on line 1010. I haven't had a chance to test it out on a Windows system yet.

@ehuss ehuss removed the E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. label Jan 29, 2022
@rbtcollins
Copy link
Contributor

@ChrisDenton Looks like FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE is only available for ZwSetInformationFile - Cygwin uses the underlying NT calls, but I had the impression rust is trying to avoid getting that low level.

@pwinckles The loop there may reduce the incidence, but as it is reentering into the entire thing its going to repeat IO for the depth of the point where the error occurred, though it will at least bail out fast. I wouldn't want to use that version on rustups deletion of docs trees, for instance; though I am still interested in getting a reliable version into the core.

@ChrisDenton
Copy link
Member

@rbtcollins FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE is in the standard headers (WinBase.h) and is also in the windows-rs crate.

Btw, the code currently in the standard library was written with some amount of haste so isn't necessarily the best example. I do plan to get back to it but lately I've been distracted with other issues.

@rbtcollins
Copy link
Contributor

@ChrisDenton thats super interesting - and not visible on the MSDN doc pages :/

I meant no negativity about the current code - it is great the CVE was solved, and there time mattered a lot... and this bug is really (to my mind) about figuring out what is sensible for the stdlib - tradeoffs between complexity, generality, least surprise etc abound.

@ChrisDenton
Copy link
Member

@ChrisDenton thats super interesting - and not visible on the MSDN doc pages :/

The newer information classes are listed (e.g. FileDispositionInfoEx and FileRenameInfoEx) but there are not yet platform version information or dedicated pages listing the constants as there are with the others. It probably just needs someone to put in the work of writing the docs (which tbh is rarely the most popular task). In any case being in public headers and used by MSVC STL is quite a vote of confidence that it isn't going away.

I meant no negativity about the current code - it is great the CVE was solved, and there time mattered a lot... and this bug is really (to my mind) about figuring out what is sensible for the stdlib - tradeoffs between complexity, generality, least surprise etc abound.

Sorry! I didn't take what you said as negativity. I was meaning that more as a warning that, while I'm happy it fixes the CVE, I would still like to see the code cleaned up and improved.

@rbtcollins
Copy link
Contributor

@ChrisDenton so FileDispositionInfoEx notes that deleting a file with posix semantics - " When FILE_DISPOSITION_POSIX_SEMANTICS is set, the link is removed from the visible namespace as soon as the POSIX delete handle has been closed, but the file's data streams remain accessible by" - does this mean that the directory delete will recieve ERROR_DIR_NOT_EMPTY while those processes with handles open continue to consume the file? Or is it genuinely unlinked from the directory and thus the directory is deletable?

@ChrisDenton
Copy link
Member

@rbtcollins The HANDLE that sets the disposition must be closed for it to take full effect. On setting FILE_DISPOSITION_POSIX_SEMANTICS, but before closing, the file is locked such that the file can no longer be opened. However, it still remains in the directory and so attempting to delete the directory will produce ERROR_DIR_NOT_EMPTY. So essentially setting the disposition should be immediately followed by a close.

Other file handles to the same file can remain open. They have no effect at all on the POSIX delete (unless of course they lock the file).

@rbtcollins
Copy link
Contributor

Thank you @ChrisDenton .

I'm having a little trouble figuring out the version support for this comment of yours: FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE flag (on OS versions that support it). It looks like its definitely present in Windows 10.1809. Is there some easy lookup matrix for these things? In particular to figure out whether to use it conditionally, or just document that successful deletion of readonly trees would require that version ...

@ChrisDenton
Copy link
Member

ChrisDenton commented Jan 21, 2023

The documentation is pretty limited atm so the best bet is the C/C++ headers which indeed indicate that it's available on 1809 and later (aka _WIN32_WINNT_WIN10_RS5).

I'd suggest trying the FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE flag first, checking the error returned, and then progressively falling back if necessary. The std currently checks for any of ERROR_NOT_SUPPORTED, ERROR_INVALID_FUNCTION or ERROR_INVALID_PARAMETER and then falls back to non-posix delete. A more involved solution could fallback to dropping FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE first and if that works then remembering (in an atomic) that the ignore readonly attribute isn't supported on this OS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: `std::io`, `std::fs`, `std::net` and `std::path` C-bug Category: This is a bug. E-help-wanted Call for participation: Help is requested to fix this issue. O-windows Operating system: Windows T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.