This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
[BUGFIX] Fix flakey TemporaryDirectory() cleanup on Windows #21107
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
On a recent PR of mine and in other PRs, I saw sporadic windows-gpu job failures, which I flagged awhile ago in issue #20914 . I feel now the problem is due to flakiness of
tempfile.TemporaryDirectory()
on Windows, as described here: https://www.scivision.dev/python-tempfile-permission-error-windows/. This PR starts with a non-functional commit to show that master is suffering this problem (which indeed failed with the expected "access violation": https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-21107/1/pipeline). This PR then adds a like-named context manager routinemxnet.util.TemporaryDirectory()
that wrapstempfile.TemporaryDirectory()
, but ignores its cleanup issues on Windows. Finally, the PR changes all uses oftempfile.TemporaryDirectory()
in the codebase to use the newly addedmxnet.util.TemporaryDirectory()
.Update: I realize that this PR really targets a different error, namely "access is denied", as reported in this older issue: #17558
The "access violation" is a segfault in a backend thread, and so is likely a different issue. This maybe a good PR to merge, if we are still seeing "access is denied". But it won't correct the frequent windows GPU CI failures we are currently seeing.
Checklist
Essentials