Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[BUGFIX] Fix flakey TemporaryDirectory() cleanup on Windows #21107

Merged
merged 6 commits into from
Aug 1, 2022

Conversation

DickJC123
Copy link
Contributor

@DickJC123 DickJC123 commented Jul 28, 2022

Description

On a recent PR of mine and in other PRs, I saw sporadic windows-gpu job failures, which I flagged awhile ago in issue #20914 . I feel now the problem is due to flakiness of tempfile.TemporaryDirectory() on Windows, as described here: https://www.scivision.dev/python-tempfile-permission-error-windows/. This PR starts with a non-functional commit to show that master is suffering this problem (which indeed failed with the expected "access violation": https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-21107/1/pipeline). This PR then adds a like-named context manager routine mxnet.util.TemporaryDirectory() that wraps tempfile.TemporaryDirectory(), but ignores its cleanup issues on Windows. Finally, the PR changes all uses of tempfile.TemporaryDirectory() in the codebase to use the newly added mxnet.util.TemporaryDirectory().

Update: I realize that this PR really targets a different error, namely "access is denied", as reported in this older issue: #17558

The "access violation" is a segfault in a backend thread, and so is likely a different issue. This maybe a good PR to merge, if we are still seeing "access is denied". But it won't correct the frequent windows GPU CI failures we are currently seeing.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

@mxnet-bot
Copy link

Hey @DickJC123 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-cpu, miscellaneous, centos-cpu, windows-gpu, clang, unix-cpu, unix-gpu, website, edge, centos-gpu, sanity]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@mseth10 mseth10 added the pr-work-in-progress PR is still work in progress label Jul 28, 2022
@DickJC123 DickJC123 requested a review from szha as a code owner July 29, 2022 01:25
@DickJC123 DickJC123 changed the title [WIP] [BUGFIX] Explore windows gpu job failures [BUGFIX] Fix flakey TemporaryDirectory() cleanup on Windows Jul 29, 2022
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jul 29, 2022
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review labels Jul 30, 2022
@DickJC123 DickJC123 changed the title [BUGFIX] Fix flakey TemporaryDirectory() cleanup on Windows [WIP] [BUGFIX] Fix flakey TemporaryDirectory() cleanup on Windows Jul 30, 2022
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jul 30, 2022
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a fix for sporadic cleanup behavior on windows

@szha szha merged commit dedb8c9 into apache:master Aug 1, 2022
@DickJC123 DickJC123 changed the title [WIP] [BUGFIX] Fix flakey TemporaryDirectory() cleanup on Windows [BUGFIX] Fix flakey TemporaryDirectory() cleanup on Windows Aug 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants