Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[BUGFIX] Fix threadsafety and shutdown issues with threaded_engine_perdevice #21110

Merged
merged 4 commits into from
Aug 2, 2022

Conversation

DickJC123
Copy link
Contributor

@DickJC123 DickJC123 commented Aug 1, 2022

Description

This PR makes improvements in the handling of CUDA resources by the ThreadedEnginePerDevice, and this appears to have corrected certain CI failures of the GPU jobs on Windows (6 of 6 passing), as mentioned first in issue #20914 and more recently in PR #21107.

Background: As a general policy, a process should not have an active CUDA context, with unreleased CUDA resources like streams and events, prior to forking. While it's even unclear whether this is guaranteed to work when the forked child process performs an immediate exec, in the absence of those assurances, the parent should definitely play it safe and release all CUDA resources prior to the fork. To help in that task, the following callback in initializer.cc is currently executed by the parent process prior to the fork:

void LibraryInitializer::atfork_prepare() {
  using op::custom::CustomOperator;
  CustomOperator::Get()->Stop();
  Engine::Get()->Stop();
}

The Engine::Get()->Stop() call then ultimately calls ThreadedEnginePerDevice::StopNoWait(). This PR adds to StopNoWait() a release of CUDA resources introduced by the async GPU dependency engine PR #20331 (namely streams_ and cuda_event_pool_per_worker_). In addition, a thread safety issue is corrected that might occur during the initial construction of GPUWorkers.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

@mxnet-bot
Copy link

Hey @DickJC123 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, windows-cpu, miscellaneous, unix-cpu, sanity, centos-gpu, windows-gpu, website, unix-gpu, edge, centos-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@mseth10 mseth10 added the pr-work-in-progress PR is still work in progress label Aug 1, 2022
@DickJC123 DickJC123 requested a review from ptrendx August 1, 2022 22:11
@DickJC123 DickJC123 changed the title [WIP] [BUGFIX] Fix threadsafety and shutdown issues with threaded_engine_perdevice [BUGFIX] Fix threadsafety and shutdown issues with threaded_engine_perdevice Aug 1, 2022
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Aug 1, 2022
Copy link
Member

@ptrendx ptrendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ptrendx ptrendx merged commit 0b4ecdb into apache:master Aug 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants