This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
[BUGFIX] Fix threadsafety and shutdown issues with threaded_engine_perdevice #21110
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR makes improvements in the handling of CUDA resources by the ThreadedEnginePerDevice, and this appears to have corrected certain CI failures of the GPU jobs on Windows (6 of 6 passing), as mentioned first in issue #20914 and more recently in PR #21107.
Background: As a general policy, a process should not have an active CUDA context, with unreleased CUDA resources like streams and events, prior to forking. While it's even unclear whether this is guaranteed to work when the forked child process performs an immediate exec, in the absence of those assurances, the parent should definitely play it safe and release all CUDA resources prior to the fork. To help in that task, the following callback in initializer.cc is currently executed by the parent process prior to the fork:
The
Engine::Get()->Stop()
call then ultimately callsThreadedEnginePerDevice::StopNoWait()
. This PR adds toStopNoWait()
a release of CUDA resources introduced by theasync GPU dependency engine PR
#20331 (namelystreams_
andcuda_event_pool_per_worker_
). In addition, a thread safety issue is corrected that might occur during the initial construction of GPUWorkers.Checklist
Essentials