[ROCm] Avoid watchdog event queries during graph capture#176251
[ROCm] Avoid watchdog event queries during graph capture#176251chinmaydk99 wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176251
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Pending, 1 Unrelated FailureAs of commit 23503e4 with merge base e274aff ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
What is going on with these PRs? Watchdog event queries were already happening in thread-local mode, there should be no additional fixes needed. under what circumstances would putting watchdog in the thread-local mode |
You're right that the watchdog already sets cudaStreamCaptureModeThreadLocal before event queries. On CUDA/NVIDIA this is sufficient. The issue is that current HIP runtimes don't honor THREAD_LOCAL mode for hipEventQuery. Even with the mode switch, hipEventQuery still checks the global capture list and returns hipErrorStreamCaptureUnsupported if another thread has GLOBAL capture active, which invalidates the session. This is a known HIP runtime bug. So the existing CUDAStreamCaptureModeGuard is the correct design, it just isn't effective on current HIP runtimes. This PR is a temporary workaround until the fixed runtime is baseline, at which point the guards can be removed. |
|
@pytorchbot revert -c ghfirst -m "needs additional reviews, blocking revert of another PR" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…re (#176251)" This reverts commit a346446. Reverted #176251 on behalf of https://github.com/jeffdaily due to needs additional reviews, blocking revert of another PR ([comment](#176251 (comment)))
|
@chinmaydk99 your PR has been successfully reverted. |
a3fdfc8 to
23503e4
Compare
|
@pytorchbot merge -f "target-determination job got stuck? all other trunk is passing, and 1 known flaky and other fails outside of this PR" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
) This PR introduces a workaround for the HIP runtime bug (pytorch#177309) where `hipEventQuery` from a non-capturing thread invalidates graph captures on other threads, even in `THREAD_LOCAL` mode(ROCm/rocm-systems#3176). The NCCL/RCCL watchdog's polling queries hit this. - `queryEventWithRocmWatchdogCaptureWorkaround()` wraps `CUDAEvent::query()` logic: - Watchdog calling during active capture: skips the query, returns false (not ready) - Otherwise queries normally, but catches `hipErrorCapturedEvent` / `hipErrorStreamCaptureUnsupported` from the watchdog and maps them to "not ready" for race conditions - `RocmWatchdogEventQueryContextGuard` thread-local guard set in `runLoop()` so the skip path only activates on the watchdog — main-thread `wait()`/`isCompleted()` unchanged - Timeout checks gated on `!is_graph_capture_active()` to avoid false positives while queries are skipped - `is_graph_capture_active()` reads the existing `_currently_capturing_graphs` map under its mutex - `capture_end()` erases the map entry before `AT_CUDA_CHECK` so the watchdog never sees stale state on error paths All `#ifdef USE_ROCM`. TODO to remove once the HIP runtime fix ships. Pull Request resolved: pytorch#176251 Approved by: https://github.com/jeffdaily, https://github.com/ngimel (cherry picked from commit 5ae3a6f)
…rch#176251) On HIP runtimes, `hipEventQuery` from the NCCL watchdog thread while another thread has GLOBAL stream capture active poisons the capture session (`hipErrorStreamCaptureUnsupported` → `hipErrorStreamCaptureInvalidated`). This was observed as intermittent failures in FSDP + CUDA graph tests, confirmed by `TORCH_NCCL_BLOCKING_WAIT=1` making it pass (disables async watchdog). Fix: add ROCm-only `is_graph_capture_active()` using the existing capture bookkeeping map, then use it in `ProcessGroupNCCL` to skip `hipEventQuery` and defer timeout checks while any capture is active. Also fix `capture_end()` ordering so bookkeeping cleanup happens before error propagation. Pull Request resolved: pytorch#176251 Approved by: https://github.com/pragupta, https://github.com/jeffdaily
…re (pytorch#176251)" This reverts commit a346446. Reverted pytorch#176251 on behalf of https://github.com/jeffdaily due to needs additional reviews, blocking revert of another PR ([comment](pytorch#176251 (comment)))
) This PR introduces a workaround for the HIP runtime bug (pytorch#177309) where `hipEventQuery` from a non-capturing thread invalidates graph captures on other threads, even in `THREAD_LOCAL` mode(ROCm/rocm-systems#3176). The NCCL/RCCL watchdog's polling queries hit this. - `queryEventWithRocmWatchdogCaptureWorkaround()` wraps `CUDAEvent::query()` logic: - Watchdog calling during active capture: skips the query, returns false (not ready) - Otherwise queries normally, but catches `hipErrorCapturedEvent` / `hipErrorStreamCaptureUnsupported` from the watchdog and maps them to "not ready" for race conditions - `RocmWatchdogEventQueryContextGuard` thread-local guard set in `runLoop()` so the skip path only activates on the watchdog — main-thread `wait()`/`isCompleted()` unchanged - Timeout checks gated on `!is_graph_capture_active()` to avoid false positives while queries are skipped - `is_graph_capture_active()` reads the existing `_currently_capturing_graphs` map under its mutex - `capture_end()` erases the map entry before `AT_CUDA_CHECK` so the watchdog never sees stale state on error paths All `#ifdef USE_ROCM`. TODO to remove once the HIP runtime fix ships. Pull Request resolved: pytorch#176251 Approved by: https://github.com/jeffdaily, https://github.com/ngimel
) This PR introduces a workaround for the HIP runtime bug (pytorch#177309) where `hipEventQuery` from a non-capturing thread invalidates graph captures on other threads, even in `THREAD_LOCAL` mode(ROCm/rocm-systems#3176). The NCCL/RCCL watchdog's polling queries hit this. ### Code Changes #### `ProcessGroupNCCL.cpp` - `queryEventWithRocmWatchdogCaptureWorkaround()` wraps `CUDAEvent::query()` logic: - Watchdog calling during active capture: skips the query, returns false (not ready) - Otherwise queries normally, but catches `hipErrorCapturedEvent` / `hipErrorStreamCaptureUnsupported` from the watchdog and maps them to "not ready" for race conditions - `RocmWatchdogEventQueryContextGuard` thread-local guard set in `runLoop()` so the skip path only activates on the watchdog — main-thread `wait()`/`isCompleted()` unchanged - Timeout checks gated on `!is_graph_capture_active()` to avoid false positives while queries are skipped #### `CUDAGraph.cpp/h` - `is_graph_capture_active()` reads the existing `_currently_capturing_graphs` map under its mutex - `capture_end()` erases the map entry before `AT_CUDA_CHECK` so the watchdog never sees stale state on error paths All `#ifdef USE_ROCM`. TODO to remove once the HIP runtime fix ships. Pull Request resolved: pytorch#176251 Approved by: https://github.com/jeffdaily, https://github.com/ngimel
…orch#176251)" This reverts commit 5ae3a6f.
This PR introduces a workaround for the HIP runtime bug (#177309) where
hipEventQueryfrom a non-capturing thread invalidates graph captures on other threads, even inTHREAD_LOCALmode(ROCm/rocm-systems#3176). The NCCL/RCCL watchdog's polling queries hit this.Code Changes
ProcessGroupNCCL.cppqueryEventWithRocmWatchdogCaptureWorkaround()wrapsCUDAEvent::query()logic:hipErrorCapturedEvent/hipErrorStreamCaptureUnsupportedfrom the watchdog and maps them to "not ready" for race conditionsRocmWatchdogEventQueryContextGuardthread-local guard set inrunLoop()so the skip path only activates on the watchdog — main-threadwait()/isCompleted()unchanged!is_graph_capture_active()to avoid false positives while queries are skippedCUDAGraph.cpp/his_graph_capture_active()reads the existing_currently_capturing_graphsmap under its mutexcapture_end()erases the map entry beforeAT_CUDA_CHECKso the watchdog never sees stale state on error pathsAll
#ifdef USE_ROCM. TODO to remove once the HIP runtime fix ships.cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang