Fix race condition in CUDA stream creation #1984

rafbiels · 2024-08-15T17:37:22Z

Do not increment NumComputeStreams / NumTransferStreams before cuStreamCreateWithPriority returns. Too early increment caused other threads to read the incremented count before a CUDA stream was created and try to use an invalid stream handle, causing crashes.

The construction:

if (condition) {
  lock_this_scope
  if (condition) {
    create_object
    update_condition
  }
}
use_object

is only thread-safe if update_condition happens after create_object is completed. This PR ensures the ordering.

intel/llvm PR: intel/llvm#15100

rafbiels · 2024-08-15T17:40:32Z

Reproducer for the crashes:

#include <future>
#include <sycl/sycl.hpp>

int main() {
  sycl::queue q{};

  constexpr static unsigned int numThreads{128};
  std::vector<std::future<void>> futures;
  std::array<unsigned int, numThreads> hostData{};
  std::array<unsigned int*, numThreads> devicePointers{};
  futures.reserve(numThreads);

  for (unsigned int i{0}; i<numThreads; ++i)  {
    devicePointers[i] = sycl::malloc_device<unsigned int>(1, q);
  }

  for (unsigned int i{0}; i<numThreads; ++i)  {
    futures.push_back(std::async([&q, &devicePointers, &hostData, i](){
      q.copy(hostData.data()+i, devicePointers[i], 1);
    }));
  }

  for (unsigned int i{0}; i<numThreads; ++i)  {
    futures[i].wait();
  }
  futures.clear();

  q.wait_and_throw();

  for (unsigned int i{0}; i<numThreads; ++i)  {
    sycl::free(devicePointers[i], q);
  }

}

This crashes for me in ~50% of runs before this PR with:

UR CUDA ERROR:
        Value:           400
        Name:            CUDA_ERROR_INVALID_HANDLE
        Description:     invalid resource handle
        Function:        urEnqueueUSMMemcpy

hdelan

Nice catch! Tricky one

Do not increment NumComputeStreams / NumTransferStreams before cuStreamCreateWithPriority returns. Too early increment caused other threads to read the incremented count before a CUDA stream was created and try to use an invalid stream handle, causing crashes. The construction: ``` if (condition) { lock_this_scope if (condition) { create_object update_condition } } use_object ``` is only thread-safe if update_condition happens after create_object is completed.

npmiller · 2024-08-19T15:34:46Z

Should we add this to v0.10.0?

Fix race condition in CUDA stream creation in the UR CUDA adapter See oneapi-src/unified-runtime#1984

Fix race condition in CUDA stream creation

rafbiels requested a review from a team as a code owner August 15, 2024 17:37

rafbiels requested a review from npmiller August 15, 2024 17:37

rafbiels mentioned this pull request Aug 15, 2024

[UR][CUDA] Fix race condition in CUDA stream creation intel/llvm#15100

Merged

hdelan approved these changes Aug 15, 2024

View reviewed changes

kbenzie added the ready to merge Added to PR's which are ready to merge label Aug 16, 2024

omarahmed1111 force-pushed the rafbiels/cuda-stream-race-cond branch from 24b6ff7 to 15bca3b Compare August 19, 2024 13:14

github-actions bot added the cuda CUDA adapter specific issues label Aug 19, 2024

omarahmed1111 merged commit cabf128 into oneapi-src:main Aug 19, 2024

steffenlarsen pushed a commit to intel/llvm that referenced this pull request Aug 20, 2024

[UR][CUDA] Fix race condition in CUDA stream creation (#15100)

2f3919e

Fix race condition in CUDA stream creation in the UR CUDA adapter See oneapi-src/unified-runtime#1984

kbenzie added the v0.10.x Include in the v0.10.x release label Aug 20, 2024

kbenzie pushed a commit that referenced this pull request Aug 20, 2024

Merge pull request #1984 from rafbiels/rafbiels/cuda-stream-race-cond

2d0a72e

Fix race condition in CUDA stream creation

kbenzie mentioned this pull request Aug 20, 2024

Candidate for the v0.10.0 release tag #1938

Merged

53 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in CUDA stream creation #1984

Fix race condition in CUDA stream creation #1984

Uh oh!

rafbiels commented Aug 15, 2024 •

edited

Loading

Uh oh!

rafbiels commented Aug 15, 2024

Uh oh!

hdelan left a comment

Uh oh!

npmiller commented Aug 19, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix race condition in CUDA stream creation #1984

Fix race condition in CUDA stream creation #1984

Uh oh!

Conversation

rafbiels commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rafbiels commented Aug 15, 2024

Uh oh!

hdelan left a comment

Choose a reason for hiding this comment

Uh oh!

npmiller commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rafbiels commented Aug 15, 2024 •

edited

Loading

npmiller commented Aug 19, 2024 •

edited

Loading