Skip to content

Commit 0a8ae04

Browse files
authored
Stream synchronize before deallocating SAM (#1655)
While investigating cuml benchmarks, I found an issue with the current `system_memory_resource` that causes segfault. Roughly it's in code like this: ```cuda void foo(...) { rmm::device_uvector<T> tmp(bufferSize, stream); // launch cuda kernels making use of tmp } ``` When the function returns, the `device_uvector` would go out of scope and get deleted, while the cuda kernel might still be in flight. With `cudaFree`, the CUDA runtime would perform implicit synchronization to make sure the kernel finishes before actually freeing the memory, but with SAM we don't have that guarantee, thus causing use-after-free errors. This is a rather simple fix. In the future we may want to use CUDA events to make this less blocking. Authors: - Rong Ou (https://github.com/rongou) Approvers: - Mark Harris (https://github.com/harrism) - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) URL: #1655
1 parent 8adedd0 commit 0a8ae04

File tree

1 file changed

+9
-3
lines changed

1 file changed

+9
-3
lines changed

include/rmm/mr/device/system_memory_resource.hpp

+9-3
Original file line numberDiff line numberDiff line change
@@ -107,17 +107,23 @@ class system_memory_resource final : public device_memory_resource {
107107
/**
108108
* @brief Deallocate memory pointed to by \p p.
109109
*
110-
* The stream argument is ignored.
110+
* This function synchronizes the stream before deallocating the memory.
111111
*
112112
* @param ptr Pointer to be deallocated
113113
* @param bytes The size in bytes of the allocation. This must be equal to the value of `bytes`
114114
* that was passed to the `allocate` call that returned `ptr`.
115-
* @param stream This argument is ignored
115+
* @param stream The stream in which to order this deallocation
116116
*/
117117
void do_deallocate(void* ptr,
118118
[[maybe_unused]] std::size_t bytes,
119-
[[maybe_unused]] cuda_stream_view stream) override
119+
cuda_stream_view stream) override
120120
{
121+
// With `cudaFree`, the CUDA runtime keeps track of dependent operations and does implicit
122+
// synchronization. However, with SAM, since `free` is immediate, we need to wait for in-flight
123+
// CUDA operations to finish before freeing the memory, to avoid potential use-after-free errors
124+
// or race conditions.
125+
stream.synchronize();
126+
121127
rmm::detail::aligned_host_deallocate(
122128
ptr, bytes, CUDA_ALLOCATION_ALIGNMENT, [](void* ptr) { ::operator delete(ptr); });
123129
}

0 commit comments

Comments
 (0)