Stream synchronize before deallocating SAM (#1655)

rongou · web-flow · commit 0a8ae04206be · 2024-08-26T20:10:28.000Z
While investigating cuml benchmarks, I found an issue with the current `system_memory_resource` that causes segfault. Roughly it's in code like this: ```cuda void foo(...) { rmm::device_uvector<T> tmp(bufferSize, stream); // launch cuda kernels making use of tmp } ``` When the function returns, the `device_uvector` would go out of scope and get deleted, while the cuda kernel might still be in flight. With `cudaFree`, the CUDA runtime would perform implicit synchronization to make sure the kernel finishes before actually freeing the memory, but with SAM we don't have that guarantee, thus causing use-after-free errors. This is a rather simple fix. In the future we may want to use CUDA events to make this less blocking. Authors: - Rong Ou (https://github.com/rongou) Approvers: - Mark Harris (https://github.com/harrism) - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) URL: #1655
diff --git a/include/rmm/mr/device/system_memory_resource.hpp b/include/rmm/mr/device/system_memory_resource.hpp
@@ -107,17 +107,23 @@ class system_memory_resource final : public device_memory_resource {
   /**
    * @brief Deallocate memory pointed to by \p p.
    *
-   * The stream argument is ignored.
+   * This function synchronizes the stream before deallocating the memory.
    *
    * @param ptr Pointer to be deallocated
    * @param bytes The size in bytes of the allocation. This must be equal to the value of `bytes`
    *              that was passed to the `allocate` call that returned `ptr`.
-   * @param stream This argument is ignored
+   * @param stream The stream in which to order this deallocation
    */
   void do_deallocate(void* ptr,
                      [[maybe_unused]] std::size_t bytes,
-                     [[maybe_unused]] cuda_stream_view stream) override
+                     cuda_stream_view stream) override
   {
+    // With `cudaFree`, the CUDA runtime keeps track of dependent operations and does implicit
+    // synchronization. However, with SAM, since `free` is immediate, we need to wait for in-flight
+    // CUDA operations to finish before freeing the memory, to avoid potential use-after-free errors
+    // or race conditions.
+    stream.synchronize();
+
     rmm::detail::aligned_host_deallocate(
       ptr, bytes, CUDA_ALLOCATION_ALIGNMENT, [](void* ptr) { ::operator delete(ptr); });
   }