Skip to content

Conversation

@gmarkall
Copy link
Contributor

This is an incomplete follow-on from #295. Putting this PR up so others can see the general idea of it.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jun 13, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@gmarkall
Copy link
Contributor Author

/ok to test

@gmarkall gmarkall added the 2 - In Progress Currently a work in progress label Jun 26, 2025
@gmarkall gmarkall force-pushed the fix-bindings-streams-events branch from a8939c1 to cb37c69 Compare June 27, 2025 14:14
@gmarkall
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's almost there. Locally I pass tests with the following patch on top of this:

Details
diff --git a/numba_cuda/numba/cuda/cudadrv/driver.py b/numba_cuda/numba/cuda/cudadrv/driver.py
index 0dfe177..6b3d0af 100644
--- a/numba_cuda/numba/cuda/cudadrv/driver.py
+++ b/numba_cuda/numba/cuda/cudadrv/driver.py
@@ -1256,13 +1256,8 @@ class _PendingDeallocs(object):
             while self._cons:
                 [dtor, handle, size] = self._cons.popleft()
                 _logger.info("dealloc: %s %s bytes", dtor.__name__, size)
-                # The EMM plugin interface uses CUdeviceptr instances when the
-                # NVIDIA binding is enabled, but the other interfaces use ctypes
-                # objects, so we have to check what kind of object we have here.
-                # binding_types = (binding.CUdeviceptr, binding.CUmodule)
-                # if USE_NV_BINDING and not isinstance(handle, binding_types):
-                #    handle = handle.value
                 dtor(handle)
+
             self._size = 0
 
     @contextlib.contextmanager
@@ -1789,14 +1784,14 @@ def _pin_finalizer(memory_manager, ptr, alloc_key, mapped):
 
 def _event_finalizer(deallocs, handle):
     def core():
-        deallocs.add_item(driver.cuEventDestroy, handle)
+        deallocs.add_item(driver.cuEventDestroy, handle.value)
 
     return core
 
 
 def _stream_finalizer(deallocs, handle):
     def core():
-        deallocs.add_item(driver.cuStreamDestroy, handle)
+        deallocs.add_item(driver.cuStreamDestroy, handle.value)
 
     return core
 
@@ -2406,7 +2401,7 @@ class Stream(object):
             stream_callback = binding.CUstreamCallback(ptr)
             # The callback needs to receive a pointer to the data PyObject
             data = id(data)
-            handle = int(self.handle)
+            handle = self.handle.value
         else:
             stream_callback = self._stream_callback
             handle = self.handle
diff --git a/numba_cuda/numba/cuda/memory_management/nrt.py b/numba_cuda/numba/cuda/memory_management/nrt.py
index d6e4f53..1dff61d 100644
--- a/numba_cuda/numba/cuda/memory_management/nrt.py
+++ b/numba_cuda/numba/cuda/memory_management/nrt.py
@@ -143,7 +143,7 @@ class _Runtime:
             1,
             1,
             0,
-            stream.handle,
+            stream.handle.value,
             params,
             cooperative=False,
         )
diff --git a/numba_cuda/numba/cuda/tests/cudapy/test_stream_api.py b/numba_cuda/numba/cuda/tests/cudapy/test_stream_api.py
index 8367b46..c79bbfb 100644
--- a/numba_cuda/numba/cuda/tests/cudapy/test_stream_api.py
+++ b/numba_cuda/numba/cuda/tests/cudapy/test_stream_api.py
@@ -38,10 +38,7 @@ class TestStreamAPI(CUDATestCase):
         # We don't test synchronization on the stream because it's not a real
         # stream - we used a dummy pointer for testing the API, so we just
         # ensure that the stream handle matches the external stream pointer.
-        if config.CUDA_USE_NVIDIA_BINDING:
-            value = int(s.handle)
-        else:
-            value = s.handle.value
+        value = s.handle.value
         self.assertEqual(ptr, value)
 
     @skip_unless_cudasim("External streams are usable with hardware")

The APIs that create streams and events now always return ctypes objects, so moving the .value access to the finalizer should work and might be a little cleaner since you don't have to puzzle through each object during clear. Beyond that, there's a small update in Stream.add_callback that assumes the stream handle is now a ctypes object and similar changes in the tests and NRT code.

If you think this is a reasonable way forward I can push this and debug from there.

@gmarkall gmarkall force-pushed the fix-bindings-streams-events branch from 5b30ebf to e61b61d Compare July 3, 2025 12:06
@gmarkall
Copy link
Contributor Author

gmarkall commented Jul 3, 2025

/ok to test

@gmarkall
Copy link
Contributor Author

gmarkall commented Jul 3, 2025

@brandon-b-miller Thanks for the fixes! I think the current state / notes are:

  • I think your solution looks good and works for the Numba-CUDA internal cases.
  • I can't foresee whether these changes will cause an issue for EMM plugins, so I'm going to try RMM with both bindings locally to investigate that.
  • There are some cases where there are some splits between the NVIDIA binding and the ctypes one in the test suite:
    • The IPC tests (test_ipc.py). This is a legacy API and I'd be surprised if it's in use anywhere, so I think this can just be left
    • Some CUDA driver tests:
      • test_cuda_memory.py: This is because we haven't touched the EMM plugin interface so we don't break RMM.
      • test_context_stack.py: These are because we either directly call driver APIs that need separate handling (test_attached_primary) or because we're testing internal parts of the API where we don't need to / can't maintain consistency between the two bindings (test_attached_non_primary).
      • test_managed_alloc.py: We're directly calling driver functions so we need to handle the differences.
      • test_cuda_driver.py: Uses the launch_kernel() function from the driver, which is an internal API and handles the bindings differently (in test_cuda_driver_basic), or are directly calling the driver functions (in test_cuda_driver_external_stream).

So assuming this is also OK with RMM, then I think this completes the work we need to / will do in the near term to make the APIs between the two bindings consistent.

@gmarkall gmarkall added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 7, 2025
@gmarkall gmarkall marked this pull request as ready for review July 7, 2025 12:01
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 7, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@gmarkall
Copy link
Contributor Author

gmarkall commented Jul 7, 2025

@brandon-b-miller This is OK with RMM, with the fix from @wence- in #317 (comment) (which I think is a deadlock caused by a combination of the behaviour of RMM and cuda-python).

The IPC tests fail when using RMM and not using cuda-python, but I don't think that is an issue resulting from this PR.

@gmarkall
Copy link
Contributor Author

gmarkall commented Jul 7, 2025

/ok to test

@gmarkall gmarkall changed the title [WIP] Fix bindings: consistency of contexts, streams, and events, similar to #295 Fix bindings: consistency of contexts, streams, and events, similar to #295 Jul 7, 2025
@gmarkall gmarkall added 5 - Ready to merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 7, 2025
@gmarkall gmarkall merged commit 20a2e3b into NVIDIA:main Jul 7, 2025
39 checks passed
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Jul 18, 2025
- Add deadlock warnings to Stream.add_callback and Stream.async_done docstrings (NVIDIA#321)
- `MemoryPointer`: ensure `CUdeviceptr` used with NVIDIA binding (NVIDIA#328)
- Fix indexing GPUs with CUdevice object (NVIDIA#319)
- Fix bindings: consistency of contexts, streams, and events, similar to NVIDIA#295 (NVIDIA#296)
- Fix nvrtc resolution when CUDA_HOME env is set (NVIDIA#314)
@gmarkall gmarkall mentioned this pull request Jul 18, 2025
gmarkall added a commit that referenced this pull request Jul 18, 2025
- Add deadlock warnings to Stream.add_callback and Stream.async_done docstrings (#321)
- `MemoryPointer`: ensure `CUdeviceptr` used with NVIDIA binding (#328)
- Fix indexing GPUs with CUdevice object (#319)
- Fix bindings: consistency of contexts, streams, and events, similar to #295 (#296)
- Fix nvrtc resolution when CUDA_HOME env is set (#314)
atmnp pushed a commit to atmnp/numba-cuda that referenced this pull request Jul 21, 2025
- Add deadlock warnings to Stream.add_callback and Stream.async_done docstrings (NVIDIA#321)
- `MemoryPointer`: ensure `CUdeviceptr` used with NVIDIA binding (NVIDIA#328)
- Fix indexing GPUs with CUdevice object (NVIDIA#319)
- Fix bindings: consistency of contexts, streams, and events, similar to NVIDIA#295 (NVIDIA#296)
- Fix nvrtc resolution when CUDA_HOME env is set (NVIDIA#314)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to merge Testing and reviews complete, ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants