-
Notifications
You must be signed in to change notification settings - Fork 55
Capture global device arrays in kernels and device functions #666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c17b44e
a7f451c
c65eaef
11bc98f
98799f8
0a0b55b
54d5fb3
d4db14a
f7de600
21148d3
8d20e24
acce40d
87ee0fe
8313619
2630d25
2c3ead1
f848c0a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| .. | ||
| SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| SPDX-License-Identifier: BSD-2-Clause | ||
|
|
||
|
|
||
| .. _cuda-globals: | ||
|
|
||
| ===================================== | ||
| Global Variables and Captured Values | ||
| ===================================== | ||
|
|
||
| Numba CUDA kernels and device functions can reference global variables defined | ||
| at module scope. This section describes how these values are captured and the | ||
| implications for your code. | ||
|
|
||
|
|
||
| Capture as constants | ||
| ==================== | ||
|
|
||
| By default, global variables referenced in kernels are captured as constants at | ||
| compilation time. This applies to scalars and host arrays (e.g. NumPy arrays). | ||
|
|
||
| The following example demonstrates this behavior. Both ``TAX_RATE`` and | ||
| ``PRICES`` are captured when the kernel is first compiled. Because they are | ||
| embedded as constants, **modifications to these variables after compilation | ||
| have no effect**—the second kernel call still uses the original values: | ||
|
|
||
| .. literalinclude:: ../../../numba_cuda/numba/cuda/tests/doc_examples/test_globals.py | ||
| :language: python | ||
| :caption: Demonstrating constant capture of global variables | ||
| :start-after: magictoken.ex_globals_constant_capture.begin | ||
| :end-before: magictoken.ex_globals_constant_capture.end | ||
| :dedent: 8 | ||
| :linenos: | ||
|
|
||
| Running the above code prints: | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| Value of d_totals: [ 10.8 54. 16.2 64.8 162. ] | ||
| Value of d_totals: [ 10.8 54. 16.2 64.8 162. ] | ||
|
|
||
| Note that both outputs are identical—the modifications to ``TAX_RATE`` and | ||
| ``PRICES`` after the first kernel call have no effect. | ||
|
|
||
| This behaviour is useful for small amounts of truly constant data like | ||
| configuration values, lookup tables, or mathematical constants. For larger | ||
| arrays, consider using device arrays instead. | ||
|
|
||
|
|
||
| Device array capture | ||
| ==================== | ||
|
|
||
| Device arrays are an exception to the constant capture rule. When a kernel | ||
| references a global device array—any object implementing | ||
| ``__cuda_array_interface__``, such as CuPy arrays or Numba device arrays—the | ||
| device pointer is captured rather than the data. No copy occurs, and | ||
| modifications to the array **are** visible to subsequent kernel calls. | ||
|
|
||
| The following example demonstrates this behavior. The global ``PRICES`` device | ||
| array is mutated after the first kernel call, and the second kernel call sees | ||
| the updated values: | ||
|
|
||
| .. literalinclude:: ../../../numba_cuda/numba/cuda/tests/doc_examples/test_globals.py | ||
| :language: python | ||
| :caption: Demonstrating device array capture by pointer | ||
| :start-after: magictoken.ex_globals_device_array_capture.begin | ||
| :end-before: magictoken.ex_globals_device_array_capture.end | ||
| :dedent: 8 | ||
| :linenos: | ||
|
|
||
| Running the above code prints: | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| [10. 25. 5. 15. 30.] | ||
| [20. 50. 10. 30. 60.] | ||
|
|
||
| Note that the outputs are different—the mutation to ``PRICES`` after the first | ||
| kernel call *is* visible to the second call, unlike with host arrays. | ||
|
|
||
| This makes device arrays suitable for global state that needs to be updated | ||
| between kernel calls without recompilation. | ||
|
|
||
| .. note:: | ||
|
|
||
| Kernels and device functions that capture global device arrays cannot use | ||
| ``cache=True``. Because the device pointer is embedded in the compiled code, | ||
| caching would serialize an invalid pointer. Attempting to cache such a kernel | ||
| will raise a ``PicklingError``. See :doc:`caching` for more information on | ||
| kernel caching. |
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -197,6 +197,16 @@ def reducer_override(self, obj): | |||
| # Overridden to disable pickling of certain types | ||||
| if type(obj) in self.disabled_types: | ||||
| _no_pickle(obj) # noreturn | ||||
|
|
||||
| # Prevent pickling of objects implementing __cuda_array_interface__ | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this has no effect because it never gets to see an object with the CUDA Array Interface when pickling. The To prevent caching, the diff --git a/numba_cuda/numba/cuda/codegen.py b/numba_cuda/numba/cuda/codegen.py
index 957dd72e..9ee91e29 100644
--- a/numba_cuda/numba/cuda/codegen.py
+++ b/numba_cuda/numba/cuda/codegen.py
@@ -463,6 +463,10 @@ class CUDACodeLibrary(serialize.ReduceMixin, CodeLibrary):
if not self._finalized:
raise RuntimeError("Cannot pickle unfinalized CUDACodeLibrary")
+
+ if self.referenced_objects:
+ raise RuntimeError("Cannot pickle...")
+
return dict(
codegen=None,
name=self.name,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see -- the issue is that for closure variables, we are able to raise this
I followed your suggestion to correctly handle global device arrays in |
||||
| # These contain device pointers that would become stale after unpickling | ||||
| if getattr(obj, "__cuda_array_interface__", None) is not None: | ||||
| raise pickle.PicklingError( | ||||
| "Cannot serialize kernels or device functions referencing " | ||||
| "global device arrays. Pass the array(s) as arguments " | ||||
| "to the kernel instead." | ||||
| ) | ||||
|
|
||||
| return super().reducer_override(obj) | ||||
|
|
||||
|
|
||||
|
|
||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples in this file look like they should be copy-pastable to execute, but they don't run because of missing declarations. Should they be expected to work? (If they are, it may be good to convert them to doctests, like e.g https://github.com/NVIDIA/numba-cuda/blob/main/numba_cuda/numba/cuda/tests/doc_examples/test_random.py / https://github.com/NVIDIA/numba-cuda/blob/main/docs/source/user/examples.rst?plain=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were meant to be representative/illustrative. If you think it's useful, I'll replace them with real doctests; thanks!