Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
I've broken this PR into a few parts that are ready to merge: After #2137 merges, I will start breaking up the new user guide documents into their own PRs. |
This PR improves a few small issues in the Python documentation. Split off from #2087. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #2139
This is split off of #2087. I am overhauling the RMM documentation. This is the first set of changes, which includes a new theme and a reorganization of the C++ docs. All docs now use Markdown / Myst. The next phases will include docstring tweaks to fix various formatting/cross-linking issues (see #2138 and #2139 for current progress on this), an expansion of the Python API docs, and adding user guides for various features. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Matthew Murray (https://github.com/Matt711) - Rong Ou (https://github.com/rongou) - Jake Awe (https://github.com/AyodeAwe) URL: #2137
This PR improves a few small issues in the C++ documentation. Split off from #2087. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #2138
Update all C++ code examples to use set_current_device_resource_ref() instead of set_current_device_resource(&ptr), pass resource refs by value to adaptor constructors, use get_bytes_counter/get_allocations_counter instead of fictional get_statistics(), add compute-sanitizer UM flags, fix managed_memory multi-GPU example, improve choosing_memory_resources managed memory example with PrefetchResourceAdaptor, and fix incorrect upstream= keyword args in pool_allocators.md.
|
Caution Review failedFailed to post review comments 📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThis PR migrates RMM's memory resource architecture from a custom Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Rationale: This refactoring is substantial and pervasive, affecting 500+ files across core headers, implementations, tests, benchmarks, Python bindings, and documentation. While the pattern is consistent (device_memory_resource → shared_resource delegation), the scope is massive and verifying correctness requires checking: (1) proper impl delegation in each resource type, (2) consistent allocation/deallocation signatures and alignment handling across all resources/adaptors, (3) property and equality operator correctness, (4) test suite adequacy and mock updates, (5) Python binding type correctness, and (6) documentation accuracy. The changes are heterogeneous enough that each resource/adaptor requires separate reasoning despite following a common pattern. Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
|
…ue-type resources
Lead with performance rationale, link Unified Memory and oversubscription to CUDA Programming Guide, trim CudaMemoryResource to minimal description, remove redundant Performance Considerations section, add logging adaptor to compositions and best practices.
…uide and API reference
There was a problem hiding this comment.
I like this information but I want to have an agent verify that all the code will compile and run.
There was a problem hiding this comment.
I haven't used it, but myst does support a {code-cell} directive: https://mystmd.org/guide/notebooks-with-markdown#code-cell
But I don't know how that would work for C++ (I know there are jupyter kernels for C++), or whether we want to run these every time we build the docs (probably not).
There was a problem hiding this comment.
Some of this overlaps with the recently-improved CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html
Reviewers: What do you think about this? Keep? Reduce? Delete? Please leave signpost comments on the parts that you think are valuable to mention in the user guide.
There was a problem hiding this comment.
I'm inclined to delete this page. There are a few tiny pieces of this that might be valuable, but they should probably be copied into other pages.
Reviewers: What do you think? Please leave signpost comments on the parts that you think are valuable to mention in the user guide.
There was a problem hiding this comment.
As above, some of this could be deleted if we point to the CUDA Programming Guide on Asynchronous Execution. https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html
Reviewers: What do you think? Please leave signpost comments on the parts that you think are valuable to mention in the user guide.
wence-
left a comment
There was a problem hiding this comment.
This needs a huge amount of work
| `````{tabs} | ||
| ````{code-tab} c++ | ||
| #include <rmm/mr/cuda_async_memory_resource.hpp> | ||
| #include <rmm/mr/per_device_resource.hpp> | ||
|
|
||
| rmm::mr::cuda_async_memory_resource mr; | ||
| rmm::mr::set_current_device_resource_ref(mr); | ||
| ```` | ||
| ````{code-tab} python | ||
| import rmm | ||
|
|
||
| mr = rmm.mr.CudaAsyncMemoryResource() | ||
| rmm.mr.set_current_device_resource(mr) | ||
| ```` | ||
| ````` |
There was a problem hiding this comment.
Can we avoid pushing the set_current_device_resource model in our examples? As we're seeing in all the libraries we have, it is best to manage resources explicitly (not for lifetime reasons, now we have any_resource ownership).
|
|
||
| // Use 80% of GPU memory, rounded down to nearest 256 bytes | ||
| auto [free_memory, total_memory] = rmm::available_device_memory(); | ||
| std::size_t pool_size = (static_cast<std::size_t>(total_memory * 0.8) / 256) * 256; |
There was a problem hiding this comment.
rmm::align_down is a public function.
| rmm::mr::managed_memory_resource managed_mr; | ||
| rmm::mr::pool_memory_resource pool_mr{managed_mr, pool_size}; |
There was a problem hiding this comment.
Question: Should we not be recommending cuda_async_managed_memory_resouce (at least on cuda-13)?
There was a problem hiding this comment.
That feature is currently experimental. I am seeing mixed results on performance, and addressing those with the CUDA memory team. Long term this should be the preferred direction.
| #include <rmm/mr/cuda_async_memory_resource.hpp> | ||
| #include <rmm/mr/per_device_resource.hpp> | ||
|
|
||
| rmm::mr::cuda_async_memory_resource mr; |
There was a problem hiding this comment.
question: Should we be recommending the "default" async pool, rather than this one that makes its own mempool?
There was a problem hiding this comment.
No. We need the custom mempool to enable Blackwell decompression engine support and a custom release threshold. We don’t want to alter the flags on the default mempool.
There was a problem hiding this comment.
OK, we should explain this, because this is a trade-off.
|
|
||
| ### CudaAsyncMemoryResource | ||
|
|
||
| The `CudaAsyncMemoryResource` uses CUDA's driver-managed memory pool (via `cudaMallocAsync`). This is the **recommended default** for most applications. |
There was a problem hiding this comment.
This is a correct, but misleading, statement.
It use a driver-managed pool. But, crucially, does not use the default mempool for the device.
There was a problem hiding this comment.
Yes, we can clarify that.
| # Synchronize to ensure allocation completes | ||
| stream.synchronize() | ||
|
|
||
| # Now safe to do CPU operations with buffer.ptr |
There was a problem hiding this comment.
This is misleading. The ptr is always available on the CPU immediately, even when stream ordered.
|
|
||
| # Create a pool that maintains stream-ordered semantics | ||
| pool = rmm.mr.PoolMemoryResource( | ||
| rmm.mr.CudaAsyncMemoryResource(), # stream-ordered upstream |
There was a problem hiding this comment.
Upstream doesn't have to be stream ordered. And again, I think we shouldn't be advocating Pool around CudaAsync
| with stream: | ||
| kernel[100, 10](cuda.as_cuda_array(buffer).view('float32'), 1000) |
There was a problem hiding this comment.
This doesn't launch the kernel on stream, but rather the default stream.
| # BAD: May access uninitialized memory | ||
| # some_function(buffer.ptr) | ||
|
|
||
| # GOOD: Synchronize first | ||
| stream.synchronize() | ||
| some_function(buffer.ptr) |
There was a problem hiding this comment.
This is misleading, for the same reason above. The ptr is always valid on the CPU. So "accessing" from the CPU is a meaningless statement.
| ```python | ||
| stream = rmm.cuda_stream() | ||
|
|
||
| def allocate_and_use(): | ||
| buffer = rmm.DeviceBuffer(size=1000, stream=stream) | ||
| # Launch kernel using buffer | ||
| kernel[...](buffer.ptr) | ||
| # BAD: buffer is deallocated when function returns | ||
| # but kernel may still be running! | ||
|
|
||
| allocate_and_use() | ||
| stream.synchronize() # May crash - buffer already freed | ||
| ``` | ||
|
|
||
| Fix: Keep buffer alive until synchronization: | ||
|
|
||
| ```python | ||
| stream = rmm.cuda_stream() | ||
| buffer = allocate_and_use() # Return the buffer | ||
| stream.synchronize() # Now safe | ||
| buffer = None # Explicit cleanup after sync | ||
| ``` |
There was a problem hiding this comment.
This is wrong. If kernel is launched on stream then there is no problem.
| The choice of resource determines the underlying type of memory and thus its accessibility from host or device. | ||
| For example, the `cuda_async_memory_resource` uses a pool of memory managed by the CUDA driver. | ||
| This resource is recommended for most applications, because of its performance and support for asynchrous (stream-ordered) allocations. See [Stream-Ordered Allocation](stream_ordered_allocation.md) for details. | ||
| As another example, the `managed_memory_resource` provides unified memory for CPU+GPU, and is recommended for applications exceeding the available GPU memory. | ||
|
|
||
| See [Choosing a Memory Resource](choosing_memory_resources.md) for guidance on the available memory resources, performance considerations, and how they fit into efficient CUDA application design strategies. | ||
| [NVIDIA Nsight™ Systems](https://developer.nvidia.com/nsight-systems) can be used to profile memory resource performance. |
There was a problem hiding this comment.
Agreed, but I do appreciate the link to "Choosing a memory resource". "Which one should I pick" is a natural first question to ask.
Having just that, after defining a memory resource, would be sufficient.
| Resource adaptors wrap and add functionality to existing resources. | ||
| For example, the `statistics_resource_adaptor` can be used to track allocation statistics. | ||
| The `logging_resource_adaptor` logs allocations to a CSV file. | ||
| Adaptors are composable - wrap multiple adaptors for combined functionality. |
There was a problem hiding this comment.
Use "Resource adaptors" here, instead of just "Adaptors"? Or are we using those interchangeably?
|
|
||
| ### 3. Containers | ||
|
|
||
| RMM provides [RAII](https://en.cppreference.com/w/cpp/language/raii.html) container classes that manage memory lifetime. |
There was a problem hiding this comment.
Note that this is C++ specific? Or generalize things a bit to apply to python or C++.
| Memory resources aim to serve the needs of a wide range of applications, from data science and machine learning to high-performance simulation. | ||
|
|
||
| RMM's memory resources leverage CUDA features like **stream-ordered** (asynchronous) pipeline parallelism, **managed** memory (also known as unified virtual memory, UVM), and **pinned** memory, making it easier to write complex workflows that optimally use both device and host memory. | ||
| The integrations provided in RMM allow memory resources to benefit memory management across libraries frequently used together, such as **PyTorch** and **RAPIDS**. |
There was a problem hiding this comment.
"allow memory resources to benefit memory management" is a bit awkward.
Maybe "RMM provides integrations with other GPU libraries, enabling uniform memory handling for your entire application." or something like that.
And maybe link to the "Integration with GPU libraries below."
There was a problem hiding this comment.
I haven't used it, but myst does support a {code-cell} directive: https://mystmd.org/guide/notebooks-with-markdown#code-cell
But I don't know how that would work for C++ (I know there are jupyter kernels for C++), or whether we want to run these every time we build the docs (probably not).
|
|
||
| ### Python: Using Memory Event Logging | ||
|
|
||
| Enable logging by wrapping your memory resource with `LoggingResourceAdaptor`: |
There was a problem hiding this comment.
General comment, we should be able to link to Python API docs with something like
{ref}`rmm.mr.LoggingResourceAdaptor`
(assuming we're building this with myst, which I think we are).
I'm not sure about the c++ side.
| These page faults can significantly impact performance, especially for: | ||
| - First-touch access patterns | ||
| - Random memory access | ||
| - Large datasets that don't fit in GPU memory |
There was a problem hiding this comment.
I don't understand this third bullet, since I'd assume that "larger than GPU memory" is a precondition for the page fault?
My best guess is that this suggests something about repeated page faults as subsets of a large dataset are paged in and out of GPU memory?
Description
Adds a comprehensive User Guide for RMM and expands the existing programming guide.
Contributes to #1562, #1694, #2035.
New pages:
rmm.statisticsprofilerExpanded:
All C++ examples use the 26.06 API:
set_current_device_resource_ref(), pass-by-value adaptor constructors,get_bytes_counter()/get_allocations_counter(), required stream arguments fordevice_buffer, and copyable value-type resources.Checklist