Skip to content

Add RMM User Guide#2087

Open
bdice wants to merge 14 commits intorapidsai:stagingfrom
bdice:docs-overhaul
Open

Add RMM User Guide#2087
bdice wants to merge 14 commits intorapidsai:stagingfrom
bdice:docs-overhaul

Conversation

@bdice
Copy link
Copy Markdown
Collaborator

@bdice bdice commented Oct 11, 2025

Description

Adds a comprehensive User Guide for RMM and expands the existing programming guide.

Contributes to #1562, #1694, #2035.

New pages:

  • Introduction — Overview of RMM's purpose and key abstractions
  • Installation — Build and install instructions for C++ and Python
  • Choosing a Memory Resource — Decision guide for selecting the right MR
  • Pool Allocators — PoolMemoryResource, ArenaMemoryResource, BinningMemoryResource configuration and best practices
  • Stream-Ordered Allocation — Async allocation patterns and stream safety
  • Managed Memory — Unified memory usage with prefetching strategies
  • Logging and Profiling — Allocation logging, statistics tracking, and the rmm.statistics profiler

Expanded:

  • Programming Guide — Memory resources, containers, adaptors, library integrations (CuPy, Numba, PyTorch, Thrust), multi-device usage

All C++ examples use the 26.06 API: set_current_device_resource_ref(), pass-by-value adaptor constructors, get_bytes_counter()/get_allocations_counter(), required stream arguments for device_buffer, and copyable value-type resources.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Oct 11, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bdice
Copy link
Copy Markdown
Collaborator Author

bdice commented Nov 11, 2025

I've broken this PR into a few parts that are ready to merge:

After #2137 merges, I will start breaking up the new user guide documents into their own PRs.

rapids-bot bot pushed a commit that referenced this pull request Nov 11, 2025
This PR improves a few small issues in the Python documentation.

Split off from #2087.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Matthew Murray (https://github.com/Matt711)

URL: #2139
@bdice bdice removed the status in libcudf Nov 11, 2025
rapids-bot bot pushed a commit that referenced this pull request Nov 12, 2025
This is split off of #2087.

I am overhauling the RMM documentation. This is the first set of changes, which includes a new theme and a reorganization of the C++ docs. All docs now use Markdown / Myst.

The next phases will include docstring tweaks to fix various formatting/cross-linking issues (see #2138 and #2139 for current progress on this), an expansion of the Python API docs, and adding user guides for various features.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Rong Ou (https://github.com/rongou)
  - Jake Awe (https://github.com/AyodeAwe)

URL: #2137
rapids-bot bot pushed a commit that referenced this pull request Nov 12, 2025
This PR improves a few small issues in the C++ documentation.

Split off from #2087.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)

URL: #2138
@bdice bdice self-assigned this Jan 9, 2026
bdice added 2 commits April 2, 2026 01:44
Update all C++ code examples to use set_current_device_resource_ref()
instead of set_current_device_resource(&ptr), pass resource refs by
value to adaptor constructors, use get_bytes_counter/get_allocations_counter
instead of fictional get_statistics(), add compute-sanitizer UM flags,
fix managed_memory multi-GPU example, improve choosing_memory_resources
managed memory example with PrefetchResourceAdaptor, and fix incorrect
upstream= keyword args in pool_allocators.md.
@bdice bdice changed the base branch from main to staging April 2, 2026 04:56
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added comprehensive user guide with programming guide, installation instructions, memory resource selection guidance, and configuration examples.
    • Added documentation for stream-ordered allocation, managed memory, pool allocators, and logging capabilities.
  • New Features

    • Improved memory resource architecture with enhanced composability and shared ownership semantics.
    • Updated resource APIs for better consistency across synchronous and asynchronous allocation patterns.

Walkthrough

This PR migrates RMM's memory resource architecture from a custom device_memory_resource base class to CUDA C++ Core Library (CCCL) memory resource concepts. The refactoring removes the legacy base class, replaces it with a cuda::mr::shared_resource<...impl> pattern across all resources and adaptors, updates allocation APIs to include alignment parameters, splits implementations into detail headers and source files, and updates the per-device resource API to use type-erased async resource references instead of raw pointers. Extensive test and documentation updates accompany the changes.

Changes

Cohort / File(s) Summary
Core Architecture Removal
cpp/include/rmm/mr/device_memory_resource.hpp, cpp/include/rmm/detail/cccl_adaptors.hpp
Removed legacy device_memory_resource base class and CCCL adaptor wrappers that previously bridged RMM and CCCL resource types.
Memory Resource Base/Core Headers
cpp/include/rmm/mr/cuda_memory_resource.hpp, cpp/include/rmm/mr/managed_memory_resource.hpp, cpp/include/rmm/mr/pinned_host_memory_resource.hpp, cpp/include/rmm/mr/system_memory_resource.hpp
Converted from inheriting device_memory_resource to direct CCCL-style implementations with allocate(cuda::stream_ref, bytes, alignment) / deallocate(...) methods, property hooks, and equality operators.
Memory Resource Adaptor Headers
cpp/include/rmm/mr/aligned_resource_adaptor.hpp, cpp/include/rmm/mr/arena_memory_resource.hpp, cpp/include/rmm/mr/binning_memory_resource.hpp, cpp/include/rmm/mr/callback_memory_resource.hpp, cpp/include/rmm/mr/failure_callback_resource_adaptor.hpp, cpp/include/rmm/mr/limiting_resource_adaptor.hpp, cpp/include/rmm/mr/logging_resource_adaptor.hpp, cpp/include/rmm/mr/pool_memory_resource.hpp, cpp/include/rmm/mr/prefetch_resource_adaptor.hpp, cpp/include/rmm/mr/statistics_resource_adaptor.hpp, cpp/include/rmm/mr/tracking_resource_adaptor.hpp, cpp/include/rmm/mr/thread_safe_resource_adaptor.hpp
Converted from templated device_memory_resource-derived classes to non-templated classes deriving from cuda::mr::shared_resource<detail::..._impl>, enabling shared ownership and implementation delegation.
Async CUDA Memory Resources
cpp/include/rmm/mr/cuda_async_memory_resource.hpp, cpp/include/rmm/mr/cuda_async_view_memory_resource.hpp, cpp/include/rmm/mr/cuda_async_managed_memory_resource.hpp
Refactored to derive from cuda::mr::shared_resource<...impl>, exposing async allocation/deallocation APIs with stream references and alignment, plus synchronous variants.
Fixed-Size & Stream-Ordered Resources
cpp/include/rmm/mr/fixed_size_memory_resource.hpp, cpp/include/rmm/mr/detail/stream_ordered_memory_resource.hpp
Updated to use new shared_resource pattern and cuda::stream_ref-based APIs; removed inheritance from device_memory_resource.
Implementation Detail Headers
cpp/include/rmm/mr/detail/*_impl.hpp (aligned, arena, binning, callback, cuda_async*, failure_callback, fixed_size, limiting, logging, pool, prefetch, sam_headroom, statistics, tracking, thread_safe)
Added new implementation headers defining non-copyable/move-deleted classes handling allocation logic, upstream storage, and property/equality semantics.
Resource Reference & Per-Device APIs
cpp/include/rmm/resource_ref.hpp, cpp/include/rmm/mr/per_device_resource.hpp
Removed CCCL adaptor dependencies and updated type aliases to directly use cuda::mr refs; changed per-device setters to return any_resource instead of raw pointers.
Source Implementation Files
cpp/src/mr/*.cpp, cpp/src/mr/detail/*.cpp
Added constructor/accessor implementations delegating to underlying cuda::mr::make_shared_resource<...impl>(...) and forwarding calls to shared implementations.
Memory Resource Support
cpp/include/rmm/detail/export.hpp
Added RMM_CONSTEXPR_FRIEND macro for Doxygen-compatible property friend declarations.
Stream CUDA Compilation
cpp/CMakeLists.txt
Added src/cuda_stream.cpp and numerous new detail/implementation source files to the rmm library build.
Test Updates
cpp/tests/CMakeLists.txt, cpp/tests/mr/*.hpp, cpp/tests/mr/*.cpp, cpp/tests/mr/*.cu, cpp/tests/mock_resource.hpp, cpp/tests/*_tests.cpp
Refactored test fixtures, mocks, and benchmarks to use new allocation/deallocation APIs with alignment, removed device_memory_resource dependencies, added new CCCL-based test suites, and disabled alignment-related failure tests.
Benchmark Updates
cpp/benchmarks/*/...bench.*
Changed from std::shared_ptr<device_memory_resource> and owning_wrapper patterns to cuda::mr::any_resource<device_accessible> type-erased references.
Python Bindings
python/rmm/rmm/librmm/*.pxd, python/rmm/rmm/pylibrmm/*.pyx, python/rmm/rmm/pylibrmm/memory_resource/*.pyx
Updated Cython declarations to use device_async_resource_ref instead of device_memory_resource*, changed internal storage from shared_ptr to unique_ptr with make_device_async_resource_ref helpers, and updated per-device APIs.
Documentation
docs/user_guide/*.md, docs/conf.py, docs/index.md
Added comprehensive user guides (introduction, installation, programming guide, choosing resources, stream allocation, managed memory, pool allocators, logging), updated documentation index and configuration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: This refactoring is substantial and pervasive, affecting 500+ files across core headers, implementations, tests, benchmarks, Python bindings, and documentation. While the pattern is consistent (device_memory_resource → shared_resource delegation), the scope is massive and verifying correctness requires checking: (1) proper impl delegation in each resource type, (2) consistent allocation/deallocation signatures and alignment handling across all resources/adaptors, (3) property and equality operator correctness, (4) test suite adequacy and mock updates, (5) Python binding type correctness, and (6) documentation accuracy. The changes are heterogeneous enough that each resource/adaptor requires separate reasoning despite following a common pattern.

Possibly related issues

Possibly related PRs

Suggested labels

breaking, improvement

Suggested reviewers

  • gforsyth
  • lamarrr
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch docs-overhaul

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this information but I want to have an agent verify that all the code will compile and run.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used it, but myst does support a {code-cell} directive: https://mystmd.org/guide/notebooks-with-markdown#code-cell

But I don't know how that would work for C++ (I know there are jupyter kernels for C++), or whether we want to run these every time we build the docs (probably not).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this overlaps with the recently-improved CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html

Reviewers: What do you think about this? Keep? Reduce? Delete? Please leave signpost comments on the parts that you think are valuable to mention in the user guide.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to delete this page. There are a few tiny pieces of this that might be valuable, but they should probably be copied into other pages.

Reviewers: What do you think? Please leave signpost comments on the parts that you think are valuable to mention in the user guide.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, some of this could be deleted if we point to the CUDA Programming Guide on Asynchronous Execution. https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html

Reviewers: What do you think? Please leave signpost comments on the parts that you think are valuable to mention in the user guide.

@bdice bdice added doc Documentation non-breaking Non-breaking change labels Apr 3, 2026
@bdice bdice marked this pull request as ready for review April 3, 2026 21:28
@bdice bdice requested a review from GregoryKimball April 3, 2026 21:28
Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a huge amount of work

Comment on lines +11 to +25
`````{tabs}
````{code-tab} c++
#include <rmm/mr/cuda_async_memory_resource.hpp>
#include <rmm/mr/per_device_resource.hpp>

rmm::mr::cuda_async_memory_resource mr;
rmm::mr::set_current_device_resource_ref(mr);
````
````{code-tab} python
import rmm

mr = rmm.mr.CudaAsyncMemoryResource()
rmm.mr.set_current_device_resource(mr)
````
`````
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid pushing the set_current_device_resource model in our examples? As we're seeing in all the libraries we have, it is best to manage resources explicitly (not for lifetime reasons, now we have any_resource ownership).


// Use 80% of GPU memory, rounded down to nearest 256 bytes
auto [free_memory, total_memory] = rmm::available_device_memory();
std::size_t pool_size = (static_cast<std::size_t>(total_memory * 0.8) / 256) * 256;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmm::align_down is a public function.

Comment on lines +41 to +42
rmm::mr::managed_memory_resource managed_mr;
rmm::mr::pool_memory_resource pool_mr{managed_mr, pool_size};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should we not be recommending cuda_async_managed_memory_resouce (at least on cuda-13)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feature is currently experimental. I am seeing mixed results on performance, and addressing those with the CUDA memory team. Long term this should be the preferred direction.

#include <rmm/mr/cuda_async_memory_resource.hpp>
#include <rmm/mr/per_device_resource.hpp>

rmm::mr::cuda_async_memory_resource mr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should we be recommending the "default" async pool, rather than this one that makes its own mempool?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We need the custom mempool to enable Blackwell decompression engine support and a custom release threshold. We don’t want to alter the flags on the default mempool.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we should explain this, because this is a trade-off.


### CudaAsyncMemoryResource

The `CudaAsyncMemoryResource` uses CUDA's driver-managed memory pool (via `cudaMallocAsync`). This is the **recommended default** for most applications.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correct, but misleading, statement.

It use a driver-managed pool. But, crucially, does not use the default mempool for the device.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can clarify that.

# Synchronize to ensure allocation completes
stream.synchronize()

# Now safe to do CPU operations with buffer.ptr
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is misleading. The ptr is always available on the CPU immediately, even when stream ordered.


# Create a pool that maintains stream-ordered semantics
pool = rmm.mr.PoolMemoryResource(
rmm.mr.CudaAsyncMemoryResource(), # stream-ordered upstream
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream doesn't have to be stream ordered. And again, I think we shouldn't be advocating Pool around CudaAsync

Comment on lines +171 to +172
with stream:
kernel[100, 10](cuda.as_cuda_array(buffer).view('float32'), 1000)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't launch the kernel on stream, but rather the default stream.

Comment on lines +255 to +260
# BAD: May access uninitialized memory
# some_function(buffer.ptr)

# GOOD: Synchronize first
stream.synchronize()
some_function(buffer.ptr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is misleading, for the same reason above. The ptr is always valid on the CPU. So "accessing" from the CPU is a meaningless statement.

Comment on lines +265 to +286
```python
stream = rmm.cuda_stream()

def allocate_and_use():
buffer = rmm.DeviceBuffer(size=1000, stream=stream)
# Launch kernel using buffer
kernel[...](buffer.ptr)
# BAD: buffer is deallocated when function returns
# but kernel may still be running!

allocate_and_use()
stream.synchronize() # May crash - buffer already freed
```

Fix: Keep buffer alive until synchronization:

```python
stream = rmm.cuda_stream()
buffer = allocate_and_use() # Return the buffer
stream.synchronize() # Now safe
buffer = None # Explicit cleanup after sync
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. If kernel is launched on stream then there is no problem.

Comment on lines +31 to +37
The choice of resource determines the underlying type of memory and thus its accessibility from host or device.
For example, the `cuda_async_memory_resource` uses a pool of memory managed by the CUDA driver.
This resource is recommended for most applications, because of its performance and support for asynchrous (stream-ordered) allocations. See [Stream-Ordered Allocation](stream_ordered_allocation.md) for details.
As another example, the `managed_memory_resource` provides unified memory for CPU+GPU, and is recommended for applications exceeding the available GPU memory.

See [Choosing a Memory Resource](choosing_memory_resources.md) for guidance on the available memory resources, performance considerations, and how they fit into efficient CUDA application design strategies.
[NVIDIA Nsight™ Systems](https://developer.nvidia.com/nsight-systems) can be used to profile memory resource performance.
Copy link
Copy Markdown
Contributor

@TomAugspurger TomAugspurger Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I do appreciate the link to "Choosing a memory resource". "Which one should I pick" is a natural first question to ask.

Having just that, after defining a memory resource, would be sufficient.

Resource adaptors wrap and add functionality to existing resources.
For example, the `statistics_resource_adaptor` can be used to track allocation statistics.
The `logging_resource_adaptor` logs allocations to a CSV file.
Adaptors are composable - wrap multiple adaptors for combined functionality.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use "Resource adaptors" here, instead of just "Adaptors"? Or are we using those interchangeably?


### 3. Containers

RMM provides [RAII](https://en.cppreference.com/w/cpp/language/raii.html) container classes that manage memory lifetime.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is C++ specific? Or generalize things a bit to apply to python or C++.

Memory resources aim to serve the needs of a wide range of applications, from data science and machine learning to high-performance simulation.

RMM's memory resources leverage CUDA features like **stream-ordered** (asynchronous) pipeline parallelism, **managed** memory (also known as unified virtual memory, UVM), and **pinned** memory, making it easier to write complex workflows that optimally use both device and host memory.
The integrations provided in RMM allow memory resources to benefit memory management across libraries frequently used together, such as **PyTorch** and **RAPIDS**.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"allow memory resources to benefit memory management" is a bit awkward.

Maybe "RMM provides integrations with other GPU libraries, enabling uniform memory handling for your entire application." or something like that.

And maybe link to the "Integration with GPU libraries below."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used it, but myst does support a {code-cell} directive: https://mystmd.org/guide/notebooks-with-markdown#code-cell

But I don't know how that would work for C++ (I know there are jupyter kernels for C++), or whether we want to run these every time we build the docs (probably not).


### Python: Using Memory Event Logging

Enable logging by wrapping your memory resource with `LoggingResourceAdaptor`:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment, we should be able to link to Python API docs with something like

{ref}`rmm.mr.LoggingResourceAdaptor`

(assuming we're building this with myst, which I think we are).

I'm not sure about the c++ side.

These page faults can significantly impact performance, especially for:
- First-touch access patterns
- Random memory access
- Large datasets that don't fit in GPU memory
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this third bullet, since I'd assume that "larger than GPU memory" is a precondition for the page fault?

My best guess is that this suggests something about repeated page faults as subsets of a large dataset are paged in and out of GPU memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Documentation non-breaking Non-breaking change

Projects

Status: Review
Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants