Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: Copyright (c) 2020-2025, NVIDIA CORPORATION.
# SPDX-FileCopyrightText: Copyright (c) 2020-2026, NVIDIA CORPORATION.
# SPDX-License-Identifier: Apache-2.0

# Configuration file for the Sphinx documentation builder.
Expand Down Expand Up @@ -57,6 +57,7 @@
"sphinx.ext.intersphinx",
"sphinx_copybutton",
"sphinx_markdown_tables",
"sphinx_tabs.tabs",
"sphinxcontrib.jquery",
]

Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ RMM (RAPIDS Memory Manager) is a library for allocating and managing GPU memory
:maxdepth: 2
:caption: Contents

user_guide/guide
user_guide/index
cpp/index
python/index
```
264 changes: 264 additions & 0 deletions docs/user_guide/choosing_memory_resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
# Choosing a Memory Resource

One of the most common questions when using RMM is: "Which memory resource should I use?"

This guide recommends memory resources based on optimal allocation performance for common workloads.

## Recommended Defaults

For most applications, the CUDA async memory pool provides the best allocation performance with no tuning required.

`````{tabs}
````{code-tab} c++
#include <rmm/mr/cuda_async_memory_resource.hpp>
#include <rmm/mr/per_device_resource.hpp>

rmm::mr::cuda_async_memory_resource mr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should we be recommending the "default" async pool, rather than this one that makes its own mempool?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We need the custom mempool to enable Blackwell decompression engine support and a custom release threshold. We don’t want to alter the flags on the default mempool.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we should explain this, because this is a trade-off.

rmm::mr::set_current_device_resource_ref(mr);
````
````{code-tab} python
import rmm

mr = rmm.mr.CudaAsyncMemoryResource()
rmm.mr.set_current_device_resource(mr)
````
`````
Comment on lines +11 to +25
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid pushing the set_current_device_resource model in our examples? As we're seeing in all the libraries we have, it is best to manage resources explicitly (not for lifetime reasons, now we have any_resource ownership).


For applications that require GPU memory oversubscription (allocating more memory than physically available on the GPU), use a pooled managed memory resource with prefetching. This uses [CUDA Unified Memory](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html) (`cudaMallocManaged`) to enable automatic page migration between CPU and GPU at the cost of slower allocation performance. Coupling the managed memory "base" allocator with adaptors for pool allocation and prefetching to device on allocation recovers some of the performance lost to the overhead of managed allocations. Note: Managed memory has [limited support on WSL2](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html#unified-memory-on-windows-wsl-and-tegra).

`````{tabs}
````{code-tab} c++
#include <rmm/mr/managed_memory_resource.hpp>
#include <rmm/mr/pool_memory_resource.hpp>
#include <rmm/mr/prefetch_resource_adaptor.hpp>
#include <rmm/mr/per_device_resource.hpp>
#include <rmm/cuda_device.hpp>

// Use 80% of GPU memory, rounded down to nearest 256 bytes
auto [free_memory, total_memory] = rmm::available_device_memory();
std::size_t pool_size = (static_cast<std::size_t>(total_memory * 0.8) / 256) * 256;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmm::align_down is a public function.


rmm::mr::managed_memory_resource managed_mr;
rmm::mr::pool_memory_resource pool_mr{managed_mr, pool_size};
Comment on lines +41 to +42
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should we not be recommending cuda_async_managed_memory_resouce (at least on cuda-13)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feature is currently experimental. I am seeing mixed results on performance, and addressing those with the CUDA memory team. Long term this should be the preferred direction.

rmm::mr::prefetch_resource_adaptor prefetch_mr{pool_mr};
rmm::mr::set_current_device_resource_ref(prefetch_mr);
````
````{code-tab} python
import rmm

# Use 80% of GPU memory, rounded down to nearest 256 bytes
free_memory, total_memory = rmm.mr.available_device_memory()
pool_size = int(total_memory * 0.8) // 256 * 256

mr = rmm.mr.PrefetchResourceAdaptor(
rmm.mr.PoolMemoryResource(
rmm.mr.ManagedMemoryResource(),
initial_pool_size=pool_size,
)
)
rmm.mr.set_current_device_resource(mr)
````
`````

## Memory Resource Considerations

Resources that use the CUDA driver's pool suballocation (`cudaMallocFromPoolAsync`) provide the best performance because the driver can manage virtual address space efficiently, avoid fragmentation, and share memory across libraries without synchronization overhead.

### CudaAsyncMemoryResource
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout: All of these need cross-links to the API docs.

We also need a rosetta stone to translate from the Python names (here) to the C++ names, I expect.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree on the Rosetta stone aspect. That is one of my goals in writing a user guide that is separate from the API docs.


The `CudaAsyncMemoryResource` uses CUDA's driver-managed memory pool (via `cudaMallocAsync`). This is the **recommended default** for most applications.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correct, but misleading, statement.

It use a driver-managed pool. But, crucially, does not use the default mempool for the device.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can clarify that.


**Advantages:**
- **Fastest allocation performance**: Driver-managed suballocation with virtual addressing eliminates fragmentation and minimizes latency
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely not true. For many applications we have people still using the RMM pool exactly because it has lower latency allocation performance.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only aware of applications using the RMM pool to reduce the cost of managed memory. Most other applications I can think of default to the async MR. Maybe we can discuss these use cases offline.

- **Cross-library sharing**: The pool is shared across all libraries on the device, even those not using RMM directly
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not true.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that applications using the driver’s async pool (default or custom) won’t step on each other quite as badly as if, say, cuDF were to allocate an RMM pool with 80% of the device and leave only 20% (minus overhead) for another library like PyTorch.

- **Stream-ordered semantics**: Allocations and deallocations are stream-ordered by default, avoiding pipeline stalls in multi-stream workloads
- **Zero configuration**: No pool sizes to tune — the driver manages growth automatically
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not true. You need to handle the release threshold and max size in case you are also using libraries that don't allocate from the pool. e.g. anything that doesn't speak RMM memory resources, or use cudaMallocFromPoolAsync. Most communication libraries, for example, are in this latter bucket.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to examples where this tuning is being done?


**When to use:**
- Default choice for GPU-accelerated applications
- Multi-stream or multi-threaded applications
- Applications using multiple GPU libraries (e.g., cuDF + PyTorch)
- Most production workloads

### CudaMemoryResource

The `CudaMemoryResource` uses the legacy `cudaMalloc`/`cudaFree` APIs directly with no pooling or stream-ordering support. It is generally not recommended.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "legacy" is the wrong word. These are never going to be deprecated so they are not legacy.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove that word but it is definitely not recommended for new applications.


**When to use:**
- Debugging memory issues (to isolate allocator-related problems)
- Benchmarking baseline allocation overhead

### PoolMemoryResource

The `PoolMemoryResource` maintains a pool of memory allocated from an upstream resource. It provides fast suballocation but requires manual tuning for pool sizes and does not match the performance of `CudaAsyncMemoryResource` in multi-stream workloads.

**Advantages:**
- Fast suballocation from pre-allocated pool
- Configurable initial and maximum pool sizes for explicit memory budgeting
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The async pool also has these


**Disadvantages:**
- **Slower than async MR** in multi-stream workloads due to internal locking
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes.

- Can suffer from fragmentation (async MR reduces this with virtual addressing)
- Pool cannot be shared across CUDA applications unless all applications are using RMM
- May require tuning of pool size for optimal performance

**When to use:**
- Explicit memory budgeting with fixed pool sizes
- Wrapping non-CUDA memory sources (e.g., managed memory)
- Prefer `CudaAsyncMemoryResource` for new code unless you need explicit pool size control
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudaasyncmemoryresource has explicit pool size control.


**Note**: If using `PoolMemoryResource`, prefer wrapping `CudaAsyncMemoryResource` as the upstream rather than `CudaMemoryResource`:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? This sounds wrong.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn’t you want to use the async MR as the base? It enables Blackwell DE support, for example.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decompression engine works fine with cudaMalloc: https://docs.nvidia.com/cuda/nvcomp/decompression_engine_faq.html#id4

If you want to use the decompression engine to decompress host pointers you need to use cudamallocfrompoolasync with an appropriately configured pool but that is not the case here.


**Example:**
```python
import rmm

pool = rmm.mr.PoolMemoryResource(
rmm.mr.CudaAsyncMemoryResource(), # upstream resource
initial_pool_size=2**32, # 4 GiB
maximum_pool_size=2**34 # 16 GiB
)
rmm.mr.set_current_device_resource(pool)
```

### ManagedMemoryResource

The `ManagedMemoryResource` allocates [CUDA Unified Memory](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html) via `cudaMallocManaged`. Unified Memory creates a single address space accessible from both CPU and GPU, with the CUDA driver migrating pages between processors on demand. This enables [GPU memory oversubscription](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html) — allocating more memory than physically available on the GPU — but generally comes with a performance cost.

**Advantages:**
- Enables GPU memory oversubscription for datasets larger than GPU memory
- Automatic page migration between CPU and GPU

**Disadvantages:**
- **Slower than device memory** due to page faults and migration overhead, especially in multi-stream workloads (see [Performance Tuning](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html#performance-tuning) in the CUDA Programming Guide)
- Requires prefetching to achieve acceptable performance (see [Managed Memory guide](managed_memory.md))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording: something like:

Suggested change
- Requires prefetching to achieve acceptable performance (see [Managed Memory guide](managed_memory.md))
- Requires prefetching to achieve acceptable performance, see [the Managed Memory guide](managed_memory.md) for details on how to configure prefetching.


**Example:**
```python
import rmm

# Always combine managed memory with a pool and prefetching for acceptable
# performance. Without prefetching, page faults cause significant overhead,
# especially in multi-stream workloads.
base = rmm.mr.ManagedMemoryResource()
pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30)
prefetch_mr = rmm.mr.PrefetchResourceAdaptor(pool)
rmm.mr.set_current_device_resource(prefetch_mr)
```

**When to use:**
- Datasets larger than available GPU memory
- Always combine with a pool and prefetching (see [Managed Memory guide](managed_memory.md))

### ArenaMemoryResource

The `ArenaMemoryResource` divides a large allocation into size-binned arenas, reducing fragmentation.

**Advantages:**
- Better fragmentation characteristics than basic pool
- Good for mixed allocation sizes
- Predictable performance

**Disadvantages:**
- More complex configuration
- May waste memory if bin sizes don't match allocation patterns
Comment on lines +155 to +164
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be contrasted with the (recommended) async pool as well.


**Example:**
```python
import rmm

arena = rmm.mr.ArenaMemoryResource(
rmm.mr.CudaMemoryResource(),
arena_size=2**28 # 256 MiB arenas
)
rmm.mr.set_current_device_resource(arena)
```

**When to use:**
- Applications with diverse allocation sizes
- Long-running services with complex allocation patterns
- When fragmentation is observed with pool allocators

## Composing Memory Resources

Memory resources can be composed (wrapped) to combine their properties. The general pattern is:

```python
# Adaptor wrapping a base resource
adaptor = rmm.mr.SomeAdaptor(base_resource)
```

### Common Compositions

**Prefetching with managed memory:**
```python
import rmm

# Prefetch adaptor wrapping managed memory pool
base = rmm.mr.ManagedMemoryResource()
pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30)
prefetch = rmm.mr.PrefetchResourceAdaptor(pool)
rmm.mr.set_current_device_resource(prefetch)
```

**Statistics tracking:**
```python
import rmm

# Track allocation statistics (counts, peak, and total bytes)
base = rmm.mr.CudaAsyncMemoryResource()
stats = rmm.mr.StatisticsResourceAdaptor(base)
rmm.mr.set_current_device_resource(stats)
```

**Allocation logging:**
```python
import rmm

# Log every allocation and deallocation to a file
base = rmm.mr.CudaAsyncMemoryResource()
logged = rmm.mr.LoggingResourceAdaptor(base, log_file_name="allocations.csv")
rmm.mr.set_current_device_resource(logged)
```

## Multi-Library Applications

When using RMM with multiple GPU libraries (e.g., cuDF, PyTorch, CuPy), `CudaAsyncMemoryResource` is especially important because:

1. The driver-managed pool is shared automatically across all libraries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cudaasyncmemoryresource doesn't use the default mempool, so this is not true.

Additionally, even when using the default mempool this sharing doesn't happen by magic: all participating libraries must have somehow decided to use cudamallocasync with the default mempool.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. You don't need to configure every library to use RMM
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example literally requires configuration of pytorch to use RMM.

3. Memory is not artificially partitioned between libraries

**Example: RMM + PyTorch**
```python
import rmm
import torch
from rmm.allocators.torch import rmm_torch_allocator

# Use async MR as the base
rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())

# Configure PyTorch to use RMM
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
```

With this setup, both PyTorch and any other RMM-using code (like cuDF) will share the same driver-managed pool.

## Best Practices

1. **Set the memory resource before any allocations**: Changing the resource after allocations have been made can lead to crashes.

```python
import rmm

# Do this first, before any GPU allocations
rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
```

2. **Use adaptors for diagnostics**: Wrap with `StatisticsResourceAdaptor` to track allocation counts and peak usage, or `LoggingResourceAdaptor` to log every allocation and deallocation (see [Logging and Profiling](logging.md)).

## See Also

- [Pool Allocators](pool_allocators.md) - Detailed guide on pool and arena allocators
- [Managed Memory](managed_memory.md) - Guide to using managed memory and prefetching
- [Stream-Ordered Allocation](stream_ordered_allocation.md) - Understanding stream-ordered semantics
Loading
Loading