-
Notifications
You must be signed in to change notification settings - Fork 247
Add RMM User Guide #2087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging
Are you sure you want to change the base?
Add RMM User Guide #2087
Changes from all commits
841414e
2a1f4b8
0466c81
5a2bbaa
b8d0e98
c00a8c8
91fb555
03de6a4
4b7edb6
6ea60a8
fc681f1
d4d8ea6
5a51696
db54ca6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,264 @@ | ||||||
| # Choosing a Memory Resource | ||||||
|
|
||||||
| One of the most common questions when using RMM is: "Which memory resource should I use?" | ||||||
|
|
||||||
| This guide recommends memory resources based on optimal allocation performance for common workloads. | ||||||
|
|
||||||
| ## Recommended Defaults | ||||||
|
|
||||||
| For most applications, the CUDA async memory pool provides the best allocation performance with no tuning required. | ||||||
|
|
||||||
| `````{tabs} | ||||||
| ````{code-tab} c++ | ||||||
| #include <rmm/mr/cuda_async_memory_resource.hpp> | ||||||
| #include <rmm/mr/per_device_resource.hpp> | ||||||
|
|
||||||
| rmm::mr::cuda_async_memory_resource mr; | ||||||
| rmm::mr::set_current_device_resource_ref(mr); | ||||||
| ```` | ||||||
| ````{code-tab} python | ||||||
| import rmm | ||||||
|
|
||||||
| mr = rmm.mr.CudaAsyncMemoryResource() | ||||||
| rmm.mr.set_current_device_resource(mr) | ||||||
| ```` | ||||||
| ````` | ||||||
|
Comment on lines
+11
to
+25
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we avoid pushing the |
||||||
|
|
||||||
| For applications that require GPU memory oversubscription (allocating more memory than physically available on the GPU), use a pooled managed memory resource with prefetching. This uses [CUDA Unified Memory](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html) (`cudaMallocManaged`) to enable automatic page migration between CPU and GPU at the cost of slower allocation performance. Coupling the managed memory "base" allocator with adaptors for pool allocation and prefetching to device on allocation recovers some of the performance lost to the overhead of managed allocations. Note: Managed memory has [limited support on WSL2](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html#unified-memory-on-windows-wsl-and-tegra). | ||||||
|
|
||||||
| `````{tabs} | ||||||
| ````{code-tab} c++ | ||||||
| #include <rmm/mr/managed_memory_resource.hpp> | ||||||
| #include <rmm/mr/pool_memory_resource.hpp> | ||||||
| #include <rmm/mr/prefetch_resource_adaptor.hpp> | ||||||
| #include <rmm/mr/per_device_resource.hpp> | ||||||
| #include <rmm/cuda_device.hpp> | ||||||
|
|
||||||
| // Use 80% of GPU memory, rounded down to nearest 256 bytes | ||||||
| auto [free_memory, total_memory] = rmm::available_device_memory(); | ||||||
| std::size_t pool_size = (static_cast<std::size_t>(total_memory * 0.8) / 256) * 256; | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||
|
|
||||||
| rmm::mr::managed_memory_resource managed_mr; | ||||||
| rmm::mr::pool_memory_resource pool_mr{managed_mr, pool_size}; | ||||||
|
Comment on lines
+41
to
+42
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Should we not be recommending
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That feature is currently experimental. I am seeing mixed results on performance, and addressing those with the CUDA memory team. Long term this should be the preferred direction. |
||||||
| rmm::mr::prefetch_resource_adaptor prefetch_mr{pool_mr}; | ||||||
| rmm::mr::set_current_device_resource_ref(prefetch_mr); | ||||||
| ```` | ||||||
| ````{code-tab} python | ||||||
| import rmm | ||||||
|
|
||||||
| # Use 80% of GPU memory, rounded down to nearest 256 bytes | ||||||
| free_memory, total_memory = rmm.mr.available_device_memory() | ||||||
| pool_size = int(total_memory * 0.8) // 256 * 256 | ||||||
|
|
||||||
| mr = rmm.mr.PrefetchResourceAdaptor( | ||||||
| rmm.mr.PoolMemoryResource( | ||||||
| rmm.mr.ManagedMemoryResource(), | ||||||
| initial_pool_size=pool_size, | ||||||
| ) | ||||||
| ) | ||||||
| rmm.mr.set_current_device_resource(mr) | ||||||
| ```` | ||||||
| ````` | ||||||
|
|
||||||
| ## Memory Resource Considerations | ||||||
|
|
||||||
| Resources that use the CUDA driver's pool suballocation (`cudaMallocFromPoolAsync`) provide the best performance because the driver can manage virtual address space efficiently, avoid fragmentation, and share memory across libraries without synchronization overhead. | ||||||
|
|
||||||
| ### CudaAsyncMemoryResource | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Throughout: All of these need cross-links to the API docs. We also need a rosetta stone to translate from the Python names (here) to the C++ names, I expect.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I agree on the Rosetta stone aspect. That is one of my goals in writing a user guide that is separate from the API docs. |
||||||
|
|
||||||
| The `CudaAsyncMemoryResource` uses CUDA's driver-managed memory pool (via `cudaMallocAsync`). This is the **recommended default** for most applications. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a correct, but misleading, statement. It use a driver-managed pool. But, crucially, does not use the default mempool for the device.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we can clarify that. |
||||||
|
|
||||||
| **Advantages:** | ||||||
| - **Fastest allocation performance**: Driver-managed suballocation with virtual addressing eliminates fragmentation and minimizes latency | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is definitely not true. For many applications we have people still using the RMM pool exactly because it has lower latency allocation performance.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am only aware of applications using the RMM pool to reduce the cost of managed memory. Most other applications I can think of default to the async MR. Maybe we can discuss these use cases offline. |
||||||
| - **Cross-library sharing**: The pool is shared across all libraries on the device, even those not using RMM directly | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is also not true.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding is that applications using the driver’s async pool (default or custom) won’t step on each other quite as badly as if, say, cuDF were to allocate an RMM pool with 80% of the device and leave only 20% (minus overhead) for another library like PyTorch. |
||||||
| - **Stream-ordered semantics**: Allocations and deallocations are stream-ordered by default, avoiding pipeline stalls in multi-stream workloads | ||||||
| - **Zero configuration**: No pool sizes to tune — the driver manages growth automatically | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is also not true. You need to handle the release threshold and max size in case you are also using libraries that don't allocate from the pool. e.g. anything that doesn't speak RMM memory resources, or use cudaMallocFromPoolAsync. Most communication libraries, for example, are in this latter bucket.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you point to examples where this tuning is being done? |
||||||
|
|
||||||
| **When to use:** | ||||||
| - Default choice for GPU-accelerated applications | ||||||
| - Multi-stream or multi-threaded applications | ||||||
| - Applications using multiple GPU libraries (e.g., cuDF + PyTorch) | ||||||
| - Most production workloads | ||||||
|
|
||||||
| ### CudaMemoryResource | ||||||
|
|
||||||
| The `CudaMemoryResource` uses the legacy `cudaMalloc`/`cudaFree` APIs directly with no pooling or stream-ordering support. It is generally not recommended. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think "legacy" is the wrong word. These are never going to be deprecated so they are not legacy.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can remove that word but it is definitely not recommended for new applications. |
||||||
|
|
||||||
| **When to use:** | ||||||
| - Debugging memory issues (to isolate allocator-related problems) | ||||||
| - Benchmarking baseline allocation overhead | ||||||
|
|
||||||
| ### PoolMemoryResource | ||||||
|
|
||||||
| The `PoolMemoryResource` maintains a pool of memory allocated from an upstream resource. It provides fast suballocation but requires manual tuning for pool sizes and does not match the performance of `CudaAsyncMemoryResource` in multi-stream workloads. | ||||||
|
|
||||||
| **Advantages:** | ||||||
| - Fast suballocation from pre-allocated pool | ||||||
| - Configurable initial and maximum pool sizes for explicit memory budgeting | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The async pool also has these |
||||||
|
|
||||||
| **Disadvantages:** | ||||||
| - **Slower than async MR** in multi-stream workloads due to internal locking | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sometimes. |
||||||
| - Can suffer from fragmentation (async MR reduces this with virtual addressing) | ||||||
| - Pool cannot be shared across CUDA applications unless all applications are using RMM | ||||||
| - May require tuning of pool size for optimal performance | ||||||
|
|
||||||
| **When to use:** | ||||||
| - Explicit memory budgeting with fixed pool sizes | ||||||
| - Wrapping non-CUDA memory sources (e.g., managed memory) | ||||||
| - Prefer `CudaAsyncMemoryResource` for new code unless you need explicit pool size control | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cudaasyncmemoryresource has explicit pool size control. |
||||||
|
|
||||||
| **Note**: If using `PoolMemoryResource`, prefer wrapping `CudaAsyncMemoryResource` as the upstream rather than `CudaMemoryResource`: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Really? This sounds wrong.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why wouldn’t you want to use the async MR as the base? It enables Blackwell DE support, for example.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The decompression engine works fine with cudaMalloc: https://docs.nvidia.com/cuda/nvcomp/decompression_engine_faq.html#id4 If you want to use the decompression engine to decompress host pointers you need to use cudamallocfrompoolasync with an appropriately configured pool but that is not the case here. |
||||||
|
|
||||||
| **Example:** | ||||||
| ```python | ||||||
| import rmm | ||||||
|
|
||||||
| pool = rmm.mr.PoolMemoryResource( | ||||||
| rmm.mr.CudaAsyncMemoryResource(), # upstream resource | ||||||
| initial_pool_size=2**32, # 4 GiB | ||||||
| maximum_pool_size=2**34 # 16 GiB | ||||||
| ) | ||||||
| rmm.mr.set_current_device_resource(pool) | ||||||
| ``` | ||||||
|
|
||||||
| ### ManagedMemoryResource | ||||||
|
|
||||||
| The `ManagedMemoryResource` allocates [CUDA Unified Memory](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html) via `cudaMallocManaged`. Unified Memory creates a single address space accessible from both CPU and GPU, with the CUDA driver migrating pages between processors on demand. This enables [GPU memory oversubscription](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html) — allocating more memory than physically available on the GPU — but generally comes with a performance cost. | ||||||
|
|
||||||
| **Advantages:** | ||||||
| - Enables GPU memory oversubscription for datasets larger than GPU memory | ||||||
| - Automatic page migration between CPU and GPU | ||||||
|
|
||||||
| **Disadvantages:** | ||||||
| - **Slower than device memory** due to page faults and migration overhead, especially in multi-stream workloads (see [Performance Tuning](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html#performance-tuning) in the CUDA Programming Guide) | ||||||
| - Requires prefetching to achieve acceptable performance (see [Managed Memory guide](managed_memory.md)) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wording: something like:
Suggested change
|
||||||
|
|
||||||
| **Example:** | ||||||
| ```python | ||||||
| import rmm | ||||||
|
|
||||||
| # Always combine managed memory with a pool and prefetching for acceptable | ||||||
| # performance. Without prefetching, page faults cause significant overhead, | ||||||
| # especially in multi-stream workloads. | ||||||
| base = rmm.mr.ManagedMemoryResource() | ||||||
| pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30) | ||||||
| prefetch_mr = rmm.mr.PrefetchResourceAdaptor(pool) | ||||||
| rmm.mr.set_current_device_resource(prefetch_mr) | ||||||
| ``` | ||||||
|
|
||||||
| **When to use:** | ||||||
| - Datasets larger than available GPU memory | ||||||
| - Always combine with a pool and prefetching (see [Managed Memory guide](managed_memory.md)) | ||||||
|
|
||||||
| ### ArenaMemoryResource | ||||||
|
|
||||||
| The `ArenaMemoryResource` divides a large allocation into size-binned arenas, reducing fragmentation. | ||||||
|
|
||||||
| **Advantages:** | ||||||
| - Better fragmentation characteristics than basic pool | ||||||
| - Good for mixed allocation sizes | ||||||
| - Predictable performance | ||||||
|
|
||||||
| **Disadvantages:** | ||||||
| - More complex configuration | ||||||
| - May waste memory if bin sizes don't match allocation patterns | ||||||
|
Comment on lines
+155
to
+164
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be contrasted with the (recommended) async pool as well. |
||||||
|
|
||||||
| **Example:** | ||||||
| ```python | ||||||
| import rmm | ||||||
|
|
||||||
| arena = rmm.mr.ArenaMemoryResource( | ||||||
| rmm.mr.CudaMemoryResource(), | ||||||
| arena_size=2**28 # 256 MiB arenas | ||||||
| ) | ||||||
| rmm.mr.set_current_device_resource(arena) | ||||||
| ``` | ||||||
|
|
||||||
| **When to use:** | ||||||
| - Applications with diverse allocation sizes | ||||||
| - Long-running services with complex allocation patterns | ||||||
| - When fragmentation is observed with pool allocators | ||||||
|
|
||||||
| ## Composing Memory Resources | ||||||
|
|
||||||
| Memory resources can be composed (wrapped) to combine their properties. The general pattern is: | ||||||
|
|
||||||
| ```python | ||||||
| # Adaptor wrapping a base resource | ||||||
| adaptor = rmm.mr.SomeAdaptor(base_resource) | ||||||
| ``` | ||||||
|
|
||||||
| ### Common Compositions | ||||||
|
|
||||||
| **Prefetching with managed memory:** | ||||||
| ```python | ||||||
| import rmm | ||||||
|
|
||||||
| # Prefetch adaptor wrapping managed memory pool | ||||||
| base = rmm.mr.ManagedMemoryResource() | ||||||
| pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30) | ||||||
| prefetch = rmm.mr.PrefetchResourceAdaptor(pool) | ||||||
| rmm.mr.set_current_device_resource(prefetch) | ||||||
| ``` | ||||||
|
|
||||||
| **Statistics tracking:** | ||||||
| ```python | ||||||
| import rmm | ||||||
|
|
||||||
| # Track allocation statistics (counts, peak, and total bytes) | ||||||
| base = rmm.mr.CudaAsyncMemoryResource() | ||||||
| stats = rmm.mr.StatisticsResourceAdaptor(base) | ||||||
| rmm.mr.set_current_device_resource(stats) | ||||||
| ``` | ||||||
|
|
||||||
| **Allocation logging:** | ||||||
| ```python | ||||||
| import rmm | ||||||
|
|
||||||
| # Log every allocation and deallocation to a file | ||||||
| base = rmm.mr.CudaAsyncMemoryResource() | ||||||
| logged = rmm.mr.LoggingResourceAdaptor(base, log_file_name="allocations.csv") | ||||||
| rmm.mr.set_current_device_resource(logged) | ||||||
| ``` | ||||||
|
|
||||||
| ## Multi-Library Applications | ||||||
|
|
||||||
| When using RMM with multiple GPU libraries (e.g., cuDF, PyTorch, CuPy), `CudaAsyncMemoryResource` is especially important because: | ||||||
|
|
||||||
| 1. The driver-managed pool is shared automatically across all libraries | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The cudaasyncmemoryresource doesn't use the default mempool, so this is not true. Additionally, even when using the default mempool this sharing doesn't happen by magic: all participating libraries must have somehow decided to use
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||
| 2. You don't need to configure every library to use RMM | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The example literally requires configuration of pytorch to use RMM. |
||||||
| 3. Memory is not artificially partitioned between libraries | ||||||
|
|
||||||
| **Example: RMM + PyTorch** | ||||||
| ```python | ||||||
| import rmm | ||||||
| import torch | ||||||
| from rmm.allocators.torch import rmm_torch_allocator | ||||||
|
|
||||||
| # Use async MR as the base | ||||||
| rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource()) | ||||||
|
|
||||||
| # Configure PyTorch to use RMM | ||||||
| torch.cuda.memory.change_current_allocator(rmm_torch_allocator) | ||||||
| ``` | ||||||
|
|
||||||
| With this setup, both PyTorch and any other RMM-using code (like cuDF) will share the same driver-managed pool. | ||||||
|
|
||||||
| ## Best Practices | ||||||
|
|
||||||
| 1. **Set the memory resource before any allocations**: Changing the resource after allocations have been made can lead to crashes. | ||||||
|
|
||||||
| ```python | ||||||
| import rmm | ||||||
|
|
||||||
| # Do this first, before any GPU allocations | ||||||
| rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource()) | ||||||
| ``` | ||||||
|
|
||||||
| 2. **Use adaptors for diagnostics**: Wrap with `StatisticsResourceAdaptor` to track allocation counts and peak usage, or `LoggingResourceAdaptor` to log every allocation and deallocation (see [Logging and Profiling](logging.md)). | ||||||
|
|
||||||
| ## See Also | ||||||
|
|
||||||
| - [Pool Allocators](pool_allocators.md) - Detailed guide on pool and arena allocators | ||||||
| - [Managed Memory](managed_memory.md) - Guide to using managed memory and prefetching | ||||||
| - [Stream-Ordered Allocation](stream_ordered_allocation.md) - Understanding stream-ordered semantics | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Should we be recommending the "default" async pool, rather than this one that makes its own mempool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We need the custom mempool to enable Blackwell decompression engine support and a custom release threshold. We don’t want to alter the flags on the default mempool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we should explain this, because this is a trade-off.