diff --git a/docs/conf.py b/docs/conf.py index db45dff09..c090697b8 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -1,4 +1,4 @@ -# SPDX-FileCopyrightText: Copyright (c) 2020-2025, NVIDIA CORPORATION. +# SPDX-FileCopyrightText: Copyright (c) 2020-2026, NVIDIA CORPORATION. # SPDX-License-Identifier: Apache-2.0 # Configuration file for the Sphinx documentation builder. @@ -57,6 +57,7 @@ "sphinx.ext.intersphinx", "sphinx_copybutton", "sphinx_markdown_tables", + "sphinx_tabs.tabs", "sphinxcontrib.jquery", ] diff --git a/docs/index.md b/docs/index.md index d95428e72..21c8db3e7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -6,7 +6,7 @@ RMM (RAPIDS Memory Manager) is a library for allocating and managing GPU memory :maxdepth: 2 :caption: Contents -user_guide/guide +user_guide/index cpp/index python/index ``` diff --git a/docs/user_guide/choosing_memory_resources.md b/docs/user_guide/choosing_memory_resources.md new file mode 100644 index 000000000..56afa67fc --- /dev/null +++ b/docs/user_guide/choosing_memory_resources.md @@ -0,0 +1,264 @@ +# Choosing a Memory Resource + +One of the most common questions when using RMM is: "Which memory resource should I use?" + +This guide recommends memory resources based on optimal allocation performance for common workloads. + +## Recommended Defaults + +For most applications, the CUDA async memory pool provides the best allocation performance with no tuning required. + +`````{tabs} +````{code-tab} c++ +#include +#include + +rmm::mr::cuda_async_memory_resource mr; +rmm::mr::set_current_device_resource_ref(mr); +```` +````{code-tab} python +import rmm + +mr = rmm.mr.CudaAsyncMemoryResource() +rmm.mr.set_current_device_resource(mr) +```` +````` + +For applications that require GPU memory oversubscription (allocating more memory than physically available on the GPU), use a pooled managed memory resource with prefetching. This uses [CUDA Unified Memory](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html) (`cudaMallocManaged`) to enable automatic page migration between CPU and GPU at the cost of slower allocation performance. Coupling the managed memory "base" allocator with adaptors for pool allocation and prefetching to device on allocation recovers some of the performance lost to the overhead of managed allocations. Note: Managed memory has [limited support on WSL2](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html#unified-memory-on-windows-wsl-and-tegra). + +`````{tabs} +````{code-tab} c++ +#include +#include +#include +#include +#include + +// Use 80% of GPU memory, rounded down to nearest 256 bytes +auto [free_memory, total_memory] = rmm::available_device_memory(); +std::size_t pool_size = (static_cast(total_memory * 0.8) / 256) * 256; + +rmm::mr::managed_memory_resource managed_mr; +rmm::mr::pool_memory_resource pool_mr{managed_mr, pool_size}; +rmm::mr::prefetch_resource_adaptor prefetch_mr{pool_mr}; +rmm::mr::set_current_device_resource_ref(prefetch_mr); +```` +````{code-tab} python +import rmm + +# Use 80% of GPU memory, rounded down to nearest 256 bytes +free_memory, total_memory = rmm.mr.available_device_memory() +pool_size = int(total_memory * 0.8) // 256 * 256 + +mr = rmm.mr.PrefetchResourceAdaptor( + rmm.mr.PoolMemoryResource( + rmm.mr.ManagedMemoryResource(), + initial_pool_size=pool_size, + ) +) +rmm.mr.set_current_device_resource(mr) +```` +````` + +## Memory Resource Considerations + +Resources that use the CUDA driver's pool suballocation (`cudaMallocFromPoolAsync`) provide the best performance because the driver can manage virtual address space efficiently, avoid fragmentation, and share memory across libraries without synchronization overhead. + +### CudaAsyncMemoryResource + +The `CudaAsyncMemoryResource` uses CUDA's driver-managed memory pool (via `cudaMallocAsync`). This is the **recommended default** for most applications. + +**Advantages:** +- **Fastest allocation performance**: Driver-managed suballocation with virtual addressing eliminates fragmentation and minimizes latency +- **Cross-library sharing**: The pool is shared across all libraries on the device, even those not using RMM directly +- **Stream-ordered semantics**: Allocations and deallocations are stream-ordered by default, avoiding pipeline stalls in multi-stream workloads +- **Zero configuration**: No pool sizes to tune — the driver manages growth automatically + +**When to use:** +- Default choice for GPU-accelerated applications +- Multi-stream or multi-threaded applications +- Applications using multiple GPU libraries (e.g., cuDF + PyTorch) +- Most production workloads + +### CudaMemoryResource + +The `CudaMemoryResource` uses the legacy `cudaMalloc`/`cudaFree` APIs directly with no pooling or stream-ordering support. It is generally not recommended. + +**When to use:** +- Debugging memory issues (to isolate allocator-related problems) +- Benchmarking baseline allocation overhead + +### PoolMemoryResource + +The `PoolMemoryResource` maintains a pool of memory allocated from an upstream resource. It provides fast suballocation but requires manual tuning for pool sizes and does not match the performance of `CudaAsyncMemoryResource` in multi-stream workloads. + +**Advantages:** +- Fast suballocation from pre-allocated pool +- Configurable initial and maximum pool sizes for explicit memory budgeting + +**Disadvantages:** +- **Slower than async MR** in multi-stream workloads due to internal locking +- Can suffer from fragmentation (async MR reduces this with virtual addressing) +- Pool cannot be shared across CUDA applications unless all applications are using RMM +- May require tuning of pool size for optimal performance + +**When to use:** +- Explicit memory budgeting with fixed pool sizes +- Wrapping non-CUDA memory sources (e.g., managed memory) +- Prefer `CudaAsyncMemoryResource` for new code unless you need explicit pool size control + +**Note**: If using `PoolMemoryResource`, prefer wrapping `CudaAsyncMemoryResource` as the upstream rather than `CudaMemoryResource`: + +**Example:** +```python +import rmm + +pool = rmm.mr.PoolMemoryResource( + rmm.mr.CudaAsyncMemoryResource(), # upstream resource + initial_pool_size=2**32, # 4 GiB + maximum_pool_size=2**34 # 16 GiB +) +rmm.mr.set_current_device_resource(pool) +``` + +### ManagedMemoryResource + +The `ManagedMemoryResource` allocates [CUDA Unified Memory](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html) via `cudaMallocManaged`. Unified Memory creates a single address space accessible from both CPU and GPU, with the CUDA driver migrating pages between processors on demand. This enables [GPU memory oversubscription](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html) — allocating more memory than physically available on the GPU — but generally comes with a performance cost. + +**Advantages:** +- Enables GPU memory oversubscription for datasets larger than GPU memory +- Automatic page migration between CPU and GPU + +**Disadvantages:** +- **Slower than device memory** due to page faults and migration overhead, especially in multi-stream workloads (see [Performance Tuning](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html#performance-tuning) in the CUDA Programming Guide) +- Requires prefetching to achieve acceptable performance (see [Managed Memory guide](managed_memory.md)) + +**Example:** +```python +import rmm + +# Always combine managed memory with a pool and prefetching for acceptable +# performance. Without prefetching, page faults cause significant overhead, +# especially in multi-stream workloads. +base = rmm.mr.ManagedMemoryResource() +pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30) +prefetch_mr = rmm.mr.PrefetchResourceAdaptor(pool) +rmm.mr.set_current_device_resource(prefetch_mr) +``` + +**When to use:** +- Datasets larger than available GPU memory +- Always combine with a pool and prefetching (see [Managed Memory guide](managed_memory.md)) + +### ArenaMemoryResource + +The `ArenaMemoryResource` divides a large allocation into size-binned arenas, reducing fragmentation. + +**Advantages:** +- Better fragmentation characteristics than basic pool +- Good for mixed allocation sizes +- Predictable performance + +**Disadvantages:** +- More complex configuration +- May waste memory if bin sizes don't match allocation patterns + +**Example:** +```python +import rmm + +arena = rmm.mr.ArenaMemoryResource( + rmm.mr.CudaMemoryResource(), + arena_size=2**28 # 256 MiB arenas +) +rmm.mr.set_current_device_resource(arena) +``` + +**When to use:** +- Applications with diverse allocation sizes +- Long-running services with complex allocation patterns +- When fragmentation is observed with pool allocators + +## Composing Memory Resources + +Memory resources can be composed (wrapped) to combine their properties. The general pattern is: + +```python +# Adaptor wrapping a base resource +adaptor = rmm.mr.SomeAdaptor(base_resource) +``` + +### Common Compositions + +**Prefetching with managed memory:** +```python +import rmm + +# Prefetch adaptor wrapping managed memory pool +base = rmm.mr.ManagedMemoryResource() +pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30) +prefetch = rmm.mr.PrefetchResourceAdaptor(pool) +rmm.mr.set_current_device_resource(prefetch) +``` + +**Statistics tracking:** +```python +import rmm + +# Track allocation statistics (counts, peak, and total bytes) +base = rmm.mr.CudaAsyncMemoryResource() +stats = rmm.mr.StatisticsResourceAdaptor(base) +rmm.mr.set_current_device_resource(stats) +``` + +**Allocation logging:** +```python +import rmm + +# Log every allocation and deallocation to a file +base = rmm.mr.CudaAsyncMemoryResource() +logged = rmm.mr.LoggingResourceAdaptor(base, log_file_name="allocations.csv") +rmm.mr.set_current_device_resource(logged) +``` + +## Multi-Library Applications + +When using RMM with multiple GPU libraries (e.g., cuDF, PyTorch, CuPy), `CudaAsyncMemoryResource` is especially important because: + +1. The driver-managed pool is shared automatically across all libraries +2. You don't need to configure every library to use RMM +3. Memory is not artificially partitioned between libraries + +**Example: RMM + PyTorch** +```python +import rmm +import torch +from rmm.allocators.torch import rmm_torch_allocator + +# Use async MR as the base +rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource()) + +# Configure PyTorch to use RMM +torch.cuda.memory.change_current_allocator(rmm_torch_allocator) +``` + +With this setup, both PyTorch and any other RMM-using code (like cuDF) will share the same driver-managed pool. + +## Best Practices + +1. **Set the memory resource before any allocations**: Changing the resource after allocations have been made can lead to crashes. + + ```python + import rmm + + # Do this first, before any GPU allocations + rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource()) + ``` + +2. **Use adaptors for diagnostics**: Wrap with `StatisticsResourceAdaptor` to track allocation counts and peak usage, or `LoggingResourceAdaptor` to log every allocation and deallocation (see [Logging and Profiling](logging.md)). + +## See Also + +- [Pool Allocators](pool_allocators.md) - Detailed guide on pool and arena allocators +- [Managed Memory](managed_memory.md) - Guide to using managed memory and prefetching +- [Stream-Ordered Allocation](stream_ordered_allocation.md) - Understanding stream-ordered semantics diff --git a/docs/user_guide/guide.md b/docs/user_guide/guide.md index b6923257b..b941ebb9b 100644 --- a/docs/user_guide/guide.md +++ b/docs/user_guide/guide.md @@ -1,338 +1,447 @@ -# User Guide +# Programming Guide -Achieving optimal performance in GPU-centric workflows frequently requires -customizing how GPU ("device") memory is allocated. +This guide covers using RMM in C++ and Python applications, including memory resources, containers, and library integrations. -RMM is a package that enables you to allocate device memory -in a highly configurable way. For example, it enables you to -allocate and use pools of GPU memory, or to use -[managed memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) -for allocations. +## Basic Example -You can also easily configure other libraries like Numba and CuPy -to use RMM for allocating device memory. +`````{tabs} +````{code-tab} c++ +#include +#include +#include +#include -## Installation +int main() { + // Use async MR (recommended) + rmm::mr::cuda_async_memory_resource async_mr; + rmm::mr::set_current_device_resource_ref(async_mr); -See the project [README](https://github.com/rapidsai/rmm) for how to install RMM. + // Allocate device memory + rmm::cuda_stream stream; + rmm::device_buffer buffer(1024, stream.view()); -## Using RMM + std::cout << "Allocated " << buffer.size() << " bytes\n"; -There are two ways to use RMM in Python code: + return 0; +} +```` +````{code-tab} python +import rmm -1. Using the `rmm.DeviceBuffer` API to explicitly create and manage - device memory allocations -2. Transparently via external libraries such as CuPy and Numba +# Use async MR (recommended) +mr = rmm.mr.CudaAsyncMemoryResource() +rmm.mr.set_current_device_resource(mr) -RMM provides a `MemoryResource` abstraction to control _how_ device -memory is allocated in both the above uses. +# Allocate device memory +buffer = rmm.DeviceBuffer(size=1024) -### `DeviceBuffer` Objects +print(f"Allocated {buffer.size} bytes at {hex(buffer.ptr)}") +```` +````` -A `DeviceBuffer` represents an **untyped, uninitialized device memory -allocation**. `DeviceBuffer`s can be created by providing the -size of the allocation in bytes: +## Memory Resources -```python ->>> import rmm ->>> buf = rmm.DeviceBuffer(size=100) -``` +Memory resources control how device memory is allocated. RMM provides several resource types optimized for different use cases. -The size of the allocation and the memory address associated with it -can be accessed via the `.size` and `.ptr` attributes respectively: +### Setting the Current Resource -```python ->>> buf.size -100 ->>> buf.ptr -140202544726016 -``` +The current device resource is used by default for all allocations: -`DeviceBuffer`s can also be created by copying data from host memory: +`````{tabs} +````{code-tab} c++ +#include +#include -```python ->>> import rmm ->>> import numpy as np ->>> a = np.array([1, 2, 3], dtype='float64') ->>> buf = rmm.DeviceBuffer.to_device(a.view("uint8")) # to_device expects an unsigned 8-bit dtype ->>> buf.size -24 -``` +// Get current device resource ref +rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource_ref(); -Conversely, the data underlying a `DeviceBuffer` can be copied to the host: +// Set current device resource ref +rmm::mr::cuda_async_memory_resource async_mr; +rmm::mr::set_current_device_resource_ref(async_mr); +```` +````{code-tab} python +import rmm -```python ->>> np.frombuffer(buf.tobytes()) -array([1., 2., 3.]) -``` +# Get current device resource +mr = rmm.mr.get_current_device_resource() -#### Prefetching a `DeviceBuffer` +# Set current device resource +async_mr = rmm.mr.CudaAsyncMemoryResource() +rmm.mr.set_current_device_resource(async_mr) +```` +````` -[CUDA Unified Memory]( - https://developer.nvidia.com/blog/unified-memory-cuda-beginners/ -), also known as managed memory, can be allocated using an -`rmm.mr.ManagedMemoryResource` explicitly, or by calling `rmm.reinitialize` -with `managed_memory=True`. +> **Warning**: The default resource must be set **before** allocating any device memory on that device. Setting or changing the resource after device allocations have been made can lead to unexpected behavior or crashes. -A `DeviceBuffer` backed by managed memory or other -migratable memory (such as -[HMM/ATS](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) -memory) may be prefetched to a specified device, for example to reduce or eliminate page faults. +### Available Resources -```python ->>> import rmm ->>> rmm.reinitialize(managed_memory=True) ->>> buf = rmm.DeviceBuffer(size=100) ->>> buf.prefetch() -``` +RMM provides base memory resources (e.g., `CudaAsyncMemoryResource`, `ManagedMemoryResource`) and resource adaptors (e.g., `PoolMemoryResource`, `StatisticsResourceAdaptor`) that wrap an upstream resource to add functionality. See [Choosing a Memory Resource](choosing_memory_resources.md) for recommendations and the API references ([C++](../cpp/memory_resources/index.md), [Python](../python/index.md)) for the full list. -The above example prefetches the `DeviceBuffer` memory to the current CUDA device -on the stream that the `DeviceBuffer` last used (e.g. at construction). The -destination device ID and stream are optional parameters. +## Containers -```python ->>> import rmm ->>> rmm.reinitialize(managed_memory=True) ->>> from rmm.pylibrmm.stream import Stream ->>> stream = Stream() ->>> buf = rmm.DeviceBuffer(size=100, stream=stream) ->>> buf.prefetch(device=3, stream=stream) # prefetch to device on stream. -``` +RMM provides RAII containers that automatically manage device memory lifetime. -`DeviceBuffer.prefetch()` is a no-op if the `DeviceBuffer` is not backed -by migratable memory. +### DeviceBuffer -`rmm.pylibrmm.stream.Stream` implements the [CUDA Stream Protocol](https://nvidia.github.io/cuda-python/cuda-core/latest/interoperability.html#cuda-stream-protocol), so it can be used with -`cuda.core.`. +Untyped, uninitialized device memory: -```python ->>> from cuda.core import Device ->>> import rmm.pylibrmm.stream ->>> device = Device() ->>> device.set_current() ->>> rmm_stream = rmm.pylibrmm.stream.Stream() +`````{tabs} +````{code-tab} c++ +#include ->>> cuda_stream = device.create_stream(rmm_stream) -``` +rmm::cuda_stream stream; -### `MemoryResource` objects +// Allocate 1024 bytes +rmm::device_buffer buffer(1024, stream.view()); -`MemoryResource` objects are used to configure how device memory allocations are made by -RMM. +// Access pointer and size +void* ptr = buffer.data(); +std::size_t size = buffer.size(); -By default if a `MemoryResource` is not set explicitly, RMM uses the `CudaMemoryResource`, which -uses `cudaMalloc` for allocating device memory. +// Resize (may reallocate) +buffer.resize(2048, stream.view()); -`rmm.reinitialize()` provides an easy way to initialize RMM with specific memory resource options -across multiple devices. See `help(rmm.reinitialize)` for full details. +// Copy construct (deep copy) +rmm::device_buffer buffer2(buffer, stream.view()); +```` +````{code-tab} python +import rmm -For lower-level control, the `rmm.mr.set_current_device_resource()` function can be -used to set a different MemoryResource for the current CUDA device. For -example, enabling the `ManagedMemoryResource` tells RMM to use -`cudaMallocManaged` instead of `cudaMalloc` for allocating memory: +# Allocate 1024 bytes +buffer = rmm.DeviceBuffer(size=1024) -```python ->>> import rmm ->>> rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource()) -``` +# Access pointer and size +ptr = buffer.ptr +size = buffer.size -> :warning: The default resource must be set for any device **before** -> allocating any device memory on that device. Setting or changing the -> resource after device allocations have been made can lead to unexpected -> behaviour or crashes. +# Resize (may reallocate) +buffer.resize(2048) -As another example, `PoolMemoryResource` allows you to allocate a -large "pool" of device memory up-front. Subsequent allocations will -draw from this pool of already allocated memory. The example -below shows how to construct a PoolMemoryResource with an initial size -of 1 GiB and a maximum size of 4 GiB. The pool uses -`CudaMemoryResource` as its underlying ("upstream") memory resource: +# Copy construct (deep copy) +buffer2 = buffer.copy() +```` +````` -```python ->>> import rmm ->>> pool = rmm.mr.PoolMemoryResource( -... rmm.mr.CudaMemoryResource(), -... initial_pool_size="1GiB", # equivalent to initial_pool_size=2**30 -... maximum_pool_size="4GiB" -... ) ->>> rmm.mr.set_current_device_resource(pool) -``` +### device_uvector (C++) -Similarly, to use a pool of managed memory: +Typed, uninitialized device vector for trivially copyable types: -```python ->>> import rmm ->>> pool = rmm.mr.PoolMemoryResource( -... rmm.mr.ManagedMemoryResource(), -... initial_pool_size="1GiB", -... maximum_pool_size="4GiB" -... ) ->>> rmm.mr.set_current_device_resource(pool) +```cpp +#include +#include +#include + +rmm::cuda_stream stream; + +// Allocate 100 elements +rmm::device_uvector vec(100, stream.view()); + +// Access as pointer +int* ptr = vec.data(); + +// Access as iterators +auto begin = vec.begin(); +auto end = vec.end(); + +// Initialize with Thrust +thrust::fill(rmm::exec_policy(stream.view()), vec.begin(), vec.end(), 42); + +// Resize +vec.resize(200, stream.view()); ``` -Other `MemoryResource`s include: +### device_scalar (C++) -* `FixedSizeMemoryResource` for allocating fixed blocks of memory -* `BinningMemoryResource` for allocating blocks within specified "bin" sizes from different memory -resources +Single typed element with host-device transfer convenience: -`MemoryResource`s are highly configurable and can be composed together in different ways. -See `help(rmm.mr)` for more information. +```cpp +#include -## Using RMM with third-party libraries +rmm::cuda_stream stream; -A number of libraries provide hooks to control their device -allocations. RMM provides implementations of these for -[CuPy](https://cupy.dev), -[numba](https://numba.readthedocs.io/en/stable/), and [PyTorch](https://pytorch.org) in the -`rmm.allocators` submodule. All these approaches configure the library -to use the _current_ RMM memory resource for device -allocations. +// Allocate single int +rmm::device_scalar scalar(stream.view()); -### Using RMM with CuPy +// Set value from host (async on stream) +scalar.set_value(42, stream.view()); -You can configure [CuPy](https://cupy.dev/) to use RMM for memory -allocations by setting the CuPy CUDA allocator to -`rmm.allocators.cupy.rmm_cupy_allocator`: +// Get value to host (async on stream) +int value = scalar.value(stream.view()); -```python ->>> from rmm.allocators.cupy import rmm_cupy_allocator ->>> import cupy ->>> cupy.cuda.set_allocator(rmm_cupy_allocator) +// Access device pointer +int* d_ptr = scalar.data(); + +// Pass to kernel +launch_kernel<<<..., stream.value()>>>(scalar.data()); ``` -### Using RMM with Numba +## Resource Adaptors -You can configure [Numba](https://numba.readthedocs.io/en/stable/) to use RMM for memory allocations using the -Numba [EMM Plugin](https://numba.readthedocs.io/en/stable/cuda/external-memory.html#setting-emm-plugin). +Adaptors wrap resources to add functionality like statistics tracking and logging. -This can be done in two ways: +### Statistics Tracking -1. Setting the environment variable `NUMBA_CUDA_MEMORY_MANAGER`: +`````{tabs} +````{code-tab} c++ +#include +#include - ```bash - $ NUMBA_CUDA_MEMORY_MANAGER=rmm.allocators.numba python (args) - ``` +rmm::mr::cuda_async_memory_resource cuda_mr; +rmm::mr::statistics_resource_adaptor stats_mr{cuda_mr}; +rmm::mr::set_current_device_resource_ref(stats_mr); -2. Using the `set_memory_manager()` function provided by Numba: +// Allocate +rmm::cuda_stream stream; +rmm::device_buffer buffer(1024, stream.view()); - ```python - >>> from numba import cuda - >>> from rmm.allocators.numba import RMMNumbaManager - >>> cuda.set_memory_manager(RMMNumbaManager) - ``` +// Get statistics +auto bytes = stats_mr.get_bytes_counter(); +std::cout << "Current bytes: " << bytes.value << "\n"; +std::cout << "Peak bytes: " << bytes.peak << "\n"; +std::cout << "Total bytes: " << bytes.total << "\n"; +```` +````{code-tab} python +import rmm -### Using RMM with PyTorch +# Wrap base resource with statistics adaptor +cuda_mr = rmm.mr.CudaAsyncMemoryResource() +stats_mr = rmm.mr.StatisticsResourceAdaptor(cuda_mr) +rmm.mr.set_current_device_resource(stats_mr) -You can configure -[PyTorch](https://pytorch.org/docs/stable/notes/cuda.html) to use RMM -for memory allocations using their by configuring the current -allocator. +# Allocate +buffer = rmm.DeviceBuffer(size=1024) -```python ->>> from rmm.allocators.torch import rmm_torch_allocator ->>> import torch +# Get statistics +stats = stats_mr.allocation_counts +print(f"Current bytes: {stats.current_bytes}") +print(f"Peak bytes: {stats.peak_bytes}") +print(f"Total bytes: {stats.total_bytes}") +```` +````` ->>> torch.cuda.memory.change_current_allocator(rmm_torch_allocator) -``` +### Logging -## Memory statistics and profiling +`````{tabs} +````{code-tab} c++ +#include +#include -RMM can profile memory usage and track memory statistics by using either of the following: - - Use the context manager `rmm.statistics.statistics()` to enable statistics tracking for a specific code block. - - Call `rmm.statistics.enable_statistics()` to enable statistics tracking globally. +rmm::mr::cuda_async_memory_resource cuda_mr; +rmm::mr::logging_resource_adaptor log_mr{cuda_mr, "allocations.csv"}; +rmm::mr::set_current_device_resource_ref(log_mr); -Common to both usages is that they modify the currently active RMM memory resource. The current device resource is wrapped with a `StatisticsResourceAdaptor` which must remain the topmost resource throughout the statistics tracking: -```python ->>> import rmm ->>> import rmm.statistics - ->>> # We start with the default CUDA memory resource ->>> rmm.mr.get_current_device_resource() - - ->>> # When using statistics, we get a StatisticsResourceAdaptor with the context ->>> with rmm.statistics.statistics(): -... rmm.mr.get_current_device_resource() - - ->>> # We can also enable statistics globally ->>> rmm.statistics.enable_statistics() ->>> print(rmm.mr.get_current_device_resource()) - +// All allocations logged to CSV +rmm::device_buffer buffer(1024, rmm::cuda_stream_default); +```` +````{code-tab} python +import rmm + +# Wrap the current resource with logging adaptor +base = rmm.mr.CudaAsyncMemoryResource() +log_mr = rmm.mr.LoggingResourceAdaptor(base, log_file_name="allocations.csv") +rmm.mr.set_current_device_resource(log_mr) + +# All allocations logged to CSV +buffer = rmm.DeviceBuffer(size=1024) +```` +````` + +CSV format: `Thread,Time,Action,Pointer,Size,Stream` + +See [Logging and Profiling](logging.md) for more details. + +### Composing Resources + +Adaptors can be stacked to combine functionality: + +`````{tabs} +````{code-tab} c++ +#include +#include +#include +#include +#include + +// Base resource +rmm::mr::cuda_async_memory_resource cuda_mr; + +// Add pool +rmm::mr::pool_memory_resource pool_mr{cuda_mr, 1ULL << 30}; + +// Add statistics +rmm::mr::statistics_resource_adaptor stats_mr{pool_mr}; + +// Add logging +rmm::mr::logging_resource_adaptor log_mr{stats_mr, "log.csv"}; + +// Set as current +rmm::mr::set_current_device_resource_ref(log_mr); +```` +````{code-tab} python +import rmm + +# Base resource +cuda_mr = rmm.mr.CudaAsyncMemoryResource() + +# Add pool +pool_mr = rmm.mr.PoolMemoryResource(cuda_mr, initial_pool_size=2**30) + +# Add statistics +stats_mr = rmm.mr.StatisticsResourceAdaptor(pool_mr) + +# Add logging +log_mr = rmm.mr.LoggingResourceAdaptor(stats_mr, log_file_name="log.csv") + +# Set as current +rmm.mr.set_current_device_resource(log_mr) +```` +````` + +Order matters: outer adaptors see all allocations from inner resources. + +## Library Integrations + +### Thrust (C++) + +Use `rmm::exec_policy` to make Thrust algorithms use RMM for temporary storage: + +```cpp +#include +#include +#include +#include + +rmm::cuda_stream stream; +rmm::device_uvector vec(1000, stream.view()); + +// Fill with descending values +thrust::sequence(rmm::exec_policy(stream.view()), + vec.begin(), vec.end(), vec.size() - 1, -1); + +// Sort using current device resource for temporary storage +thrust::sort(rmm::exec_policy(stream.view()), vec.begin(), vec.end()); + +// Or use a specific memory resource for temporary storage +rmm::mr::cuda_async_memory_resource custom_mr; +thrust::sort(rmm::exec_policy(stream.view(), custom_mr), vec.begin(), vec.end()); + +stream.synchronize(); ``` -With statistics enabled, you can query statistics of the current and peak bytes and number of allocations performed by the current RMM memory resource: +### CuPy (Python) + +Configure CuPy to use RMM for all device memory allocations: + ```python ->>> buf = rmm.DeviceBuffer(size=10) ->>> rmm.statistics.get_statistics() -Statistics(current_bytes=16, current_count=1, peak_bytes=16, peak_count=1, total_bytes=16, total_count=1) +import rmm +import cupy as cp +from rmm.allocators.cupy import rmm_cupy_allocator + +# Configure RMM +mr = rmm.mr.CudaAsyncMemoryResource() +rmm.mr.set_current_device_resource(mr) + +# Set CuPy to use RMM +cp.cuda.set_allocator(rmm_cupy_allocator) + +# All CuPy arrays now use RMM +array = cp.zeros(1000) ``` -### Memory Profiler -To profile a specific block of code, first enable memory statistics by calling `rmm.statistics.enable_statistics()`. To profile a function, use `profiler` as a function decorator: -```python ->>> @rmm.statistics.profiler() -... def f(size): -... rmm.DeviceBuffer(size=size) ->>> f(1000) +### Numba (Python) ->>> # By default, the profiler write to rmm.statistics.default_profiler_records ->>> print(rmm.statistics.default_profiler_records.report()) -Memory Profiling -================ +Configure Numba to use RMM for device memory in CUDA JIT-compiled functions: -Legends: - ncalls - number of times the function or code block was called - memory_peak - peak memory allocated in function or code block (in bytes) - memory_total - total memory allocated in function or code block (in bytes) +```python +from numba import cuda +from rmm.allocators.numba import RMMNumbaManager +import rmm -Ordered by: memory_peak +# Configure RMM +mr = rmm.mr.CudaAsyncMemoryResource() +rmm.mr.set_current_device_resource(mr) -ncalls memory_peak memory_total filename:lineno(function) - 1 1,008 1,008 :1(f) +# Set Numba to use RMM +cuda.set_memory_manager(RMMNumbaManager) ``` -To profile a code block, use `profiler` as a context manager: -```python ->>> with rmm.statistics.profiler(name="my code block"): -... rmm.DeviceBuffer(size=20) ->>> print(rmm.statistics.default_profiler_records.report()) -Memory Profiling -================ - -Legends: - ncalls - number of times the function or code block was called - memory_peak - peak memory allocated in function or code block (in bytes) - memory_total - total memory allocated in function or code block (in bytes) - -Ordered by: memory_peak - -ncalls memory_peak memory_total filename:lineno(function) - 1 1,008 1,008 :1(f) - 1 32 32 my code block +Or use the environment variable: + +```bash +NUMBA_CUDA_MEMORY_MANAGER=rmm.allocators.numba python script.py ``` -The `profiler` supports nesting: +### PyTorch (Python) + +Configure PyTorch to use RMM for CUDA tensor allocations: + ```python ->>> with rmm.statistics.profiler(name="outer"): -... buf1 = rmm.DeviceBuffer(size=10) -... with rmm.statistics.profiler(name="inner"): -... buf2 = rmm.DeviceBuffer(size=10) ->>> print(rmm.statistics.default_profiler_records.report()) -Memory Profiling -================ - -Legends: - ncalls - number of times the function or code block was called - memory_peak - peak memory allocated in function or code block (in bytes) - memory_total - total memory allocated in function or code block (in bytes) - -Ordered by: memory_peak - -ncalls memory_peak memory_total filename:lineno(function) - 1 1,008 1,008 :1(f) - 1 32 32 my code block - 1 32 32 outer - 1 16 16 inner +import rmm +import torch +from rmm.allocators.torch import rmm_torch_allocator + +# Configure RMM +mr = rmm.mr.CudaAsyncMemoryResource() +rmm.mr.set_current_device_resource(mr) + +# Set PyTorch to use RMM +torch.cuda.memory.change_current_allocator(rmm_torch_allocator) + +# All PyTorch tensors now use RMM +tensor = torch.zeros(1000, device='cuda') ``` + +## Multi-Device Usage + +For multi-GPU systems, each device can have its own memory resource. Use `set_per_device_resource_ref` (C++) or `set_per_device_resource` (Python) to configure each device before allocating memory on it: + +`````{tabs} +````{code-tab} c++ +#include +#include +#include +#include + +int num_devices; +cudaGetDeviceCount(&num_devices); + +// Store resources to maintain lifetime (resources are copyable value types) +std::vector resources; + +for (int i = 0; i < num_devices; ++i) { + // Set device BEFORE creating resource + cudaSetDevice(i); + + // Create resource for this device + resources.emplace_back(); + + // Set as per-device resource ref + rmm::mr::set_per_device_resource_ref(rmm::cuda_device_id{i}, resources.back()); +} + +// Use device 0 +cudaSetDevice(0); +rmm::cuda_stream stream; +rmm::device_buffer buffer(1024, stream.view()); // Uses device 0's resource +```` +````{code-tab} python +import rmm +from cuda import cuda + +num_devices = cuda.cuDeviceGetCount()[1] + +# Store resources to maintain lifetime +resources = [] + +for device_id in range(num_devices): + # Create resource for this device + mr = rmm.mr.CudaAsyncMemoryResource() + resources.append(mr) + + # Set as per-device resource + rmm.mr.set_per_device_resource(device_id, mr) + +# Use device 0 +buffer = rmm.DeviceBuffer(size=1024) # Uses device 0's resource +```` +````` diff --git a/docs/user_guide/index.md b/docs/user_guide/index.md new file mode 100644 index 000000000..bafb4d1e6 --- /dev/null +++ b/docs/user_guide/index.md @@ -0,0 +1,14 @@ +# User Guide + +```{toctree} +:maxdepth: 2 + +introduction +installation +guide +choosing_memory_resources +stream_ordered_allocation +managed_memory +pool_allocators +logging +``` diff --git a/docs/user_guide/installation.md b/docs/user_guide/installation.md new file mode 100644 index 000000000..800872a25 --- /dev/null +++ b/docs/user_guide/installation.md @@ -0,0 +1,151 @@ +# Installation + +This guide covers installing RMM. For general RAPIDS installation instructions, which includes RMM, see the [RAPIDS Installation Guide](https://docs.rapids.ai/install/). + +## System Requirements + +See the [RAPIDS Platform Support](https://docs.rapids.ai/platform-support/) for supported operating systems, CUDA versions, GPU architectures, and Python versions for each release. + +## Installing with conda + +The easiest way to install RMM and all of its dependencies is using conda. You can get a minimal conda installation with [miniforge](https://conda-forge.org/download/). + +### Stable Release + +Install the latest stable release: + +```bash +conda install -c rapidsai -c conda-forge rmm cuda-version=13 +``` + +### Nightly Builds + +For the latest development version, install from the nightly channel: + +```bash +conda install -c rapidsai-nightly -c conda-forge rmm cuda-version=13 +``` + +Nightly builds are created from the `main` branch and may contain unreleased features or bug fixes. + +## Installing with pip + +RMM can also be installed using pip. The CUDA driver must already be installed on your system. + +```bash +pip install rmm-cu13 # For CUDA 13 +# or +pip install rmm-cu12 # For CUDA 12 +``` + +## Building from Source + +Building from source gives you the latest features and allows you to customize the build. + +### Clone and Create Development Environment + +The conda environment files in `conda/environments/` pin all build prerequisites (compiler, CUDA toolkit, CMake, etc.) to known-good versions: + +```bash +git clone https://github.com/rapidsai/rmm.git +cd rmm + +# Create environment for CUDA 13 +conda env create --name rmm_dev --file conda/environments/all_cuda-131_arch-$(uname -m).yaml +conda activate rmm_dev +``` + +### Build Using build.sh + +RMM provides a convenience script `build.sh` that handles the build process. +The `build.sh` script is meant to be used with the developer conda environment above, which installs all prerequisites. + +```bash +# Show help +./build.sh -h + +# Build librmm without installing +./build.sh -n librmm + +# Build rmm Python package without installing +./build.sh -n rmm + +# Build and install both +./build.sh librmm rmm +``` + +## Using RMM in a Downstream CMake Project + +To use RMM in your own CMake project, add the following to your `CMakeLists.txt`: + +```cmake +find_package(rmm REQUIRED) + +# Link your target with RMM +target_link_libraries(your_target PRIVATE rmm::rmm) +``` + +If RMM is not installed in a default location, specify its path: + +```bash +cmake .. -Drmm_ROOT=/path/to/rmm/install +``` + +### Using CPM to Fetch RMM + +You can use CPM to fetch RMM as a dependency: + +```cmake +include(CPM) + +CPMAddPackage( + NAME rmm + VERSION 26.06 + GITHUB_REPOSITORY rapidsai/rmm + GIT_TAG main + SOURCE_SUBDIR cpp +) + +target_link_libraries(your_target PRIVATE rmm::rmm) +``` + +## Testing Installation + +### C++ + +Create a test file `test_rmm.cpp`: + +```cpp +#include +#include +#include +#include + +int main() { + auto mr = rmm::mr::cuda_memory_resource{}; + rmm::mr::set_current_device_resource_ref(mr); + + rmm::device_buffer buf(100, rmm::cuda_stream_view{}); + std::cout << "Allocated " << buf.size() << " bytes\n"; + + return 0; +} +``` + +Compile and run: + +```bash +nvcc -std=c++17 -I/path/to/rmm/include test_rmm.cpp -o test_rmm +./test_rmm +``` + +### Python + +```python +import rmm +print(rmm.__version__) + +# Quick test +buffer = rmm.DeviceBuffer(size=100) +print(f"Allocated {buffer.size} bytes") +``` diff --git a/docs/user_guide/introduction.md b/docs/user_guide/introduction.md new file mode 100644 index 000000000..19cac18f9 --- /dev/null +++ b/docs/user_guide/introduction.md @@ -0,0 +1,148 @@ +# Introduction to RMM + +**RMM (RAPIDS Memory Manager)** is a library for allocating and managing GPU memory in C++ and Python. It provides a flexible interface for customizing how device memory is allocated, along with efficient implementations and containers. + +## Purpose + +Achieving optimal performance in GPU-accelerated applications frequently requires customizing memory allocation strategies. For example: + +- Using **memory pools** to reduce the overhead of dynamic allocation +- Using **managed memory** to work with datasets larger than GPU memory +- Using **pinned host memory** for faster asynchronous CPU ↔ GPU transfers +- Customizing allocation strategies for specific workload patterns + +RMM provides a unified interface, called a **memory resource**, which is a building block for GPU-accelerated applications. + +Memory resources provide a **minimal-overhead abstraction** over memory allocation that is **pluggable at runtime**, making it possible to debug, measure performance, and optimize a CUDA application without recompiling. +Memory resources aim to serve the needs of a wide range of applications, from data science and machine learning to high-performance simulation. + +RMM's memory resources leverage CUDA features like **stream-ordered** (asynchronous) pipeline parallelism, **managed** memory (also known as unified virtual memory, UVM), and **pinned** memory, making it easier to write complex workflows that optimally use both device and host memory. +The integrations provided in RMM allow memory resources to benefit memory management across libraries frequently used together, such as **PyTorch** and **RAPIDS**. + +## Key Features + +RMM is built around three main concepts. + +### 1. Memory Resources + +Memory resources provide a common abstraction for device memory allocation. +The API of RMM's memory resources is based on the CCCL memory resource design to facilitate interoperability. + +The choice of resource determines the underlying type of memory and thus its accessibility from host or device. +For example, the `cuda_async_memory_resource` uses a pool of memory managed by the CUDA driver. +This resource is recommended for most applications, because of its performance and support for asynchrous (stream-ordered) allocations. See [Stream-Ordered Allocation](stream_ordered_allocation.md) for details. +As another example, the `managed_memory_resource` provides unified memory for CPU+GPU, and is recommended for applications exceeding the available GPU memory. + +See [Choosing a Memory Resource](choosing_memory_resources.md) for guidance on the available memory resources, performance considerations, and how they fit into efficient CUDA application design strategies. +[NVIDIA Nsight™ Systems](https://developer.nvidia.com/nsight-systems) can be used to profile memory resource performance. + +### 2. Resource Adaptors + +Resource adaptors wrap and add functionality to existing resources. +For example, the `statistics_resource_adaptor` can be used to track allocation statistics. +The `logging_resource_adaptor` logs allocations to a CSV file. +Adaptors are composable - wrap multiple adaptors for combined functionality. + +### 3. Containers + +RMM provides [RAII](https://en.cppreference.com/w/cpp/language/raii.html) container classes that manage memory lifetime. +Using these containers avoids common problems with performing raw allocation such as memory leaks or improper stream ordering. +- `device_buffer`: Untyped device memory +- `device_uvector`: Typed, uninitialized vector of device memory (trivially copyable types) +- `device_scalar`: Single typed element + +All containers use stream-ordered allocation and work with any memory resource. + +## Basic Example + +### C++ + +```cpp +#include +#include + +// Use CUDA async memory pool +auto async_mr = rmm::mr::cuda_async_memory_resource{}; +rmm::mr::set_current_device_resource_ref(async_mr); + +// Allocate device memory asynchronously +rmm::cuda_stream stream; +rmm::device_buffer buffer(1024, stream.view()); +stream.synchronize(); +``` + +### Python + +```python +import rmm + +# Use CUDA async memory pool +mr = rmm.mr.CudaAsyncMemoryResource() +rmm.mr.set_current_device_resource(mr) + +# Allocate device memory +buffer = rmm.DeviceBuffer(size=1024) +``` + +## Integration with GPU Libraries + +RMM integrates seamlessly with popular GPU libraries: + +### PyTorch + +Set the PyTorch allocator to use the current device resource: + +```python +import rmm +import torch +from rmm.allocators.torch import rmm_torch_allocator + +mr = rmm.mr.CudaAsyncMemoryResource() +rmm.mr.set_current_device_resource(mr) +torch.cuda.memory.change_current_allocator(rmm_torch_allocator) +``` + +### CuPy + +Set the CuPy allocator to use the current device resource: + +```python +import rmm +import cupy +from rmm.allocators.cupy import rmm_cupy_allocator + +mr = rmm.mr.CudaAsyncMemoryResource() +rmm.mr.set_current_device_resource(mr) +cupy.cuda.set_allocator(rmm_cupy_allocator) + +# CuPy allocations now use RMM +array = cupy.zeros(1000) +``` + +### Numba + +When launching a script: +```bash +NUMBA_CUDA_MEMORY_MANAGER=rmm.allocators.numba python script.py +``` + +Or from Python: + +```python +import rmm +from numba import cuda +from rmm.allocators.numba import RMMNumbaManager + +mr = rmm.mr.CudaAsyncMemoryResource() +rmm.mr.set_current_device_resource(mr) +cuda.set_memory_manager(RMMNumbaManager) +``` + +## Resources and Support + +- [RMM GitHub Repository](https://github.com/rapidsai/rmm): Source code and development +- [RMM Issue Tracker](https://github.com/rapidsai/rmm/issues): Report bugs or request features +- [RAPIDS Documentation](https://docs.rapids.ai): RAPIDS ecosystem docs +- [RAPIDS Installation Guide](https://docs.rapids.ai/install): Installation instructions +- [Developer Blog: Fast, Flexible Allocation](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/): RMM design walkthrough +- [Developer Blog: Stream-Ordered Allocation](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/): Deep dive into stream-ordered semantics diff --git a/docs/user_guide/logging.md b/docs/user_guide/logging.md new file mode 100644 index 000000000..8f77bf034 --- /dev/null +++ b/docs/user_guide/logging.md @@ -0,0 +1,577 @@ +# Logging and Profiling + +RMM provides two types of logging: **memory event logging** for tracking allocations and deallocations, and **debug logging** for troubleshooting internal behavior. + +## Memory Event Logging + +Memory event logging writes details of every allocation and deallocation to a CSV file. This is useful for: +- Debugging memory issues +- Understanding allocation patterns +- Profiling memory usage +- Replaying workloads for benchmarking + +### Python: Using Memory Event Logging + +Enable logging by wrapping your memory resource with `LoggingResourceAdaptor`: + +```python +import rmm + +# Wrap the current resource with logging adaptor +base_mr = rmm.mr.CudaAsyncMemoryResource() +log_mr = rmm.mr.LoggingResourceAdaptor(base_mr, log_file_name="memory_log.csv") +rmm.mr.set_current_device_resource(log_mr) + +# Allocations are now logged +buffer1 = rmm.DeviceBuffer(size=1024) +buffer2 = rmm.DeviceBuffer(size=2048) + +# All allocations/deallocations written to memory_log.csv +``` + +If `log_file_name` is not provided, the environment variable `RMM_LOG_FILE` is used: + +```bash +export RMM_LOG_FILE="allocations.csv" +python script.py +``` + +### C++: Using logging_resource_adaptor + +Wrap any memory resource with `logging_resource_adaptor`: + +```cpp +#include +#include + +int main() { + // Create upstream resource + auto cuda_mr = rmm::mr::cuda_async_memory_resource{}; + + // Wrap with logging adaptor + auto log_mr = rmm::mr::logging_resource_adaptor{cuda_mr, "memory_log.csv"}; + + // Set as current resource + rmm::mr::set_current_device_resource_ref(log_mr); + + // All allocations logged to CSV + rmm::cuda_stream stream; + rmm::device_buffer buffer(1024, stream.view()); + + return 0; +} +``` + +If filename is not provided, `RMM_LOG_FILE` environment variable is checked: + +```bash +export RMM_LOG_FILE="allocations.csv" +./my_app +``` + +### CSV Log Format + +Each row represents an allocation or deallocation with the following columns: + +``` +Thread,Time,Action,Pointer,Size,Stream +``` + +Example: +``` +Thread,Time,Action,Pointer,Size,Stream +140573312345856,1634567890.123456,allocate,0x7f8a40000000,1024,0x7f8a38001020 +140573312345856,1634567890.234567,allocate,0x7f8a40000400,2048,0x7f8a38001020 +140573312345856,1634567890.345678,deallocate,0x7f8a40000000,1024,0x7f8a38001020 +``` + +- **Thread**: Thread ID performing the operation +- **Time**: Timestamp (seconds since epoch) +- **Action**: `allocate` or `deallocate` +- **Pointer**: Memory address +- **Size**: Allocation size in bytes +- **Stream**: CUDA stream pointer + +### Analyzing Logs + +You can parse and analyze logs with Python: + +```python +import pandas as pd + +# Read log file +df = pd.read_csv("memory_log.csv") + +# Total bytes allocated +total_allocated = df[df['Action'] == 'allocate']['Size'].sum() +print(f"Total allocated: {total_allocated:,} bytes") + +# Allocation size distribution +print(df[df['Action'] == 'allocate']['Size'].describe()) + +# Peak memory usage (simple analysis) +df['Delta'] = df.apply( + lambda row: row['Size'] if row['Action'] == 'allocate' else -row['Size'], + axis=1 +) +df['Cumulative'] = df['Delta'].cumsum() +peak = df['Cumulative'].max() +print(f"Peak usage: {peak:,} bytes") +``` + +### Replay Benchmark + +When building RMM from source, logs can be used with `REPLAY_BENCHMARK`: + +```bash +cd build/gbenchmarks +./REPLAY_BENCHMARK --log_file=memory_log.csv +``` + +This replays the allocation pattern from the log, useful for: +- Benchmarking different memory resources +- Testing allocator implementations +- Profiling allocation overhead + +## Memory Statistics + +RMM provides statistics tracking for allocations using `statistics_resource_adaptor`. + +### Python: Enabling Statistics + +```python +import rmm + +# Enable statistics globally +rmm.statistics.enable_statistics() + +# Or use context manager for specific code blocks +with rmm.statistics.statistics(): + buffer = rmm.DeviceBuffer(size=1024) + + # Get current statistics + stats = rmm.statistics.get_statistics() + print(f"Current bytes: {stats.current_bytes}") + print(f"Peak bytes: {stats.peak_bytes}") + print(f"Total allocations: {stats.total_count}") +``` + +Available statistics: + +```python +class Statistics: + current_bytes: int # Currently allocated bytes + current_count: int # Number of active allocations + peak_bytes: int # Peak bytes allocated + peak_count: int # Peak number of allocations + total_bytes: int # Total bytes ever allocated + total_count: int # Total number of allocations +``` + +### C++: Using statistics_resource_adaptor + +```cpp +#include +#include +#include + +int main() { + auto cuda_mr = rmm::mr::cuda_async_memory_resource{}; + auto stats_mr = rmm::mr::statistics_resource_adaptor{cuda_mr}; + rmm::mr::set_current_device_resource_ref(stats_mr); + + // Allocate + rmm::cuda_stream stream; + rmm::device_buffer buffer1(1024, stream.view()); + rmm::device_buffer buffer2(2048, stream.view()); + + // Get statistics + auto bytes = stats_mr.get_bytes_counter(); + auto allocs = stats_mr.get_allocations_counter(); + std::cout << "Current bytes: " << bytes.value << "\n"; + std::cout << "Peak bytes: " << bytes.peak << "\n"; + std::cout << "Allocation count: " << allocs.value << "\n"; + + return 0; +} +``` + +### Tracking Memory Growth + +Monitor memory usage over time: + +```python +import rmm +import time + +rmm.statistics.enable_statistics() + +def checkpoint(label): + stats = rmm.statistics.get_statistics() + print(f"{label}:") + print(f" Current: {stats.current_bytes:,} bytes ({stats.current_count} allocations)") + print(f" Peak: {stats.peak_bytes:,} bytes") + +checkpoint("Start") + +# Allocate +buffers = [rmm.DeviceBuffer(size=1024*1024) for _ in range(10)] +checkpoint("After 10x1MB allocations") + +# Free some +buffers = buffers[:5] +checkpoint("After freeing 5") + +# Allocate more +buffers.extend([rmm.DeviceBuffer(size=2*1024*1024) for _ in range(5)]) +checkpoint("After 5x2MB allocations") +``` + +## Memory Profiling + +The memory profiler tracks allocations by function/code block. + +### Python: Using the Profiler + +#### Profiling Functions + +```python +import rmm + +# Enable statistics first +rmm.statistics.enable_statistics() + +# Profile a function +@rmm.statistics.profiler() +def process_data(size): + buffer = rmm.DeviceBuffer(size=size) + # ... processing ... + return buffer + +# Run function +process_data(1000000) + +# View report +print(rmm.statistics.default_profiler_records.report()) +``` + +Output: +``` +Memory Profiling +================ + +Legends: + ncalls - number of times the function or code block was called + memory_peak - peak memory allocated in function or code block (in bytes) + memory_total - total memory allocated in function or code block (in bytes) + +Ordered by: memory_peak + +ncalls memory_peak memory_total filename:lineno(function) + 1 1,000,016 1,000,016 script.py:5(process_data) +``` + +#### Profiling Code Blocks + +```python +import rmm + +rmm.statistics.enable_statistics() + +# Profile specific code blocks +with rmm.statistics.profiler(name="data loading"): + data = rmm.DeviceBuffer(size=1000000) + +with rmm.statistics.profiler(name="processing"): + buffer1 = rmm.DeviceBuffer(size=500000) + buffer2 = rmm.DeviceBuffer(size=500000) + +# View report +print(rmm.statistics.default_profiler_records.report()) +``` + +Output: +``` +ncalls memory_peak memory_total filename:lineno(function) + 1 1,000,016 1,000,016 data loading + 1 1,000,032 1,000,032 processing +``` + +#### Nested Profiling + +```python +import rmm + +rmm.statistics.enable_statistics() + +with rmm.statistics.profiler(name="outer"): + buffer1 = rmm.DeviceBuffer(size=1000) + + with rmm.statistics.profiler(name="inner"): + buffer2 = rmm.DeviceBuffer(size=2000) + + buffer3 = rmm.DeviceBuffer(size=500) + +print(rmm.statistics.default_profiler_records.report()) +``` + +Output shows both nested and total allocations: +``` +ncalls memory_peak memory_total filename:lineno(function) + 1 3,520 3,520 outer + 1 2,016 2,016 inner +``` + +### Custom Profiler Records + +Use custom profiler records for separate tracking: + +```python +import rmm + +rmm.statistics.enable_statistics() + +# Create custom profiler records +custom_records = rmm.statistics.profiler_records() + +# Use with context manager +with rmm.statistics.profiler(name="my operation", records=custom_records): + buffer = rmm.DeviceBuffer(size=1024) + +# View only custom records +print(custom_records.report()) +``` + +## Debug Logging + +RMM uses [rapids-logger](https://github.com/rapidsai/rapids-logger) for debug output. + +### Enabling Debug Logging + +Debug logs show internal RMM behavior, errors, and warnings. + +#### Output Location + +By default, logs go to stderr. Set `RMM_DEBUG_LOG_FILE` to write to a file: + +```bash +export RMM_DEBUG_LOG_FILE=/path/to/rmm_debug.log +``` + +#### Log Levels + +Set at **compile time** with CMake: + +```bash +cmake .. -DRMM_LOGGING_LEVEL=DEBUG +``` + +Available levels (increasing verbosity): +- `OFF` - No logging +- `CRITICAL` - Only critical errors +- `ERROR` - Errors +- `WARN` - Warnings and errors +- `INFO` - Informational messages (default) +- `DEBUG` - Detailed debug info +- `TRACE` - Very verbose tracing + +#### Runtime Log Level (Python) + +Even with verbose logging compiled in, you must enable it at runtime: + +```python +import rmm + +# Enable all logging down to TRACE level +rmm.set_logging_level("trace") + +# Now you'll see TRACE and DEBUG messages +``` + +Available Python levels: `"trace"`, `"debug"`, `"info"`, `"warn"`, `"error"`, `"critical"`, `"off"` + +#### Runtime Log Level (C++) + +```cpp +#include + +int main() { + // Enable all logging down to TRACE level + rmm::default_logger().set_level(rapids_logger::level_enum::trace); + + // Your code here + + return 0; +} +``` + +### What Gets Logged + +Debug logging shows: +- Memory resource initialization +- Allocation failures and errors +- Pool growth and shrinkage +- Stream synchronization events +- Multi-device operations +- Internal state changes + +Example debug output: +``` +[2024-01-15 10:30:45.123] [info] Initializing cuda_async_memory_resource +[2024-01-15 10:30:45.234] [debug] pool_memory_resource: allocated 1 GiB from upstream +[2024-01-15 10:30:45.345] [warn] Allocation of 10 GiB failed, pool exhausted +[2024-01-15 10:30:45.456] [debug] Growing pool by 2 GiB +``` + +## Combining Logging Features + +Use multiple logging features together: + +```python +import rmm + +# Enable memory event logging by wrapping with adaptor +base_mr = rmm.mr.CudaAsyncMemoryResource() +log_mr = rmm.mr.LoggingResourceAdaptor(base_mr, log_file_name="events.csv") +rmm.mr.set_current_device_resource(log_mr) + +# Enable statistics and profiling +rmm.statistics.enable_statistics() + +# Set debug log level +rmm.set_logging_level("debug") + +# Now all logging is active +@rmm.statistics.profiler() +def my_function(): + buffer = rmm.DeviceBuffer(size=1024) + return buffer + +my_function() + +# Get statistics +stats = rmm.statistics.get_statistics() +print(f"Peak bytes: {stats.peak_bytes}") + +# View profiler report +print(rmm.statistics.default_profiler_records.report()) +``` + +C++ equivalent: + +```cpp +#include +#include +#include +#include + +int main() { + // Set debug log level + rmm::default_logger().set_level(rapids_logger::level_enum::debug); + + // Build resource stack + auto cuda_mr = rmm::mr::cuda_async_memory_resource{}; + auto stats_mr = rmm::mr::statistics_resource_adaptor{cuda_mr}; + auto log_mr = rmm::mr::logging_resource_adaptor{stats_mr, "events.csv"}; + + rmm::mr::set_current_device_resource_ref(log_mr); + + // Now all logging is active + rmm::cuda_stream stream; + rmm::device_buffer buffer(1024, stream.view()); + + // Get statistics + auto bytes = stats_mr.get_bytes_counter(); + std::cout << "Peak bytes: " << bytes.peak << "\n"; + + return 0; +} +``` + +## Use Cases + +### Debugging OOM Errors + +```python +import rmm + +# Enable detailed logging +base_mr = rmm.mr.CudaAsyncMemoryResource() +log_mr = rmm.mr.LoggingResourceAdaptor(base_mr, log_file_name="oom_debug.csv") +rmm.mr.set_current_device_resource(log_mr) +rmm.set_logging_level("debug") +rmm.statistics.enable_statistics() + +# Run problematic code +try: + large_buffer = rmm.DeviceBuffer(size=100 * 2**30) # 100 GiB +except MemoryError as e: + stats = rmm.statistics.get_statistics() + print(f"Peak before OOM: {stats.peak_bytes / 2**30:.2f} GiB") + print(f"Check oom_debug.csv for allocation history") + raise +``` + +### Profiling Memory in Data Pipeline + +```python +import rmm + +rmm.statistics.enable_statistics() + +@rmm.statistics.profiler() +def load_data(): + return rmm.DeviceBuffer(size=1000000) + +@rmm.statistics.profiler() +def process_data(buffer): + temp = rmm.DeviceBuffer(size=2000000) + result = rmm.DeviceBuffer(size=500000) + return result + +@rmm.statistics.profiler() +def save_data(buffer): + pass + +# Run pipeline +data = load_data() +result = process_data(data) +save_data(result) + +# Identify memory hotspots +print(rmm.statistics.default_profiler_records.report()) +``` + +### Benchmarking Memory Resources + +```python +import rmm +import time + +def benchmark_allocations(mr_name, mr): + rmm.mr.set_current_device_resource(mr) + + start = time.time() + buffers = [] + for _ in range(1000): + buffers.append(rmm.DeviceBuffer(size=1024)) + end = time.time() + + print(f"{mr_name}: {(end - start) * 1000:.2f} ms for 1000 allocations") + +# Compare resources +benchmark_allocations("CudaMemoryResource", rmm.mr.CudaMemoryResource()) +benchmark_allocations("CudaAsyncMemoryResource", rmm.mr.CudaAsyncMemoryResource()) +pool = rmm.mr.PoolMemoryResource(rmm.mr.CudaAsyncMemoryResource(), initial_pool_size=2**20) +benchmark_allocations("PoolMemoryResource", pool) +``` + +## Best Practices + +1. **Use event logging for debugging** - CSV logs help understand allocation patterns +2. **Enable statistics for profiling** - Track memory usage over time +3. **Use profiler for hotspot analysis** - Identify which functions allocate most memory +4. **Set appropriate debug level** - Use `INFO` normally, `DEBUG`/`TRACE` when troubleshooting +5. **Disable logging in production** - Logging has overhead; only enable when needed +6. **Analyze logs with tools** - Use pandas, REPLAY_BENCHMARK, or custom scripts +7. **Combine with NVIDIA tools** - Use [NVIDIA Nsight™ Systems](https://developer.nvidia.com/nsight-systems) alongside RMM logging for a complete picture diff --git a/docs/user_guide/managed_memory.md b/docs/user_guide/managed_memory.md new file mode 100644 index 000000000..5d4c7529e --- /dev/null +++ b/docs/user_guide/managed_memory.md @@ -0,0 +1,343 @@ +# Managed Memory and Prefetching + +CUDA Managed Memory (also called Unified Memory) allows memory to be accessed from both CPU and GPU, with automatic page migration managed by the CUDA driver. RMM provides `ManagedMemoryResource` to leverage this capability. + +## What is Managed Memory? + +Managed memory creates a single address space accessible from both CPU and GPU: + +- Allocations can be accessed using the same pointer from host or device code +- The CUDA driver automatically migrates pages between CPU and GPU as needed +- Enables working with datasets **larger than GPU memory** + +## When to Use Managed Memory + +Managed memory is ideal for: + +1. **Datasets larger than GPU memory**: When your data doesn't fit in VRAM +2. **Prototyping**: Simplifies development by removing explicit memory transfers +3. **CPU-GPU interoperability**: When you need to access the same data from both host and device + +**Important**: Managed memory has performance implications. Always combine with prefetching for production workloads. + +## Basic Usage + +### Python + +```python +import rmm + +# Use managed memory as the default resource +rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource()) + +# Allocations now use managed memory +buffer = rmm.DeviceBuffer(size=1000000) +``` + +### C++ + +```cpp +#include +#include + +auto managed_mr = rmm::mr::managed_memory_resource{}; +rmm::mr::set_current_device_resource_ref(managed_mr); + +// Allocations use managed memory +rmm::cuda_stream stream; +rmm::device_buffer buffer(1000000, stream.view()); +``` + +## Performance Considerations + +### Page Faults and Migration + +When the GPU accesses managed memory that is not resident on the GPU, a **page fault** occurs: + +1. GPU execution pauses +2. The driver migrates the page from CPU to GPU +3. GPU execution resumes + +These page faults can significantly impact performance, especially for: +- First-touch access patterns +- Random memory access +- Large datasets that don't fit in GPU memory + +### The Prefetching Solution + +**Prefetching** explicitly migrates data to the GPU before it's accessed, eliminating page faults. + +## Prefetching Strategies + +There are two main strategies for prefetching: + +### 1. Prefetch on Allocate (Eager Prefetching) + +Automatically prefetch memory to the GPU when it's allocated. This is useful when you know the data will be used on the GPU immediately after allocation. + +**Implementation: Use `PrefetchResourceAdaptor`** + +```python +import rmm + +# Wrap managed memory with prefetch adaptor +base = rmm.mr.ManagedMemoryResource() +prefetch_mr = rmm.mr.PrefetchResourceAdaptor(base) +rmm.mr.set_current_device_resource(prefetch_mr) + +# Every allocation is automatically prefetched to the GPU +buffer = rmm.DeviceBuffer(size=1000000) +# Buffer is already on the GPU, no page faults on first access +``` + +**With a pool:** + +```python +import rmm + +# Combine managed memory, pool, and prefetching +base = rmm.mr.ManagedMemoryResource() +pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30) +prefetch_mr = rmm.mr.PrefetchResourceAdaptor(pool) +rmm.mr.set_current_device_resource(prefetch_mr) +``` + +**When to use:** +- Allocations are immediately used on the GPU +- You want automatic prefetching without code changes + +### 2. Prefetch on Access (Lazy Prefetching) + +Explicitly prefetch data just before it's used in a kernel. This gives finer control and can optimize for specific access patterns. + +**Implementation: Manual prefetch calls** + +```python +import rmm + +rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource()) + +# Allocate managed memory (not prefetched yet) +buffer = rmm.DeviceBuffer(size=1000000) + +# ... later, just before using on GPU ... +stream = rmm.cuda_stream() +buffer.prefetch(device=0, stream=stream) # Prefetch to device 0 + +# Launch kernel on the same stream +# ... kernel will not page fault ... +``` + +**In C++:** + +```cpp +#include +#include +#include + +auto managed_mr = rmm::mr::managed_memory_resource{}; +rmm::mr::set_current_device_resource_ref(managed_mr); + +rmm::cuda_stream stream; +rmm::device_buffer buffer(1000000, stream.view()); + +// Prefetch before using +rmm::prefetch(buffer.data(), buffer.size(), + rmm::get_current_cuda_device(), stream.view()); + +// Launch kernel +launch_kernel<<>>(buffer.data()); +``` + +**When to use:** +- You need fine-grained control over when data is prefetched +- Access patterns are complex or dynamic +- You're optimizing for specific workload characteristics + +## Practical Example: PyTorch with Larger-Than-VRAM Models + +Here's how to use managed memory with PyTorch to work with models or data larger than GPU memory: + +```python +import rmm +import torch +from rmm.allocators.torch import rmm_torch_allocator + +# Use managed memory with prefetching +base = rmm.mr.ManagedMemoryResource() +pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30, maximum_pool_size=2**34) +prefetch_mr = rmm.mr.PrefetchResourceAdaptor(pool) +rmm.mr.set_current_device_resource(prefetch_mr) + +# Configure PyTorch to use RMM +torch.cuda.memory.change_current_allocator(rmm_torch_allocator) + +# Now you can work with larger-than-VRAM data +# Example: Large tensor that doesn't fit in VRAM +large_tensor = torch.randn(100000, 100000, device='cuda') # ~40 GB + +# Operations will automatically page as needed +result = large_tensor @ large_tensor.T +``` + +**What happens:** +1. RMM allocates managed memory for tensors +2. The prefetch adaptor prefetches to GPU on allocation +3. If memory exceeds GPU capacity, pages migrate between CPU and GPU +4. Performance is better than without prefetching + +## Prefetching Best Practices + +### 1. Prefetch Adaptor Should Be Outermost + +When composing memory resources, always make the prefetch adaptor the outermost layer: + +```python +# Correct: Prefetch is outermost +base = rmm.mr.ManagedMemoryResource() +pool = rmm.mr.PoolMemoryResource(base, initial_pool_size=2**30) +stats = rmm.mr.StatisticsResourceAdaptor(pool) +prefetch_mr = rmm.mr.PrefetchResourceAdaptor(stats) # Outermost +rmm.mr.set_current_device_resource(prefetch_mr) + +# Incorrect: Prefetch is not outermost +base = rmm.mr.ManagedMemoryResource() +prefetch_mr = rmm.mr.PrefetchResourceAdaptor(base) +pool = rmm.mr.PoolMemoryResource(prefetch_mr, initial_pool_size=2**30) # Wrong! +``` + +### 2. Prefetch on the Correct Stream + +When manually prefetching, use the same stream as the subsequent kernel: + +```python +stream = rmm.cuda_stream() + +# Prefetch on stream +buffer.prefetch(device=0, stream=stream) + +# Use on the same stream +with stream: + # ... operations using buffer ... +``` + +### 3. Prefetch Size Considerations + +Prefetching is most effective when: +- The prefetch size is large enough to amortize the migration cost +- Data is used shortly after prefetching +- Access patterns are predictable + +### 4. Profile and Measure + +Always profile to verify that prefetching improves performance: + +```python +import rmm +import time + +# Without prefetching +rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource()) +buffer = rmm.DeviceBuffer(size=10**9) +start = time.time() +# ... run workload ... +print(f"Without prefetch: {time.time() - start:.2f}s") + +# With prefetching +base = rmm.mr.ManagedMemoryResource() +prefetch_mr = rmm.mr.PrefetchResourceAdaptor(base) +rmm.mr.set_current_device_resource(prefetch_mr) +buffer = rmm.DeviceBuffer(size=10**9) +start = time.time() +# ... run workload ... +print(f"With prefetch: {time.time() - start:.2f}s") +``` + +Use [NVIDIA Nsight™ Systems](https://developer.nvidia.com/nsight-systems) to visualize page faults and data migration: + +```bash +nsys profile -o output python your_script.py +``` + +When using `compute-sanitizer` with managed memory, you may need to enable page fault tracking: + +```bash +compute-sanitizer --tool memcheck \ + --cuda-um-cpu-page-faults=true \ + --cuda-um-gpu-page-faults=true \ + python your_script.py +``` + +## Managed Memory Limitations + +### 1. Not Stream-Ordered + +`ManagedMemoryResource` uses `cudaMallocManaged`, which is **synchronous**. Allocations block until complete, unlike stream-ordered resources. + +For better performance in multi-stream applications, use `CudaAsyncMemoryResource` instead. + +### 2. Performance Overhead + +Even with prefetching, managed memory has overhead compared to explicit memory management: +- Page fault handling +- Driver page migration +- Potential CPU-GPU transfer latency + +For performance-critical code with data that fits in GPU memory, prefer `CudaAsyncMemoryResource`. + +### 3. PCIe Bandwidth Limitation + +If your workload constantly migrates data between CPU and GPU, you're limited by PCIe bandwidth: +- PCIe Gen3 x16: ~12 GB/s +- PCIe Gen4 x16: ~24 GB/s +- PCIe Gen5 x16: ~48 GB/s + +For such workloads, consider: +- Algorithmic changes to reduce data movement +- Using system memory as a staging area +- Streaming data in smaller chunks + +## Comparison: Prefetch Strategies + +| Strategy | Advantages | Disadvantages | Use Case | +|----------|-----------|---------------|----------| +| **PrefetchResourceAdaptor** | Automatic, no code changes | Prefetches everything, even if not needed | General-purpose, allocate-and-use patterns | +| **Manual prefetch** | Fine-grained control, can optimize specific patterns | Requires code changes | Complex access patterns, performance tuning | +| **No prefetching** | Simple | High page fault overhead | Prototyping only, not for production | + +## Multi-GPU Considerations + +When using managed memory with multiple GPUs: + +```python +import rmm +from cuda.bindings import runtime as cudart + +# Set up managed memory on each device +for device_id in [0, 1]: + cudart.cudaSetDevice(device_id) + base = rmm.mr.ManagedMemoryResource() + prefetch_mr = rmm.mr.PrefetchResourceAdaptor(base) + rmm.mr.set_per_device_resource(device_id, prefetch_mr) + +# Prefetch to specific devices +buffer = rmm.DeviceBuffer(size=1000000) +buffer.prefetch(device=0, stream=stream_0) # Prefetch to GPU 0 +buffer.prefetch(device=1, stream=stream_1) # Prefetch to GPU 1 +``` + +## Summary + +- Managed memory enables larger-than-VRAM workloads and simplifies CPU-GPU interoperability +- Always use prefetching in production to avoid page fault overhead +- Use `PrefetchResourceAdaptor` for automatic, eager prefetching +- Use manual `prefetch()` calls for fine-grained control +- Profile with Nsight Systems to measure page fault overhead +- For best performance with data that fits in VRAM, use `CudaAsyncMemoryResource` instead + +## See Also + +- [Choosing a Memory Resource](choosing_memory_resources.md) - When to use managed memory vs. other resources +- [Stream-Ordered Allocation](stream_ordered_allocation.md) - Understanding asynchronous allocation semantics +- [NVIDIA Developer Blog: Unified Memory](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) +- [NVIDIA Developer Blog: Memory Oversubscription](https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/) diff --git a/docs/user_guide/pool_allocators.md b/docs/user_guide/pool_allocators.md new file mode 100644 index 000000000..9aa168029 --- /dev/null +++ b/docs/user_guide/pool_allocators.md @@ -0,0 +1,455 @@ +# Pool Memory Allocators + +Pool allocators maintain a "pool" of pre-allocated memory to enable fast suballocation without repeatedly calling the underlying memory allocation API. RMM provides several pool-based memory resources, each with different characteristics and use cases. + +## Why Use Pool Allocators? + +Direct allocation (e.g., `cudaMalloc`) has overhead: +- Requires driver synchronization +- Can be slow for small, frequent allocations +- Forces serialization of allocation requests + +Pool allocators address this by: +- Pre-allocating large blocks of memory +- Suballocating from the pool without driver calls +- Reusing freed memory for new allocations + +## RMM's Pool Allocators + +RMM provides three main pool-like allocators: + +1. **`CudaAsyncMemoryResource`**: Driver-managed pool (recommended default) +2. **`PoolMemoryResource`**: RMM-managed coalescing pool +3. **`ArenaMemoryResource`**: Size-binned arena pool + +## CudaAsyncMemoryResource (Recommended) + +The `CudaAsyncMemoryResource` uses CUDA's driver-managed memory pool via `cudaMallocAsync`. + +**Advantages:** +- Virtual address space management (avoids fragmentation) +- Shared across all applications using the same GPU +- Stream-ordered allocation +- No manual tuning of pool sizes + +**Example:** +```python +import rmm + +rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource()) +``` + +**When to use:** Default choice for most applications. See [Choosing a Memory Resource](choosing_memory_resources.md) for details. + +## PoolMemoryResource + +The `PoolMemoryResource` wraps an upstream memory resource and maintains a pool using a coalescing best-fit allocator. + +### Configuration + +```python +import rmm + +pool = rmm.mr.PoolMemoryResource( + rmm.mr.CudaMemoryResource(), # or CudaAsyncMemoryResource + initial_pool_size=2**30, # 1 GiB - initial allocation + maximum_pool_size=2**32 # 4 GiB - max the pool can grow to +) +rmm.mr.set_current_device_resource(pool) +``` + +### Parameters + +- **`upstream`**: The underlying memory resource to allocate from + - Use `CudaAsyncMemoryResource()` for best results + - `CudaMemoryResource()` for basic CUDA memory + - Can be any memory resource (including another pool!) + +- **`initial_pool_size`**: Size of the initial allocation + - Larger values reduce early-stage growth overhead + - Should be based on your typical memory usage + - Use string notation: `"1GiB"`, `"512MiB"`, etc. + - Or use powers of 2: `2**30` (1 GiB) + +- **`maximum_pool_size`**: Maximum size the pool can grow to + - Acts as a limit on total GPU memory usage + - `None` means no limit (pool can grow until GPU memory is exhausted) + - Useful for multi-tenant or multi-process scenarios + +### How It Works + +1. **Initial allocation**: On first use, allocates `initial_pool_size` from upstream +2. **Suballocation**: Subsequent allocations are served from the pool +3. **Growth**: If pool is exhausted, allocates more from upstream +4. **Coalescing**: Adjacent freed blocks are merged to reduce fragmentation +5. **Shrinking**: The pool does **not** automatically return memory to upstream + +### Best Practices + +#### 1. Choose Appropriate Pool Sizes + +**Initial pool size:** +- Profile your application to understand memory usage +- Set initial size to ~80% of typical peak usage +- Too small: frequent growth overhead +- Too large: wastes memory, longer startup + +**Example:** +```python +import rmm + +# For an application that typically uses 2 GiB +pool = rmm.mr.PoolMemoryResource( + rmm.mr.CudaAsyncMemoryResource(), + initial_pool_size=int(1.6 * 2**30), # 1.6 GiB + maximum_pool_size=int(4 * 2**30) # 4 GiB max +) +rmm.mr.set_current_device_resource(pool) +``` + +#### 2. Prefer Async MR as Upstream + +Wrapping `CudaAsyncMemoryResource` combines benefits: + +```python +# Good: Pool wrapping async MR +pool = rmm.mr.PoolMemoryResource( + rmm.mr.CudaAsyncMemoryResource(), + initial_pool_size=2**30 +) +``` + +This gives: +- Fast suballocation from RMM pool +- Driver's virtual addressing for fragmentation resistance +- Shared memory pool across libraries + +#### 3. Avoid Double Pooling + +Don't wrap a pool in another pool: + +```python +# Bad: Double pooling +inner_pool = rmm.mr.PoolMemoryResource(rmm.mr.CudaMemoryResource(), 2**30) +outer_pool = rmm.mr.PoolMemoryResource(inner_pool, 2**30) # Wasteful! +``` + +#### 4. Monitor Fragmentation + +Pool allocators can suffer from fragmentation: + +```python +import rmm + +# Enable statistics to monitor fragmentation +pool = rmm.mr.PoolMemoryResource(rmm.mr.CudaAsyncMemoryResource(), 2**30) +stats_mr = rmm.mr.StatisticsResourceAdaptor(pool) +rmm.mr.set_current_device_resource(stats_mr) + +# Run workload +# ... + +# Check statistics +stats = rmm.statistics.get_statistics() +print(f"Peak bytes: {stats.peak_bytes}") +print(f"Current bytes: {stats.current_bytes}") +``` + +If `peak_bytes` is much larger than needed, fragmentation may be occurring. + +### Common Issues + +#### Issue 1: Out of Memory (OOM) Before Max Pool Size + +**Symptom:** OOM errors even though allocated memory is less than `maximum_pool_size` + +**Cause:** Fragmentation. The pool has free memory, but not in contiguous blocks. + +**Solutions:** +1. Use `ArenaMemoryResource` instead (better fragmentation characteristics) +2. Use `CudaAsyncMemoryResource` (virtual addressing prevents fragmentation) +3. Adjust allocation patterns to reduce fragmentation + +#### Issue 2: Pool Doesn't Shrink + +**Symptom:** Memory remains allocated even after deallocations + +**Cause:** By design, pools don't return memory to the upstream resource. + +**Solutions:** +1. Destroy and recreate the pool (not recommended for long-running applications) +2. Set appropriate `maximum_pool_size` to limit growth +3. Use `CudaAsyncMemoryResource` if memory should be returned to the system + +## ArenaMemoryResource + +The `ArenaMemoryResource` divides memory into size-binned arenas to reduce fragmentation. + +### Configuration + +```python +import rmm + +arena = rmm.mr.ArenaMemoryResource( + rmm.mr.CudaMemoryResource(), + arena_size=2**28, # 256 MiB per arena + dump_log_on_failure=False +) +rmm.mr.set_current_device_resource(arena) +``` + +### How It Works + +1. Allocates memory in fixed-size "arenas" +2. Each arena is divided into size-binned "superblocks" +3. Allocations are served from the appropriate bin +4. Reduces fragmentation by isolating allocation sizes + +### When to Use + +- Applications with diverse allocation sizes +- Long-running services with complex allocation patterns +- When `PoolMemoryResource` suffers from fragmentation + +### Example: Mixed Allocation Sizes + +```python +import rmm + +# Application allocates small (KB), medium (MB), and large (GB) buffers +arena = rmm.mr.ArenaMemoryResource( + rmm.mr.CudaAsyncMemoryResource(), + arena_size=2**28 # 256 MiB arenas +) +rmm.mr.set_current_device_resource(arena) + +# Allocations are binned by size +small = rmm.DeviceBuffer(size=1024) # Small bin +medium = rmm.DeviceBuffer(size=1024**2) # Medium bin +large = rmm.DeviceBuffer(size=1024**3) # Large bin +``` + +## BinningMemoryResource + +The `BinningMemoryResource` routes allocations to different memory resources based on size. + +### Configuration + +```python +import rmm + +# Create resources for different size ranges +small_mr = rmm.mr.FixedSizeMemoryResource( + rmm.mr.CudaMemoryResource(), + block_size=256 # 256 bytes +) +large_mr = rmm.mr.PoolMemoryResource( + rmm.mr.CudaMemoryResource(), + initial_pool_size=2**30 +) + +# Bin allocations by size +binning_mr = rmm.mr.BinningMemoryResource( + large_mr, # Default for allocations not in bins +) + +# Add bins: allocations of size <= threshold go to this resource +binning_mr.add_bin(256, small_mr) # <= 256 bytes -> small_mr +binning_mr.add_bin(1024, None) # <= 1 KiB -> upstream (large_mr) +# Anything > 1 KiB goes to upstream (large_mr) + +rmm.mr.set_current_device_resource(binning_mr) +``` + +### How It Works + +Allocations are routed based on size: +``` +Allocation size <= bin1_threshold -> bin1_resource +Allocation size <= bin2_threshold -> bin2_resource +... +Allocation size > largest_threshold -> upstream +``` + +### Best Practices for Binning + +#### 1. Profile Allocation Sizes + +Before configuring bins, understand your allocation patterns: + +```python +import rmm + +# Enable statistics to see allocation sizes +base = rmm.mr.CudaMemoryResource() +stats_mr = rmm.mr.StatisticsResourceAdaptor(base) +rmm.mr.set_current_device_resource(stats_mr) + +# Run workload +# ... + +# Analyze allocation patterns +stats = rmm.statistics.get_statistics() +print(stats) +``` + +#### 2. Optimize for Common Sizes + +Configure bins to match your most common allocation sizes: + +```python +import rmm + +# Based on profiling, we know: +# - Many small allocations (< 1 KiB) +# - Medium allocations (1 KiB - 1 MiB) +# - Large allocations (> 1 MiB) + +# Fixed-size resource for small allocations +small_mr = rmm.mr.FixedSizeMemoryResource( + rmm.mr.CudaAsyncMemoryResource(), + block_size=1024 # 1 KiB +) + +# Pool for medium allocations +medium_mr = rmm.mr.PoolMemoryResource( + rmm.mr.CudaAsyncMemoryResource(), + initial_pool_size=2**28 # 256 MiB +) + +# Pool for large allocations +large_mr = rmm.mr.PoolMemoryResource( + rmm.mr.CudaAsyncMemoryResource(), + initial_pool_size=2**30 # 1 GiB +) + +# Configure binning +binning_mr = rmm.mr.BinningMemoryResource(large_mr) +binning_mr.add_bin(1024, small_mr) # <= 1 KiB +binning_mr.add_bin(1024**2, medium_mr) # <= 1 MiB +# > 1 MiB goes to large_mr + +rmm.mr.set_current_device_resource(binning_mr) +``` + +#### 3. Consider Using ArenaMemoryResource Instead + +For many use cases, `ArenaMemoryResource` provides similar benefits with simpler configuration: + +```python +# Simpler: Arena handles size-binning automatically +arena = rmm.mr.ArenaMemoryResource( + rmm.mr.CudaAsyncMemoryResource(), + arena_size=2**28 +) +rmm.mr.set_current_device_resource(arena) +``` + +### Example: PyTorch with Binning + +From issue #1958, here's a practical example for PyTorch workloads: + +```python +import rmm +import torch +from rmm.allocators.torch import rmm_torch_allocator + +# Use managed memory as base (for larger-than-VRAM scenarios) +upstream = rmm.mr.ManagedMemoryResource() + +# Create a pool wrapping managed memory +pool = rmm.mr.PoolMemoryResource( + upstream, + initial_pool_size=2**20, # 1 MiB + maximum_pool_size=int(80 * 2**30) # 80 GiB max +) + +# Fixed-size resource for small allocations +fixed_mr = rmm.mr.FixedSizeMemoryResource(pool, block_size=1024) # 1 KiB blocks + +# Binning resource +binning_mr = rmm.mr.BinningMemoryResource(pool) + +# Add bins for common PyTorch tensor sizes +binning_mr.add_bin(256 * 1024, fixed_mr) # <= 256 KiB +binning_mr.add_bin(512 * 1024, None) # <= 512 KiB -> pool +binning_mr.add_bin(1024 * 1024, None) # <= 1 MiB -> pool +binning_mr.add_bin(2 * 1024 * 1024, None) # <= 2 MiB -> pool +binning_mr.add_bin(4 * 1024 * 1024, None) # <= 4 MiB -> pool +# > 4 MiB goes to pool + +rmm.mr.set_current_device_resource(binning_mr) + +# Configure PyTorch +torch.cuda.memory.change_current_allocator(rmm_torch_allocator) +``` + +**Note:** For production PyTorch workloads, prefer `CudaAsyncMemoryResource` unless you specifically need managed memory for larger-than-VRAM scenarios. + +## Choosing Between Pool Allocators + +| Resource | Best For | Fragmentation Handling | Complexity | +|----------|----------|------------------------|------------| +| **CudaAsyncMemoryResource** | General purpose, multi-stream apps | Excellent (virtual addressing) | Low | +| **PoolMemoryResource** | Simple pooling needs | Fair (coalescing) | Low | +| **ArenaMemoryResource** | Diverse allocation sizes | Good (size binning) | Medium | +| **BinningMemoryResource** | Custom size-based routing | Depends on configuration | High | + +## Debugging Pool Issues + +### Enable Logging + +```python +import rmm + +arena = rmm.mr.ArenaMemoryResource( + rmm.mr.CudaMemoryResource(), + arena_size=2**28, + dump_log_on_failure=True # Log on allocation failure +) +rmm.mr.set_current_device_resource(arena) +``` + +### Track Statistics + +```python +import rmm + +pool = rmm.mr.PoolMemoryResource(rmm.mr.CudaAsyncMemoryResource(), 2**30) +stats_mr = rmm.mr.StatisticsResourceAdaptor(pool) +rmm.mr.set_current_device_resource(stats_mr) + +# Run workload +buffer = rmm.DeviceBuffer(size=1000000) + +# Check usage +stats = rmm.statistics.get_statistics() +print(f"Current bytes: {stats.current_bytes:,}") +print(f"Peak bytes: {stats.peak_bytes:,}") +print(f"Total allocations: {stats.total_count}") +``` + +### Profile with Nsight Systems + +```bash +nsys profile -o output python your_script.py +``` + +Look for: +- Allocation frequency and sizes +- Memory usage over time +- Fragmentation indicators + +## Summary + +- **For most cases**: Use `CudaAsyncMemoryResource` (driver-managed pool) +- **For simple pooling**: Use `PoolMemoryResource` wrapping `CudaAsyncMemoryResource` +- **For fragmentation issues**: Try `ArenaMemoryResource` +- **For size-based routing**: Use `BinningMemoryResource` (or `ArenaMemoryResource`) +- **Always profile**: Use statistics and Nsight Systems to understand allocation patterns +- **Set appropriate pool sizes**: Too small causes growth overhead, too large wastes memory + +## See Also + +- [Choosing a Memory Resource](choosing_memory_resources.md) - High-level guidance on selecting resources +- [Stream-Ordered Allocation](stream_ordered_allocation.md) - Understanding async allocation diff --git a/docs/user_guide/stream_ordered_allocation.md b/docs/user_guide/stream_ordered_allocation.md new file mode 100644 index 000000000..35d84ee16 --- /dev/null +++ b/docs/user_guide/stream_ordered_allocation.md @@ -0,0 +1,325 @@ +# Stream-Ordered Memory Allocation + +RMM provides **stream-ordered memory allocation**, which means that memory allocations and deallocations are ordered with respect to operations on a CUDA stream. This is a fundamental concept for achieving optimal performance in asynchronous CUDA applications. + +## What is Stream-Ordered Allocation? + +In stream-ordered allocation: + +1. **Allocations are asynchronous**: Calling `allocate()` schedules the allocation on a stream and returns immediately +2. **Memory is available after stream synchronization**: The allocated memory is guaranteed to be available for use by operations scheduled after the allocation on the same stream +3. **Deallocations are also stream-ordered**: Memory is not actually freed until all prior operations on the stream complete + +This allows memory operations to be interleaved with kernel launches and other CUDA operations without explicit synchronization. + +## Why Stream-Ordered Allocation Matters + +Traditional memory allocation (e.g., `cudaMalloc`) is **synchronous** - it blocks until the allocation completes. This creates bubbles in the execution pipeline where the CPU waits for GPU operations to complete. + +Stream-ordered allocation enables: +- **Overlapping compute and memory operations**: Allocations can be scheduled while kernels are running +- **Reduced synchronization overhead**: No need to synchronize the stream before allocating +- **Better multi-stream performance**: Different streams can allocate independently + +## How It Works + +Consider the following example of allocating memory from a stream-ordered memory resource. + +C++: + +```cpp +#include +#include + +rmm::cuda_stream_view stream; +auto buffer = rmm::device_buffer(1000, stream); +``` + +Python: + +```python +import rmm + +# Allocate on a specific stream +stream = rmm.cuda_stream() +buffer = rmm.DeviceBuffer(size=1000, stream=stream) +``` + +The following happens: + +1. The allocation request is **scheduled** on `stream` +2. The function returns immediately (asynchronous) +3. The memory is **guaranteed to be available** for operations enqueued on `stream` after the allocation +4. You can use `buffer.data()` (the pointer) immediately in subsequent stream operations + +## Key Semantics + +### Safe to Use the Pointer Immediately + +**You can use the returned pointer in stream-ordered operations without synchronization:** + +```python +import rmm +import cupy as cp + +stream = rmm.cuda_stream() + +# Allocate memory on the stream +buffer = rmm.DeviceBuffer(size=1000, stream=stream) + +# Use the pointer immediately in a CuPy operation on the same stream +# This is SAFE - no synchronization needed +with stream: + array = cp.ndarray(shape=(250,), dtype=cp.float32, + memptr=cp.cuda.MemoryPointer( + cp.cuda.UnownedMemory(buffer.ptr, buffer.size, buffer), + 0)) + # Kernel launches on this stream will see the allocated memory + array[:] = 42 +``` + +The allocation is guaranteed to complete before the kernel that uses it, as long as both are on the same stream. + +### Deallocations Are Also Stream-Ordered + +When you deallocate (e.g., a buffer goes out of scope), the deallocation is also stream-ordered: + +```python +import rmm + +stream = rmm.cuda_stream() + +# Allocate +buffer = rmm.DeviceBuffer(size=1000, stream=stream) + +# Schedule some work on the stream +# ... kernels using buffer.ptr ... + +# When buffer is destroyed, deallocation is scheduled on the stream +# The memory won't actually be freed until all prior work completes +buffer = None # triggers deallocation +``` + +This ensures that: +- Memory is not freed while still in use by a kernel +- Deallocations don't block waiting for kernels to complete + +### Stream Synchronization + +To guarantee that an allocation has completed (for example, if you need to access it from the CPU), synchronize the stream: + +```python +import rmm + +stream = rmm.cuda_stream() +buffer = rmm.DeviceBuffer(size=1000, stream=stream) + +# Synchronize to ensure allocation completes +stream.synchronize() + +# Now safe to do CPU operations with buffer.ptr +# (though accessing GPU memory from CPU usually requires managed memory) +``` + +## Memory Resources and Stream Ordering + +### Which Resources Support Stream Ordering? + +- **`CudaAsyncMemoryResource`**: Fully stream-ordered (recommended) +- **`PoolMemoryResource`**: Can be stream-ordered when wrapping a stream-ordered upstream +- **`ArenaMemoryResource`**: Stream-ordered when wrapping a stream-ordered upstream +- **`CudaMemoryResource`**: NOT stream-ordered (synchronous `cudaMalloc`) +- **`ManagedMemoryResource`**: NOT stream-ordered (synchronous `cudaMallocManaged`) + +### Example: Pool Wrapping Async MR + +```python +import rmm + +# Create a pool that maintains stream-ordered semantics +pool = rmm.mr.PoolMemoryResource( + rmm.mr.CudaAsyncMemoryResource(), # stream-ordered upstream + initial_pool_size=2**30 +) +rmm.mr.set_current_device_resource(pool) + +# Allocations from this pool are stream-ordered +stream = rmm.cuda_stream() +buffer = rmm.DeviceBuffer(size=1000, stream=stream) +``` + +## Common Patterns + +### Pattern 1: Allocate and Use in Kernel + +```python +import rmm +from numba import cuda + +@cuda.jit +def kernel(data, n): + idx = cuda.grid(1) + if idx < n: + data[idx] = idx * 2 + +stream = rmm.cuda_stream() + +# Allocate +buffer = rmm.DeviceBuffer(size=1000 * 4, stream=stream) # 1000 float32s + +# Use immediately +with stream: + kernel[100, 10](cuda.as_cuda_array(buffer).view('float32'), 1000) + +# Synchronize to wait for kernel +stream.synchronize() +``` + +### Pattern 2: Allocate, Compute, Deallocate, Repeat + +```python +import rmm + +stream = rmm.cuda_stream() + +for i in range(100): + # Allocate + buffer = rmm.DeviceBuffer(size=1000000, stream=stream) + + # Use buffer in computations + # ... launch kernels on stream ... + + # Deallocate (automatic, or explicitly set buffer = None) + buffer = None + +# All allocations and deallocations are stream-ordered +# No need to synchronize between iterations +``` + +### Pattern 3: Multi-Stream Allocation + +```python +import rmm + +# Create multiple streams +streams = [rmm.cuda_stream() for _ in range(4)] + +# Allocate on different streams independently +buffers = [] +for stream in streams: + # Each allocation is independent + buffer = rmm.DeviceBuffer(size=1000000, stream=stream) + buffers.append(buffer) + + # Launch work on this stream + # ... kernels using buffer ... + +# Synchronize all streams +for stream in streams: + stream.synchronize() +``` + +## Performance Implications + +### Benefits + +1. **Reduced CPU-GPU synchronization**: No blocking on allocations +2. **Better pipeline utilization**: Memory operations overlap with compute +3. **Multi-stream scalability**: Streams can allocate independently + +### Pitfalls to Avoid + +1. **Don't mix streams**: Using memory allocated on stream A in operations on stream B requires synchronization: + + ```python + stream_a = rmm.cuda_stream() + stream_b = rmm.cuda_stream() + + # Allocate on stream A + buffer = rmm.DeviceBuffer(size=1000, stream=stream_a) + + # To use on stream B, synchronize stream A first + stream_a.synchronize() + + # Now safe to use on stream B + with stream_b: + # ... operations using buffer ... + ``` + +2. **Don't access from CPU without sync**: Stream-ordered allocations are asynchronous - accessing from CPU requires synchronization: + + ```python + stream = rmm.cuda_stream() + buffer = rmm.DeviceBuffer(size=1000, stream=stream) + + # BAD: May access uninitialized memory + # some_function(buffer.ptr) + + # GOOD: Synchronize first + stream.synchronize() + some_function(buffer.ptr) + ``` + +3. **Resource lifetime**: Ensure buffers live until all stream operations complete: + + ```python + stream = rmm.cuda_stream() + + def allocate_and_use(): + buffer = rmm.DeviceBuffer(size=1000, stream=stream) + # Launch kernel using buffer + kernel[...](buffer.ptr) + # BAD: buffer is deallocated when function returns + # but kernel may still be running! + + allocate_and_use() + stream.synchronize() # May crash - buffer already freed + ``` + + Fix: Keep buffer alive until synchronization: + + ```python + stream = rmm.cuda_stream() + buffer = allocate_and_use() # Return the buffer + stream.synchronize() # Now safe + buffer = None # Explicit cleanup after sync + ``` + +## C++ API + +In C++, stream-ordered allocation is the default for most RMM containers: + +```cpp +#include +#include +#include +#include + +// Set async MR as default +auto async_mr = rmm::mr::cuda_async_memory_resource{}; +rmm::mr::set_current_device_resource_ref(async_mr); + +// Create a stream +rmm::cuda_stream stream; + +// Allocate stream-ordered memory +rmm::device_buffer buffer(1000, stream.view()); +rmm::device_uvector vec(1000, stream.view()); + +// Use immediately in stream-ordered operations +launch_kernel<<>>(buffer.data(), vec.data()); + +// Synchronize +stream.synchronize(); +``` + +## Summary + +- Stream-ordered allocation enables asynchronous, non-blocking memory operations +- Allocated pointers can be used immediately in subsequent operations on the same stream +- Deallocations are also stream-ordered, preventing use-after-free +- `CudaAsyncMemoryResource` provides the best stream-ordered allocation support +- Always synchronize before accessing memory from the CPU +- Ensure buffer lifetimes extend until all stream operations complete + +For more details on choosing memory resources, see [Choosing a Memory Resource](choosing_memory_resources.md).