Forward-merge release/26.04 into main by bdice · Pull Request #2310 · rapidsai/rmm

bdice · 2026-03-16T22:10:12Z

Manual forward merge from release/26.04 to main. This PR should not be squashed.

Fixes these `pre-commit` errors blocking CI: ```text verify-hardcoded-version.................................................Failed - hook id: verify-hardcoded-version - exit code: 1 In file RAPIDS_BRANCH:1:9: release/26.04 warning: do not hard-code version, read from VERSION file instead In file RAPIDS_BRANCH:1:9: release/26.04 In file cpp/examples/versions.cmake:8:21: set(RMM_TAG release/26.04) warning: do not hard-code version, read from VERSION file instead In file cpp/examples/versions.cmake:8:21: set(RMM_TAG release/26.04) ``` By updating `verify-hardcoded-version` configuration and by updating the C++ examples to read `RMM_TAG` from the `RAPIDS_BRANCH` file. See rapidsai/pre-commit-hooks#121 for details Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#2293

Contributes to rapidsai/build-planning#256 Broken out from rapidsai#2270 Proposes a stricter pattern for installing `torch` wheels, to prevent bugs of the form "accidentally used a CPU-only `torch` from pypi.org". This should help us to catch compatibility issues, improving release confidence. Other small changes: * splits torch wheel testing into "oldest" (PyTorch 2.9) and "latest" (PyTorch 2.10) * introduces a `require_gpu_pytorch` matrix filter so conda jobs can explicitly request `pytorch-gpu` (to similarly ensure solvers don't fall back to the GPU-only variant) * appends `rapids-generate-pip-constraint` output to file `PIP_CONSTRAINT` points - *(to reduce duplication and the risk of failing to apply constraints)* Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#2279

…adaptor (rapidsai#2304) So that the tracking resource adaptor is thread safe, the modification of the tracked allocations should be sandwiched by an "acquire-release" pair upstream.allocate-upstream.deallocate. Previously this was not the case, the upstream allocation occurred before updating the tracked allocations, but the dellocation did not occur after. This could lead to a scenario in multi-threaded use where we get a logged error that a deallocated pointer was not tracked. To solve this, actually use the correct pattern. Moreover, ensure that we don't observe ABA issues by using try_emplace when tracking an allocation. - Closes rapidsai#2303 Authors: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#2304

…E 754 -0.0 (rapidsai#2302) ## Description `device_uvector::set_element_async` had a zero-value optimization that used `cudaMemsetAsync` when `value == value_type{0}`. For IEEE 754 floating-point types, `-0.0 == 0.0` is `true` per the standard, so `-0.0` was incorrectly routed through `cudaMemsetAsync(..., 0, ...)` which clears all bits — including the sign bit — normalizing `-0.0` to `+0.0`. This corrupts the in-memory representation of `-0.0` for any downstream library that creates scalars through RMM (`cudf::fixed_width_scalar::set_value` → `rmm::device_scalar::set_value_async` → `device_uvector::set_element_async`), causing observable behavioral divergence in spark-rapids (e.g., `cast(-0.0 as string)` returns `"0.0"` on GPU instead of `"-0.0"`). ### Fix Per the discussion in rapidsai#2298, remove all `constexpr` special casing in `set_element_async` — both the `bool` `cudaMemsetAsync` path and the `is_fundamental_v` zero-detection path — and always use `cudaMemcpyAsync`. This preserves exact bit-level representations for all types, which is the correct contract for a memory management library that sits below cuDF, cuML, and cuGraph. `set_element_to_zero_async` is unchanged — its explicit "set to zero" semantics make `cudaMemsetAsync` the correct implementation. ### Testing Added `NegativeZeroTest.PreservesFloatNegativeZero` and `NegativeZeroTest.PreservesDoubleNegativeZero` regression tests that verify the sign bit of `-0.0f` / `-0.0` survives a round-trip through `set_element_async` → `element`. All 122 tests pass locally (CUDA 13.0, RTX 5880). Closes rapidsai#2298 ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/rmm/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes. Made with [Cursor](https://cursor.com) --------- Signed-off-by: Allen Xu <allxu@nvidia.com>

## Description I found that the `ulimit` settings for CUDA 13.1 devcontainers were missing. This fixes it. ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/rmm/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes.

This PR sets an upper bound on the `numba-cuda` dependency to `<0.29.0` Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#2306

coderabbitai · 2026-03-16T22:13:13Z

📝 Walkthrough

Summary by CodeRabbit

Release Notes

New Features
- Enhanced PyTorch wheel support with improved CUDA compatibility handling and dynamic installation workflow.
Bug Fixes
- Improved thread-safety in memory resource handling with additional validation checks.
Tests
- Added comprehensive multithreaded concurrency tests for memory resource adaptors and statistics tracking.
Chores
- Updated dependency constraints for CUDA compatibility.
- Enhanced development environment configuration with improved resource limits.

Walkthrough

This PR makes multi-domain improvements to RMM: adds container resource limits, updates CI infrastructure for PyTorch wheel handling, tightens numba-cuda version constraints, enables dynamic CMake branch selection, removes memset optimizations from async operations, hardens resource adaptors with duplicate-tracking assertions, and introduces test utilities for concurrency validation.

Changes

Cohort / File(s)	Summary
Devcontainer Configuration `.devcontainer/cuda13.1-conda/devcontainer.json`, `.devcontainer/cuda13.1-pip/devcontainer.json`	Added file descriptor limit (--ulimit nofile=500000) to container runtime arguments.
Pre-commit and CI Foundation `.pre-commit-config.yaml`, `ci/release/update-version.sh`	Bumped pre-commit-hooks from v1.3.3 to v1.4.2 with new exclude rules; updated copyright year and removed RMM_TAG sed commands from release script.
PyTorch Wheel Management `ci/download-torch-wheels.sh`, `ci/test_wheel.sh`, `ci/test_wheel_integrations.sh`, `ci/test_python_integrations.sh`	Added new download script for CUDA-aware PyTorch wheels; refactored test scripts to use PIP_CONSTRAINT environment variable, dynamic wheel downloads, and CUDA version gating (12.9–13.0).
Dependency Constraints `conda/environments/all_cuda-129_arch-aarch64.yaml`, `conda/environments/all_cuda-129_arch-x86_64.yaml`, `conda/environments/all_cuda-131_arch-aarch64.yaml`, `conda/environments/all_cuda-131_arch-x86_64.yaml`, `python/rmm/pyproject.toml`, `dependencies.yaml`	Tightened numba-cuda constraint from `>=0.22.1` to `>=0.22.1,<0.29.0` across multiple CUDA variant environments and test matrices.
CMake and Version Management `cpp/examples/versions.cmake`	Made RMM_TAG dynamic by reading from `${_rapids_branch}` variable and adding include directive to rapids_config.cmake.
Device Operations `cpp/include/rmm/device_scalar.hpp`, `cpp/include/rmm/device_uvector.hpp`	Removed memset optimization documentation and zero-value fast-path for set_element_async, consolidating to single cudaMemcpyAsync path.
Resource Adaptors `cpp/include/rmm/aligned_resource_adaptor.hpp`, `cpp/include/rmm/statistics_resource_adaptor.hpp`, `cpp/include/rmm/tracking_resource_adaptor.hpp`	Added duplicate-pointer detection via try_emplace assertions; reordered deallocation calls to occur after counter updates to ensure proper error detection and logging.
Testing Infrastructure `cpp/tests/mr/delayed_memory_resource.hpp`, `cpp/tests/device_uvector_tests.cpp`, `cpp/tests/mr/statistics_mr_tests.cpp`, `cpp/tests/mr/tracking_mr_tests.cpp`	Introduced delayed_memory_resource test utility to inject post-deallocation delays; added negative-zero preservation tests for floating-point types and multithreaded concurrency tests for resource adaptors.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Remove zero-value special casing in set_element_async to preserve IEEE 754 -0.0 #2302: Directly related—removes the same memset special-case optimization from device_uvector.hpp and adds identical negative-zero preservation tests.
Fix ABA problem in tracking resource adaptor and statistics resource adaptor #2304: Directly related—makes identical changes to tracking/statistics/aligned resource adaptors (try_emplace assertions, deallocation reordering) and adds the same delayed_memory_resource test utility with multithreaded ABA-race tests.
ensure 'torch' CUDA wheels are installed in CI #2279: Related CI work—adds ci/download-torch-wheels.sh and restructures PyTorch wheel download/install logic in dependencies.yaml and test scripts.

Suggested labels

3 - Ready for review

Suggested reviewers

gforsyth

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'Forward-merge release/26.04 into main' clearly and concisely describes the primary purpose of this changeset—merging changes from a release branch into the main branch.
Description check	✅ Passed	The PR description accurately explains this is a manual forward merge from release/26.04 to main with a note about not squashing, which directly relates to the changeset's purpose.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

coderabbitai

🧹 Nitpick comments (1)

cpp/tests/mr/tracking_mr_tests.cpp (1)
36-80: Good multi-threaded test for ABA race detection.

The test correctly sets up the interleaving scenario documented in the comments (lines 43-60). The use of delayed_memory_resource with 300ms delay combined with Thread 1's 100ms initial sleep creates the overlapping deallocation window needed to expose ABA issues.

One observation: unlike the StatisticsTest::MultiThreaded in statistics_mr_tests.cpp, this test doesn't assert final counter/tracking state after threads join. Consider adding assertions to verify mr.get_outstanding_allocations().size() == 0 and mr.get_allocated_bytes() == 0 to confirm correct bookkeeping under concurrency.
💡 Optional: Add final state assertions
   for (auto& t : threads) {
     t.join();
   }
+  EXPECT_EQ(mr.get_outstanding_allocations().size(), 0);
+  EXPECT_EQ(mr.get_allocated_bytes(), 0);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/mr/tracking_mr_tests.cpp` around lines 36 - 80, Add final-state
assertions after the thread joins in the TrackingTest::MultiThreaded test to
ensure the tracking_resource_adaptor cleaned up correctly: call
mr.get_outstanding_allocations().size() and mr.get_allocated_bytes() and assert
they are zero (e.g., EXPECT_EQ or ASSERT_EQ) to verify no leaked allocation
entries and zero tracked bytes after concurrent allocate/deallocate; place these
checks immediately after the for-loop that joins threads and reference mr (the
tracking_resource_adaptor<delayed_memory_resource> instance) to locate the code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cpp/tests/mr/tracking_mr_tests.cpp`:
- Around line 36-80: Add final-state assertions after the thread joins in the
TrackingTest::MultiThreaded test to ensure the tracking_resource_adaptor cleaned
up correctly: call mr.get_outstanding_allocations().size() and
mr.get_allocated_bytes() and assert they are zero (e.g., EXPECT_EQ or ASSERT_EQ)
to verify no leaked allocation entries and zero tracked bytes after concurrent
allocate/deallocate; place these checks immediately after the for-loop that
joins threads and reference mr (the
tracking_resource_adaptor<delayed_memory_resource> instance) to locate the code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 73e0669c-448b-467f-b00c-a97943ba8d02

📥 Commits

Reviewing files that changed from the base of the PR and between 22f4680 and 7ddf10f.

📒 Files selected for processing (24)

.devcontainer/cuda13.1-conda/devcontainer.json
.devcontainer/cuda13.1-pip/devcontainer.json
.pre-commit-config.yaml
ci/download-torch-wheels.sh
ci/release/update-version.sh
ci/test_python_integrations.sh
ci/test_wheel.sh
ci/test_wheel_integrations.sh
conda/environments/all_cuda-129_arch-aarch64.yaml
conda/environments/all_cuda-129_arch-x86_64.yaml
conda/environments/all_cuda-131_arch-aarch64.yaml
conda/environments/all_cuda-131_arch-x86_64.yaml
cpp/examples/versions.cmake
cpp/include/rmm/device_scalar.hpp
cpp/include/rmm/device_uvector.hpp
cpp/include/rmm/mr/aligned_resource_adaptor.hpp
cpp/include/rmm/mr/statistics_resource_adaptor.hpp
cpp/include/rmm/mr/tracking_resource_adaptor.hpp
cpp/tests/device_uvector_tests.cpp
cpp/tests/mr/delayed_memory_resource.hpp
cpp/tests/mr/statistics_mr_tests.cpp
cpp/tests/mr/tracking_mr_tests.cpp
dependencies.yaml
python/rmm/pyproject.toml

💤 Files with no reviewable changes (1)

cpp/include/rmm/device_uvector.hpp

jameslamb and others added 6 commits March 12, 2026 22:04

bdice requested review from a team as code owners March 16, 2026 22:10

bdice added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Mar 16, 2026

bdice requested review from gforsyth, lamarrr and vyasr March 16, 2026 22:10

github-project-automation bot added this to RMM Project Board Mar 16, 2026

Merge branch 'release/26.04' into main-merge-release/26.04

7ddf10f

bdice force-pushed the main-merge-release/26.04 branch from 7661d92 to 7ddf10f Compare March 16, 2026 22:12

bdice moved this to In Progress in RMM Project Board Mar 16, 2026

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

bdice merged commit 485d79a into rapidsai:main Mar 16, 2026
82 checks passed

github-project-automation bot moved this from In Progress to Done in RMM Project Board Mar 16, 2026

bdice mentioned this pull request Mar 16, 2026

Forward-merge release/26.04 into main #2297

Merged

coderabbitai bot mentioned this pull request Apr 2, 2026

Add RMM User Guide #2087

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward-merge release/26.04 into main#2310

Forward-merge release/26.04 into main#2310
bdice merged 7 commits intorapidsai:mainfrom
bdice:main-merge-release/26.04

bdice commented Mar 16, 2026

Uh oh!

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

bdice commented Mar 16, 2026

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coderabbitai bot commented Mar 16, 2026 •

edited

Loading