Skip to content

Forward-merge release/26.04 into main#2310

Merged
bdice merged 7 commits intorapidsai:mainfrom
bdice:main-merge-release/26.04
Mar 16, 2026
Merged

Forward-merge release/26.04 into main#2310
bdice merged 7 commits intorapidsai:mainfrom
bdice:main-merge-release/26.04

Conversation

@bdice
Copy link
Copy Markdown
Collaborator

@bdice bdice commented Mar 16, 2026

Manual forward merge from release/26.04 to main. This PR should not be squashed.

jameslamb and others added 6 commits March 12, 2026 22:04
Fixes these `pre-commit` errors blocking CI:

```text
verify-hardcoded-version.................................................Failed
- hook id: verify-hardcoded-version
- exit code: 1

In file RAPIDS_BRANCH:1:9:
 release/26.04
warning: do not hard-code version, read from VERSION file instead

In file RAPIDS_BRANCH:1:9:
 release/26.04

In file cpp/examples/versions.cmake:8:21:
 set(RMM_TAG release/26.04)
warning: do not hard-code version, read from VERSION file instead

In file cpp/examples/versions.cmake:8:21:
 set(RMM_TAG release/26.04)
```

By updating `verify-hardcoded-version` configuration and by updating the C++ examples to read `RMM_TAG` from the `RAPIDS_BRANCH` file.

See rapidsai/pre-commit-hooks#121 for details

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#2293
Contributes to rapidsai/build-planning#256

Broken out from rapidsai#2270 

Proposes a stricter pattern for installing `torch` wheels, to prevent bugs of the form "accidentally used a CPU-only `torch` from pypi.org". This should help us to catch compatibility issues, improving release confidence.

Other small changes:

* splits torch wheel testing into "oldest" (PyTorch 2.9) and "latest" (PyTorch 2.10)
* introduces a `require_gpu_pytorch` matrix filter so conda jobs can explicitly request `pytorch-gpu` (to similarly ensure solvers don't fall back to the GPU-only variant)
* appends `rapids-generate-pip-constraint` output to file `PIP_CONSTRAINT` points
  - *(to reduce duplication and the risk of failing to apply constraints)*

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#2279
…adaptor (rapidsai#2304)

So that the tracking resource adaptor is thread safe, the modification of the tracked allocations should be sandwiched by an "acquire-release" pair upstream.allocate-upstream.deallocate. Previously this was not the case, the upstream allocation occurred before updating the tracked allocations, but the dellocation did not occur after. This could lead to a scenario in multi-threaded use where we get a logged error that a deallocated pointer was not tracked.

To solve this, actually use the correct pattern. Moreover, ensure that we don't observe ABA issues by using try_emplace when tracking an allocation.

- Closes rapidsai#2303

Authors:
  - Lawrence Mitchell (https://github.com/wence-)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#2304
…E 754 -0.0 (rapidsai#2302)

## Description

`device_uvector::set_element_async` had a zero-value optimization that
used `cudaMemsetAsync` when `value == value_type{0}`. For IEEE 754
floating-point types, `-0.0 == 0.0` is `true` per the standard, so
`-0.0` was incorrectly routed through `cudaMemsetAsync(..., 0, ...)`
which clears all bits — including the sign bit — normalizing `-0.0` to
`+0.0`.

This corrupts the in-memory representation of `-0.0` for any downstream
library that creates scalars through RMM
(`cudf::fixed_width_scalar::set_value` →
`rmm::device_scalar::set_value_async` →
`device_uvector::set_element_async`), causing observable behavioral
divergence in spark-rapids (e.g., `cast(-0.0 as string)` returns `"0.0"`
on GPU instead of `"-0.0"`).

### Fix

Per the discussion in rapidsai#2298, remove all `constexpr` special casing in
`set_element_async` — both the `bool` `cudaMemsetAsync` path and the
`is_fundamental_v` zero-detection path — and always use
`cudaMemcpyAsync`. This preserves exact bit-level representations for
all types, which is the correct contract for a memory management library
that sits below cuDF, cuML, and cuGraph.

`set_element_to_zero_async` is unchanged — its explicit "set to zero"
semantics make `cudaMemsetAsync` the correct implementation.

### Testing

Added `NegativeZeroTest.PreservesFloatNegativeZero` and
`NegativeZeroTest.PreservesDoubleNegativeZero` regression tests that
verify the sign bit of `-0.0f` / `-0.0` survives a round-trip through
`set_element_async` → `element`. All 122 tests pass locally (CUDA 13.0,
RTX 5880).

Closes rapidsai#2298

## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/rmm/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.

Made with [Cursor](https://cursor.com)

---------

Signed-off-by: Allen Xu <allxu@nvidia.com>
## Description
I found that the `ulimit` settings for CUDA 13.1 devcontainers were
missing. This fixes it.

## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/rmm/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This PR sets an upper bound on the `numba-cuda` dependency to `<0.29.0`

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#2306
@bdice bdice requested review from a team as code owners March 16, 2026 22:10
@bdice bdice added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Mar 16, 2026
@bdice bdice requested review from gforsyth, lamarrr and vyasr March 16, 2026 22:10
@bdice bdice force-pushed the main-merge-release/26.04 branch from 7661d92 to 7ddf10f Compare March 16, 2026 22:12
@bdice bdice moved this to In Progress in RMM Project Board Mar 16, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 16, 2026

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced PyTorch wheel support with improved CUDA compatibility handling and dynamic installation workflow.
  • Bug Fixes

    • Improved thread-safety in memory resource handling with additional validation checks.
  • Tests

    • Added comprehensive multithreaded concurrency tests for memory resource adaptors and statistics tracking.
  • Chores

    • Updated dependency constraints for CUDA compatibility.
    • Enhanced development environment configuration with improved resource limits.

Walkthrough

This PR makes multi-domain improvements to RMM: adds container resource limits, updates CI infrastructure for PyTorch wheel handling, tightens numba-cuda version constraints, enables dynamic CMake branch selection, removes memset optimizations from async operations, hardens resource adaptors with duplicate-tracking assertions, and introduces test utilities for concurrency validation.

Changes

Cohort / File(s) Summary
Devcontainer Configuration
.devcontainer/cuda13.1-conda/devcontainer.json, .devcontainer/cuda13.1-pip/devcontainer.json
Added file descriptor limit (--ulimit nofile=500000) to container runtime arguments.
Pre-commit and CI Foundation
.pre-commit-config.yaml, ci/release/update-version.sh
Bumped pre-commit-hooks from v1.3.3 to v1.4.2 with new exclude rules; updated copyright year and removed RMM_TAG sed commands from release script.
PyTorch Wheel Management
ci/download-torch-wheels.sh, ci/test_wheel.sh, ci/test_wheel_integrations.sh, ci/test_python_integrations.sh
Added new download script for CUDA-aware PyTorch wheels; refactored test scripts to use PIP_CONSTRAINT environment variable, dynamic wheel downloads, and CUDA version gating (12.9–13.0).
Dependency Constraints
conda/environments/all_cuda-129_arch-aarch64.yaml, conda/environments/all_cuda-129_arch-x86_64.yaml, conda/environments/all_cuda-131_arch-aarch64.yaml, conda/environments/all_cuda-131_arch-x86_64.yaml, python/rmm/pyproject.toml, dependencies.yaml
Tightened numba-cuda constraint from >=0.22.1 to >=0.22.1,<0.29.0 across multiple CUDA variant environments and test matrices.
CMake and Version Management
cpp/examples/versions.cmake
Made RMM_TAG dynamic by reading from ${_rapids_branch} variable and adding include directive to rapids_config.cmake.
Device Operations
cpp/include/rmm/device_scalar.hpp, cpp/include/rmm/device_uvector.hpp
Removed memset optimization documentation and zero-value fast-path for set_element_async, consolidating to single cudaMemcpyAsync path.
Resource Adaptors
cpp/include/rmm/aligned_resource_adaptor.hpp, cpp/include/rmm/statistics_resource_adaptor.hpp, cpp/include/rmm/tracking_resource_adaptor.hpp
Added duplicate-pointer detection via try_emplace assertions; reordered deallocation calls to occur after counter updates to ensure proper error detection and logging.
Testing Infrastructure
cpp/tests/mr/delayed_memory_resource.hpp, cpp/tests/device_uvector_tests.cpp, cpp/tests/mr/statistics_mr_tests.cpp, cpp/tests/mr/tracking_mr_tests.cpp
Introduced delayed_memory_resource test utility to inject post-deallocation delays; added negative-zero preservation tests for floating-point types and multithreaded concurrency tests for resource adaptors.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

3 - Ready for review

Suggested reviewers

  • gforsyth
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'Forward-merge release/26.04 into main' clearly and concisely describes the primary purpose of this changeset—merging changes from a release branch into the main branch.
Description check ✅ Passed The PR description accurately explains this is a manual forward merge from release/26.04 to main with a note about not squashing, which directly relates to the changeset's purpose.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cpp/tests/mr/tracking_mr_tests.cpp (1)

36-80: Good multi-threaded test for ABA race detection.

The test correctly sets up the interleaving scenario documented in the comments (lines 43-60). The use of delayed_memory_resource with 300ms delay combined with Thread 1's 100ms initial sleep creates the overlapping deallocation window needed to expose ABA issues.

One observation: unlike the StatisticsTest::MultiThreaded in statistics_mr_tests.cpp, this test doesn't assert final counter/tracking state after threads join. Consider adding assertions to verify mr.get_outstanding_allocations().size() == 0 and mr.get_allocated_bytes() == 0 to confirm correct bookkeeping under concurrency.

💡 Optional: Add final state assertions
   for (auto& t : threads) {
     t.join();
   }
+  EXPECT_EQ(mr.get_outstanding_allocations().size(), 0);
+  EXPECT_EQ(mr.get_allocated_bytes(), 0);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/mr/tracking_mr_tests.cpp` around lines 36 - 80, Add final-state
assertions after the thread joins in the TrackingTest::MultiThreaded test to
ensure the tracking_resource_adaptor cleaned up correctly: call
mr.get_outstanding_allocations().size() and mr.get_allocated_bytes() and assert
they are zero (e.g., EXPECT_EQ or ASSERT_EQ) to verify no leaked allocation
entries and zero tracked bytes after concurrent allocate/deallocate; place these
checks immediately after the for-loop that joins threads and reference mr (the
tracking_resource_adaptor<delayed_memory_resource> instance) to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cpp/tests/mr/tracking_mr_tests.cpp`:
- Around line 36-80: Add final-state assertions after the thread joins in the
TrackingTest::MultiThreaded test to ensure the tracking_resource_adaptor cleaned
up correctly: call mr.get_outstanding_allocations().size() and
mr.get_allocated_bytes() and assert they are zero (e.g., EXPECT_EQ or ASSERT_EQ)
to verify no leaked allocation entries and zero tracked bytes after concurrent
allocate/deallocate; place these checks immediately after the for-loop that
joins threads and reference mr (the
tracking_resource_adaptor<delayed_memory_resource> instance) to locate the code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 73e0669c-448b-467f-b00c-a97943ba8d02

📥 Commits

Reviewing files that changed from the base of the PR and between 22f4680 and 7ddf10f.

📒 Files selected for processing (24)
  • .devcontainer/cuda13.1-conda/devcontainer.json
  • .devcontainer/cuda13.1-pip/devcontainer.json
  • .pre-commit-config.yaml
  • ci/download-torch-wheels.sh
  • ci/release/update-version.sh
  • ci/test_python_integrations.sh
  • ci/test_wheel.sh
  • ci/test_wheel_integrations.sh
  • conda/environments/all_cuda-129_arch-aarch64.yaml
  • conda/environments/all_cuda-129_arch-x86_64.yaml
  • conda/environments/all_cuda-131_arch-aarch64.yaml
  • conda/environments/all_cuda-131_arch-x86_64.yaml
  • cpp/examples/versions.cmake
  • cpp/include/rmm/device_scalar.hpp
  • cpp/include/rmm/device_uvector.hpp
  • cpp/include/rmm/mr/aligned_resource_adaptor.hpp
  • cpp/include/rmm/mr/statistics_resource_adaptor.hpp
  • cpp/include/rmm/mr/tracking_resource_adaptor.hpp
  • cpp/tests/device_uvector_tests.cpp
  • cpp/tests/mr/delayed_memory_resource.hpp
  • cpp/tests/mr/statistics_mr_tests.cpp
  • cpp/tests/mr/tracking_mr_tests.cpp
  • dependencies.yaml
  • python/rmm/pyproject.toml
💤 Files with no reviewable changes (1)
  • cpp/include/rmm/device_uvector.hpp

@bdice bdice merged commit 485d79a into rapidsai:main Mar 16, 2026
82 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in RMM Project Board Mar 16, 2026
@coderabbitai coderabbitai bot mentioned this pull request Apr 2, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants