Remove zero-value special casing in set_element_async to preserve IEEE 754 -0.0 by wjxiz1992 · Pull Request #2302 · rapidsai/rmm

wjxiz1992 · 2026-03-16T03:32:29Z

Description

device_uvector::set_element_async had a zero-value optimization that used cudaMemsetAsync when value == value_type{0}. For IEEE 754 floating-point types, -0.0 == 0.0 is true per the standard, so -0.0 was incorrectly routed through cudaMemsetAsync(..., 0, ...) which clears all bits — including the sign bit — normalizing -0.0 to +0.0.

This corrupts the in-memory representation of -0.0 for any downstream library that creates scalars through RMM (cudf::fixed_width_scalar::set_value → rmm::device_scalar::set_value_async → device_uvector::set_element_async), causing observable behavioral divergence in spark-rapids (e.g., cast(-0.0 as string) returns "0.0" on GPU instead of "-0.0").

Fix

Per the discussion in #2298, remove all constexpr special casing in set_element_async — both the bool cudaMemsetAsync path and the is_fundamental_v zero-detection path — and always use cudaMemcpyAsync. This preserves exact bit-level representations for all types, which is the correct contract for a memory management library that sits below cuDF, cuML, and cuGraph.

set_element_to_zero_async is unchanged — its explicit "set to zero" semantics make cudaMemsetAsync the correct implementation.

Testing

Added NegativeZeroTest.PreservesFloatNegativeZero and NegativeZeroTest.PreservesDoubleNegativeZero regression tests that verify the sign bit of -0.0f / -0.0 survives a round-trip through set_element_async → element. All 122 tests pass locally (CUDA 13.0, RTX 5880).

Closes #2298

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Made with Cursor

copy-pr-bot · 2026-03-16T03:32:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-16T03:35:36Z

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Summary by CodeRabbit

Tests
- Added tests preserving negative zero for floats and new multi-threaded tests for memory-resource tracking/statistics; added a helper delayed memory-resource for ABA-style testing.
Refactor
- Unified asynchronous element-setting to a single consistent memcpy path; enforced stricter tracking checks and reordered upstream deallocation in adaptors.
Chores
- CI/scripts, dependency matrices, and pre-commit hooks updated; version/config handling made more dynamic.

Walkthrough

Removed type-specific memset optimizations in element-setting paths so set_element_async/set_value_async always use host-to-device memcpy; added tests preserving IEEE-754 negative zero for float/double; adjusted resource adaptor insertion/deallocation order and added ABA test utilities; multiple CI and dependency updates.

Changes

Cohort / File(s)	Summary
Element-set implementation `cpp/include/rmm/device_uvector.hpp`, `cpp/include/rmm/device_scalar.hpp`	Removed fundamental-type memset/zero-specialization and bool-specific paths; `set_element_async` / `set_value_async` now consistently use host->device memcpy (docs updated).
Resource adaptors (thread-safety/ordering) `cpp/include/rmm/mr/aligned_resource_adaptor.hpp`, `cpp/include/rmm/mr/tracking_resource_adaptor.hpp`, `cpp/include/rmm/mr/statistics_resource_adaptor.hpp`	Enforced unique insertion checks (try_emplace + RMM_EXPECTS) and moved upstream deallocate calls to after protected updates to avoid ABA/order issues.
New test helpers & MR tests `cpp/tests/mr/delayed_memory_resource.hpp`, `cpp/tests/mr/statistics_mr_tests.cpp`, `cpp/tests/mr/tracking_mr_tests.cpp`	Added delayed_memory_resource to simulate deallocation delays and multithreaded tests for statistics and tracking adaptors to exercise ABA/race scenarios.
Element tests `cpp/tests/device_uvector_tests.cpp`	Added tests PreservesFloatNegativeZero and PreservesDoubleNegativeZero verifying signbit preservation when setting -0.0 asynchronously; added include and year update.
CI scripts `ci/download-torch-wheels.sh`, `ci/test_python_integrations.sh`, `ci/test_wheel.sh`, `ci/test_wheel_integrations.sh`, `ci/release/update-version.sh`	Added wheel-downloader script, changed PyTorch wheel/constraint handling, made PyTorch GPU gating/constraint vars explicit, and removed two sed in-place updates from release script.
Dependencies & config `dependencies.yaml`, `.pre-commit-config.yaml`, `cpp/examples/versions.cmake`	Reworked PyTorch dependency matrices and pins (CUDA-specific wheels), added new public block `torch_only`, updated pre-commit hook revision and exclusions, and made versions.cmake derive RMM_TAG from rapids_config.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

examples: read tag from RAPIDS_BRANCH file #2293 — overlapping changes to pre-commit hook exclusions and deriving RMM_TAG from rapids_config (matches .pre-commit-config.yaml and cpp/examples/versions.cmake edits).
ensure 'torch' CUDA wheels are installed in CI #2279 — adds the Torch wheel download workflow and related CI adjustments (matches ci/download-torch-wheels.sh and PyTorch constraint logic).
Fix ABA problem in tracking resource adaptor and statistics resource adaptor #2304 — implements similar ABA/thread-safety fixes and tests for resource adaptors (matches try_emplace/insertion checks, moved deallocate ordering, delayed_memory_resource, and multithreaded MR tests).

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check	❓ Inconclusive	Multiple supporting changes are included (pre-commit config bumps, CI scripts, dependencies.yaml rewrites, new test utilities), but their necessity to the core fix cannot be fully assessed without additional context on whether these are prerequisites or incidental.	Verify that all changes in CI scripts, dependencies.yaml, pre-commit config, and new test utilities (delayed_memory_resource.hpp) are directly related to or necessary for the -0.0 preservation fix and regression tests.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main change: removing zero-value special casing in set_element_async to preserve IEEE 754 -0.0 sign bits.
Description check	✅ Passed	The description clearly explains the bug (IEEE 754 -0.0 treated as +0.0), root cause (cudaMemsetAsync clearing sign bits), impact (downstream scalar divergence), and the fix (remove special casing and always use cudaMemcpyAsync).
Linked Issues check	✅ Passed	The PR fulfills the main objectives from `#2298`: removes zero-value special casing from set_element_async, adds regression tests for -0.0 preservation, and includes supporting documentation updates across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copilot

Pull request overview

Updates rmm::device_uvector::set_element_async to preserve floating-point negative zero (-0.0) by removing the previous zero/bool fast-paths that used cudaMemsetAsync, and adds regression tests to ensure the sign bit is retained.

Changes:

Remove cudaMemsetAsync-based fast-paths in device_uvector::set_element_async (zero values and bool).
Add new tests validating that -0.0f and -0.0 preserve their sign bit through set_element_async + element.
Add <cmath> include and update copyright year.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`cpp/include/rmm/device_uvector.hpp`	Removes the memset-based “zero/bool optimization” from `set_element_async` to avoid clobbering `-0.0`’s sign bit.
`cpp/tests/device_uvector_tests.cpp`	Adds regression tests for preserving `-0.0` sign bit for `float` and `double`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cpp/include/rmm/device_uvector.hpp

coderabbitai

🧹 Nitpick comments (1)

cpp/include/rmm/device_scalar.hpp (1)

168-196: Clean up stale memset wording in adjacent comment.

Nice update on Line 168. To keep docs consistent, the deleted-overload comment on Lines 194-195 should also drop "/ memset" since this path is now memcpy-only.

Suggested doc-only diff

-  // literal can be freed before the async memcpy / memset executes.
+  // literal can be freed before the async memcpy executes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/include/rmm/device_scalar.hpp` around lines 168 - 196, The comment next
to the deleted rmm::device_scalar::set_value_async(value_type&&,
cuda_stream_view) overload still mentions "/ memset" which is stale because
set_value_async now uses memcpy-only; update the inline comment above the
deleted overload to remove the "/ memset" phrase so it reads something like
"Disallow passing literals to set_value to avoid race conditions where the
memory holding the literal can be freed before the async memcpy executes."
Ensure you modify the comment text near the deleted overload for set_value_async
to reference only memcpy and the race condition.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cpp/include/rmm/device_scalar.hpp`:
- Around line 168-196: The comment next to the deleted
rmm::device_scalar::set_value_async(value_type&&, cuda_stream_view) overload
still mentions "/ memset" which is stale because set_value_async now uses
memcpy-only; update the inline comment above the deleted overload to remove the
"/ memset" phrase so it reads something like "Disallow passing literals to
set_value to avoid race conditions where the memory holding the literal can be
freed before the async memcpy executes." Ensure you modify the comment text near
the deleted overload for set_value_async to reference only memcpy and the race
condition.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 29a906e9-8ddf-4c5f-81d6-2b92f9f71c47

📥 Commits

Reviewing files that changed from the base of the PR and between fe360fd and 20f799e.

📒 Files selected for processing (1)

cpp/include/rmm/device_scalar.hpp

bdice · 2026-03-16T14:29:51Z

/ok to test 20f799e

bdice · 2026-03-16T14:30:31Z

I will retarget this to release/26.04 so this bugfix can be released sooner.

The zero-value optimization in `set_element_async` used `cudaMemsetAsync` when `value == value_type{0}`. For IEEE 754 floating-point types, `-0.0 == 0.0` evaluates to `true`, so `-0.0` was routed through `cudaMemsetAsync(..., 0, ...)` which clears all bits — including the sign bit — normalizing `-0.0` to `+0.0`. Remove all `constexpr` special casing (both the `bool` memset path and the fundamental-type zero-detection path) and always use `cudaMemcpyAsync`. This preserves exact bit-level representations for all types, which is the correct behavior for a memory management library that sits below cuDF, cuML, and cuGraph. Add regression tests that verify `-0.0f` and `-0.0` sign bits survive a round-trip through `set_element_async` / `element`. Closes rapidsai#2298 Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

…_async The previous commit removed the cudaMemsetAsync zero-value fast-path but left behind the doxygen comment describing it. Remove the outdated paragraph to keep documentation consistent with the implementation. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

@note

device_scalar::set_value_async delegates to device_uvector::set_element_async, which no longer has the cudaMemsetAsync zero-value fast-path. Remove the outdated doxygen paragraph and adjust the @note to reflect that only cudaMemcpyAsync is used. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

The root cause (RMM set_element_async normalizing -0.0 to +0.0 via cudaMemsetAsync) has been fixed upstream in rapidsai/rmm#2302 (commit 06c3562). The fix is now included in the spark-rapids-jni nightly SNAPSHOT via the updated cudf-pins RMM pin. Remove the ColumnVector-based workaround in GpuScalar and restore the original direct Scalar.fromDouble/fromFloat calls. The exclusion removal in RapidsTestSettings (from the prior commit) is retained — the test now passes with the upstream fix alone. Verified: RapidsSQLQuerySuite 234 tests, 0 failures, 0 errors. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

…ry' test (issue #14116) (#14400) ## Summary - **Root cause fixed upstream**: The `-0.0` normalization bug was in RMM's `device_uvector::set_element_async`, which used `cudaMemsetAsync` for zero values — clearing the sign bit of IEEE 754 `-0.0`. This has been fixed in [rapidsai/rmm#2302](rapidsai/rmm@06c3562). - **This PR**: Simply removes the `.exclude()` for `"normalize special floating numbers in subquery"` in `RapidsSQLQuerySuite`, re-enabling the test now that the upstream fix has landed in the spark-rapids-jni nightly SNAPSHOT. - **No spark-rapids code changes needed**: The original workaround in `GpuScalar` (using `ColumnVector` path to bypass `Scalar.fromDouble`) has been removed — the direct `Scalar.fromDouble`/`Scalar.fromFloat` calls now preserve `-0.0` correctly. ## Upstream fix chain ``` rapidsai/rmm#2302 (06c3562) — remove zero-value cudaMemsetAsync special-casing → spark-rapids-jni cudf-pins updated (RMM pin now includes the fix) → spark-rapids-jni nightly SNAPSHOT rebuilt with fixed librmm.so → this test now passes on GPU without any spark-rapids workaround ``` ### RAPIDS test to Spark original mapping | RAPIDS test | Spark original | Spark file | Lines | |---|---|---|---| | `normalize special floating numbers in subquery` (inherited) | `normalize special floating numbers in subquery` | `sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala` | [3620-3636](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L3620-L3636) ([permalink](https://github.com/apache/spark/blob/v3.3.0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L3620-L3636)) | ## Test plan - [x] `mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=RapidsSQLQuerySuite` — 234 tests, 0 failures, 0 errors - [x] The previously excluded test "normalize special floating numbers in subquery" now passes - [x] No new test failures introduced - [x] Verified on latest `origin/main` (commit b74f7f7) with upstream RMM fix in spark-rapids-jni SNAPSHOT Closes #14116 ### Checklists - [ ] This PR has added documentation for new or modified features or behaviors. - [x] This PR has added new tests or modified existing tests to cover new code paths. (Re-enabled the inherited Spark test "normalize special floating numbers in subquery" by removing its `.exclude()` entry. The test validates GPU-CPU parity for `-0.0` in scalar subqueries.) - [ ] Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description. Made with [Cursor](https://cursor.com) --------- Signed-off-by: Allen Xu <allxu@nvidia.com>

Copilot AI review requested due to automatic review settings March 16, 2026 03:32

wjxiz1992 requested a review from a team as a code owner March 16, 2026 03:32

wjxiz1992 requested review from bdice and shrshi March 16, 2026 03:32

github-project-automation bot added this to RMM Project Board Mar 16, 2026

Copilot started reviewing on behalf of wjxiz1992 March 16, 2026 03:33 View session

wjxiz1992 mentioned this pull request Mar 16, 2026

device_uvector::set_element_async loses IEEE 754 sign bit of -0.0 due to zero-optimization #2298

Closed

Copilot AI reviewed Mar 16, 2026

View reviewed changes

cpp/include/rmm/device_uvector.hpp Show resolved Hide resolved

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

bdice assigned wjxiz1992 Mar 16, 2026

bdice moved this to In Progress in RMM Project Board Mar 16, 2026

bdice added bug Something isn't working breaking Breaking change labels Mar 16, 2026

bdice approved these changes Mar 16, 2026

View reviewed changes

rapidsai deleted a comment from copy-pr-bot bot Mar 16, 2026

wence- approved these changes Mar 16, 2026

View reviewed changes

wjxiz1992 added 3 commits March 16, 2026 11:12

bdice force-pushed the fix/2298-negative-zero-sign-bit branch from 20f799e to eba2114 Compare March 16, 2026 16:12

bdice requested review from a team as code owners March 16, 2026 16:12

bdice requested a review from msarahan March 16, 2026 16:12

bdice changed the base branch from main to release/26.04 March 16, 2026 16:12

bdice approved these changes Mar 16, 2026

View reviewed changes

bdice removed the request for review from msarahan March 16, 2026 16:12

bdice removed the request for review from shrshi March 16, 2026 16:12

bdice merged commit 06c3562 into rapidsai:release/26.04 Mar 16, 2026
82 of 83 checks passed

github-project-automation bot moved this from In Progress to Done in RMM Project Board Mar 16, 2026

coderabbitai bot mentioned this pull request Mar 16, 2026

Forward-merge release/26.04 into main #2310

Merged

bdice mentioned this pull request Mar 17, 2026

Merge main into staging #2311

Merged

3 tasks

greptile-apps bot mentioned this pull request Mar 18, 2026

[AutoSparkUT] Re-enable 'normalize special floating numbers in subquery' test (issue #14116) NVIDIA/spark-rapids#14400

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove zero-value special casing in set_element_async to preserve IEEE 754 -0.0#2302

Remove zero-value special casing in set_element_async to preserve IEEE 754 -0.0#2302
bdice merged 3 commits intorapidsai:release/26.04from
wjxiz1992:fix/2298-negative-zero-sign-bit

wjxiz1992 commented Mar 16, 2026

Uh oh!

copy-pr-bot bot commented Mar 16, 2026

Uh oh!

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

Review failed

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

bdice commented Mar 16, 2026

Uh oh!

bdice commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wjxiz1992 commented Mar 16, 2026

Description

Fix

Testing

Checklist

Uh oh!

copy-pr-bot bot commented Mar 16, 2026

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

bdice commented Mar 16, 2026

Uh oh!

bdice commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai bot commented Mar 16, 2026 •

edited

Loading