Skip to content

Remove zero-value special casing in set_element_async to preserve IEEE 754 -0.0#2302

Merged
bdice merged 3 commits intorapidsai:release/26.04from
wjxiz1992:fix/2298-negative-zero-sign-bit
Mar 16, 2026
Merged

Remove zero-value special casing in set_element_async to preserve IEEE 754 -0.0#2302
bdice merged 3 commits intorapidsai:release/26.04from
wjxiz1992:fix/2298-negative-zero-sign-bit

Conversation

@wjxiz1992
Copy link
Copy Markdown
Member

Description

device_uvector::set_element_async had a zero-value optimization that used cudaMemsetAsync when value == value_type{0}. For IEEE 754 floating-point types, -0.0 == 0.0 is true per the standard, so -0.0 was incorrectly routed through cudaMemsetAsync(..., 0, ...) which clears all bits — including the sign bit — normalizing -0.0 to +0.0.

This corrupts the in-memory representation of -0.0 for any downstream library that creates scalars through RMM (cudf::fixed_width_scalar::set_valuermm::device_scalar::set_value_asyncdevice_uvector::set_element_async), causing observable behavioral divergence in spark-rapids (e.g., cast(-0.0 as string) returns "0.0" on GPU instead of "-0.0").

Fix

Per the discussion in #2298, remove all constexpr special casing in set_element_async — both the bool cudaMemsetAsync path and the is_fundamental_v zero-detection path — and always use cudaMemcpyAsync. This preserves exact bit-level representations for all types, which is the correct contract for a memory management library that sits below cuDF, cuML, and cuGraph.

set_element_to_zero_async is unchanged — its explicit "set to zero" semantics make cudaMemsetAsync the correct implementation.

Testing

Added NegativeZeroTest.PreservesFloatNegativeZero and NegativeZeroTest.PreservesDoubleNegativeZero regression tests that verify the sign bit of -0.0f / -0.0 survives a round-trip through set_element_asyncelement. All 122 tests pass locally (CUDA 13.0, RTX 5880).

Closes #2298

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Made with Cursor

Copilot AI review requested due to automatic review settings March 16, 2026 03:32
@wjxiz1992 wjxiz1992 requested a review from a team as a code owner March 16, 2026 03:32
@wjxiz1992 wjxiz1992 requested review from bdice and shrshi March 16, 2026 03:32
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 16, 2026

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Summary by CodeRabbit

  • Tests

    • Added tests preserving negative zero for floats and new multi-threaded tests for memory-resource tracking/statistics; added a helper delayed memory-resource for ABA-style testing.
  • Refactor

    • Unified asynchronous element-setting to a single consistent memcpy path; enforced stricter tracking checks and reordered upstream deallocation in adaptors.
  • Chores

    • CI/scripts, dependency matrices, and pre-commit hooks updated; version/config handling made more dynamic.

Walkthrough

Removed type-specific memset optimizations in element-setting paths so set_element_async/set_value_async always use host-to-device memcpy; added tests preserving IEEE-754 negative zero for float/double; adjusted resource adaptor insertion/deallocation order and added ABA test utilities; multiple CI and dependency updates.

Changes

Cohort / File(s) Summary
Element-set implementation
cpp/include/rmm/device_uvector.hpp, cpp/include/rmm/device_scalar.hpp
Removed fundamental-type memset/zero-specialization and bool-specific paths; set_element_async / set_value_async now consistently use host->device memcpy (docs updated).
Resource adaptors (thread-safety/ordering)
cpp/include/rmm/mr/aligned_resource_adaptor.hpp, cpp/include/rmm/mr/tracking_resource_adaptor.hpp, cpp/include/rmm/mr/statistics_resource_adaptor.hpp
Enforced unique insertion checks (try_emplace + RMM_EXPECTS) and moved upstream deallocate calls to after protected updates to avoid ABA/order issues.
New test helpers & MR tests
cpp/tests/mr/delayed_memory_resource.hpp, cpp/tests/mr/statistics_mr_tests.cpp, cpp/tests/mr/tracking_mr_tests.cpp
Added delayed_memory_resource to simulate deallocation delays and multithreaded tests for statistics and tracking adaptors to exercise ABA/race scenarios.
Element tests
cpp/tests/device_uvector_tests.cpp
Added tests PreservesFloatNegativeZero and PreservesDoubleNegativeZero verifying signbit preservation when setting -0.0 asynchronously; added include and year update.
CI scripts
ci/download-torch-wheels.sh, ci/test_python_integrations.sh, ci/test_wheel.sh, ci/test_wheel_integrations.sh, ci/release/update-version.sh
Added wheel-downloader script, changed PyTorch wheel/constraint handling, made PyTorch GPU gating/constraint vars explicit, and removed two sed in-place updates from release script.
Dependencies & config
dependencies.yaml, .pre-commit-config.yaml, cpp/examples/versions.cmake
Reworked PyTorch dependency matrices and pins (CUDA-specific wheels), added new public block torch_only, updated pre-commit hook revision and exclusions, and made versions.cmake derive RMM_TAG from rapids_config.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check ❓ Inconclusive Multiple supporting changes are included (pre-commit config bumps, CI scripts, dependencies.yaml rewrites, new test utilities), but their necessity to the core fix cannot be fully assessed without additional context on whether these are prerequisites or incidental. Verify that all changes in CI scripts, dependencies.yaml, pre-commit config, and new test utilities (delayed_memory_resource.hpp) are directly related to or necessary for the -0.0 preservation fix and regression tests.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: removing zero-value special casing in set_element_async to preserve IEEE 754 -0.0 sign bits.
Description check ✅ Passed The description clearly explains the bug (IEEE 754 -0.0 treated as +0.0), root cause (cudaMemsetAsync clearing sign bits), impact (downstream scalar divergence), and the fix (remove special casing and always use cudaMemcpyAsync).
Linked Issues check ✅ Passed The PR fulfills the main objectives from #2298: removes zero-value special casing from set_element_async, adds regression tests for -0.0 preservation, and includes supporting documentation updates across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates rmm::device_uvector::set_element_async to preserve floating-point negative zero (-0.0) by removing the previous zero/bool fast-paths that used cudaMemsetAsync, and adds regression tests to ensure the sign bit is retained.

Changes:

  • Remove cudaMemsetAsync-based fast-paths in device_uvector::set_element_async (zero values and bool).
  • Add new tests validating that -0.0f and -0.0 preserve their sign bit through set_element_async + element.
  • Add <cmath> include and update copyright year.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
cpp/include/rmm/device_uvector.hpp Removes the memset-based “zero/bool optimization” from set_element_async to avoid clobbering -0.0’s sign bit.
cpp/tests/device_uvector_tests.cpp Adds regression tests for preserving -0.0 sign bit for float and double.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cpp/include/rmm/device_scalar.hpp (1)

168-196: Clean up stale memset wording in adjacent comment.

Nice update on Line 168. To keep docs consistent, the deleted-overload comment on Lines 194-195 should also drop "/ memset" since this path is now memcpy-only.

Suggested doc-only diff
-  // literal can be freed before the async memcpy / memset executes.
+  // literal can be freed before the async memcpy executes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/rmm/device_scalar.hpp` around lines 168 - 196, The comment next
to the deleted rmm::device_scalar::set_value_async(value_type&&,
cuda_stream_view) overload still mentions "/ memset" which is stale because
set_value_async now uses memcpy-only; update the inline comment above the
deleted overload to remove the "/ memset" phrase so it reads something like
"Disallow passing literals to set_value to avoid race conditions where the
memory holding the literal can be freed before the async memcpy executes."
Ensure you modify the comment text near the deleted overload for set_value_async
to reference only memcpy and the race condition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cpp/include/rmm/device_scalar.hpp`:
- Around line 168-196: The comment next to the deleted
rmm::device_scalar::set_value_async(value_type&&, cuda_stream_view) overload
still mentions "/ memset" which is stale because set_value_async now uses
memcpy-only; update the inline comment above the deleted overload to remove the
"/ memset" phrase so it reads something like "Disallow passing literals to
set_value to avoid race conditions where the memory holding the literal can be
freed before the async memcpy executes." Ensure you modify the comment text near
the deleted overload for set_value_async to reference only memcpy and the race
condition.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 29a906e9-8ddf-4c5f-81d6-2b92f9f71c47

📥 Commits

Reviewing files that changed from the base of the PR and between fe360fd and 20f799e.

📒 Files selected for processing (1)
  • cpp/include/rmm/device_scalar.hpp

@bdice bdice moved this to In Progress in RMM Project Board Mar 16, 2026
@bdice bdice added bug Something isn't working breaking Breaking change labels Mar 16, 2026
@bdice
Copy link
Copy Markdown
Collaborator

bdice commented Mar 16, 2026

/ok to test 20f799e

@rapidsai rapidsai deleted a comment from copy-pr-bot bot Mar 16, 2026
@bdice
Copy link
Copy Markdown
Collaborator

bdice commented Mar 16, 2026

I will retarget this to release/26.04 so this bugfix can be released sooner.

The zero-value optimization in `set_element_async` used
`cudaMemsetAsync` when `value == value_type{0}`. For IEEE 754
floating-point types, `-0.0 == 0.0` evaluates to `true`, so
`-0.0` was routed through `cudaMemsetAsync(..., 0, ...)` which
clears all bits — including the sign bit — normalizing `-0.0`
to `+0.0`.

Remove all `constexpr` special casing (both the `bool` memset
path and the fundamental-type zero-detection path) and always
use `cudaMemcpyAsync`. This preserves exact bit-level
representations for all types, which is the correct behavior
for a memory management library that sits below cuDF, cuML,
and cuGraph.

Add regression tests that verify `-0.0f` and `-0.0` sign bits
survive a round-trip through `set_element_async` / `element`.

Closes rapidsai#2298

Signed-off-by: Allen Xu <allxu@nvidia.com>
Made-with: Cursor
…_async

The previous commit removed the cudaMemsetAsync zero-value fast-path
but left behind the doxygen comment describing it. Remove the
outdated paragraph to keep documentation consistent with the
implementation.

Signed-off-by: Allen Xu <allxu@nvidia.com>
Made-with: Cursor
device_scalar::set_value_async delegates to
device_uvector::set_element_async, which no longer has the
cudaMemsetAsync zero-value fast-path. Remove the outdated
doxygen paragraph and adjust the @note to reflect that only
cudaMemcpyAsync is used.

Signed-off-by: Allen Xu <allxu@nvidia.com>
Made-with: Cursor
@bdice bdice force-pushed the fix/2298-negative-zero-sign-bit branch from 20f799e to eba2114 Compare March 16, 2026 16:12
@bdice bdice requested review from a team as code owners March 16, 2026 16:12
@bdice bdice requested a review from msarahan March 16, 2026 16:12
@bdice bdice changed the base branch from main to release/26.04 March 16, 2026 16:12
@bdice bdice removed the request for review from msarahan March 16, 2026 16:12
@bdice bdice removed the request for review from shrshi March 16, 2026 16:12
@bdice bdice merged commit 06c3562 into rapidsai:release/26.04 Mar 16, 2026
82 of 83 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in RMM Project Board Mar 16, 2026
@bdice bdice mentioned this pull request Mar 17, 2026
3 tasks
wjxiz1992 added a commit to wjxiz1992/spark-rapids that referenced this pull request Mar 18, 2026
The root cause (RMM set_element_async normalizing -0.0 to +0.0 via
cudaMemsetAsync) has been fixed upstream in rapidsai/rmm#2302
(commit 06c3562). The fix is now included in the spark-rapids-jni
nightly SNAPSHOT via the updated cudf-pins RMM pin.

Remove the ColumnVector-based workaround in GpuScalar and restore the
original direct Scalar.fromDouble/fromFloat calls. The exclusion
removal in RapidsTestSettings (from the prior commit) is retained —
the test now passes with the upstream fix alone.

Verified: RapidsSQLQuerySuite 234 tests, 0 failures, 0 errors.
Signed-off-by: Allen Xu <allxu@nvidia.com>
Made-with: Cursor
wjxiz1992 added a commit to NVIDIA/spark-rapids that referenced this pull request Mar 24, 2026
…ry' test (issue #14116) (#14400)

## Summary

- **Root cause fixed upstream**: The `-0.0` normalization bug was in
RMM's `device_uvector::set_element_async`, which used `cudaMemsetAsync`
for zero values — clearing the sign bit of IEEE 754 `-0.0`. This has
been fixed in
[rapidsai/rmm#2302](rapidsai/rmm@06c3562).
- **This PR**: Simply removes the `.exclude()` for `"normalize special
floating numbers in subquery"` in `RapidsSQLQuerySuite`, re-enabling the
test now that the upstream fix has landed in the spark-rapids-jni
nightly SNAPSHOT.
- **No spark-rapids code changes needed**: The original workaround in
`GpuScalar` (using `ColumnVector` path to bypass `Scalar.fromDouble`)
has been removed — the direct `Scalar.fromDouble`/`Scalar.fromFloat`
calls now preserve `-0.0` correctly.

## Upstream fix chain

```
rapidsai/rmm#2302 (06c3562)  — remove zero-value cudaMemsetAsync special-casing
  → spark-rapids-jni cudf-pins updated (RMM pin now includes the fix)
    → spark-rapids-jni nightly SNAPSHOT rebuilt with fixed librmm.so
      → this test now passes on GPU without any spark-rapids workaround
```

### RAPIDS test to Spark original mapping

| RAPIDS test | Spark original | Spark file | Lines |
|---|---|---|---|
| `normalize special floating numbers in subquery` (inherited) |
`normalize special floating numbers in subquery` |
`sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala` |
[3620-3636](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L3620-L3636)
([permalink](https://github.com/apache/spark/blob/v3.3.0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L3620-L3636))
|

## Test plan

- [x] `mvn package -pl tests -am -Dbuildver=330
-DwildcardSuites=RapidsSQLQuerySuite` — 234 tests, 0 failures, 0 errors
- [x] The previously excluded test "normalize special floating numbers
in subquery" now passes
- [x] No new test failures introduced
- [x] Verified on latest `origin/main` (commit b74f7f7) with upstream
RMM fix in spark-rapids-jni SNAPSHOT

Closes #14116

### Checklists

- [ ] This PR has added documentation for new or modified features or
behaviors.
- [x] This PR has added new tests or modified existing tests to cover
new code paths.
(Re-enabled the inherited Spark test "normalize special floating numbers
in subquery" by removing its `.exclude()` entry. The test validates
GPU-CPU parity for `-0.0` in scalar subqueries.)
- [ ] Performance testing has been performed and its results are added
in the PR description. Or, an issue has been filed with a link in the PR
description.

Made with [Cursor](https://cursor.com)

---------

Signed-off-by: Allen Xu <allxu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change bug Something isn't working

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

device_uvector::set_element_async loses IEEE 754 sign bit of -0.0 due to zero-optimization

4 participants