Remove zero-value special casing in set_element_async to preserve IEEE 754 -0.0#2302
Conversation
|
Caution Review failedFailed to post review comments 📝 WalkthroughSummary by CodeRabbit
WalkthroughRemoved type-specific memset optimizations in element-setting paths so set_element_async/set_value_async always use host-to-device memcpy; added tests preserving IEEE-754 negative zero for float/double; adjusted resource adaptor insertion/deallocation order and added ABA test utilities; multiple CI and dependency updates. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
There was a problem hiding this comment.
Pull request overview
Updates rmm::device_uvector::set_element_async to preserve floating-point negative zero (-0.0) by removing the previous zero/bool fast-paths that used cudaMemsetAsync, and adds regression tests to ensure the sign bit is retained.
Changes:
- Remove
cudaMemsetAsync-based fast-paths indevice_uvector::set_element_async(zero values andbool). - Add new tests validating that
-0.0fand-0.0preserve their sign bit throughset_element_async+element. - Add
<cmath>include and update copyright year.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
cpp/include/rmm/device_uvector.hpp |
Removes the memset-based “zero/bool optimization” from set_element_async to avoid clobbering -0.0’s sign bit. |
cpp/tests/device_uvector_tests.cpp |
Adds regression tests for preserving -0.0 sign bit for float and double. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cpp/include/rmm/device_scalar.hpp (1)
168-196: Clean up stalememsetwording in adjacent comment.Nice update on Line 168. To keep docs consistent, the deleted-overload comment on Lines 194-195 should also drop "
/ memset" since this path is now memcpy-only.Suggested doc-only diff
- // literal can be freed before the async memcpy / memset executes. + // literal can be freed before the async memcpy executes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/rmm/device_scalar.hpp` around lines 168 - 196, The comment next to the deleted rmm::device_scalar::set_value_async(value_type&&, cuda_stream_view) overload still mentions "/ memset" which is stale because set_value_async now uses memcpy-only; update the inline comment above the deleted overload to remove the "/ memset" phrase so it reads something like "Disallow passing literals to set_value to avoid race conditions where the memory holding the literal can be freed before the async memcpy executes." Ensure you modify the comment text near the deleted overload for set_value_async to reference only memcpy and the race condition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cpp/include/rmm/device_scalar.hpp`:
- Around line 168-196: The comment next to the deleted
rmm::device_scalar::set_value_async(value_type&&, cuda_stream_view) overload
still mentions "/ memset" which is stale because set_value_async now uses
memcpy-only; update the inline comment above the deleted overload to remove the
"/ memset" phrase so it reads something like "Disallow passing literals to
set_value to avoid race conditions where the memory holding the literal can be
freed before the async memcpy executes." Ensure you modify the comment text near
the deleted overload for set_value_async to reference only memcpy and the race
condition.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 29a906e9-8ddf-4c5f-81d6-2b92f9f71c47
📒 Files selected for processing (1)
cpp/include/rmm/device_scalar.hpp
|
/ok to test 20f799e |
|
I will retarget this to |
The zero-value optimization in `set_element_async` used
`cudaMemsetAsync` when `value == value_type{0}`. For IEEE 754
floating-point types, `-0.0 == 0.0` evaluates to `true`, so
`-0.0` was routed through `cudaMemsetAsync(..., 0, ...)` which
clears all bits — including the sign bit — normalizing `-0.0`
to `+0.0`.
Remove all `constexpr` special casing (both the `bool` memset
path and the fundamental-type zero-detection path) and always
use `cudaMemcpyAsync`. This preserves exact bit-level
representations for all types, which is the correct behavior
for a memory management library that sits below cuDF, cuML,
and cuGraph.
Add regression tests that verify `-0.0f` and `-0.0` sign bits
survive a round-trip through `set_element_async` / `element`.
Closes rapidsai#2298
Signed-off-by: Allen Xu <allxu@nvidia.com>
Made-with: Cursor
…_async The previous commit removed the cudaMemsetAsync zero-value fast-path but left behind the doxygen comment describing it. Remove the outdated paragraph to keep documentation consistent with the implementation. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor
device_scalar::set_value_async delegates to device_uvector::set_element_async, which no longer has the cudaMemsetAsync zero-value fast-path. Remove the outdated doxygen paragraph and adjust the @note to reflect that only cudaMemcpyAsync is used. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor
20f799e to
eba2114
Compare
The root cause (RMM set_element_async normalizing -0.0 to +0.0 via cudaMemsetAsync) has been fixed upstream in rapidsai/rmm#2302 (commit 06c3562). The fix is now included in the spark-rapids-jni nightly SNAPSHOT via the updated cudf-pins RMM pin. Remove the ColumnVector-based workaround in GpuScalar and restore the original direct Scalar.fromDouble/fromFloat calls. The exclusion removal in RapidsTestSettings (from the prior commit) is retained — the test now passes with the upstream fix alone. Verified: RapidsSQLQuerySuite 234 tests, 0 failures, 0 errors. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor
…ry' test (issue #14116) (#14400) ## Summary - **Root cause fixed upstream**: The `-0.0` normalization bug was in RMM's `device_uvector::set_element_async`, which used `cudaMemsetAsync` for zero values — clearing the sign bit of IEEE 754 `-0.0`. This has been fixed in [rapidsai/rmm#2302](rapidsai/rmm@06c3562). - **This PR**: Simply removes the `.exclude()` for `"normalize special floating numbers in subquery"` in `RapidsSQLQuerySuite`, re-enabling the test now that the upstream fix has landed in the spark-rapids-jni nightly SNAPSHOT. - **No spark-rapids code changes needed**: The original workaround in `GpuScalar` (using `ColumnVector` path to bypass `Scalar.fromDouble`) has been removed — the direct `Scalar.fromDouble`/`Scalar.fromFloat` calls now preserve `-0.0` correctly. ## Upstream fix chain ``` rapidsai/rmm#2302 (06c3562) — remove zero-value cudaMemsetAsync special-casing → spark-rapids-jni cudf-pins updated (RMM pin now includes the fix) → spark-rapids-jni nightly SNAPSHOT rebuilt with fixed librmm.so → this test now passes on GPU without any spark-rapids workaround ``` ### RAPIDS test to Spark original mapping | RAPIDS test | Spark original | Spark file | Lines | |---|---|---|---| | `normalize special floating numbers in subquery` (inherited) | `normalize special floating numbers in subquery` | `sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala` | [3620-3636](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L3620-L3636) ([permalink](https://github.com/apache/spark/blob/v3.3.0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L3620-L3636)) | ## Test plan - [x] `mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=RapidsSQLQuerySuite` — 234 tests, 0 failures, 0 errors - [x] The previously excluded test "normalize special floating numbers in subquery" now passes - [x] No new test failures introduced - [x] Verified on latest `origin/main` (commit b74f7f7) with upstream RMM fix in spark-rapids-jni SNAPSHOT Closes #14116 ### Checklists - [ ] This PR has added documentation for new or modified features or behaviors. - [x] This PR has added new tests or modified existing tests to cover new code paths. (Re-enabled the inherited Spark test "normalize special floating numbers in subquery" by removing its `.exclude()` entry. The test validates GPU-CPU parity for `-0.0` in scalar subqueries.) - [ ] Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description. Made with [Cursor](https://cursor.com) --------- Signed-off-by: Allen Xu <allxu@nvidia.com>
Description
device_uvector::set_element_asynchad a zero-value optimization that usedcudaMemsetAsyncwhenvalue == value_type{0}. For IEEE 754 floating-point types,-0.0 == 0.0istrueper the standard, so-0.0was incorrectly routed throughcudaMemsetAsync(..., 0, ...)which clears all bits — including the sign bit — normalizing-0.0to+0.0.This corrupts the in-memory representation of
-0.0for any downstream library that creates scalars through RMM (cudf::fixed_width_scalar::set_value→rmm::device_scalar::set_value_async→device_uvector::set_element_async), causing observable behavioral divergence in spark-rapids (e.g.,cast(-0.0 as string)returns"0.0"on GPU instead of"-0.0").Fix
Per the discussion in #2298, remove all
constexprspecial casing inset_element_async— both theboolcudaMemsetAsyncpath and theis_fundamental_vzero-detection path — and always usecudaMemcpyAsync. This preserves exact bit-level representations for all types, which is the correct contract for a memory management library that sits below cuDF, cuML, and cuGraph.set_element_to_zero_asyncis unchanged — its explicit "set to zero" semantics makecudaMemsetAsyncthe correct implementation.Testing
Added
NegativeZeroTest.PreservesFloatNegativeZeroandNegativeZeroTest.PreservesDoubleNegativeZeroregression tests that verify the sign bit of-0.0f/-0.0survives a round-trip throughset_element_async→element. All 122 tests pass locally (CUDA 13.0, RTX 5880).Closes #2298
Checklist
Made with Cursor