Skip to content

Conversation

@martin-augment
Copy link
Owner

18937: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Nov 26, 2025

Walkthrough

This change extends variance computation in DataFusion to support decimal data types. The variance module now includes decimal-aware accumulators and state management using i256 for precise arithmetic. The VarianceSample and VariancePopulation structs are updated to route decimal inputs through specialized implementations while maintaining backward compatibility with numeric types. Additionally, test coverage is expanded with new aggregate tests validating variance and count distinct operations on decimal values and grouped aggregations.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18937-2025-11-26-15-40-53

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Nov 26, 2025

Code Review: Decimal Support for Variance Functions

Thank you for implementing decimal support for variance functions! This is a substantial and well-structured addition. Here's my comprehensive review:

✅ Strengths

  1. Comprehensive Implementation: The PR properly implements decimal variance for all four decimal types (Decimal32, Decimal64, Decimal128, Decimal256) for both sample and population variance.

  2. Proper Overflow Handling: Good use of checked arithmetic throughout (checked_add, checked_mul, checked_sub) to prevent silent overflow issues with decimal arithmetic.

  3. Test Coverage: Includes both unit tests and SQL logic tests covering various scenarios including grouped aggregations.

  4. Code Organization: Well-structured with clear separation of concerns (trait definitions, state management, accumulators).

  5. Retract Support: Properly implements retract_batch for window functions and other scenarios requiring removal of values.

🔍 Issues & Suggestions

1. Critical: Potential Precision Loss in i256_to_f64_lossy (variance.rs:88-104)

The conversion from i256 to f64 can lose significant precision for large decimal values. While the function is correctly named "_lossy", this could lead to incorrect variance calculations for very large decimal values.

Recommendation: Add a comment explaining the precision implications and consider adding a test case with very large decimal values to demonstrate the behavior.

2. Bug: Incorrect Test in variance_decimal_input (variance.rs:1197-1205)

The test uses VarianceAccumulator (the old float-based accumulator) instead of DecimalVarianceAccumulator. This means the test is actually testing the cast-to-float path, not the new decimal-specific code path.

// Current (incorrect):
let mut pop_acc = VarianceAccumulator::try_new(StatsType::Population)?;

// Should be:
let mut pop_acc = DecimalVarianceAccumulator::<Decimal128Type>::try_new(3, StatsType::Population)?;

Impact: High - The new decimal variance implementation is not actually being tested in unit tests, only through SQL logic tests.

3. Numerical Stability Issue (variance.rs:326-330)

The clamping of negative numerator to zero is mathematically correct for handling floating-point rounding errors, but it silently hides potential bugs in the calculation:

let numerator = if numerator < i256::ZERO {
    i256::ZERO
} else {
    numerator
};

Recommendation: Add a debug assertion or log warning when this occurs, as a negative variance (before clamping) indicates numerical issues.

4. Missing Edge Case Tests

The tests don't cover several important scenarios:

  • NULL values in decimal arrays
  • Empty result sets
  • Single value for sample variance (should return NULL)
  • Mixed positive/negative decimals across groups
  • Maximum scale decimals
  • Retract operations for decimal types

5. Minor: Inconsistent Error Messages

variance.rs:272 uses Decimal variance retract underflow but other errors use just Decimal variance overflow. Consider making error messages more descriptive:

return exec_err!("Cannot retract from empty decimal variance state");

6. Performance Consideration

For each value, the code converts to i256 and performs i256 arithmetic even for Decimal32/Decimal64 where smaller types would suffice. This is a reasonable trade-off for code simplicity, but worth documenting.

7. Documentation Needed

The new decimal-specific functions (i256_to_f64_lossy, decimal_scale, etc.) lack documentation comments explaining their purpose and any gotchas.

📋 Recommended Actions

  1. Fix the unit test to actually test DecimalVarianceAccumulator
  2. Add tests for edge cases (NULLs, empty sets, single values, retract operations)
  3. Add documentation comments for helper functions
  4. Consider adding a debug assertion for the negative variance case
  5. Add a test with very large decimal values to demonstrate precision behavior

🔒 Security Assessment

No security concerns identified. Overflow protection is properly implemented.

🎯 Test Coverage Assessment

SQL logic tests are good, but unit tests have a critical gap (testing wrong accumulator). More edge case coverage needed.

Overall Assessment

This is solid work that correctly implements the core functionality. The main concern is the unit test bug which means the decimal-specific path isn't being properly unit tested. Once that's fixed and edge cases are covered, this will be ready to merge.

Issue Reference: Addresses issue apache#18937

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

self.resize(total_num_groups);
for (row, group_index) in group_indices.iter().enumerate() {
if let Some(filter) = opt_filter {
if !filter.value(row) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DecimalVarianceGroupsAccumulator::update_batch, accessing filter.value(row) without checking validity can mis-handle NULL filters and may read an undefined bit; consider treating NULL as false by checking validity before evaluating the filter (also applies to other locations in the PR if similar logic appears).

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! Before calling filter.value(row) it should check whether the row is valid with !filter.is_valid(row). Prevents wrong behavior of the check in case the row is Null.

let mut sample_acc = VarianceAccumulator::try_new(StatsType::Sample)?;
let sample_input = [array];
sample_acc.update_batch(&sample_input)?;
assert_variance(sample_acc.evaluate()?, 11606.257924736841);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Unit test doesn't test the new decimal variance accumulator

The variance_decimal_input test creates VarianceAccumulator directly instead of DecimalVarianceAccumulator. The VarianceAccumulator casts decimal inputs to Float64, so this test doesn't actually exercise the new decimal-specific variance implementation added in this PR. The test name and setup suggest it's validating decimal variance support, but it's actually testing the Float64 path. The new DecimalVarianceAccumulator code remains untested by this unit test.

Fix in Cursor Fix in Web

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct that the unit test uses the old accumulator (VarianceAccumulator) instead of the one introduced in this Pull Request. Prevents regressions due to testing the wrong implementation.

stats_type,
_marker: PhantomData,
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Groups accumulator missing scale validation present in regular accumulator

DecimalVarianceAccumulator::try_new validates that scale doesn't exceed DECIMAL256_MAX_SCALE and returns an error if it does. However, DecimalVarianceGroupsAccumulator::new has no such validation. This inconsistency means the same invalid scale would cause an error via the regular accumulator path but silently proceed via the groups accumulator path, potentially producing incorrect results when the scale factor calculation overflows.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! DecimalVarianceGroupsAccumulator should check the value of "scale" to be lower than DECIMAL256_MAX_SCALE. Prevents wrong calculations due to too big scale.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
datafusion/functions-aggregate/src/variance.rs (3)

361-463: Scalar decimal accumulator is well-structured; minor robustness tweaks possible

DecimalVarianceAccumulator correctly:

  • Uses the decimal-native primitive array (as_primitive::<T>()) without unnecessary casts.
  • Implements update_batch, retract_batch, and merge_batch in terms of the shared DecimalVarianceState.
  • Returns the expected Float64 scalar via state.variance.

You might optionally assert values.len() == 1 in update_batch/retract_batch for defensive programming, mirroring patterns used elsewhere, but this is not required for correctness.


743-831: VariancePopulation type checks and decimal routing look good; consider mirroring in sample

VariancePopulation::return_type now explicitly accepts both numeric and decimal inputs via is_numeric_or_decimal, and the accumulator/group-accumulator creation mirrors the sample case with StatsType::Population. That’s correct and keeps error messages clear for non‑numeric types.

For consistency, you might optionally add the same explicit type check to VarianceSample::return_type (currently it ignores arg_types) so both variants behave identically at planning time.


1172-1217: Decimal variance tests are correct but don’t exercise the new decimal accumulator

The new unit tests:

  • Verify that VariancePopulation::return_type accepts a decimal type.
  • Check numeric variance results for a Decimal128 array, but they call VarianceAccumulator::try_new (the float-based implementation) instead of the new DecimalVarianceAccumulator or the VariancePopulation/VarianceSample UDF paths.

The expectations (11025.9450285 and 11606.257924736841) match the analytical population and sample variance for the constructed data, which is good. To directly test the new decimal path, consider adding tests that:

  • Construct a VariancePopulation / VarianceSample accumulator via their accumulator methods with a decimal expr_field, or
  • Instantiate DecimalVarianceAccumulator::<Decimal128Type>::try_new(3, StatsType::Population/Sample) and run update_batch / evaluate on the same data.

This would guard against regressions in the decimal-specific state/merge logic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48b4e95 and 5b92361.

📒 Files selected for processing (2)
  • datafusion/functions-aggregate/src/variance.rs (12 hunks)
  • datafusion/sqllogictest/test_files/aggregate.slt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
datafusion/functions-aggregate/src/variance.rs (1)
datafusion/expr/src/utils.rs (1)
  • format_state_name (1257-1259)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (5)
datafusion/sqllogictest/test_files/aggregate.slt (1)

5632-5648: Decimal variance SQL tests look consistent and well-chosen

The new aggregate_decimal_variance cases validate both population and sample variance on decimal(10,3), including grouped var_pop with very small within-group variance and correct arrow_typeof checks. The expected values match the analytical variance for the constructed data, so this should give good coverage of the new decimal variance path.

datafusion/functions-aggregate/src/variance.rs (4)

70-135: Decimal variance helpers and state layout are sound

The shared variance_signature() plus decimal_scale/decimal_variance_state_fields wiring cleanly extends variance to decimals while preserving numeric behavior. State encoding as (UInt64, FixedSizeBinary(32), FixedSizeBinary(32)) via i256_to_scalar/i256_from_bytes is consistent and matches the 256‑bit backing type, so merge paths can reliably reconstruct state. The lossy conversion to f64 is acceptable given the Float64 return type.


137-221: Type gating and decimal accumulator factory behavior look correct

is_numeric_or_decimal and the create_decimal_variance_* helpers correctly distinguish numeric vs decimal inputs and route each decimal variant (32/64/128/256) to the appropriate accumulator. Non‑decimal types still fall back to the existing Welford-based variance accumulators, so existing behavior is preserved.


222-359: DecimalVarianceState math and overflow handling are correct

The state (count, sum, sum_squares) and variance implementation mirror the standard formulas:

  • Population: (n * Σx² − (Σx)²) / n²
  • Sample: (n * Σx² − (Σx)²) / (n * (n−1))

with all intermediate arithmetic done in i256 and guarded by checked_* plus a clear "Decimal variance overflow" error. Scaling by 10^(2*scale) is applied only at the end, so decimal scaling is handled exactly. Clamping negative numerators to zero is a safe guard against rare intermediate anomalies, and count == 0 / count <= 1 edge cases are correctly mapped to None.


629-709: VarianceSample integration with decimal path is consistent

Updating VarianceSample to use the shared variance_signature() plus:

  • Decimal-aware state_fields (switching to the fixed-size binary state for decimal inputs), and
  • Early dispatch to create_decimal_variance_accumulator / create_decimal_variance_groups_accumulator

cleanly extends var/var_samp to decimals while keeping VAR(DISTINCT) behavior unchanged. The Float64 return type remains appropriate.

Comment on lines +465 to +600
#[derive(Debug)]
struct DecimalVarianceGroupsAccumulator<T>
where
T: DecimalType + ArrowNumericType + Debug,
T::Native: DecimalNative,
{
states: Vec<DecimalVarianceState>,
scale: i8,
stats_type: StatsType,
_marker: PhantomData<T>,
}

impl<T> DecimalVarianceGroupsAccumulator<T>
where
T: DecimalType + ArrowNumericType + Debug,
T::Native: DecimalNative,
{
fn new(scale: i8, stats_type: StatsType) -> Self {
Self {
states: Vec::new(),
scale,
stats_type,
_marker: PhantomData,
}
}

fn resize(&mut self, total_num_groups: usize) {
if self.states.len() < total_num_groups {
self.states
.resize(total_num_groups, DecimalVarianceState::default());
}
}
}

impl<T> GroupsAccumulator for DecimalVarianceGroupsAccumulator<T>
where
T: DecimalType + ArrowNumericType + Debug,
T::Native: DecimalNative,
{
fn update_batch(
&mut self,
values: &[ArrayRef],
group_indices: &[usize],
opt_filter: Option<&BooleanArray>,
total_num_groups: usize,
) -> Result<()> {
let array = values[0].as_primitive::<T>();
self.resize(total_num_groups);
for (row, group_index) in group_indices.iter().enumerate() {
if let Some(filter) = opt_filter {
if !filter.value(row) {
continue;
}
}
if array.is_null(row) {
continue;
}
let value = array.value(row).to_i256();
self.states[*group_index].update(value)?;
}
Ok(())
}

fn merge_batch(
&mut self,
values: &[ArrayRef],
group_indices: &[usize],
_opt_filter: Option<&BooleanArray>,
total_num_groups: usize,
) -> Result<()> {
let counts = downcast_value!(values[0], UInt64Array);
let sums = downcast_value!(values[1], FixedSizeBinaryArray);
let sum_squares = downcast_value!(values[2], FixedSizeBinaryArray);
self.resize(total_num_groups);

for (row, group_index) in group_indices.iter().enumerate() {
if counts.is_null(row) {
continue;
}
let count = counts.value(row);
if count == 0 {
continue;
}
let sum = i256_from_bytes(sums.value(row))?;
let sum_sq = i256_from_bytes(sum_squares.value(row))?;
let other = DecimalVarianceState {
count,
sum,
sum_squares: sum_sq,
};
self.states[*group_index].merge(&other)?;
}
Ok(())
}

fn evaluate(&mut self, emit_to: datafusion_expr::EmitTo) -> Result<ArrayRef> {
let states = emit_to.take_needed(&mut self.states);
let mut builder = Float64Builder::with_capacity(states.len());
for state in &states {
match state.variance(self.stats_type, self.scale)? {
Some(value) => builder.append_value(value),
None => builder.append_null(),
}
}
Ok(Arc::new(builder.finish()))
}

fn state(&mut self, emit_to: datafusion_expr::EmitTo) -> Result<Vec<ArrayRef>> {
let states = emit_to.take_needed(&mut self.states);
let mut counts = UInt64Builder::with_capacity(states.len());
let mut sums = FixedSizeBinaryBuilder::with_capacity(
states.len(),
DECIMAL_VARIANCE_BINARY_SIZE,
);
let mut sum_squares = FixedSizeBinaryBuilder::with_capacity(
states.len(),
DECIMAL_VARIANCE_BINARY_SIZE,
);

for state in states {
counts.append_value(state.count);
sums.append_value(state.sum.to_le_bytes())?;
sum_squares.append_value(state.sum_squares.to_le_bytes())?;
}

Ok(vec![
Arc::new(counts.finish()),
Arc::new(sums.finish()),
Arc::new(sum_squares.finish()),
])
}

fn size(&self) -> usize {
self.states.capacity() * size_of::<DecimalVarianceState>()
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

FILTER with nullable boolean can be misinterpreted in decimal groups accumulator

In DecimalVarianceGroupsAccumulator::update_batch, the filter handling is:

if let Some(filter) = opt_filter {
    if !filter.value(row) {
        continue;
    }
}

This ignores the validity bitmap. If the filter expression can produce NULLs, those rows should be excluded (SQL FILTER (WHERE ...) treats NULL as false), but here a NULL row will be treated according to the underlying, undefined value bit in the data buffer, which can randomly include or exclude rows.

To align with existing aggregate semantics and the accumulate helper used by VarianceGroupsAccumulator, you should also check filter.is_valid(row) before using value(row), for example:

-            if let Some(filter) = opt_filter {
-                if !filter.value(row) {
-                    continue;
-                }
-            }
+            if let Some(filter) = opt_filter {
+                if !filter.is_valid(row) || !filter.value(row) {
+                    continue;
+                }
+            }
🤖 Prompt for AI Agents
In datafusion/functions-aggregate/src/variance.rs around lines 465 to 600, the
update_batch filter check uses filter.value(row) without checking validity, so
NULL filter entries are interpreted from undefined bits; change the check to
exclude rows when the filter is NULL or false by verifying filter.is_valid(row)
before reading value (i.e., continue if !filter.is_valid(row) ||
!filter.value(row)). Ensure the logic matches the other accumulators: treat NULL
as false and skip the row.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! Before calling filter.value(row) it should check whether the row is valid with !filter.is_valid(row). Prevents wrong behavior of the check in case the row is Null.

@martin-augment
Copy link
Owner Author

1. Critical: Potential Precision Loss in i256_to_f64_lossy (variance.rs:88-104)

The conversion from i256 to f64 can lose significant precision for large decimal values. While the function is correctly named "_lossy", this could lead to incorrect variance calculations for very large decimal values.

Recommendation: Add a comment explaining the precision implications and consider adding a test case with very large decimal values to demonstrate the behavior.

value:good-to-have; category:documentation; feedback:The Claude AI reviewer is correct that this function would benefit of having some documentation about its limits. Currently its only "documentation" is "lossy" in its name but it does not explain in which cases it would loss precision. Prevents wrong results due to big expectations.

@martin-augment
Copy link
Owner Author

2. Bug: Incorrect Test in variance_decimal_input (variance.rs:1197-1205)

The test uses VarianceAccumulator (the old float-based accumulator) instead of DecimalVarianceAccumulator. This means the test is actually testing the cast-to-float path, not the new decimal-specific code path.

// Current (incorrect):
let mut pop_acc = VarianceAccumulator::try_new(StatsType::Population)?;

// Should be:
let mut pop_acc = DecimalVarianceAccumulator::<Decimal128Type>::try_new(3, StatsType::Population)?;

Impact: High - The new decimal variance implementation is not actually being tested in unit tests, only through SQL logic tests.

value:useful; category:bug; feedback:The Claude AI reviewer is correct that the unit test uses the old accumulator (VarianceAccumulator) instead of the one introduced in this Pull Request. Prevents regressions due to testing the wrong implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants