Skip to content

Conversation

@Mark1626
Copy link
Contributor

@Mark1626 Mark1626 commented Nov 30, 2025

Which issue does this PR close?

Rationale for this change

Analysis

Other engines:

  1. Clickhouse seems to only consider "(U)Int*", "Float*", "Decimal*" as arguments for log https://github.com/ClickHouse/ClickHouse/blob/master/src/Functions/log.cpp#L47-L63

Libraries

  1. There a C++ library libdecimal which internally uses Intel Decimal Floating Point Library for it's decimal32 operations. Intel's library itself converts the decimal32 to double and calls log. https://github.com/karlorz/IntelRDFPMathLib20U2/blob/main/LIBRARY/src/bid32_log.c
  2. There was another C++ library based on IBM's decimal decNumber library https://github.com/semihc/CppDecimal . This one's implementation of log is fully using decimal, but I don't think this would be very performant way to do this

I'm going to go with an approach similar to the one inside Intel's decimal library. To begin with the decimal32 -> double is done by a simple scaling

What changes are included in this PR?

  1. Support Decimal32 for log

Are these changes tested?

Yes, unit tests have been added, and I've tested this from the datafusion cli for Decimal32

> select log(2.0, arrow_cast(12345.67, 'Decimal32(9, 2)'));
+-----------------------------------------------------------------------+
| log(Float64(2),arrow_cast(Float64(12345.67),Utf8("Decimal32(9, 2)"))) |
+-----------------------------------------------------------------------+
| 13.591717513271785                                                    |
+-----------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.021 seconds.

Are there any user-facing changes?

  1. The precision of the result for Decimal32 will change, the precision loss in Decimal128 implementation of log loses precision #18524 does not occur in this PR

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Nov 30, 2025
@Mark1626
Copy link
Contributor Author

I'm still working on the Decimal64, but early feedback on the PR is much appreciated

@Mark1626 Mark1626 marked this pull request as ready for review November 30, 2025 16:59
@Mark1626
Copy link
Contributor Author

Mark1626 commented Dec 1, 2025

Interesting, I didn't realise negative scales aren't allowed. I assumed they were as arrow allows negative scales in decimal.
https://github.com/apache/arrow-rs/blob/main/arrow-schema/src/datatype.rs#L359-L372

@Mark1626 Mark1626 requested a review from martin-g December 1, 2025 02:27
@Jefffrey
Copy link
Contributor

Jefffrey commented Dec 1, 2025

Interesting, I didn't realise negative scales aren't allowed. I assumed they were as arrow allows negative scales in decimal. https://github.com/apache/arrow-rs/blob/main/arrow-schema/src/datatype.rs#L359-L372

Negative scales are allowed; I believe any places in our codebase they are disallowed is mainly due to implementation limitation (i.e. not yet supported) rather than inherently not being possible.

(Haven't had a chance to review this PR yet, hopefully soon)

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, how would this implementation be different from having our coercion/casting logic convert the input decimal arrays to floats before applying the log, as opposed to doing that decimal -> float ourselves here? 🤔

@Mark1626
Copy link
Contributor Author

Mark1626 commented Dec 3, 2025

Now that I think about it, how would this implementation be different from having our coercion/casting logic convert the input decimal arrays to floats before applying the log

My take here is that it would depend on the function. Let's say something round, ceil, floor a coercion is unacceptable, where as with things like log, sin the results would not be precise and so a coercion is acceptable

We can doing it entirely in decimal like this one based on IBM's version https://github.com/semihc/CppDecimal/blob/main/src/decNumber.c#L1384-L1518

But I think this would be a rather expensive version. Also from #18524 if we want
select log(2.0, 100000000000000000000000000000000000::decimal(38,0)); to be 116.267483321058 we need the result of log to be float (which it currently is), otherwise I think the result has to be 116 (since it's (38, 0))


Another thing is Intel's decimal library does convert decimal -> binary64 then back to decimal. The conversion itself is a bit more sophisticated than a simple (N / 10*scale) https://github.com/karlorz/IntelRDFPMathLib20U2/blob/main/LIBRARY/src/bid32_log.c

We can port their conversion logic here if needed, I wanted to get some feedback on this PR before that. Let me know your suggestion on this

@Jefffrey
Copy link
Contributor

Jefffrey commented Dec 5, 2025

In the original PR that kicked off this effort (#17023) it converts the decimal128 to the native i128 representation before doing an integer log, as converting to f64 apparently causes some precision loss. I think we should follow suit as otherwise there is little difference than previous behaviour of casting to float64 first before doing the log 🤔

It also makes me wonder if we should handle negative scale by just casting to float to do the log so we don't lose functionality.

Thanks for looking into the other solutions from IBM and Intel; I think we can avoid porting/copying their code unless there is a strong need for what they bring to the table for us.

@Mark1626
Copy link
Contributor Author

Mark1626 commented Dec 6, 2025

Ok, instead of converting to float I'll keep it as integers and perform and integer log.

Just one thing though the log function in DuckDB and Clickhouse return float64/double so the behaviour might different (but I personally think that's fine, as the user will not have an implicit type conversion from decimal to float). And if we are returning as decimal then the following is expected right? across Decimal32, Decimal64, Decimal128, Decimal256

Query Res
select log(12345::decimal(38,0)) 4.0
select log(12345::decimal(38,2)) 4.09

I went through the Intel and IBM solution to understand potential edge cases which could affect precision. I won't be porting anything from them unless it's truly needed, point noted

@Jefffrey
Copy link
Contributor

Jefffrey commented Dec 6, 2025

Ok, instead of converting to float I'll keep it as integers and perform and integer log.

Just one thing though the log function in DuckDB and Clickhouse return float64/double so the behaviour might different (but I personally think that's fine, as the user will not have an implicit type conversion from decimal to float). And if we are returning as decimal then the following is expected right? across Decimal32, Decimal64, Decimal128, Decimal256

Query Res
select log(12345::decimal(38,0)) 4.0
select log(12345::decimal(38,2)) 4.09

I went through the Intel and IBM solution to understand potential edge cases which could affect precision. I won't be porting anything from them unless it's truly needed, point noted

I think we should follow what the decimal128 version is doing: it performs ilog on the scaled i128 and then converts it to f64 to return. We aren't returning log of decimal as log, we're still converting it to f64.

The idea is that doing the log as ilog on scaled decimal we get more accurate results than casting decimal to float before performing log on the float.

See the original decimal128 PR for reference: #17023

Comment on lines 229 to 236
} else if scale as u8 > precision {
Err(ArrowError::ComputeError(format!(
"scale {scale} is greater than precision {precision}"
)))
} else if scale == 0 {
Ok(value)
} else {
validate_decimal32_precision(value, precision, scale)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we differ from the decimal128 version above in having a precision parameter and doing these extra checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added to validate cases where precision > 9 and also check the max value (Decimal(32, 0) max value 999999999).

This conversion in the decimal128 testcase is actually incorrect as it's greater than the max decimal128 value
https://github.com/Mark1626/datafusion/blob/05eea8b1144487a6a698d3be8815c93b689d15a3/datafusion/functions/src/utils.rs#L411

Do you think it's redundant, in which case I'll remove it? ScalarValue does a validation but it doesn't validate on the max value
https://github.com/Mark1626/datafusion/blob/05eea8b1144487a6a698d3be8815c93b689d15a3/datafusion/common/src/scalar/mod.rs#L4446

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the checks should be done in these functions, as they are detached from the actual DataType::DecimalXX(_) so it looks weird that we check the precision even though at this level it doesn't feel like it is this functions responsibility 🤔

Same goes for checking that scale doesn't exceed precision; it seems like something that would be checked higher in the chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll remove these redundant checks

"{value} and {precision} {scale} vs {expected:?}"
);
}
Err(_) => assert!(expected.is_none()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if we can assert the expected error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll add an assertion on the expected error

@Mark1626 Mark1626 requested a review from Jefffrey December 16, 2025 16:00
@alamb alamb requested a review from kumarUjjawal December 18, 2025 20:56
@Jefffrey Jefffrey added this pull request to the merge queue Dec 19, 2025
Merged via the queue into apache:main with commit c2747eb Dec 19, 2025
27 checks passed
@Jefffrey
Copy link
Contributor

Thanks @Mark1626 @martin-g @kumarUjjawal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants