-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: Support log for Decimal32 and Decimal64 #18999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm still working on the Decimal64, but early feedback on the PR is much appreciated |
|
Interesting, I didn't realise negative scales aren't allowed. I assumed they were as arrow allows negative scales in decimal. |
Negative scales are allowed; I believe any places in our codebase they are disallowed is mainly due to implementation limitation (i.e. not yet supported) rather than inherently not being possible. (Haven't had a chance to review this PR yet, hopefully soon) |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, how would this implementation be different from having our coercion/casting logic convert the input decimal arrays to floats before applying the log, as opposed to doing that decimal -> float ourselves here? 🤔
My take here is that it would depend on the function. Let's say something We can doing it entirely in decimal like this one based on IBM's version https://github.com/semihc/CppDecimal/blob/main/src/decNumber.c#L1384-L1518 But I think this would be a rather expensive version. Also from #18524 if we want Another thing is Intel's decimal library does convert We can port their conversion logic here if needed, I wanted to get some feedback on this PR before that. Let me know your suggestion on this |
|
In the original PR that kicked off this effort (#17023) it converts the decimal128 to the native i128 representation before doing an integer log, as converting to f64 apparently causes some precision loss. I think we should follow suit as otherwise there is little difference than previous behaviour of casting to float64 first before doing the log 🤔 It also makes me wonder if we should handle negative scale by just casting to float to do the log so we don't lose functionality. Thanks for looking into the other solutions from IBM and Intel; I think we can avoid porting/copying their code unless there is a strong need for what they bring to the table for us. |
|
Ok, instead of converting to float I'll keep it as integers and perform and integer log. Just one thing though the
I went through the Intel and IBM solution to understand potential edge cases which could affect precision. I won't be porting anything from them unless it's truly needed, point noted |
I think we should follow what the decimal128 version is doing: it performs ilog on the scaled i128 and then converts it to f64 to return. We aren't returning log of decimal as log, we're still converting it to f64. The idea is that doing the log as ilog on scaled decimal we get more accurate results than casting decimal to float before performing log on the float. See the original decimal128 PR for reference: #17023 |
datafusion/functions/src/utils.rs
Outdated
| } else if scale as u8 > precision { | ||
| Err(ArrowError::ComputeError(format!( | ||
| "scale {scale} is greater than precision {precision}" | ||
| ))) | ||
| } else if scale == 0 { | ||
| Ok(value) | ||
| } else { | ||
| validate_decimal32_precision(value, precision, scale)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why we differ from the decimal128 version above in having a precision parameter and doing these extra checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added to validate cases where precision > 9 and also check the max value (Decimal(32, 0) max value 999999999).
This conversion in the decimal128 testcase is actually incorrect as it's greater than the max decimal128 value
https://github.com/Mark1626/datafusion/blob/05eea8b1144487a6a698d3be8815c93b689d15a3/datafusion/functions/src/utils.rs#L411
Do you think it's redundant, in which case I'll remove it? ScalarValue does a validation but it doesn't validate on the max value
https://github.com/Mark1626/datafusion/blob/05eea8b1144487a6a698d3be8815c93b689d15a3/datafusion/common/src/scalar/mod.rs#L4446
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the checks should be done in these functions, as they are detached from the actual DataType::DecimalXX(_) so it looks weird that we check the precision even though at this level it doesn't feel like it is this functions responsibility 🤔
Same goes for checking that scale doesn't exceed precision; it seems like something that would be checked higher in the chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll remove these redundant checks
datafusion/functions/src/utils.rs
Outdated
| "{value} and {precision} {scale} vs {expected:?}" | ||
| ); | ||
| } | ||
| Err(_) => assert!(expected.is_none()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if we can assert the expected error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll add an assertion on the expected error
ad73b04 to
2b78589
Compare
|
Thanks @Mark1626 @martin-g @kumarUjjawal |
Which issue does this PR close?
Rationale for this change
Analysis
Other engines:
"(U)Int*", "Float*", "Decimal*"as arguments for log https://github.com/ClickHouse/ClickHouse/blob/master/src/Functions/log.cpp#L47-L63Libraries
log. https://github.com/karlorz/IntelRDFPMathLib20U2/blob/main/LIBRARY/src/bid32_log.clogis fully using decimal, but I don't think this would be very performant way to do thisI'm going to go with an approach similar to the one inside Intel's decimal library. To begin with the
decimal32 -> doubleis done by a simple scalingWhat changes are included in this PR?
Are these changes tested?
Yes, unit tests have been added, and I've tested this from the datafusion cli for Decimal32
Are there any user-facing changes?