-
Notifications
You must be signed in to change notification settings - Fork 258
feat: Support ANSI mode sum expr (int inputs) #2600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Support ANSI mode sum expr (int inputs) #2600
Conversation
|
Draft PR to support sum function - WIP |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2600 +/- ##
============================================
+ Coverage 56.12% 59.41% +3.28%
- Complexity 976 1377 +401
============================================
Files 119 167 +48
Lines 11743 15330 +3587
Branches 2251 2546 +295
============================================
+ Hits 6591 9108 +2517
- Misses 4012 4938 +926
- Partials 1140 1284 +144 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
970d693 to
c0715aa
Compare
|
@andygrove , @comphead I believe this should be ready for review (pending CI) |
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Outdated
Show resolved
Hide resolved
Thats actually interesting why would window fall back, we planning to fallback on windows in #2726 but the test would still preserve windows, so I think we need to check a fallback reason |
6dcf47b to
13b7d68
Compare
This is no longer a failing test after I rebased with main branch |
|
@andygrove @comphead . All the tests passed and the PR is rest for review |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coderfender This LGTM. I am going to pull the PR locally and do some testing/benchmarking today. Could you rebease/upmerge?
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Outdated
Show resolved
Hide resolved
|
Here are microbenchmark results: Legacy modeANSI mode |
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala
Outdated
Show resolved
Hide resolved
|
@andygrove , Please take a look whenever you get a chance . I made changes per review and fixed merge conflicts |
|
@andygrove I rebased with main and fixed merge conflicts . Please review whenever you get a chance |
|
Resolved scalafix error |
|
@andygrove , @martin-g could you please take a (hopefully final) look at the PR and let me know if you think we need further changes ? |
The changes in the PR are pretty substantial (~800 new lines of code) so there is a lot to review. I don't see anything that concerns me at this point, but I would like to run some benchmarks to check that there are no regressions in the non ANSI case. Unfortunately, we don't have any existing microbenchmarks that work with ANSI enabled, so we need to create something. |
|
Thank you @andygrove let me go ahead and work on generating benchmarks for ANSI supported aggregate expressions |
|
Created #2883 to support benchmarking with ANSI mode |
|
Ran benchmarking with ANSI mode enabled for SUM aggregated function only and here are the results |
|
Rebased with main |
|
|
||
| // Check for overflow for early termination | ||
| if self.eval_mode == EvalMode::Try { | ||
| let that_has_all_nulls = states[1].as_boolean().value(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DF docs for merge_batch state:
/// Updates the accumulator's state from an `Array` containing one
/// or more intermediate values.
should there be a check that the input has more than one element before accessing state[1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check here might be an overkill. We are guaranteeing that the accumulator's state is holding sum and has_all_nulls for Try eval mode as per Spark's semantics :
| self.resize_helper(total_num_groups); | ||
|
|
||
| let that_sums_is_all_nulls = if self.eval_mode == EvalMode::Try { | ||
| Some(values[1].as_boolean()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as earlier. Is values guaranteed to have length > 1 at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check here might be an overkill. We are guaranteeing that the group accumulator's state is holding sum and has_all_nulls for Try eval mode as per Spark's semantics
Which issue does this PR close?
Closes #531 .
(Partially closes 531) . The ANSI changes to support
AVGwill be tracked in a new PRRationale for this change
DataFusion's default SUM doesn't match Spark's overflow semantics. This implementation ensures we get the exact same behavior as Spark for integer overflow in all 3 eval modes (ANSI , Legacy and Try mode)
What changes are included in this PR?
This PR adds native Rust implementation for SUM aggregation on integer types (byte, short, int, long)
Native changes (Rust):
(Inspired from
SumDecimaland spark's SUM impl)SumIntegeraggregate function that handles SUM for all integer types (in coherence with Spark)( Implemented code in similar fashion of spark leveraging
Option<i64>to represent NULL and numeric values for sum , and an additional parameter calledhas_all_nullswhich is leveraged in Try mode to distinguish if NULL sum is caused by all NULL inputs or the fact that the sum overflowed. (Spark does this withshouldTrackIsEmptyand assigning NULL to running sum which is a long datatype) )Scala side changes :
CometSumto add ANSI support (ANSI and Try)eval_modeinstead offail_on_errorto supportLegacy, ANSI and Tryeval modesjava.lang.Longto avoid Scala's auto-boxing feature which auto casts objects to primitive types there by casting nulls to 0s) handling in both simple and GROUP BY agg .How are these changes tested?