Produce correct answers for Group BY NULL (Option 1) #793

alamb · 2021-07-29T17:01:58Z

Which issue does this PR close?

This PR closes #781 and #782;

Built on #786 so review that first.

Rationale for this change

Grouping on columns that contain NULLs today produces incorrect results
This is what I think is the minimum change required to produce correct results

This is a version of the "Alternative" approach described in #790 which I think is the minimum change to GroupByHash to produce the correct answers when grouping on columns that contain nulls. Thanks to @jhorstmann and @Dandandan for the ideas leading to this PR

It will likely reduce the speed of grouping as well as require more memory than the current implementation (though it does get correct answers!)

I created this PR it available for comparison and as a fallback in case I run into trouble or run out of time time trying to implement #790, which I expect will take longer to code and review.

What changes are included in this PR?

Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy #786
Include a "null byte" for each column when creating the key to the hash table
Fix some bugs related to NULL handling in ScalarValue
Tests

On master keys are created like this:

                            string len                   0x1234
{                          (as usize le)      "foo"    (as u16 le)
  k1: "foo"         ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
  k2: 0x1234u16     │03│00│00│00│00│00│00│00│"f│"o│"o│34│12│
}                   └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
                    0  1  2  3  4  5  6  7  8  9  10 11 12

After this PR, the keys are created as follows (note the two extra bytes, one for each grouping column)

Example of a key without any nulls:

                       0xFF byte at the start of each column
                          signifies the value is non-null
                                         │

                     ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┐

                     │        string len                 │  0x1234
{                    ▼       (as usize le)      "foo"    ▼(as u16 le)
  k1: "foo"        ╔ ═┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──╦ ═┌──┬──┐
  k2: 0x1234u16     FF║03│00│00│00│00│00│00│00│"f│"o│"o│FF║34│12│
}                  ╚ ═└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──╩ ═└──┴──┘
                    0  1  2  3  4  5  6  7  8  9  10 11 12 13 14

Example of a key with NULL values:

                        0xFE byte at the start of k1 column
                    ┌ ─     signifies the value is NULL

                    └ ┐
                             0x1234
{                     ▼    (as u16 le)
  k1: NULL          ╔ ═╔ ═┌──┬──┐
  k2: 0x1234u16      FE║FF║12│34│
}                   ╚ ═╚ ═└──┴──┘
                      0  1  2  3

Are there any user-facing changes?

Correct answers!

Benchmark results

The benchmarks show a slight slowdown, which is not unexpected given there is now more work being done

group                                                gby_null_alternative                   master
-----                                                --------------------                   ------
aggregate_query_group_by                             1.16      3.6±0.14ms        ? ?/sec    1.00      3.1±0.10ms        ? ?/sec
aggregate_query_group_by_u64 15 12                   1.06      3.7±0.09ms        ? ?/sec    1.00      3.5±0.04ms        ? ?/sec
aggregate_query_group_by_with_filter                 1.06      2.5±0.05ms        ? ?/sec    1.00      2.3±0.04ms        ? ?/sec
aggregate_query_group_by_with_filter_u64 15 12       1.02      2.4±0.10ms        ? ?/sec    1.00      2.4±0.04ms        ? ?/sec
aggregate_query_no_group_by 15 12                    1.00  1152.2±30.38µs        ? ?/sec    1.00  1155.2±28.60µs        ? ?/sec
aggregate_query_no_group_by_count_distinct_narrow    1.14      6.1±0.14ms        ? ?/sec    1.00      5.4±0.05ms        ? ?/sec
aggregate_query_no_group_by_count_distinct_wide      1.18      8.8±0.30ms        ? ?/sec    1.00      7.4±0.10ms        ? ?/sec
aggregate_query_no_group_by_min_max_f64              1.06  1225.6±27.57µs        ? ?/sec    1.00  1160.5±29.22µs        ? ?/sec

Dandandan · 2021-07-29T18:06:37Z

datafusion/tests/sql.rs

+        "+-----------------+----+",
+        "| COUNT(UInt8(1)) | c1 |",
+        "+-----------------+----+",
+        "| 1               |    |",


alamb · 2021-07-29T18:07:42Z

Clippy error is unrelated -- see fix in #794

Dandandan · 2021-07-29T18:11:56Z

datafusion/src/physical_plan/hash_aggregate.rs

    vec.clear();
    for col in group_by_keys {
-        create_key_for_col(col, row, vec)?
+        if !col.is_valid(row) {


Not sure if it makes sense to improve performance here, but an optimization might be to check on null-count==0 outside of this function to avoid the is_valid call and just always add an 0xFF

Thank you for the suggestion.

If you don't mind I would like to spend time on #790 which, if successful, I expect to significantly remove all this code.

I will attempt to add that optimization at a later date.

Dandandan · 2021-07-29T18:13:03Z

datafusion/src/scalar.rs

+    fn scalar_try_from_dict_datatype() {
+        let data_type =
+            DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8));
+        let data_type = &data_type;


Amusingly, supporting this behavior ended up causing a test to fail when I brought the code into IOx and I think I traced the problem to an issue in parquet file statistics: apache/arrow-rs#641 🤣 this was not a side effect I had anticipated

Dandandan · 2021-07-29T18:13:27Z

datafusion/tests/sql.rs

+        "+-----------------+----+-----+",
+        "| 1               |    |     |",
+        "| 2               |    | bar |",
+        "| 3               | 0  |     |",


Dandandan · 2021-07-29T18:16:44Z

datafusion/src/scalar.rs

+        // any newly added enum variant will require editing this list
+        // or else face a compile error
+        match (self, other) {
+            (Boolean(v1), Boolean(v2)) => v1.eq(v2),


You could also use == instead?

jhorstmann · 2021-07-30T16:05:51Z

Looks good. I was trying to come up with an example where two distinct keys would end up as the same encoding but could not find any because any variable length types include the length prefix. 🚀

alamb · 2021-07-30T17:05:34Z

Thanks for the reviews @Dandandan and @jhorstmann ! I plan to wait another day or two to see if anyone else has feedback on this approach, but what I am thinking of doing is merging this PR (after addressing comments) so at least we get correct answers and then work on the more sophisticated implementation in parallel.

Looks good. I was trying to come up with an example where two distinct keys would end up as the same encoding but could not find any because any variable length types include the length prefix. 🚀

Thanks for double checking -- this also worried me a lot so I am glad to hear someone else did a double check too.

I convinced myself that since each key had entries for the same columns in the same order, there was no way to concoct the same bytes with different column values

jorgecarleitao

Went through this carefully, and it looks great! Great work, @alamb !

alamb · 2021-07-30T17:42:34Z

Rebased given that #786 is merged, so that this PR just shows the delta now

datafusion/src/physical_plan/hash_aggregate.rs

Co-authored-by: Daniël Heres <[email protected]>

alamb · 2021-08-02T11:23:44Z

#808 contains the PR that should give us back any performance we lost in this one

takes the relevant part out of apache#793 which was ignored by cube maintainers

* Add CreateArray support * Update Spark SQL test diffs * Use scalaExprToProto * Specify data type * Only do nullable elements again * Remove unused import * Add null to the test and add nullable element datafusion issue * Rename test * Update lock

* make PyExpr::to_variant arms explicit * update PyInList to wrap expr::InList * update PyExists to wrap expr::Exists * update PyInSubquery to wrap expr::InSubquery * update Placeholder to wrap expr::Placeholder * make PyLogicalPlan::to_variant match arms explicit * add PySortExpr wrapper * add PyUnnestExpr wrapper * update PyAlias to wrap upstream Alias * return not implemented error for unimplemnted variants in PyExpr::to_variant * added to_variant python test from the GH issue * remove unused import * return unsupported_variants for unimplemented variants in PyLogicalPlan::to_variant

github-actions bot added the datafusion label Jul 29, 2021

alamb changed the title ~~Produce correct ansers for Group BY NULL (Option 1)~~ Produce correct answers for Group BY NULL (Option 1) Jul 29, 2021

alamb mentioned this pull request Jul 29, 2021

Add a test grouping on null (showing incorrect answers) #783

Closed

Dandandan reviewed Jul 29, 2021

View reviewed changes

datafusion/tests/sql.rs

"+-----------------+----+",

"| COUNT(UInt8(1)) | c1 |",

"+-----------------+----+",

"| 1 | |",

Copy link

Contributor

Dandandan Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Dandandan reviewed Jul 29, 2021

View reviewed changes

datafusion/tests/sql.rs

"+-----------------+----+-----+",

"| 1 | | |",

"| 2 | | bar |",

"| 3 | 0 | |",

Copy link

Contributor

Dandandan Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Dandandan reviewed Jul 29, 2021

View reviewed changes

Dandandan approved these changes Jul 29, 2021

View reviewed changes

alamb force-pushed the alamb/gby_null_alternative branch from ddf2298 to b0d834a Compare July 29, 2021 19:23

jorgecarleitao approved these changes Jul 30, 2021

View reviewed changes

Add support for group by hash of a null column, tests for same

b6c6a3c

alamb force-pushed the alamb/gby_null_alternative branch from b0d834a to b6c6a3c Compare July 30, 2021 17:42

Dandandan reviewed Jul 30, 2021

View reviewed changes

datafusion/src/physical_plan/hash_aggregate.rs Outdated Show resolved Hide resolved

Update datafusion/src/physical_plan/hash_aggregate.rs

3378cef

Co-authored-by: Daniël Heres <[email protected]>

alamb mentioned this pull request Aug 1, 2021

Rework GroupByHash to for faster performance and support grouping by nulls #808

Merged

6 tasks

alamb merged commit 2bcf040 into apache:master Aug 2, 2021

alamb deleted the alamb/gby_null_alternative branch August 2, 2021 11:23

houqp added the bug Something isn't working label Aug 3, 2021

alamb mentioned this pull request Aug 4, 2021

Wrong results when grouping on a column with NULLs #782

Closed

igorcalabria mentioned this pull request Oct 8, 2023

Wrong pre aggregation result for boolean dimension containing NULL values cube-js/cube#7197

Open

igorcalabria added a commit to igorcalabria/arrow-datafusion that referenced this pull request Oct 9, 2023

fix group by with null columns

c02d971

takes the relevant part out of apache#793 which was ignored by cube maintainers

cfms3 pushed a commit to inloco/arrow-datafusion that referenced this pull request May 31, 2024

fix group by with null columns

0443692

takes the relevant part out of apache#793 which was ignored by cube maintainers

Produce correct answers for Group BY NULL (Option 1) #793

Produce correct answers for Group BY NULL (Option 1) #793

Uh oh!

Conversation

alamb commented Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Benchmark results

Uh oh!

Dandandan Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 29, 2021

Uh oh!

Dandandan Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

alamb Jul 30, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

alamb Jul 30, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

jhorstmann commented Jul 30, 2021

Uh oh!

alamb commented Jul 30, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 30, 2021

Uh oh!

Uh oh!

alamb commented Aug 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alamb commented Jul 29, 2021 •

edited

Loading