Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jul 29, 2021

Which issue does this PR close?

This PR closes #781 and #782;

Built on #786 so review that first.

Rationale for this change

  1. Grouping on columns that contain NULLs today produces incorrect results
  2. This is what I think is the minimum change required to produce correct results

This is a version of the "Alternative" approach described in #790 which I think is the minimum change to GroupByHash to produce the correct answers when grouping on columns that contain nulls. Thanks to @jhorstmann and @Dandandan for the ideas leading to this PR

It will likely reduce the speed of grouping as well as require more memory than the current implementation (though it does get correct answers!)

I created this PR it available for comparison and as a fallback in case I run into trouble or run out of time time trying to implement #790, which I expect will take longer to code and review.

What changes are included in this PR?

  1. Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy #786
  2. Include a "null byte" for each column when creating the key to the hash table
  3. Fix some bugs related to NULL handling in ScalarValue
  4. Tests

On master keys are created like this:

                            string len                   0x1234
{                          (as usize le)      "foo"    (as u16 le)
  k1: "foo"         ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
  k2: 0x1234u16     │03│00│00│00│00│00│00│00│"f│"o│"o│34│12│
}                   └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
                    0  1  2  3  4  5  6  7  8  9  10 11 12

After this PR, the keys are created as follows (note the two extra bytes, one for each grouping column)

Example of a key without any nulls:

                       0xFF byte at the start of each column
                          signifies the value is non-null
                                         │

                     ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┐

                     │        string len                 │  0x1234
{                    ▼       (as usize le)      "foo"    ▼(as u16 le)
  k1: "foo"        ╔ ═┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──╦ ═┌──┬──┐
  k2: 0x1234u16     FF║03│00│00│00│00│00│00│00│"f│"o│"o│FF║34│12│
}                  ╚ ═└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──╩ ═└──┴──┘
                    0  1  2  3  4  5  6  7  8  9  10 11 12 13 14

Example of a key with NULL values:

                        0xFE byte at the start of k1 column
                    ┌ ─     signifies the value is NULL

                    └ ┐
                             0x1234
{                     ▼    (as u16 le)
  k1: NULL          ╔ ═╔ ═┌──┬──┐
  k2: 0x1234u16      FE║FF║12│34│
}                   ╚ ═╚ ═└──┴──┘
                      0  1  2  3

Are there any user-facing changes?

Correct answers!

Benchmark results

The benchmarks show a slight slowdown, which is not unexpected given there is now more work being done

group                                                gby_null_alternative                   master
-----                                                --------------------                   ------
aggregate_query_group_by                             1.16      3.6±0.14ms        ? ?/sec    1.00      3.1±0.10ms        ? ?/sec
aggregate_query_group_by_u64 15 12                   1.06      3.7±0.09ms        ? ?/sec    1.00      3.5±0.04ms        ? ?/sec
aggregate_query_group_by_with_filter                 1.06      2.5±0.05ms        ? ?/sec    1.00      2.3±0.04ms        ? ?/sec
aggregate_query_group_by_with_filter_u64 15 12       1.02      2.4±0.10ms        ? ?/sec    1.00      2.4±0.04ms        ? ?/sec
aggregate_query_no_group_by 15 12                    1.00  1152.2±30.38µs        ? ?/sec    1.00  1155.2±28.60µs        ? ?/sec
aggregate_query_no_group_by_count_distinct_narrow    1.14      6.1±0.14ms        ? ?/sec    1.00      5.4±0.05ms        ? ?/sec
aggregate_query_no_group_by_count_distinct_wide      1.18      8.8±0.30ms        ? ?/sec    1.00      7.4±0.10ms        ? ?/sec
aggregate_query_no_group_by_min_max_f64              1.06  1225.6±27.57µs        ? ?/sec    1.00  1160.5±29.22µs        ? ?/sec

@alamb alamb changed the title Produce correct ansers for Group BY NULL (Option 1) Produce correct answers for Group BY NULL (Option 1) Jul 29, 2021
"+-----------------+----+",
"| COUNT(UInt8(1)) | c1 |",
"+-----------------+----+",
"| 1 | |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2021

Clippy error is unrelated -- see fix in #794

vec.clear();
for col in group_by_keys {
create_key_for_col(col, row, vec)?
if !col.is_valid(row) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it makes sense to improve performance here, but an optimization might be to check on null-count==0 outside of this function to avoid the is_valid call and just always add an 0xFF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion.

If you don't mind I would like to spend time on #790 which, if successful, I expect to significantly remove all this code.

I will attempt to add that optimization at a later date.

fn scalar_try_from_dict_datatype() {
let data_type =
DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8));
let data_type = &data_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amusingly, supporting this behavior ended up causing a test to fail when I brought the code into IOx and I think I traced the problem to an issue in parquet file statistics: apache/arrow-rs#641 🤣 this was not a side effect I had anticipated

"+-----------------+----+-----+",
"| 1 | | |",
"| 2 | | bar |",
"| 3 | 0 | |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// any newly added enum variant will require editing this list
// or else face a compile error
match (self, other) {
(Boolean(v1), Boolean(v2)) => v1.eq(v2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also use == instead?

@alamb alamb force-pushed the alamb/gby_null_alternative branch from ddf2298 to b0d834a Compare July 29, 2021 19:23
@jhorstmann
Copy link
Contributor

Looks good. I was trying to come up with an example where two distinct keys would end up as the same encoding but could not find any because any variable length types include the length prefix. 🚀

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2021

Thanks for the reviews @Dandandan and @jhorstmann ! I plan to wait another day or two to see if anyone else has feedback on this approach, but what I am thinking of doing is merging this PR (after addressing comments) so at least we get correct answers and then work on the more sophisticated implementation in parallel.

Looks good. I was trying to come up with an example where two distinct keys would end up as the same encoding but could not find any because any variable length types include the length prefix. 🚀

Thanks for double checking -- this also worried me a lot so I am glad to hear someone else did a double check too.

I convinced myself that since each key had entries for the same columns in the same order, there was no way to concoct the same bytes with different column values

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through this carefully, and it looks great! Great work, @alamb !

@alamb
Copy link
Contributor Author

alamb commented Jul 30, 2021

Rebased given that #786 is merged, so that this PR just shows the delta now

@alamb alamb force-pushed the alamb/gby_null_alternative branch from b0d834a to b6c6a3c Compare July 30, 2021 17:42
@alamb
Copy link
Contributor Author

alamb commented Aug 2, 2021

#808 contains the PR that should give us back any performance we lost in this one

@alamb alamb merged commit 2bcf040 into apache:master Aug 2, 2021
@alamb alamb deleted the alamb/gby_null_alternative branch August 2, 2021 11:23
@houqp houqp added the bug Something isn't working label Aug 3, 2021
igorcalabria added a commit to igorcalabria/arrow-datafusion that referenced this pull request Oct 9, 2023
takes the relevant part out of
apache#793 which was ignored by
cube maintainers
cfms3 pushed a commit to inloco/arrow-datafusion that referenced this pull request May 31, 2024
takes the relevant part out of
apache#793 which was ignored by
cube maintainers
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
* Add CreateArray support

* Update Spark SQL test diffs

* Use scalaExprToProto

* Specify data type

* Only do nullable elements again

* Remove unused import

* Add null to the test and add nullable element datafusion issue

* Rename test

* Update lock
H0TB0X420 pushed a commit to H0TB0X420/datafusion that referenced this pull request Oct 7, 2025
* make PyExpr::to_variant arms explicit

* update PyInList to wrap expr::InList

* update PyExists to wrap expr::Exists

* update PyInSubquery to wrap expr::InSubquery

* update Placeholder to wrap expr::Placeholder

* make PyLogicalPlan::to_variant match arms explicit

* add PySortExpr wrapper

* add PyUnnestExpr wrapper

* update PyAlias to wrap upstream Alias

* return not implemented error for unimplemnted variants in PyExpr::to_variant

* added to_variant python test from the GH issue

* remove unused import

* return unsupported_variants for unimplemented variants in  PyLogicalPlan::to_variant
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong results when grouping with dictionary arrays with nulls

5 participants