fix: Correct results for grouping sets when columns contain nulls #12571

eejbyfeldt · 2024-09-21T09:30:38Z

Which issue does this PR close?

Rationale for this change

Currently we produce incorrect results when combining grouping sets and columns containing null values.

What changes are included in this PR?

The bug is fixed by introducing an internal column grouping_id when using grouping sets. This extra column makes sure that we create different groups for the nulls from the grouping sets and the data.

This approach is based on how it is implemented in Spark and has previously been proposed here: #5749 Note that this change is smaller in scope and limit the existence of the grouping_id to the ~~physical plan~~ (Added it also to the logical as this will be needed for implementing the grouping function and it simplifies the code significantly). This is done so we end up with a smaller PR that is easier to review. But we might want to follow up and extend it to the logical plan an use it to implement the grouping function (#5647) in a similar manner to what is done in Spark.

Are these changes tested?

Existing and new sqllogictests.

Are there any user-facing changes?

alamb · 2024-09-23T17:15:07Z

Thank you @eejbyfeldt

cc @thinkharderdev as I think you / your team implemented the GROUPING SETS implementation originally

thinkharderdev

Nice work! Had a few comments and questions :)

datafusion/sqllogictest/test_files/aggregate.slt

thinkharderdev · 2024-09-24T20:56:18Z

datafusion/physical-plan/src/aggregates/mod.rs

@@ -108,6 +110,8 @@ impl AggregateMode {
    }
 }

+const INTERNAL_GROUPING_ID: &str = "grouping_id";


What happens if this conflicts with a user-defined field? E.g. if I had a query like:

SELECT grouping_id, count(1) FROM table GROUP BY CUBE(grouping_id)

Seems to just work (which surprised me).

Pushed changes to add it to the logical plan, so now name conflicts are possible. And renamed it __grouping_id, this follows the pattern used for __common_expr which can also conflict with names from user schema. Since it it internal I think it should be possible to follow up and make it fully unique.

datafusion/physical-plan/src/aggregates/mod.rs

eejbyfeldt · 2024-10-01T20:26:10Z

@thinkharderdev I made some changes so that the grouping id exist already in the logical plan. This simplified the logic quite a bit.

I also pushed a follow up PR #12704 that shows how we can use grouping id to implement the grouping function.

alamb

Thank you @eejbyfeldt and @thinkharderdev

I reviewed this PR and the logic makes sense to me, it seems well tested and structured, and commented. Thank you very much for this contribution

datafusion/expr/src/logical_plan/plan.rs

alamb

Looks good -- I plan to merge this PR tomorrow unless anyone else would like more time to review.

Thanks again @eejbyfeldt

alamb · 2024-10-06T11:21:55Z

🤔 looks like some newly added CI tests are failing

eejbyfeldt · 2024-10-06T19:29:54Z

🤔 looks like some newly added CI tests are failing

Fix the test. But looks like there might be other failures that are also on master.

alamb · 2024-10-07T11:24:10Z

Gah -- I think CI / clippy will be fixed by #12724

Sorry about this @eejbyfeldt

…h-null-values

alamb · 2024-10-07T12:54:03Z

Merged up from main to pick up #12724 and hopefully get a clean CI run

alamb · 2024-10-07T16:24:56Z

🚀

@etseidl

* Add support for external tables with qualified names (#12645) * Make support schemas * Set default name to table * Remove print statements and stale comment * Add tests for create table * Fix typo * Update datafusion/sql/src/statement.rs Co-authored-by: Jonah Gao <[email protected]> * convert create_external_table to objectname * Add sqllogic tests * Fix failing tests --------- Co-authored-by: Jonah Gao <[email protected]> * Fix Regex signature types (#12690) * Fix Regex signature types * Uncomment the shared tests in string_query.slt.part and removed tests copies everywhere else * Test `LIKE` and `MATCH` with flags; Remove new tests from regexp.slt * Refactor `ByteGroupValueBuilder` to use `MaybeNullBufferBuilder` (#12681) * Fix malformed hex string literal in docs (#12708) * Simplify match patterns in coercion rules (#12711) Remove conditions where unnecessary. Refactor to improve readability. * Remove aggregate functions dependency on frontend (#12715) * Remove aggregate functions dependency on frontend DataFusion is a SQL query engine and also a reusable library for building query engines. The core functionality should not depend on frontend related functionalities like `sqlparser` or `datafusion-sql`. * Remove duplicate license header * Minor: Remove clone in `transform_to_states` (#12707) * rm clone Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]> * Refactor tests for union sorting properties, add tests for unions and constants (#12702) * Refactor tests for union sorting properties * update doc test * Undo import reordering * remove unecessary static lifetimes * Fix: support Qualified Wildcard in count aggregate function (#12673) * Reduce code duplication in `PrimitiveGroupValueBuilder` with const generics (#12703) * Reduce code duplication in `PrimitiveGroupValueBuilder` with const generics * Fix docs * Disallow duplicated qualified field names (#12608) * Disallow duplicated qualified field names * Fix tests * Optimize base64/hex decoding by pre-allocating output buffers (~2x faster) (#12675) * add bench * replace macro with generic function * remove duplicated code * optimize base64/hex decode * Allow DynamicFileCatalog support to query partitioned file (#12683) * support to query partitioned table for dynamic file catalog * cargo clippy * split partitions inferring to another function * Support `LIMIT` Push-down logical plan optimization for `Extension` nodes (#12685) * Update trait `UserDefinedLogicalNodeCore` Signed-off-by: Austin Liu <[email protected]> * Update corresponding interface Signed-off-by: Austin Liu <[email protected]> Add rewrite rule for `push-down-limit` for `Extension` Signed-off-by: Austin Liu <[email protected]> * Add rewrite rule for `push-down-limit` for `Extension` and tests Signed-off-by: Austin Liu <[email protected]> * Update corresponding interface Signed-off-by: Austin Liu <[email protected]> * Reorganize to match guard Signed-off-by: Austin Liu <[email protected]> * Clena up Signed-off-by: Austin Liu <[email protected]> Clean up Signed-off-by: Austin Liu <[email protected]> --------- Signed-off-by: Austin Liu <[email protected]> * Fix AvroReader: Add union resolving for nested struct arrays (#12686) * Add union resolving for nested struct arrays * Add test * Change test * Reproduce index error * fmt --------- Co-authored-by: Andrew Lamb <[email protected]> * Adds macros for creating `WindowUDF` and `WindowFunction` expression (#12693) * Adds macro for udwf singleton * Adds a doc comment parameter to macro * Add doc comment for `create_udwf` macro * Uses default constructor * Update `Cargo.lock` in `datafusion-cli` * Fixes: expand `$FN_NAME` in doc strings * Adds example for macro usage * Renames macro * Improve doc comments * Rename udwf macro * Minor: doc copy edits * Adds macro for creating fluent-style expression API * Adds support for 1 or more parameters in expression function * Rewrite doc comments * Rename parameters * Minor: formatting * Adds doc comment for `create_udwf_expr` macro * Improve example docs * Hides extraneous code in doc comments * Add a one-line readme * Adds doc test assertions + minor formatting fixes * Adds common macro for defining user-defined window functions * Adds doc comment for `define_udwf_and_expr` * Defines `RowNumber` using common macro * Add usage example for common macro * Adds usage for custom constructor * Add examples for remaining patterns * Improve doc comments for usage examples * Rewrite inner line docs * Rewrite `create_udwf_expr!` doc comments * Minor doc improvements * Fix doc test and usage example * Add inline comments for macro patterns * Minor: change doc comment in example * Support unparsing plans with both Aggregation and Window functions (#12705) * Support unparsing plans with both Aggregation and Window functions (#35) * Fix unparsing for aggregation grouping sets * Add test for grouping set unparsing * Update datafusion/sql/src/unparser/utils.rs Co-authored-by: Jax Liu <[email protected]> * Update datafusion/sql/src/unparser/utils.rs Co-authored-by: Jax Liu <[email protected]> * Update * More tests --------- Co-authored-by: Jax Liu <[email protected]> * Fix strpos invocation with dictionary and null (#12712) In 1b3608d `strpos` signature was modified to indicate it supports dictionary as input argument, but the invoke method doesn't support them. * docs: Update DataFusion introduction to clarify that DataFusion does provide an "out of the box" query engine (#12666) * Update DataFusion introduction to show that DataFusion offers packaged versions for end users * change order * Update README.md Co-authored-by: Andrew Lamb <[email protected]> * refine wording and update user guide for consistency * prettier --------- Co-authored-by: Andrew Lamb <[email protected]> * Framework for generating function docs from embedded code documentation (#12668) * Initial work on #12432 to allow for generation of udf docs from embedded documentation in the code * Add missing license header. * Fixed examples. * Fixing a really weird RustRover/wsl ... something. No clue what happened there. * permission change * Cargo fmt update. * Refactored Documentation to allow it to be used in a const. * Add documentation for syntax_example * Refactoring Documentation based on PR feedback. * Cargo fmt update. * Doc update * Fixed copy/paste error. * Minor text updates. --------- Co-authored-by: Andrew Lamb <[email protected]> * Add IMDB(JOB) Benchmark [2/N] (imdb queries) (#12529) * imdb dataset * cargo fmt * Add 113 queries for IMDB(JOB) Signed-off-by: Austin Liu <[email protected]> * Add `get_query_sql` from `query_id` string Signed-off-by: Austin Liu <[email protected]> * Fix CSV reader & Remove Parquet partition Signed-off-by: Austin Liu <[email protected]> * Add benchmark IMDB runner Signed-off-by: Austin Liu <[email protected]> * Add `run_imdb` script Signed-off-by: Austin Liu <[email protected]> * Add checker for imdb option Signed-off-by: Austin Liu <[email protected]> * Add SLT for IMDB Signed-off-by: Austin Liu <[email protected]> * Fix `get_query_sql()` for CI roundtrip test Signed-off-by: Austin Liu <[email protected]> Fix `get_query_sql()` for CI roundtrip test Signed-off-by: Austin Liu <[email protected]> Fix `get_query_sql()` for CI roundtrip test Signed-off-by: Austin Liu <[email protected]> * Clean up Signed-off-by: Austin Liu <[email protected]> * Add missing license Signed-off-by: Austin Liu <[email protected]> * Add IMDB(JOB) queries `2b` to `5c` Signed-off-by: Austin Liu <[email protected]> * Add `INCLUDE_IMDB` in CI verify-benchmark-results Signed-off-by: Austin Liu <[email protected]> * Prepare IMDB dataset Signed-off-by: Austin Liu <[email protected]> Prepare IMDB dataset Signed-off-by: Austin Liu <[email protected]> * use uint as id type * format * Seperate `tpch` and `imdb` benchmarking CI jobs Signed-off-by: Austin Liu <[email protected]> Fix path Signed-off-by: Austin Liu <[email protected]> Fix path Signed-off-by: Austin Liu <[email protected]> Remove `tpch` in `imdb` benchmark Signed-off-by: Austin Liu <[email protected]> * Remove IMDB(JOB) slt in CI Signed-off-by: Austin Liu <[email protected]> Remove IMDB(JOB) slt in CI Signed-off-by: Austin Liu <[email protected]> --------- Signed-off-by: Austin Liu <[email protected]> Co-authored-by: DouPache <[email protected]> * Minor: avoid clone while calculating union equivalence properties (#12722) * Minor: avoid clone while calculating union equivalence properties * Update datafusion/physical-expr/src/equivalence/properties.rs * fmt * Simplify streaming_merge function parameters (#12719) * simplify streaming_merge function parameters * revert test change * change StreamingMergeConfig into builder pattern * Fix links on docs index page (#12750) * Provide field and schema metadata missing on cross joins, and union with null fields. (#12729) * test: reproducer for missing schema metadata on cross join * fix: pass thru schema metadata on cross join * fix: preserve metadata when transforming to view types * test: reproducer for missing field metadata in left hand NULL field of union * fix: preserve field metadata from right side of union * chore: safe indexing * Minor: Update string tests for strpos (#12739) * Apply `type_union_resolution` to array and values (#12753) * cleanup make array coercion rule Signed-off-by: jayzhan211 <[email protected]> * change to type union resolution Signed-off-by: jayzhan211 <[email protected]> * change value too Signed-off-by: jayzhan211 <[email protected]> * fix tpyo Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]> * Add `DocumentationBuilder::with_standard_argument` to reduce copy/paste (#12747) * Add `DocumentationBuilder::with_standard_expression` to reduce copy/paste * fix doc * fix standard argument * Update docs * Improve documentation to explain what is different * fix `equal_to` in `PrimitiveGroupValueBuilder` (#12758) * fix `equal_to` in `PrimitiveGroupValueBuilder`. * fix typo. * add uts. * reduce calling of `is_null`. * Minor: doc how field name is to be set (#12757) * Fix `equal_to` in `ByteGroupValueBuilder` (#12770) * Fix `equal_to` in `ByteGroupValueBuilder` * refactor null_equal_to * Update datafusion/physical-plan/src/aggregates/group_values/group_column.rs * Allow simplification even when nullable (#12746) The nullable requirement seem to have been added in #1401 but as far as I can tell they are not needed for these 2 cases. I think this can be shown using this truth table: (generated using datafusion-cli without this patch) ``` > CREATE TABLE t (v BOOLEAN) as values (true), (false), (NULL); > select t.v, t2.v, t.v AND (t.v OR t2.v), t.v OR (t.v AND t2.v) from t cross join t as t2; +-------+-------+---------------------+---------------------+ | v | v | t.v AND t.v OR t2.v | t.v OR t.v AND t2.v | +-------+-------+---------------------+---------------------+ | true | true | true | true | | true | false | true | true | | true | | true | true | | false | true | false | false | | false | false | false | false | | false | | false | false | | | true | | | | | false | | | | | | | | +-------+-------+---------------------+---------------------+ ``` And it seems Spark applies both of these and DuckDB applies only the first one. * Fix unnest conjunction with selecting wildcard expression (#12760) * fix unnest statement with wildcard expression * add commnets * Improve `round` scalar function unparsing for Postgres (#12744) * Postgres: enforce required `NUMERIC` type for `round` scalar function (#34) Includes initial support for dialects to override scalar functions unparsing * Document scalar_function_to_sql_overrides fn * Fix stack overflow calculating projected orderings (#12759) * Fix stack overflow calculating projected orderings * fix docs * Port / Add Documentation for `VarianceSample` and `VariancePopulation` (#12742) * Upgrade arrow/parquet to `53.1.0` / fix clippy (#12724) * Update to arrow/parquet 53.1.0 * Update some API * update for changed file sizes * Use non deprecated APIs * Use ParquetMetadataReader from @etseidl * remove upstreamed implementation * Update CSV schema * Use upstream is_null and is_not_null kernels * feat: add support for Substrait ExtendedExpression (#12728) * Add support for serializing and deserializing Substrait ExtendedExpr message * Address clippy reviews * Reuse existing rename method * Transformed::new_transformed: Fix documentation formatting (#12787) Co-authored-by: Andrew Lamb <[email protected]> * fix: Correct results for grouping sets when columns contain nulls (#12571) * Fix grouping sets behavior when data contains nulls * PR suggestion comment * Update new test case * Add grouping_id to the logical plan * Add doc comment next to INTERNAL_GROUPING_ID * Fix unparsing of Aggregate with grouping sets --------- Co-authored-by: Andrew Lamb <[email protected]> * Migrate documentation for all string functions from scalar_functions.md to code (#12775) * Added documentation for string and unicode functions. * Fixed issues with aliases. * Cargo fmt. * Minor doc fixes. * Update docs for var_pop/samp --------- Co-authored-by: Andrew Lamb <[email protected]> * Account for constant equivalence properties in union, tests (#12562) * Minor: clarify comment about empty dependencies (#12786) * Introduce Signature::String and return error if input of `strpos` is integer (#12751) * fix sig Signed-off-by: jayzhan211 <[email protected]> * fix Signed-off-by: jayzhan211 <[email protected]> * fix error Signed-off-by: jayzhan211 <[email protected]> * fix all signature Signed-off-by: jayzhan211 <[email protected]> * fix all signature Signed-off-by: jayzhan211 <[email protected]> * change default type Signed-off-by: jayzhan211 <[email protected]> * clippy Signed-off-by: jayzhan211 <[email protected]> * fix docs Signed-off-by: jayzhan211 <[email protected]> * rm deadcode Signed-off-by: jayzhan211 <[email protected]> * cleanup Signed-off-by: jayzhan211 <[email protected]> * cleanup Signed-off-by: jayzhan211 <[email protected]> * rm test Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]> * Minor: improve docs on MovingMin/MovingMax (#12790) * Add slt tests (#12721) --------- Signed-off-by: jayzhan211 <[email protected]> Signed-off-by: Austin Liu <[email protected]> Co-authored-by: OussamaSaoudi <[email protected]> Co-authored-by: Jonah Gao <[email protected]> Co-authored-by: Dmitrii Blaginin <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Tomoaki Kawada <[email protected]> Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Jay Zhan <[email protected]> Co-authored-by: HuSen <[email protected]> Co-authored-by: Emil Ejbyfeldt <[email protected]> Co-authored-by: Simon Vandel Sillesen <[email protected]> Co-authored-by: Jax Liu <[email protected]> Co-authored-by: Austin Liu <[email protected]> Co-authored-by: JonasDev1 <[email protected]> Co-authored-by: jcsherin <[email protected]> Co-authored-by: Sergei Grebnov <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: Bruce Ritchie <[email protected]> Co-authored-by: DouPache <[email protected]> Co-authored-by: mertak-synnada <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Co-authored-by: wiedld <[email protected]> Co-authored-by: kamille <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Val Lorentz <[email protected]>

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Sep 21, 2024

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch 5 times, most recently from f4a220b to 9c840b0 Compare September 22, 2024 10:27

github-actions bot added the optimizer Optimizer rules label Sep 22, 2024

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch 2 times, most recently from 8d01437 to fdce177 Compare September 22, 2024 11:19

eejbyfeldt marked this pull request as ready for review September 23, 2024 17:08

eejbyfeldt changed the title ~~fix: Grouping sets when columns contain nulls~~ fix: Correct results for grouping sets when columns contain nulls Sep 23, 2024

thinkharderdev reviewed Sep 24, 2024

View reviewed changes

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch from fdce177 to fee8bdf Compare September 25, 2024 18:09

eejbyfeldt mentioned this pull request Sep 25, 2024

Implement GROUPING aggregate function (following Postgres behavior.) #12565

Closed

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch 2 times, most recently from 230e4ef to 37dc663 Compare September 29, 2024 08:09

github-actions bot added logical-expr Logical plan and expressions and removed optimizer Optimizer rules labels Sep 29, 2024

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch 2 times, most recently from 37d6283 to 2914017 Compare October 1, 2024 12:24

github-actions bot added the sql SQL Planner label Oct 1, 2024

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch from 2914017 to e9c7e13 Compare October 1, 2024 19:23

github-actions bot added optimizer Optimizer rules substrait labels Oct 1, 2024

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch from e9c7e13 to 377e348 Compare October 1, 2024 19:35

github-actions bot removed the sql SQL Planner label Oct 1, 2024

eejbyfeldt mentioned this pull request Oct 1, 2024

feat: Implement grouping function using grouping id #12704

Merged

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch from 377e348 to 86eb434 Compare October 1, 2024 20:16

eejbyfeldt requested a review from thinkharderdev October 1, 2024 20:18

alamb approved these changes Oct 4, 2024

View reviewed changes

datafusion/expr/src/logical_plan/plan.rs Show resolved Hide resolved

eejbyfeldt added 5 commits October 5, 2024 09:48

Fix grouping sets behavior when data contains nulls

9f387f2

PR suggestion comment

15aafe3

Update new test case

c256737

Add grouping_id to the logical plan

9e7c314

Add doc comment next to INTERNAL_GROUPING_ID

920b384

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch from 86eb434 to 920b384 Compare October 5, 2024 08:17

alamb approved these changes Oct 6, 2024

View reviewed changes

github-actions bot added the sql SQL Planner label Oct 6, 2024

Fix unparsing of Aggregate with grouping sets

1f61ddf

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch from 543daa3 to 1f61ddf Compare October 6, 2024 19:18

Merge remote-tracking branch 'apache/main' into fix-grouping-sets-wit…

5a8d670

…h-null-values

alamb merged commit ef227f4 into apache:main Oct 7, 2024
24 checks passed

Sevenannn mentioned this pull request Oct 17, 2024

Return correct results when rollup groupby contains null in groups spiceai/datafusion#47

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Correct results for grouping sets when columns contain nulls #12571

fix: Correct results for grouping sets when columns contain nulls #12571

eejbyfeldt commented Sep 21, 2024 •

edited

Loading

alamb commented Sep 23, 2024

thinkharderdev left a comment

thinkharderdev Sep 24, 2024

eejbyfeldt Sep 25, 2024

eejbyfeldt Oct 1, 2024

eejbyfeldt commented Oct 1, 2024

alamb left a comment

alamb left a comment

alamb commented Oct 6, 2024

eejbyfeldt commented Oct 6, 2024

alamb commented Oct 7, 2024

alamb commented Oct 7, 2024

alamb commented Oct 7, 2024

fix: Correct results for grouping sets when columns contain nulls #12571

fix: Correct results for grouping sets when columns contain nulls #12571

Conversation

eejbyfeldt commented Sep 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Sep 23, 2024

thinkharderdev left a comment

Choose a reason for hiding this comment

thinkharderdev Sep 24, 2024

Choose a reason for hiding this comment

eejbyfeldt Sep 25, 2024

Choose a reason for hiding this comment

eejbyfeldt Oct 1, 2024

Choose a reason for hiding this comment

eejbyfeldt commented Oct 1, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Oct 6, 2024

eejbyfeldt commented Oct 6, 2024

alamb commented Oct 7, 2024

alamb commented Oct 7, 2024

alamb commented Oct 7, 2024

eejbyfeldt commented Sep 21, 2024 •

edited

Loading