-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix duplicated schema name error from count wildcard #14824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
0af4ab9
7f18e05
3ef7ddd
40385aa
a456792
3497965
d956307
e24cf29
e54d4b8
6ee5a35
2a2d0d3
98d42a1
9efb1ee
76aadc8
7edbf02
39e9d72
670f06e
6931c24
e95018f
fadf6e3
d24f151
610c9a3
cb6c975
90a7b0a
6550841
5f55161
2a8f4f4
70280b6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -148,6 +148,15 @@ impl AggregateUDFImpl for Count { | |
| "count" | ||
| } | ||
|
|
||
| // In AggregateFunctionPlanner, wildcard is converted to count(1) | ||
| // | ||
| // count() -> count(1) | ||
| // count(*) -> count(1) | ||
| // count(1) -> count(1) | ||
| // count(2) -> count(2) | ||
| // | ||
| // count(1) is named as count(*) in schema_name | ||
| // other constant remains the same | ||
| fn schema_name(&self, params: &AggregateFunctionParams) -> Result<String> { | ||
| let AggregateFunctionParams { | ||
| args, | ||
|
|
@@ -511,6 +520,11 @@ impl AggregateUDFImpl for Count { | |
| return None; | ||
| } | ||
| if let Precision::Exact(num_rows) = statistics_args.statistics.num_rows { | ||
| // handle count() | ||
| if statistics_args.exprs.is_empty() { | ||
| return Some(ScalarValue::Int64(Some(num_rows as i64))); | ||
| } | ||
|
|
||
| if statistics_args.exprs.len() == 1 { | ||
| // TODO optimize with exprs other than Column | ||
| if let Some(col_expr) = statistics_args.exprs[0] | ||
|
|
@@ -550,8 +564,6 @@ impl AggregateUDFImpl for Count { | |
| fn is_count_wildcard(args: &[Expr]) -> bool { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this function now feels a bit redundant as it is just checking for |
||
| match args { | ||
| [] => true, // count() | ||
| // All const should be coerced to int64 or rejected by the signature | ||
| [Expr::Literal(ScalarValue::Int64(Some(_)))] => true, // count(1) | ||
| _ => false, // More than one argument or non-matching cases | ||
| } | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,7 +34,9 @@ use crate::{ | |
| SendableRecordBatchStream, Statistics, | ||
| }; | ||
|
|
||
| use arrow::array::{ArrayRef, UInt16Array, UInt32Array, UInt64Array, UInt8Array}; | ||
| use arrow::array::{ | ||
| ArrayRef, Int64Array, UInt16Array, UInt32Array, UInt64Array, UInt8Array, | ||
| }; | ||
| use arrow::datatypes::{Field, Schema, SchemaRef}; | ||
| use arrow::record_batch::RecordBatch; | ||
| use datafusion_common::stats::Precision; | ||
|
|
@@ -1231,6 +1233,13 @@ fn evaluate( | |
| expr: &[Arc<dyn PhysicalExpr>], | ||
| batch: &RecordBatch, | ||
| ) -> Result<Vec<ArrayRef>> { | ||
| // handle count() case | ||
| if expr.is_empty() { | ||
| return Ok(vec![ | ||
| Arc::new(Int64Array::from(vec![1; batch.num_rows()])) as ArrayRef | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is equivalent to
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that this function is not only used by |
||
| ]); | ||
| } | ||
|
|
||
| expr.iter() | ||
| .map(|expr| { | ||
| expr.evaluate(batch) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,7 @@ use crate::aggregates::{ | |
| }; | ||
| use crate::metrics::{BaselineMetrics, RecordOutput}; | ||
| use crate::{RecordBatchStream, SendableRecordBatchStream}; | ||
| use arrow::array::{ArrayRef, Int64Array}; | ||
| use arrow::datatypes::SchemaRef; | ||
| use arrow::record_batch::RecordBatch; | ||
| use datafusion_common::Result; | ||
|
|
@@ -219,23 +220,26 @@ fn aggregate_batch( | |
| None => Cow::Borrowed(&batch), | ||
| }; | ||
|
|
||
| let n_rows = batch.num_rows(); | ||
|
|
||
| // 1.3 | ||
| let values = &expr | ||
| .iter() | ||
| .map(|e| { | ||
| e.evaluate(&batch) | ||
| .and_then(|v| v.into_array(batch.num_rows())) | ||
| }) | ||
| .collect::<Result<Vec<_>>>()?; | ||
| // Handle count(*) case | ||
| let values = if expr.is_empty() { | ||
| vec![Arc::new(Int64Array::from(vec![1; n_rows])) as ArrayRef] | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is equivalent to count(1) case |
||
| } else { | ||
| expr.iter() | ||
| .map(|e| e.evaluate(&batch).and_then(|v| v.into_array(n_rows))) | ||
| .collect::<Result<Vec<_>>>()? | ||
| }; | ||
|
|
||
| // 1.4 | ||
| let size_pre = accum.size(); | ||
| let res = match mode { | ||
| AggregateMode::Partial | ||
| | AggregateMode::Single | ||
| | AggregateMode::SinglePartitioned => accum.update_batch(values), | ||
| | AggregateMode::SinglePartitioned => accum.update_batch(&values), | ||
| AggregateMode::Final | AggregateMode::FinalPartitioned => { | ||
| accum.merge_batch(values) | ||
| accum.merge_batch(&values) | ||
| } | ||
| }; | ||
| let size_post = accum.size(); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6276,6 +6276,9 @@ physical_plan | |
| 05)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 | ||
| 06)----------DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c5], file_type=csv, has_header=true | ||
|
|
||
| statement count 0 | ||
| drop table aggregate_test_100; | ||
|
|
||
| # test count(null) case (null with type) | ||
|
|
||
| statement count 0 | ||
|
|
@@ -6296,6 +6299,54 @@ physical_plan | |
| 01)AggregateExec: mode=Single, gby=[], aggr=[count(NULL)] | ||
| 02)--DataSourceExec: partitions=1, partition_sizes=[1] | ||
|
|
||
| statement count 0 | ||
| drop table t; | ||
|
|
||
| # test duplicated shema name issue | ||
|
|
||
| statement count 0 | ||
| create table t (a int) as values (1), (2); | ||
|
|
||
| query I | ||
| select count() from t; | ||
| ---- | ||
| 2 | ||
|
|
||
| query I | ||
| select count(1) * count(2) from t; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you also please add a test that shows just the values of For example select count(1), count(2), count(1) * count(2) from t; |
||
| ---- | ||
| 4 | ||
|
|
||
| query I | ||
| select count(1) * count(*) from t; | ||
| ---- | ||
| 4 | ||
|
|
||
| query I | ||
| select count(*) * count(*) from t; | ||
| ---- | ||
| 4 | ||
|
|
||
| query I | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Likewise here it would be nice to have |
||
| select count(1) * count(1) from t; | ||
| ---- | ||
| 4 | ||
|
|
||
| query TT | ||
| explain select count(1) * count(2) from t; | ||
| ---- | ||
| logical_plan | ||
| 01)Projection: count(Int64(1)) * count(Int64(2)) | ||
| 02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1)), count(Int64(2))]] | ||
| 03)----TableScan: t projection=[] | ||
| physical_plan | ||
| 01)ProjectionExec: expr=[count(Int64(1))@0 * count(Int64(2))@1 as count(Int64(1)) * count(Int64(2))] | ||
| 02)--AggregateExec: mode=Single, gby=[], aggr=[count(Int64(1)), count(Int64(2))] | ||
| 03)----DataSourceExec: partitions=1, partition_sizes=[1] | ||
|
|
||
| statement count 0 | ||
| drop table t; | ||
|
|
||
| ####### | ||
| # Group median test | ||
| ####### | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -80,12 +80,12 @@ query TT | |
| EXPLAIN SELECT a, COUNT() OVER (PARTITION BY a) AS count_a FROM t1; | ||
| ---- | ||
| logical_plan | ||
| 01)Projection: t1.a, count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS count_a | ||
| 02)--WindowAggr: windowExpr=[[count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] | ||
| 01)Projection: t1.a, count(Int64(1)) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS count_a | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given you implemented support for |
||
| 02)--WindowAggr: windowExpr=[[count(Int64(1)) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] | ||
| 03)----TableScan: t1 projection=[a] | ||
| physical_plan | ||
| 01)ProjectionExec: expr=[a@0 as a, count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@1 as count_a] | ||
| 02)--WindowAggExec: wdw=[count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Ok(Field { name: "count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(NULL)), is_causal: false }] | ||
| 01)ProjectionExec: expr=[a@0 as a, count(Int64(1)) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@1 as count_a] | ||
| 02)--WindowAggExec: wdw=[count(Int64(1)) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Ok(Field { name: "count(Int64(1)) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(NULL)), is_causal: false }] | ||
| 03)----SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[false] | ||
| 04)------DataSourceExec: partitions=1, partition_sizes=[1] | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still can't run
select count(), count(*).I suspect that using aliases to restore the original names is a simpler fix. I tried doing this on jonahgao@08206fd.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue here is quite different than the test covered in extended test.
duplicated schema case is executable now
select count(), count(*)duplicated name in projection is another issueBut I agree, this query should be executable too, and I think the way to fix it is different from the duplicated schema name one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I verified that both of those queries run in
datafusion44 and 45 but does not run on main. Thus this is a regression.I agree with @jayzhan211 that the issue is different than what is causing the sqlite tests to fail in main
I have filed a ticket to track this:
select count(), count(*)does not work #14855Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they have the same root cause, which is the rewriting by
AggregateFunctionPlannerandCount::schema_name()introducing duplicate names, and they could all be fixed by using aliases. The oldCountWildcardRuleusedNamePreserverto achieve a similar effect.