-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize COUNT(1)
: Change the sentinel value's type for COUNT(*) to Int64
#9944
Conversation
e0ecfc6
to
dbc5020
Compare
Might as well try whether /benchmark works as intended now. |
dbc5020
to
c27e9c2
Compare
Benchmark resultsBenchmarks comparing 4bd7c13 (main) and dbc5020 (PR)
|
@@ -42,7 +42,7 @@ use sqlparser::ast::{ExceptSelectItem, ExcludeSelectItem, WildcardAdditionalOpti | |||
|
|||
/// The value to which `COUNT(*)` is expanded to in | |||
/// `COUNT(<constant>)` expressions | |||
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::UInt8(Some(1)); | |||
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::Int64(Some(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does however result in count(1)
being auto-named to count(*)
now.
❯ explain select count(1) from hits;
+---------------+---------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[]], aggr=[[COUNT(Int64(1))]] |
| | TableScan: hits projection=[] |
| physical_plan | ProjectionExec: expr=[99997497 as COUNT(*)] |
| | PlaceholderRowExec |
| | |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.263 seconds.
❯ select count(1) from hits;
+----------+
| COUNT(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 0.015 seconds.
That might be fine as it reduces ambiguity.
Otherwise it could be resolved by e.g.:
diff --git a/datafusion/core/src/physical_optimizer/aggregate_statistics.rs b/datafusion/core/src/physical_optimizer/aggregate_statistics.rs
index df5422227..0e35b6a2b 100644
--- a/datafusion/core/src/physical_optimizer/aggregate_statistics.rs
+++ b/datafusion/core/src/physical_optimizer/aggregate_statistics.rs
@@ -160,6 +160,11 @@ fn take_optimizable_table_count(
ScalarValue::Int64(Some(num_rows as i64)),
COUNT_STAR_NAME,
));
+ } else if lit_expr.value() == &ScalarValue::Int64(Some(1)) {
+ return Some((
+ ScalarValue::Int64(Some(num_rows as i64)),
+ "COUNT(1)",
+ ));
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does however result in count(1) being auto-named to count(*) now.
FWIW, this is what happens on main with count(uint8)
(the same thing)
❯ select count(arrow_cast(1, 'UInt8')) from (values (1));
+----------+
| COUNT(*) |
+----------+
| 1 |
+----------+
1 row(s) fetched.
Elapsed 0.002 seconds.
Though admittedly almost no one would actually type count(arrow_cast(1, 'UInt8'))
so it is likely not a big deal.
That might be fine as it reduces ambiguity.
I agree having the COUNT(*)
in the plan actually helps as it makes it clearer when the fast path is being used.
Let's start with this and if someone else has a different opinion we can make another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gruuya -- this PR makes senes to me. I have definitely seen people do SELECT COUNT(1)
rather than SELECT COUNT(*)
so having them have the same performance seems good 👨🍳 👌
@@ -42,7 +42,7 @@ use sqlparser::ast::{ExceptSelectItem, ExcludeSelectItem, WildcardAdditionalOpti | |||
|
|||
/// The value to which `COUNT(*)` is expanded to in | |||
/// `COUNT(<constant>)` expressions | |||
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::UInt8(Some(1)); | |||
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::Int64(Some(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does however result in count(1) being auto-named to count(*) now.
FWIW, this is what happens on main with count(uint8)
(the same thing)
❯ select count(arrow_cast(1, 'UInt8')) from (values (1));
+----------+
| COUNT(*) |
+----------+
| 1 |
+----------+
1 row(s) fetched.
Elapsed 0.002 seconds.
Though admittedly almost no one would actually type count(arrow_cast(1, 'UInt8'))
so it is likely not a big deal.
That might be fine as it reduces ambiguity.
I agree having the COUNT(*)
in the plan actually helps as it makes it clearer when the fast path is being used.
Let's start with this and if someone else has a different opinion we can make another PR
COUNT(1)
: Change the sentinel value's type for COUNT(*) to Int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, nice find 👍
Thanks again @gruuya |
Which issue does this PR close?
Closes #9943.
Rationale for this change
Make
COUNT(1)
equivalent toCOUNT(*)
.What changes are included in this PR?
Change the type of
COUNT_STAR_EXPANSION
toScalarValue::Int64
.Are these changes tested?
Yes, with revised existing tests and a new one that tests the optimization for
count(1)
case.Are there any user-facing changes?