-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29197: Disable vectorization for multi-column COUNT(DISTINCT) #6114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e75ce80
to
58d0528
Compare
58d0528
to
eb285e8
Compare
eb285e8
to
55dd006
Compare
|
@ayushtkn / @deniskuzZ / @okumin can you help to review the PR. Thanks |
enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true | ||
inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | ||
notVectorizedReason: GROUPBY operator: Aggregations with > 1 parameter are not supported unless all the extra parameters are constants count([Column[a], Column[b]]) | ||
notVectorizedReason: GROUPBY operator: Unsupported COUNT DISTINCT with multiple columns: count([Column[a], Column[b]]). Hive only supports COUNT(DISTINCT col) in vectorized execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was the original message not good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Before, It has covered some cases like count(distinct col1, col2). Not cases like count(distinct col1, constant), count(distinct col1, col2, constant) etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before we supported multi-column aggregations with constant expressions and now we don't? At least that what the message was saying
Aggregations with > 1 parameter are not supported unless all the extra parameters are constants count([Column[a], Column[b]])
i don't get why are we changing the message? if the issue was related to filter on partition column, it shouldn't change non-partition table behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @asolimando
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create external table test_vector(id string, pid bigint, full_date int);
insert into test_vector (pid, full_date, id) values (1, '20240305', '6150');
EXPLAIN VECTORIZATION EXPRESSION
SELECT COUNT(DISTINCT(pid, full_date)) AS CNT FROM test_vector WHERE full_date=20240305;
vectorized: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ the message was not changed for other cases. i added a new message for count udf with more than one parameter. now both partition table and non-partition one will have same behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT COUNT(DISTINCT(pid, full_date)) AS CNT FROM test_vector WHERE full_date=20240305;
works fine with partitioned table as well.
i am not sure Hive properly handles DISTINCT with missing parentheses inside COUNT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the expectation that count(distinct pid, full_date) == count(distinct(pid, full_date)) ?
-------------------------------------------------------------------------------- | ||
-- 4. COUNT(DISTINCT pid, full_date, id) (multi-col distinct → FAIL) | ||
-------------------------------------------------------------------------------- | ||
SELECT COUNT(DISTINCT pid, full_date, id) AS CNT FROM test_vector WHERE full_date=20240305; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that it works for you — I’m getting an exception unless I wrap the distinct columns in parentheses.
org.apache.hadoop.hive.ql.exec.UDFArgumentException: DISTINCT keyword must be specified
at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount.getEvaluator(GenericUDAFCount.java:73)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COUNT UDAF excepts DISTINCT to be specified, when the parameters are more than 1.
throw new UDFArgumentException("DISTINCT keyword must be specified"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try it
SET hive.vectorized.execution.enabled=true;
create external table test_vector(id string, pid bigint, full_date int);
insert into test_vector (pid, full_date, id) values (1, '20240305', '6150');
SELECT COUNT(DISTINCT pid, full_date) AS CNT FROM test_vector WHERE full_date=20240305;
exception
org.apache.hadoop.hive.ql.exec.UDFArgumentException: DISTINCT keyword must be specified
at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount.getEvaluator(GenericUDAFCount.java:73)
What changes were proposed in this pull request?
Disabled vectorized execution for multi-column COUNT(DISTINCT) so queries fall back to row mode for unsupported expressions.
Why are the changes needed?
In case of query with filter on Partition column, and if the same column exists in count(distinct, col) udf, Partition column changes to constant.
Vectorized execution does not support multi-column COUNT(DISTINCT). This ensures queries run safely without exceptions.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added test case