HIVE-29197: Disable vectorization for multi-column COUNT(DISTINCT) #6114

Indhumathi27 · 2025-10-03T13:36:59Z

What changes were proposed in this pull request?

Disabled vectorized execution for multi-column COUNT(DISTINCT) so queries fall back to row mode for unsupported expressions.

Why are the changes needed?

In case of query with filter on Partition column, and if the same column exists in count(distinct, col) udf, Partition column changes to constant.

Vectorized execution does not support multi-column COUNT(DISTINCT). This ensures queries run safely without exceptions.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added test case

sonarqubecloud · 2025-10-06T07:36:41Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Indhumathi27 · 2025-10-06T08:35:45Z

@ayushtkn / @deniskuzZ / @okumin can you help to review the PR. Thanks

deniskuzZ · 2025-10-06T09:07:57Z

ql/src/test/results/clientpositive/llap/vector_count.q.out

                enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
                inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
-                notVectorizedReason: GROUPBY operator: Aggregations with > 1 parameter are not supported unless all the extra parameters are constants count([Column[a], Column[b]])
+                notVectorizedReason: GROUPBY operator: Unsupported COUNT DISTINCT with multiple columns: count([Column[a], Column[b]]). Hive only supports COUNT(DISTINCT col) in vectorized execution. 


was the original message not good enough?

yes. Before, It has covered some cases like count(distinct col1, col2). Not cases like count(distinct col1, constant), count(distinct col1, col2, constant) etc.

before we supported multi-column aggregations with constant expressions and now we don't? At least that what the message was saying

Aggregations with > 1 parameter are not supported unless all the extra parameters are constants count([Column[a], Column[b]])

i don't get why are we changing the message? if the issue was related to filter on partition column, it shouldn't change non-partition table behavior

cc @asolimando

create external table test_vector(id string, pid bigint, full_date int); insert into test_vector (pid, full_date, id) values (1, '20240305', '6150'); EXPLAIN VECTORIZATION EXPRESSION SELECT COUNT(DISTINCT(pid, full_date)) AS CNT FROM test_vector WHERE full_date=20240305;

vectorized: true

@deniskuzZ the message was not changed for other cases. i added a new message for count udf with more than one parameter. now both partition table and non-partition one will have same behavior

SELECT COUNT(DISTINCT(pid, full_date)) AS CNT FROM test_vector WHERE full_date=20240305;

works fine with partitioned table as well.
i am not sure Hive properly handles DISTINCT with missing parentheses inside COUNT.

is the expectation that count(distinct pid, full_date) == count(distinct(pid, full_date)) ?

ql/src/test/queries/clientpositive/vector_count_distinct_multiarg.q

deniskuzZ · 2025-10-06T11:19:25Z

ql/src/test/queries/clientpositive/vector_count_distinct_multiarg.q

+--------------------------------------------------------------------------------
+-- 4. COUNT(DISTINCT pid, full_date, id) (multi-col distinct → FAIL)
+--------------------------------------------------------------------------------
+SELECT COUNT(DISTINCT pid, full_date, id) AS CNT FROM test_vector WHERE full_date=20240305;


Interesting that it works for you — I’m getting an exception unless I wrap the distinct columns in parentheses.

org.apache.hadoop.hive.ql.exec.UDFArgumentException: DISTINCT keyword must be specified at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount.getEvaluator(GenericUDAFCount.java:73)

COUNT UDAF excepts DISTINCT to be specified, when the parameters are more than 1.

hive/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java

Line 73 in 0e8749e

throw new UDFArgumentException("DISTINCT keyword must be specified");

try it

SET hive.vectorized.execution.enabled=true; create external table test_vector(id string, pid bigint, full_date int); insert into test_vector (pid, full_date, id) values (1, '20240305', '6150'); SELECT COUNT(DISTINCT pid, full_date) AS CNT FROM test_vector WHERE full_date=20240305;

exception

org.apache.hadoop.hive.ql.exec.UDFArgumentException: DISTINCT keyword must be specified at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount.getEvaluator(GenericUDAFCount.java:73)

asf-ci-hive added the tests pending label Oct 3, 2025

Indhumathi27 force-pushed the vect_distinct branch from e75ce80 to 58d0528 Compare October 3, 2025 13:40

asf-ci-hive added tests failed tests pending tests unstable and removed tests pending tests failed labels Oct 3, 2025

Indhumathi27 force-pushed the vect_distinct branch from 58d0528 to eb285e8 Compare October 4, 2025 02:11

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels Oct 4, 2025

HIVE-29197: Disable vectorization for multi-column COUNT(DISTINCT)

55dd006

Indhumathi27 force-pushed the vect_distinct branch from eb285e8 to 55dd006 Compare October 6, 2025 06:07

asf-ci-hive added tests pending and removed tests passed labels Oct 6, 2025

asf-ci-hive added tests passed and removed tests pending labels Oct 6, 2025

deniskuzZ reviewed Oct 6, 2025

View reviewed changes

ql/src/test/queries/clientpositive/vector_count_distinct_multiarg.q Show resolved Hide resolved

deniskuzZ reviewed Oct 6, 2025

View reviewed changes

ql/src/test/queries/clientpositive/vector_count_distinct_multiarg.q Show resolved Hide resolved

deniskuzZ reviewed Oct 6, 2025

View reviewed changes

HIVE-29197: Disable vectorization for multi-column COUNT(DISTINCT) #6114

Are you sure you want to change the base?

HIVE-29197: Disable vectorization for multi-column COUNT(DISTINCT) #6114

Conversation

Indhumathi27 commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sonarqubecloud bot commented Oct 6, 2025

Quality Gate passed

Uh oh!

Indhumathi27 commented Oct 6, 2025

Uh oh!

deniskuzZ Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Indhumathi27 Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Indhumathi27 Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

deniskuzZ Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Indhumathi27 Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Indhumathi27 commented Oct 3, 2025 •

edited

Loading

deniskuzZ Oct 6, 2025 •

edited

Loading

deniskuzZ Oct 7, 2025 •

edited

Loading

Indhumathi27 Oct 6, 2025 •

edited

Loading

deniskuzZ Oct 7, 2025 •

edited

Loading