Skip to content

Conversation

Indhumathi27
Copy link
Contributor

@Indhumathi27 Indhumathi27 commented Oct 3, 2025

What changes were proposed in this pull request?

Disabled vectorized execution for multi-column COUNT(DISTINCT) so queries fall back to row mode for unsupported expressions.

Why are the changes needed?

In case of query with filter on Partition column, and if the same column exists in count(distinct, col) udf, Partition column changes to constant.

Vectorized execution does not support multi-column COUNT(DISTINCT). This ensures queries run safely without exceptions.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added test case

Copy link

sonarqubecloud bot commented Oct 6, 2025

@Indhumathi27
Copy link
Contributor Author

@ayushtkn / @deniskuzZ / @okumin can you help to review the PR. Thanks

enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
inputFileFormats: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
notVectorizedReason: GROUPBY operator: Aggregations with > 1 parameter are not supported unless all the extra parameters are constants count([Column[a], Column[b]])
notVectorizedReason: GROUPBY operator: Unsupported COUNT DISTINCT with multiple columns: count([Column[a], Column[b]]). Hive only supports COUNT(DISTINCT col) in vectorized execution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the original message not good enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Before, It has covered some cases like count(distinct col1, col2). Not cases like count(distinct col1, constant), count(distinct col1, col2, constant) etc.

Copy link
Member

@deniskuzZ deniskuzZ Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before we supported multi-column aggregations with constant expressions and now we don't? At least that what the message was saying

Aggregations with > 1 parameter are not supported unless all the extra parameters are constants count([Column[a], Column[b]])

i don't get why are we changing the message? if the issue was related to filter on partition column, it shouldn't change non-partition table behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create external table test_vector(id string, pid bigint, full_date int);
insert into test_vector (pid, full_date, id) values (1, '20240305', '6150');

EXPLAIN VECTORIZATION EXPRESSION
SELECT COUNT(DISTINCT(pid, full_date)) AS CNT FROM test_vector WHERE full_date=20240305;

vectorized: true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deniskuzZ the message was not changed for other cases. i added a new message for count udf with more than one parameter. now both partition table and non-partition one will have same behavior

Copy link
Member

@deniskuzZ deniskuzZ Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT COUNT(DISTINCT(pid, full_date)) AS CNT FROM test_vector WHERE full_date=20240305;

works fine with partitioned table as well.
i am not sure Hive properly handles DISTINCT with missing parentheses inside COUNT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the expectation that count(distinct pid, full_date) == count(distinct(pid, full_date)) ?

--------------------------------------------------------------------------------
-- 4. COUNT(DISTINCT pid, full_date, id) (multi-col distinct → FAIL)
--------------------------------------------------------------------------------
SELECT COUNT(DISTINCT pid, full_date, id) AS CNT FROM test_vector WHERE full_date=20240305;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that it works for you — I’m getting an exception unless I wrap the distinct columns in parentheses.

 org.apache.hadoop.hive.ql.exec.UDFArgumentException: DISTINCT keyword must be specified
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount.getEvaluator(GenericUDAFCount.java:73)

Copy link
Contributor Author

@Indhumathi27 Indhumathi27 Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COUNT UDAF excepts DISTINCT to be specified, when the parameters are more than 1.

throw new UDFArgumentException("DISTINCT keyword must be specified");

Copy link
Member

@deniskuzZ deniskuzZ Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try it

SET hive.vectorized.execution.enabled=true;

create external table test_vector(id string, pid bigint, full_date int);
insert into test_vector (pid, full_date, id) values (1, '20240305', '6150');

SELECT COUNT(DISTINCT pid, full_date) AS CNT FROM test_vector WHERE full_date=20240305;

exception

 org.apache.hadoop.hive.ql.exec.UDFArgumentException: DISTINCT keyword must be specified
	at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount.getEvaluator(GenericUDAFCount.java:73)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants