Fix incorrect results for aggregation functions on case-sensitive types#8551
Conversation
1aadeef to
16800a0
Compare
There was a problem hiding this comment.
if it's about sorting (as we believe it is), limitation doesn't apply to GROUPING SETs.
Also it applies to some aggregation functions (min, max), but not others count, or count(DISTINCT)
There was a problem hiding this comment.
Yes. The part about difference between aggregation functions came up before but I couldn't find a nice way to model it (other than introduce a field to AggregateFunction and having each of them declare if they depend on case-sensitivity for correctness).
For the grouping set - thanks for catching. Postgres indeed doesn't need the grouping set check - only the aggregate function check.
There was a problem hiding this comment.
Re how to model - since this is function specific, it could be handled by io.trino.plugin.jdbc.expression.AggregateFunctionRule#rewrite returning empty or not.
There was a problem hiding this comment.
That's what I tried initially but it's a bit errorprone since each module has their own rewrites. I can try to move as many to base-jdbc as possible in a preparatory commit.
There was a problem hiding this comment.
Also, to confirm, you mean that the grouping sets condition can be handled in supportsAggregation like now.
And instead of handling the functions in supportsAggregation I should handle them in the rewrite rules?
There was a problem hiding this comment.
That's what I tried initially but it's a bit errorprone since each module has their own rewrites. I can try to move as many to base-jdbc as possible in a preparatory commit.
That's unlikely to work. We have separate rewrites because there were some differences.
Also, to confirm, you mean that the grouping sets condition can be handled in supportsAggregation like now.
i didn't read that part, so "maybe"
And instead of handling the functions in supportsAggregation I should handle them in the rewrite rules?
that's my intuition
There was a problem hiding this comment.
I implemented the approach you suggested - it works and thankfully the only case-sensitive rewrites live in base-jdbc already.
There was a problem hiding this comment.
count(a_varchar) should still be pushed down, as case sensitivity (or more broadly: collations), doesn't impact results.
plugin/trino-clickhouse/src/main/java/io/trino/plugin/clickhouse/ClickHouseClient.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
That's based on a wrong type.
For example, PostgreSQL's money and enum types are mapped to Trino varchar, while both will have different sorting properties.
There was a problem hiding this comment.
Since the rewrite rule is generic I think the only solution is to allow connectors to pass a list of jdbcTypeName through the constructor for types which should not be pushed down?
Or maybe add a static function to JdbcClient called isCollationSensitive(JdbcColumnHandle) and let connectors define their own impls? This method already exists in the PostgreSQL client btw and may be useful for others where we can pass explicit collations (MySQL once we drop 5.x).
There was a problem hiding this comment.
Yeah, i think we already use "is mapped to char or varchar" as a way to determine if it's potentially case insensitive (or collation-sensitive).
it may catch too much, but shouldn't catch too little, so it's fine.
leave as is
There was a problem hiding this comment.
it's not exactly about case sensitivity. For example, by default PostgreSQL is case-sensitive, but still sorts differently than Trino, so we shouldn't push min/max on varchar.
BTW do PostgresQL min/max accept COLLATE?
There was a problem hiding this comment.
I think the complaint here is that the variable is named in a misleading way? Maybe isCollationSensitive would be better?
Functions don't accept explicit collations - I couldn't find any examples and the variations I tried led to syntax errors.
There was a problem hiding this comment.
I think the complaint here is that the variable is named in a misleading way
yes
There was a problem hiding this comment.
clerk is varchar and part of grouping set so it prevents pushdown after the change. Changed to use a numeric column since we just want to test GROUP BY + TOPN.
There was a problem hiding this comment.
can this be static?
(that would help understand the state flow)
There was a problem hiding this comment.
Yes, should've been static from the start. thanks.
There was a problem hiding this comment.
let's add a correctness cases
count(DISTINCT a_string)count(DISTINCT a_string),count(DISTINCT a_bigint)` (together)
this could help avoid any regressions in #8562 cc @alexjo2144
There was a problem hiding this comment.
remote database may be case-insensitive
nit: Use "Remote database can be case insensitive" words, so that similar places in the code are searchable.
504c1ae to
9e7dd3c
Compare
|
LGTM @hashhar - thanks for working on that |
|
Rebasing to make sure nothing broke since the upgrade. Will merge once CI finishes. |
Some databases are case-insensitive (MySQL, SQL Server) while others sort textual types differently compared to Trino (PostgreSQL). For such databases pushdown of aggregation functions when the grouping set includes a textual type can lead to incorrect results. So we prevent aggregation pushdown for such cases. We also prevent pushdown for functions whose results depend on sort order (min/max) when the input is a textual type.
9e7dd3c to
6473c84
Compare
Pushdown of aggregation functions to case-insensitive databases (e.g. MySQL) on case-sensitive inputs (e.g. VARCHAR/CHAR) can lead to incorrect results.
This change disables aggregation pushdown when:
It also introduces a toggle to restore previous incorrect behaviour.Fixes #7320