Fix incorrect results for aggregation functions on case-sensitive types by hashhar · Pull Request #8551 · trinodb/trino

hashhar · 2021-07-14T07:58:30Z

Pushdown of aggregation functions to case-insensitive databases (e.g. MySQL) on case-sensitive inputs (e.g. VARCHAR/CHAR) can lead to incorrect results.

This change disables aggregation pushdown when:

Any of the grouping sets is a case-sensitive type
If the aggregation function is case-sensitive (e.g. max/min but not count) and any of the inputs to the aggregation function is a case-sensitive type

~~It also introduces a toggle to restore previous incorrect behaviour.~~

Fixes #7320

findepi · 2021-07-16T11:19:14Z

plugin/trino-postgresql/src/main/java/io/trino/plugin/postgresql/PostgreSqlClient.java

if it's about sorting (as we believe it is), limitation doesn't apply to GROUPING SETs.
Also it applies to some aggregation functions (min, max), but not others count, or count(DISTINCT)

Yes. The part about difference between aggregation functions came up before but I couldn't find a nice way to model it (other than introduce a field to AggregateFunction and having each of them declare if they depend on case-sensitivity for correctness).

For the grouping set - thanks for catching. Postgres indeed doesn't need the grouping set check - only the aggregate function check.

Re how to model - since this is function specific, it could be handled by io.trino.plugin.jdbc.expression.AggregateFunctionRule#rewrite returning empty or not.

That's what I tried initially but it's a bit errorprone since each module has their own rewrites. I can try to move as many to base-jdbc as possible in a preparatory commit.

Also, to confirm, you mean that the grouping sets condition can be handled in supportsAggregation like now.
And instead of handling the functions in supportsAggregation I should handle them in the rewrite rules?

That's what I tried initially but it's a bit errorprone since each module has their own rewrites. I can try to move as many to base-jdbc as possible in a preparatory commit.

That's unlikely to work. We have separate rewrites because there were some differences.

Also, to confirm, you mean that the grouping sets condition can be handled in supportsAggregation like now.

i didn't read that part, so "maybe"

And instead of handling the functions in supportsAggregation I should handle them in the rewrite rules?

that's my intuition

I implemented the approach you suggested - it works and thankfully the only case-sensitive rewrites live in base-jdbc already.

findepi · 2021-07-16T11:20:03Z

plugin/trino-sqlserver/src/main/java/io/trino/plugin/sqlserver/SqlServerClient.java

count(a_varchar) should still be pushed down, as case sensitivity (or more broadly: collations), doesn't impact results.

plugin/trino-clickhouse/src/main/java/io/trino/plugin/clickhouse/ClickHouseClient.java

hashhar · 2021-07-19T09:06:44Z

PTAL @findepi @wendigo

I can add a session property to restore performance at the user's own discretion - what should we call it? (if we do this then I'll also address #7022 as part of it).

hashhar · 2021-07-22T09:09:02Z

@wendigo @findepi Gentle ping. 🙂

findepi · 2021-07-26T12:01:23Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/expression/ImplementMinMax.java

That's based on a wrong type.

For example, PostgreSQL's money and enum types are mapped to Trino varchar, while both will have different sorting properties.

Since the rewrite rule is generic I think the only solution is to allow connectors to pass a list of jdbcTypeName through the constructor for types which should not be pushed down?

Or maybe add a static function to JdbcClient called isCollationSensitive(JdbcColumnHandle) and let connectors define their own impls? This method already exists in the PostgreSQL client btw and may be useful for others where we can pass explicit collations (MySQL once we drop 5.x).

Yeah, i think we already use "is mapped to char or varchar" as a way to determine if it's potentially case insensitive (or collation-sensitive).
it may catch too much, but shouldn't catch too little, so it's fine.

leave as is

findepi · 2021-07-26T12:02:05Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/expression/ImplementMinMax.java

it's not exactly about case sensitivity. For example, by default PostgreSQL is case-sensitive, but still sorts differently than Trino, so we shouldn't push min/max on varchar.

BTW do PostgresQL min/max accept COLLATE?

I think the complaint here is that the variable is named in a misleading way? Maybe isCollationSensitive would be better?

Functions don't accept explicit collations - I couldn't find any examples and the variations I tried led to syntax errors.

I think the complaint here is that the variable is named in a misleading way

yes

findepi · 2021-07-26T12:03:12Z

plugin/trino-base-jdbc/src/test/java/io/trino/plugin/jdbc/BaseJdbcConnectorTest.java

why change here?

clerk is varchar and part of grouping set so it prevents pushdown after the change. Changed to use a numeric column since we just want to test GROUP BY + TOPN.

findepi · 2021-07-26T12:06:14Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/BaseJdbcClient.java

can this be static?

(that would help understand the state flow)

Yes, should've been static from the start. thanks.

findepi · 2021-07-26T12:07:49Z

plugin/trino-base-jdbc/src/test/java/io/trino/plugin/jdbc/BaseJdbcConnectorTest.java

let's add a correctness cases

count(DISTINCT a_string)

count(DISTINCT a_string), count(DISTINCT a_bigint)` (together)

this could help avoid any regressions in #8562 cc @alexjo2144

findepi · 2021-07-26T12:09:31Z

plugin/trino-base-jdbc/src/main/java/io/trino/plugin/jdbc/BaseJdbcClient.java

remote database may be case-insensitive

nit: Use "Remote database can be case insensitive" words, so that similar places in the code are searchable.

findepi

% earlier feedback

hashhar · 2021-07-27T10:05:46Z

@wendigo PTAL. Applied @findepi 's comments.

wendigo · 2021-07-27T12:24:06Z

LGTM @hashhar - thanks for working on that

hashhar · 2021-08-02T15:44:08Z

Rebasing to make sure nothing broke since the upgrade. Will merge once CI finishes.

Some databases are case-insensitive (MySQL, SQL Server) while others sort textual types differently compared to Trino (PostgreSQL). For such databases pushdown of aggregation functions when the grouping set includes a textual type can lead to incorrect results. So we prevent aggregation pushdown for such cases. We also prevent pushdown for functions whose results depend on sort order (min/max) when the input is a textual type.

hashhar · 2021-08-02T17:12:06Z

CI hit #8719 and #8432

hashhar added the WIP label Jul 14, 2021

cla-bot bot added the cla-signed label Jul 14, 2021

hashhar force-pushed the hashhar/case-insensitive-aggregation-pushdown branch from 1aadeef to 16800a0 Compare July 16, 2021 09:01

findepi reviewed Jul 16, 2021

View reviewed changes

hashhar removed the WIP label Jul 19, 2021

hashhar marked this pull request as ready for review July 19, 2021 09:06

hashhar requested review from findepi and wendigo July 19, 2021 09:06

findepi reviewed Jul 26, 2021

View reviewed changes

findepi approved these changes Jul 27, 2021

View reviewed changes

hashhar force-pushed the hashhar/case-insensitive-aggregation-pushdown branch from 504c1ae to 9e7dd3c Compare July 27, 2021 09:29

wendigo approved these changes Jul 27, 2021

View reviewed changes

findepi force-pushed the master branch from 8538e49 to 1f896ea Compare July 30, 2021 22:13

hashhar added 2 commits August 2, 2021 21:14

Provide more information to connectors to control aggregation pushdown

c0bb821

hashhar force-pushed the hashhar/case-insensitive-aggregation-pushdown branch from 9e7dd3c to 6473c84 Compare August 2, 2021 15:44

hashhar merged commit ee57029 into trinodb:master Aug 2, 2021

hashhar added this to the 361 milestone Aug 2, 2021

hashhar deleted the hashhar/case-insensitive-aggregation-pushdown branch August 2, 2021 17:13

This was referenced Aug 3, 2021

Release notes for 361 #8732

Closed

Implement aggregation pushdown in Pinot #6069

Merged

findepi mentioned this pull request Aug 24, 2021

Support pushdown of count aggregation with distinct #8562

Merged

Conversation

hashhar commented Jul 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hashhar commented Jul 19, 2021

Uh oh!

hashhar commented Jul 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi left a comment

Choose a reason for hiding this comment

Uh oh!

hashhar commented Jul 27, 2021

Uh oh!

wendigo commented Jul 27, 2021

Uh oh!

hashhar commented Aug 2, 2021

Uh oh!

hashhar commented Aug 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

hashhar commented Jul 14, 2021 •

edited

Loading

findepi Jul 27, 2021 •

edited

Loading