Skip to content

Fix incorrect results for aggregation functions on case-sensitive types#8551

Merged
hashhar merged 2 commits intotrinodb:masterfrom
hashhar:hashhar/case-insensitive-aggregation-pushdown
Aug 2, 2021
Merged

Fix incorrect results for aggregation functions on case-sensitive types#8551
hashhar merged 2 commits intotrinodb:masterfrom
hashhar:hashhar/case-insensitive-aggregation-pushdown

Conversation

@hashhar
Copy link
Copy Markdown
Member

@hashhar hashhar commented Jul 14, 2021

Pushdown of aggregation functions to case-insensitive databases (e.g. MySQL) on case-sensitive inputs (e.g. VARCHAR/CHAR) can lead to incorrect results.

This change disables aggregation pushdown when:

  • Any of the grouping sets is a case-sensitive type
  • If the aggregation function is case-sensitive (e.g. max/min but not count) and any of the inputs to the aggregation function is a case-sensitive type

It also introduces a toggle to restore previous incorrect behaviour.

Fixes #7320

@hashhar hashhar added the WIP label Jul 14, 2021
@cla-bot cla-bot bot added the cla-signed label Jul 14, 2021
@hashhar hashhar force-pushed the hashhar/case-insensitive-aggregation-pushdown branch from 1aadeef to 16800a0 Compare July 16, 2021 09:01
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's about sorting (as we believe it is), limitation doesn't apply to GROUPING SETs.
Also it applies to some aggregation functions (min, max), but not others count, or count(DISTINCT)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The part about difference between aggregation functions came up before but I couldn't find a nice way to model it (other than introduce a field to AggregateFunction and having each of them declare if they depend on case-sensitivity for correctness).

For the grouping set - thanks for catching. Postgres indeed doesn't need the grouping set check - only the aggregate function check.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re how to model - since this is function specific, it could be handled by io.trino.plugin.jdbc.expression.AggregateFunctionRule#rewrite returning empty or not.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I tried initially but it's a bit errorprone since each module has their own rewrites. I can try to move as many to base-jdbc as possible in a preparatory commit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to confirm, you mean that the grouping sets condition can be handled in supportsAggregation like now.
And instead of handling the functions in supportsAggregation I should handle them in the rewrite rules?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I tried initially but it's a bit errorprone since each module has their own rewrites. I can try to move as many to base-jdbc as possible in a preparatory commit.

That's unlikely to work. We have separate rewrites because there were some differences.

Also, to confirm, you mean that the grouping sets condition can be handled in supportsAggregation like now.

i didn't read that part, so "maybe"

And instead of handling the functions in supportsAggregation I should handle them in the rewrite rules?

that's my intuition

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the approach you suggested - it works and thankfully the only case-sensitive rewrites live in base-jdbc already.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(a_varchar) should still be pushed down, as case sensitivity (or more broadly: collations), doesn't impact results.

@hashhar hashhar removed the WIP label Jul 19, 2021
@hashhar
Copy link
Copy Markdown
Member Author

hashhar commented Jul 19, 2021

PTAL @findepi @wendigo

I can add a session property to restore performance at the user's own discretion - what should we call it? (if we do this then I'll also address #7022 as part of it).

@hashhar hashhar marked this pull request as ready for review July 19, 2021 09:06
@hashhar hashhar requested review from findepi and wendigo July 19, 2021 09:06
@hashhar
Copy link
Copy Markdown
Member Author

hashhar commented Jul 22, 2021

@wendigo @findepi Gentle ping. 🙂

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's based on a wrong type.

For example, PostgreSQL's money and enum types are mapped to Trino varchar, while both will have different sorting properties.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the rewrite rule is generic I think the only solution is to allow connectors to pass a list of jdbcTypeName through the constructor for types which should not be pushed down?

Or maybe add a static function to JdbcClient called isCollationSensitive(JdbcColumnHandle) and let connectors define their own impls? This method already exists in the PostgreSQL client btw and may be useful for others where we can pass explicit collations (MySQL once we drop 5.x).

Copy link
Copy Markdown
Member

@findepi findepi Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i think we already use "is mapped to char or varchar" as a way to determine if it's potentially case insensitive (or collation-sensitive).
it may catch too much, but shouldn't catch too little, so it's fine.

leave as is

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not exactly about case sensitivity. For example, by default PostgreSQL is case-sensitive, but still sorts differently than Trino, so we shouldn't push min/max on varchar.

BTW do PostgresQL min/max accept COLLATE?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the complaint here is that the variable is named in a misleading way? Maybe isCollationSensitive would be better?

Functions don't accept explicit collations - I couldn't find any examples and the variations I tried led to syntax errors.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the complaint here is that the variable is named in a misleading way

yes

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clerk is varchar and part of grouping set so it prevents pushdown after the change. Changed to use a numeric column since we just want to test GROUP BY + TOPN.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be static?

(that would help understand the state flow)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should've been static from the start. thanks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a correctness cases

  • count(DISTINCT a_string)
  • count(DISTINCT a_string), count(DISTINCT a_bigint)` (together)

this could help avoid any regressions in #8562 cc @alexjo2144

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote database may be case-insensitive

nit: Use "Remote database can be case insensitive" words, so that similar places in the code are searchable.

Copy link
Copy Markdown
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% earlier feedback

@hashhar hashhar force-pushed the hashhar/case-insensitive-aggregation-pushdown branch from 504c1ae to 9e7dd3c Compare July 27, 2021 09:29
@hashhar
Copy link
Copy Markdown
Member Author

hashhar commented Jul 27, 2021

@wendigo PTAL. Applied @findepi 's comments.

@wendigo
Copy link
Copy Markdown
Contributor

wendigo commented Jul 27, 2021

LGTM @hashhar - thanks for working on that

@hashhar
Copy link
Copy Markdown
Member Author

hashhar commented Aug 2, 2021

Rebasing to make sure nothing broke since the upgrade. Will merge once CI finishes.

hashhar added 2 commits August 2, 2021 21:14
Some databases are case-insensitive (MySQL, SQL Server) while others
sort textual types differently compared to Trino (PostgreSQL). For such
databases pushdown of aggregation functions when the grouping set
includes a textual type can lead to incorrect results. So we prevent
aggregation pushdown for such cases.
We also prevent pushdown for functions whose results depend on sort
order (min/max) when the input is a textual type.
@hashhar hashhar force-pushed the hashhar/case-insensitive-aggregation-pushdown branch from 9e7dd3c to 6473c84 Compare August 2, 2021 15:44
@hashhar
Copy link
Copy Markdown
Member Author

hashhar commented Aug 2, 2021

CI hit #8719 and #8432

@hashhar hashhar merged commit ee57029 into trinodb:master Aug 2, 2021
@hashhar hashhar added this to the 361 milestone Aug 2, 2021
@hashhar hashhar deleted the hashhar/case-insensitive-aggregation-pushdown branch August 2, 2021 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Incorrect aggregation pushdown for case insensitive columns in JDBC some connectors

3 participants