Skip to content

Remove unsound varchar->char implicit coercion#10499

Closed
ebyhr wants to merge 1 commit intotrinodb:masterfrom
ebyhr:ebi/concat-varchar-v2
Closed

Remove unsound varchar->char implicit coercion#10499
ebyhr wants to merge 1 commit intotrinodb:masterfrom
ebyhr:ebi/concat-varchar-v2

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented Jan 7, 2022

cc: @martint

@cla-bot cla-bot bot added the cla-signed label Jan 7, 2022
@ebyhr ebyhr force-pushed the ebi/concat-varchar-v2 branch from 0984145 to 87283c6 Compare January 7, 2022 11:04
A conversion from varchar to char adds padding, so it's unsafe to
do it automatically. For example, given a table t with a column v
of type varchar(100), and function that performs the following operation:

    function foo(x::char(100)) := '<' || x || '>'

A query such as this would produce surprising results:

    SELECT foo(v) from t

It would be more reasonable to fail the query due to a type mismatch.

Moreover, the implicit coercion between varchar and char is lossy.
The way it's currently defined, if the string is larger than the
max allowed size for the char type, it will be silently truncated.

Part of the reason this worked like it does was because of lack of
store assignment conversions during insert. That feature is now supported
by the engine, so we can change the coercion rules to fix the surprising
cases without losing that functionality.
@ebyhr ebyhr force-pushed the ebi/concat-varchar-v2 branch from 87283c6 to 217514b Compare January 8, 2022 00:14
@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 10, 2022

AFAIR, the tpch/tpcds template queries need to be updated.
cc @sopel39

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is NOGO as long as tpcds queries need to be changed (when comparing char table column with string literal).
We need to fix character string comparison too. One way to do it would be to have a cast dedicated for comparisons.
Some old doc about it https://docs.google.com/document/d/1UngGirjV_I7nPnIGU6G7SDS0cY8j7N_Jb3Wk3EoHChc/edit?usp=sharing

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 10, 2022

@sopel39 i think the conclusion was that we can fix char/varchar coercion direction only if we fix type of string literals.

@ebyhr you need to change eg 'abc' to be of type char(3) (today it's varchar(3)). cc @martint @kasiafi

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Jan 10, 2022

@sopel39 i think the conclusion was that we can fix char/varchar coercion direction only if we fix type of string literals.

wdym? We should have proper mechanism for character type comparisons in place before we flip coercion from varchar -> char

@martint
Copy link
Copy Markdown
Member

martint commented Jan 10, 2022

If I recall correctly, to be able to reverse the order of char<->varchar implicit casts we first need to:

  • treat string literals as char(n)
  • be able to encode literals of type varchar(n) in the IR. Otherwise, PPD and other optimizations will break. For instance varchar_1_column = ‘X’ would get planned as varchar_1_column = cast(‘X’ as varchar(1)), as there's currently no way to represent a literal value of type varchar(n) in the IR.

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 11, 2022

  • Otherwise, PPD and other optimizations will break. For instance varchar_1_column = ‘X’ would get planned as varchar_1_column = cast(‘X’ as varchar(1)), as there's currently no way to represent a literal value of type varchar(n) in the IR.

That's interesting. There isn't an assumption that LiteralEncoder produces an actual literal expression, and optimizations should not -- and they strive not -- depend on it.
Do you know which particular optimization needs a fix here?

BTW, this problem is preexisting. For instance varchar_2_column = 'X' today gets planned as varchar_2_column = cast('X' AS varchar(2)), as there's currently no direct way to represent a literal value of type varchar(n) but length shorter than n in the IR.

@kasiafi
Copy link
Copy Markdown
Member

kasiafi commented Jan 17, 2022

Do you know which particular optimization needs a fix here?

I've seen examples of code that tries to handle both "bare" literals, and literals wrapped in Cast. E.g. CanonicalizeExpressionRewriter. Also, UnwrapCasts wouldn't break, because it evaluates the expression.

However, some sites would be affected, e.g. PPD or FilterStatsCalculator.

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 17, 2022

However, some sites would be affected, e.g. PPD or FilterStatsCalculator.

for FilterStatsCalculator this is supposed to cover the problem

// value is a constant
return SymbolStatsEstimate.builder()
.setNullsFraction(0)
.setDistinctValuesCount(1)
.build();

for PPD -- would be good to be more explicit about problems' addresses, as the class is too long to read at once

@kasiafi
Copy link
Copy Markdown
Member

kasiafi commented Jan 17, 2022

for PPD -- would be good to be more explicit about problems' addresses, as the class is too long to read at once

I meant the part where the predicate is pushed through a ProjectNode, and the conjuncts to push are extracted: the isInliningCandidate method. The conjunct will only be pushed if a corresponding projection assignment is based on a Literal or a SymbolReference. If we change the direction of coercions, we can get a Cast where we used to have a Symbol. However, it's a symmetric situation. Sometimes we'll get a Symbol where we used to have a Cast. I think we should consider which direction of the char <-> varchar coercion is going to produce less casts on non-constants.

@kasiafi
Copy link
Copy Markdown
Member

kasiafi commented Jan 17, 2022

for FilterStatsCalculator this is supposed to cover the problem

I meant FilterStatsCalculator, not ScalarStatsCalculator:

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 18, 2022

I meant FilterStatsCalculator, not ScalarStatsCalculator:

FilterStatsCalculator relies on ScalarStatsCalculator, that's why l linked a hopefully-relevant piece of ScalarStatsCalculator.

I meant the part where the predicate is pushed through a ProjectNode, and the conjuncts to push are extracted: the isInliningCandidate method. The conjunct will only be pushed if a corresponding projection assignment is based on a Literal or a SymbolReference.

Thanks!
Sounds like pre-existing problem though, i.e. candidate for improvement that we can make today, and we may need to do prior to landing the coercion direction swap.

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 18, 2022

Sounds like pre-existing problem though, i.e. candidate for improvement that we can make today, and we may need to do prior to landing the coercion direction swap.

#10663

@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Mar 30, 2022

Let me close because there are blockers (e.g. "treat string literals as char(n)"). Thanks for reviewing the PR.

@ebyhr ebyhr closed this Mar 30, 2022
@ebyhr ebyhr deleted the ebi/concat-varchar-v2 branch March 30, 2022 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants