Skip to content

Unwrap casts in BETWEEN predicate#14452

Closed
findinpath wants to merge 2 commits intotrinodb:masterfrom
findinpath:rewrite-casts-in-between-predicate
Closed

Unwrap casts in BETWEEN predicate#14452
findinpath wants to merge 2 commits intotrinodb:masterfrom
findinpath:rewrite-casts-in-between-predicate

Conversation

@findinpath
Copy link
Copy Markdown
Contributor

@findinpath findinpath commented Oct 4, 2022

Description

This change allows the engine to infer that, for instance, given t::timestamp(6)

cast(t as date) BETWEEN DATE '2022-01-01' AND DATE '2022-01-02'

can be rewritten as

 t BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 23:59:59.999999'

The change applies for the temporal types:

  • date
  • timestamp
  • timestamp with time zone

Range predicate BetweenPredicate can be transformed into a TupleDomain and thus help with predicate pushdown.
Range-based TupleDomain representation is critical for connectors which have min/max-based metadata (like Iceberg manifests lists which play a key role in partition pruning or Iceberg data files), as ranges allow for intersection tests, something that is hard
to do in a generic manner for ConnectorExpression.

This is a spin-off from #14390

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Main
* Improve partition and data pruning when comparing casts with ranges in `BETWEEN` predicate

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering where would be the right place to unify a<=2 AND a>=2 to a=2.

@findinpath findinpath requested a review from findepi October 4, 2022 21:55
@findinpath findinpath force-pushed the rewrite-casts-in-between-predicate branch from 4eab85d to efbd6a0 Compare October 5, 2022 04:49
@findinpath findinpath force-pushed the rewrite-casts-in-between-predicate branch 2 times, most recently from 8d84a05 to 2979c17 Compare October 5, 2022 11:29
@findinpath findinpath requested a review from przemekak October 5, 2022 11:29
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want to call treeRewriter.defaultRewrite to ensure between's child nodes are processed

you want to call unwrapCast to do something meaningful. Single pass of UnwrapCastInComparison.Visitor should do the job of unwrapping casts, and not require another pass to do its job right

@findinpath findinpath force-pushed the rewrite-casts-in-between-predicate branch from 2979c17 to faf7ccf Compare October 11, 2022 08:17
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like duplicating the logic quite much.

Did you try something like

@Override
public Expression rewriteBetweenPredicate(BetweenPredicate node, Void context, ExpressionTreeRewriter<Void> treeRewriter)
{
    BetweenPredicate expression = (BetweenPredicate) treeRewriter.defaultRewrite((Expression) node, null);

    ComparisonExpression lowBound = new ComparisonExpression(GREATER_THAN_OR_EQUAL, node.getValue(), node.getMin());
    ComparisonExpression highBound = new ComparisonExpression(LESS_THAN_OR_EQUAL, node.getValue(), node.getMax());

    Expression lowBoundUnwrapped = unwrapCast(lowBound);
    Expression highBoundUnwrapped = unwrapCast(highBound);
    if (lowBound.equals(lowBoundUnwrapped) && highBound.equals(highBoundUnwrapped)) {
        return expression;
    }
    ...
}

?
Why it cannot be made to work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #14648 with inspiration from #12797

I stumbled while trying to add the methods:

  • getPreviousValue(Object)
  • getNextValue(Object)

to DateType.
I am assuming that checking the validity of a date needs joda-time expertise which is not available in trino-spi module.

As an alternative we could use custom operators included in DateOperators to cover this need.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stumbled while trying to add the methods:

  • getPreviousValue(Object)
  • getNextValue(Object)

to DateType.

This is now covered by #12797.

@findinpath findinpath force-pushed the rewrite-casts-in-between-predicate branch 2 times, most recently from dcac104 to af873c8 Compare October 12, 2022 09:27
This change allows the engine to infer that, for instance,
given t::timestamp(6)

    cast(t as date) BETWEEN DATE '2022-01-01' AND DATE '2022-01-02'

can be rewritten as

    t BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 23:59:59.999999'

Range predicate BetweenPredicate can be transformed into a `TupleDomain`
and thus help with predicate pushdown.
Range-based `TupleDomain` representation is critical for connectors
which have min/max-based metadata (like Iceberg manifests lists which
play a key role in partition pruning or Iceberg data files), as ranges allow
for intersection tests, something that is hard
to do in a generic manner for `ConnectorExpression`.
@findinpath findinpath force-pushed the rewrite-casts-in-between-predicate branch from af873c8 to 1c3fdaf Compare October 13, 2022 08:02
join (INNER, REPLICATED):
join (INNER, REPLICATED):
join (INNER, REPLICATED):
join (INNER, PARTITIONED):
Copy link
Copy Markdown
Contributor Author

@findinpath findinpath Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that this is a side-effect of the expression rewriting in tpcds q13.sql

Context: ss_net_profit has the type DECIMAL(7,2)

INPUT NODE in `UnwrapCastInComparison

(CAST("ss_net_profit" AS decimal(12, 2)) BETWEEN CAST(DECIMAL '100.00' AS decimal(12, 2)) AND CAST(DECIMAL '200.00' AS decimal(12, 2)))

OUTPUT NODE BEFORE CHANGES

(CAST("ss_net_profit" AS decimal(12, 2)) BETWEEN CAST(DECIMAL '100.00' AS decimal(12, 2)) AND CAST(DECIMAL '200.00' AS decimal(12, 2)))

OUTPUT NODE AFTER CHANGES

("ss_net_profit" BETWEEN CAST(DECIMAL '100.00' AS decimal(7, 2)) AND CAST(DECIMAL '200.00' AS decimal(7, 2)))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable

to be on the safe side, we will want to confirm this change with benchmarks (cc @przemekak )
but i don't think we're ready yet

@findinpath
Copy link
Copy Markdown
Contributor Author

CI hit #13957

@findepi
Copy link
Copy Markdown
Member

findepi commented Oct 13, 2022

trino-main is red

@findinpath
Copy link
Copy Markdown
Contributor Author

Superseded by #14648

@findinpath findinpath closed this Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants