Skip to content

Perform range optimization for BETWEEN predicate on date_trunc and temporal casts#14390

Closed
findinpath wants to merge 4 commits intotrinodb:masterfrom
findinpath:rewrite-between-for-temporal-casts
Closed

Perform range optimization for BETWEEN predicate on date_trunc and temporal casts#14390
findinpath wants to merge 4 commits intotrinodb:masterfrom
findinpath:rewrite-between-for-temporal-casts

Conversation

@findinpath
Copy link
Copy Markdown
Contributor

@findinpath findinpath commented Sep 30, 2022

Description

This change allows the engine to infer that, for instance,
given t::timestamp(6)

    date_trunc('day', t) BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 00:00:00'

or

   cast(t as date) BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 00:00:00'

can be rewritten as

    t BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 23:59:59.999999'

The change applies for the temporal types:

  • date
  • timestamp
  • timestamp with time zone

Range predicate BetweenPredicate can be transformed into a TupleDomain
and thus help with predicate pushdown.
Range-based TupleDomain representation is critical for connectors
which have min/max-based metadata (like Iceberg manifests lists which
play a key role in partition pruning or Iceberg data files), as ranges allow
for intersection tests, something that is hard
to do in a generic manner for ConnectorExpression.

Fixes #14293

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Main
* Improve partition and data pruning when comparing temporal casts  with ranges

@cla-bot cla-bot bot added the cla-signed label Sep 30, 2022
@findinpath findinpath changed the title Rewrite temporal casts comparation on ranges Perform range optimization for BETWEEN predicate on date_trunc and temporal casts Sep 30, 2022
@findinpath findinpath force-pushed the rewrite-between-for-temporal-casts branch 2 times, most recently from c01d17e to 4226391 Compare September 30, 2022 14:58
@findinpath findinpath marked this pull request as ready for review September 30, 2022 15:00
@findinpath findinpath force-pushed the rewrite-between-for-temporal-casts branch 3 times, most recently from a487c99 to aa7abc0 Compare October 2, 2022 07:19
@findinpath
Copy link
Copy Markdown
Contributor Author

CI hit #11140

@findinpath findinpath requested review from findepi and martint October 2, 2022 07:21
@findinpath findinpath force-pushed the rewrite-between-for-temporal-casts branch from aa7abc0 to 5317143 Compare October 2, 2022 07:23
…expression

This change allows the engine to infer that, for instance,
given t::timestamp(6)

    date_trunc('day', t) BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 00:00:00'

can be rewritten as

    t BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 23:59:59.999999'

The change applies for the temporal types:
- date
- timestamp
- timestamp with time zone

Range predicate BetweenPredicate can be transformed into a `TupleDomain`
and thus help with predicate pushdown.
Range-based `TupleDomain` representation is critical for connectors
which have min/max-based metadata (like Iceberg manifests lists which
play a key role in partition pruning or Iceberg data files), as ranges allow
for intersection tests, something that is hard
to do in a generic manner for `ConnectorExpression`.
This change allows the engine to infer that, for instance,
given t::timestamp(6)

    cast(t as date) BETWEEN DATE '2022-01-01' AND DATE '2022-01-02'

can be rewritten as

    t BETWEEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 23:59:59.999999'

The change applies for the temporal types:
- date
- timestamp
- timestamp with time zone

Range predicate BetweenPredicate can be transformed into a `TupleDomain`
and thus help with predicate pushdown.
Range-based `TupleDomain` representation is critical for connectors
which have min/max-based metadata (like Iceberg manifests lists which
play a key role in partition pruning or Iceberg data files), as ranges allow
for intersection tests, something that is hard
to do in a generic manner for `ConnectorExpression`.
@findinpath findinpath force-pushed the rewrite-between-for-temporal-casts branch from 5317143 to a7ab471 Compare October 3, 2022 08:14
verify(longTimestamp.getPicosOfMicro() == 0, "Unexpected picos in %s, value not rounded to %s", rangeStart, rangeUnit);
long endInclusiveMicros = (long) calculateRangeEndInclusive(longTimestamp.getEpochMicros(), createTimestampType(6), rangeUnit);
return new LongTimestamp(endInclusiveMicros, toIntExact(PICOSECONDS_PER_MICROSECOND - scaleFactor(timestampType.getPrecision(), 12)));
long endInclusiveMicros = (long) calculateRangeEndInclusive(longTimestamp.getEpochMicros(), createTimestampType(TimestampType.MAX_SHORT_PRECISION), rangeUnit);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variable name is "endInclusiveMicros"
the code used 6 and it's know that 10^(-6)s is a microsecond.

after the change the code uses TimestampType.MAX_SHORT_PRECISION. it's not obvious that it's correct (is short precision actually microseconds?). Thus, actually this change decreases readability

long endInclusiveMicros = (long) calculateRangeEndInclusive(longTimestamp.getEpochMicros(), createTimestampType(6), rangeUnit);
return new LongTimestamp(endInclusiveMicros, toIntExact(PICOSECONDS_PER_MICROSECOND - scaleFactor(timestampType.getPrecision(), 12)));
long endInclusiveMicros = (long) calculateRangeEndInclusive(longTimestamp.getEpochMicros(), createTimestampType(TimestampType.MAX_SHORT_PRECISION), rangeUnit);
return new LongTimestamp(endInclusiveMicros, toIntExact(PICOSECONDS_PER_MICROSECOND - scaleFactor(timestampType.getPrecision(), TimestampType.MAX_PRECISION)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here. the use PICOSECONDS_PER_MICROSECOND mandates that we know we're dealing with picoseconds, i.e. 10^(-12)s, so it matched the corresponding 12 on this line

after the change, we invoke "max precision" constant, but we still rely on it having an actual value of 12

@findepi
Copy link
Copy Markdown
Member

findepi commented Oct 3, 2022

@findinpath let's have unwrapping of CASTs and date_trunc as separate PRs.
I'd like to focus on casts first.

@findinpath
Copy link
Copy Markdown
Contributor Author

Continuing the work on #14451 and #14452

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

date_trunc range optimization should apply also for BETWEEN predicate

2 participants