Skip to content

Use unenforced constraint when building Iceberg stats#16244

Merged
findepi merged 3 commits intotrinodb:masterfrom
alexjo2144:iceberg/stats-filtering
Mar 7, 2023
Merged

Use unenforced constraint when building Iceberg stats#16244
findepi merged 3 commits intotrinodb:masterfrom
alexjo2144:iceberg/stats-filtering

Conversation

@alexjo2144
Copy link
Member

Description

Fixes: #16239

Additional context and related issues

Iceberg files can be pruned using both the enforced constraint and the unenforced constraint when building table statistics.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Improve file pruning when generating Iceberg table statistics. 

@cla-bot cla-bot bot added the cla-signed label Feb 23, 2023
@alexjo2144 alexjo2144 force-pushed the iceberg/stats-filtering branch from 03af0e6 to bf56373 Compare February 24, 2023 14:57
@alexjo2144 alexjo2144 self-assigned this Feb 24, 2023
@alexjo2144 alexjo2144 added the iceberg Iceberg connector label Feb 24, 2023
@alexjo2144 alexjo2144 force-pushed the iceberg/stats-filtering branch 2 times, most recently from a06ba3f to 6c6d3c6 Compare February 27, 2023 20:14
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi @raunaqmorarka is there a good way to tell if these changes are good or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, just run the benchmarks :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexjo2144 do you have the results?

@alexjo2144 alexjo2144 force-pushed the iceberg/stats-filtering branch from 6c6d3c6 to 3fd7d62 Compare February 28, 2023 18:26
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this TODO (not related to current PR), the engine does provide the columns required for the query (various connectors save this in their ConnectorTableHandle after applyProjection). Do we try to parse all columns here or only the columns accessed by the query ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a 100% sure, but I think this might lead to underestimation by the CBO.
Even though unenforcedConstraint was used to prune statistics here, that filter is going to stay on top of the scan for evaluation. FilterStatsCalculator will estimate it like any other predicate and further reduce the already filtered stats. Please check if this is the case.
cc: @findepi @sopel39

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raunaqmorarka the code here seems correct to me.
Note that the logic here should be -- and is -- aligned with how we select files for reading later on

TupleDomain<IcebergColumnHandle> fullPredicate = tableHandle.getUnenforcedPredicate()
.intersect(pushedDownDynamicFilterPredicate);
// TODO: (https://github.com/trinodb/trino/issues/9743): Consider removing TupleDomain#simplify
TupleDomain<IcebergColumnHandle> simplifiedPredicate = fullPredicate.simplify(ICEBERG_DOMAIN_COMPACTION_THRESHOLD);
boolean usedSimplifiedPredicate = !simplifiedPredicate.equals(fullPredicate);
if (usedSimplifiedPredicate) {
// Pushed down predicate was simplified, always evaluate it against individual splits
this.pushedDownDynamicFilterPredicate = TupleDomain.all();
}
TupleDomain<IcebergColumnHandle> effectivePredicate = dataColumnPredicate
.intersect(simplifiedPredicate);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that the logic here matches with file pruning in splits generation. My concern was about something else.
E.g.: Let's say before this change FilterStatsCalculator would get column stats based on 10 file's stats and uses that to estimate a predicate. After this change FilterStatsCalculator will estimate the same predicate but with column stats based on a subset of file's stats. The domain of the column stats (min/max, ndv) would be narrower this time.
This is still correct though as the row count will also be smaller and we assume uniform distribution of values.
It does change the estimates when actual distribution of values across files is non-uniform, maybe this is why a couple of TPC queries plans also changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your concern. Yes, this may result in different stats eventually calculated by FilterStatsCalculator.
Yet, this is IMO the correct thing to do. If the connector is going to return rows from 10 files, it should indicate row count coming from 100s of files.

Comment on lines 89 to 90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private TableStatisticsReader()
{ }
private TableStatisticsReader() {}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class became utility class now

  • i have Utility class 'TableStatisticsReader' is not 'final' warning on the class declaration
  • move the constructor declaration as the first entry in the class, since this isn't real constructor anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in first commit we have two variables meaning one same thing ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and in the second commit the enforcedPredicate is ill-named, because it's not actually enforced

  • let's call it effectivePredicate.
  • let's introduce the variable in the second commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raunaqmorarka the code here seems correct to me.
Note that the logic here should be -- and is -- aligned with how we select files for reading later on

TupleDomain<IcebergColumnHandle> fullPredicate = tableHandle.getUnenforcedPredicate()
.intersect(pushedDownDynamicFilterPredicate);
// TODO: (https://github.com/trinodb/trino/issues/9743): Consider removing TupleDomain#simplify
TupleDomain<IcebergColumnHandle> simplifiedPredicate = fullPredicate.simplify(ICEBERG_DOMAIN_COMPACTION_THRESHOLD);
boolean usedSimplifiedPredicate = !simplifiedPredicate.equals(fullPredicate);
if (usedSimplifiedPredicate) {
// Pushed down predicate was simplified, always evaluate it against individual splits
this.pushedDownDynamicFilterPredicate = TupleDomain.all();
}
TupleDomain<IcebergColumnHandle> effectivePredicate = dataColumnPredicate
.intersect(simplifiedPredicate);

@findepi
Copy link
Member

findepi commented Mar 1, 2023

"Use unenforced constraint when building Iceberg stats" title doesn't tell the whole story, leading to quesations like #16244 (comment). What would be a better title for the change?

"Return more accurate stats from Iceberg when some filters not fully enforced"?

@alexjo2144 alexjo2144 force-pushed the iceberg/stats-filtering branch from 3fd7d62 to c6e9516 Compare March 1, 2023 20:54
@alexjo2144 alexjo2144 requested a review from findepi March 1, 2023 21:19
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's capture some info from #16244 (comment) conversation
like that this matches split gen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to sum it up in the commit message, let me know if theres anything missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to sum it up in the commit message,

code comments are easier to notice (and serve slightly different purpose)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a code comment as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexjo2144 do you have the results?

@alexjo2144
Copy link
Member Author

Here's q21 of tpch before and after the change. Doesn't seem to have made much of a difference:

Screen Shot 2023-03-02 at 11 06 21 AM

Screen Shot 2023-03-02 at 11 04 15 AM

@alexjo2144
Copy link
Member Author

And q09:
Screen Shot 2023-03-02 at 11 08 48 AM
Screen Shot 2023-03-02 at 11 09 07 AM

The unenforced component of the table predicate is still used to prune
data files when generating Iceberg splits. The same can be done when
scanning the manifest to generate table statistics.
@alexjo2144 alexjo2144 force-pushed the iceberg/stats-filtering branch from c6e9516 to c4c8bff Compare March 2, 2023 17:50
@alexjo2144
Copy link
Member Author

Not sure what's wrong with Maven Checks, but it passed the code checks part.

@alexjo2144 alexjo2144 force-pushed the iceberg/stats-filtering branch from 214e7cc to 614155a Compare March 6, 2023 15:01
@alexjo2144
Copy link
Member Author

@findepi build is green ✔️

@findepi findepi merged commit 43936fb into trinodb:master Mar 7, 2023
@findepi findepi merged commit 43936fb into trinodb:master Mar 7, 2023
@findepi
Copy link
Member

findepi commented Mar 7, 2023

(x) Release notes are required, with the following suggested text:

# Iceberg
* Improve file pruning when generating Iceberg table statistics. 

@colebow this should be reworded sth like "improve query planning performance"

@github-actions github-actions bot added this to the 410 milestone Mar 7, 2023
@alexjo2144 alexjo2144 deleted the iceberg/stats-filtering branch March 10, 2023 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

Table statistics reader ignores filter predicates

3 participants