Use unenforced constraint when building Iceberg stats by alexjo2144 · Pull Request #16244 · trinodb/trino

alexjo2144 · 2023-02-23T19:10:47Z

Description

Additional context and related issues

Iceberg files can be pruned using both the enforced constraint and the unenforced constraint when building table statistics.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Improve file pruning when generating Iceberg table statistics.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java

alexjo2144 · 2023-02-27T20:21:29Z

...g/trino-tests/src/test/resources/sql/presto/tpcds/iceberg/parquet/unpartitioned/q09.plan.txt

@findepi @raunaqmorarka is there a good way to tell if these changes are good or not?

Sure, just run the benchmarks :)

@alexjo2144 do you have the results?

raunaqmorarka · 2023-03-01T04:35:56Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java

Regarding this TODO (not related to current PR), the engine does provide the columns required for the query (various connectors save this in their ConnectorTableHandle after applyProjection). Do we try to parse all columns here or only the columns accessed by the query ?

raunaqmorarka · 2023-03-01T04:51:35Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java

I'm not a 100% sure, but I think this might lead to underestimation by the CBO.
Even though unenforcedConstraint was used to prune statistics here, that filter is going to stay on top of the scan for evaluation. FilterStatsCalculator will estimate it like any other predicate and further reduce the already filtered stats. Please check if this is the case.
cc: @findepi @sopel39

@raunaqmorarka the code here seems correct to me.
Note that the logic here should be -- and is -- aligned with how we select files for reading later on

trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

Lines 167 to 178 in 75bc783

TupleDomain<IcebergColumnHandle> fullPredicate = tableHandle.getUnenforcedPredicate()

.intersect(pushedDownDynamicFilterPredicate);

// TODO: (https://github.com/trinodb/trino/issues/9743): Consider removing TupleDomain#simplify

TupleDomain<IcebergColumnHandle> simplifiedPredicate = fullPredicate.simplify(ICEBERG_DOMAIN_COMPACTION_THRESHOLD);

boolean usedSimplifiedPredicate = !simplifiedPredicate.equals(fullPredicate);

if (usedSimplifiedPredicate) {

// Pushed down predicate was simplified, always evaluate it against individual splits

this.pushedDownDynamicFilterPredicate = TupleDomain.all();

}

TupleDomain<IcebergColumnHandle> effectivePredicate = dataColumnPredicate

.intersect(simplifiedPredicate);

I get that the logic here matches with file pruning in splits generation. My concern was about something else.
E.g.: Let's say before this change FilterStatsCalculator would get column stats based on 10 file's stats and uses that to estimate a predicate. After this change FilterStatsCalculator will estimate the same predicate but with column stats based on a subset of file's stats. The domain of the column stats (min/max, ndv) would be narrower this time.
This is still correct though as the row count will also be smaller and we assume uniform distribution of values.
It does change the estimates when actual distribution of values across files is non-uniform, maybe this is why a couple of TPC queries plans also changed.

I see your concern. Yes, this may result in different stats eventually calculated by FilterStatsCalculator.
Yet, this is IMO the correct thing to do. If the connector is going to return rows from 10 files, it should indicate row count coming from 100s of files.

findepi · 2023-03-01T14:26:35Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java

Suggested change

private TableStatisticsReader()

{ }

private TableStatisticsReader() {}

The class became utility class now

i have Utility class 'TableStatisticsReader' is not 'final' warning on the class declaration

move the constructor declaration as the first entry in the class, since this isn't real constructor anymore.

findepi · 2023-03-01T14:30:43Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java

in first commit we have two variables meaning one same thing ...

findepi · 2023-03-01T14:32:10Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java

... and in the second commit the enforcedPredicate is ill-named, because it's not actually enforced

let's call it effectivePredicate.

let's introduce the variable in the second commit

findepi · 2023-03-01T14:34:34Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java

@raunaqmorarka the code here seems correct to me.
Note that the logic here should be -- and is -- aligned with how we select files for reading later on

trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java

Lines 167 to 178 in 75bc783

TupleDomain<IcebergColumnHandle> fullPredicate = tableHandle.getUnenforcedPredicate()

.intersect(pushedDownDynamicFilterPredicate);

// TODO: (https://github.com/trinodb/trino/issues/9743): Consider removing TupleDomain#simplify

TupleDomain<IcebergColumnHandle> simplifiedPredicate = fullPredicate.simplify(ICEBERG_DOMAIN_COMPACTION_THRESHOLD);

boolean usedSimplifiedPredicate = !simplifiedPredicate.equals(fullPredicate);

if (usedSimplifiedPredicate) {

// Pushed down predicate was simplified, always evaluate it against individual splits

this.pushedDownDynamicFilterPredicate = TupleDomain.all();

}

TupleDomain<IcebergColumnHandle> effectivePredicate = dataColumnPredicate

.intersect(simplifiedPredicate);

findepi · 2023-03-01T14:35:54Z

"Use unenforced constraint when building Iceberg stats" title doesn't tell the whole story, leading to quesations like #16244 (comment). What would be a better title for the change?

"Return more accurate stats from Iceberg when some filters not fully enforced"?

findepi · 2023-03-02T07:40:38Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java

let's capture some info from #16244 (comment) conversation
like that this matches split gen

I tried to sum it up in the commit message, let me know if theres anything missing.

I tried to sum it up in the commit message,

code comments are easier to notice (and serve slightly different purpose)

Added a code comment as well

findepi · 2023-03-02T07:41:04Z

...g/trino-tests/src/test/resources/sql/presto/tpcds/iceberg/parquet/unpartitioned/q09.plan.txt

@alexjo2144 do you have the results?

alexjo2144 · 2023-03-02T16:06:51Z

Here's q21 of tpch before and after the change. Doesn't seem to have made much of a difference:

alexjo2144 · 2023-03-02T16:09:51Z

And q09:

The unenforced component of the table predicate is still used to prune data files when generating Iceberg splits. The same can be done when scanning the manifest to generate table statistics.

alexjo2144 · 2023-03-03T21:19:41Z

Not sure what's wrong with Maven Checks, but it passed the code checks part.

alexjo2144 · 2023-03-06T20:09:36Z

@findepi build is green ✔️

findepi · 2023-03-07T16:30:38Z

(x) Release notes are required, with the following suggested text:
# Iceberg
* Improve file pruning when generating Iceberg table statistics. 

@colebow this should be reworded sth like "improve query planning performance"

cla-bot bot added the cla-signed label Feb 23, 2023

alexjo2144 requested review from ebyhr, findepi and nineinchnick February 23, 2023 19:10

alexjo2144 force-pushed the iceberg/stats-filtering branch from 03af0e6 to bf56373 Compare February 24, 2023 14:57

alexjo2144 self-assigned this Feb 24, 2023

alexjo2144 added the iceberg Iceberg connector label Feb 24, 2023

findepi reviewed Feb 27, 2023

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TableStatisticsReader.java Outdated Show resolved Hide resolved

alexjo2144 force-pushed the iceberg/stats-filtering branch 2 times, most recently from a06ba3f to 6c6d3c6 Compare February 27, 2023 20:14

alexjo2144 commented Feb 27, 2023

View reviewed changes

alexjo2144 force-pushed the iceberg/stats-filtering branch from 6c6d3c6 to 3fd7d62 Compare February 28, 2023 18:26

raunaqmorarka reviewed Mar 1, 2023

View reviewed changes

raunaqmorarka added the performance label Mar 1, 2023

findepi reviewed Mar 1, 2023

View reviewed changes

Refactor Iceberg TableStatisticsReader parameters

94f2037

alexjo2144 force-pushed the iceberg/stats-filtering branch from 3fd7d62 to c6e9516 Compare March 1, 2023 20:54

alexjo2144 requested a review from findepi March 1, 2023 21:19

findepi approved these changes Mar 2, 2023

View reviewed changes

Improve stats accuracy in Iceberg when filters are not enforced

c4c8bff

The unenforced component of the table predicate is still used to prune data files when generating Iceberg splits. The same can be done when scanning the manifest to generate table statistics.

alexjo2144 force-pushed the iceberg/stats-filtering branch from c6e9516 to c4c8bff Compare March 2, 2023 17:50

findepi approved these changes Mar 2, 2023

View reviewed changes

Empty

614155a

alexjo2144 force-pushed the iceberg/stats-filtering branch from 214e7cc to 614155a Compare March 6, 2023 15:01

findepi merged commit 43936fb into trinodb:master Mar 7, 2023

github-actions bot added this to the 410 milestone Mar 7, 2023

colebow mentioned this pull request Mar 7, 2023

Add Trino 410 release notes #16422

Merged

alexjo2144 deleted the iceberg/stats-filtering branch March 10, 2023 15:40

	TupleDomain<IcebergColumnHandle> fullPredicate = tableHandle.getUnenforcedPredicate()
	.intersect(pushedDownDynamicFilterPredicate);
	// TODO: (https://github.com/trinodb/trino/issues/9743): Consider removing TupleDomain#simplify
	TupleDomain<IcebergColumnHandle> simplifiedPredicate = fullPredicate.simplify(ICEBERG_DOMAIN_COMPACTION_THRESHOLD);
	boolean usedSimplifiedPredicate = !simplifiedPredicate.equals(fullPredicate);
	if (usedSimplifiedPredicate) {
	// Pushed down predicate was simplified, always evaluate it against individual splits
	this.pushedDownDynamicFilterPredicate = TupleDomain.all();
	}

	TupleDomain<IcebergColumnHandle> effectivePredicate = dataColumnPredicate
	.intersect(simplifiedPredicate);

	private TableStatisticsReader()
	{ }
	private TableStatisticsReader() {}

Conversation

alexjo2144 commented Feb 23, 2023

Description

Additional context and related issues

Release notes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi commented Mar 1, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexjo2144 commented Mar 2, 2023

Uh oh!

alexjo2144 commented Mar 2, 2023

Uh oh!

alexjo2144 commented Mar 3, 2023

Uh oh!

alexjo2144 commented Mar 6, 2023

Uh oh!

findepi commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants