Skip to content

Refactor Iceberg table statistics to be deterministic#9906

Merged
findepi merged 1 commit intotrinodb:masterfrom
alexjo2144:iceberg/stats-refactor
Jan 21, 2022
Merged

Refactor Iceberg table statistics to be deterministic#9906
findepi merged 1 commit intotrinodb:masterfrom
alexjo2144:iceberg/stats-refactor

Conversation

@alexjo2144
Copy link
Copy Markdown
Member

Fixes #9716

Existing table statistics were non-deterministic because they depended on the order that data files were loaded from the Iceberg API. This hopefully cleans the code up a bit and makes it more consistent.

@cla-bot cla-bot bot added the cla-signed label Nov 8, 2021
@alexjo2144 alexjo2144 added the WIP label Nov 8, 2021
@findepi
Copy link
Copy Markdown
Member

findepi commented Nov 9, 2021

@alexjo2144 the build is red

@alexjo2144 alexjo2144 force-pushed the iceberg/stats-refactor branch 3 times, most recently from d4515b5 to 248943a Compare January 11, 2022 20:46
@alexjo2144 alexjo2144 requested review from findepi and phd3 January 12, 2022 20:54
@alexjo2144
Copy link
Copy Markdown
Member Author

@findepi finally had some time to get back to this. Mind taking a look?

@alexjo2144 alexjo2144 removed the WIP label Jan 12, 2022
@alexjo2144 alexjo2144 requested a review from jirassimok January 12, 2022 20:55
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Immutable, final

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defensive copy

Comment on lines 127 to 128
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defensive copy

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add empty line between immutable and mutable state

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acceptDataFile ?

Comment on lines 196 to 199
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeIfAbsent (without another get) gives you 0.5x map access and skips IcebergStatisticsBuilder allocation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions.compose -> plan lambda entry -> ...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add // invalidate

Comment on lines 224 to 229
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're doing 3 map lookups (contains, get, merge).
You can do this in one shot:

nullCounts.merge(id, nullCount, (existingCount, newCount) ->
        existingCount.isPresent() && newCount.isPresent() ? Optional.of(existingCount.get() + newCount.get()) : Optional.empty());

(you can extract lambda body to a method like sumOfOptionals)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is also equivalent to this:

nullCounts.merge(id, nullCount, (oldCount, newCount) -> 
        oldCount.flatMap(oldValue ->
                newCount.map(newValue -> newValue + oldValue)));

Though after looking at this, it's definitely less clear about the intent, though I would use this form if extracting a method to add optionals.

Actually, instead of just addition, maybe mergeOptionals would be better (I see somewhere else in this file it could be used). Here, it would be used like mergeOptionals(oldCount, newCount, Long::sum).

/**
 * Apply a function to the values in two optionals, returning an optional containing the result.
 * If either argument is empty, return empty.
 */
<A, B, C> Optional<C> mergeOptionals(Optional<A> a, Optional<B> b, BiFunction<A, B, C> mergeFunction)
{
    return a.flatMap(aa -> b.map(bb -> mergeFunction.apply(aa, bb)));
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return a.flatMap(aa -> b.map(bb -> mergeFunction.apply(aa, bb)));

i thought about that, but i don't find it readable, that's why i suggested ?: use

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did these go to?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before my changes the ORC version of this table was deemed to have invalid column metrics, thus the NULL min/max/null count rows. I think that was a bug though, and the stats are the same now for both Parquet and ORC, with a few exceptions below.

Copy link
Copy Markdown
Member

@jirassimok jirassimok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand how this works, but overall it looks pretty good.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the builder's method return this?

Comment on lines 154 to 155
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be declared so long before they're used?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any closer and they'd be inside the loop

Comment on lines 160 to 187
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see partitionValue.ifPresentOrElse here to avoid Optional.get.

Or maybe just an ifPresent, leaving the updateNullCountStats call outside:

partitionValues.get(id).ifPresent(partition -> {
    // ...
    updateMinMaxStats(...);
});
updateNullCountStats(id, partitionValue.map(v -> 0).orElseGet(dataFile::recordCount));

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're getting at. I kinda like the separation as it is because there's a "this partition value is non-null" block, and a "this partition value is null" block.

Comment on lines 180 to 181
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of these null checks.

Why not inline upperBounds and lowerBounds here as optionals (changing the signatureof convertBounds)?

Object lowerBound = convertBounds(idToTypeMapping, dataFile.lowerBounds())
    .map(bounds -> convertIcebergValueToTrino(column.type(), bounds)
    .orElse(null);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe the bounds should actually be Optional themselves, rather than nullable.

Comment on lines 224 to 229
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is also equivalent to this:

nullCounts.merge(id, nullCount, (oldCount, newCount) -> 
        oldCount.flatMap(oldValue ->
                newCount.map(newValue -> newValue + oldValue)));

Though after looking at this, it's definitely less clear about the intent, though I would use this form if extracting a method to add optionals.

Actually, instead of just addition, maybe mergeOptionals would be better (I see somewhere else in this file it could be used). Here, it would be used like mergeOptionals(oldCount, newCount, Long::sum).

/**
 * Apply a function to the values in two optionals, returning an optional containing the result.
 * If either argument is empty, return empty.
 */
<A, B, C> Optional<C> mergeOptionals(Optional<A> a, Optional<B> b, BiFunction<A, B, C> mergeFunction)
{
    return a.flatMap(aa -> b.map(bb -> mergeFunction.apply(aa, bb)));
}

Comment on lines 253 to 259
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a stream.

idToMetricMap.entrySet().stream().map(...).collect(toImmutableMap(Entry::getKey, Entry::getValue))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or make the method return an Optional (also noted in an earlier comment).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is, ImmutableMap will throw an exception.

Comment on lines 295 to 325
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use ifPresent, or even better, it could use map or mergeOptionals (suggested above).

this.min = this.min.map(currentMin ->
        min != null && compareTrinoValue(min, currentMin) < 0 ? newValue : currentMin);

this.min = mergeOptionals(this.min, Optional.ofNullable(min), (currentValue, newValue) ->
        compareTrinoValue(newValue, currentMin) < 0 ? newValue : currentValue);

(If you make the bound variables Optionals as I suggested in an earlier comment, then I think mergeOptionals is best here. Otherwise, I think the map version is better.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These miss a case where we want to invalidate the stats. For example, if the first file has stats for a column and then the second file does not, we should treat that the same as if the order is reversed.

@alexjo2144
Copy link
Copy Markdown
Member Author

Comments addressed in the fixup, however I realized this doesn't work with tables that have gone through some schema evolution. Need to work on a fix for that before re-reviewing.

Copy link
Copy Markdown
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments to the second fixup

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Iceberg Types safe to use as map keys?

(my initial thought was to have List<> columnTrinoType and correlate with columns based on list index.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping them here is useful when you have many columns of same type, with stats.
And it only matters when # files isn't huge.

Doing this in ColumnStatistics::new is simpler and IMO sufficient, as you do lookup per column.
(originally you had lookup per column x file)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Optional?
How does empty() differ from empty map?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what breaks the schema evolution.
I notice that you didn't have this check and deemed OK.

Maybe we want if (identityPartitionFieldIds.contains(id) && partitionValues.containsKey(id)), so that we still try to take min/max from file stats?

Actually, is identityPartitionFieldIds.contains(id) important?
if partitionValues.containsKey(id) should be enough. The current table partitioning is not important when calculating the stats.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, is identityPartitionFieldIds.contains(id) important?

I was using that to proxy checking if the partition has a transform. If the partitioning is on hour(ts) we can't use the partition information to calculate max(ts), but you're right we can't use the current partitioning it needs to be the spec for that file.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowerBounds, upperBounds are unnecessarily Optional. "No entry" (null) is treated the same as "no map at all".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do typeToComparisonHandle.get(type) only if constructing new ColumnStatistics

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i missed that previously -- the ColumnStatistics captures initial bounds, so on the first round the call to updateMinMax is noop.

You can skip the call in a somewhat verbose manner with Map.compute

columnStatistics.compute(id, (ignored, columnStatistics) -> {
    if (columnStatistics == null) {
        columnStatistics = new ColumnStatistics(lowerBound, upperBound, comparisonHandle);
    }
    else {
        columnStatistics.updateMinMax(lowerBound, upperBound);
    }
    return columnStatistics;
});

(or document that you're doing what you're doing currently)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparisonHandle is immutable state, so fits better as first arg (as you put the field order)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd remove Optional.of().
also Optional.empty() -> "Empty"

(the fact that value is wrapped in an Optional is obvious, and doesn't need to be talked about)

@alexjo2144
Copy link
Copy Markdown
Member Author

Linking a thread I started on the iceberg slack channel on how to deal with schema evolution vs missing metrics, still don't have a clear answer for how to tell the two apart though https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1642177083095300

@alexjo2144
Copy link
Copy Markdown
Member Author

AC thanks

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 21, 2022

Copy link
Copy Markdown
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexjo2144 please squash (sans rebase)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix bug in Iceberg partition schema evolution

this will be squashed, right?

also, should we have a test, with two partitioning transformations over a column?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll put it in a separate PR though

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nineinchnick
Copy link
Copy Markdown
Member

nineinchnick commented Jan 21, 2022

@nineinchnick what is Pull Request Labeler / Test Report failure? https://github.com/trinodb/trino/pull/9906/checks?check_run_id=4862862822 / https://github.com/trinodb/trino/runs/4862862822

@findepi the Test Report is a check added by the separate workflow that's triggered by but not associated with the PR, hence it doesn't have its own workflow header and looks like it belongs to the Pull Request Labeler workflow. This is Github's UI issue. This check contains annotations with a summary of failures in jobs of the ci workflow, so you don't have to go into single jobs and search for failures.

It's marked as failed, because there were other failures. Marking it as successful would be a false positive if it contained error annotations.

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 21, 2022

@nineinchnick could it be attached to the ci flow somehow?
it's confusing to see Pull Request Labeler fail.

@alexjo2144 alexjo2144 force-pushed the iceberg/stats-refactor branch from 64bbf63 to 712277b Compare January 21, 2022 15:28
@alexjo2144
Copy link
Copy Markdown
Member Author

Squashed. Thanks @findepi

@findepi findepi merged commit df61aea into trinodb:master Jan 21, 2022
@github-actions github-actions bot added this to the 369 milestone Jan 21, 2022
@alexjo2144 alexjo2144 deleted the iceberg/stats-refactor branch February 7, 2022 14:43
@vincentpoon
Copy link
Copy Markdown
Member

We have millions of files and find SELECT * from table$partitions hangs in the acceptDataFile here.

@findepi
Copy link
Copy Markdown
Member

findepi commented Feb 18, 2022

@vincentpoon this PR probably didn't change how the stats are calculated, just made the code saner & "more correct"
or, do you mean this is a behavioral regression what you're observing?

in any case, let's have an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Iceberg table statistics are non-deterministic

5 participants