Skip to content

Re-commit iceberg statistics and table handle revert#22666

Merged
ZacBlanco merged 3 commits intoprestodb:masterfrom
ZacBlanco:upstream-add-estimate-functions
May 6, 2024
Merged

Re-commit iceberg statistics and table handle revert#22666
ZacBlanco merged 3 commits intoprestodb:masterfrom
ZacBlanco:upstream-add-estimate-functions

Conversation

@ZacBlanco
Copy link
Contributor

Description

This PR adds a separate commit with the prerequisite changes that are required when #22661 was merged. It re-adds the two additional commits so that they don't depend on the histogram PR.

Motivation and Context

Re-commit some changes that were reverted in #22661

Impact

N/A

Test Plan

N/A

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

ZacBlanco added 3 commits May 3, 2024 16:40
This makes working with estimates more friendly
and ergonomic. They mimic the same style
methods as java.util.Optional
Previously, the data size statistic was computed by using the
Iceberg data manifests data size field. This is value is misleading
for Presto because it represents the compressed on-disk size.
This change allows ANALYZE to read and write data size statistic
values to puffin files.

This change also updates the hive-statistics-merge-strategy
config value in the Iceberg connector to accept a comma-separated
list of valid values to override from the HMS instead of using
an independent enum. This allows for a wider variety of combinations
using less code.
The `iceberg.pushdown_filter_enabled` flag added recently
changed how the iceberg connector metadata generates
table layouts. The previous fork in the logic would cause
layouts to be generated without a partition predicate constraint
even if it existed. This can lead to table stats being incorrect
and hence incorrect statistics and poor query planning when
filtering on partitioned tables or when the config is disabled.

The original logic led to the codepath which was responsible for
filling in the predicate for tables when filter pushdown was disabled
to be unreachable. This change will now allows the predicates to
show up on partition columns in the layout.
@ZacBlanco ZacBlanco force-pushed the upstream-add-estimate-functions branch from 29885cd to 463fbe1 Compare May 3, 2024 20:52
@ZacBlanco ZacBlanco changed the title Re-commit iceberg statistics and table handle revert. Re-commit iceberg statistics and table handle revert May 3, 2024
@ZacBlanco ZacBlanco marked this pull request as ready for review May 4, 2024 02:18
@ZacBlanco ZacBlanco requested a review from presto-oss May 4, 2024 02:18
@github-actions
Copy link

github-actions bot commented May 4, 2024

Codenotify: Notifying subscribers in CODENOTIFY files for diff be20d29...463fbe1.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/connector/iceberg.rst

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jaystarshot
Copy link
Member

Is the root cause of the revert #22661 identified?

@ZacBlanco
Copy link
Contributor Author

@jaystarshot Yes, it's identified and described in #22664. I'm looking to add some additional tests to verify that change doesn't have other adverse affects when histograms are used.

Comment on lines +208 to +214
TableMetadata tableMetadata = metadata.getTableMetadata(session, tableHandle);
List<ColumnHandle> nonHiddenColumns = ImmutableList.copyOf(tableMetadata.getColumns().stream().filter(column -> !column.isHidden())
.map(ColumnMetadata::getName)
.map(columnHandles::get)
.filter(Objects::nonNull)
.collect(Collectors.toList()));
TableStatistics tableStatistics = metadata.getTableStatistics(session, tableHandle, nonHiddenColumns, constraint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this part is originally from #22327
Can you explain what this change is and how does it affect the non-iceberg tables?

Copy link
Contributor Author

@ZacBlanco ZacBlanco May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents hidden columns from being queried when a user executes SHOW STATS. The HiveMetadata already implicitly filters out hidden columns when returning from getTableStatistics without a constraint. The line that renders the table also filters out the hidden columns.

Map<String, ColumnHandle> columns = columnHandles.stream()
.map(HiveColumnHandle.class::cast)
.filter(not(HiveColumnHandle::isHidden))
.collect(toImmutableMap(HiveColumnHandle::getName, Function.identity()));

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local build of docs, everything looks good. Thanks!

@ZacBlanco ZacBlanco merged commit e51d1b1 into prestodb:master May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants