Skip to content

[Iceberg]Support setting the max number of columns for which metrics are collected#23468

Merged
hantangwangd merged 2 commits intoprestodb:masterfrom
hantangwangd:add_iceberg_table_properties
Aug 29, 2024
Merged

[Iceberg]Support setting the max number of columns for which metrics are collected#23468
hantangwangd merged 2 commits intoprestodb:masterfrom
hantangwangd:add_iceberg_table_properties

Conversation

@hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented Aug 19, 2024

Description

For a very-wide-table, there will be too much metadata stored as stats in manifest files if we always collect stats for all the columns. It will occupy too much memory when loading the table metadata into memory. To cope with this, Iceberg lib support configuring the max columns that can have metrics for a table, referring to: apache/iceberg#3959, apache/iceberg#5215, apache/iceberg#5916

This PR add an Iceberg table property metrics_max_inferred_column to set the max columns number for which metrics are collected, and implement recognizing and supporting metrics_max_inferred_column on metrics collection for Iceberg table with PARQUET format.

Motivation and Context

Support setting the max number of columns for which metrics are collected

Impact

Iceberg tables with PARQUET format now only collect statistics information from the first 100 columns by default

Test Plan

  • Newly added test case to show the behavior of stats collection with columns count limit

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Iceberg Connector Changes
* Add table property `metrics_max_inferred_column` to configure the max columns number for which metrics are collected, and support `metrics_max_inferred_column` for Iceberg table with `PARQUET` format :pr:`23468`

steveburnett
steveburnett previously approved these changes Aug 19, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local doc build, looks good. Thanks!

@hantangwangd hantangwangd force-pushed the add_iceberg_table_properties branch 2 times, most recently from 360163a to 5dc6e55 Compare August 26, 2024 10:24
@hantangwangd hantangwangd changed the title [WIP]Support setting the max number of columns for which metrics are collected [Iceberg]Support setting the max number of columns for which metrics are collected Aug 26, 2024
@hantangwangd hantangwangd marked this pull request as ready for review August 26, 2024 12:02
@hantangwangd hantangwangd requested review from a team, ZacBlanco and elharo as code owners August 26, 2024 12:02
Comment on lines -110 to +112
Schema outputSchema,
PartitionSpec partitionSpec,
Table table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help explain why we need these changes to IcebergPageSink, and why do we need to pass in the Iceberg Table when it seems we're just using the Presto PrestoIcebergTable implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem encountered when creating IcebergFileWriter in IcebergPageSink is that, among all the static methods provided by Iceberg Lib for building a MetricsConfig instance, we can only pass an iceberg Table object in order to specify the table's schema. Refer to the static methods in MetricsConfig.

However, at present, there isn't a way on the presto worker side for us to load an iceberg Table object through metadata manager. So we just built an iceberg Table interface's implementation that is only used for simple encapsulation of schema, properties, 'spec', and 'sortOrder'. Then pass to and use it in IcebergPageSink.

import java.util.List;
import java.util.Map;

public class PrestoIcebergTable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class PrestoIcebergTable
public class PrestoIcebergTableForMetricsConfig

Please add a comment that this is a dummy class and is required for MetricsConfig#forTable. Consider an Iceberg issue and PR adding a more direct method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment that this is a dummy class and is required for MetricsConfig#forTable.

The comment is added, please take a look when available.

Consider an Iceberg issue and PR adding a more direct method.

Thanks for the suggestion, will consult with the Iceberg community later on whether a more direct method should be added.

@hantangwangd hantangwangd force-pushed the add_iceberg_table_properties branch from 5dc6e55 to 90e99b3 Compare August 28, 2024 04:01
@hantangwangd hantangwangd requested a review from tdcmeehan August 28, 2024 04:07
tdcmeehan
tdcmeehan previously approved these changes Aug 28, 2024
Comment on lines +51 to +53
/**
* This is a dummy class required for {@link org.apache.iceberg.MetricsConfig#forTable}
*
* */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* This is a dummy class required for {@link org.apache.iceberg.MetricsConfig#forTable}
*
* */
/**
* This is a dummy class required for {@link org.apache.iceberg.MetricsConfig#forTable}
**/

This change get metrics configuration for the target table rather than
directly using the default metrics configuration which always collect
metrics for all columns.

Currently, only Iceberg tables with `PARQUET` format can recognize and
support metrics configuration.
@hantangwangd hantangwangd merged commit c66fa35 into prestodb:master Aug 29, 2024
@hantangwangd hantangwangd deleted the add_iceberg_table_properties branch August 29, 2024 00:18
@jaystarshot jaystarshot mentioned this pull request Nov 1, 2024
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants