[Iceberg]Support setting the max number of columns for which metrics are collected#23468
Conversation
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull branch, local doc build, looks good. Thanks!
360163a to
5dc6e55
Compare
| Schema outputSchema, | ||
| PartitionSpec partitionSpec, | ||
| Table table, |
There was a problem hiding this comment.
Can you help explain why we need these changes to IcebergPageSink, and why do we need to pass in the Iceberg Table when it seems we're just using the Presto PrestoIcebergTable implementation?
There was a problem hiding this comment.
The problem encountered when creating IcebergFileWriter in IcebergPageSink is that, among all the static methods provided by Iceberg Lib for building a MetricsConfig instance, we can only pass an iceberg Table object in order to specify the table's schema. Refer to the static methods in MetricsConfig.
However, at present, there isn't a way on the presto worker side for us to load an iceberg Table object through metadata manager. So we just built an iceberg Table interface's implementation that is only used for simple encapsulation of schema, properties, 'spec', and 'sortOrder'. Then pass to and use it in IcebergPageSink.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java
Outdated
Show resolved
Hide resolved
| import java.util.List; | ||
| import java.util.Map; | ||
|
|
||
| public class PrestoIcebergTable |
There was a problem hiding this comment.
| public class PrestoIcebergTable | |
| public class PrestoIcebergTableForMetricsConfig |
Please add a comment that this is a dummy class and is required for MetricsConfig#forTable. Consider an Iceberg issue and PR adding a more direct method.
There was a problem hiding this comment.
Please add a comment that this is a dummy class and is required for MetricsConfig#forTable.
The comment is added, please take a look when available.
Consider an Iceberg issue and PR adding a more direct method.
Thanks for the suggestion, will consult with the Iceberg community later on whether a more direct method should be added.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergPageSinkProvider.java
Outdated
Show resolved
Hide resolved
5dc6e55 to
90e99b3
Compare
| /** | ||
| * This is a dummy class required for {@link org.apache.iceberg.MetricsConfig#forTable} | ||
| * | ||
| * */ |
There was a problem hiding this comment.
| /** | |
| * This is a dummy class required for {@link org.apache.iceberg.MetricsConfig#forTable} | |
| * | |
| * */ | |
| /** | |
| * This is a dummy class required for {@link org.apache.iceberg.MetricsConfig#forTable} | |
| **/ |
This change get metrics configuration for the target table rather than directly using the default metrics configuration which always collect metrics for all columns. Currently, only Iceberg tables with `PARQUET` format can recognize and support metrics configuration.
…trics are collected
90e99b3 to
0030735
Compare
Description
For a very-wide-table, there will be too much metadata stored as stats in manifest files if we always collect stats for all the columns. It will occupy too much memory when loading the table metadata into memory. To cope with this, Iceberg lib support configuring the max columns that can have metrics for a table, referring to: apache/iceberg#3959, apache/iceberg#5215, apache/iceberg#5916
This PR add an Iceberg table property
metrics_max_inferred_columnto set the max columns number for which metrics are collected, and implement recognizing and supportingmetrics_max_inferred_columnon metrics collection for Iceberg table withPARQUETformat.Motivation and Context
Support setting the max number of columns for which metrics are collected
Impact
Iceberg tables with
PARQUETformat now only collect statistics information from the first 100 columns by defaultTest Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.