Core: Update MetricsConfig to use a default for first 32 columns #5215

rdblue · 2022-07-06T21:11:31Z

#3959 updated MetricsConfig to stop writing metrics by default for tables with more than 32 columns. The intent was to avoid having too much metadata stored as stats in manifest files, but the implementation stopped writing metrics for all columns after a table reached 32 columns, which caused a large change in table performance after a table grows to 32+ columns.

This updates the logic so that the first 32 columns still have metrics, but no new metrics are stored for columns after that point unless the table has a global default set or columns are individually configured.

danielcweeks · 2022-07-06T21:21:30Z

core/src/main/java/org/apache/iceberg/TableProperties.java

  public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled";
  public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false;

+  public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS =


nit: INFERRED doesn't really make sense to me in this context. We're not inferring, we're actually explicit here.

I debated what to call this. The problem is that we are defaulting the default. You can explicitly set a default using write.metadata.metrics.default, and that will apply to all columns. If you don't set a default we infer one for you, but only for the first 32 columns (at least, after this PR). That's why I used "inferred" -- I thought it was better than max-defaulted-default-columns.

Agree with this table property, initially I had made one but it was taken out during the discussions. Indeed it's a bit of a confusing config, but I dont see any other great option.

The naming felt a little bit confusing to me too. After I read the explanation, it started to make more sense. However, I am still not sure inferred is the right word. Technically, we infer defaults for all columns (after this limit it just becomes none). To me, this is more about limiting the number of columns for which we persist metrics by default. Can the property name revolve around persist rather than infer?

The problem with "persist" and similar is that it misses the distinction between an explicit default (when write.metadata.metrics.default is set) and an implicit default that comes from Iceberg. I think the right behavior is to preserve what we currently do, which is to use the explicit default for all columns. But that means that this property should obviously not apply to the explicit default. That's why I used "inferred default".

What about changing this to include unconfigured or missing? Something like missing-mode-limit?

Let's just keep it as-is then. I don't think it is a big deal.

core/src/main/java/org/apache/iceberg/MetricsConfig.java

szehon-ho

Yea this makes sense, thanks for making the change (should have considered this case earlier). Added some small comments

core/src/main/java/org/apache/iceberg/MetricsConfig.java

szehon-ho · 2022-07-06T22:09:21Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

   */
-  private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
+  private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
+    int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,


Add precondition that its >= 0?

I'm adding a warning, but I don't think that we should fail because of an invalid value here.

Looks like invalid from, to will trigger exception in sublist(from, to) anyway , if from >= to, was thinking a precondition would make the message clearer.

I added a warning and set it to the default if it is invalid.

szehon-ho · 2022-07-06T22:13:27Z

core/src/main/java/org/apache/iceberg/TableProperties.java

  public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled";
  public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false;

+  public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS =


Agree with this table property, initially I had made one but it was taken out during the discussions. Indeed it's a bit of a confusing config, but I dont see any other great option.

core/src/main/java/org/apache/iceberg/MetricsConfig.java

aokolnychyi · 2022-07-07T05:34:58Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));


Quick warning: subList would return a view on top of the original list and any subsequent changes would be reflected in both. This does not seem to cause issues here but I better mention.

This is a temporary schema so it should be fine.

aokolnychyi · 2022-07-07T05:35:47Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

+
+    } else {
+      // a inferred default mode is applied to the first few columns, up to the limit
+      Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));


What if I have a highly nested schema? The number of stored metrics can be way more than 32 in that case?

Yes, but this is the current behavior. We use the top-level columns for the current check.

core/src/main/java/org/apache/iceberg/MetricsConfig.java

bryanck · 2022-07-09T13:39:18Z

This is already merged, but I thought I'd leave feedback anyway, in case it is useful.

As a data engineer, many tables I have maintained have more than 32 top-level columns. Often columns used for partitioning, sorting, auditing, and so forth are put at the end of a table schema, but these are some of the most frequently used in filtering. Also, additional columns are generally added at the end of the schema. The assumption that the first columns in a table schema are the most important to have stats on is not always accurate.

In testing 0.14, I ran into missing stats on tables, which was confusing and difficult to debug. I image those new to Iceberg and who are most likely to leave settings at the default, it would be even more confusing.

I feel a more sensible default is to leave it the same as previous Iceberg versions (i.e. no column limit). Then an option could be introduced to limit the number of columns so those that prefer can set it on their tables, e.g. "first(32)". I feel it is better to err on the side of too many stats and dial that back as needed.

szehon-ho · 2022-07-11T22:04:49Z

I think that's an interesting enhancement idea, if I understand, to make write.metadata.metrics.max-inferred-column-defaults to also accept first or last. An organization might be able to set last(32) as default on the catalog level (ie, #4011). On the other hand, sort and partition columns are already auto-promoted to have stats today, curious do you see a good impact of having stats on other columns?

szehon-ho · 2022-07-12T18:37:26Z

Thinking about it, I guess there are always correlation in the data, maybe some column values are still able to be used for filtering if you sort by/partition by correlated columns.

So, it makes sense to me to add more support like first/last, given we already are advertising this option. Will leave to everyone's opinion if the default needs to be changed.

findinpath · 2022-09-09T05:41:54Z

core/src/main/java/org/apache/iceberg/MetricsConfig.java

  private static final Logger LOG = LoggerFactory.getLogger(MetricsConfig.class);
  private static final Joiner DOT = Joiner.on('.');

  // Disable metrics by default for wide tables to prevent excessive metadata


@danielcweeks Is this a left-over comment?

github-actions bot added the core label Jul 6, 2022

Core: Update MetricsConfig to use a default for first 32 columns.

30f10f5

rdblue force-pushed the fix-metrics-config branch from 094d852 to 30f10f5 Compare July 6, 2022 21:15

danielcweeks reviewed Jul 6, 2022

View reviewed changes

rdblue added this to the Iceberg 0.14.0 Release milestone Jul 6, 2022

danielcweeks reviewed Jul 6, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/MetricsConfig.java Show resolved Hide resolved

Data: Update Spark and Flink metrics tests.

504ed95

github-actions bot added the data label Jul 6, 2022

danielcweeks approved these changes Jul 6, 2022

View reviewed changes

szehon-ho approved these changes Jul 6, 2022

View reviewed changes

aokolnychyi reviewed Jul 7, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/MetricsConfig.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Jul 7, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/MetricsConfig.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Jul 7, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/MetricsConfig.java Outdated Show resolved Hide resolved

rdblue added 2 commits July 7, 2022 08:26

Fix checkstyle.

9a084ba

Fix review issues.

3a2ed5f

szehon-ho approved these changes Jul 7, 2022

View reviewed changes

rdblue merged commit a253817 into apache:master Jul 8, 2022

findinpath reviewed Sep 9, 2022

View reviewed changes

hantangwangd mentioned this pull request Aug 26, 2024

[Iceberg]Support setting the max number of columns for which metrics are collected prestodb/presto#23468

Merged

6 tasks

singhpk234 mentioned this pull request Oct 4, 2024

Demonstrate bug for issue #11253 #11256

Closed

Core: Update MetricsConfig to use a default for first 32 columns #5215

Core: Update MetricsConfig to use a default for first 32 columns #5215

Uh oh!

Conversation

rdblue commented Jul 6, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bryanck commented Jul 9, 2022

Uh oh!

szehon-ho commented Jul 11, 2022

Uh oh!

szehon-ho commented Jul 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

szehon-ho Jul 7, 2022 •

edited

Loading

aokolnychyi Jul 7, 2022 •

edited

Loading