Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Jul 6, 2022

#3959 updated MetricsConfig to stop writing metrics by default for tables with more than 32 columns. The intent was to avoid having too much metadata stored as stats in manifest files, but the implementation stopped writing metrics for all columns after a table reached 32 columns, which caused a large change in table performance after a table grows to 32+ columns.

This updates the logic so that the first 32 columns still have metrics, but no new metrics are stored for columns after that point unless the table has a global default set or columns are individually configured.

@github-actions github-actions bot added the core label Jul 6, 2022
@rdblue rdblue force-pushed the fix-metrics-config branch from 094d852 to 30f10f5 Compare July 6, 2022 21:15
public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled";
public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false;

public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: INFERRED doesn't really make sense to me in this context. We're not inferring, we're actually explicit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated what to call this. The problem is that we are defaulting the default. You can explicitly set a default using write.metadata.metrics.default, and that will apply to all columns. If you don't set a default we infer one for you, but only for the first 32 columns (at least, after this PR). That's why I used "inferred" -- I thought it was better than max-defaulted-default-columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this table property, initially I had made one but it was taken out during the discussions. Indeed it's a bit of a confusing config, but I dont see any other great option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming felt a little bit confusing to me too. After I read the explanation, it started to make more sense. However, I am still not sure inferred is the right word. Technically, we infer defaults for all columns (after this limit it just becomes none). To me, this is more about limiting the number of columns for which we persist metrics by default. Can the property name revolve around persist rather than infer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with "persist" and similar is that it misses the distinction between an explicit default (when write.metadata.metrics.default is set) and an implicit default that comes from Iceberg. I think the right behavior is to preserve what we currently do, which is to use the explicit default for all columns. But that means that this property should obviously not apply to the explicit default. That's why I used "inferred default".

What about changing this to include unconfigured or missing? Something like missing-mode-limit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just keep it as-is then. I don't think it is a big deal.

@rdblue rdblue added this to the Iceberg 0.14.0 Release milestone Jul 6, 2022
@github-actions github-actions bot added the data label Jul 6, 2022
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this makes sense, thanks for making the change (should have considered this case earlier). Added some small comments

*/
private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) {
int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add precondition that its >= 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a warning, but I don't think that we should fail because of an invalid value here.

Copy link
Member

@szehon-ho szehon-ho Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like invalid from, to will trigger exception in sublist(from, to) anyway , if from >= to, was thinking a precondition would make the message clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning and set it to the default if it is invalid.

public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled";
public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false;

public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this table property, initially I had made one but it was taken out during the discussions. Indeed it's a bit of a confusing config, but I dont see any other great option.


} else {
// a inferred default mode is applied to the first few columns, up to the limit
Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick warning: subList would return a view on top of the original list and any subsequent changes would be reflected in both. This does not seem to cause issues here but I better mention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary schema so it should be fine.


} else {
// a inferred default mode is applied to the first few columns, up to the limit
Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns));
Copy link
Contributor

@aokolnychyi aokolnychyi Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I have a highly nested schema? The number of stored metrics can be way more than 32 in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this is the current behavior. We use the top-level columns for the current check.

@rdblue rdblue merged commit a253817 into apache:master Jul 8, 2022
@bryanck
Copy link
Contributor

bryanck commented Jul 9, 2022

This is already merged, but I thought I'd leave feedback anyway, in case it is useful.

As a data engineer, many tables I have maintained have more than 32 top-level columns. Often columns used for partitioning, sorting, auditing, and so forth are put at the end of a table schema, but these are some of the most frequently used in filtering. Also, additional columns are generally added at the end of the schema. The assumption that the first columns in a table schema are the most important to have stats on is not always accurate.

In testing 0.14, I ran into missing stats on tables, which was confusing and difficult to debug. I image those new to Iceberg and who are most likely to leave settings at the default, it would be even more confusing.

I feel a more sensible default is to leave it the same as previous Iceberg versions (i.e. no column limit). Then an option could be introduced to limit the number of columns so those that prefer can set it on their tables, e.g. "first(32)". I feel it is better to err on the side of too many stats and dial that back as needed.

@szehon-ho
Copy link
Member

I think that's an interesting enhancement idea, if I understand, to make write.metadata.metrics.max-inferred-column-defaults to also accept first or last. An organization might be able to set last(32) as default on the catalog level (ie, #4011). On the other hand, sort and partition columns are already auto-promoted to have stats today, curious do you see a good impact of having stats on other columns?

@szehon-ho
Copy link
Member

Thinking about it, I guess there are always correlation in the data, maybe some column values are still able to be used for filtering if you sort by/partition by correlated columns.

So, it makes sense to me to add more support like first/last, given we already are advertising this option. Will leave to everyone's opinion if the default needs to be changed.

private static final Logger LOG = LoggerFactory.getLogger(MetricsConfig.class);
private static final Joiner DOT = Joiner.on('.');

// Disable metrics by default for wide tables to prevent excessive metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielcweeks Is this a left-over comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants