-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Update MetricsConfig to use a default for first 32 columns #5215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
094d852 to
30f10f5
Compare
| public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled"; | ||
| public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false; | ||
|
|
||
| public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: INFERRED doesn't really make sense to me in this context. We're not inferring, we're actually explicit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated what to call this. The problem is that we are defaulting the default. You can explicitly set a default using write.metadata.metrics.default, and that will apply to all columns. If you don't set a default we infer one for you, but only for the first 32 columns (at least, after this PR). That's why I used "inferred" -- I thought it was better than max-defaulted-default-columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with this table property, initially I had made one but it was taken out during the discussions. Indeed it's a bit of a confusing config, but I dont see any other great option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming felt a little bit confusing to me too. After I read the explanation, it started to make more sense. However, I am still not sure inferred is the right word. Technically, we infer defaults for all columns (after this limit it just becomes none). To me, this is more about limiting the number of columns for which we persist metrics by default. Can the property name revolve around persist rather than infer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with "persist" and similar is that it misses the distinction between an explicit default (when write.metadata.metrics.default is set) and an implicit default that comes from Iceberg. I think the right behavior is to preserve what we currently do, which is to use the explicit default for all columns. But that means that this property should obviously not apply to the explicit default. That's why I used "inferred default".
What about changing this to include unconfigured or missing? Something like missing-mode-limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just keep it as-is then. I don't think it is a big deal.
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this makes sense, thanks for making the change (should have considered this case earlier). Added some small comments
| */ | ||
| private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) { | ||
| private static MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order) { | ||
| int maxInferredDefaultColumns = PropertyUtil.propertyAsInt(props, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add precondition that its >= 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding a warning, but I don't think that we should fail because of an invalid value here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like invalid from, to will trigger exception in sublist(from, to) anyway , if from >= to, was thinking a precondition would make the message clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a warning and set it to the default if it is invalid.
| public static final String METADATA_DELETE_AFTER_COMMIT_ENABLED = "write.metadata.delete-after-commit.enabled"; | ||
| public static final boolean METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT = false; | ||
|
|
||
| public static final String METRICS_MAX_INFERRED_COLUMN_DEFAULTS = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with this table property, initially I had made one but it was taken out during the discussions. Indeed it's a bit of a confusing config, but I dont see any other great option.
|
|
||
| } else { | ||
| // a inferred default mode is applied to the first few columns, up to the limit | ||
| Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick warning: subList would return a view on top of the original list and any subsequent changes would be reflected in both. This does not seem to cause issues here but I better mention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary schema so it should be fine.
|
|
||
| } else { | ||
| // a inferred default mode is applied to the first few columns, up to the limit | ||
| Schema subSchema = new Schema(schema.columns().subList(0, maxInferredDefaultColumns)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I have a highly nested schema? The number of stored metrics can be way more than 32 in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this is the current behavior. We use the top-level columns for the current check.
|
This is already merged, but I thought I'd leave feedback anyway, in case it is useful. As a data engineer, many tables I have maintained have more than 32 top-level columns. Often columns used for partitioning, sorting, auditing, and so forth are put at the end of a table schema, but these are some of the most frequently used in filtering. Also, additional columns are generally added at the end of the schema. The assumption that the first columns in a table schema are the most important to have stats on is not always accurate. In testing 0.14, I ran into missing stats on tables, which was confusing and difficult to debug. I image those new to Iceberg and who are most likely to leave settings at the default, it would be even more confusing. I feel a more sensible default is to leave it the same as previous Iceberg versions (i.e. no column limit). Then an option could be introduced to limit the number of columns so those that prefer can set it on their tables, e.g. "first(32)". I feel it is better to err on the side of too many stats and dial that back as needed. |
|
I think that's an interesting enhancement idea, if I understand, to make write.metadata.metrics.max-inferred-column-defaults to also accept first or last. An organization might be able to set last(32) as default on the catalog level (ie, #4011). On the other hand, sort and partition columns are already auto-promoted to have stats today, curious do you see a good impact of having stats on other columns? |
|
Thinking about it, I guess there are always correlation in the data, maybe some column values are still able to be used for filtering if you sort by/partition by correlated columns. So, it makes sense to me to add more support like first/last, given we already are advertising this option. Will leave to everyone's opinion if the default needs to be changed. |
| private static final Logger LOG = LoggerFactory.getLogger(MetricsConfig.class); | ||
| private static final Joiner DOT = Joiner.on('.'); | ||
|
|
||
| // Disable metrics by default for wide tables to prevent excessive metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielcweeks Is this a left-over comment?
#3959 updated
MetricsConfigto stop writing metrics by default for tables with more than 32 columns. The intent was to avoid having too much metadata stored as stats in manifest files, but the implementation stopped writing metrics for all columns after a table reached 32 columns, which caused a large change in table performance after a table grows to 32+ columns.This updates the logic so that the first 32 columns still have metrics, but no new metrics are stored for columns after that point unless the table has a global default set or columns are individually configured.