-
Notifications
You must be signed in to change notification settings - Fork 3k
Core/Hive: Introduce total-files-size snapshot metric and populate HMS #2329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…S stats on commit
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
Outdated
Show resolved
Hide resolved
|
@aokolnychyi: any concerns about adding The Hive part looks good to me, so I will merge if there is no concerns about the core side. |
|
This looks good to me |
| // Set the basic statistics | ||
| parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0")); | ||
| parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0")); | ||
| parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the summary property is missing, should we set the Hive property? I think this is correct only if "0" indicates to Hive that the value is not known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
Based on this Hive considers 0 as a non valid statistics value anyway:
public BasicStats(Partish p) {
partish = p;
rowCount = parseLong(StatsSetupConst.ROW_COUNT);
[..]
currentNumRows = rowCount;
[..]
if (currentNumRows > 0) {
state = State.COMPLETE;
} else {
state = State.NONE;
}
}
But I agree that unsetting the value would be more intuitive. What is the content of the SnapshotSummary in case of a new table? Setting 0 there would still be good (even if Hive does not consider it as a valid one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When creating a table, metadata.currentSnapshot() is null, therefore we won't have a summary and will create an empty map for it. I think @rdblue is right that we should only update the HMS values if they're actually present in the summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove them if they are not present for whatever reasons?
What happens if a commit is a metadata only change? Do we still have the summary?
Thanks,
Peter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked with a PropertiesUpdate, and it does not produce a new snapshot, so metadata.currentSnapshot() will be the same as before along with its summary object.
|
Thanks for working on this. I think this is great to have the stats in the Hive table metadata. However, I think it'd still be good to implement |
I have checked the Hive code and only |
| // Set the basic statistics | ||
| if (summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP) != null) { | ||
| parameters.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove them if we do not have summary data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. For example, when we create a new table, Hive already puts numRows=0, totalSize=0, etc. into the HMS table params, so we would just end up removing those valid values here. Other than the we-just-created-the-table scenario, I think we should always have the summary object with these 3 values filled out. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Hope this is true:
Other than the we-just-created-the-table scenario, I think we should always have the summary object with these 3 values filled out
|
Thanks for the PR @marton-bod! |
|
Late +1 from me too. |
Raw Commit Message: This patch: Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'. On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution. Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>
Raw Commit Message: This patch: Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'. On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution. Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>
Raw Commit Message: This patch: Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'. On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution. Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>
This patch:
total-files-size. It was somehow missing up till now, even though it has its companion metricsadded-files-sizeandremoved-files-size. Introducing this total metric makes it consistent with the other 'metric groups'.