Core/Hive: Introduce total-files-size snapshot metric and populate HMS #2329

marton-bod · 2021-03-12T14:52:04Z

This patch:

Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'.
On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution.

…S stats on commit

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

pvary · 2021-03-12T16:00:18Z

@aokolnychyi: any concerns about adding TOTAL_FILE_SIZE_PROP to the SnapshotSummary?

The Hive part looks good to me, so I will merge if there is no concerns about the core side.

RussellSpitzer · 2021-03-15T15:31:45Z

This looks good to me

rdblue · 2021-03-15T19:49:56Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

+    // Set the basic statistics
+    parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
+    parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
+    parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));


If the summary property is missing, should we set the Hive property? I think this is correct only if "0" indicates to Hive that the value is not known.

Good question.
Based on this Hive considers 0 as a non valid statistics value anyway:

public BasicStats(Partish p) { partish = p; rowCount = parseLong(StatsSetupConst.ROW_COUNT); [..] currentNumRows = rowCount; [..] if (currentNumRows > 0) { state = State.COMPLETE; } else { state = State.NONE; } }

But I agree that unsetting the value would be more intuitive. What is the content of the SnapshotSummary in case of a new table? Setting 0 there would still be good (even if Hive does not consider it as a valid one)

When creating a table, metadata.currentSnapshot() is null, therefore we won't have a summary and will create an empty map for it. I think @rdblue is right that we should only update the HMS values if they're actually present in the summary.

Should we remove them if they are not present for whatever reasons?

What happens if a commit is a metadata only change? Do we still have the summary?

Thanks,
Peter

I checked with a PropertiesUpdate, and it does not produce a new snapshot, so metadata.currentSnapshot() will be the same as before along with its summary object.

edgarRd · 2021-03-15T20:47:06Z

Thanks for working on this. I think this is great to have the stats in the Hive table metadata.

However, I think it'd still be good to implement InputEstimator in the HiveIcebergStorageHandler to provide the right estimation on the scan using the filter during a join plan. It's a bit tricky in Iceberg since we'd need to compute the split tasks again or cache them. I guess initially computing them again (if we can reuse the code) would be okay to avoid caching - although it could be slow is it's a very selective scan, in that case maybe we could do a quick stop if too large and use table stats.

pvary · 2021-03-16T11:19:28Z

I think it'd still be good to implement InputEstimator in the HiveIcebergStorageHandler to provide the right estimation on the scan using the filter during a join plan. It's a bit tricky [..]

I have checked the Hive code and only InputEstimatorTestClass implements the InputEstimator.
Also checked the usages of Estimation object. Only getTotalLength is used. Once for determining if we can use fetch task instead of spawning an MR/Tez job, and then for generating ContentSummary.length. This later one is used several places, so it can be useful but we definitely need specific use-cases to check how much effort should we put into calculating the value

pvary · 2021-03-16T14:12:28Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

+    // Set the basic statistics
+    if (summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP) != null) {
+      parameters.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
+    }


Should we remove them if we do not have summary data?

I don't think so. For example, when we create a new table, Hive already puts numRows=0, totalSize=0, etc. into the HMS table params, so we would just end up removing those valid values here. Other than the we-just-created-the-table scenario, I think we should always have the summary object with these 3 values filled out. What do you think?

Makes sense. Hope this is true:

Other than the we-just-created-the-table scenario, I think we should always have the summary object with these 3 values filled out

pvary · 2021-03-17T10:45:42Z

Thanks for the PR @marton-bod!

apache#2329)

aokolnychyi · 2021-03-22T21:44:33Z

Late +1 from me too.

apache#2329)

@pvary

Raw Commit Message: This patch: Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'. On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution. Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>

@pvary

Raw Commit Message: This patch: Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'. On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution. Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>

@pvary

Raw Commit Message: This patch: Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'. On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution. Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>

Core/Hive: Introduce total-files-size snapshot metric and populate HM…

6c83271

…S stats on commit

github-actions bot added core hive MR labels Mar 12, 2021

pvary reviewed Mar 12, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Outdated Show resolved Hide resolved

Only populate totalSize, not the rawDataSize in HMS

b464742

RussellSpitzer approved these changes Mar 15, 2021

View reviewed changes

rdblue reviewed Mar 15, 2021

View reviewed changes

marton-bod added 2 commits March 16, 2021 14:27

Only set HMS stats if they're present in summary

dc1600e

Adjust properties hive test

f85fa52

pvary approved these changes Mar 16, 2021

View reviewed changes

pvary reviewed Mar 16, 2021

View reviewed changes

pvary approved these changes Mar 16, 2021

View reviewed changes

pvary merged commit d99f1b6 into apache:master Mar 17, 2021

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021

Core/Hive: Introduce total-files-size snapshot metric and populate HMS (

99c2fda

apache#2329)

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

Core/Hive: Introduce total-files-size snapshot metric and populate HMS (

097df0a

apache#2329)

stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021

Core/Hive: Introduce total-files-size snapshot metric and populate HMS (

249a30a

apache#2329)

autumnust mentioned this pull request Feb 1, 2022

Backport https://github.com/apache/iceberg/pull/2328 and its prerequisites linkedin/iceberg#89

Merged

Core/Hive: Introduce total-files-size snapshot metric and populate HMS #2329

Core/Hive: Introduce total-files-size snapshot metric and populate HMS #2329

Uh oh!

Conversation

marton-bod commented Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pvary commented Mar 12, 2021

Uh oh!

RussellSpitzer commented Mar 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edgarRd commented Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Mar 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary commented Mar 17, 2021

Uh oh!

aokolnychyi commented Mar 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

marton-bod commented Mar 12, 2021 •

edited

Loading

edgarRd commented Mar 15, 2021 •

edited

Loading