Skip to content

Conversation

@marton-bod
Copy link
Collaborator

@marton-bod marton-bod commented Mar 12, 2021

This patch:

  • Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'.
  • On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution.

@pvary
Copy link
Contributor

pvary commented Mar 12, 2021

@aokolnychyi: any concerns about adding TOTAL_FILE_SIZE_PROP to the SnapshotSummary?

The Hive part looks good to me, so I will merge if there is no concerns about the core side.

@RussellSpitzer
Copy link
Member

This looks good to me

// Set the basic statistics
parameters.put(StatsSetupConst.NUM_FILES, summary.getOrDefault(SnapshotSummary.TOTAL_DATA_FILES_PROP, "0"));
parameters.put(StatsSetupConst.ROW_COUNT, summary.getOrDefault(SnapshotSummary.TOTAL_RECORDS_PROP, "0"));
parameters.put(StatsSetupConst.TOTAL_SIZE, summary.getOrDefault(SnapshotSummary.TOTAL_FILE_SIZE_PROP, "0"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the summary property is missing, should we set the Hive property? I think this is correct only if "0" indicates to Hive that the value is not known.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.
Based on this Hive considers 0 as a non valid statistics value anyway:

  public BasicStats(Partish p) {
    partish = p;

    rowCount = parseLong(StatsSetupConst.ROW_COUNT);
[..]
    currentNumRows = rowCount;
[..]
    if (currentNumRows > 0) {
      state = State.COMPLETE;
    } else {
      state = State.NONE;
    }
  }

But I agree that unsetting the value would be more intuitive. What is the content of the SnapshotSummary in case of a new table? Setting 0 there would still be good (even if Hive does not consider it as a valid one)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating a table, metadata.currentSnapshot() is null, therefore we won't have a summary and will create an empty map for it. I think @rdblue is right that we should only update the HMS values if they're actually present in the summary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove them if they are not present for whatever reasons?

What happens if a commit is a metadata only change? Do we still have the summary?

Thanks,
Peter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with a PropertiesUpdate, and it does not produce a new snapshot, so metadata.currentSnapshot() will be the same as before along with its summary object.

@edgarRd
Copy link
Contributor

edgarRd commented Mar 15, 2021

Thanks for working on this. I think this is great to have the stats in the Hive table metadata.

However, I think it'd still be good to implement InputEstimator in the HiveIcebergStorageHandler to provide the right estimation on the scan using the filter during a join plan. It's a bit tricky in Iceberg since we'd need to compute the split tasks again or cache them. I guess initially computing them again (if we can reuse the code) would be okay to avoid caching - although it could be slow is it's a very selective scan, in that case maybe we could do a quick stop if too large and use table stats.

@pvary
Copy link
Contributor

pvary commented Mar 16, 2021

I think it'd still be good to implement InputEstimator in the HiveIcebergStorageHandler to provide the right estimation on the scan using the filter during a join plan. It's a bit tricky [..]

I have checked the Hive code and only InputEstimatorTestClass implements the InputEstimator.
Also checked the usages of Estimation object. Only getTotalLength is used. Once for determining if we can use fetch task instead of spawning an MR/Tez job, and then for generating ContentSummary.length. This later one is used several places, so it can be useful but we definitely need specific use-cases to check how much effort should we put into calculating the value

// Set the basic statistics
if (summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP) != null) {
parameters.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove them if we do not have summary data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. For example, when we create a new table, Hive already puts numRows=0, totalSize=0, etc. into the HMS table params, so we would just end up removing those valid values here. Other than the we-just-created-the-table scenario, I think we should always have the summary object with these 3 values filled out. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Hope this is true:

Other than the we-just-created-the-table scenario, I think we should always have the summary object with these 3 values filled out

@pvary pvary merged commit d99f1b6 into apache:master Mar 17, 2021
@pvary
Copy link
Contributor

pvary commented Mar 17, 2021

Thanks for the PR @marton-bod!

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021
@aokolnychyi
Copy link
Contributor

Late +1 from me too.

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021
stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021
autumnust added a commit to autumnust/iceberg-1 that referenced this pull request Feb 1, 2022
Raw Commit Message: This patch:

Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'.
On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution.

Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328

Author: Marton Bod <[email protected]>
autumnust added a commit to autumnust/iceberg-1 that referenced this pull request Feb 3, 2022
Raw Commit Message: This patch:

Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'.
On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution.

Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328

Author: Marton Bod <[email protected]>
autumnust added a commit to linkedin/iceberg that referenced this pull request Feb 8, 2022
Raw Commit Message: This patch:

Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'.
On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution.

Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328

Author: Marton Bod <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants