Skip to content

Conversation

@autumnust
Copy link

@autumnust autumnust commented Feb 1, 2022

Identifying commits required

To patch the PR of interests apache/iceberg#2328 I identified four PRs that has a bunch of overlapping with it by scanning through the file git history of HiveTableOperations.java which is the file being modified the most. Specifically, the git history of this file in Li-Iceberg and Apache Iceberg are compared, from which I identified four PRs to be ported:
apache/iceberg#2123 [Push Iceberg table property values to HMS table properties]
apache/iceberg#2252 [Change a key method's signature in HiveTableOperations.java ]
apache/iceberg#2263 [acquireLock method fixing in HiveTableOperations.java ]
apache/iceberg#2329 [Introduce total-files-size snapshot metric and populate HMS]

These are changes need to make HiveTableOperations.java right.
To logically replicate the fix into HiveMetadataPreservingTableOperations.java, did the following:

  • Change a couple of methods' access modifier to make them available in extended class,
  • Refactor HMS client code block into persistTable method, as well as the clean up method in finally block,
  • apply the try-catch of checkCommitStatus in the doCommit method,
  • remove metadataUpdatedSuccessfully which is no longer needed (and the original impl. is buggy since it only checks if the current version is exactly the same as new metadata version instead of checking if current version is in the lineage of new metadata version as the PR 2328 did).

Conflicts

There's no conflicts if all four are ported, in the HiveTableOperations.java We have two commits that are not contributed back to upstream that:

  • simply remove the "private" access modifier from "unlock", "storageDescriptor", "acquireLock" and "setParameters" methods in the HiveTableOperations.java
  • adding a code block for more log message in HiveTableOperations.java

They are easily resolved. Overall, there should be minimal work to bring HiveTableOperations.java exactly the same as upstream.

@shenodaguirguis
Copy link

thanks @autumnust for the quick rb, and the documentation of the changes!
Checks are failing with dependency issue. You didn't change dependencies though, but I re-ran the checks and it still failed. Can you please take a look? Otherwise, it looks good to me.

Iceberg table properties are the canonical source of truth
HMS table properties should be maintained as much as possible to be in sync with the Iceberg table, but it can only happen on a best effort basis
This PR makes the following changes:

Ensures that all Iceberg table properties are propagated to the HMS table during HiveTableOperations commit
All HMS table properties are pushed down to Iceberg as well during table creation (except for metadata location and spec props)
Refactors the various property check assertions scattered throughout various test cases into a single property-focused unit test case
What is left out and should be done in the future:

Push property changes occurring via Hive DDL (ALTER TABLE SET TBLPROPERTIES) down to Iceberg as well. Currently this can't be done reliably because the HiveMetaHook interface only contains a preAlterTable method, but no commitAlterTable method. We'll need to extend this interface and include the change in an upcoming Hive upstream release.

Author: Marton Bod <[email protected]>
PR: apache/iceberg#2123
Backport Reason: To accomdate(I) for fix apache/iceberg#2328
Raw commit message: Addressing apache/iceberg#2249
Backport Reason: Acommodate (II) for the fix apache/iceberg#2328

Author: Marton Bod <[email protected]>
Raw commit message: Currently, there is no way to call unlock if HiveTableOperations.acquireLock fails at waiting for lock on hive table. This PR aims to try to invoke unlock in the finally block.
Backport Reason: Accomodate (III) for the fix apache/iceberg#2328
Author: ZorTsou <[email protected]>
Raw Commit Message: This patch:

Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'.
On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution.

Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328

Author: Marton Bod <[email protected]>
Raw commit message: #2317 - We discovered that Iceberg is currently treating all failures during commit
as full commit failures. This can lead to an unstable/corrupt table if the
catalog was successfully updated and it was only a network or other error
that prevented the client from learning of this. In this state, the client
will attempt to clean up files related to the commit while other clients and the table believe that files are successfully added to the table.

To fix this we change snapshot producer to only do a cleanup when a true CommitFailureException is thrown and stop our HMSTableOperations from removing metadata.json files when an uncertain exception is thrown.

Backport Reason: Bug fix
Author: Russell Spitzer <[email protected]>
@rzhang10
Copy link
Member

rzhang10 commented Feb 2, 2022

thanks @autumnust for the quick rb, and the documentation of the changes! Checks are failing with dependency issue. You didn't change dependencies though, but I re-ran the checks and it still failed. Can you please take a look? Otherwise, it looks good to me.

Seems like the error has to do with "palantir.bintray.com" and I see https://github.com/linkedin/iceberg/pull/88/files removed them, maybe rebase on that and see if it works?

@autumnust
Copy link
Author

Rebased, and logically applied the fix in HiveMetadataPreservingTableOperations.java
Can you take a second look ? @shenodaguirguis @rzhang10

FYI @ZihanLi58 @jack-moseley

@autumnust
Copy link
Author

Also, I decided to not to take down the patch in the HiveTableOperations since they are needed anyway once we decide to move to Apache Iceberg I assume. This also reduces the amount of code that I need to duplicate.

Copy link

@shenodaguirguis shenodaguirguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @autumnust ! Left a minor comment, otherwise LGTM!

@rzhang10
Copy link
Member

rzhang10 commented Feb 7, 2022

Overall LGTM.

Copy link

@ZihanLi58 ZihanLi58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for helping fix this!

@autumnust autumnust merged commit 891354d into linkedin:li-0.11.x Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants