-
Notifications
You must be signed in to change notification settings - Fork 36
Backport https://github.com/apache/iceberg/pull/2328 and its prerequisites #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
thanks @autumnust for the quick rb, and the documentation of the changes! |
Iceberg table properties are the canonical source of truth HMS table properties should be maintained as much as possible to be in sync with the Iceberg table, but it can only happen on a best effort basis This PR makes the following changes: Ensures that all Iceberg table properties are propagated to the HMS table during HiveTableOperations commit All HMS table properties are pushed down to Iceberg as well during table creation (except for metadata location and spec props) Refactors the various property check assertions scattered throughout various test cases into a single property-focused unit test case What is left out and should be done in the future: Push property changes occurring via Hive DDL (ALTER TABLE SET TBLPROPERTIES) down to Iceberg as well. Currently this can't be done reliably because the HiveMetaHook interface only contains a preAlterTable method, but no commitAlterTable method. We'll need to extend this interface and include the change in an upcoming Hive upstream release. Author: Marton Bod <[email protected]> PR: apache/iceberg#2123 Backport Reason: To accomdate(I) for fix apache/iceberg#2328
Raw commit message: Addressing apache/iceberg#2249 Backport Reason: Acommodate (II) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>
Raw commit message: Currently, there is no way to call unlock if HiveTableOperations.acquireLock fails at waiting for lock on hive table. This PR aims to try to invoke unlock in the finally block. Backport Reason: Accomodate (III) for the fix apache/iceberg#2328 Author: ZorTsou <[email protected]>
Raw Commit Message: This patch: Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'. On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution. Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>
Raw commit message: #2317 - We discovered that Iceberg is currently treating all failures during commit as full commit failures. This can lead to an unstable/corrupt table if the catalog was successfully updated and it was only a network or other error that prevented the client from learning of this. In this state, the client will attempt to clean up files related to the commit while other clients and the table believe that files are successfully added to the table. To fix this we change snapshot producer to only do a cleanup when a true CommitFailureException is thrown and stop our HMSTableOperations from removing metadata.json files when an uncertain exception is thrown. Backport Reason: Bug fix Author: Russell Spitzer <[email protected]>
Seems like the error has to do with "palantir.bintray.com" and I see https://github.com/linkedin/iceberg/pull/88/files removed them, maybe rebase on that and see if it works? |
d6256fd to
7ce5de4
Compare
|
Rebased, and logically applied the fix in |
|
Also, I decided to not to take down the patch in the HiveTableOperations since they are needed anyway once we decide to move to Apache Iceberg I assume. This also reduces the amount of code that I need to duplicate. |
shenodaguirguis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @autumnust ! Left a minor comment, otherwise LGTM!
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveMetadataPreservingTableOperations.java
Show resolved
Hide resolved
|
Overall LGTM. |
ZihanLi58
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks for helping fix this!
Identifying commits required
To patch the PR of interests apache/iceberg#2328 I identified four PRs that has a bunch of overlapping with it by scanning through the file git history of
HiveTableOperations.javawhich is the file being modified the most. Specifically, the git history of this file in Li-Iceberg and Apache Iceberg are compared, from which I identified four PRs to be ported:apache/iceberg#2123 [Push Iceberg table property values to HMS table properties]
apache/iceberg#2252 [Change a key method's signature in
HiveTableOperations.java]apache/iceberg#2263 [
acquireLockmethod fixing inHiveTableOperations.java]apache/iceberg#2329 [Introduce total-files-size snapshot metric and populate HMS]
These are changes need to make
HiveTableOperations.javaright.To logically replicate the fix into
HiveMetadataPreservingTableOperations.java, did the following:persistTablemethod, as well as the clean up method in finally block,checkCommitStatusin thedoCommitmethod,metadataUpdatedSuccessfullywhich is no longer needed (and the original impl. is buggy since it only checks if the current version is exactly the same as new metadata version instead of checking if current version is in the lineage of new metadata version as the PR 2328 did).Conflicts
There's no conflicts if all four are ported, in the
HiveTableOperations.javaWe have two commits that are not contributed back to upstream that:HiveTableOperations.javaHiveTableOperations.javaThey are easily resolved. Overall, there should be minimal work to bring
HiveTableOperations.javaexactly the same as upstream.