(#2317) Stop removal of files when catalog state is uncertain - HiveCatalog #2328

RussellSpitzer · 2021-03-11T18:31:55Z

#2317 - We discovered that Iceberg is currently treating all failures during commit
as full commit failures. This can lead to an unstable/corrupt table if the
catalog was successfully updated and it was only a network or other error
that prevented the client from learning of this. In this state, the client
will attempt to clean up files related to the commit while other clients and the table believe that files are successfully added to the table.

To fix this we change snapshot producer to only do a cleanup when a true CommitFailureException is thrown and stop our HMSTableOperations from removing metadata.json files when an uncertain exception is thrown.

core/src/main/java/org/apache/iceberg/SnapshotProducer.java

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

RussellSpitzer · 2021-03-11T23:33:12Z

core/src/main/java/org/apache/iceberg/SnapshotProducer.java

            taskOps.commit(base, updated.withUUID());
          });

+    } catch (CommitFailedException commitFailedException) {


@rymurr This is something I want to make sure you are ok with. Basically we have to also not clean up data files, I basically changed the rules as specified in TableOperations. Now if you throw CFE we clean up, other wise the client is on their own.

aokolnychyi · 2021-03-12T06:46:44Z

core/src/main/java/org/apache/iceberg/SnapshotProducer.java

+      Exceptions.suppressAndThrow(commitFailedException, this::cleanAll);
    } catch (RuntimeException e) {
-      Exceptions.suppressAndThrow(e, this::cleanAll);
+      LOG.error("Cannot determine whether the commit was successful or not, the underlying data files may or " +


I am afraid this will be dangerous. This means we have to update all places that throw a specific exception like AlreadyExistsException to now throw CommitFailedException. This is error-prone and we potentially lose helpful details. We could throw CommitFailedException and add the specific exception as a cause but the first point still holds.

Also, an error log message will most likely be ignored by the user. This is a case where we really want to propagate as much info as possible.

Would it make sense to introduce a new exception type? E.g., UnknownCommitStateException? That way, we can keep the existing logic for handling exceptions except cases where we really don't know what happens.

Thoughts, @RussellSpitzer @pvary @rymurr @danielcweeks @Parth-Brahmbhatt?

cc @omalley @rdblue @shardulm94 @rdsr too

I agree wtih @aokolnychyi here, I think changing the contract for the table ops is dangerous, especially as the change reduces the information available to the user.

I think Anton is right, rather than saying "CommitFailedException is the only valid known failure mode" we should probably say "NewExceptionTypeX means the commit failed in an unknown way and the user may be required to do some immediate intervention"

I think that's fine, I just want to make sure every catalog implements it then, if we always treat all non CME excepts as unknown commit state we know we are never deleting state we need. If we instead do it in the opposite direction we need to be extra sure that all exceptions that may have an unknown state are marked that way. In my mind this left more chance of corruption when someone forgets to mark a particular edgecase.

I don't think the property should be persisted at table level as a single table could be written from both idempotent and non idempotent clients (flink writer (idempotent) + background a compaction process (nonidempotent)). We can either expose the API at Table or PendingUpdate level.

In absence of more external use cases it is hard to say what is worst , corrupting a table so nothing can read or write until manual intervention or having duplicate data that can go undetected. I would pick duplicate data based on general use cases that I have seen so if we are voting, I would vote to default the behavior to assuming idempotent client.

An API on PendingUpdate sounds good to me.

One more reason why I'd vote for not deleting by default is due to this situation:

Also, getting a reply from the metastore or checking the commit status may take a substantial amount of time allowing a concurrent operation to succeed (if the lock expires). That's the worst what can happen as we will silently corrupt the table and will detect it only while querying a specific portion of the table. I had to fix such a case and we were simply lucky we found this out only after a couple of days while rewriting manifests.

@Parth-Brahmbhatt one thing to remember is that in our current code we actually get both data duplication and corrupt tables, it just depends when the failure occurs. Ie if you fail to check the metastore and lose connection to the file system, you won't clean up and will still throw an error. Or if say the commit is successful but for some other reason the job fails (OOM, Interrupt, power loss, I spill my coke on the server) we still end up with a failed job which if we retry will duplicate data / work.

So you can still get data duplication if a container loses connectivity to the network after committing.

I think there are a couple of issues here that we should address separately.

First, I agree that it is a good idea to add a way to signal that a commit is idempotent. Some operations are already idempotent, like file rewrites because we validate that all of the rewritten files are still part of the table. If we update paths that guarantee idempotent commits to signal this and handle UnknownCommitStateException, then we really reduce the incidence of the problem.

Second, I think we still need to agree on the default behavior. While I really don't like the idea of allowing a retry that will write duplicate data, I think that this has convinced me that silently duplicating data is a better outcome.

In our environment, we use versioned buckets so we can always un-delete metadata files that are missing, but that's not always the case. If those files are actually gone, then it is much worse because you have missing data and don't know which data files were in the missing commit without a lot of work. I think this problem is worse than the duplicate data.

A second compelling argument for changing the default is that deleting the metadata files leaks the problem to other writers. All concurrent processes are blocked if a table is broken, rather than blocking just a single writer.

In the end, I think that the right thing is to not delete the files and to throw UnknownCommitStateException as proposed. That handles interactive cases and also makes it so schedulers can handle a job failure by blocking just a single workflow and not all workflows operating on a table. And idempotent jobs should not be affected.

I would go for API level flag that the change is idempotent since this is more dependent on the client / action then the actual table.

Also I would go for a few connection retries and then throwing the UnknownCommitStateException if we are not able to determine the status of the commit. We should fail fast as soon as possible, so the user is able to mitigate the issue. If they ignore the exception it is better to have a state where we can recover so I would keep the files in case we are not able to handle the error cleanly.

core/src/main/java/org/apache/iceberg/SnapshotProducer.java

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

api/src/main/java/org/apache/iceberg/exceptions/CommitStateUnknownException.java

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

aokolnychyi · 2021-03-13T00:13:07Z

api/src/main/java/org/apache/iceberg/exceptions/CommitStateUnknownException.java

+public class CommitStateUnknownException extends RuntimeException {
+
+  private static final String COMMON_INFO =
+      "Cannot determine whether the commit was successful or not, the underlying data files may or " +


Looks descriptive enough!

api/src/main/java/org/apache/iceberg/exceptions/CommitStateUnknownException.java

core/src/main/java/org/apache/iceberg/TableOperations.java

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

aokolnychyi · 2021-03-13T00:20:04Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java


-      persistTable(tbl, updateHiveTable);
-      threw = false;
+      try {


Looks much cleaner now!

aokolnychyi · 2021-03-13T00:22:26Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

+  }
+
+  /**
+   * Attempt to load the table and see if any current or past metadata location matches the one we were attempting


Thanks for the javadoc! It is going to be helpful for folks who touch this code next time.

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCommits.java

This is to address the issue found in apache/iceberg#2328

* set read timeout on client This is to address the issue found in apache/iceberg#2328 * fix checkstyle * code review * code review pt 2

Fixes apache#2317

In case the Nessie endpoint did not respond or some other network error that makes it impossible to detect whether the Nessie server got the request and, more importantly, get the response. This PR adds a `catch (org.projectnessie.client.http.HttpClientException)` and re-throws it as the new `CommitStateUnknownException`. Also related refactoring of `NessieCatalog.dropTable`/`dropTableInner` Related to apache#2328

In case the Nessie endpoint did not respond or some other network error that makes it impossible to detect whether the Nessie server got the request and, more importantly, get the response. This PR adds a `catch (org.projectnessie.client.http.HttpClientException)` and re-throws it as the new `CommitStateUnknownException`. Related to apache#2328

In case the Nessie endpoint did not respond or some other network error that makes it impossible to detect whether the Nessie server got the request and, more importantly, get the response. This PR adds a `catch (org.projectnessie.client.http.HttpClientException)` and re-throws it as the new `CommitStateUnknownException`. Related to #2328

Fixes apache#2317

liupan664021 · 2021-09-17T12:36:05Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

+   * Attempt to load the table and see if any current or past metadata location matches the one we were attempting
+   * to set. This is used as a last resort when we are dealing with exceptions that may indicate the commit has
+   * failed but are not proof that this is the case. Past locations must also be searched on the chance that a second
+   * committer was able to successfully commit on top of our commit.


Hello, I have a question about the past locations check here that when checkCommitStatus() is called, we still holds the metastore lock, so is it possible for another commiter to commit on top of our commit?

We request an EXCLUSIVE lock before committing. Until we hold the lock, no other committer can get another lock on the same table, so no other committer can commit

Iceberg table properties are the canonical source of truth HMS table properties should be maintained as much as possible to be in sync with the Iceberg table, but it can only happen on a best effort basis This PR makes the following changes: Ensures that all Iceberg table properties are propagated to the HMS table during HiveTableOperations commit All HMS table properties are pushed down to Iceberg as well during table creation (except for metadata location and spec props) Refactors the various property check assertions scattered throughout various test cases into a single property-focused unit test case What is left out and should be done in the future: Push property changes occurring via Hive DDL (ALTER TABLE SET TBLPROPERTIES) down to Iceberg as well. Currently this can't be done reliably because the HiveMetaHook interface only contains a preAlterTable method, but no commitAlterTable method. We'll need to extend this interface and include the change in an upcoming Hive upstream release. Author: Marton Bod <[email protected]> PR: apache/iceberg#2123 Backport Reason: To accomdate(I) for fix apache/iceberg#2328

Raw commit message: Addressing apache/iceberg#2249 Backport Reason: Acommodate (II) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>

Raw commit message: Currently, there is no way to call unlock if HiveTableOperations.acquireLock fails at waiting for lock on hive table. This PR aims to try to invoke unlock in the finally block. Backport Reason: Accomodate (III) for the fix apache/iceberg#2328 Author: ZorTsou <[email protected]>

@pvary

Raw Commit Message: This patch: Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'. On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution. Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>

Raw commit message: #2317 - We discovered that Iceberg is currently treating all failures during commit as full commit failures. This can lead to an unstable/corrupt table if the catalog was successfully updated and it was only a network or other error that prevented the client from learning of this. In this state, the client will attempt to clean up files related to the commit while other clients and the table believe that files are successfully added to the table. To fix this we change snapshot producer to only do a cleanup when a true CommitFailureException is thrown and stop our HMSTableOperations from removing metadata.json files when an uncertain exception is thrown. Backport Reason: Bug fix Author: Russell Spitzer <[email protected]>

Iceberg table properties are the canonical source of truth HMS table properties should be maintained as much as possible to be in sync with the Iceberg table, but it can only happen on a best effort basis This PR makes the following changes: Ensures that all Iceberg table properties are propagated to the HMS table during HiveTableOperations commit All HMS table properties are pushed down to Iceberg as well during table creation (except for metadata location and spec props) Refactors the various property check assertions scattered throughout various test cases into a single property-focused unit test case What is left out and should be done in the future: Push property changes occurring via Hive DDL (ALTER TABLE SET TBLPROPERTIES) down to Iceberg as well. Currently this can't be done reliably because the HiveMetaHook interface only contains a preAlterTable method, but no commitAlterTable method. We'll need to extend this interface and include the change in an upcoming Hive upstream release. Author: Marton Bod <[email protected]> PR: apache/iceberg#2123 Backport Reason: To accomdate(I) for fix apache/iceberg#2328

Raw commit message: Addressing apache/iceberg#2249 Backport Reason: Acommodate (II) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>

Raw commit message: Currently, there is no way to call unlock if HiveTableOperations.acquireLock fails at waiting for lock on hive table. This PR aims to try to invoke unlock in the finally block. Backport Reason: Accomodate (III) for the fix apache/iceberg#2328 Author: ZorTsou <[email protected]>

@pvary

Raw Commit Message: This patch: Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'. On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution. Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>

Raw commit message: #2317 - We discovered that Iceberg is currently treating all failures during commit as full commit failures. This can lead to an unstable/corrupt table if the catalog was successfully updated and it was only a network or other error that prevented the client from learning of this. In this state, the client will attempt to clean up files related to the commit while other clients and the table believe that files are successfully added to the table. To fix this we change snapshot producer to only do a cleanup when a true CommitFailureException is thrown and stop our HMSTableOperations from removing metadata.json files when an uncertain exception is thrown. Backport Reason: Bug fix Author: Russell Spitzer <[email protected]>

Iceberg table properties are the canonical source of truth HMS table properties should be maintained as much as possible to be in sync with the Iceberg table, but it can only happen on a best effort basis This PR makes the following changes: Ensures that all Iceberg table properties are propagated to the HMS table during HiveTableOperations commit All HMS table properties are pushed down to Iceberg as well during table creation (except for metadata location and spec props) Refactors the various property check assertions scattered throughout various test cases into a single property-focused unit test case What is left out and should be done in the future: Push property changes occurring via Hive DDL (ALTER TABLE SET TBLPROPERTIES) down to Iceberg as well. Currently this can't be done reliably because the HiveMetaHook interface only contains a preAlterTable method, but no commitAlterTable method. We'll need to extend this interface and include the change in an upcoming Hive upstream release. Author: Marton Bod <[email protected]> PR: apache/iceberg#2123 Backport Reason: To accomdate(I) for fix apache/iceberg#2328

Raw commit message: Addressing apache/iceberg#2249 Backport Reason: Acommodate (II) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>

Raw commit message: Currently, there is no way to call unlock if HiveTableOperations.acquireLock fails at waiting for lock on hive table. This PR aims to try to invoke unlock in the finally block. Backport Reason: Accomodate (III) for the fix apache/iceberg#2328 Author: ZorTsou <[email protected]>

@pvary

Raw Commit Message: This patch: Introduces a new snapshot summary metric for total-files-size. It was somehow missing up till now, even though it has its companion metrics added-files-size and removed-files-size. Introducing this total metric makes it consistent with the other 'metric groups'. On HiveTableOperations commit, we should populate the HMS statistics using these snapshot metrics. Having these stats populated makes the Hive read query planning significantly faster. In some cases, @pvary's research showed that it led to 10x+ improvement on query compilation times, since in the absence of HMS stats the Hive query planner will recursively list the data files to gather their sizes first before execution. Backport Reason: Accomodate (IV) for the fix apache/iceberg#2328 Author: Marton Bod <[email protected]>

Raw commit message: #2317 - We discovered that Iceberg is currently treating all failures during commit as full commit failures. This can lead to an unstable/corrupt table if the catalog was successfully updated and it was only a network or other error that prevented the client from learning of this. In this state, the client will attempt to clean up files related to the commit while other clients and the table believe that files are successfully added to the table. To fix this we change snapshot producer to only do a cleanup when a true CommitFailureException is thrown and stop our HMSTableOperations from removing metadata.json files when an uncertain exception is thrown. Backport Reason: Bug fix Author: Russell Spitzer <[email protected]>

github-actions bot added core hive labels Mar 11, 2021

RussellSpitzer commented Mar 11, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SnapshotProducer.java Outdated Show resolved Hide resolved

RussellSpitzer commented Mar 11, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Outdated Show resolved Hide resolved

RussellSpitzer commented Mar 11, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Outdated Show resolved Hide resolved

RussellSpitzer commented Mar 11, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Show resolved Hide resolved

pvary reviewed Mar 11, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Outdated Show resolved Hide resolved

RussellSpitzer commented Mar 11, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Outdated Show resolved Hide resolved

RussellSpitzer commented Mar 11, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Outdated Show resolved Hide resolved

RussellSpitzer commented Mar 11, 2021

View reviewed changes

aokolnychyi reviewed Mar 12, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/SnapshotProducer.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 12, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Show resolved Hide resolved

github-actions bot added the API label Mar 12, 2021

RussellSpitzer force-pushed the CorruptTableCatalogFailure branch from 804c007 to 7c6e4e4 Compare March 12, 2021 13:51

RussellSpitzer changed the title ~~(#2317) Stop removal of files when catalog state is uncertain~~ (#2317) Stop removal of files when catalog state is uncertain - HiveCatalog Mar 12, 2021

RussellSpitzer force-pushed the CorruptTableCatalogFailure branch from 49552b1 to 87d13c9 Compare March 12, 2021 14:31

RussellSpitzer commented Mar 12, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 12, 2021

View reviewed changes

RussellSpitzer commented Mar 12, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 13, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/exceptions/CommitStateUnknownException.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 13, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/TableOperations.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 13, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/TableOperations.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 13, 2021

View reviewed changes

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Show resolved Hide resolved

aokolnychyi reviewed Mar 13, 2021

View reviewed changes

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCommits.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 13, 2021

View reviewed changes

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCommits.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 13, 2021

View reviewed changes

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCommits.java Outdated Show resolved Hide resolved

rymurr pushed a commit to rymurr/nessie that referenced this pull request Mar 31, 2021

set read timeout on client

419c322

This is to address the issue found in apache/iceberg#2328

yyanyy mentioned this pull request Apr 1, 2021

AWS: handle uncertain catalog state for glue #2402

Merged

rymurr added a commit to projectnessie/nessie that referenced this pull request Apr 1, 2021

set read timeout on client (#1016)

2c71a82

* set read timeout on client This is to address the issue found in apache/iceberg#2328 * fix checkstyle * code review * code review pt 2

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

Hive: Don't delete files when commit state is unknown (apache#2328)

97fb41b

Fixes apache#2317

snazy mentioned this pull request Apr 26, 2021

Use CommitStatusUnknownException for Nessie #2515

Merged

coolderli mentioned this pull request May 18, 2021

metadata.json not found #2602

Closed

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Jul 13, 2021

Hive: Don't delete files when commit state is unknown (apache#2328)

b30e69e

stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021

Hive: Don't delete files when commit state is unknown (apache#2328)

08d4945

Fixes apache#2317

rdblue mentioned this pull request Aug 17, 2021

Add 0.12.0 release notes pt 2 #2986

Merged

liupan664021 reviewed Sep 17, 2021

View reviewed changes

RussellSpitzer mentioned this pull request Jan 4, 2022

Iceberg table may be corrupted when HMS/catalog commit fails due to network reasons trinodb/trino#10462

Closed

autumnust mentioned this pull request Feb 1, 2022

Backport https://github.com/apache/iceberg/pull/2328 and its prerequisites linkedin/iceberg#89

Merged

(#2317) Stop removal of files when catalog state is uncertain - HiveCatalog #2328

(#2317) Stop removal of files when catalog state is uncertain - HiveCatalog #2328

Uh oh!

Conversation

RussellSpitzer commented Mar 11, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

aokolnychyi Mar 12, 2021 •

edited

Loading

RussellSpitzer Mar 12, 2021 •

edited

Loading