Skip to content

Conversation

@yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Apr 1, 2021

  • same fix as in (#2317) Stop removal of files when catalog state is uncertain - HiveCatalog #2328 for glue catalog. This case in theory should happen less often for Glue comparing to HMS unless there's internet connectivity issue with the current host at the time the commit is in progress, but still want to handle regardless.
  • in theory, certain exception types from glue may not worth catalog recheck, e.g. client side exception due to invalid input. Personally I think an extra glue check may be cheap enough comparing to the risk of missing a valid case handling and code complexity of listing all kinds of exception type, but I don't really feel strongly either way and comments are welcome.
  • moved checkCommitStatus() from HiveTableOperations to BaseMetastoreTableOperations to allow code reuse.
  • tests in GlueCatalogCommitFailureTest are heavily based on (#2317) Stop removal of files when catalog state is uncertain - HiveCatalog #2328; wanted to abstract them but since both classes extends a base class, and the shared code are not a lot, decided to duplicate the logic instead.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. @jackye1995 or @yyanyy, can you confirm that the integration tests pass?

@yyanyy
Copy link
Contributor Author

yyanyy commented Apr 1, 2021

The changes look good to me. @jackye1995 or @yyanyy, can you confirm that the integration tests pass?

Thank you for the quick review! Yes I have ran the integration tests locally for both new and existing glue tests and confirmed that they all succeeded.

@aokolnychyi
Copy link
Contributor

@RussellSpitzer, could you check?

@RussellSpitzer
Copy link
Member

Looks good to me as well! Sorry the tests weren't easier to abstract, I figure with catalogs having such different operations it is probably better that we just duplicate :/

try {
persistGlueTable(glueTable, properties);
commitStatus = CommitStatus.SUCCESS;
} catch (Throwable persistFailure) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think this can be SdkException instead of all throwables, because persistGlueTable is only a call to Glue. If it is not that exception type, it is guaranteed to be a failure.
  2. can we avoid nested try and directly determine this in the catch block below? We are already catching that exception at L137 anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!

For 1, I'm not confident if any type of network issue would end up as SdkException, so I'd prefer to maintain a list of exceptions that won't go through checkCommitStatus than a list that we will just to be on the safe side.
For 2, sounds good, I'll move the catch logic to L137 but I'll convert it to a Throwable to handle 1. This essentially means to not check commit status for ConcurrentModificationException and AlreadyExistsException which I think should be minimum risk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catching Throwable always sound very dangerous to me, because Error indicates serious problems that a reasonable application should not try to catch, and I would much prefer just catching RuntimeException or at least just Exception. Was there any particular reason for HiveTableOperations to catch Throwable? @RussellSpitzer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is less important now that we start out with the commit state set to unknown, but in the original design we basically started out assuming that the commit had succeeded and switched a flag to indicate that it had failed. We can probably have this just be runtime exception now since now the logic is basically for deciding when to retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I was thinking if it's a Error we will very unlikely be able have a success commit before that, so it will be rethrown anyway, but on the other hand we probably want to make sure the process to die immediately. I'll update L137 to catch runtime exception.

Table glueTable = getGlueTable();
checkMetadataLocation(glueTable, base);
Map<String, String> properties = prepareProperties(glueTable, newMetadataLocation);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra line to remove

}
persistGlueTable(glueTable, properties);
commitStatus = CommitStatus.SUCCESS;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra line to remove

@yyanyy
Copy link
Contributor Author

yyanyy commented Apr 1, 2021

Thanks everyone for the quick review! The test failed but I think it may be transient, since I think I built the same code locally before updating and all tests succeeded. I've triggered another round of test.

If there is no more comment/concern before tomorrow I'll probably merge this change. Thanks again!

@yyanyy yyanyy merged commit 68c48ec into apache:master Apr 6, 2021
@yyanyy
Copy link
Contributor Author

yyanyy commented Apr 6, 2021

Merged as no further comment after last Thursday and multiple approvals were in place. Thank you again everyone for reviewing so quickly!

stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants