Core: Log the new metadata location in commit. #4681

flyrain · 2022-05-02T18:03:13Z

This is pretty useful to figure out which version is actually committed to the catalog from the log, especially when we debug a catalog consistency or locking issue.
cc @aokolnychyi @RussellSpitzer @szehon-ho @karuppayya

RussellSpitzer · 2022-05-02T18:16:01Z

I'm not sure we want to force all future table operations to include a literal location (or have to return one). So we may want to have the logging in the operations (HiveTableOperations, HadoopTableOperations, ... ) themselves?

flyrain · 2022-05-02T18:43:43Z

I'm not sure we want to force all future table operations to include a literal location (or have to return one). So we may want to have the logging in the operations (HiveTableOperations, HadoopTableOperations, ... ) themselves?

Good point. But I think a table version string will be there even for a future catalog without a metatdata.json file. It can easily support that, with minor word change. like Successfully committed to table {} with the new metadata location({}) -> Successfully committed to table {} with the new version ({})

Hi @kbendick, what do you think from the perspective of rest API catalog?

rdblue · 2022-05-02T19:03:59Z

I agree with @RussellSpitzer that this should be done in the catalog, not generally.

flyrain · 2022-05-02T19:38:20Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

      cleanupMetadataAndUnlock(commitStatus, newMetadataLocation, lockId, tableLevelMutex);
    }
+
+    LOG.info("Committed to table {} with the new metadata location {}", fullName, newMetadataLocation);


Instead of putting this log in both line 278 and line 297, I put the log here.

kbendick · 2022-05-03T01:37:01Z

I'm not sure we want to force all future table operations to include a literal location (or have to return one). So we may want to have the logging in the operations (HiveTableOperations, HadoopTableOperations, ... ) themselves?

Good point. But I think a table version string will be there even for a future catalog without a metatdata.json file. It can easily support that, with minor word change. like Successfully committed to table {} with the new metadata location({}) -> Successfully committed to table {} with the new version ({})

Hi @kbendick, what do you think from the perspective of rest API catalog?

For the current REST catalog, a file should be loggable. That said, I do agree that it's probably best not to force catalogs to have to always return a new metadata.json.

From the point of view of the REST catalog, presently it wouldn't matter.

That said, it looks like there would be quite a n number of places to log this if we did it at the catalog level. Essentially everywhere that commit is called..

For HiveCatalog, that would be:

registerTable (from Catalog interface)
createTable
Several functions in the BaseMetastoreCatalog, such as BaseMetastoreCatalog#BaseMetastoreCatalogTableBuilder::create.
The various implementations of Transactions::commitTransaction would also need to add logs (which breaks down into BaseTransaction::commitCreateTransaction, BaseTransaction::commitReplaceTransaction, and 2 more I can see from a simple pass over the code.

TLDR: I agree that having the catalog do the logging would be the ideal way, but realistically that's a lot of additional places to add logs vs a small handful of implementations of TableOperations::commit would suffice and generally speaking all of the relevant information is in Table Operations. So in my opinion, as long as the TableOperations implementations add the logging themselves, I think that would be simpler.

For the RESTCatalog, RESTTableOperations::updateCurrentMetadata, which is called at the end of commit with the LoadTableResponse - just for clarification on where / how this could be handled in the REST catalog.

flyrain · 2022-05-03T18:17:33Z

Hi @RussellSpitzer, @rdblue @kbendick, made the change only to HiveTableOperation. I can make change to other table operations you think it is necessary.

kbendick · 2022-05-03T23:34:58Z

Hi @RussellSpitzer, @rdblue @kbendick, made the change only to HiveTableOperation. I can make change to other table operations you think it is necessary.

I'm good with just adding the log to HiveTableOperations for now. I'd consider adding the log to HadoopTableOperations as well, as a lot of testing takes place using that and so investigations might wind up using that.

But realistically, if people who use other table operations find need or value for this log, it can be added as a follow up (particularly by people who make more common use of those table operations).

flyrain · 2022-05-04T00:28:14Z

Added the support for HadoopTableOperations. But it is less likely be confusing for Hadoop table since it always write with the latest version number as the file name. One of the pain point for Hive table is that, there are multiple metadata files in the metadata directory, and you don't know which one is generated by a succeeded job, which is not. Hadoop table doesn't have this issue, it can overwrite the file generated by the failed commit.

It is still valuable though. You can connect the file with the job easily by looking at the log.

kbendick · 2022-05-04T01:33:44Z

Added the support for HadoopTableOperations. But it is less likely be confusing for Hadoop table since it always write with the latest version number as the file name. One of the pain point for Hive table is that, there are multiple metadata files in the metadata directory, and you don't know which one is generated by a succeeded job, which is not. Hadoop table doesn't have this issue, it can overwrite the file generated by the failed commit.

It is still valuable though. You can connect the file with the job easily by looking at the log.

Ah that makes sense. I'm good with this use case then. If other TableOperations decide it's needed, then we can add it.

kbendick

This looks good to me.

rdblue · 2022-05-04T15:45:40Z

Thanks, @flyrain!

flyrain · 2022-05-04T16:54:59Z

Thanks all for the review.

(cherry picked from commit 30b31a2)

github-actions bot added AWS core DELL hive NESSIE labels May 2, 2022

Log the new metadata location in commit.

626c981

flyrain force-pushed the log-metadata-location-in-commit branch from b8561bf to 626c981 Compare May 2, 2022 19:36

flyrain commented May 2, 2022

View reviewed changes

Resolve the comments

c2e6ed0

kbendick approved these changes May 4, 2022

View reviewed changes

rdblue approved these changes May 4, 2022

View reviewed changes

rdblue merged commit 30b31a2 into apache:master May 4, 2022

InvisibleProgrammer mentioned this pull request Dec 14, 2022

HIVE-26822: port changes before spotless apache/hive#3857

Closed

InvisibleProgrammer mentioned this pull request Jan 3, 2023

HIVE-26808: port iceberg catalog changes apache/hive#3907

Merged

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 9, 2023

Hive: Log new metadata location in commit (apache#4681)

b92fba7

(cherry picked from commit 30b31a2)

Core: Log the new metadata location in commit. #4681

Core: Log the new metadata location in commit. #4681

Uh oh!

Conversation

flyrain commented May 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented May 2, 2022

Uh oh!

flyrain commented May 2, 2022

Uh oh!

rdblue commented May 2, 2022

Uh oh!

flyrain May 2, 2022

Choose a reason for hiding this comment

Uh oh!

kbendick commented May 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flyrain commented May 3, 2022

Uh oh!

kbendick commented May 3, 2022

Uh oh!

flyrain commented May 4, 2022

Uh oh!

kbendick commented May 4, 2022

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue commented May 4, 2022

Uh oh!

flyrain commented May 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flyrain commented May 2, 2022 •

edited

Loading

kbendick commented May 3, 2022 •

edited

Loading