Skip to content

Conversation

@flyrain
Copy link
Contributor

@flyrain flyrain commented May 2, 2022

This is pretty useful to figure out which version is actually committed to the catalog from the log, especially when we debug a catalog consistency or locking issue.
cc @aokolnychyi @RussellSpitzer @szehon-ho @karuppayya

@RussellSpitzer
Copy link
Member

I'm not sure we want to force all future table operations to include a literal location (or have to return one). So we may want to have the logging in the operations (HiveTableOperations, HadoopTableOperations, ... ) themselves?

@flyrain
Copy link
Contributor Author

flyrain commented May 2, 2022

I'm not sure we want to force all future table operations to include a literal location (or have to return one). So we may want to have the logging in the operations (HiveTableOperations, HadoopTableOperations, ... ) themselves?

Good point. But I think a table version string will be there even for a future catalog without a metatdata.json file. It can easily support that, with minor word change. like Successfully committed to table {} with the new metadata location({}) -> Successfully committed to table {} with the new version ({})

Hi @kbendick, what do you think from the perspective of rest API catalog?

@rdblue
Copy link
Contributor

rdblue commented May 2, 2022

I agree with @RussellSpitzer that this should be done in the catalog, not generally.

@flyrain flyrain force-pushed the log-metadata-location-in-commit branch from b8561bf to 626c981 Compare May 2, 2022 19:36
cleanupMetadataAndUnlock(commitStatus, newMetadataLocation, lockId, tableLevelMutex);
}

LOG.info("Committed to table {} with the new metadata location {}", fullName, newMetadataLocation);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of putting this log in both line 278 and line 297, I put the log here.

@kbendick
Copy link
Contributor

kbendick commented May 3, 2022

I'm not sure we want to force all future table operations to include a literal location (or have to return one). So we may want to have the logging in the operations (HiveTableOperations, HadoopTableOperations, ... ) themselves?

Good point. But I think a table version string will be there even for a future catalog without a metatdata.json file. It can easily support that, with minor word change. like Successfully committed to table {} with the new metadata location({}) -> Successfully committed to table {} with the new version ({})

Hi @kbendick, what do you think from the perspective of rest API catalog?

For the current REST catalog, a file should be loggable. That said, I do agree that it's probably best not to force catalogs to have to always return a new metadata.json.

From the point of view of the REST catalog, presently it wouldn't matter.

That said, it looks like there would be quite a n number of places to log this if we did it at the catalog level. Essentially everywhere that commit is called..

For HiveCatalog, that would be:

  • registerTable (from Catalog interface)
  • createTable
  • Several functions in the BaseMetastoreCatalog, such as BaseMetastoreCatalog#BaseMetastoreCatalogTableBuilder::create.
  • The various implementations of Transactions::commitTransaction would also need to add logs (which breaks down into BaseTransaction::commitCreateTransaction, BaseTransaction::commitReplaceTransaction, and 2 more I can see from a simple pass over the code.

TLDR: I agree that having the catalog do the logging would be the ideal way, but realistically that's a lot of additional places to add logs vs a small handful of implementations of TableOperations::commit would suffice and generally speaking all of the relevant information is in Table Operations. So in my opinion, as long as the TableOperations implementations add the logging themselves, I think that would be simpler.

For the RESTCatalog, RESTTableOperations::updateCurrentMetadata, which is called at the end of commit with the LoadTableResponse - just for clarification on where / how this could be handled in the REST catalog.

@flyrain
Copy link
Contributor Author

flyrain commented May 3, 2022

Hi @RussellSpitzer, @rdblue @kbendick, made the change only to HiveTableOperation. I can make change to other table operations you think it is necessary.

@kbendick
Copy link
Contributor

kbendick commented May 3, 2022

Hi @RussellSpitzer, @rdblue @kbendick, made the change only to HiveTableOperation. I can make change to other table operations you think it is necessary.

I'm good with just adding the log to HiveTableOperations for now. I'd consider adding the log to HadoopTableOperations as well, as a lot of testing takes place using that and so investigations might wind up using that.

But realistically, if people who use other table operations find need or value for this log, it can be added as a follow up (particularly by people who make more common use of those table operations).

@flyrain
Copy link
Contributor Author

flyrain commented May 4, 2022

Added the support for HadoopTableOperations. But it is less likely be confusing for Hadoop table since it always write with the latest version number as the file name. One of the pain point for Hive table is that, there are multiple metadata files in the metadata directory, and you don't know which one is generated by a succeeded job, which is not. Hadoop table doesn't have this issue, it can overwrite the file generated by the failed commit.

It is still valuable though. You can connect the file with the job easily by looking at the log.

@kbendick
Copy link
Contributor

kbendick commented May 4, 2022

Added the support for HadoopTableOperations. But it is less likely be confusing for Hadoop table since it always write with the latest version number as the file name. One of the pain point for Hive table is that, there are multiple metadata files in the metadata directory, and you don't know which one is generated by a succeeded job, which is not. Hadoop table doesn't have this issue, it can overwrite the file generated by the failed commit.

It is still valuable though. You can connect the file with the job easily by looking at the log.

Ah that makes sense. I'm good with this use case then. If other TableOperations decide it's needed, then we can add it.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@rdblue rdblue merged commit 30b31a2 into apache:master May 4, 2022
@rdblue
Copy link
Contributor

rdblue commented May 4, 2022

Thanks, @flyrain!

@flyrain
Copy link
Contributor Author

flyrain commented May 4, 2022

Thanks all for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants