Skip to content

Conversation

@ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Jun 2, 2022

Filesystem tables and metastore tables have different naming conventions for TableMetadata file as per spec.
So, when a filesystem table is registered to hive catalog (or any metastore based catalog), the register was successful but loading the table used to fail in parseVersion as it was expecting the naming convention of metastore table.

Fixed it by using metastore table naming convention while registering.

Fixes #4794

}

private boolean isMetastoreTable(String metadataFileLocation) {
return metadataFileLocation.substring(metadataFileLocation.lastIndexOf("/") + 1).contains("-");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be more specific. Why not confirm that there's a number using a regex?

TableMetadata metadata = TableMetadataParser.read(ops.io(), metadataFile);

if (!isMetastoreTable(metadataFileLocation)) {
// Metastore table and filesystem tables will have different naming conventions as per the spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary. Why not just update parseVersion to return -1 if the version isn't valid? I think that's a better solution than rewriting the metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this suggestion. For filesystem tables, along with fixing register problem for hive, it can fix register problem for other catalogs also once they support it.

@pvary
Copy link
Contributor

pvary commented Jun 5, 2022

I think this again is a step in the direction of registering / adding the same table / snapshot to different catalogs. This will cause issues when ExpireSnapshots or the RemoveOrphanFiles processes are running. When some files are not needed in one catalog anymore they might be removed, and they will be missing from the other version later.

Do we have a solution for prevent this situation? Or do I miss something?

@rdblue
Copy link
Contributor

rdblue commented Jun 5, 2022

@pvary, I think that this is a case that makes sense for the register method. We highly recommend using a real metastore rather than using the Hadoop strategy. To convert from a Hadoop catalog to a metastore catalog, you'd need this to work. To me, the register table method is mainly for migration between catalogs.

@pvary
Copy link
Contributor

pvary commented Jun 5, 2022

@rdblue: Putting that in this way, and I agree with you. If this is a migration, then this is the way to go.

Thanks for the explanation!

@ajantha-bhat
Copy link
Member Author

@pvary : Yes, this was for migration purposes. As we are aware that keeping the same table in two catalog is dangerous and leads to data mismatch. The user who migrates needs to take care of deleting the table in source catalog using drop table with purge=false option using the catalog interface.

@rdblue : Thanks for clarification and review. I will address the comments.

@ajantha-bhat
Copy link
Member Author

@rdblue : I have addressed the comments. Thanks.

try {
return Integer.valueOf(metadataLocation.substring(versionStart, versionEnd));
} catch (NumberFormatException e) {
} catch (NumberFormatException | StringIndexOutOfBoundsException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check versionEnd? If versionEnd is < 0, then return -1. No need to catch an exception.

() -> catalog.loadTable(identifier));

// register the table to hive catalog using the latest metadata file
metadataFiles.sort(Collections.reverseOrder());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than guessing which file is the metadata file, I think you should just get the location from the table:

  String metadataLocation = ((BaseTable) table).operations().current().metadataFileLocation();

@ajantha-bhat
Copy link
Member Author

@rdblue : Thanks for the review again. I have addressed all the comments.

@ajantha-bhat ajantha-bhat changed the title API: Support registering filesystem tables to hive catalog API: Support registering filesystem tables to metastore based catalogs Jun 7, 2022
@ajantha-bhat
Copy link
Member Author

ping @rdblue

int versionStart = metadataLocation.lastIndexOf('/') + 1; // if '/' isn't found, this will be 0
int versionEnd = metadataLocation.indexOf('-', versionStart);
if (versionEnd < 0) {
// As per the spec, metastore tables and filesystem tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like this:

If there the version is not parseable return -1 as a sign that the metadata is not part of this catalog

Or ever better remove this comment and add a javadoc for this method, since this is the same behaviour as we handle NumberFormatExceptions

@ajantha-bhat
Copy link
Member Author

@pvary : I have addressed the comments. Please check. Thanks.

Copy link
Contributor

@pvary pvary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from my side. Let's see if @rdblue has other comments, or not. If not I will merge this around at the end of the week.

Thanks @ajantha-bhat!

int versionStart = metadataLocation.lastIndexOf('/') + 1; // if '/' isn't found, this will be 0
int versionEnd = metadataLocation.indexOf('-', versionStart);
if (versionEnd < 0) {
LOG.warn("Found filesystem table's metadata : {}", metadataLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good thing to warn about. Warnings are for potential problems, while this is a normal case if you convert a table or control a name some other way.

This is also inaccurate. We can't really know why the file didn't include a version and shouldn't guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the log warning and added it as a code comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more we talk about this, the less convinced I am that we should differentiate between the parsing errors. Nothing really prevents a developer to create their own TableOperations and using their own metadata filename format.

I think it would be good to have a log for the event in any case, at least in info level.

Maybe rewrite the whole parsing part to this:

  private static int parseVersion(String metadataLocation) {
    try {
      int versionStart = metadataLocation.lastIndexOf('/') + 1; // if '/' isn't found, this will be 0
      int versionEnd = metadataLocation.indexOf('-', versionStart);
      return Integer.valueOf(metadataLocation.substring(versionStart, versionEnd));
    } catch (Throwable e) {
      LOG.info("Unable to parse version from metadata location: {}", metadataLocation, e);
      return -1;
    }
  }

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. waiting for @rdblue 's opinion to avoid me working back and forth on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary : As there was no response from @rdblue, I went ahead with your suggestion and modified the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should check for versionEnd. That indicates that we're using a file with no version and should return -1.

If we also want to return -1 for NumberFormatException and warn, I'm all for that. But we should definitely not catch Throwable. That's dangerous. Let's handle the cases we know about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the inputs @rdblue. Agree that catching Throwable is not a good practice. I have reverted the latest change. So, final code is same as what you described above.

@pvary
Copy link
Contributor

pvary commented Jun 21, 2022

@pvary : I have addressed the comments. Please check. Thanks.

Let's give him a few more days, then I will merge.
Thanks, Peter

@rdblue rdblue added this to the Iceberg 0.14.0 Release milestone Jun 29, 2022
@rdblue rdblue merged commit 3663497 into apache:master Jun 29, 2022
@rdblue
Copy link
Contributor

rdblue commented Jun 29, 2022

Thanks, @ajantha-bhat!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error Registering Hadoop Based Table Into Hive Metastore based table.

3 participants