-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API: Support registering filesystem tables to metastore based catalogs #4946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| private boolean isMetastoreTable(String metadataFileLocation) { | ||
| return metadataFileLocation.substring(metadataFileLocation.lastIndexOf("/") + 1).contains("-"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be more specific. Why not confirm that there's a number using a regex?
| TableMetadata metadata = TableMetadataParser.read(ops.io(), metadataFile); | ||
|
|
||
| if (!isMetastoreTable(metadataFileLocation)) { | ||
| // Metastore table and filesystem tables will have different naming conventions as per the spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary. Why not just update parseVersion to return -1 if the version isn't valid? I think that's a better solution than rewriting the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this suggestion. For filesystem tables, along with fixing register problem for hive, it can fix register problem for other catalogs also once they support it.
|
I think this again is a step in the direction of registering / adding the same table / snapshot to different catalogs. This will cause issues when ExpireSnapshots or the RemoveOrphanFiles processes are running. When some files are not needed in one catalog anymore they might be removed, and they will be missing from the other version later. Do we have a solution for prevent this situation? Or do I miss something? |
|
@pvary, I think that this is a case that makes sense for the register method. We highly recommend using a real metastore rather than using the Hadoop strategy. To convert from a Hadoop catalog to a metastore catalog, you'd need this to work. To me, the register table method is mainly for migration between catalogs. |
|
@rdblue: Putting that in this way, and I agree with you. If this is a migration, then this is the way to go. Thanks for the explanation! |
|
@pvary : Yes, this was for migration purposes. As we are aware that keeping the same table in two catalog is dangerous and leads to data mismatch. The user who migrates needs to take care of deleting the table in source catalog using drop table with purge=false option using the catalog interface. @rdblue : Thanks for clarification and review. I will address the comments. |
|
@rdblue : I have addressed the comments. Thanks. |
| try { | ||
| return Integer.valueOf(metadataLocation.substring(versionStart, versionEnd)); | ||
| } catch (NumberFormatException e) { | ||
| } catch (NumberFormatException | StringIndexOutOfBoundsException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not check versionEnd? If versionEnd is < 0, then return -1. No need to catch an exception.
| () -> catalog.loadTable(identifier)); | ||
|
|
||
| // register the table to hive catalog using the latest metadata file | ||
| metadataFiles.sort(Collections.reverseOrder()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than guessing which file is the metadata file, I think you should just get the location from the table:
String metadataLocation = ((BaseTable) table).operations().current().metadataFileLocation();|
@rdblue : Thanks for the review again. I have addressed all the comments. |
|
ping @rdblue |
| int versionStart = metadataLocation.lastIndexOf('/') + 1; // if '/' isn't found, this will be 0 | ||
| int versionEnd = metadataLocation.indexOf('-', versionStart); | ||
| if (versionEnd < 0) { | ||
| // As per the spec, metastore tables and filesystem tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like this:
If there the version is not parseable return -1 as a sign that the metadata is not part of this catalog
Or ever better remove this comment and add a javadoc for this method, since this is the same behaviour as we handle NumberFormatExceptions
|
@pvary : I have addressed the comments. Please check. Thanks. |
pvary
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from my side. Let's see if @rdblue has other comments, or not. If not I will merge this around at the end of the week.
Thanks @ajantha-bhat!
| int versionStart = metadataLocation.lastIndexOf('/') + 1; // if '/' isn't found, this will be 0 | ||
| int versionEnd = metadataLocation.indexOf('-', versionStart); | ||
| if (versionEnd < 0) { | ||
| LOG.warn("Found filesystem table's metadata : {}", metadataLocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good thing to warn about. Warnings are for potential problems, while this is a normal case if you convert a table or control a name some other way.
This is also inaccurate. We can't really know why the file didn't include a version and shouldn't guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the log warning and added it as a code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more we talk about this, the less convinced I am that we should differentiate between the parsing errors. Nothing really prevents a developer to create their own TableOperations and using their own metadata filename format.
I think it would be good to have a log for the event in any case, at least in info level.
Maybe rewrite the whole parsing part to this:
private static int parseVersion(String metadataLocation) {
try {
int versionStart = metadataLocation.lastIndexOf('/') + 1; // if '/' isn't found, this will be 0
int versionEnd = metadataLocation.indexOf('-', versionStart);
return Integer.valueOf(metadataLocation.substring(versionStart, versionEnd));
} catch (Throwable e) {
LOG.info("Unable to parse version from metadata location: {}", metadataLocation, e);
return -1;
}
}
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. waiting for @rdblue 's opinion to avoid me working back and forth on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should check for versionEnd. That indicates that we're using a file with no version and should return -1.
If we also want to return -1 for NumberFormatException and warn, I'm all for that. But we should definitely not catch Throwable. That's dangerous. Let's handle the cases we know about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the inputs @rdblue. Agree that catching Throwable is not a good practice. I have reverted the latest change. So, final code is same as what you described above.
Let's give him a few more days, then I will merge. |
|
Thanks, @ajantha-bhat! |
Filesystem tables and metastore tables have different naming conventions for TableMetadata file as per spec.
So, when a filesystem table is registered to hive catalog (or any metastore based catalog), the register was successful but loading the table used to fail in
parseVersionas it was expecting the naming convention of metastore table.Fixed it by using metastore table naming convention while registering.
Fixes #4794