-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Use the FileIO submodule in Spark writers and readers. #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tricky because the table operations needs to be exposed. Added a mixed interface of Table + TableOperations accordingly.
| } | ||
|
|
||
| protected Table findTable(DataSourceOptions options) { | ||
| protected TableWithTableOperations findTable(DataSourceOptions options) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue this is the most interesting question raised by this patch. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding io to the Table interface? I'd rather do that since FileIO is a public interface. I think that is mostly what HasTableOperations is used for anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine.
| /** | ||
| * @return a {@link FileIO} to read and write table data and metadata files | ||
| */ | ||
| FileIO io(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should TableOperations#io continue to exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because TableOperations is still how implementations are passed.
| int hash = HASH_FUNC.apply(partitionAndFilename); | ||
| return new Path(objectStorePath, | ||
| String.format("%08x/%s/%s", hash, context, partitionAndFilename)); | ||
| return UriBuilder.fromUri(URI.create(objectStore)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Path/URI methods cause problems with escape characters. This should use strings for reliability.
|
|
||
| TestTableOperations ops() { | ||
| @Override | ||
| public TestTableOperations operations() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this rename needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous version didn't actually override the superclass's method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because we didn't expose TableOperations at first. Now that FileIO will be public, we can probably remove HasTableOperations in a follow-up.
|
@mccheah, thanks for working on this! It's about ready to go, but we need to change back to using String instead of URI. There's also an unnecessary rename that we should remove. |
| Function<PartitionKey, String> outputPathFunc = key -> | ||
| String.format("%s/%s/%s", | ||
| stripTrailingSlash(baseDataPath), | ||
| stripTrailingSlash(stripLeadingSlash(key.toPath())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always implemented by PartitionSpec#partitionToPath. That never adds a starting or trailing / so this can rely on that and doesn't need to do the extra check.
| String baseDataPath = stripTrailingSlash(dataLocation); // avoid calling this in the output path function | ||
| Function<PartitionKey, String> outputPathFunc = key -> | ||
| String.format("%s/%s/%s", | ||
| stripTrailingSlash(baseDataPath), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
baseDataPath already stripped the trailing /.
| Path baseDataPath = lazyDataPath(); // avoid calling this in the output path function | ||
| Function<PartitionKey, Path> outputPathFunc = key -> | ||
| new Path(new Path(baseDataPath, key.toPath()), filename); | ||
| String baseDataPath = stripTrailingSlash(dataLocation); // avoid calling this in the output path function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an internal implementation, I think that the dataLocation() method should guarantee that the trailing / is removed. That way we never need to do this in the tasks.
| String.format("%08x/%s/%s", hash, context, partitionAndFilename)); | ||
| return String.format( | ||
| "%s/%08x/%s/%s/%s", | ||
| stripTrailingSlash(objectStore), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same general ideas here as above. This shouldn't remove slashes if it can be done in one central place or strings are generated by Iceberg code. We may need to document that other methods must not produce trailing or starting /, but that is a better way to go.
|
@mccheah, I think this is just about ready to merge, but it doesn't need most of the calls to strip slashes. Once those are fixed, I'll merge it. No rush since you're probably just back from the holidays. |
|
Almost there, addressed the comments, fixed the merge conflicts. |
|
Merged. Thanks @mccheah! |
PLAT-41915 Add Base ADLS Table ops implementation for atomic version_hint io
Tricky because the table operations needs to be exposed. Added a mixed interface of Table + TableOperations accordingly. Therefore now the Spark data source is opinionated about ensuring the returned table also implements
HasTableOperations.