Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for
write.metadata.path
#1642Add support for
write.metadata.path
#1642Changes from 7 commits
39928c6
f4c075c
6f88749
1c9f177
76c0480
b36df41
8b309d2
577bbdd
5b0b85b
f75642b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks fine to me (and configuring metadata locations on
LocationProvider
s is cool)!LocationProvider
but would've worked before still would now fail for metadata writing because they don't have these new methods. If this is the case, it's maybe fine because we can expect them to subclassLocationProvider
as per the docs.LocationProvider
s do for data files (maybe with mass updates they can have a large number of json/avro files so they could want to do the object storage hashing inside the metadata folder too to reduce throttling?) . I don't see that strong of a use case FWIW, but if I'm wrong could maybe add this to docs about custom location providers or start a larger discussion. Any thoughts folks?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for 1, we haven't released a version with LocationProvider. so any references going forward should also have the new methods. We probably want to add a note in the location provider docs about the new metadata location feature
does "under a table" mean path relative to the table's base location? The SimpleLocation Provider uses absolute path. The ObjectStoreLocationProvider is special in that it injects some hash after the base location. But i think a custom metadata location provider can do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly, I was just curious about user demand for a custom metadata location provider. It exceeds the capability of Java custom location providers and posing the question of whether it makes sense for other implementations.
To clarify, no objections from my end. Updating the docs sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smaheshwar-pltr Regarding point 1, I had the same question while working on this. Centralizing locations/paths would definitely make future changes easier, especially with the ongoing discussions around relative path work. I found a related issue in the main repo here: apache/iceberg#6809. Which looks like it didn't get any traction but still holds some value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see the semantic difference.
new_data_location
is an abstract method, which allows custom LocationProviders to override.load_location_provider
returnsSimpleLocationProvider
by default, which then use theSimpleLocationProvider
'snew_data_location
method in most cases.we can do the same for
new_metadata_location
, making it an abstract method so that custom LocationProviders can override. but also provide a default implementation for bothSimpleLocationProvider
andObjectStoreLocationProvider
we can then mention in the LocationProvider documentation that
new_metadata_location
can be overridenDoes this make sense? @smaheshwar-pltr is this what you were pointing out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is interesting because for data file path, we can use
TableProperties.WRITE_DATA_PATH
to override LocationProvider'sself.data_path
, which is the prefix of the data file path.And
new_data_location
to override the behavior of the fully-qualified data file location.We're trying recreate the same behavior for metadata path.
TableProperties.WRITE_METADATA_PATH
to override the LocationProvider'sself.metadata_path
, which is the prefix of the metadata file path.And
new_metadata_location
to override the behavior of the fully-qualified metadata file location.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, brushing up on my python...
i think the code is fine as is. both
new_metadata_location
andnew_table_metadata_file_location
can provide default implementation and also be overriden by custom implementationsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this - I wasn't aware of that discussion. Quoting from it, "Object storage mode to not just data files but also metadata files" is what I was getting at here. It's interesting