-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for write.metadata.path
#1642
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Generally LGTM. I added some comments around code organization.
Could you also rebase main to fix the conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally lgtm, i added a few nit comments.
looks like theres a linter issue too, can you run make lint
locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! sorry about the merge conflict again, merging a lot of changes today :P
@@ -64,6 +71,35 @@ def new_data_location(self, data_file_name: str, partition_key: Optional[Partiti | |||
str: A fully-qualified location URI for the data file. | |||
""" | |||
|
|||
def new_table_metadata_file_location(self, new_version: int = 0) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks fine to me (and configuring metadata locations on LocationProvider
s is cool)!
- Spitballing (/nits): I wonder if custom location providers that technically didn't subclass
LocationProvider
but would've worked before still would now fail for metadata writing because they don't have these new methods. If this is the case, it's maybe fine because we can expect them to subclassLocationProvider
as per the docs. - The way this is written makes me wonder whether users might want to customise their metadata locations under a table instead of providing a hard-coded path - in a similar way to custom
LocationProvider
s do for data files (maybe with mass updates they can have a large number of json/avro files so they could want to do the object storage hashing inside the metadata folder too to reduce throttling?) . I don't see that strong of a use case FWIW, but if I'm wrong could maybe add this to docs about custom location providers or start a larger discussion. Any thoughts folks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for 1, we haven't released a version with LocationProvider. so any references going forward should also have the new methods. We probably want to add a note in the location provider docs about the new metadata location feature
The way this is written makes me wonder whether users might want to customise their metadata locations under a table instead of providing a hard-coded path - in a similar way to custom LocationProviders do for data files
does "under a table" mean path relative to the table's base location? The SimpleLocation Provider uses absolute path. The ObjectStoreLocationProvider is special in that it injects some hash after the base location. But i think a custom metadata location provider can do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly, I was just curious about user demand for a custom metadata location provider. It exceeds the capability of Java custom location providers and posing the question of whether it makes sense for other implementations.
To clarify, no objections from my end. Updating the docs sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smaheshwar-pltr Regarding point 1, I had the same question while working on this. Centralizing locations/paths would definitely make future changes easier, especially with the ongoing discussions around relative path work. I found a related issue in the main repo here: apache/iceberg#6809. Which looks like it didn't get any traction but still holds some value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see the semantic difference.
new_data_location
is an abstract method, which allows custom LocationProviders to override.
load_location_provider
returns SimpleLocationProvider
by default, which then use the SimpleLocationProvider
's new_data_location
method in most cases.
we can do the same for new_metadata_location
, making it an abstract method so that custom LocationProviders can override. but also provide a default implementation for both SimpleLocationProvider
and ObjectStoreLocationProvider
we can then mention in the LocationProvider documentation that new_metadata_location
can be overriden
Does this make sense? @smaheshwar-pltr is this what you were pointing out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is interesting because for data file path, we can use TableProperties.WRITE_DATA_PATH
to override LocationProvider's self.data_path
, which is the prefix of the data file path.
And new_data_location
to override the behavior of the fully-qualified data file location.
We're trying recreate the same behavior for metadata path.
TableProperties.WRITE_METADATA_PATH
to override the LocationProvider's self.metadata_path
, which is the prefix of the metadata file path.
And new_metadata_location
to override the behavior of the fully-qualified metadata file location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, brushing up on my python...
i think the code is fine as is. both new_metadata_location
and new_table_metadata_file_location
can provide default implementation and also be overriden by custom implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a related issue in the main repo here: apache/iceberg#6809. Which looks like it didn't get any traction but still holds some value.
Thanks for digging into this - I wasn't aware of that discussion. Quoting from it, "Object storage mode to not just data files but also metadata files" is what I was getting at here. It's interesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for refactoring the code
I think we should add a comment to the Location Provider documentation and mention the metadata path
@@ -64,6 +71,35 @@ def new_data_location(self, data_file_name: str, partition_key: Optional[Partiti | |||
str: A fully-qualified location URI for the data file. | |||
""" | |||
|
|||
def new_table_metadata_file_location(self, new_version: int = 0) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for 1, we haven't released a version with LocationProvider. so any references going forward should also have the new methods. We probably want to add a note in the location provider docs about the new metadata location feature
The way this is written makes me wonder whether users might want to customise their metadata locations under a table instead of providing a hard-coded path - in a similar way to custom LocationProviders do for data files
does "under a table" mean path relative to the table's base location? The SimpleLocation Provider uses absolute path. The ObjectStoreLocationProvider is special in that it injects some hash after the base location. But i think a custom metadata location provider can do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I'll wait for fokko and smaheshwar to chime in :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks @geruh for working on this! 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (Tiny docs nits)
Thank you @geruh for the location provider changes!
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: smaheshwar-pltr <[email protected]>
Thanks @geruh for the contribution! And thanks @smaheshwar-pltr @Fokko for the review! |
Adding support for writing metadata to a custom path set via
write.metadata.path
property. Since the Python library consolidates the table operation classes in both the table and catalog classes, I had to surface the metadata file location handling to the baseCatalog
class to avoid circular dependencies. This way we are also able to centralize the metadata location handling for table metadata and snapshots.Relates to #1492