Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for write.metadata.path #1642

Merged
merged 10 commits into from
Feb 14, 2025
Merged

Add support for write.metadata.path #1642

merged 10 commits into from
Feb 14, 2025

Conversation

geruh
Copy link
Contributor

@geruh geruh commented Feb 11, 2025

Adding support for writing metadata to a custom path set via write.metadata.path property. Since the Python library consolidates the table operation classes in both the table and catalog classes, I had to surface the metadata file location handling to the base Catalog class to avoid circular dependencies. This way we are also able to centralize the metadata location handling for table metadata and snapshots.

Relates to #1492

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Generally LGTM. I added some comments around code organization.

Could you also rebase main to fix the conflict?

pyiceberg/catalog/__init__.py Outdated Show resolved Hide resolved
pyiceberg/table/update/snapshot.py Outdated Show resolved Hide resolved
pyiceberg/table/update/snapshot.py Outdated Show resolved Hide resolved
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm, i added a few nit comments.

looks like theres a linter issue too, can you run make lint locally?

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
pyiceberg/table/update/snapshot.py Outdated Show resolved Hide resolved
pyiceberg/catalog/__init__.py Outdated Show resolved Hide resolved
@kevinjqliu kevinjqliu requested a review from Fokko February 11, 2025 22:30
@kevinjqliu
Copy link
Contributor

cc @Fokko @smaheshwar-pltr

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! sorry about the merge conflict again, merging a lot of changes today :P

@@ -64,6 +71,35 @@ def new_data_location(self, data_file_name: str, partition_key: Optional[Partiti
str: A fully-qualified location URI for the data file.
"""

def new_table_metadata_file_location(self, new_version: int = 0) -> str:
Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks fine to me (and configuring metadata locations on LocationProviders is cool)!

  1. Spitballing (/nits): I wonder if custom location providers that technically didn't subclass LocationProvider but would've worked before still would now fail for metadata writing because they don't have these new methods. If this is the case, it's maybe fine because we can expect them to subclass LocationProvider as per the docs.
  2. The way this is written makes me wonder whether users might want to customise their metadata locations under a table instead of providing a hard-coded path - in a similar way to custom LocationProviders do for data files (maybe with mass updates they can have a large number of json/avro files so they could want to do the object storage hashing inside the metadata folder too to reduce throttling?) . I don't see that strong of a use case FWIW, but if I'm wrong could maybe add this to docs about custom location providers or start a larger discussion. Any thoughts folks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 1, we haven't released a version with LocationProvider. so any references going forward should also have the new methods. We probably want to add a note in the location provider docs about the new metadata location feature

The way this is written makes me wonder whether users might want to customise their metadata locations under a table instead of providing a hard-coded path - in a similar way to custom LocationProviders do for data files

does "under a table" mean path relative to the table's base location? The SimpleLocation Provider uses absolute path. The ObjectStoreLocationProvider is special in that it injects some hash after the base location. But i think a custom metadata location provider can do the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly, I was just curious about user demand for a custom metadata location provider. It exceeds the capability of Java custom location providers and posing the question of whether it makes sense for other implementations.

To clarify, no objections from my end. Updating the docs sounds good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smaheshwar-pltr Regarding point 1, I had the same question while working on this. Centralizing locations/paths would definitely make future changes easier, especially with the ongoing discussions around relative path work. I found a related issue in the main repo here: apache/iceberg#6809. Which looks like it didn't get any traction but still holds some value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see the semantic difference.

new_data_location is an abstract method, which allows custom LocationProviders to override.
load_location_provider returns SimpleLocationProvider by default, which then use the SimpleLocationProvider's new_data_location method in most cases.

we can do the same for new_metadata_location, making it an abstract method so that custom LocationProviders can override. but also provide a default implementation for both SimpleLocationProvider and ObjectStoreLocationProvider

we can then mention in the LocationProvider documentation that new_metadata_location can be overriden

Does this make sense? @smaheshwar-pltr is this what you were pointing out?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is interesting because for data file path, we can use TableProperties.WRITE_DATA_PATH to override LocationProvider's self.data_path, which is the prefix of the data file path.
And new_data_location to override the behavior of the fully-qualified data file location.

We're trying recreate the same behavior for metadata path.
TableProperties.WRITE_METADATA_PATH to override the LocationProvider's self.metadata_path, which is the prefix of the metadata file path.
And new_metadata_location to override the behavior of the fully-qualified metadata file location.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, brushing up on my python...
i think the code is fine as is. both new_metadata_location and new_table_metadata_file_location can provide default implementation and also be overriden by custom implementations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a related issue in the main repo here: apache/iceberg#6809. Which looks like it didn't get any traction but still holds some value.

Thanks for digging into this - I wasn't aware of that discussion. Quoting from it, "Object storage mode to not just data files but also metadata files" is what I was getting at here. It's interesting

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for refactoring the code

I think we should add a comment to the Location Provider documentation and mention the metadata path

@@ -64,6 +71,35 @@ def new_data_location(self, data_file_name: str, partition_key: Optional[Partiti
str: A fully-qualified location URI for the data file.
"""

def new_table_metadata_file_location(self, new_version: int = 0) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 1, we haven't released a version with LocationProvider. so any references going forward should also have the new methods. We probably want to add a note in the location provider docs about the new metadata location feature

The way this is written makes me wonder whether users might want to customise their metadata locations under a table instead of providing a hard-coded path - in a similar way to custom LocationProviders do for data files

does "under a table" mean path relative to the table's base location? The SimpleLocation Provider uses absolute path. The ObjectStoreLocationProvider is special in that it injects some hash after the base location. But i think a custom metadata location provider can do the same.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll wait for fokko and smaheshwar to chime in :)

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks @geruh for working on this! 🙌

Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (Tiny docs nits)

Thank you @geruh for the location provider changes!

@kevinjqliu
Copy link
Contributor

Thanks for the great suggestions, i added them and double check the docs rendering

Screenshot 2025-02-14 at 9 14 56 AM

@kevinjqliu kevinjqliu merged commit 829b7dc into apache:main Feb 14, 2025
8 checks passed
@kevinjqliu
Copy link
Contributor

Thanks @geruh for the contribution! And thanks @smaheshwar-pltr @Fokko for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants