-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: allow default data location for ObjectStorageLocationProvider #2845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honestly, I did not get what's the real meaning for
TableProperties.WRITE_NEW_DATA_LOCATION. If it's the configured default location, then why did it name asWRITE_NEW_DATA_LOCATIONwith its real name aswrite.folder-storage.path.Now in this patch, we are trying to fallback to the
write.folder-storage.pathwhen people did not set thewrite.object-storage.path. It even make this more confuse to understand :-)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I totally agree
TableProperties.WRITE_NEW_DATA_LOCATIONis probably not a good name, but it is what it is, we might want to deprecate and rename this if we have a major version update...The resolution strategy I am trying to create here is that:
write.object-storage.pathis set, use itwrite.folder-storage.path<tableLocation>/dataThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@openinx any additional thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on the fallback strategy, though I'd like to see a warehouse like set up for the whole catalog.
E.g. if I don't set
path, per table, I'd like for it to work as s3://bucket/warehouse/path/<tableLocation and/or just table name>/<objectstorage_hash>/dataIs that more or less what this config is achieving?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Table location default is defined by the catalog implementation, which is likely
s3://bucket/warehouse/db/tablename, or a custom override during create table. So that path slash data will always be there when writing data, and this PR tries to use that as the default.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main discussion point I think is if we should have
write.folder-storage.pathas a part of the fallback, which I think is worth having, but I do agree it increases the complexity for path resolution which is already quite complicated in Iceberg. If no one else objects the current approach, I will merge this.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is with some potential confusion around the existence of
write.object-storage.*and then falling back towrite.folder-storage.*. I admittedly have enough trouble as it is convincing people that object-storage keys are not folders 🙂, but I don't think that's necessarily a large enough concern to introduce new table properties.Given that it's an existing property though, I do think it makes sense.
In this example, with object storage location provider, where would the hash / entropy characters be placed in the path?
Am I correct in assuming that the transformation would be as follows?
s3://bucket/warehouse/db/tablename->s3://bucket/warehouse/db/tablename/<hash>/dataThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or would we wind up getting
s3://bucket/warehouse/<hash>/db/tablename/data?My preference would be for the hash to be just after the warehouse for data files if
write.object-storage.pathis not specified, though I can understand that we don't want to complicate the resolution to much. However, keeping the entropy high enough in the path for S3 partitioning seems like a reasonable goal for allowing it to be somewhat convoluted.I have not made much usage of
WRITE_NEW_DATA_LOCATIONto have a strong opinion there. I agree the naming is a bit strange (write.folder-storage.path), but given that this config already exists (with the given name), I do agree that it would be a bit strange to ignore it entirely if users have set it.Is there no convention around
write.folder-storage.path, andwrite.folder-storage.*to only be applied to non-object storage layouts? Given that the property name isTableProperties.WRITE_NEW_DATA_LOCATION, I'm guessing that the name is just an arguably unfortunate side effect, but that we should respect the intention of this property.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@openinx opened a Github issue. As a new user to Iceberg the weird naming here is making the codebase hard to understand.
#2964