-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: allow default data location for ObjectStorageLocationProvider #2845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| private static String defaultDataLocation(String tableLocation, Map<String, String> properties) { | ||
| return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honestly, I did not get what's the real meaning for TableProperties.WRITE_NEW_DATA_LOCATION. If it's the configured default location, then why did it name as WRITE_NEW_DATA_LOCATION with its real name as write.folder-storage.path.
Now in this patch, we are trying to fallback to the write.folder-storage.path when people did not set the write.object-storage.path. It even make this more confuse to understand :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I totally agree TableProperties.WRITE_NEW_DATA_LOCATION is probably not a good name, but it is what it is, we might want to deprecate and rename this if we have a major version update...
The resolution strategy I am trying to create here is that:
- if
write.object-storage.pathis set, use it - if not found, fallback to
write.folder-storage.path - if not found, use
<tableLocation>/data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@openinx any additional thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on the fallback strategy, though I'd like to see a warehouse like set up for the whole catalog.
E.g. if I don't set path, per table, I'd like for it to work as s3://bucket/warehouse/path/<tableLocation and/or just table name>/<objectstorage_hash>/data
Is that more or less what this config is achieving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Table location default is defined by the catalog implementation, which is likely s3://bucket/warehouse/db/tablename, or a custom override during create table. So that path slash data will always be there when writing data, and this PR tries to use that as the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main discussion point I think is if we should have write.folder-storage.path as a part of the fallback, which I think is worth having, but I do agree it increases the complexity for path resolution which is already quite complicated in Iceberg. If no one else objects the current approach, I will merge this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern is with some potential confusion around the existence of write.object-storage.* and then falling back to write.folder-storage.*. I admittedly have enough trouble as it is convincing people that object-storage keys are not folders 🙂, but I don't think that's necessarily a large enough concern to introduce new table properties.
Given that it's an existing property though, I do think it makes sense.
In this example, with object storage location provider, where would the hash / entropy characters be placed in the path?
Am I correct in assuming that the transformation would be as follows?
s3://bucket/warehouse/db/tablename -> s3://bucket/warehouse/db/tablename/<hash>/data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or would we wind up getting s3://bucket/warehouse/<hash>/db/tablename/data?
My preference would be for the hash to be just after the warehouse for data files if write.object-storage.path is not specified, though I can understand that we don't want to complicate the resolution to much. However, keeping the entropy high enough in the path for S3 partitioning seems like a reasonable goal for allowing it to be somewhat convoluted.
I have not made much usage of WRITE_NEW_DATA_LOCATION to have a strong opinion there. I agree the naming is a bit strange (write.folder-storage.path), but given that this config already exists (with the given name), I do agree that it would be a bit strange to ignore it entirely if users have set it.
Is there no convention around write.folder-storage.path, and write.folder-storage.* to only be applied to non-object storage layouts? Given that the property name is TableProperties.WRITE_NEW_DATA_LOCATION, I'm guessing that the name is just an arguably unfortunate side effect, but that we should respect the intention of this property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honestly, I did not get what's the real meaning for
TableProperties.WRITE_NEW_DATA_LOCATION. If it's the configured default location, then why did it name asWRITE_NEW_DATA_LOCATIONwith its real name aswrite.folder-storage.path.
@openinx opened a Github issue. As a new user to Iceberg the weird naming here is making the codebase hard to understand.
#2964
|
@jackye1995 has there been any further discussion on this? I was also going to open a PR to allow for a default location, as many users create many tables and don't want to configure a location per table (e.g. they want a warehouse-like location, but with object storage randomness in the path). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me.
|
@kbendick thanks for the approval. We discussed offline, and I think the use cases that we are trying to achieve require this fallback strategy. Use case 1: user creates table and want to use Use case 2: user has an existing table with So the 2 levels of fallback is required. I will go merge the PR, thanks for all the discussion. |
when table property
write.object-storage.pathis not set, use the default table data path which istableLocation/data@yyanyy