Skip to content

Conversation

@jackye1995
Copy link
Contributor

when table property write.object-storage.path is not set, use the default table data path which is tableLocation/data

@yyanyy

@github-actions github-actions bot added the core label Jul 21, 2021
}

private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honestly, I did not get what's the real meaning for TableProperties.WRITE_NEW_DATA_LOCATION. If it's the configured default location, then why did it name as WRITE_NEW_DATA_LOCATION with its real name as write.folder-storage.path.

Now in this patch, we are trying to fallback to the write.folder-storage.path when people did not set the write.object-storage.path. It even make this more confuse to understand :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I totally agree TableProperties.WRITE_NEW_DATA_LOCATION is probably not a good name, but it is what it is, we might want to deprecate and rename this if we have a major version update...

The resolution strategy I am trying to create here is that:

  • if write.object-storage.path is set, use it
  • if not found, fallback to write.folder-storage.path
  • if not found, use <tableLocation>/data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openinx any additional thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on the fallback strategy, though I'd like to see a warehouse like set up for the whole catalog.

E.g. if I don't set path, per table, I'd like for it to work as s3://bucket/warehouse/path/<tableLocation and/or just table name>/<objectstorage_hash>/data

Is that more or less what this config is achieving?

Copy link
Contributor Author

@jackye1995 jackye1995 Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Table location default is defined by the catalog implementation, which is likely s3://bucket/warehouse/db/tablename, or a custom override during create table. So that path slash data will always be there when writing data, and this PR tries to use that as the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main discussion point I think is if we should have write.folder-storage.path as a part of the fallback, which I think is worth having, but I do agree it increases the complexity for path resolution which is already quite complicated in Iceberg. If no one else objects the current approach, I will merge this.

Copy link
Contributor

@kbendick kbendick Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is with some potential confusion around the existence of write.object-storage.* and then falling back to write.folder-storage.*. I admittedly have enough trouble as it is convincing people that object-storage keys are not folders 🙂, but I don't think that's necessarily a large enough concern to introduce new table properties.

Given that it's an existing property though, I do think it makes sense.

In this example, with object storage location provider, where would the hash / entropy characters be placed in the path?

Am I correct in assuming that the transformation would be as follows?
s3://bucket/warehouse/db/tablename -> s3://bucket/warehouse/db/tablename/<hash>/data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or would we wind up getting s3://bucket/warehouse/<hash>/db/tablename/data?

My preference would be for the hash to be just after the warehouse for data files if write.object-storage.path is not specified, though I can understand that we don't want to complicate the resolution to much. However, keeping the entropy high enough in the path for S3 partitioning seems like a reasonable goal for allowing it to be somewhat convoluted.

I have not made much usage of WRITE_NEW_DATA_LOCATION to have a strong opinion there. I agree the naming is a bit strange (write.folder-storage.path), but given that this config already exists (with the given name), I do agree that it would be a bit strange to ignore it entirely if users have set it.

Is there no convention around write.folder-storage.path, and write.folder-storage.* to only be applied to non-object storage layouts? Given that the property name is TableProperties.WRITE_NEW_DATA_LOCATION, I'm guessing that the name is just an arguably unfortunate side effect, but that we should respect the intention of this property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honestly, I did not get what's the real meaning for TableProperties.WRITE_NEW_DATA_LOCATION. If it's the configured default location, then why did it name as WRITE_NEW_DATA_LOCATION with its real name as write.folder-storage.path.

@openinx opened a Github issue. As a new user to Iceberg the weird naming here is making the codebase hard to understand.
#2964

@kbendick
Copy link
Contributor

@jackye1995 has there been any further discussion on this?

I was also going to open a PR to allow for a default location, as many users create many tables and don't want to configure a location per table (e.g. they want a warehouse-like location, but with object storage randomness in the path).

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me.

@jackye1995
Copy link
Contributor Author

@kbendick thanks for the approval. We discussed offline, and I think the use cases that we are trying to achieve require this fallback strategy.

Use case 1: user creates table and want to use ObjectStorageLocationProvider with the default table path. Currently this will throw a NPE because it requires write.object-storage.path to be set. But the user might not really know what is the root path of the table, and we should not break this abstraction. So a fallback to the default table root path should be desirable.

Use case 2: user has an existing table with write.folder-storage.path set, and would like to enable object storage mode by updating the table property. With change to use case 1, this will directly fallback to the default table location instead of the folder storage path, but the correct behavior is for the Iceberg table to respect the write.folder-storage.path and use it as the root.

So the 2 levels of fallback is required. I will go merge the PR, thanks for all the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants