Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions core/src/main/java/org/apache/iceberg/LocationProviders.java
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
}
}

private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honestly, I did not get what's the real meaning for TableProperties.WRITE_NEW_DATA_LOCATION. If it's the configured default location, then why did it name as WRITE_NEW_DATA_LOCATION with its real name as write.folder-storage.path.

Now in this patch, we are trying to fallback to the write.folder-storage.path when people did not set the write.object-storage.path. It even make this more confuse to understand :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I totally agree TableProperties.WRITE_NEW_DATA_LOCATION is probably not a good name, but it is what it is, we might want to deprecate and rename this if we have a major version update...

The resolution strategy I am trying to create here is that:

  • if write.object-storage.path is set, use it
  • if not found, fallback to write.folder-storage.path
  • if not found, use <tableLocation>/data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openinx any additional thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on the fallback strategy, though I'd like to see a warehouse like set up for the whole catalog.

E.g. if I don't set path, per table, I'd like for it to work as s3://bucket/warehouse/path/<tableLocation and/or just table name>/<objectstorage_hash>/data

Is that more or less what this config is achieving?

Copy link
Contributor Author

@jackye1995 jackye1995 Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Table location default is defined by the catalog implementation, which is likely s3://bucket/warehouse/db/tablename, or a custom override during create table. So that path slash data will always be there when writing data, and this PR tries to use that as the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main discussion point I think is if we should have write.folder-storage.path as a part of the fallback, which I think is worth having, but I do agree it increases the complexity for path resolution which is already quite complicated in Iceberg. If no one else objects the current approach, I will merge this.

Copy link
Contributor

@kbendick kbendick Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is with some potential confusion around the existence of write.object-storage.* and then falling back to write.folder-storage.*. I admittedly have enough trouble as it is convincing people that object-storage keys are not folders 🙂, but I don't think that's necessarily a large enough concern to introduce new table properties.

Given that it's an existing property though, I do think it makes sense.

In this example, with object storage location provider, where would the hash / entropy characters be placed in the path?

Am I correct in assuming that the transformation would be as follows?
s3://bucket/warehouse/db/tablename -> s3://bucket/warehouse/db/tablename/<hash>/data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or would we wind up getting s3://bucket/warehouse/<hash>/db/tablename/data?

My preference would be for the hash to be just after the warehouse for data files if write.object-storage.path is not specified, though I can understand that we don't want to complicate the resolution to much. However, keeping the entropy high enough in the path for S3 partitioning seems like a reasonable goal for allowing it to be somewhat convoluted.

I have not made much usage of WRITE_NEW_DATA_LOCATION to have a strong opinion there. I agree the naming is a bit strange (write.folder-storage.path), but given that this config already exists (with the given name), I do agree that it would be a bit strange to ignore it entirely if users have set it.

Is there no convention around write.folder-storage.path, and write.folder-storage.* to only be applied to non-object storage layouts? Given that the property name is TableProperties.WRITE_NEW_DATA_LOCATION, I'm guessing that the name is just an arguably unfortunate side effect, but that we should respect the intention of this property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honestly, I did not get what's the real meaning for TableProperties.WRITE_NEW_DATA_LOCATION. If it's the configured default location, then why did it name as WRITE_NEW_DATA_LOCATION with its real name as write.folder-storage.path.

@openinx opened a Github issue. As a new user to Iceberg the weird naming here is making the codebase hard to understand.
#2964

}

static class DefaultLocationProvider implements LocationProvider {
private final String dataLocation;

DefaultLocationProvider(String tableLocation, Map<String, String> properties) {
this.dataLocation = stripTrailingSlash(properties.getOrDefault(
TableProperties.WRITE_NEW_DATA_LOCATION,
String.format("%s/data", tableLocation)));
this.dataLocation = stripTrailingSlash(defaultDataLocation(tableLocation, properties));
}

@Override
Expand All @@ -96,7 +98,8 @@ static class ObjectStoreLocationProvider implements LocationProvider {
private final String context;

ObjectStoreLocationProvider(String tableLocation, Map<String, String> properties) {
this.storageLocation = stripTrailingSlash(properties.get(OBJECT_STORE_PATH));
this.storageLocation = stripTrailingSlash(properties.getOrDefault(OBJECT_STORE_PATH,
defaultDataLocation(tableLocation, properties)));
this.context = pathContext(tableLocation);
}

Expand Down
26 changes: 26 additions & 0 deletions core/src/test/java/org/apache/iceberg/TestLocationProvider.java
Original file line number Diff line number Diff line change
Expand Up @@ -212,4 +212,30 @@ public void testInvalidArgTypesDynamicallyLoadedLocationProvider() {
() -> table.locationProvider()
);
}

@Test
public void testObjectStorageLocationProviderPathResolution() {
table.updateProperties()
.set(TableProperties.OBJECT_STORE_ENABLED, "true")
.commit();

Assert.assertTrue("default data location should be used when object storage path not set",
table.locationProvider().newDataLocation("file").contains(table.location() + "/data"));

String folderPath = "s3://random/folder/location";
table.updateProperties()
.set(TableProperties.WRITE_NEW_DATA_LOCATION, folderPath)
.commit();

Assert.assertTrue("folder storage path should be used when set",
table.locationProvider().newDataLocation("file").contains(folderPath));

String objectPath = "s3://random/object/location";
table.updateProperties()
.set(TableProperties.OBJECT_STORE_PATH, objectPath)
.commit();

Assert.assertTrue("object storage path should be used when set",
table.locationProvider().newDataLocation("file").contains(objectPath));
}
}