Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 16 additions & 10 deletions core/src/main/java/org/apache/iceberg/LocationProviders.java
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,6 @@
import org.apache.iceberg.types.Types;
import org.apache.iceberg.util.PropertyUtil;

import static org.apache.iceberg.TableProperties.OBJECT_STORE_PATH;

public class LocationProviders {

private LocationProviders() {
Expand Down Expand Up @@ -68,16 +66,11 @@ public static LocationProvider locationsFor(String location, Map<String, String>
}
}

private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
return properties.getOrDefault(TableProperties.WRITE_FOLDER_STORAGE_LOCATION,
String.format("%s/data", tableLocation));
}

static class DefaultLocationProvider implements LocationProvider {
private final String dataLocation;

DefaultLocationProvider(String tableLocation, Map<String, String> properties) {
this.dataLocation = stripTrailingSlash(defaultDataLocation(tableLocation, properties));
this.dataLocation = stripTrailingSlash(dataLocation(properties, tableLocation));
}

@Override
Expand All @@ -99,8 +92,7 @@ static class ObjectStoreLocationProvider implements LocationProvider {
private final String context;

ObjectStoreLocationProvider(String tableLocation, Map<String, String> properties) {
this.storageLocation = stripTrailingSlash(properties.getOrDefault(OBJECT_STORE_PATH,
defaultDataLocation(tableLocation, properties)));
this.storageLocation = stripTrailingSlash(dataLocation(properties, tableLocation));
this.context = pathContext(tableLocation);
}

Expand Down Expand Up @@ -141,4 +133,18 @@ private static String stripTrailingSlash(String path) {
}
return result;
}

public static String dataLocation(Map<String, String> properties, String tableLocation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Does this need to be public or can it be private? If it needs to be visible for testing or something, that's ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visible for testing should use protected or package-private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make it public because it is used in IcebergSourceBenchmark.java. Maybe we can change WRITE_FOLDER_STORAGE_LOCATION to the WRITE_DATA_PATH in IcebergSourceBenchmark.java?

String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION);
if (dataLocation == null) {
dataLocation = properties.get(TableProperties.OBJECT_STORE_PATH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only correct if the table is using object storage. I think you need to update this logic to select whether to return OBJECT_STORE_PATH or WRITE_FOLDER_STORAGE_LOCATION depending on the location provider selected. Both should fall back to the table location. And the new property should take precedence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Made the change. This is going to be a little different from #2965. Modified the test case testObjectStorageLocationProviderPathResolution and added testDefaultStorageLocationProviderPathResolution.

if (dataLocation == null) {
dataLocation = properties.get(TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
if (dataLocation == null) {
dataLocation = String.format("%s/data", tableLocation);
}
}
}
return dataLocation;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're falling back through several deprecated options, we might want to list out the intended fallback order / behavior (in a comment up top for example) so we can more easily verify that it happens.

Or better yet, add tests setting different combinations.

We should potentially also address whether or not it's possible for users to have set too many flags and then error out.

Lastly, might want to drop a warning level deprecation log if one of the deprecated flags is non-null (or maybe just info). :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testObjectStorageLocationProviderPathResolution is for that. Will add warn log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the log.

}
18 changes: 14 additions & 4 deletions core/src/main/java/org/apache/iceberg/TableProperties.java
Original file line number Diff line number Diff line change
Expand Up @@ -135,21 +135,31 @@ private TableProperties() {
public static final String OBJECT_STORE_ENABLED = "write.object-storage.enabled";
public static final boolean OBJECT_STORE_ENABLED_DEFAULT = false;

/**
* @deprecated will be removed in 0.14.0, use {@link #WRITE_DATA_LOCATION} instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt this will ever be removed. It isn't worth breaking on older tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, remove the comments.

*/
@Deprecated
public static final String OBJECT_STORE_PATH = "write.object-storage.path";

public static final String WRITE_LOCATION_PROVIDER_IMPL = "write.location-provider.impl";

// This only applies to files written after this property is set. Files previously written aren't
// relocated to reflect this parameter.
// If not set, defaults to a "data" folder underneath the root path of the table.
/**
* @deprecated will be removed in 0.14.0, use {@link #WRITE_DATA_LOCATION} instead
*/
@Deprecated
public static final String WRITE_FOLDER_STORAGE_LOCATION = "write.folder-storage.path";

/**
* @deprecated will be removed in 0.14.0, use {@link #WRITE_FOLDER_STORAGE_LOCATION} instead
* @deprecated will be removed in 0.14.0, use {@link #WRITE_DATA_LOCATION} instead
*/
@Deprecated
public static final String WRITE_NEW_DATA_LOCATION = "write.folder-storage.path";

// This only applies to files written after this property is set. Files previously written aren't
// relocated to reflect this parameter.
// If not set, defaults to a "data" folder underneath the root path of the table.
public static final String WRITE_DATA_LOCATION = "write.data.path";
Comment on lines +158 to +161
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do we want to update the comment for the default if not set to reflect anything about the possibility of object storage location provider?

Up to you. The more I think about it, the more I think it just complicates things and that we should just properly document the behavior on the website. But up to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can move L161 to after L144 to avoid deleting the comments and add again here.


// This only applies to files written after this property is set. Files previously written aren't
// relocated to reflect this parameter.
// If not set, defaults to a "metadata" folder underneath the root path of the table.
Expand Down
10 changes: 9 additions & 1 deletion core/src/test/java/org/apache/iceberg/TestLocationProvider.java
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ public void testDefaultLocationProvider() {
@Test
public void testDefaultLocationProviderWithCustomDataLocation() {
this.table.updateProperties()
.set(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, "new_location")
.set(TableProperties.WRITE_DATA_LOCATION, "new_location")
.commit();

this.table.locationProvider().newDataLocation("my_file");
Expand Down Expand Up @@ -237,5 +237,13 @@ public void testObjectStorageLocationProviderPathResolution() {

Assert.assertTrue("object storage path should be used when set",
table.locationProvider().newDataLocation("file").contains(objectPath));

String dataPath = "s3://random/data/location";
table.updateProperties()
.set(TableProperties.WRITE_DATA_LOCATION, dataPath)
.commit();

Assert.assertTrue("object storage path should be used when set",
table.locationProvider().newDataLocation("file").contains(dataPath));
}
}
7 changes: 3 additions & 4 deletions site/docs/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -344,7 +344,7 @@ Data stored in S3 with a traditional Hive storage layout can face S3 request thr

Iceberg by default uses the Hive storage layout, but can be switched to use the `ObjectStoreLocationProvider`.
With `ObjectStoreLocationProvider`, a determenistic hash is generated for each stored file, with the hash appended
directly after the `write.object-storage.path`. This ensures files written to s3 are equally distributed across multiple [prefixes](https://aws.amazon.com/premiumsupport/knowledge-center/s3-object-key-naming-pattern/) in the S3 bucket. Resulting in minimized throttling and maximized throughput for S3-related IO operations. When using `ObjectStoreLocationProvider` having a shared and short `write.object-storage.path` across your Iceberg tables will improve performance.
directly after the `write.data.path`. This ensures files written to s3 are equally distributed across multiple [prefixes](https://aws.amazon.com/premiumsupport/knowledge-center/s3-object-key-naming-pattern/) in the S3 bucket. Resulting in minimized throttling and maximized throughput for S3-related IO operations. When using `ObjectStoreLocationProvider` having a shared and short `write.data.path` across your Iceberg tables will improve performance.

For more information on how S3 scales API QPS, checkout the 2018 re:Invent session on [Best Practices for Amazon S3 and Amazon S3 Glacier]( https://youtu.be/rHeTn9pHNKo?t=3219). At [53:39](https://youtu.be/rHeTn9pHNKo?t=3219) it covers how S3 scales/partitions & at [54:50](https://youtu.be/rHeTn9pHNKo?t=3290) it discusses the 30-60 minute wait time before new partitions are created.

Expand All @@ -358,7 +358,7 @@ CREATE TABLE my_catalog.my_ns.my_table (
USING iceberg
OPTIONS (
'write.object-storage.enabled'=true,
'write.object-storage.path'='s3://my-table-data-bucket')
'write.data.path'='s3://my-table-data-bucket')
PARTITIONED BY (category);
```

Expand All @@ -373,8 +373,7 @@ s3://my-table-data-bucket/2d3905f8/my_ns.db/my_table/category=orders/00000-0-5af
```

Note, the path resolution logic for `ObjectStoreLocationProvider` is as follows:
- if `write.object-storage.path` is set, use it
- if not found, fallback to `write.folder-storage.path`
- if `write.data.path` is set, use it
- if not found, use `<tableLocation>/data`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we have a simple path resolution strategy, I think we can just put these listing in a single sentence. Also add 2 warning blocks describing the legacy behaviors:

  • before 0.12.0, write.object-storage.path must be set
  • at 0.12.0, write.object-storage.path then write.folder-storage.path then <tableLocation>/data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jack! Made the change for the doc.


For more details, please refer to the [LocationProvider Configuration](../custom-catalog/#custom-location-provider-implementation) section.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.iceberg.LocationProviders;
import org.apache.iceberg.Table;
import org.apache.iceberg.TableProperties;
import org.apache.iceberg.UpdateProperties;
Expand Down Expand Up @@ -78,9 +79,7 @@ protected String newTableLocation() {
}

protected String dataLocation() {
Map<String, String> properties = table.properties();
return properties.getOrDefault(TableProperties.WRITE_FOLDER_STORAGE_LOCATION,
String.format("%s/data", table.location()));
return LocationProviders.dataLocation(table().properties(), table.location());
}

protected void cleanupFiles() throws IOException {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ public void testWapFilesAreKept() throws InterruptedException {
public void testMetadataFolderIsIntact() throws InterruptedException {
// write data directly to the table location
Map<String, String> props = Maps.newHashMap();
props.put(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, tableLocation);
props.put(TableProperties.WRITE_DATA_LOCATION, tableLocation);
Table table = TABLES.create(SCHEMA, SPEC, props, tableLocation);

List<ThreeColumnRecord> records = Lists.newArrayList(
Expand Down Expand Up @@ -357,7 +357,7 @@ public void testOlderThanTimestamp() throws InterruptedException {
@Test
public void testRemoveUnreachableMetadataVersionFiles() throws InterruptedException {
Map<String, String> props = Maps.newHashMap();
props.put(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, tableLocation);
props.put(TableProperties.WRITE_DATA_LOCATION, tableLocation);
props.put(TableProperties.METADATA_PREVIOUS_VERSIONS_MAX, "1");
Table table = TABLES.create(SCHEMA, SPEC, props, tableLocation);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ public void testWriteWithCustomDataLocation() throws IOException {
File tablePropertyDataLocation = temp.newFolder("test-table-property-data-dir");
Table table = createTable(new Schema(SUPPORTED_PRIMITIVES.fields()), location);
table.updateProperties().set(
TableProperties.WRITE_FOLDER_STORAGE_LOCATION, tablePropertyDataLocation.getAbsolutePath()).commit();
TableProperties.WRITE_DATA_LOCATION, tablePropertyDataLocation.getAbsolutePath()).commit();
writeAndValidateWithLocations(table, location, tablePropertyDataLocation);
}

Expand Down Expand Up @@ -271,7 +271,7 @@ public void testNullableWithWriteOption() throws IOException {
String sourcePath = String.format("%s/nullable_poc/sourceFolder/", location.toString());
String targetPath = String.format("%s/nullable_poc/targetFolder/", location.toString());

tableProperties = ImmutableMap.of(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, targetPath);
tableProperties = ImmutableMap.of(TableProperties.WRITE_DATA_LOCATION, targetPath);

// read this and append to iceberg dataset
spark
Expand Down Expand Up @@ -312,7 +312,7 @@ public void testNullableWithSparkSqlOption() throws IOException {
String sourcePath = String.format("%s/nullable_poc/sourceFolder/", location.toString());
String targetPath = String.format("%s/nullable_poc/targetFolder/", location.toString());

tableProperties = ImmutableMap.of(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, targetPath);
tableProperties = ImmutableMap.of(TableProperties.WRITE_DATA_LOCATION, targetPath);

// read this and append to iceberg dataset
spark
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ protected Map<String, String> destTableProps() {
// remove any possible location properties from origin properties
properties.remove(LOCATION);
properties.remove(TableProperties.WRITE_METADATA_LOCATION);
properties.remove(TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
properties.remove(TableProperties.WRITE_DATA_LOCATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When snapshotting a table that has folder storage location set, should we also remove it in addition to write data location?
Also should we remove object storage location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be safer to remove both. Added in the new commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR #2966 for this change, you can either also port my tests here, or remove this and I will update that PR once this is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jackye1995 , thanks for the information. I've added both WRITE_FOLDER_STORAGE_LOCATION and OBJECT_STORE_PATH. Can you rebase it in PR #2966 once this got merged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that


// set default and user-provided props
properties.put(TableCatalog.PROP_PROVIDER, "iceberg");
Expand Down