Skip to content

Conversation

@cobookman
Copy link
Contributor

Clarified the description on ObjectStoreLocationProvider on that it generates a deterministic hash based on the filename, and that the hash is placed after write.object-storage.path.

@github-actions github-actions bot added the docs label Aug 12, 2021
@jackye1995
Copy link
Contributor

Thanks for rewording this and provide more clarification!

There is an ongoing PR #2845 to finalize the exact way that ObjectStorageLocationProvider resolves the root path. Let's wait for that to finalize and then update accordingly before merging this change.

@jackye1995
Copy link
Contributor

@cobookman the PR referenced is merged, could you update the documentation with the correct path resolution strategy? Thank you!

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for adding some light on this useful but under-documented feature , (struggled a little myself :))

@cobookman
Copy link
Contributor Author

cobookman commented Aug 13, 2021

@cobookman the PR referenced is merged, could you update the documentation with the correct path resolution strategy? Thank you!

Happy to, just want to understand what the expected write behaviour is for folder-storage & s3. Having the following fail on my end.

CREATE TABLE my_catalog.my_ns.my_table (
    id bigint,
    data string,
    category string)
USING iceberg
OPTIONS (
    'write.object-storage.enabled'=true, 
    'write.folder-storage.path'='s3://some-bucket/some-random-folder/')
PARTITIONED BY (category);

INSERT INTO my_catalog.my_ns.my_table VALUES (1, "some data", "some category");
java.lang.NullPointerException
	at org.apache.iceberg.LocationProviders.stripTrailingSlash(LocationProviders.java:135)
	at org.apache.iceberg.LocationProviders.access$000(LocationProviders.java:34)
	at org.apache.iceberg.LocationProviders$ObjectStoreLocationProvider.<init>(LocationProviders.java:99)
	at org.apache.iceberg.LocationProviders.locationsFor(LocationProviders.java:65)
	at org.apache.iceberg.BaseMetastoreTableOperations.locationProvider(BaseMetastoreTableOperations.java:200)
	at org.apache.iceberg.BaseTable.locationProvider(BaseTable.java:219)
	at org.apache.iceberg.spark.source.SparkWrite.createWriterFactory(SparkWrite.java:172)
	at org.apache.iceberg.spark.source.SparkWrite.access$600(SparkWrite.java:87)
	at org.apache.iceberg.spark.source.SparkWrite$BaseBatchWrite.createBatchWriterFactory(SparkWrite.java:226)

Omitting the 'write.object-storage.enabled'=true avoids the NULL Pointer exception, but also falls back to the hive driver, writing data to:
s3://some-bucket/some-random-folder/category=some+category/00000-2-13441dd2-137a-42d1-9c6b-9ccc29a2ebeb-00001.parquet

spark-sql> CREATE TABLE my_catalog.my_ns.my_table (
         >     id bigint,
         >     data string,
         >     category string)
         > USING iceberg
         > OPTIONS (
         >     'write.folder-storage.path'='s3://some-bucket/some-random-folder/')
         > PARTITIONED BY (category);
Time taken: 2.021 seconds
spark-sql> INSERT INTO my_catalog.my_ns.my_table VALUES (1, "some data", "some category");
Time taken: 3.968 seconds

@jackye1995
Copy link
Contributor

jackye1995 commented Aug 13, 2021

@cobookman The new resolution strategy is the following:

  • if write.object-storage.path is set, use it
  • if not found, fallback to write.folder-storage.path
  • if not found, use <tableLocation>/data

I think you are testing using an older version. I have added some more tests in Spark SQL, in case you want some examples: https://github.com/apache/iceberg/pull/2966/files

@cobookman
Copy link
Contributor Author

/faceplam yep that was it, wasn't using the lastest on github. Got it to work, and will update the docs.

CREATE TABLE my_catalog.my-db.my_table (
     id bigint,
     data string,
     category string)
USING iceberg
OPTIONS (
     'write.object-storage.enabled'=true,
      'write.folder-storage.path'='s3://my-bucket/some-random-folder/')
PARTITIONED BY (category);

Writes to:

s3://my-bucket/some-random-folder/21e81572/my-db.db/my_table/category=some+category/00000-0-5413ff19-edbc-41f9-ad2a-8ac1551ef3e1-00001.parquet`


Clarified the description on ObjectStoreLocationProvider on that it generates a deterministic hash based on the filename, and that the hash is placed after `write.object-storage.path`.

Added an example s3 path for ObjectStorageProvider

Docs aws.md - Updated with suggestions

Added
- path resolution information
- link to YT video on how S3 scales
- explained that 2d3905f8 is a hash in the s3 path
- changed text to "table properties"
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, lots of good information added!

@jackye1995 jackye1995 merged commit 8c490dc into apache:master Aug 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants