From 0a67940a27d01b81bf587951ca58bfe01a19f662 Mon Sep 17 00:00:00 2001 From: Colin Bookman Date: Wed, 11 Aug 2021 21:35:21 -0700 Subject: [PATCH] Docs: Clarify ObjectStoreLocationProvider Clarified the description on ObjectStoreLocationProvider on that it generates a deterministic hash based on the filename, and that the hash is placed after `write.object-storage.path`. Added an example s3 path for ObjectStorageProvider Docs aws.md - Updated with suggestions Added - path resolution information - link to YT video on how S3 scales - explained that 2d3905f8 is a hash in the s3 path - changed text to "table properties" --- site/docs/aws.md | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/site/docs/aws.md b/site/docs/aws.md index 7bf3a3cbedb2..ab37962e2505 100644 --- a/site/docs/aws.md +++ b/site/docs/aws.md @@ -339,13 +339,17 @@ For more details, please read [S3 ACL Documentation](https://docs.aws.amazon.com ### Object Store File Layout -S3 and many other cloud storage services [throttle requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). -This means data stored in a traditional Hive storage layout has bad read and write throughput since data files of the same partition are placed under the same prefix. -Iceberg by default uses the Hive storage layout, but can be switched to use a different `ObjectStoreLocationProvider`. -In this mode, a hash string is added to the beginning of each file path, so that files are equally distributed across all prefixes in an S3 bucket. -This results in minimized throttling and maximized throughput for S3-related IO operations. -Here is an example Spark SQL command to create a table with this feature enabled: +S3 and many other cloud storage services [throttle requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). +Data stored in S3 with a traditional Hive storage layout can face S3 request throttling as objects are stored under the same filepath prefix. +Iceberg by default uses the Hive storage layout, but can be switched to use the `ObjectStoreLocationProvider`. +With `ObjectStoreLocationProvider`, a determenistic hash is generated for each stored file, with the hash appended +directly after the `write.object-storage.path`. This ensures files written to s3 are equally distributed across multiple [prefixes](https://aws.amazon.com/premiumsupport/knowledge-center/s3-object-key-naming-pattern/) in the S3 bucket. Resulting in minimized throttling and maximized throughput for S3-related IO operations. When using `ObjectStoreLocationProvider` having a shared and short `write.object-storage.path` across your Iceberg tables will improve performance. + +For more information on how S3 scales API QPS, checkout the 2018 re:Invent session on [Best Practices for Amazon S3 and Amazon S3 Glacier]( https://youtu.be/rHeTn9pHNKo?t=3219). At [53:39](https://youtu.be/rHeTn9pHNKo?t=3219) it covers how S3 scales/partitions & at [54:50](https://youtu.be/rHeTn9pHNKo?t=3290) it discusses the 30-60 minute wait time before new partitions are created. + +To use the `ObjectStorageLocationProvider` add `'write.object-storage.enabled'=true` in the table's properties. +Below is an example Spark SQL command to create a table using the `ObjectStorageLocationProvider`: ```sql CREATE TABLE my_catalog.my_ns.my_table ( id bigint, @@ -358,6 +362,21 @@ OPTIONS ( PARTITIONED BY (category); ``` +We can then insert a single row into this new table +```SQL +INSERT INTO my_catalog.my_ns.my_table VALUES (1, "Pizza", "orders"); +``` + +Which will write the data to S3 with a hash (`2d3905f8`) appended directly after the `write.object-storage.path`, ensuring reads to the table are spread evenly across [S3 bucket prefixes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html), and improving performance. +``` +s3://my-table-data-bucket/2d3905f8/my_ns.db/my_table/category=orders/00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +``` + +Note, the path resolution logic for `ObjectStoreLocationProvider` is as follows: +- if `write.object-storage.path` is set, use it +- if not found, fallback to `write.folder-storage.path` +- if not found, use `/data` + For more details, please refer to the [LocationProvider Configuration](../custom-catalog/#custom-location-provider-implementation) section. ### S3 Strong Consistency