AWS: documentation page for AWS module #1891

jackye1995 · 2020-12-08T18:59:56Z

Initial draft for documentation page of the AWS module, I will mark this as draft until #1823 and #1844 are merged.

Also, I am currently placing it under Tables tab, which is clearly wrong. I was thinking about adding a new tab Integrations for all integrations including AWS and Nessie, but the navbar becomes too long. I am thinking about moving Flink, Hive and Presto all to a single Engines tab, any thoughts on that?

ismailsimsek · 2020-12-16T11:55:02Z

site/docs/aws.md

+When this configurer is used, a STS client is initialized with default [credentials chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html) and [region chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html),
+and all the other clients (Glue, DynamoDB, S3, etc.) will use the configured assume-role credential and region.
+
+## Run Iceberg on AWS


Thank you @jackye1995 it looks great, this is more an question. is it also possible to run iceberg with AWS "Glue Job"? if we use spark.jars.packages to provide iceberg library.

I have not tried that yet since I am mostly focused on Athena use case, but let me check if anyone has done that, thanks for bringing up this use case.

site/docs/aws.md

yyanyy · 2020-12-16T23:56:49Z

site/docs/aws.md

+### Glue Catalog ID
+
+It is very common for an organization to store all the tables in a single Glue catalog in a single AWS account and run data computation in many different accounts. 
+In this case, you need to specify a Glue catalog ID when initializing `GlueCatalog`.


For things like setting glue catalog id we probably want to mention the parameter "gluecatalog.id"; or overall I think we may want to add a section in the configuration page that include all configurable parameters mentioned in AwsProperties and link to that section in this page, otherwise people won't know how to configure them. And that would serve as a centralized place to read about all configurations. Is it part of the plan/covered by other PRs?

yyanyy · 2020-12-16T23:58:48Z

site/docs/aws.md

+This is because in each AWS account, there is a single Glue catalog in each AWS region,
+but the region is pre-determined by the Glue web client that is making the call.
+If you would like to access a Glue catalog in a different region, you should configure you AWS client, see more details in [AWS client configuration](#aws-client-configurations).


Nit: I guess the logic here isn't super clear to me or I misunderstood; it seems like the "this is because" part is not explaining why we want to use aws account ID, but rather to explain we can configure region and other stuff when we need to?

I see, please read if the new version is easier to understand.

site/docs/aws.md

yyanyy · 2020-12-17T00:17:28Z

site/docs/aws.md

+To enable server side encryption, use the following configuration properties:
+
+* `s3fileio.sse.type`: `none`, `s3`, `kms` or `custom`, default to `none`
+* `s3fileio.sse.key`: a KMS Key ID or ARN for `kms` type (default to `aws/s3`), or a custom base-64 AES256 symmetric key for `custom` type.


From the current code base it seems like we are not defaulting it to anything, is this aws/s3 a default value on AWS SDK or something?

Yes that is the default at service side.

yyanyy · 2020-12-17T01:50:56Z

site/docs/aws.md

+
+## Run Iceberg on AWS
+
+[Amazon EMR](https://aws.amazon.com/emr/) is the most common platform to run Iceberg on AWS. EMR can provision clusters with [Spark](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html) (EMR 6 for Spark 3, EMR 5 for Spark 2),


Minor: did you have to pull down the aws sdk bundle to run sparksql in EMR? IIRC there's already some preinstalled aws libraries on EMR, if that's true we might be able to mention them here to save one step for the users.

The v2 library is not bundled in spark.

HeartSaVioR · 2020-12-17T02:45:29Z

site/docs/aws.md

+To store data in a different local or cloud store, Glue catalog can switch to use `HadoopFileIO` 
+or any custom FileIO using the mechanism described in the [custom FileIO](../custom-catalog/#custom-file-io-implementation) section.
+
+## S3 FileIO


I was about to ask the rationalization of S3 FileIO compared to Hadoop filesystem API with S3 support in #1945, but this section covers it. Thanks!

Probably worth to also mention whether Hadoop FS API with S3 is sufficient to work with, or S3 FileIO is required to avoid consistency glitches. That would help end users to determine whether including aws module is a kind of requirement for dealing with S3 or not.

I have updated strong consistency section for more details, please let me know if it is enough.

site/docs/aws.md

massdosage · 2020-12-17T10:34:14Z

site/docs/aws.md

+This provides maximized upload speed and minimized local disk usage during uploads.
+Here are the configurations user can tune related to this feature:
+
+* `s3fileio.multipart.num-threads`: number of threads to use for uploading parts to S3 (shared pool across all output streams)


Is there a default here? Or is there no default and setting this is required?

default to Runtime.getRuntime().availableProcessors(), let me add in the doc

site/docs/aws.md

massdosage · 2020-12-17T10:35:13Z

site/docs/aws.md

+
+* `s3fileio.multipart.num-threads`: number of threads to use for uploading parts to S3 (shared pool across all output streams)
+* `s3fileio.multipart.part.size`: the size of a single part for multipart upload requests, default to 32MB
+* `s3fileio.multipart.threshold`: the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload, default to 1.5


as above defaults to or default is

site/docs/aws.md

rdblue · 2020-12-22T23:18:44Z

site/docs/aws.md

+Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation.
+When used, an Iceberg `Namespace` is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html), 
+an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html),
+an Iceberg `Snapshot` is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion). 


Is this true? I thought that only one version of the table was kept and that it pointed to the Iceberg root metadata file. That file contains more than one snapshot.

Snapshot is probably a bad word, I will just use table version then.
But Glue also stores all historical table versions for each update.

rdblue · 2020-12-22T23:20:37Z

site/docs/aws.md

+
+Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation.
+When used, an Iceberg `Namespace` is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html), 
+an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html),


I would drop references to Iceberg classes here. Most readers are going to care about tables through a SQL or DataFrame interface, not through Iceberg's API. The mapping to Glue database and table names is probably pretty clear.

rdblue · 2020-12-22T23:21:21Z

site/docs/aws.md

+an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html),
+an Iceberg `Snapshot` is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion). 
+You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`. 
+More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs).


I would definitely give a full example of using GlueCatalog here, in addition to linking to Spark and Flink pages.

Sorry, there's one above. Maybe just point the reader to it.

rdblue · 2020-12-22T23:21:53Z

site/docs/aws.md

+You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`. 
+More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs).
+
+### Glue Catalog ID


What about adding a table of configuration options instead of sections?

The reason I am doing this is because the explanation is quite long, which makes the table look very messy.

I think it makes sense here, but I would opt for tables in most other places. Part of the appeal is it forces you to be concise.

rdblue · 2020-12-22T23:23:24Z

site/docs/aws.md

+
+By default, Glue will store all the table versions created and user can rollback a table to any historical version if needed.
+However, if you are streaming data to Iceberg, this will easily create a lot of Glue table versions.
+Therefore, it is recommended to turn off the archive feature in Glue by setting `gluecatalog.skip-archive` to false.


I think it would be a bit cleaner if these catalog options were shorter. this could simply be skip-archive in the catalog config. Similarly, the one above could be catalog-id.

This is mostly trying to be consistent with the s3fileio properties by having gluecatalog as the config namespace. We might be able to make all of them shorter, something like glue.id and s3.staging.dir. Any thoughts on this? @danielcweeks

site/docs/aws.md

rdblue · 2020-12-22T23:30:13Z

site/docs/aws.md

+* `s3fileio.multipart.num-threads`: number of threads to use for uploading parts to S3 (shared across all output streams), defaults to the available number of processors in the system
+* `s3fileio.multipart.part.size`: the size of a single part for multipart upload requests, defaults to 32MB
+* `s3fileio.multipart.threshold`: the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload, defaults to 1.5
+* `s3fileio.staging.dir`: the directory to hold temporary files, defaults to Java's `java.io.tmpdir` property value


Great info here, but I think it may be easier to maintain as a table.

switched to table layout.

rdblue · 2020-12-22T23:33:39Z

site/docs/aws.md

+spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0,software.amazon.awssdk:bundle:2.15.40 \
+    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
+    --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my-key-prefix \


Does the Glue catalog support custom database locations?

good point, let me add that section

rdblue · 2020-12-22T23:34:37Z

@jackye1995 this is great! I noted a few things, but mostly what I would change is organizing the configuration into tables. The text describing all of the features is clear and really helpful.

jackye1995 · 2021-01-07T23:28:42Z

@rdblue there are too many tabs currently on Iceberg website, and adding the additional tab for AWS will make the UI ugly. So I moved Flink, Trino and Hive all under an engine's tab, and added a new tab called Integrations that will have topics like AWS, Nessie and JDBC.

rdblue · 2021-01-08T00:37:31Z

site/docs/aws.md

+| Property                          | Default                                            | Description                                            |
+| --------------------------------- | -------------------------------------------------- | ------------------------------------------------------ |
+| s3fileio.multipart.num-threads    | the available number of processors in the system   | number of threads to use for uploading parts to S3 (shared across all output streams)  |
+| s3fileio.multipart.part.size      | 32MB                                               | the size of a single part for multipart upload requests  |


Have we released this yet? We would normally make it s3fileio.multipart.part-size-bytes:

part is part of "part size" and isn't really part of the hierarchy

We prefer to have clear units

rdblue · 2021-01-08T00:38:14Z

site/docs/aws.md

+| s3fileio.multipart.num-threads    | the available number of processors in the system   | number of threads to use for uploading parts to S3 (shared across all output streams)  |
+| s3fileio.multipart.part.size      | 32MB                                               | the size of a single part for multipart upload requests  |
+| s3fileio.multipart.threshold      | 1.5                                                | the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload  |
+| s3fileio.staging.dir              | `java.io.tmpdir` property value                    | the directory to hold temporary files  |


Is there anything else under staging, or should this be staging-dir?

Looks like there are multiple discussions around the config key names, and these names for s3fileio are not designed by me but Daniel. Let me put up another thread for this discussion before release.

let's use #2050 for the discussion

rdblue · 2021-01-08T00:39:30Z

site/docs/configuration.md

+
+| Property                          | Default            | Description                                            |
+| --------------------------------- | ------------------ | ------------------------------------------------------ |
+| lock.impl                         | null               | a custom implementation of the lock manager, the actual interface depends on the catalog used  |


The other implementation class properties are something-impl. Shouldn't this be lock-impl instead to match those? Then the other properties are in the lock namespace.

rdblue · 2021-01-08T00:40:11Z

site/docs/configuration.md

+| Property                          | Default            | Description                                            |
+| --------------------------------- | ------------------ | ------------------------------------------------------ |
+| lock.impl                         | null               | a custom implementation of the lock manager, the actual interface depends on the catalog used  |
+| lock.table                        | null               | an optional auxiliary table for locking                |


Isn't the table required if you're using Dynamo? I would not say it is optional, although it is optional if you're not using Dynamo.

Yes, this is not a part of AWS documentation, that is why I say it as optional, I can remove that. I will emphasize it is required in the AWS doc.

Thanks for pointing that out. You're right that it is optional here. Thanks!

You may also want to link to the Dynamo docs if you haven't already, since that's the only implementation currently available.

rdblue · 2021-01-08T00:42:49Z

site/mkdocs.yml

-  - Spark:
-    - Getting Started: getting-started.md
+
+  - Engines:


The Spark page is getting long enough that I need to break it into separate pages. That's why I reorganized. I like having an Integrations tab, though.

I wonder if we can get rid of forward/back instead. And maybe move "About" into the "Project" list.

rdblue · 2021-01-08T01:31:00Z

@jackye1995, we can save some space by getting rid of the next/previous links and by moving "About" into "Project":

diff --git a/site/docs/css/extra.css b/site/docs/css/extra.css
index 4fa7c8036..3d79de02b 100644
--- a/site/docs/css/extra.css
+++ b/site/docs/css/extra.css
@@ -28,6 +28,10 @@
   float: left;
 }
 
+.navbar-right {
+  display: none;
+}
+
 .navbar-brand {
   margin-right: 1em;
 }
diff --git a/site/mkdocs.yml b/site/mkdocs.yml
index 1ce05135b..209495896 100644
--- a/site/mkdocs.yml
+++ b/site/mkdocs.yml
@@ -40,8 +40,8 @@ markdown_extensions:
   - admonition
   - pymdownx.tilde
 nav:
-  - About: index.md
-  - Project:
+  - About:
+    - Project: index.md
     - Community: community.md
     - Releases: releases.md
     - Trademarks: trademarks.md

jackye1995 · 2021-01-09T00:02:29Z

@jackye1995, we can save some space by getting rid of the next/previous links and by moving "About" into "Project":

Thanks for the code reference, I just updated the tabs based on that

rdblue · 2021-01-09T00:07:30Z

site/docs/aws.md

+User can choose the ACL level by setting the `s3.acl` property.
+For more details, please read [S3 ACL Documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html).
+
+### ObjectStoreLocationProvider


How about using a description here rather than a class name, like "Object store file layout" or something?

Yes that's better, let me also remove other class names in titles

rdblue · 2021-01-09T00:09:07Z

site/docs/aws.md

+    data string,
+    category string)
+USING iceberg
+OPTIONS ('location'='s3://my-special-table-bucket')


You can also use the LOCATION DDL clause

oh yeah I completely forgot that

rdblue · 2021-01-09T00:12:43Z

site/docs/aws.md

+Many organizations have customized their way of configuring AWS clients with their own credential provider, access proxy, retry strategy, etc.
+Iceberg allows users to plug in their own implementation of `org.apache.iceberg.aws.AwsClientFactory` by setting the `client.factory` catalog property.
+
+### AssumeRoleAwsClientFactory


Is there a better heading for this as well?

Yes I just updated

jackye1995 marked this pull request as draft December 8, 2020 19:00

github-actions bot added the docs label Dec 8, 2020

ismailsimsek reviewed Dec 16, 2020

View reviewed changes

danielcweeks mentioned this pull request Dec 17, 2020

Revert "AWS: support S3 strong consistency (#1863)" #1945

Merged

yyanyy reviewed Dec 17, 2020

View reviewed changes

HeartSaVioR reviewed Dec 17, 2020

View reviewed changes

massdosage suggested changes Dec 17, 2020

View reviewed changes

rdblue reviewed Dec 22, 2020

View reviewed changes

site/docs/aws.md Outdated Show resolved Hide resolved

rdblue reviewed Dec 22, 2020

View reviewed changes

site/docs/aws.md Outdated Show resolved Hide resolved

rdblue reviewed Dec 22, 2020

View reviewed changes

site/docs/aws.md Outdated Show resolved Hide resolved

rdblue reviewed Dec 22, 2020

View reviewed changes

jackye1995 marked this pull request as ready for review January 7, 2021 23:24

rdblue reviewed Jan 8, 2021

View reviewed changes

jackye1995 mentioned this pull request Jan 8, 2021

AWS: consolidate config key names before release #2050

Merged

AWS: documentation page for AWS module

d2f8ef6

add deleted lines back

530157f

rdblue reviewed Jan 9, 2021

View reviewed changes

add location sql, update title names

ae4654d

rdblue approved these changes Jan 9, 2021

View reviewed changes

rdblue merged commit 674c9b6 into apache:master Jan 9, 2021

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021

Docs: Add page for AWS integration (apache#1891)

6f62f26


		## Run Iceberg on AWS

		[Amazon EMR](https://aws.amazon.com/emr/) is the most common platform to run Iceberg on AWS. EMR can provision clusters with [Spark](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html) (EMR 6 for Spark 3, EMR 5 for Spark 2),

AWS: documentation page for AWS module #1891

AWS: documentation page for AWS module #1891

Uh oh!

Conversation

jackye1995 commented Dec 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Dec 22, 2020

Uh oh!

jackye1995 commented Jan 7, 2021

Uh oh!