Skip to content

Conversation

@jackye1995
Copy link
Contributor

Initial draft for documentation page of the AWS module, I will mark this as draft until #1823 and #1844 are merged.

Also, I am currently placing it under Tables tab, which is clearly wrong. I was thinking about adding a new tab Integrations for all integrations including AWS and Nessie, but the navbar becomes too long. I am thinking about moving Flink, Hive and Presto all to a single Engines tab, any thoughts on that?

@jackye1995 jackye1995 marked this pull request as draft December 8, 2020 19:00
@github-actions github-actions bot added the docs label Dec 8, 2020
When this configurer is used, a STS client is initialized with default [credentials chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html) and [region chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html),
and all the other clients (Glue, DynamoDB, S3, etc.) will use the configured assume-role credential and region.

## Run Iceberg on AWS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jackye1995 it looks great, this is more an question. is it also possible to run iceberg with AWS "Glue Job"? if we use spark.jars.packages to provide iceberg library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tried that yet since I am mostly focused on Athena use case, but let me check if anyone has done that, thanks for bringing up this use case.

site/docs/aws.md Outdated
### Glue Catalog ID

It is very common for an organization to store all the tables in a single Glue catalog in a single AWS account and run data computation in many different accounts.
In this case, you need to specify a Glue catalog ID when initializing `GlueCatalog`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For things like setting glue catalog id we probably want to mention the parameter "gluecatalog.id"; or overall I think we may want to add a section in the configuration page that include all configurable parameters mentioned in AwsProperties and link to that section in this page, otherwise people won't know how to configure them. And that would serve as a centralized place to read about all configurations. Is it part of the plan/covered by other PRs?

site/docs/aws.md Outdated
Comment on lines 55 to 57
This is because in each AWS account, there is a single Glue catalog in each AWS region,
but the region is pre-determined by the Glue web client that is making the call.
If you would like to access a Glue catalog in a different region, you should configure you AWS client, see more details in [AWS client configuration](#aws-client-configurations).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I guess the logic here isn't super clear to me or I misunderstood; it seems like the "this is because" part is not explaining why we want to use aws account ID, but rather to explain we can configure region and other stuff when we need to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, please read if the new version is easier to understand.

site/docs/aws.md Outdated
To enable server side encryption, use the following configuration properties:

* `s3fileio.sse.type`: `none`, `s3`, `kms` or `custom`, default to `none`
* `s3fileio.sse.key`: a KMS Key ID or ARN for `kms` type (default to `aws/s3`), or a custom base-64 AES256 symmetric key for `custom` type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the current code base it seems like we are not defaulting it to anything, is this aws/s3 a default value on AWS SDK or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is the default at service side.

site/docs/aws.md Outdated

## Run Iceberg on AWS

[Amazon EMR](https://aws.amazon.com/emr/) is the most common platform to run Iceberg on AWS. EMR can provision clusters with [Spark](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html) (EMR 6 for Spark 3, EMR 5 for Spark 2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: did you have to pull down the aws sdk bundle to run sparksql in EMR? IIRC there's already some preinstalled aws libraries on EMR, if that's true we might be able to mention them here to save one step for the users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v2 library is not bundled in spark.

To store data in a different local or cloud store, Glue catalog can switch to use `HadoopFileIO`
or any custom FileIO using the mechanism described in the [custom FileIO](../custom-catalog/#custom-file-io-implementation) section.

## S3 FileIO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to ask the rationalization of S3 FileIO compared to Hadoop filesystem API with S3 support in #1945, but this section covers it. Thanks!

Probably worth to also mention whether Hadoop FS API with S3 is sufficient to work with, or S3 FileIO is required to avoid consistency glitches. That would help end users to determine whether including aws module is a kind of requirement for dealing with S3 or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated strong consistency section for more details, please let me know if it is enough.

site/docs/aws.md Outdated
This provides maximized upload speed and minimized local disk usage during uploads.
Here are the configurations user can tune related to this feature:

* `s3fileio.multipart.num-threads`: number of threads to use for uploading parts to S3 (shared pool across all output streams)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a default here? Or is there no default and setting this is required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default to Runtime.getRuntime().availableProcessors(), let me add in the doc

site/docs/aws.md Outdated

* `s3fileio.multipart.num-threads`: number of threads to use for uploading parts to S3 (shared pool across all output streams)
* `s3fileio.multipart.part.size`: the size of a single part for multipart upload requests, default to 32MB
* `s3fileio.multipart.threshold`: the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload, default to 1.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above defaults to or default is

site/docs/aws.md Outdated
Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation.
When used, an Iceberg `Namespace` is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html),
an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html),
an Iceberg `Snapshot` is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I thought that only one version of the table was kept and that it pointed to the Iceberg root metadata file. That file contains more than one snapshot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot is probably a bad word, I will just use table version then.
But Glue also stores all historical table versions for each update.

site/docs/aws.md Outdated

Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation.
When used, an Iceberg `Namespace` is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html),
an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop references to Iceberg classes here. Most readers are going to care about tables through a SQL or DataFrame interface, not through Iceberg's API. The mapping to Glue database and table names is probably pretty clear.

an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html),
an Iceberg `Snapshot` is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion).
You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`.
More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely give a full example of using GlueCatalog here, in addition to linking to Spark and Flink pages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, there's one above. Maybe just point the reader to it.

You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`.
More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs).

### Glue Catalog ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a table of configuration options instead of sections?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I am doing this is because the explanation is quite long, which makes the table look very messy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense here, but I would opt for tables in most other places. Part of the appeal is it forces you to be concise.

site/docs/aws.md Outdated

By default, Glue will store all the table versions created and user can rollback a table to any historical version if needed.
However, if you are streaming data to Iceberg, this will easily create a lot of Glue table versions.
Therefore, it is recommended to turn off the archive feature in Glue by setting `gluecatalog.skip-archive` to false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a bit cleaner if these catalog options were shorter. this could simply be skip-archive in the catalog config. Similarly, the one above could be catalog-id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly trying to be consistent with the s3fileio properties by having gluecatalog as the config namespace. We might be able to make all of them shorter, something like glue.id and s3.staging.dir. Any thoughts on this? @danielcweeks

site/docs/aws.md Outdated
* `s3fileio.multipart.num-threads`: number of threads to use for uploading parts to S3 (shared across all output streams), defaults to the available number of processors in the system
* `s3fileio.multipart.part.size`: the size of a single part for multipart upload requests, defaults to 32MB
* `s3fileio.multipart.threshold`: the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload, defaults to 1.5
* `s3fileio.staging.dir`: the directory to hold temporary files, defaults to Java's `java.io.tmpdir` property value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great info here, but I think it may be easier to maintain as a table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to table layout.

site/docs/aws.md Outdated
spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0,software.amazon.awssdk:bundle:2.15.40 \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my-key-prefix \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the Glue catalog support custom database locations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, let me add that section

@rdblue
Copy link
Contributor

rdblue commented Dec 22, 2020

@jackye1995 this is great! I noted a few things, but mostly what I would change is organizing the configuration into tables. The text describing all of the features is clear and really helpful.

@jackye1995 jackye1995 marked this pull request as ready for review January 7, 2021 23:24
@jackye1995
Copy link
Contributor Author

@rdblue there are too many tabs currently on Iceberg website, and adding the additional tab for AWS will make the UI ugly. So I moved Flink, Trino and Hive all under an engine's tab, and added a new tab called Integrations that will have topics like AWS, Nessie and JDBC.

site/docs/aws.md Outdated
| Property | Default | Description |
| --------------------------------- | -------------------------------------------------- | ------------------------------------------------------ |
| s3fileio.multipart.num-threads | the available number of processors in the system | number of threads to use for uploading parts to S3 (shared across all output streams) |
| s3fileio.multipart.part.size | 32MB | the size of a single part for multipart upload requests |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we released this yet? We would normally make it s3fileio.multipart.part-size-bytes:

  • part is part of "part size" and isn't really part of the hierarchy
  • We prefer to have clear units

site/docs/aws.md Outdated
| s3fileio.multipart.num-threads | the available number of processors in the system | number of threads to use for uploading parts to S3 (shared across all output streams) |
| s3fileio.multipart.part.size | 32MB | the size of a single part for multipart upload requests |
| s3fileio.multipart.threshold | 1.5 | the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload |
| s3fileio.staging.dir | `java.io.tmpdir` property value | the directory to hold temporary files |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything else under staging, or should this be staging-dir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are multiple discussions around the config key names, and these names for s3fileio are not designed by me but Daniel. Let me put up another thread for this discussion before release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use #2050 for the discussion


| Property | Default | Description |
| --------------------------------- | ------------------ | ------------------------------------------------------ |
| lock.impl | null | a custom implementation of the lock manager, the actual interface depends on the catalog used |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other implementation class properties are something-impl. Shouldn't this be lock-impl instead to match those? Then the other properties are in the lock namespace.

| Property | Default | Description |
| --------------------------------- | ------------------ | ------------------------------------------------------ |
| lock.impl | null | a custom implementation of the lock manager, the actual interface depends on the catalog used |
| lock.table | null | an optional auxiliary table for locking |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the table required if you're using Dynamo? I would not say it is optional, although it is optional if you're not using Dynamo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not a part of AWS documentation, that is why I say it as optional, I can remove that. I will emphasize it is required in the AWS doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. You're right that it is optional here. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also want to link to the Dynamo docs if you haven't already, since that's the only implementation currently available.

site/mkdocs.yml Outdated
- Spark:
- Getting Started: getting-started.md

- Engines:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Spark page is getting long enough that I need to break it into separate pages. That's why I reorganized. I like having an Integrations tab, though.

I wonder if we can get rid of forward/back instead. And maybe move "About" into the "Project" list.

@rdblue
Copy link
Contributor

rdblue commented Jan 8, 2021

@jackye1995, we can save some space by getting rid of the next/previous links and by moving "About" into "Project":

diff --git a/site/docs/css/extra.css b/site/docs/css/extra.css
index 4fa7c8036..3d79de02b 100644
--- a/site/docs/css/extra.css
+++ b/site/docs/css/extra.css
@@ -28,6 +28,10 @@
   float: left;
 }
 
+.navbar-right {
+  display: none;
+}
+
 .navbar-brand {
   margin-right: 1em;
 }
diff --git a/site/mkdocs.yml b/site/mkdocs.yml
index 1ce05135b..209495896 100644
--- a/site/mkdocs.yml
+++ b/site/mkdocs.yml
@@ -40,8 +40,8 @@ markdown_extensions:
   - admonition
   - pymdownx.tilde
 nav:
-  - About: index.md
-  - Project:
+  - About:
+    - Project: index.md
     - Community: community.md
     - Releases: releases.md
     - Trademarks: trademarks.md

@jackye1995
Copy link
Contributor Author

@jackye1995, we can save some space by getting rid of the next/previous links and by moving "About" into "Project":

Thanks for the code reference, I just updated the tabs based on that

site/docs/aws.md Outdated
User can choose the ACL level by setting the `s3.acl` property.
For more details, please read [S3 ACL Documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html).

### ObjectStoreLocationProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a description here rather than a class name, like "Object store file layout" or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's better, let me also remove other class names in titles

data string,
category string)
USING iceberg
OPTIONS ('location'='s3://my-special-table-bucket')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use the LOCATION DDL clause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah I completely forgot that

site/docs/aws.md Outdated
Many organizations have customized their way of configuring AWS clients with their own credential provider, access proxy, retry strategy, etc.
Iceberg allows users to plug in their own implementation of `org.apache.iceberg.aws.AwsClientFactory` by setting the `client.factory` catalog property.

### AssumeRoleAwsClientFactory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better heading for this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I just updated

@rdblue rdblue merged commit 674c9b6 into apache:master Jan 9, 2021
XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants