-
Notifications
You must be signed in to change notification settings - Fork 3k
AWS: documentation page for AWS module #1891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| When this configurer is used, a STS client is initialized with default [credentials chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html) and [region chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html), | ||
| and all the other clients (Glue, DynamoDB, S3, etc.) will use the configured assume-role credential and region. | ||
|
|
||
| ## Run Iceberg on AWS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jackye1995 it looks great, this is more an question. is it also possible to run iceberg with AWS "Glue Job"? if we use spark.jars.packages to provide iceberg library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not tried that yet since I am mostly focused on Athena use case, but let me check if anyone has done that, thanks for bringing up this use case.
site/docs/aws.md
Outdated
| ### Glue Catalog ID | ||
|
|
||
| It is very common for an organization to store all the tables in a single Glue catalog in a single AWS account and run data computation in many different accounts. | ||
| In this case, you need to specify a Glue catalog ID when initializing `GlueCatalog`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For things like setting glue catalog id we probably want to mention the parameter "gluecatalog.id"; or overall I think we may want to add a section in the configuration page that include all configurable parameters mentioned in AwsProperties and link to that section in this page, otherwise people won't know how to configure them. And that would serve as a centralized place to read about all configurations. Is it part of the plan/covered by other PRs?
site/docs/aws.md
Outdated
| This is because in each AWS account, there is a single Glue catalog in each AWS region, | ||
| but the region is pre-determined by the Glue web client that is making the call. | ||
| If you would like to access a Glue catalog in a different region, you should configure you AWS client, see more details in [AWS client configuration](#aws-client-configurations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I guess the logic here isn't super clear to me or I misunderstood; it seems like the "this is because" part is not explaining why we want to use aws account ID, but rather to explain we can configure region and other stuff when we need to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, please read if the new version is easier to understand.
site/docs/aws.md
Outdated
| To enable server side encryption, use the following configuration properties: | ||
|
|
||
| * `s3fileio.sse.type`: `none`, `s3`, `kms` or `custom`, default to `none` | ||
| * `s3fileio.sse.key`: a KMS Key ID or ARN for `kms` type (default to `aws/s3`), or a custom base-64 AES256 symmetric key for `custom` type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the current code base it seems like we are not defaulting it to anything, is this aws/s3 a default value on AWS SDK or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is the default at service side.
site/docs/aws.md
Outdated
|
|
||
| ## Run Iceberg on AWS | ||
|
|
||
| [Amazon EMR](https://aws.amazon.com/emr/) is the most common platform to run Iceberg on AWS. EMR can provision clusters with [Spark](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html) (EMR 6 for Spark 3, EMR 5 for Spark 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: did you have to pull down the aws sdk bundle to run sparksql in EMR? IIRC there's already some preinstalled aws libraries on EMR, if that's true we might be able to mention them here to save one step for the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The v2 library is not bundled in spark.
| To store data in a different local or cloud store, Glue catalog can switch to use `HadoopFileIO` | ||
| or any custom FileIO using the mechanism described in the [custom FileIO](../custom-catalog/#custom-file-io-implementation) section. | ||
|
|
||
| ## S3 FileIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to ask the rationalization of S3 FileIO compared to Hadoop filesystem API with S3 support in #1945, but this section covers it. Thanks!
Probably worth to also mention whether Hadoop FS API with S3 is sufficient to work with, or S3 FileIO is required to avoid consistency glitches. That would help end users to determine whether including aws module is a kind of requirement for dealing with S3 or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated strong consistency section for more details, please let me know if it is enough.
site/docs/aws.md
Outdated
| This provides maximized upload speed and minimized local disk usage during uploads. | ||
| Here are the configurations user can tune related to this feature: | ||
|
|
||
| * `s3fileio.multipart.num-threads`: number of threads to use for uploading parts to S3 (shared pool across all output streams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a default here? Or is there no default and setting this is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default to Runtime.getRuntime().availableProcessors(), let me add in the doc
site/docs/aws.md
Outdated
|
|
||
| * `s3fileio.multipart.num-threads`: number of threads to use for uploading parts to S3 (shared pool across all output streams) | ||
| * `s3fileio.multipart.part.size`: the size of a single part for multipart upload requests, default to 32MB | ||
| * `s3fileio.multipart.threshold`: the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload, default to 1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above defaults to or default is
site/docs/aws.md
Outdated
| Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation. | ||
| When used, an Iceberg `Namespace` is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html), | ||
| an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html), | ||
| an Iceberg `Snapshot` is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I thought that only one version of the table was kept and that it pointed to the Iceberg root metadata file. That file contains more than one snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshot is probably a bad word, I will just use table version then.
But Glue also stores all historical table versions for each update.
site/docs/aws.md
Outdated
|
|
||
| Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation. | ||
| When used, an Iceberg `Namespace` is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html), | ||
| an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would drop references to Iceberg classes here. Most readers are going to care about tables through a SQL or DataFrame interface, not through Iceberg's API. The mapping to Glue database and table names is probably pretty clear.
| an Iceberg `Table` is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html), | ||
| an Iceberg `Snapshot` is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion). | ||
| You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`. | ||
| More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely give a full example of using GlueCatalog here, in addition to linking to Spark and Flink pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, there's one above. Maybe just point the reader to it.
| You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`. | ||
| More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs). | ||
|
|
||
| ### Glue Catalog ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a table of configuration options instead of sections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I am doing this is because the explanation is quite long, which makes the table look very messy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense here, but I would opt for tables in most other places. Part of the appeal is it forces you to be concise.
site/docs/aws.md
Outdated
|
|
||
| By default, Glue will store all the table versions created and user can rollback a table to any historical version if needed. | ||
| However, if you are streaming data to Iceberg, this will easily create a lot of Glue table versions. | ||
| Therefore, it is recommended to turn off the archive feature in Glue by setting `gluecatalog.skip-archive` to false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a bit cleaner if these catalog options were shorter. this could simply be skip-archive in the catalog config. Similarly, the one above could be catalog-id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly trying to be consistent with the s3fileio properties by having gluecatalog as the config namespace. We might be able to make all of them shorter, something like glue.id and s3.staging.dir. Any thoughts on this? @danielcweeks
site/docs/aws.md
Outdated
| * `s3fileio.multipart.num-threads`: number of threads to use for uploading parts to S3 (shared across all output streams), defaults to the available number of processors in the system | ||
| * `s3fileio.multipart.part.size`: the size of a single part for multipart upload requests, defaults to 32MB | ||
| * `s3fileio.multipart.threshold`: the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload, defaults to 1.5 | ||
| * `s3fileio.staging.dir`: the directory to hold temporary files, defaults to Java's `java.io.tmpdir` property value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great info here, but I think it may be easier to maintain as a table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to table layout.
site/docs/aws.md
Outdated
| spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0,software.amazon.awssdk:bundle:2.15.40 \ | ||
| --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ | ||
| --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ | ||
| --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my-key-prefix \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the Glue catalog support custom database locations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, let me add that section
|
@jackye1995 this is great! I noted a few things, but mostly what I would change is organizing the configuration into tables. The text describing all of the features is clear and really helpful. |
|
@rdblue there are too many tabs currently on Iceberg website, and adding the additional tab for AWS will make the UI ugly. So I moved Flink, Trino and Hive all under an engine's tab, and added a new tab called Integrations that will have topics like AWS, Nessie and JDBC. |
site/docs/aws.md
Outdated
| | Property | Default | Description | | ||
| | --------------------------------- | -------------------------------------------------- | ------------------------------------------------------ | | ||
| | s3fileio.multipart.num-threads | the available number of processors in the system | number of threads to use for uploading parts to S3 (shared across all output streams) | | ||
| | s3fileio.multipart.part.size | 32MB | the size of a single part for multipart upload requests | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we released this yet? We would normally make it s3fileio.multipart.part-size-bytes:
partis part of "part size" and isn't really part of the hierarchy- We prefer to have clear units
site/docs/aws.md
Outdated
| | s3fileio.multipart.num-threads | the available number of processors in the system | number of threads to use for uploading parts to S3 (shared across all output streams) | | ||
| | s3fileio.multipart.part.size | 32MB | the size of a single part for multipart upload requests | | ||
| | s3fileio.multipart.threshold | 1.5 | the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload | | ||
| | s3fileio.staging.dir | `java.io.tmpdir` property value | the directory to hold temporary files | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything else under staging, or should this be staging-dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are multiple discussions around the config key names, and these names for s3fileio are not designed by me but Daniel. Let me put up another thread for this discussion before release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use #2050 for the discussion
site/docs/configuration.md
Outdated
|
|
||
| | Property | Default | Description | | ||
| | --------------------------------- | ------------------ | ------------------------------------------------------ | | ||
| | lock.impl | null | a custom implementation of the lock manager, the actual interface depends on the catalog used | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other implementation class properties are something-impl. Shouldn't this be lock-impl instead to match those? Then the other properties are in the lock namespace.
site/docs/configuration.md
Outdated
| | Property | Default | Description | | ||
| | --------------------------------- | ------------------ | ------------------------------------------------------ | | ||
| | lock.impl | null | a custom implementation of the lock manager, the actual interface depends on the catalog used | | ||
| | lock.table | null | an optional auxiliary table for locking | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the table required if you're using Dynamo? I would not say it is optional, although it is optional if you're not using Dynamo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is not a part of AWS documentation, that is why I say it as optional, I can remove that. I will emphasize it is required in the AWS doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. You're right that it is optional here. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may also want to link to the Dynamo docs if you haven't already, since that's the only implementation currently available.
site/mkdocs.yml
Outdated
| - Spark: | ||
| - Getting Started: getting-started.md | ||
|
|
||
| - Engines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Spark page is getting long enough that I need to break it into separate pages. That's why I reorganized. I like having an Integrations tab, though.
I wonder if we can get rid of forward/back instead. And maybe move "About" into the "Project" list.
|
@jackye1995, we can save some space by getting rid of the next/previous links and by moving "About" into "Project": diff --git a/site/docs/css/extra.css b/site/docs/css/extra.css
index 4fa7c8036..3d79de02b 100644
--- a/site/docs/css/extra.css
+++ b/site/docs/css/extra.css
@@ -28,6 +28,10 @@
float: left;
}
+.navbar-right {
+ display: none;
+}
+
.navbar-brand {
margin-right: 1em;
}
diff --git a/site/mkdocs.yml b/site/mkdocs.yml
index 1ce05135b..209495896 100644
--- a/site/mkdocs.yml
+++ b/site/mkdocs.yml
@@ -40,8 +40,8 @@ markdown_extensions:
- admonition
- pymdownx.tilde
nav:
- - About: index.md
- - Project:
+ - About:
+ - Project: index.md
- Community: community.md
- Releases: releases.md
- Trademarks: trademarks.md |
Thanks for the code reference, I just updated the tabs based on that |
site/docs/aws.md
Outdated
| User can choose the ACL level by setting the `s3.acl` property. | ||
| For more details, please read [S3 ACL Documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html). | ||
|
|
||
| ### ObjectStoreLocationProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using a description here rather than a class name, like "Object store file layout" or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's better, let me also remove other class names in titles
| data string, | ||
| category string) | ||
| USING iceberg | ||
| OPTIONS ('location'='s3://my-special-table-bucket') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use the LOCATION DDL clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah I completely forgot that
site/docs/aws.md
Outdated
| Many organizations have customized their way of configuring AWS clients with their own credential provider, access proxy, retry strategy, etc. | ||
| Iceberg allows users to plug in their own implementation of `org.apache.iceberg.aws.AwsClientFactory` by setting the `client.factory` catalog property. | ||
|
|
||
| ### AssumeRoleAwsClientFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better heading for this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I just updated
Initial draft for documentation page of the AWS module, I will mark this as draft until #1823 and #1844 are merged.
Also, I am currently placing it under
Tablestab, which is clearly wrong. I was thinking about adding a new tabIntegrationsfor all integrations including AWS and Nessie, but the navbar becomes too long. I am thinking about moving Flink, Hive and Presto all to a singleEnginestab, any thoughts on that?