-
Notifications
You must be signed in to change notification settings - Fork 3k
Docs: Add Amazon EMR announcement #3976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
site/docs/aws.md
Outdated
| [Hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html), [Flink](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html), | ||
| [Trino](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that can run Iceberg. | ||
|
|
||
| Recently, Amazon EMR 6.5.0 [announced](https://aws.amazon.com/about-aws/whats-new/2022/01/amazon-emr-supports-apache-iceberg/) support of Apache Iceberg's Spark 3 Runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove relative time in the doc, people might read it after many years. Let's say things in a more generic way, like Amazon EMR added Apache Iceberg in distribution since 6.5.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
it might help to think of it this way; Assume this addition to the docs might not get updated for years, but that people might still read it (we will almost certainly update before that, but for thinking through the correct language think from that standpoint).
So qualifiers like Recently need to go (throughput). I think Jack’s phrasing makes the most sense here. If need be, I’d be happy to help with phrasing on other sections. But it should be treated like it could be read by somebody years later, where phrases like “the current release” and “recently” wouldn’t make sense then.
if they add needed context, they should probably be replaced with more specific references to versions etc. In other places, things can probably just be dropped.
For example, With the current release is ambiguous. So thinking forward to 3 years from now, that phrasing will definitely make things more confusing.
Relative time should be avoided at all costs so the docs can better represent a snapshot in time. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I have pushed the changes.
site/docs/aws.md
Outdated
| Recently, Amazon EMR 6.5.0 [announced](https://aws.amazon.com/about-aws/whats-new/2022/01/amazon-emr-supports-apache-iceberg/) support of Apache Iceberg's Spark 3 Runtime. | ||
| With the current release, you can use Apache Spark 3.1.2 on EMR clusters with the Iceberg table format. Please refer the [official documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-create-cluster.html) to create a cluster with Iceberg installed. | ||
|
|
||
| You can use a [bootstrap action](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html) similar to the following to pre-install all necessary dependencies: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line also needs to be updated, it only applies to versions before 6.5.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have pushed the changes.
site/docs/aws.md
Outdated
| [Trino](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that can run Iceberg. | ||
|
|
||
| Recently, Amazon EMR 6.5.0 [announced](https://aws.amazon.com/about-aws/whats-new/2022/01/amazon-emr-supports-apache-iceberg/) support of Apache Iceberg's Spark 3 Runtime. | ||
| With the current release, you can use Apache Spark 3.1.2 on EMR clusters with the Iceberg table format. Please refer the [official documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-create-cluster.html) to create a cluster with Iceberg installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I understand this better myself, “the current release” is referring to EMR 6.5.0 (or EMR in general)?
I would try to leave the version compatability information in the EMR docs personally. EMR has a large number of releases that each support a unique set of software. IIRC, There are really good docs for seeing what is supported where.
So I would be in favor of deferring to the EMR docs for the exact combination of dependencies supported etc.
Maybe Starting with EMR version 6.5.0, EMR clusters officially support the Apache Iceberg table format. EMR can be configured to have the necessary Apache Iceberg dependencies installed without requiring the user to use additional bootstrap actions to configure the cluster. Please refer to the official documentation on how to create a cluster with Iceberg installed.
Additionally, links to some sort of version support / dependency matrix would go a long way. Given that this information is already well documented for EMR for many things, it seems best to keep that there if possible as the support matrix is quite large and changes over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current release refers to EMR 6.5.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, for the inputs! I have pushed the changes.
I don't see a version support / dependency matrix available in the EMR docs at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s ok. I know support is new, and im not sure how the EMR docs get maintained / updated, but it would be great if that information could be included with new releases like the choice of Spark / Flink versions etc is included with different EMR releases (and I believe there’s a dependency matrix that goes back some number of versions).
In the mid to long term, I think that’s by far the best solution. But this makes sense for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely! After some searching, I could find this dependency version matrix: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-650-release.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link appears to be for a specific release though.
site/docs/aws.md
Outdated
|
|
||
| You can use a [bootstrap action](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html) similar to the following to pre-install all necessary dependencies: | ||
| Amazon EMR [added Apache Iceberg](https://aws.amazon.com/about-aws/whats-new/2022/01/amazon-emr-supports-apache-iceberg/) in distribution since 6.5.0. | ||
| You can use Apache Spark 3.1.2 on EMR clusters with the Iceberg table format. Please refer the [official documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-create-cluster.html) to create a cluster with Iceberg installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the information about Apache Spark 3.1.2 included in the link? Or is there some sort of version compatibility matrix (or one that can be made for use eventually)?
The specification of Spark version 3.1.2 is going to likely leave readers with more questions re Spark 3.2 etc. That will be especially amplified in say, 6 months, when EMR will likely support Spark 3.2.x with some version of Iceberg (just a guess but seems reasonable).
That’s one of my few remaining “relative time” related concerns.
Also, thanks for updating the docs @rajarshisarkar! I certainly don’t think the docs won’t be updated for 3 years, but figured it might help as a way to think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I have removed the Spark version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The official docs says that it's supported with "Apache Spark 3": https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg.html
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, overall looks good to me!
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for updating the docs!
Really appreciate it 🙂
* apache/iceberg#3723 * apache/iceberg#3732 * apache/iceberg#3749 * apache/iceberg#3766 * apache/iceberg#3787 * apache/iceberg#3796 * apache/iceberg#3809 * apache/iceberg#3820 * apache/iceberg#3878 * apache/iceberg#3890 * apache/iceberg#3892 * apache/iceberg#3944 * apache/iceberg#3976 * apache/iceberg#3993 * apache/iceberg#3996 * apache/iceberg#4008 * apache/iceberg#3758 and 3856 * apache/iceberg#3761 * apache/iceberg#2062 * apache/iceberg#3422 * remove restriction related to legacy parquet file list
Recently, Amazon EMR 6.5.0 announced support of Apache Iceberg's Spark 3 Runtime.
This PR updates the docs.