forked from daattali/beautiful-jekyll
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: add emr cross-account access to iceberg tables
- Loading branch information
1 parent
b6cbc6e
commit 56713a2
Showing
1 changed file
with
37 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
--- | ||
layout: post | ||
title: EMR Cross Account Access to Iceberg Tables | ||
subtitle: Learn how to access Iceberg tables from EMR in another account | ||
tags: [blog] | ||
comments: false | ||
--- | ||
|
||
A few weeks ago, I was working on a project where I had to access Iceberg tables from an EMR cluster in another account. I found it a bit tricky to set up, so I decided to write this post to help others who might be facing the same issue. | ||
|
||
If you follow the EMR documentation on how to access Iceberg tables you're going to find the following `spark-submit` parameters recommendation: | ||
|
||
```bash | ||
--conf spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar | ||
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions | ||
--conf spark.sql.catalog.<YOUR_CATALOG_NAME_HERE>=org.apache.iceberg.spark.SparkCatalog | ||
--conf spark.sql.catalog.<YOUR_CATALOG_NAME_HERE>.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog | ||
--conf spark.sql.catalog.<YOUR_CATALOG_NAME_HERE>.warehouse=s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/ | ||
--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory | ||
``` | ||
|
||
After setting those and configuring the cross-account access on your AWS Glue Catalog in the account where the iceberg table lives, you're going to receive an error similar to `org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table table_name. StorageDescriptor#InputFormat cannot be null for table: table_name(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)`. | ||
|
||
Considering you've done everything right while setting the permissions you can solve this issue adding the following parameter: | ||
|
||
```bash | ||
--conf spark.sql.catalog.<YOUR_CATALOG_NAME_HERE>.glue.id=<ICEBERG_TABLE_ACCOUNT_ID> | ||
``` | ||
|
||
After this, you can access the database using SparkSQL: | ||
|
||
```sql | ||
SELECT * | ||
FROM <YOUR_CATALOG_NAME_HERE>.<DATABASE>.<TABLE_NAME> | ||
``` | ||
|
||
That's all for this post, hope it helps! |