feat: Add an example of using S3 tables with jupyter hub (#723)

Signed-off-by: Manabu McCloskey <[email protected]>
awslabs · Jan 14, 2025 · 41d6899 · 41d6899
1 parent 5769ed4
commit 41d6899
Show file tree

Hide file tree

Showing 12 changed files with 517 additions and 167 deletions.
diff --git a/analytics/terraform/spark-k8s-operator/README.md b/analytics/terraform/spark-k8s-operator/README.md
@@ -31,6 +31,7 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
 | <a name="module_eks"></a> [eks](#module\_eks) | terraform-aws-modules/eks/aws | ~> 20.26 |
 | <a name="module_eks_blueprints_addons"></a> [eks\_blueprints\_addons](#module\_eks\_blueprints\_addons) | aws-ia/eks-blueprints-addons/aws | ~> 1.2 |
 | <a name="module_eks_data_addons"></a> [eks\_data\_addons](#module\_eks\_data\_addons) | aws-ia/eks-data-addons/aws | 1.34 |
+| <a name="module_jupyterhub_single_user_irsa"></a> [jupyterhub\_single\_user\_irsa](#module\_jupyterhub\_single\_user\_irsa) | terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks | ~> 5.52.0 |
 | <a name="module_s3_bucket"></a> [s3\_bucket](#module\_s3\_bucket) | terraform-aws-modules/s3-bucket/aws | ~> 3.0 |
 | <a name="module_spark_team_irsa"></a> [spark\_team\_irsa](#module\_spark\_team\_irsa) | aws-ia/eks-blueprints-addon/aws | ~> 1.0 |
 | <a name="module_vpc"></a> [vpc](#module\_vpc) | terraform-aws-modules/vpc/aws | ~> 5.0 |
@@ -53,8 +54,11 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
 | [kubernetes_annotations.gp2_default](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/annotations) | resource |
 | [kubernetes_cluster_role.spark_role](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/cluster_role) | resource |
 | [kubernetes_cluster_role_binding.spark_role_binding](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/cluster_role_binding) | resource |
+| [kubernetes_namespace.jupyterhub](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/namespace) | resource |
 | [kubernetes_namespace_v1.spark_team](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/namespace_v1) | resource |
+| [kubernetes_secret_v1.jupyterhub_single_user](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/secret_v1) | resource |
 | [kubernetes_secret_v1.spark_team](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/secret_v1) | resource |
+| [kubernetes_service_account_v1.jupyterhub_single_user_sa](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/service_account_v1) | resource |
 | [kubernetes_service_account_v1.spark_team](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/service_account_v1) | resource |
 | [kubernetes_storage_class.ebs_csi_encrypted_gp3_storage_class](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/storage_class) | resource |
 | [random_password.grafana](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/password) | resource |
@@ -77,6 +81,7 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
 | <a name="input_eks_cluster_version"></a> [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.31"` | no |
 | <a name="input_eks_data_plane_subnet_secondary_cidr"></a> [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` | <pre>[<br>  "100.64.0.0/17",<br>  "100.64.128.0/17"<br>]</pre> | no |
 | <a name="input_enable_amazon_prometheus"></a> [enable\_amazon\_prometheus](#input\_enable\_amazon\_prometheus) | Enable AWS Managed Prometheus service | `bool` | `true` | no |
+| <a name="input_enable_jupyterhub"></a> [enable\_jupyterhub](#input\_enable\_jupyterhub) | Enable Jupyter Hub | `bool` | `false` | no |
 | <a name="input_enable_vpc_endpoints"></a> [enable\_vpc\_endpoints](#input\_enable\_vpc\_endpoints) | Enable VPC Endpoints | `bool` | `false` | no |
 | <a name="input_enable_yunikorn"></a> [enable\_yunikorn](#input\_enable\_yunikorn) | Enable Apache YuniKorn Scheduler | `bool` | `false` | no |
 | <a name="input_kms_key_admin_roles"></a> [kms\_key\_admin\_roles](#input\_kms\_key\_admin\_roles) | list of role ARNs to add to the KMS policy | `list(string)` | `[]` | no |

diff --git a/analytics/terraform/spark-k8s-operator/addons.tf b/analytics/terraform/spark-k8s-operator/addons.tf
@@ -424,6 +424,16 @@ module "eks_data_addons" {
     repository_password = data.aws_ecrpublic_authorization_token.token.password
   }
 
+  #---------------------------------------------------------------
+  # JupyterHub Add-on
+  #---------------------------------------------------------------
+  enable_jupyterhub = var.enable_jupyterhub
+  jupyterhub_helm_config = {
+    values = [templatefile("${path.module}/helm-values/jupyterhub-singleuser-values.yaml", {
+      jupyter_single_user_sa_name = var.enable_jupyterhub ? kubernetes_service_account_v1.jupyterhub_single_user_sa[0].metadata[0].name : "no-tused"
+    })]
+    version = "3.3.8"
+  }
 }
 
 #---------------------------------------------------------------
@@ -648,6 +658,8 @@ resource "aws_secretsmanager_secret_version" "grafana" {
 
 #---------------------------------------------------------------
 # S3Table IAM policy for Karpenter nodes
+# The S3 tables library does not fully support IRSA and Pod Identity as of this writing.
+# We give the node role access to S3tables to work around this limitation.
 #---------------------------------------------------------------
 resource "aws_iam_policy" "s3tables_policy" {
   name_prefix = "${local.name}-s3tables"
@@ -665,7 +677,9 @@ resource "aws_iam_policy" "s3tables_policy" {
           "s3tables:GetNamespace",
           "s3tables:GetTableBucket",
           "s3tables:GetTableBucketMaintenanceConfiguration",
-          "s3tables:GetTableBucketPolicy"
+          "s3tables:GetTableBucketPolicy",
+          "s3tables:CreateNamespace",
+          "s3tables:CreateTable"
         ]
         Resource = "arn:aws:s3tables:*:${data.aws_caller_identity.current.account_id}:bucket/*"
       },

diff --git a/analytics/terraform/spark-k8s-operator/examples/s3-tables/Dockerfile-S3Table-notebook b/analytics/terraform/spark-k8s-operator/examples/s3-tables/Dockerfile-S3Table-notebook
@@ -0,0 +1,51 @@
+#--------------------------------------------------------------------------------------------
+# Dockerfile for Apache Spark 3.3.1 with S3A Support on multi-arch platforms (AMD64 & ARM64)
+#--------------------------------------------------------------------------------------------
+# Step1: Create a Private or Public ECR repo from AWS Console or CLI
+#   e.g., aws ecr-public create-repository --repository-name spark --region us-east-1
+#---
+# Step2: Docker Login:
+#   aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/<repoAlias>
+#---
+# Step3: Build multi arch image and push it to ECR:
+#   docker buildx build --platform linux/amd64,linux/arm64 -t public.ecr.aws/<repoAlias>/spark:3.5.3-scala2.12-java17-python3-ubuntu --push .
+#--------------------------------------------------------------------------------------------
+
+# Use the official pyspark notebook base image
+FROM quay.io/jupyter/pyspark-notebook:spark-3.5.3
+
+# Arguments for version control
+ARG HADOOP_VERSION=3.4.1
+ARG PREV_HADOOP_VERSION=3.3.4
+ARG AWS_SDK_VERSION=2.29.45
+ARG ICEBERG_VERSION=1.6.1
+ARG S3_TABLES_VERSION=0.1.3
+ARG NOTEBOOK_USER=1000
+
+# Set environment variables
+ENV HADOOP_DIR=/usr/local/spark-3.5.3-bin-hadoop3
+
+# Set up as root to install dependencies and tools
+USER root
+
+# Remove any old Hadoop libraries to avoid conflicts
+RUN rm -f ${HADOOP_DIR}/jars/hadoop-client-* && \
+    rm -f ${HADOOP_DIR}/jars/hadoop-yarn-server-web-proxy-*.jar
+
+# Add Hadoop AWS connector and related Hadoop dependencies
+RUN cd ${HADOOP_DIR}/jars && \
+    wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -O hadoop-aws-${PREV_HADOOP_VERSION}.jar && \
+    wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client-api/${HADOOP_VERSION}/hadoop-client-api-${HADOOP_VERSION}.jar -O hadoop-client-api-${PREV_HADOOP_VERSION}.jar && \
+    wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client-runtime/${HADOOP_VERSION}/hadoop-client-runtime-${HADOOP_VERSION}.jar -O hadoop-client-runtime-${PREV_HADOOP_VERSION}.jar && \
+    wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/${HADOOP_VERSION}/hadoop-common-${HADOOP_VERSION}.jar -O hadoop-common-${PREV_HADOOP_VERSION}.jar && \
+    wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-yarn-server-web-proxy/${HADOOP_VERSION}/hadoop-yarn-server-web-proxy-${HADOOP_VERSION}.jar -O hadoop-yarn-server-web-proxy-${PREV_HADOOP_VERSION}.jar
+
+# Add Iceberg, AWS SDK bundle, and S3 Tables Catalog for Iceberg runtime
+RUN cd ${HADOOP_DIR}/jars && \
+    wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/${ICEBERG_VERSION}/iceberg-spark-runtime-3.5_2.12-${ICEBERG_VERSION}.jar && \
+    wget https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/${AWS_SDK_VERSION}/bundle-${AWS_SDK_VERSION}.jar && \
+    wget https://repo1.maven.org/maven2/software/amazon/s3tables/s3-tables-catalog-for-iceberg-runtime/${S3_TABLES_VERSION}/s3-tables-catalog-for-iceberg-runtime-${S3_TABLES_VERSION}.jar
+
+
+# Switch to non-root user for security best practices
+USER ${NOTEBOOK_USER}
diff --git a/analytics/terraform/spark-k8s-operator/examples/s3-tables/README.md b/analytics/terraform/spark-k8s-operator/examples/s3-tables/README.md
@@ -1,167 +1,3 @@
-# S3Table with OSS Spark on EKS Guide
+# S3Table with OSS Spark on EKS
 
-This guide provides step-by-step instructions for setting up and running a Spark job on Amazon EKS using S3Table for data storage.
-
-## Prerequisites
-
-- Latest version of AWS CLI installed (must include S3Tables API support)
-
-## Step 1: Deploy Spark Cluster on EKS
-
-Follow the steps to deploy Spark Cluster on EKS
-
-[Spark Operator on EKS with YuniKorn Scheduler](https://awslabs.github.io/data-on-eks/docs/blueprints/data-analytics/spark-operator-yunikorn#prerequisites)
-
-Once your cluster is up and running, proceed with the following steps to execute a sample Spark job using S3Tables.
-
-## Step 2: Create Test Data for the job
-
-Navigate to the example directory and Generate sample data:
-
-```sh
-cd analytics/terraform/spark-k8s-operator/examples/s3-tables
-./input-data-gen.sh
-```
-
-This will create a file called `employee_data.csv` locally with 100 records. Modify the script to adjust the number of records as needed.
-
-## Step 3: Upload Test Input data to your S3 Bucket
-
-Replace `<YOUR_S3_BUCKET>` with the name of the S3 bucket created by your blueprint and run the below command.
-
-```sh
-aws s3 cp employee_data.csv s3://<S3_BUCKET>/s3table-example/input/
-```
-
-## Step 4: Upload PySpark Script to S3 Bucket
-
-Replace `<S3_BUCKET>` with the name of the S3 bucket created by your blueprint and run the below command to upload sample Spark job to S3 buckets.
-
-```sh
-aws s3 cp s3table-iceberg-pyspark.py s3://<S3_BUCKET>/s3table-example/scripts/
-```
-
-## Step 5: Create S3Table
-
-Replace <REGION> and <S3TABLE_BUCKET_NAME> with desired names.
-
-```sh
-aws s3tables create-table-bucket \
-    --region "<REGION>" \
-    --name "<S3TABLE_BUCKET_NAME>"
-```
-
-Make note of the S3TABLE ARN generated by this command.
-
-## Step 6: Update Spark Operator YAML File
-
- - Open `s3table-spark-operator.yaml` file in your preferred text editor.
- - Replace `<S3_BUCKET>` with your S3 bucket created by this blueprint(Check Terraform outputs). S3 Bucket where you copied test data and sample spark job in the above steps.
- - REPLACE `<S3TABLE_ARN>` with your S3 Table ARN.
-
-## Step 7: Execute Spark Job
-
-Apply the updated YAML file to your Kubernetes cluster to submit the Spark Job.
-
-```sh
-cd analytics/terraform/spark-k8s-operator/examples/s3-tables
-kubectl apply -f s3table-spark-operator.yaml
-```
-
-## Step 8: Verify the Spark Driver log for the output
-
-Check the Spark driver logs to verify job progress and output:
-
-```sh
-kubectl logs <spark-driver-pod-name> -n spark-team-a
-```
-
-## Step 9: Verify the S3Table using S3Table API
-
-Use the S3Table API to confirm the table was created successfully. Just replace the `<ACCOUNT_ID>` and run the command.
-
-```sh
-aws s3tables get-table --table-bucket-arn arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table --namespace doeks_namespace --name employee_s3_table
-```
-
-Output looks like below.
-
-```json
-{
-    "name": "employee_s3_table",
-    "type": "customer",
-    "tableARN": "arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table/table/55511111-7a03-4513-b921-e372b0030daf",
-    "namespace": [
-        "doeks_namespace"
-    ],
-    "versionToken": "aafc39ddd462690d2a0c",
-    "metadataLocation": "s3://55511111-7a03-4513-bumiqc8ihp8rnxymuhyz8t1ammu7ausw2b--table-s3/metadata/00004-62cc4be3-59b5-4647-a78d-1cdf69ec5ed8.metadata.json",
-    "warehouseLocation": "s3://55511111-7a03-4513-bumiqc8ihp8rnxymuhyz8t1ammu7ausw2b--table-s3",
-    "createdAt": "2025-01-07T22:14:48.689581+00:00",
-    "createdBy": "<ACCOUNT_ID>",
-    "modifiedAt": "2025-01-09T00:06:09.222917+00:00",
-    "ownerAccountId": "<ACCOUNT_ID>",
-    "format": "ICEBERG"
-}
-```
-
-Monitor the table maintenance job status:
-
-```sh
-aws s3tables get-table-maintenance-job-status --table-bucket-arn arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table --namespace doeks_namespace --name employee_s3_table
-```
-
-This command provides information about Iceberg compaction, snapshot management, and unreferenced file removal processes.
-
-```json
-{
-    "tableARN": "arn:aws:s3tables:us-west-2:<ACCOUNT_ID>:bucket/doeks-spark-on-eks-s3table/table/55511111-7a03-4513-b921-e372b0030daf",
-    "status": {
-        "icebergCompaction": {
-            "status": "Successful",
-            "lastRunTimestamp": "2025-01-08T01:18:08.857000+00:00"
-        },
-        "icebergSnapshotManagement": {
-            "status": "Successful",
-            "lastRunTimestamp": "2025-01-08T22:17:08.811000+00:00"
-        },
-        "icebergUnreferencedFileRemoval": {
-            "status": "Successful",
-            "lastRunTimestamp": "2025-01-08T22:17:10.377000+00:00"
-        }
-    }
-}
-```
-
-## Step10: Clean up
-
-Delete the table.
-
-```bash
-aws s3tables delete-table \
-  --namespace doeks_namespace \
-  --table-bucket-arn ${S3TABLE_ARN} \
-  --name employee_s3_table
-```
-
-Delete the namespace.
-
-```bash
-aws s3tables delete-namespace \
-  --namespace doeks_namespace \
-  --table-bucket-arn ${S3TABLE_ARN}
-```
-
-Finally, delete the bucket table
-
-```bash
-aws s3tables delete-table-bucket \
-  --region "<REGION>" \
-  --table-bucket-arn ${S3TABLE_ARN}
-```
-
-
-# Conclusion
-You have successfully set up and run a Spark job on Amazon EKS using S3Table for data storage. This setup provides a scalable and efficient way to process large datasets using Spark on Kubernetes with the added benefits of S3Table's data management capabilities.
-
-For more advanced usage, refer to the official AWS documentation on S3Table and Spark on EKS.
+** Please see [our website](https://awslabs.github.io/data-on-eks/docs/blueprints/data-analytics/spark-operator-s3tables) for details on how to use this example **