Skip to content

Commit

Permalink
Merge branch 'awslabs:main' into spark-examples-update
Browse files Browse the repository at this point in the history
  • Loading branch information
alanty authored Nov 5, 2024
2 parents 6b36ed6 + 0d09b9f commit 061879c
Show file tree
Hide file tree
Showing 26 changed files with 236 additions and 657 deletions.
2 changes: 1 addition & 1 deletion ai-ml/emr-spark-rapids/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
| <a name="input_enable_nvidia_gpu_operator"></a> [enable\_nvidia\_gpu\_operator](#input\_enable\_nvidia\_gpu\_operator) | Enable NVIDIA GPU Operator | `bool` | `false` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-spark-rapids"` | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br/> "100.64.0.0/16"<br/>]</pre> | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br> "100.64.0.0/16"<br>]</pre> | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/21"` | no |

Expand Down
4 changes: 2 additions & 2 deletions ai-ml/nvidia-triton-server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,9 +79,9 @@
| <a name="input_huggingface_token"></a> [huggingface\_token](#input\_huggingface\_token) | Hugging Face Secret Token | `string` | `"DUMMY_TOKEN_REPLACE_ME"` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"nvidia-triton-server"` | no |
| <a name="input_ngc_api_key"></a> [ngc\_api\_key](#input\_ngc\_api\_key) | NGC API Key | `string` | `"DUMMY_NGC_API_KEY_REPLACE_ME"` | no |
| <a name="input_nim_models"></a> [nim\_models](#input\_nim\_models) | NVIDIA NIM Models | <pre>list(object({<br/> name = string<br/> id = string<br/> enable = bool<br/> num_gpu = string<br/> }))</pre> | <pre>[<br/> {<br/> "enable": false,<br/> "id": "nvcr.io/nim/meta/llama-3.1-8b-instruct",<br/> "name": "llama-3-1-8b-instruct",<br/> "num_gpu": "4"<br/> },<br/> {<br/> "enable": true,<br/> "id": "nvcr.io/nim/meta/llama3-8b-instruct",<br/> "name": "llama3-8b-instruct",<br/> "num_gpu": "1"<br/> }<br/>]</pre> | no |
| <a name="input_nim_models"></a> [nim\_models](#input\_nim\_models) | NVIDIA NIM Models | <pre>list(object({<br> name = string<br> id = string<br> enable = bool<br> num_gpu = string<br> }))</pre> | <pre>[<br> {<br> "enable": false,<br> "id": "nvcr.io/nim/meta/llama-3.1-8b-instruct",<br> "name": "llama-3-1-8b-instruct",<br> "num_gpu": "4"<br> },<br> {<br> "enable": true,<br> "id": "nvcr.io/nim/meta/llama3-8b-instruct",<br> "name": "llama3-8b-instruct",<br> "num_gpu": "1"<br> }<br>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | region | `string` | `"us-west-2"` | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br/> "100.64.0.0/16"<br/>]</pre> | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br> "100.64.0.0/16"<br>]</pre> | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/21"` | no |

## Outputs
Expand Down
4 changes: 2 additions & 2 deletions analytics/terraform/datahub-on-eks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
| <a name="input_enable_vpc_endpoints"></a> [enable\_vpc\_endpoints](#input\_enable\_vpc\_endpoints) | Enable VPC Endpoints | `bool` | `false` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"datahub-on-eks"` | no |
| <a name="input_private_subnet_ids"></a> [private\_subnet\_ids](#input\_private\_subnet\_ids) | Ids for existing private subnets - needed when create\_vpc set to false | `list(string)` | `[]` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.0.0/17",<br/> "10.1.128.0/18"<br/>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.255.128/26",<br/> "10.1.255.192/26"<br/>]</pre> | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.0.0/17",<br> "10.1.128.0/18"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.255.128/26",<br> "10.1.255.192/26"<br>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR - must change to match the cidr of the existing VPC if create\_vpc set to false | `string` | `"10.1.0.0/16"` | no |
Expand Down
4 changes: 2 additions & 2 deletions analytics/terraform/emr-eks-ack/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
|------|-------------|------|---------|:--------:|
| <a name="input_eks_cluster_version"></a> [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.27"` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-eks-ack"` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.0.0/17",<br/> "10.1.128.0/18"<br/>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.255.128/26",<br/> "10.1.255.192/26"<br/>]</pre> | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.0.0/17",<br> "10.1.128.0/18"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.255.128/26",<br> "10.1.255.192/26"<br>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.1.0.0/16"` | no |
Expand Down
4 changes: 2 additions & 2 deletions analytics/terraform/emr-eks-fargate/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
|------|-------------|------|---------|:--------:|
| <a name="input_eks_cluster_version"></a> [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.27"` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-eks-fargate"` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.0.0/17",<br/> "10.1.128.0/18"<br/>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br/> "10.1.255.128/26",<br/> "10.1.255.192/26"<br/>]</pre> | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 32766 Subnet1 and 16382 Subnet2 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.0.0/17",<br> "10.1.128.0/18"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet | `list(string)` | <pre>[<br> "10.1.255.128/26",<br> "10.1.255.192/26"<br>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR | `string` | `"10.1.0.0/16"` | no |
Expand Down
2 changes: 1 addition & 1 deletion analytics/terraform/emr-eks-karpenter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
| <a name="input_enable_yunikorn"></a> [enable\_yunikorn](#input\_enable\_yunikorn) | Enable Apache YuniKorn Scheduler | `bool` | `false` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"emr-eks-karpenter"` | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br/> "100.64.0.0/16"<br/>]</pre> | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br> "100.64.0.0/16"<br>]</pre> | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Default tags | `map(string)` | `{}` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/21"` | no |

Expand Down
10 changes: 6 additions & 4 deletions analytics/terraform/spark-k8s-operator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,16 +72,18 @@ Checkout the [documentation website](https://awslabs.github.io/data-on-eks/docs/
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_eks_cluster_version"></a> [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS Cluster version | `string` | `"1.30"` | no |
| <a name="input_eks_data_plane_subnet_secondary_cidr"></a> [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` | <pre>[<br/> "100.64.0.0/17",<br/> "100.64.128.0/17"<br/>]</pre> | no |
| <a name="input_eks_data_plane_subnet_secondary_cidr"></a> [eks\_data\_plane\_subnet\_secondary\_cidr](#input\_eks\_data\_plane\_subnet\_secondary\_cidr) | Secondary CIDR blocks. 32766 IPs per Subnet per Subnet/AZ for EKS Node and Pods | `list(string)` | <pre>[<br> "100.64.0.0/17",<br> "100.64.128.0/17"<br>]</pre> | no |
| <a name="input_enable_amazon_prometheus"></a> [enable\_amazon\_prometheus](#input\_enable\_amazon\_prometheus) | Enable AWS Managed Prometheus service | `bool` | `true` | no |
| <a name="input_enable_vpc_endpoints"></a> [enable\_vpc\_endpoints](#input\_enable\_vpc\_endpoints) | Enable VPC Endpoints | `bool` | `false` | no |
| <a name="input_enable_yunikorn"></a> [enable\_yunikorn](#input\_enable\_yunikorn) | Enable Apache YuniKorn Scheduler | `bool` | `true` | no |
| <a name="input_kms_key_admin_roles"></a> [kms\_key\_admin\_roles](#input\_kms\_key\_admin\_roles) | list of role ARNs to add to the KMS policy | `list(string)` | `[]` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the VPC and EKS Cluster | `string` | `"spark-operator-doeks"` | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` | <pre>[<br/> "10.1.1.0/24",<br/> "10.1.2.0/24"<br/>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` | <pre>[<br/> "10.1.0.0/26",<br/> "10.1.0.64/26"<br/>]</pre> | no |
| <a name="input_private_subnets"></a> [private\_subnets](#input\_private\_subnets) | Private Subnets CIDRs. 254 IPs per Subnet/AZ for Private NAT + NLB + Airflow + EC2 Jumphost etc. | `list(string)` | <pre>[<br> "10.1.1.0/24",<br> "10.1.2.0/24"<br>]</pre> | no |
| <a name="input_public_subnets"></a> [public\_subnets](#input\_public\_subnets) | Public Subnets CIDRs. 62 IPs per Subnet/AZ | `list(string)` | <pre>[<br> "10.1.0.0/26",<br> "10.1.0.64/26"<br>]</pre> | no |
| <a name="input_region"></a> [region](#input\_region) | Region | `string` | `"us-west-2"` | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br/> "100.64.0.0/16"<br/>]</pre> | no |
| <a name="input_secondary_cidr_blocks"></a> [secondary\_cidr\_blocks](#input\_secondary\_cidr\_blocks) | Secondary CIDR blocks to be attached to VPC | `list(string)` | <pre>[<br> "100.64.0.0/16"<br>]</pre> | no |
| <a name="input_spark_benchmark_ssd_desired_size"></a> [spark\_benchmark\_ssd\_desired\_size](#input\_spark\_benchmark\_ssd\_desired\_size) | Desired size for nodegroup of c5d 12xlarge instances to run data generation for Spark benchmark | `number` | `0` | no |
| <a name="input_spark_benchmark_ssd_min_size"></a> [spark\_benchmark\_ssd\_min\_size](#input\_spark\_benchmark\_ssd\_min\_size) | Minimum size for nodegroup of c5d 12xlarge instances to run data generation for Spark benchmark | `number` | `0` | no |
| <a name="input_vpc_cidr"></a> [vpc\_cidr](#input\_vpc\_cidr) | VPC CIDR. This should be a valid private (RFC 1918) CIDR range | `string` | `"10.1.0.0/16"` | no |

## Outputs
Expand Down
4 changes: 2 additions & 2 deletions analytics/terraform/spark-k8s-operator/eks.tf
Original file line number Diff line number Diff line change
Expand Up @@ -175,9 +175,9 @@ module "eks" {
# Node group will be created with zero instances when you deploy the blueprint.
# You can change the min_size and desired_size to 6 instances
# desired_size might not be applied through terrafrom once the node group is created so this needs to be adjusted in AWS Console.
min_size = 0 # Change min and desired to 6 for running benchmarks
min_size = var.spark_benchmark_ssd_min_size # Change min and desired to 6 for running benchmarks
max_size = 8
desired_size = 0 # Change min and desired to 6 for running benchmarks
desired_size = var.spark_benchmark_ssd_desired_size # Change min and desired to 6 for running benchmarks

instance_types = ["c5d.12xlarge"] # c5d.12xlarge = 2 x 900 NVMe SSD

Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# NOTE: This example requires the following prerequisites before executing the jobs
# 1. Ensure spark-team-a name space exists
# 2. replace <S3_BUCKET> with your bucket name
# 2. replace `<S3_BUCKET>` with your bucket name

---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
Expand All @@ -21,22 +20,22 @@ spec:
mainClass: com.amazonaws.eks.tpcds.DataGeneration
mainApplicationFile: local:///opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar
arguments:
# TPC-DS data location
- "s3a://<S3_BUCKET>/TPCDS-TEST-1TB"
# Path to kit in the docker image
- "/opt/tpcds-kit/tools"
# Data Format
- "parquet"
# Scale factor (in GB) - S3 output size shows 309.4GB for 1000GB Input
- "1000"
# Generate data num partitions
- "200"
# Create the partitioned fact tables
- "true"
# Shuffle to get partitions coalesced into single files.
- "true"
# Logging set to WARN
- "true"
# TPC-DS data location
- "s3a://<S3_BUCKET>/TPCDS-TEST-1TB"
# Path to kit in the docker image
- "/opt/tpcds-kit/tools"
# Data Format
- "parquet"
# Scale factor (in GB) - S3 output size shows 309.4GB for 1000GB Input
- "1000"
# Generate data num partitions
- "200"
# Create the partitioned fact tables
- "true"
# Shuffle to get partitions coalesced into single files.
- "true"
# Logging set to WARN
- "true"
sparkConf:
# Expose Spark metrics for Prometheus
"spark.ui.prometheus.enabled": "true"
Expand Down Expand Up @@ -82,7 +81,7 @@ spec:
"spark.hadoop.fs.s3a.connection.maximum": "200"
"spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2"
"spark.kubernetes.executor.podNamePrefix": "oss-data-gen"
"spark.sql.shuffle.partitions": "2000" # Adjust according to your job size
"spark.sql.shuffle.partitions": "2000" # Adjust according to your job size
# "spark.hadoop.fs.s3a.committer.staging.conflict-mode": "append"
# Data writing and shuffle tuning
"spark.shuffle.file.buffer": "1m"
Expand Down Expand Up @@ -111,47 +110,47 @@ spec:
securityContext:
runAsUser: 185
volumeMounts:
- name: spark-local-dir-1
mountPath: /data1
- name: spark-local-dir-1
mountPath: /data1
env:
- name: JAVA_HOME
value: "/opt/java/openjdk"
- name: JAVA_HOME
value: "/opt/java/openjdk"
initContainers:
- name: volume-permission
image: public.ecr.aws/docker/library/busybox
command: ['sh', '-c', 'mkdir -p /data1; chown -R 185:185 /data1']
volumeMounts:
- name: spark-local-dir-1
mountPath: /data1
- name: volume-permission
image: public.ecr.aws/docker/library/busybox
command: ['sh', '-c', 'mkdir -p /data1; chown -R 185:185 /data1']
volumeMounts:
- name: spark-local-dir-1
mountPath: /data1
nodeSelector:
NodeGroupType: SparkComputeOptimized
NodeGroupType: spark_benchmark_ssd
executor:
cores: 11
# The maximum memory size of the container to the running executor is determined by the sum of
# spark.executor.memoryoverHead, spark.executor.memory, spark.memory.offHeap.size, spark.executor.pyspark.memory
memory: "15g"
memoryOverhead: "4g"
instances: 26
instances: 22
serviceAccount: spark-team-a
securityContext:
runAsUser: 185
volumeMounts:
- name: spark-local-dir-1
mountPath: /data1
initContainers:
- name: volume-permission
image: public.ecr.aws/docker/library/busybox
command: ['sh', '-c', 'mkdir -p /data1; chown -R 185:185 /data1']
volumeMounts:
- name: spark-local-dir-1
mountPath: /data1
initContainers:
- name: volume-permission
image: public.ecr.aws/docker/library/busybox
command: ['sh', '-c', 'mkdir -p /data1; chown -R 185:185 /data1']
volumeMounts:
- name: spark-local-dir-1
mountPath: /data1
env:
- name: JAVA_HOME
value: "/opt/java/openjdk"
- name: JAVA_HOME
value: "/opt/java/openjdk"
nodeSelector:
NodeGroupType: SparkComputeOptimized
NodeGroupType: spark_benchmark_ssd
volumes:
- name: spark-local-dir-1
hostPath:
path: "/mnt/k8s-disks/0"
type: DirectoryOrCreate
- name: spark-local-dir-1
hostPath:
path: "/mnt/k8s-disks/0"
type: DirectoryOrCreate
Loading

0 comments on commit 061879c

Please sign in to comment.