Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support preloading container images into Bottlerocket data volumes with Karpenter #587

Merged
merged 29 commits into from
Aug 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
49d81c7
docs: add getting started for README.md
lindarr915 Jul 10, 2024
e64d10e
feat: add guide of creating snapshots for preloading container images…
lindarr915 Jul 16, 2024
7f4a1ee
feat: add karpenter manifest to support
lindarr915 Jul 16, 2024
ad213f1
fix: remove nodeSelector & correct import_path
lindarr915 Jul 16, 2024
0a8ba73
chore: bump nvidia_device_plugin version to v0.15.1
lindarr915 Jul 16, 2024
95a6239
feat: add Karpenter ec2nodeclass with custom EBS snapshots with prelo…
lindarr915 Jul 16, 2024
5a6cece
Merge branch 'awslabs:main' into main
lindarr915 Jul 16, 2024
d732759
fix: add node labeling in Karpenter NodePool
lindarr915 Jul 17, 2024
63104c1
fix: Uncomment nodeSelector in RayService
lindarr915 Jul 17, 2024
d61353e
Merge branch 'awslabs:main' into main
lindarr915 Jul 17, 2024
fe9b815
Merge branch 'awslabs:main' into main
lindarr915 Jul 29, 2024
d0cd671
Merge branch 'awslabs:main' into main
lindarr915 Aug 2, 2024
035bc29
Bottlerocket cache container image (#1)
lindarr915 Aug 2, 2024
b92887f
fix: add __pycache__ path in .gitignore
lindarr915 Aug 13, 2024
f64376f
Merge branch 'main' of https://github.com/lindarr915/data-on-eks
lindarr915 Aug 13, 2024
bd63f28
fix: correcting the import path
lindarr915 Aug 13, 2024
de70d0a
feat: add websites docs for preload container images on bottlerocket …
lindarr915 Aug 13, 2024
40144eb
fix: update .gitignore
lindarr915 Aug 13, 2024
f111dcc
fix: remove unsued README.md docs
lindarr915 Aug 19, 2024
1d214c4
fix: move additional IAM policy to addon.tf
lindarr915 Aug 19, 2024
5658061
fix: update page title and move end-to-end example to stable diffusio…
lindarr915 Aug 19, 2024
c265ee6
fix: remove comments
lindarr915 Aug 19, 2024
c521992
docs: add preload container image
lindarr915 Aug 19, 2024
ddc1561
fix: add additional IAM policy statements for karpenter to launch fro…
lindarr915 Aug 19, 2024
4a517a1
fix: bump data on eks addons to 1.33 to support karpenter helm resour…
lindarr915 Aug 19, 2024
70423fe
Merge branch 'awslabs:main' into main
lindarr915 Aug 19, 2024
4ba1b2a
fixes for pre-commit
askulkarni2 Aug 20, 2024
f13f1d6
Merge remote-tracking branch 'upstream/main'
askulkarni2 Aug 20, 2024
86ac090
fix pre-commit on the merged main
askulkarni2 Aug 20, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,6 @@ site

# node modules
node_modules
gen-ai/inference/stable-diffusion-rayserve-gpu/locust/__pycache__/*
website/package-lock.json
website/package.json
65 changes: 0 additions & 65 deletions ai-ml/jark-stack/terraform/README.md

This file was deleted.

53 changes: 50 additions & 3 deletions ai-ml/jark-stack/terraform/addons.tf
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,9 @@ module "eks_blueprints_addons" {
chart_version = "0.37.0"
repository_username = data.aws_ecrpublic_authorization_token.token.user_name
repository_password = data.aws_ecrpublic_authorization_token.token.password
source_policy_documents = [
data.aws_iam_policy_document.karpenter_controller_policy.json
]
}

#---------------------------------------
Expand All @@ -145,9 +148,10 @@ module "eks_blueprints_addons" {
#---------------------------------------------------------------
# Data on EKS Kubernetes Addons
#---------------------------------------------------------------

module "data_addons" {
source = "aws-ia/eks-data-addons/aws"
version = "~> 1.31.4" # ensure to update this to the latest/desired version
version = "~> 1.33"

oidc_provider_arn = module.eks.oidc_provider_arn

Expand Down Expand Up @@ -182,7 +186,7 @@ module "data_addons" {
#---------------------------------------------------------------
enable_nvidia_device_plugin = true
nvidia_device_plugin_helm_config = {
version = "v0.14.5"
version = "v0.16.1"
name = "nvidia-device-plugin"
values = [
<<-EOT
Expand Down Expand Up @@ -225,19 +229,35 @@ module "data_addons" {
#---------------------------------------------------------------
enable_karpenter_resources = true
karpenter_resources_helm_config = {

g5-gpu-karpenter = {
values = [
<<-EOT
name: g5-gpu-karpenter
clusterName: ${module.eks.cluster_name}
ec2NodeClass:
amiFamily: Bottlerocket
karpenterRole: ${split("/", module.eks_blueprints_addons.karpenter.node_iam_role_arn)[1]}
subnetSelectorTerms:
id: ${module.vpc.private_subnets[2]}
securityGroupSelectorTerms:
tags:
Name: ${module.eks.cluster_name}-node
instanceStorePolicy: RAID0
blockDeviceMappings:
# Root device
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
encrypted: true
# Data device: Container resources such as images and logs
- deviceName: /dev/xvdb
ebs:
volumeSize: 300Gi
volumeType: gp3
encrypted: true
${var.bottlerocket_data_disk_snpashot_id != null ? "snapshotID: ${var.bottlerocket_data_disk_snpashot_id}" : ""}

nodePool:
labels:
Expand Down Expand Up @@ -276,13 +296,28 @@ module "data_addons" {
name: x86-cpu-karpenter
clusterName: ${module.eks.cluster_name}
ec2NodeClass:
amiFamily: Bottlerocket
karpenterRole: ${split("/", module.eks_blueprints_addons.karpenter.node_iam_role_arn)[1]}
subnetSelectorTerms:
id: ${module.vpc.private_subnets[3]}
securityGroupSelectorTerms:
tags:
Name: ${module.eks.cluster_name}-node
instanceStorePolicy: RAID0
# instanceStorePolicy: RAID0
blockDeviceMappings:
# Root device
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
# Data device: Container resources such as images and logs
- deviceName: /dev/xvdb
ebs:
volumeSize: 300Gi
volumeType: gp3
encrypted: true
${var.bottlerocket_data_disk_snpashot_id != null ? "snapshotID: ${var.bottlerocket_data_disk_snpashot_id}" : ""}

nodePool:
labels:
Expand Down Expand Up @@ -352,3 +387,15 @@ resource "kubernetes_config_map_v1" "notebook" {
"dogbooth.ipynb" = file("${path.module}/src/notebook/dogbooth.ipynb")
}
}

data "aws_iam_policy_document" "karpenter_controller_policy" {
statement {
actions = [
"ec2:RunInstances",
"ec2:CreateLaunchTemplate",
]
resources = ["*"]
effect = "Allow"
sid = "KarpenterControllerAdditionalPolicy"
}
}
Empty file.
8 changes: 8 additions & 0 deletions ai-ml/jark-stack/terraform/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,11 @@ variable "enable_kubecost" {
type = bool
default = false
}


variable "bottlerocket_data_disk_snpashot_id" {
description = "Bottlerocket Data Disk Snapshot ID"
type = string
default = ""

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import json
from locust import HttpUser, task, between

class StableDiffusionUser(HttpUser):
wait_time = between(1, 2) # Seconds between requests

@task
def generate_image(self):
prompt = "A beautiful sunset over the ocean"
payload = {
"prompt": prompt
}

headers = {
"Content-Type": "application/json"
}

response = self.client.get(
"/imagine",
params=payload,
data=json.dumps(payload),
headers=headers
)

if response.status_code == 200:
print(f"Generated image for prompt: {prompt}")
else:
print(f"Error generating image: {response.text}")

# You can add more tasks here if needed
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ spec:
serveConfigV2: |
applications:
- name: stable-diffusion-deployment
import_path: "ray_serve_stablediffusion:entrypoint"
import_path: "ray_serve_sd:entrypoint"
route_prefix: "/"
runtime_env:
env_vars:
Expand Down Expand Up @@ -61,6 +61,7 @@ spec:
# For faster inference scaling, consider building a custom image with only your workload's essential dependencies.
# Smaller images lead to faster scaling, especially across multiple nodes.
# Notice that we are using the same image for both the head and worker nodes. You might hit ModuleNotFoundError if you use a different image for head and worker nodes.
# Preload Container Image into data volumes for faster new ray worker nodes
- name: head
image: public.ecr.aws/data-on-eks/ray2.11.0-py310-gpu-stablediffusion:latest
imagePullPolicy: IfNotPresent # Ensure the image is always pulled when updated
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,3 @@ index cd29db7..6814348 100644
- choices=[8, 16, 32],
help='Token block size for contiguous chunks of '
'tokens.')

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
93 changes: 93 additions & 0 deletions website/docs/bestpractices/preload-container-images.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
title: Preload container images into data volumes
sidebar_position: 2
---
import CollapsibleContent from '../../src/components/CollapsibleContent';

# Preload container images into data volumes with EBS Snapshots

The purpose of this pattern is to reduce the cold start time of containers with large images by caching the images in the data volume of Bottlerocket OS.

Data analytics and machine learning workloads often require large container images (usually measured by Gigabytes), which can take several minutes to pull and extract from Amazon ECR or other image registry. Reduce image pulling time is the key of improving speed of launching these containers.

Bottlerocket OS is a Linux-based open-source operating system built by AWS specifically for running containers. It has two volumes, an OS volume and a data volume, with the latter used for storing artifacts and container images. This sample will leverage the data volume to pull images and take snapshots for later usage.

To demonstrate the process of caching images in EBS snapshots and launching them in an EKS cluster, this sample will use Amazon EKS optimized Bottlerocket AMIs.

For details, refer to the GitHub sample and blog post:
- [GitHub - Caching Container Images for AWS Bottlerocket Instances](https://github.com/aws-samples/bottlerocket-images-cache/tree/main)
- [Blog Post - Reduce container startup time on Amazon EKS with Bottlerocket data volume](https://aws.amazon.com/blogs/containers/reduce-container-startup-time-on-amazon-eks-with-bottlerocket-data-volume/)

## Overview of this script

![](img/bottlerocket-image-cache.png)

1. Launch an EC2 instance with Bottlerocket for EKS AMI.
2. Access to instance via Amazon System Manager
3. Pull images to be cached in this EC2 using Amazon System Manager Run Command.
4. Shut down the instance, build the EBS snapshot for the data volume.
5. Terminate the instance.

## Usage Example

```
git clone https://github.com/aws-samples/bottlerocket-images-cache/
cd bottlerocket-images-cache/

# Using nohup in terminals to avoid disconnections
❯ nohup ./snapshot.sh --snapshot-size 150 -r us-west-2 \
docker.io/rayproject/ray-ml:2.10.0-py310-gpu,public.ecr.aws/data-on-eks/ray2.11.0-py310-gpu-stablediffusion:latest &

❯ tail -f nohup.out

2024-07-15 17:18:53 I - [1/8] Deploying EC2 CFN stack ...
2024-07-15 17:22:07 I - [2/8] Launching SSM .
2024-07-15 17:22:08 I - SSM launched in instance i-07d10182abc8a86e1.
2024-07-15 17:22:08 I - [3/8] Stopping kubelet.service ..
2024-07-15 17:22:10 I - Kubelet service stopped.
2024-07-15 17:22:10 I - [4/8] Cleanup existing images ..
2024-07-15 17:22:12 I - Existing images cleaned
2024-07-15 17:22:12 I - [5/8] Pulling images:
2024-07-15 17:22:12 I - Pulling docker.io/rayproject/ray-ml:2.10.0-py310-gpu - amd64 ...
2024-07-15 17:27:50 I - docker.io/rayproject/ray-ml:2.10.0-py310-gpu - amd64 pulled.
2024-07-15 17:27:50 I - Pulling docker.io/rayproject/ray-ml:2.10.0-py310-gpu - arm64 ...
2024-07-15 17:27:58 I - docker.io/rayproject/ray-ml:2.10.0-py310-gpu - arm64 pulled.
2024-07-15 17:27:58 I - Pulling public.ecr.aws/data-on-eks/ray2.11.0-py310-gpu-stablediffusion:latest - amd64 ...
2024-07-15 17:31:34 I - public.ecr.aws/data-on-eks/ray2.11.0-py310-gpu-stablediffusion:latest - amd64 pulled.
2024-07-15 17:31:34 I - Pulling public.ecr.aws/data-on-eks/ray2.11.0-py310-gpu-stablediffusion:latest - arm64 ...
2024-07-15 17:31:36 I - public.ecr.aws/data-on-eks/ray2.11.0-py310-gpu-stablediffusion:latest - arm64 pulled.
2024-07-15 17:31:36 I - [6/8] Stopping instance ...
2024-07-15 17:32:25 I - Instance i-07d10182abc8a86e1 stopped
2024-07-15 17:32:25 I - [7/8] Creating snapshot ...
2024-07-15 17:38:36 I - Snapshot snap-0c6d965cf431785ed generated.
2024-07-15 17:38:36 I - [8/8] Cleanup.
2024-07-15 17:38:37 I - Stack deleted.
2024-07-15 17:38:37 I - --------------------------------------------------
2024-07-15 17:38:37 I - All done! Created snapshot in us-west-2: snap-0c6d965cf431785ed
```

You can copy the snapshot ID `snap-0c6d965cf431785ed` and configure it as a snapshot for worker nodes.

# Using Snapshot with Amazon EKS and Karpenter

You can specify `snapshotID` in a Karpenter node class. Add the content on EC2NodeClass:

```
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: Bottlerocket # Ensure OS is BottleRocket
blockDeviceMappings:
- deviceName: /dev/xvdb
ebs:
volumeSize: 150Gi
volumeType: gp3
kmsKeyID: "arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab" # Specify KMS ID if you use custom KMS key
snapshotID: snap-0123456789 # Specify your snapshot ID here
```

# End-to-End deployment example
lindarr915 marked this conversation as resolved.
Show resolved Hide resolved

An end-to-end deployment example can be found in [Stable Diffusion on GPU](../gen-ai/inference/stablediffusion-gpus).
Loading
Loading