Skip to content

Commit

Permalink
feat: Add cloudwatch eks add on with enhanced monitoring for neuron (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
ratnopamc authored Sep 18, 2024
1 parent a931290 commit 5b1788b
Show file tree
Hide file tree
Showing 5 changed files with 87 additions and 9 deletions.
2 changes: 2 additions & 0 deletions ai-ml/nvidia-triton-server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@
| Name | Type |
|------|------|
| [aws_iam_policy.triton](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
| [aws_iam_role.cloudwatch_observability_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource |
| [aws_iam_role_policy_attachment.cloudwatch_observability_policy_attachment](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
| [aws_secretsmanager_secret.grafana](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/secretsmanager_secret) | resource |
| [aws_secretsmanager_secret_version.grafana](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/secretsmanager_secret_version) | resource |
| [helm_release.nim_llm](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release) | resource |
Expand Down
37 changes: 37 additions & 0 deletions ai-ml/nvidia-triton-server/addons.tf
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,11 @@ module "eks_blueprints_addons" {
kube-proxy = {}
# VPC CNI uses worker node IAM role policies
vpc-cni = {}

amazon-cloudwatch-observability = {
preserve = true
service_account_role_arn = aws_iam_role.cloudwatch_observability_role.arn
}
}

#---------------------------------------
Expand Down Expand Up @@ -290,6 +295,38 @@ module "data_addons" {
}
}

#---------------------------------------------------------------
# IAM Role for Amazon CloudWatch Observability
#---------------------------------------------------------------
resource "aws_iam_role" "cloudwatch_observability_role" {
name_prefix = format("%s-%s", local.name, "cloudwatch-agent")
description = "The IAM role for amazon-cloudwatch-observability addon"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRoleWithWebIdentity"
Effect = "Allow"
Principal = {
Federated = module.eks.oidc_provider_arn
}
Condition = {
StringEquals = {
"${replace(module.eks.cluster_oidc_issuer_url, "https://", "")}:sub" : "system:serviceaccount:amazon-cloudwatch:cloudwatch-agent",
"${replace(module.eks.cluster_oidc_issuer_url, "https://", "")}:aud" : "sts.amazonaws.com"
}
}
}
]
})
}

resource "aws_iam_role_policy_attachment" "cloudwatch_observability_policy_attachment" {
policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
role = aws_iam_role.cloudwatch_observability_role.name
}

#---------------------------------------------------------------
# Grafana Admin credentials resources
# Login to AWS secrets manager with the same role as Terraform to extract the Grafana admin password with the secret name as "grafana"
Expand Down
44 changes: 36 additions & 8 deletions ai-ml/trainium-inferentia/addons.tf
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,10 @@ module "eks_blueprints_addons" {
eks-pod-identity-agent = {}
kube-proxy = {}
vpc-cni = {}
amazon-cloudwatch-observability = {
preserve = true
service_account_role_arn = aws_iam_role.cloudwatch_observability_role.arn
}
}

#---------------------------------------
Expand Down Expand Up @@ -130,14 +134,6 @@ module "eks_blueprints_addons" {
repository_password = data.aws_ecrpublic_authorization_token.token.password
}

#---------------------------------------
# CloudWatch metrics for EKS
#---------------------------------------
enable_aws_cloudwatch_metrics = true
aws_cloudwatch_metrics = {
values = [templatefile("${path.module}/helm-values/aws-cloudwatch-metrics-values.yaml", {})]
}

#---------------------------------------
# Enable FSx for Lustre CSI Driver
#---------------------------------------
Expand Down Expand Up @@ -438,6 +434,38 @@ module "eks_data_addons" {
}
}

#---------------------------------------------------------------
# IAM Role for Amazon CloudWatch Observability
#---------------------------------------------------------------
resource "aws_iam_role" "cloudwatch_observability_role" {
name_prefix = format("%s-%s", local.name, "cloudwatch-agent")
description = "The IAM role for amazon-cloudwatch-observability addon"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRoleWithWebIdentity"
Effect = "Allow"
Principal = {
Federated = module.eks.oidc_provider_arn
}
Condition = {
StringEquals = {
"${replace(module.eks.cluster_oidc_issuer_url, "https://", "")}:sub" : "system:serviceaccount:amazon-cloudwatch:cloudwatch-agent",
"${replace(module.eks.cluster_oidc_issuer_url, "https://", "")}:aud" : "sts.amazonaws.com"
}
}
}
]
})
}

resource "aws_iam_role_policy_attachment" "cloudwatch_observability_policy_attachment" {
policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
role = aws_iam_role.cloudwatch_observability_role.name
}

#---------------------------------------------------------------
# ETCD for TorchX
#---------------------------------------------------------------
Expand Down
5 changes: 5 additions & 0 deletions website/docs/blueprints/ai-ml/trainium.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,11 @@ kubectl get nodes # Output shows the EKS Managed Node group nodes

</CollapsibleContent>

### Observability with AWS CloudWatch and Neuron Monitor

This blueprint deploys the CloudWatch Observability Agent as a managed add-on, providing comprehensive monitoring for containerized workloads. It includes container insights for tracking key performance metrics such as CPU and memory utilization. Additionally, the blueprint integrates GPU metrics using NVIDIA's DCGM plugin, which is essential for monitoring high-performance GPU workloads. For machine learning models running on AWS Inferentia or Trainium, the [Neuron Monitor plugin](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-user-guide) is added to capture and report Neuron-specific metrics.

All metrics, including container insights, GPU performance, and Neuron metrics, are sent to Amazon CloudWatch, where you can monitor and analyze them in real-time. After the deployment is complete, you should be able to access these metrics directly from the CloudWatch console, allowing you to manage and optimize your workloads effectively.

### Distributed PyTorch Training on Trainium with TorchX and EKS

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,13 @@ PASS: vLLM example

## Observability

As part of this blueprint, we have also deployed the Kube Prometheus stack, which provides Prometheus server and Grafana deployments for monitoring and observability.
### Observability with AWS CloudWatch and Neuron Monitor

This blueprint deploys the CloudWatch Observability Agent as a managed add-on, providing comprehensive monitoring for containerized workloads. It includes container insights for tracking key performance metrics such as CPU and memory utilization. Additionally, the blueprint integrates GPU metrics using NVIDIA's DCGM plugin, which is essential for monitoring high-performance GPU workloads. For machine learning models running on AWS Inferentia or Trainium, the [Neuron Monitor plugin](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-user-guide) is added to capture and report Neuron-specific metrics.

All metrics, including container insights, GPU performance, and Neuron metrics, are sent to Amazon CloudWatch, where you can monitor and analyze them in real-time. After the deployment is complete, you should be able to access these metrics directly from the CloudWatch console, allowing you to manage and optimize your workloads effectively.

In addition to deploying CloudWatch EKS addon, we have also deployed the Kube Prometheus stack, which provides Prometheus server and Grafana deployments for monitoring and observability.

First, let's verify the services deployed by the Kube Prometheus stack:

Expand Down

0 comments on commit 5b1788b

Please sign in to comment.