Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add cloudwatch eks add on with enhanced monitoring for neuron #651

Merged
merged 2 commits into from
Sep 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions ai-ml/nvidia-triton-server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@
| Name | Type |
|------|------|
| [aws_iam_policy.triton](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_policy) | resource |
| [aws_iam_role.cloudwatch_observability_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource |
| [aws_iam_role_policy_attachment.cloudwatch_observability_policy_attachment](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource |
| [aws_secretsmanager_secret.grafana](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/secretsmanager_secret) | resource |
| [aws_secretsmanager_secret_version.grafana](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/secretsmanager_secret_version) | resource |
| [helm_release.nim_llm](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release) | resource |
Expand Down
37 changes: 37 additions & 0 deletions ai-ml/nvidia-triton-server/addons.tf
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,11 @@ module "eks_blueprints_addons" {
kube-proxy = {}
# VPC CNI uses worker node IAM role policies
vpc-cni = {}

amazon-cloudwatch-observability = {
preserve = true
service_account_role_arn = aws_iam_role.cloudwatch_observability_role.arn
}
}

#---------------------------------------
Expand Down Expand Up @@ -290,6 +295,38 @@ module "data_addons" {
}
}

#---------------------------------------------------------------
# IAM Role for Amazon CloudWatch Observability
#---------------------------------------------------------------
resource "aws_iam_role" "cloudwatch_observability_role" {
name_prefix = format("%s-%s", local.name, "cloudwatch-agent")
description = "The IAM role for amazon-cloudwatch-observability addon"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRoleWithWebIdentity"
Effect = "Allow"
Principal = {
Federated = module.eks.oidc_provider_arn
}
Condition = {
StringEquals = {
"${replace(module.eks.cluster_oidc_issuer_url, "https://", "")}:sub" : "system:serviceaccount:amazon-cloudwatch:cloudwatch-agent",
"${replace(module.eks.cluster_oidc_issuer_url, "https://", "")}:aud" : "sts.amazonaws.com"
}
}
}
]
})
}

resource "aws_iam_role_policy_attachment" "cloudwatch_observability_policy_attachment" {
policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
role = aws_iam_role.cloudwatch_observability_role.name
}

#---------------------------------------------------------------
# Grafana Admin credentials resources
# Login to AWS secrets manager with the same role as Terraform to extract the Grafana admin password with the secret name as "grafana"
Expand Down
44 changes: 36 additions & 8 deletions ai-ml/trainium-inferentia/addons.tf
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,10 @@ module "eks_blueprints_addons" {
eks-pod-identity-agent = {}
kube-proxy = {}
vpc-cni = {}
amazon-cloudwatch-observability = {
ratnopamc marked this conversation as resolved.
Show resolved Hide resolved
preserve = true
service_account_role_arn = aws_iam_role.cloudwatch_observability_role.arn
}
}

#---------------------------------------
Expand Down Expand Up @@ -130,14 +134,6 @@ module "eks_blueprints_addons" {
repository_password = data.aws_ecrpublic_authorization_token.token.password
}

#---------------------------------------
# CloudWatch metrics for EKS
#---------------------------------------
enable_aws_cloudwatch_metrics = true
aws_cloudwatch_metrics = {
values = [templatefile("${path.module}/helm-values/aws-cloudwatch-metrics-values.yaml", {})]
}

#---------------------------------------
# Enable FSx for Lustre CSI Driver
#---------------------------------------
Expand Down Expand Up @@ -386,6 +382,38 @@ module "eks_data_addons" {
}
}

#---------------------------------------------------------------
# IAM Role for Amazon CloudWatch Observability
#---------------------------------------------------------------
resource "aws_iam_role" "cloudwatch_observability_role" {
name_prefix = format("%s-%s", local.name, "cloudwatch-agent")
description = "The IAM role for amazon-cloudwatch-observability addon"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRoleWithWebIdentity"
Effect = "Allow"
Principal = {
Federated = module.eks.oidc_provider_arn
}
Condition = {
StringEquals = {
"${replace(module.eks.cluster_oidc_issuer_url, "https://", "")}:sub" : "system:serviceaccount:amazon-cloudwatch:cloudwatch-agent",
"${replace(module.eks.cluster_oidc_issuer_url, "https://", "")}:aud" : "sts.amazonaws.com"
}
}
}
]
})
}

resource "aws_iam_role_policy_attachment" "cloudwatch_observability_policy_attachment" {
policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
role = aws_iam_role.cloudwatch_observability_role.name
}

#---------------------------------------------------------------
# ETCD for TorchX
#---------------------------------------------------------------
Expand Down
5 changes: 5 additions & 0 deletions website/docs/blueprints/ai-ml/trainium.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,11 @@ kubectl get nodes # Output shows the EKS Managed Node group nodes

</CollapsibleContent>

### Observability with AWS CloudWatch and Neuron Monitor

This blueprint deploys the CloudWatch Observability Agent as a managed add-on, providing comprehensive monitoring for containerized workloads. It includes container insights for tracking key performance metrics such as CPU and memory utilization. Additionally, the blueprint integrates GPU metrics using NVIDIA's DCGM plugin, which is essential for monitoring high-performance GPU workloads. For machine learning models running on AWS Inferentia or Trainium, the [Neuron Monitor plugin](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-user-guide) is added to capture and report Neuron-specific metrics.

All metrics, including container insights, GPU performance, and Neuron metrics, are sent to Amazon CloudWatch, where you can monitor and analyze them in real-time. After the deployment is complete, you should be able to access these metrics directly from the CloudWatch console, allowing you to manage and optimize your workloads effectively.

### Distributed PyTorch Training on Trainium with TorchX and EKS

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,13 @@ PASS: vLLM example

## Observability

As part of this blueprint, we have also deployed the Kube Prometheus stack, which provides Prometheus server and Grafana deployments for monitoring and observability.
### Observability with AWS CloudWatch and Neuron Monitor

This blueprint deploys the CloudWatch Observability Agent as a managed add-on, providing comprehensive monitoring for containerized workloads. It includes container insights for tracking key performance metrics such as CPU and memory utilization. Additionally, the blueprint integrates GPU metrics using NVIDIA's DCGM plugin, which is essential for monitoring high-performance GPU workloads. For machine learning models running on AWS Inferentia or Trainium, the [Neuron Monitor plugin](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-user-guide) is added to capture and report Neuron-specific metrics.

All metrics, including container insights, GPU performance, and Neuron metrics, are sent to Amazon CloudWatch, where you can monitor and analyze them in real-time. After the deployment is complete, you should be able to access these metrics directly from the CloudWatch console, allowing you to manage and optimize your workloads effectively.

In addition to deploying CloudWatch EKS addon, we have also deployed the Kube Prometheus stack, which provides Prometheus server and Grafana deployments for monitoring and observability.

First, let's verify the services deployed by the Kube Prometheus stack:

Expand Down
Loading