diff --git a/benchmarks/nixl/README.md b/benchmarks/nixl/README.md deleted file mode 100644 index 071992e4dc8a..000000000000 --- a/benchmarks/nixl/README.md +++ /dev/null @@ -1,32 +0,0 @@ -# NIXL Benchmark Technical Documentation (Kubernetes) - -This guide describes how to run the NIXL benchmark using the provided Docker image on a Kubernetes (K8s) cluster. - ---- - -## Prerequisites - -- A running Kubernetes cluster with access to NVIDIA GPUs (e.g., using NVIDIA GPU Operator or device plugin) -- `kubectl` configured to access your cluster -- deploy dynamo cloud in a namespace - ---- - -## 1. Prepare the Kubernetes Deployment - -A sample deployment YAML is provided in this repository: -`benchmarks/nixl/nixl-benchmark-deployment.yaml` - -Update the image field in sample yaml to appropiate image in your registry. - -You can use the `yq` tool to update the image field in the deployment YAML -```bash -yq -i '.spec.template.spec.containers[] |= select(.name == "nixl-benchmark") .image = "your-registry/your-nixl-benchmark:your-tag"' benchmarks/nixl/nixl-benchmark-deployment.yaml > nixl-benchmark-deployment.yaml -``` - -## 2. Deploy using kubectl -Launch using the command below: - -```bash -kubectl apply -f nixl-benchmark-deployment.yaml -``` \ No newline at end of file diff --git a/deploy/cloud/pre-deployment/README.md b/deploy/cloud/pre-deployment/README.md new file mode 100644 index 000000000000..9bcb79e589ff --- /dev/null +++ b/deploy/cloud/pre-deployment/README.md @@ -0,0 +1,172 @@ + + +# Pre-Deployment Check Script + +This directory contains a pre-deployment check script that verifies your Kubernetes cluster meets the requirements for deploying Dynamo. + +- For NCCL tests, please refer to the [NCCL tests](https://docs.nebius.com/kubernetes/gpu/nccl-test#run-tests) for more details. + +- For NIXL benchmark, please refer to the [NIXL benchmark pre-deployment checks](/deploy/cloud/pre-deployment/nixl/README.md) for more details. + +## Usage + +Run the pre-deployment check before deploying Dynamo: + +```bash +./pre-deployment-check.sh +``` + +## What it checks + +The script performs few checks and provides a detailed summary: + +### 1. kubectl Connectivity +- Verifies that `kubectl` is installed and kubectl can connect to your Kubernetes cluster + +### 2. Default StorageClass +- Verifies that a default StorageClass is configured in your cluster +- If no default StorageClass is found: + - Lists all available StorageClasses in the cluster with full details + - Provides a sample command to set a StorageClass as default + - References the official Kubernetes documentation for detailed guidance + +### 3. Cluster GPU Resources +- Checks for GPU-enabled nodes in the cluster using label `nvidia.com/gpu.present=true` + +## Sample Output + +### Complete Script Output Example: +``` +======================================== + Dynamo Pre-Deployment Check Script +======================================== + +--- Checking kubectl connectivity --- +✅ kubectl is available and cluster is accessible + +--- Checking for default StorageClass --- +❌ No default StorageClass found + +Dynamo requires a default StorageClass for persistent volume provisioning. +Please configure a default StorageClass before proceeding with deployment. + +Available StorageClasses in your cluster: +NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE +my-default-storage-class (default) compute.csi.mock Delete WaitForFirstConsumer true 65d +fast-ssd-storage kubernetes.io/gce-pd Delete Immediate true 30d + +To set a StorageClass as default, use the following command: +kubectl patch storageclass -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' + +Example with your first available StorageClass: +kubectl patch storageclass my-default-storage-class -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' + +For more information on managing default StorageClasses, visit: +https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/ + +--- Checking cluster gpu resources --- +✅ Found 17 gpu node(s) in the cluster +Node information: + +--- Pre-Deployment Check Summary --- +✅ kubectl Connectivity: PASSED +❌ Default StorageClass: FAILED +✅ Cluster Resources: PASSED + +Summary: 2 passed, 1 failed +❌ 1 pre-deployment check(s) failed. +Please address the issues above before proceeding with deployment. +``` + +### When all checks pass: +``` +======================================== + Dynamo Pre-Deployment Check Script +======================================== + + +--- Checking kubectl connectivity --- +✅ kubectl is available and cluster is accessible + +--- Checking for default StorageClass --- +✅ Default StorageClass found + - NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE +my-default-storage-class (default) compute.csi.mock Delete WaitForFirstConsumer true 65d + +--- Checking cluster gpu resources --- +✅ Found 17 gpu node(s) in the cluster +Node information: + + +--- Pre-Deployment Check Summary --- +✅ kubectl Connectivity: PASSED +✅ Default StorageClass: PASSED +✅ Cluster Resources: PASSED + +Summary: 3 passed, 0 failed +🎉 All pre-deployment checks passed! +Your cluster is ready for Dynamo deployment. +``` + +## Check Status Summary + +The script provides a comprehensive summary showing the status of each check: + +| Check Name | Description | Pass/Fail Status | +|------------|-------------|------------------| +| **kubectl Connectivity** | Verifies kubectl installation and cluster access | ✅ PASSED / ❌ FAILED | +| **Default StorageClass** | Checks for default StorageClass annotation | ✅ PASSED / ❌ FAILED | +| **Cluster Resources** | Validates GPU nodes availability | ✅ PASSED / ❌ FAILED | + +## Setting a Default StorageClass + +If you need to set a default StorageClass, use the following command: + +```bash +kubectl patch storageclass -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' +``` + +Replace `` with the name of your desired StorageClass. + +## Troubleshooting + +### Multiple Default StorageClasses +If you have multiple StorageClasses marked as default, the script will warn you: +``` +⚠️ Warning: Multiple default StorageClasses detected + This may cause unpredictable behavior. Consider having only one default StorageClass. +``` + +To remove the default annotation from a StorageClass: +```bash +kubectl patch storageclass -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' +``` + +### No GPU Nodes Found +If no GPU nodes are found, ensure your cluster has nodes with the `nvidia.com/gpu.present=true` label. + +### No StorageClasses Available +If no StorageClasses are available in your cluster, you'll need to: +1. Install a storage provisioner (e.g., for cloud providers, local storage, etc.) +2. Create appropriate StorageClass resources +3. Mark one as default + +## Reference + +For more information on managing default StorageClasses, visit: +[Kubernetes Documentation - Change the default StorageClass](https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/) \ No newline at end of file diff --git a/deploy/cloud/pre-deployment/nixl/README.md b/deploy/cloud/pre-deployment/nixl/README.md new file mode 100644 index 000000000000..051624e646de --- /dev/null +++ b/deploy/cloud/pre-deployment/nixl/README.md @@ -0,0 +1,292 @@ +# NIXL Benchmark Documentation + +This guide describes how to build and deploy the NIXL benchmark using the provided scripts on a Kubernetes (K8s) cluster. + +> **Note**: NIXL benchmark is part of the Dynamo platform. Before proceeding, ensure your cluster meets the basic Dynamo requirements by running the pre-deployment check script located in the parent directory (`../pre-deployment-check.sh`). + +--- + +## Prerequisites + +### Cluster Requirements +Before deploying NIXL benchmark, ensure your cluster meets the Dynamo platform requirements by running the pre-deployment check: + +```bash +# Run from the parent directory +../pre-deployment-check.sh +``` + +This script verifies: +- `kubectl` connectivity and cluster access +- GPU nodes availability (`nvidia.com/gpu.present=true` label) +- GPU Operator installation and status + +### NIXL-Specific Requirements +In addition to the cluster requirements above, NIXL benchmark requires: +- **Docker** installed and configured on your local machine (for building images) +- **Docker registry access** to push the built nixlbench images +- **ETCD service** deployed and accessible as `etcd:2379` +- **Build utilities**: `wget` and `unzip` for downloading NIXL source code + +### Verification Steps +1. **Run pre-deployment check** (recommended): + ```bash + ../pre-deployment-check.sh + ``` + Ensure all checks pass before proceeding. + +2. **Verify ETCD availability** (NIXL-specific): + ```bash + kubectl get svc etcd + ``` + +3. **Confirm Docker registry access**: + ```bash + docker login your-registry.com # if using private registry + ``` + +--- + +## Quick Start + +For the easiest experience, use the interactive build and deploy script: + +```bash +./build_and_deploy.sh +``` + +This script provides a flexible workflow where you can: +1. **Select architecture**: Choose between x86_64 (Intel/AMD 64-bit) or aarch64 (ARM64) +2. **Choose which steps to execute**: Select any combination of: + - Build nixlbench Docker image + - Update deployment YAML file + - Deploy to Kubernetes +3. **Provide Docker registry** (only when needed for building or updating deployment) + +--- + +## Interactive Script Features + +### Architecture Selection +The script supports two architectures: +- **Option 1**: x86_64 (Intel/AMD 64-bit) +- **Option 2**: aarch64 (ARM64) + +You can select by entering: +- `1` or `x86_64` for x86_64 architecture +- `2` or `aarch64` for aarch64 architecture + +### Step Selection +Choose which steps to execute by entering comma-separated numbers: + +- **All steps**: `1,2,3` +- **Build and update only**: `1,2` (skips Kubernetes deployment) +- **Deploy only**: `3` (useful if image is already built and deployment file exists) +- **Build only**: `1` (useful for just creating the Docker image) +- **Update deployment only**: `2` (useful for updating deployment file with new registry/version) + +### Smart Registry Prompting +The script only prompts for Docker registry information when needed: +- **Steps 1 or 2**: Registry required for building image or updating deployment +- **Step 3 only**: No registry prompt (uses existing deployment file) + +--- + +## What Each Step Does + +### Step 1: Build nixlbench Docker Image +- Downloads NIXL source code (version 0.6.0) from GitHub +- Extracts and navigates to the build directory +- Pauses for user confirmation before building +- Builds Docker image with specified registry and architecture +- Tags image as: `{registry}/nixlbench:0.6.0-{arch}` + +### Step 2: Update Deployment YAML File +- Copies base deployment template (`nixlbench-deployment.yaml`) +- Creates architecture-specific deployment file (`nixlbench-deployment-{arch}.yaml`) +- Updates image reference with your registry and architecture +- Preserves all other deployment configurations + +### Step 3: Deploy to Kubernetes +- Validates deployment file exists +- Applies deployment to Kubernetes cluster +- Provides monitoring commands for checking status + +--- + +## Deployment Configuration + +The deployment creates: +- **2 replicas** of the nixlbench pod +- **Resource requests/limits**: + - CPU: 10 cores + - Memory: 5Gi + - GPU: 1 NVIDIA GPU per pod +- **Environment variables**: + - `ETCD_ENDPOINTS`: Points to `etcd:2379` +- **Command**: Runs nixlbench with VRAM segments and keeps container alive + +--- + +## Usage Examples + +### Example 1: Complete Workflow +```bash +./build_and_deploy.sh +# Select: 1 (x86_64) +# Steps: 1,2,3 +# Registry: docker.io/myusername +# Confirm: y +``` + +### Example 2: Build Image Only +```bash +./build_and_deploy.sh +# Select: 2 (aarch64) +# Steps: 1 +# Registry: my-private-registry.com +# Confirm: y +``` + +### Example 3: Deploy Existing Image +```bash +./build_and_deploy.sh +# Select: 1 (x86_64) +# Steps: 3 +# Confirm: y +``` + +### Example 4: Update Deployment File Only +```bash +./build_and_deploy.sh +# Select: 2 (aarch64) +# Steps: 2 +# Registry: new-registry.com +# Confirm: y +``` + +--- + +## Generated Files + +The script creates architecture-specific deployment files: +- `nixlbench-deployment-x86_64.yaml` - For x86_64 builds +- `nixlbench-deployment-aarch64.yaml` - For aarch64 builds + +These files are customized versions of the base template with your specific: +- Docker registry +- Image tag +- Architecture + +--- + +## Monitoring Your Deployment + +After deployment, monitor your NIXL benchmark: + +```bash +# Check pod status +kubectl get pods -l app=nixl-benchmark + +# View logs +kubectl logs -l app=nixl-benchmark -f + +# Check resource usage +kubectl top pods -l app=nixl-benchmark + +# Get detailed pod information +kubectl describe pods -l app=nixl-benchmark +``` + +If deployed to a specific namespace: +```bash +kubectl get pods -l app=nixl-benchmark -n your-namespace +kubectl logs -l app=nixl-benchmark -f -n your-namespace +``` + +--- + + +## Troubleshooting + +### Cluster-Level Issues +For cluster-related problems, first run the pre-deployment check to identify issues: + +```bash +../pre-deployment-check.sh +``` + +This will help diagnose: +- kubectl connectivity problems +- Missing default StorageClass +- GPU node availability issues +- GPU Operator status problems + +### NIXL-Specific Issues + +1. **ETCD Connection**: + - Ensure etcd service is running: `kubectl get svc dynamo-platform-etcd` + - Verify etcd endpoints are accessible from pods + - Check if etcd is in the correct namespace + +2. **Image Pull Issues**: + - Verify registry credentials are configured + - Check image exists: `docker pull {registry}/nixlbench:0.6.0-{arch}` + - Ensure image was pushed successfully after build + +3. **Build Failures**: + - Ensure Docker daemon is running + - Check available disk space in `/tmp` + - Verify network connectivity to GitHub + - Confirm build utilities are installed: `which wget unzip` + +4. **Deployment File Not Found**: + - Run step 2 to create deployment file before step 3 + - Check file permissions in script directory + - Verify script directory path is correct + +### Debug Commands +```bash +# Check script-generated files +ls -la nixlbench-deployment-*.yaml + +# Verify deployment status +kubectl get deployment nixl-benchmark -o yaml + +# Check events for issues +kubectl get events --sort-by=.metadata.creationTimestamp +``` + +### Cleanup + +To remove the deployment: +```bash +kubectl delete deployment nixl-benchmark +``` + +Or if deployed to a specific namespace: +```bash +kubectl delete deployment nixl-benchmark -n your-namespace +``` + +To clean up generated files: +```bash +rm -f nixlbench-deployment-*.yaml +``` + +--- + +## Script Reference + +### build_and_deploy.sh +Interactive script that provides flexible build and deployment workflow: +- **Architecture selection**: x86_64 or aarch64 +- **Step selection**: Choose any combination of build, update, deploy +- **Validation**: Checks for deployment files before deploying + +### nixlbench-deployment.yaml +Base Kubernetes deployment template that gets customized by the script: +- **Template image**: `my-registry/nixlbench:version-arch` +- **Resource allocation**: 10 CPU, 5Gi memory, 1 GPU per pod +- **ETCD integration**: Pre-configured environment variables +- **Benchmark command**: Runs with VRAM segment configuration \ No newline at end of file diff --git a/deploy/cloud/pre-deployment/nixl/build_and_deploy.sh b/deploy/cloud/pre-deployment/nixl/build_and_deploy.sh new file mode 100755 index 000000000000..88f966a61bad --- /dev/null +++ b/deploy/cloud/pre-deployment/nixl/build_and_deploy.sh @@ -0,0 +1,413 @@ +#!/bin/bash + +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +set -euo pipefail + + +NIXL_VERSION="0.6.0" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Function to check if a command exists +command_exists() { + command -v "$1" >/dev/null 2>&1 +} + +# Function to check Docker daemon status +check_docker_daemon() { + if ! docker info >/dev/null 2>&1; then + return 1 + fi + return 0 +} + +# Function to check all required dependencies +check_dependencies() { + echo "Checking required dependencies..." + local missing_deps=() + local warnings=() + + # Check wget + if ! command_exists wget; then + missing_deps+=("wget") + else + echo "✅ wget is available" + fi + + # Check unzip + if ! command_exists unzip; then + missing_deps+=("unzip") + else + echo "✅ unzip is available" + fi + + # Check kubectl + if ! command_exists kubectl; then + missing_deps+=("kubectl") + else + echo "✅ kubectl is available" + # Test kubectl connectivity + if ! kubectl cluster-info >/dev/null 2>&1; then + warnings+=("kubectl is installed but cannot connect to cluster") + else + echo "✅ kubectl can connect to cluster" + fi + fi + + # Check Docker + if ! command_exists docker; then + missing_deps+=("docker") + else + echo "✅ docker is available" + # Check Docker daemon + if ! check_docker_daemon; then + warnings+=("Docker is installed but daemon is not running or accessible") + else + echo "✅ Docker daemon is running" + + # Additional Docker toolchain checks + if ! docker ps >/dev/null 2>&1; then + warnings+=("Docker requires sudo or user is not in docker group - consider adding user to docker group") + fi + + if ! docker buildx version >/dev/null 2>&1; then + warnings+=("Docker buildx not available (may affect multi-architecture builds)") + fi + fi + fi + + # Report missing dependencies + if [ ${#missing_deps[@]} -gt 0 ]; then + echo + echo "❌ Missing required dependencies:" + for dep in "${missing_deps[@]}"; do + echo " - $dep" + done + echo + echo "Please install the missing dependencies and try again." + echo + echo "Installation suggestions:" + for dep in "${missing_deps[@]}"; do + case "$dep" in + wget) + echo " wget: sudo apt-get install wget (Ubuntu/Debian) or yum install wget (RHEL/CentOS)" + ;; + unzip) + echo " unzip: sudo apt-get install unzip (Ubuntu/Debian) or yum install unzip (RHEL/CentOS)" + ;; + kubectl) + echo " kubectl: https://kubernetes.io/docs/tasks/tools/install-kubectl/" + ;; + docker) + echo " docker: https://docs.docker.com/get-docker/" + ;; + esac + done + return 1 + fi + + # Report warnings + if [ ${#warnings[@]} -gt 0 ]; then + echo + echo "⚠️ Warnings:" + for warning in "${warnings[@]}"; do + echo " - $warning" + done + echo + printf "Do you want to continue despite these warnings? (y/N): " + read continue_with_warnings + case "$continue_with_warnings" in + [Yy]|[Yy][Ee][Ss]) + echo "Continuing with warnings..." + ;; + *) + echo "Please resolve the warnings and try again." + return 1 + ;; + esac + fi + + echo "✅ All required dependencies are available" + return 0 +} + +# Function to display available architectures +show_architectures() { + echo "Available architectures:" + echo "1) x86_64 (Intel/AMD 64-bit)" + echo "2) aarch64 (ARM64)" +} + +# Function to validate architecture input +validate_architecture() { + local arch=$1 + case $arch in + 1|x86_64) + echo "x86_64" + return 0 + ;; + 2|aarch64) + echo "aarch64" + return 0 + ;; + *) + return 1 + ;; + esac +} + +# Function to prompt for registry +prompt_for_registry() { + echo + printf "Enter your Docker registry (e.g., my-registry, docker.io/username): " + read REGISTRY + if [ -z "$REGISTRY" ]; then + echo "Error: Registry cannot be empty" + exit 1 + fi +} + +# Function to build nixlbench image +build_nixlbench() { + local arch=$1 + local registry=$2 + + echo "Building nixlbench image for architecture: $arch" + echo "Registry: $registry" + + NIXL_BUILD_DIR="/tmp/nixlbench-${NIXL_VERSION}" + rm -rf "${NIXL_BUILD_DIR}" + mkdir -p "${NIXL_BUILD_DIR}" + cd "${NIXL_BUILD_DIR}" + + echo "Downloading NIXL source..." + wget https://github.com/ai-dynamo/nixl/archive/refs/tags/${NIXL_VERSION}.zip + unzip "${NIXL_VERSION}.zip" + cd "nixl-${NIXL_VERSION}/benchmark/nixlbench/contrib" + read -p "Press Enter to continue" + echo "Building Docker image..." + ./build.sh --tag "${registry}/nixlbench:${NIXL_VERSION}-${arch}" --arch "${arch}" + + echo "Build completed successfully!" + echo "Image: ${registry}/nixlbench:${NIXL_VERSION}-${arch}" +} + +# Function to update deployment yaml +update_deployment() { + local arch=$1 + local registry=$2 + local deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${arch}.yaml" + + echo "Creating deployment file: $deployment_file" + + # Copy the original deployment file and update the image + cp "${SCRIPT_DIR}/nixlbench-deployment.yaml" "$deployment_file" + + # Update the image field using sed + sed -i "s|my-registry/nixlbench:version-arch|${registry}/nixlbench:${NIXL_VERSION}-${arch}|g" "$deployment_file" + + echo "Deployment file updated with image: ${registry}/nixlbench:${NIXL_VERSION}-${arch}" +} + +# Function to prompt for steps to execute +prompt_for_steps() { + echo + echo "Select which steps to execute:" + echo "1) Build nixlbench Docker image" + echo "2) Update deployment YAML file" + echo "3) Deploy to Kubernetes" + echo + echo "Enter the steps you want to execute (e.g., '1,2,3' for all, '1,2' to skip deployment, '3' for deployment only):" + printf "Steps to execute: " + read steps_input + + if [ -z "$steps_input" ]; then + echo "Error: Please select at least one step" + return 1 + fi + + # Parse the input and set flags + EXECUTE_BUILD=false + EXECUTE_UPDATE=false + EXECUTE_DEPLOY=false + + # Convert comma-separated input to array + IFS=',' read -ra STEPS <<< "$steps_input" + for step in "${STEPS[@]}"; do + # Remove whitespace + step=$(echo "$step" | tr -d ' ') + case "$step" in + 1) + EXECUTE_BUILD=true + ;; + 2) + EXECUTE_UPDATE=true + ;; + 3) + EXECUTE_DEPLOY=true + ;; + *) + echo "Warning: Invalid step '$step' ignored. Valid steps are 1, 2, 3" + ;; + esac + done + + # Check if at least one valid step was selected + if [ "$EXECUTE_BUILD" = false ] && [ "$EXECUTE_UPDATE" = false ] && [ "$EXECUTE_DEPLOY" = false ]; then + echo "Error: No valid steps selected" + return 1 + fi + + return 0 +} + +# Function to deploy to Kubernetes +deploy_to_k8s() { + local arch=$1 + local deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${arch}.yaml" + + echo "Deploying to Kubernetes..." + kubectl apply -f "$deployment_file" + echo "Deployment applied successfully!" + echo + echo "To check the status of your deployment:" + echo "kubectl get pods -l app=nixl-benchmark" + echo + echo "To view logs:" + echo "kubectl logs -l app=nixl-benchmark -f" +} + +# Main script +main() { + echo "NIXL Benchmark Build and Deploy Script" + echo "======================================" + echo + + # Check dependencies first + if ! check_dependencies; then + exit 1 + fi + echo + + # Show available architectures + show_architectures + echo + + # Prompt for architecture + while true; do + printf "Select architecture (1-2 or enter x86_64/aarch64): " + read arch_input + + if [ -z "$arch_input" ]; then + echo "Error: Please select an architecture" + continue + fi + + SELECTED_ARCH=$(validate_architecture "$arch_input") + if [ $? -eq 0 ]; then + break + else + echo "Error: Invalid architecture. Please select 1, 2, x86_64, or aarch64" + fi + done + + echo "Selected architecture: $SELECTED_ARCH" + + # Prompt for registry (only if building or updating deployment) + REGISTRY="" + + # Prompt for steps to execute + while true; do + if prompt_for_steps; then + break + fi + echo "Please try again." + echo + done + + # Only prompt for registry if we need it + if [ "$EXECUTE_BUILD" = true ] || [ "$EXECUTE_UPDATE" = true ]; then + prompt_for_registry + fi + + echo + echo "Summary:" + echo "- Architecture: $SELECTED_ARCH" + if [ -n "$REGISTRY" ]; then + echo "- Registry: $REGISTRY" + echo "- Image will be: $REGISTRY/nixlbench:$NIXL_VERSION-$SELECTED_ARCH" + fi + echo "- Steps to execute:" + if [ "$EXECUTE_BUILD" = true ]; then + echo " ✓ Build nixlbench Docker image" + else + echo " ✗ Build nixlbench Docker image (skipped)" + fi + if [ "$EXECUTE_UPDATE" = true ]; then + echo " ✓ Update deployment YAML file" + else + echo " ✗ Update deployment YAML file (skipped)" + fi + if [ "$EXECUTE_DEPLOY" = true ]; then + echo " ✓ Deploy to Kubernetes" + else + echo " ✗ Deploy to Kubernetes (skipped)" + fi + echo + + printf "Proceed with selected steps? (y/N): " + read confirm + case "$confirm" in + [Yy]|[Yy][Ee][Ss]) + ;; + *) + echo "Process cancelled." + exit 0 + ;; + esac + + # Execute selected steps + if [ "$EXECUTE_BUILD" = true ]; then + echo + echo "=== Building nixlbench Docker image ===" + build_nixlbench "$SELECTED_ARCH" "$REGISTRY" + fi + + if [ "$EXECUTE_UPDATE" = true ]; then + echo + echo "=== Updating deployment YAML file ===" + update_deployment "$SELECTED_ARCH" "$REGISTRY" + fi + + if [ "$EXECUTE_DEPLOY" = true ]; then + echo + echo "=== Deploying to Kubernetes ===" + # Check if deployment file exists + deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${SELECTED_ARCH}.yaml" + if [ ! -f "$deployment_file" ]; then + echo "Warning: Deployment file not found at $deployment_file" + echo "You may need to run step 2 (Update deployment YAML file) first." + printf "Do you want to continue with deployment anyway? (y/N): " + read deploy_confirm + case "$deploy_confirm" in + [Yy]|[Yy][Ee][Ss]) + ;; + *) + echo "Deployment skipped." + EXECUTE_DEPLOY=false + ;; + esac + fi + + if [ "$EXECUTE_DEPLOY" = true ]; then + deploy_to_k8s "$SELECTED_ARCH" + fi + fi + + echo + echo "Process completed successfully!" +} + +# Run main function +main "$@" diff --git a/benchmarks/nixl/nixl-benchmark-deployment.yaml b/deploy/cloud/pre-deployment/nixl/nixlbench-deployment.yaml similarity index 54% rename from benchmarks/nixl/nixl-benchmark-deployment.yaml rename to deploy/cloud/pre-deployment/nixl/nixlbench-deployment.yaml index b0bf1084ac20..15cd39431555 100644 --- a/benchmarks/nixl/nixl-benchmark-deployment.yaml +++ b/deploy/cloud/pre-deployment/nixl/nixlbench-deployment.yaml @@ -14,16 +14,22 @@ spec: labels: app: nixl-benchmark spec: - imagePullSecrets: - - name: nvcr-imagepullsecret containers: - name: nixl-benchmark - image: my-registry/vllm-runtime:nixlbench-e42c07a8 + image: "my-registry/nixlbench:version-arch" command: ["sh", "-c"] + env: + - name: ETCD_ENDPOINTS + value: etcd:2379 args: - - "nixlbench -etcd_endpoints http://dynamo-platform-etcd:2379 --target_seg_type VRAM --initiator_seg_type VRAM && sleep infinity" + - | + nixlbench -etcd_endpoints ${ETCD_ENDPOINTS} --target_seg_type VRAM --initiator_seg_type VRAM && sleep infinity resources: requests: - nvidia.com/gpu: "1" + cpu: "10" + memory: "5Gi" + nvidia.com/gpu: "1" limits: - nvidia.com/gpu: "1" + cpu: "10" + memory: "5Gi" + nvidia.com/gpu: "1" diff --git a/deploy/cloud/pre-deployment/pre-deployment-check.sh b/deploy/cloud/pre-deployment/pre-deployment-check.sh new file mode 100755 index 000000000000..3477718b1998 --- /dev/null +++ b/deploy/cloud/pre-deployment/pre-deployment-check.sh @@ -0,0 +1,283 @@ +#!/usr/bin/env bash +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# Pre-deployment check script for Dynamo +# This script verifies that the Kubernetes cluster has the necessary prerequisites +# before deploying Dynamo platform. +# +# Checks performed: +# 1. kubectl connectivity - Verifies kubectl is installed and can connect to cluster +# 2. Default StorageClass - Ensures a default StorageClass is configured +# 3. Cluster GPU Resources - Validates GPU nodes are available +# 4. GPU Operator - Confirms GPU operator is installed and running + +set -e + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Function to print colored output +print_status() { + local color=$1 + local message=$2 + echo -e "${color}${message}${NC}" +} + +print_header() { + echo -e "\n${BLUE}========================================${NC}" + echo -e "${BLUE} Dynamo Pre-Deployment Check Script ${NC}" + echo -e "${BLUE}========================================${NC}\n" +} + +print_section() { + echo -e "\n${BLUE}--- $1 ---${NC}" +} + +# Function to check if kubectl is available and cluster is accessible +check_kubectl() { + print_section "Checking kubectl connectivity" + + if ! command -v kubectl &> /dev/null; then + print_status $RED "❌ kubectl is not installed or not in PATH" + print_status $YELLOW "Please install kubectl and ensure it's in your PATH" + return 1 + fi + + if ! kubectl cluster-info &> /dev/null; then + print_status $RED "❌ Cannot connect to Kubernetes cluster" + print_status $YELLOW "Please ensure kubectl is configured to connect to your cluster" + return 1 + fi + + print_status $GREEN "✅ kubectl is available and cluster is accessible" + return 0 +} + +# Function to check for default storage class +check_default_storage_class() { + print_section "Checking for default StorageClass" + + # Use JSONPath to find storage classes with the default annotation set to "true" + local default_storage_classes + default_storage_classes=$(kubectl get storageclass -o jsonpath='{range .items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")]}{.metadata.name}{"\n"}{end}' 2>/dev/null || echo "") + + if [[ -z "$default_storage_classes" ]]; then + print_status $RED "❌ No default StorageClass found" + print_status $YELLOW "\nDynamo requires a default StorageClass for persistent volume provisioning." + print_status $BLUE "Please follow the instructions below to configure a default StorageClass before proceeding with deployment.\n" + + # Show available storage classes + print_status $BLUE "Available StorageClasses in your cluster:" + local all_storage_classes + all_storage_classes=$(kubectl get storageclass 2>/dev/null || echo "") + + if [[ -z "$all_storage_classes" ]]; then + print_status $YELLOW " No StorageClasses found in the cluster" + else + echo -e "$all_storage_classes" + + local all_storage_class_names + all_storage_class_names=$(kubectl get storageclass -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null || echo "") + + print_status $BLUE "\nTo set a StorageClass as default, use the following command:" + print_status $YELLOW "kubectl patch storageclass -p '{\"metadata\": {\"annotations\":{\"storageclass.kubernetes.io/is-default-class\":\"true\"}}}'" + + if [[ -n "$all_storage_class_names" ]]; then + local first_sc_name + first_sc_name=$(echo "$all_storage_class_names" | head -n1) + print_status $BLUE "\nExample with your first available StorageClass:" + print_status $YELLOW "kubectl patch storageclass ${first_sc_name} -p '{\"metadata\": {\"annotations\":{\"storageclass.kubernetes.io/is-default-class\":\"true\"}}}'" + fi + fi + + print_status $BLUE "\nFor more information on managing default StorageClasses, visit:" + print_status $BLUE "https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/" + return 1 + else + print_status $GREEN "✅ Default StorageClass found" + while IFS= read -r sc_name; do + if [[ -n "$sc_name" ]]; then + local provisioner + default_sc=$(kubectl get storageclass "$sc_name" 2>/dev/null || echo "unknown") + print_status $GREEN " - ${default_sc}" + fi + done <<< "$default_storage_classes" + + # Check if there are multiple default storage classes (which can cause issues) + local default_count + default_count=$(echo "$default_storage_classes" | grep -c . || echo "0") + if [[ $default_count -gt 1 ]]; then + print_status $YELLOW "⚠️ Warning: Multiple default StorageClasses detected" + print_status $YELLOW " This may cause unpredictable behavior. Consider having only one default StorageClass." + fi + return 0 + fi +} + +check_cluster_resources() { + print_section "Checking cluster GPU resources" + + local node_count + node_count=$(kubectl get nodes -l nvidia.com/gpu.present=true -o name 2>/dev/null | wc -l || echo "0") + + if [[ $node_count -eq 0 ]]; then + print_status $RED "❌ No GPU nodes found in the cluster" + print_status $YELLOW "Dynamo requires nodes with nvidia.com/gpu.present=true label." + print_status $BLUE "Please ensure your cluster has GPU-enabled nodes properly labeled." + return 1 + else + print_status $GREEN "✅ Found ${node_count} GPU node(s) in the cluster" + return 0 + fi + + # Show basic node information (commented out for cleaner output) + # print_status $BLUE "GPU Node information:" + # kubectl get nodes -l nvidia.com/gpu.present=true -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,ROLES:.metadata.labels.'node-role\.kubernetes\.io/.*',VERSION:.status.nodeInfo.kubeletVersion 2>/dev/null || true +} + +check_gpu_operator() { + print_section "Checking GPU operator" + + # Check if GPU operator pods exist and are running + local gpu_operator_pods + gpu_operator_pods=$(kubectl get pods -A -lapp=gpu-operator --no-headers 2>/dev/null || echo "") + + if [[ -z "$gpu_operator_pods" ]]; then + print_status $RED "❌ GPU operator not found in the cluster" + print_status $YELLOW "Dynamo requires GPU operator to be installed and running." + print_status $BLUE "Please install GPU operator before proceeding with deployment." + return 1 + fi + + # Check if any GPU operator pods are running + local running_pods + running_pods=$(echo "$gpu_operator_pods" | grep -c "Running" || echo "0") + local total_pods + total_pods=$(echo "$gpu_operator_pods" | wc -l) + + if [[ $running_pods -eq 0 ]]; then + print_status $RED "❌ GPU operator pods are not running" + print_status $YELLOW "Found $total_pods GPU operator pod(s) but none are in Running state:" + echo "$gpu_operator_pods" + return 1 + elif [[ $running_pods -lt $total_pods ]]; then + print_status $YELLOW "⚠️ GPU operator partially running: $running_pods/$total_pods pods running" + echo "$gpu_operator_pods" + print_status $GREEN "✅ GPU operator is available (with warnings)" + return 0 + else + print_status $GREEN "✅ GPU operator is running ($running_pods/$total_pods pods)" + return 0 + fi +} + +# Global variables to track check results (using simple arrays for compatibility) +CHECK_RESULTS="" +CHECK_ORDER="" + +# Function to record check result +record_check_result() { + local check_name="$1" + local status="$2" + + # Append to results string with delimiter + if [[ -z "$CHECK_RESULTS" ]]; then + CHECK_RESULTS="${check_name}:${status}" + CHECK_ORDER="${check_name}" + else + CHECK_RESULTS="${CHECK_RESULTS}|${check_name}:${status}" + CHECK_ORDER="${CHECK_ORDER}|${check_name}" + fi +} + +# Function to get check result by name +get_check_result() { + local check_name="$1" + echo "$CHECK_RESULTS" | tr '|' '\n' | grep "^${check_name}:" | cut -d':' -f2 +} + +# Function to display check summary +display_check_summary() { + print_section "Pre-Deployment Check Summary" + + local passed=0 + local failed=0 + + # Split CHECK_ORDER by delimiter and iterate + IFS='|' read -ra CHECKS <<< "$CHECK_ORDER" + for check_name in "${CHECKS[@]}"; do + local status=$(get_check_result "$check_name") + if [[ "$status" == "PASS" ]]; then + print_status $GREEN "✅ $check_name: PASSED" + ((passed++)) + else + print_status $RED "❌ $check_name: FAILED" + ((failed++)) + fi + done + + echo "" + print_status $BLUE "Summary: $passed passed, $failed failed" + + if [[ $failed -eq 0 ]]; then + print_status $GREEN "🎉 All pre-deployment checks passed!" + print_status $GREEN "Your cluster is ready for Dynamo deployment." + return 0 + else + print_status $RED "❌ $failed pre-deployment check(s) failed." + print_status $RED "Please address the issues above before proceeding with deployment." + return 1 + fi +} + +# Main execution +main() { + print_header + + local overall_exit_code=0 + + # Run checks and capture results + if check_kubectl; then + record_check_result "kubectl Connectivity" "PASS" + else + record_check_result "kubectl Connectivity" "FAIL" + overall_exit_code=1 + fi + + if check_default_storage_class; then + record_check_result "Default StorageClass" "PASS" + else + record_check_result "Default StorageClass" "FAIL" + overall_exit_code=1 + fi + + if check_cluster_resources; then + record_check_result "Cluster GPU Resources" "PASS" + else + record_check_result "Cluster GPU Resources" "FAIL" + overall_exit_code=1 + fi + + if check_gpu_operator; then + record_check_result "GPU Operator" "PASS" + else + record_check_result "GPU Operator" "FAIL" + overall_exit_code=1 + fi + + # Display summary + echo "" + if ! display_check_summary; then + overall_exit_code=1 + fi + + exit $overall_exit_code +} + +# Run the script +main "$@" diff --git a/docs/kubernetes/README.md b/docs/kubernetes/README.md index 5cbac1dc432d..c7ffb22d4b3d 100644 --- a/docs/kubernetes/README.md +++ b/docs/kubernetes/README.md @@ -19,6 +19,11 @@ limitations under the License. High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides. +## Pre-deployment Checks + +Before deploying the platform, it is recommended to run the pre-deployment checks to ensure the cluster is ready for deployment. Please refer to the [pre-deployment checks](/deploy/cloud/pre-deployment/README.md) for more details. + + ## 1. Install Platform First ```bash