diff --git a/deploy/cloud/pre-deployment/nixl/README.md b/deploy/cloud/pre-deployment/nixl/README.md index 071992e4dc8a..051624e646de 100644 --- a/deploy/cloud/pre-deployment/nixl/README.md +++ b/deploy/cloud/pre-deployment/nixl/README.md @@ -1,32 +1,292 @@ -# NIXL Benchmark Technical Documentation (Kubernetes) +# NIXL Benchmark Documentation -This guide describes how to run the NIXL benchmark using the provided Docker image on a Kubernetes (K8s) cluster. +This guide describes how to build and deploy the NIXL benchmark using the provided scripts on a Kubernetes (K8s) cluster. + +> **Note**: NIXL benchmark is part of the Dynamo platform. Before proceeding, ensure your cluster meets the basic Dynamo requirements by running the pre-deployment check script located in the parent directory (`../pre-deployment-check.sh`). --- ## Prerequisites -- A running Kubernetes cluster with access to NVIDIA GPUs (e.g., using NVIDIA GPU Operator or device plugin) -- `kubectl` configured to access your cluster -- deploy dynamo cloud in a namespace +### Cluster Requirements +Before deploying NIXL benchmark, ensure your cluster meets the Dynamo platform requirements by running the pre-deployment check: + +```bash +# Run from the parent directory +../pre-deployment-check.sh +``` + +This script verifies: +- `kubectl` connectivity and cluster access +- GPU nodes availability (`nvidia.com/gpu.present=true` label) +- GPU Operator installation and status + +### NIXL-Specific Requirements +In addition to the cluster requirements above, NIXL benchmark requires: +- **Docker** installed and configured on your local machine (for building images) +- **Docker registry access** to push the built nixlbench images +- **ETCD service** deployed and accessible as `etcd:2379` +- **Build utilities**: `wget` and `unzip` for downloading NIXL source code + +### Verification Steps +1. **Run pre-deployment check** (recommended): + ```bash + ../pre-deployment-check.sh + ``` + Ensure all checks pass before proceeding. + +2. **Verify ETCD availability** (NIXL-specific): + ```bash + kubectl get svc etcd + ``` + +3. **Confirm Docker registry access**: + ```bash + docker login your-registry.com # if using private registry + ``` + +--- + +## Quick Start + +For the easiest experience, use the interactive build and deploy script: + +```bash +./build_and_deploy.sh +``` + +This script provides a flexible workflow where you can: +1. **Select architecture**: Choose between x86_64 (Intel/AMD 64-bit) or aarch64 (ARM64) +2. **Choose which steps to execute**: Select any combination of: + - Build nixlbench Docker image + - Update deployment YAML file + - Deploy to Kubernetes +3. **Provide Docker registry** (only when needed for building or updating deployment) + +--- + +## Interactive Script Features + +### Architecture Selection +The script supports two architectures: +- **Option 1**: x86_64 (Intel/AMD 64-bit) +- **Option 2**: aarch64 (ARM64) + +You can select by entering: +- `1` or `x86_64` for x86_64 architecture +- `2` or `aarch64` for aarch64 architecture + +### Step Selection +Choose which steps to execute by entering comma-separated numbers: + +- **All steps**: `1,2,3` +- **Build and update only**: `1,2` (skips Kubernetes deployment) +- **Deploy only**: `3` (useful if image is already built and deployment file exists) +- **Build only**: `1` (useful for just creating the Docker image) +- **Update deployment only**: `2` (useful for updating deployment file with new registry/version) + +### Smart Registry Prompting +The script only prompts for Docker registry information when needed: +- **Steps 1 or 2**: Registry required for building image or updating deployment +- **Step 3 only**: No registry prompt (uses existing deployment file) + +--- + +## What Each Step Does + +### Step 1: Build nixlbench Docker Image +- Downloads NIXL source code (version 0.6.0) from GitHub +- Extracts and navigates to the build directory +- Pauses for user confirmation before building +- Builds Docker image with specified registry and architecture +- Tags image as: `{registry}/nixlbench:0.6.0-{arch}` + +### Step 2: Update Deployment YAML File +- Copies base deployment template (`nixlbench-deployment.yaml`) +- Creates architecture-specific deployment file (`nixlbench-deployment-{arch}.yaml`) +- Updates image reference with your registry and architecture +- Preserves all other deployment configurations + +### Step 3: Deploy to Kubernetes +- Validates deployment file exists +- Applies deployment to Kubernetes cluster +- Provides monitoring commands for checking status --- -## 1. Prepare the Kubernetes Deployment +## Deployment Configuration + +The deployment creates: +- **2 replicas** of the nixlbench pod +- **Resource requests/limits**: + - CPU: 10 cores + - Memory: 5Gi + - GPU: 1 NVIDIA GPU per pod +- **Environment variables**: + - `ETCD_ENDPOINTS`: Points to `etcd:2379` +- **Command**: Runs nixlbench with VRAM segments and keeps container alive -A sample deployment YAML is provided in this repository: -`benchmarks/nixl/nixl-benchmark-deployment.yaml` +--- -Update the image field in sample yaml to appropiate image in your registry. +## Usage Examples -You can use the `yq` tool to update the image field in the deployment YAML +### Example 1: Complete Workflow ```bash -yq -i '.spec.template.spec.containers[] |= select(.name == "nixl-benchmark") .image = "your-registry/your-nixl-benchmark:your-tag"' benchmarks/nixl/nixl-benchmark-deployment.yaml > nixl-benchmark-deployment.yaml +./build_and_deploy.sh +# Select: 1 (x86_64) +# Steps: 1,2,3 +# Registry: docker.io/myusername +# Confirm: y ``` -## 2. Deploy using kubectl -Launch using the command below: +### Example 2: Build Image Only +```bash +./build_and_deploy.sh +# Select: 2 (aarch64) +# Steps: 1 +# Registry: my-private-registry.com +# Confirm: y +``` +### Example 3: Deploy Existing Image ```bash -kubectl apply -f nixl-benchmark-deployment.yaml -``` \ No newline at end of file +./build_and_deploy.sh +# Select: 1 (x86_64) +# Steps: 3 +# Confirm: y +``` + +### Example 4: Update Deployment File Only +```bash +./build_and_deploy.sh +# Select: 2 (aarch64) +# Steps: 2 +# Registry: new-registry.com +# Confirm: y +``` + +--- + +## Generated Files + +The script creates architecture-specific deployment files: +- `nixlbench-deployment-x86_64.yaml` - For x86_64 builds +- `nixlbench-deployment-aarch64.yaml` - For aarch64 builds + +These files are customized versions of the base template with your specific: +- Docker registry +- Image tag +- Architecture + +--- + +## Monitoring Your Deployment + +After deployment, monitor your NIXL benchmark: + +```bash +# Check pod status +kubectl get pods -l app=nixl-benchmark + +# View logs +kubectl logs -l app=nixl-benchmark -f + +# Check resource usage +kubectl top pods -l app=nixl-benchmark + +# Get detailed pod information +kubectl describe pods -l app=nixl-benchmark +``` + +If deployed to a specific namespace: +```bash +kubectl get pods -l app=nixl-benchmark -n your-namespace +kubectl logs -l app=nixl-benchmark -f -n your-namespace +``` + +--- + + +## Troubleshooting + +### Cluster-Level Issues +For cluster-related problems, first run the pre-deployment check to identify issues: + +```bash +../pre-deployment-check.sh +``` + +This will help diagnose: +- kubectl connectivity problems +- Missing default StorageClass +- GPU node availability issues +- GPU Operator status problems + +### NIXL-Specific Issues + +1. **ETCD Connection**: + - Ensure etcd service is running: `kubectl get svc dynamo-platform-etcd` + - Verify etcd endpoints are accessible from pods + - Check if etcd is in the correct namespace + +2. **Image Pull Issues**: + - Verify registry credentials are configured + - Check image exists: `docker pull {registry}/nixlbench:0.6.0-{arch}` + - Ensure image was pushed successfully after build + +3. **Build Failures**: + - Ensure Docker daemon is running + - Check available disk space in `/tmp` + - Verify network connectivity to GitHub + - Confirm build utilities are installed: `which wget unzip` + +4. **Deployment File Not Found**: + - Run step 2 to create deployment file before step 3 + - Check file permissions in script directory + - Verify script directory path is correct + +### Debug Commands +```bash +# Check script-generated files +ls -la nixlbench-deployment-*.yaml + +# Verify deployment status +kubectl get deployment nixl-benchmark -o yaml + +# Check events for issues +kubectl get events --sort-by=.metadata.creationTimestamp +``` + +### Cleanup + +To remove the deployment: +```bash +kubectl delete deployment nixl-benchmark +``` + +Or if deployed to a specific namespace: +```bash +kubectl delete deployment nixl-benchmark -n your-namespace +``` + +To clean up generated files: +```bash +rm -f nixlbench-deployment-*.yaml +``` + +--- + +## Script Reference + +### build_and_deploy.sh +Interactive script that provides flexible build and deployment workflow: +- **Architecture selection**: x86_64 or aarch64 +- **Step selection**: Choose any combination of build, update, deploy +- **Validation**: Checks for deployment files before deploying + +### nixlbench-deployment.yaml +Base Kubernetes deployment template that gets customized by the script: +- **Template image**: `my-registry/nixlbench:version-arch` +- **Resource allocation**: 10 CPU, 5Gi memory, 1 GPU per pod +- **ETCD integration**: Pre-configured environment variables +- **Benchmark command**: Runs with VRAM segment configuration \ No newline at end of file diff --git a/deploy/cloud/pre-deployment/nixl/build_and_deploy.sh b/deploy/cloud/pre-deployment/nixl/build_and_deploy.sh new file mode 100755 index 000000000000..88f966a61bad --- /dev/null +++ b/deploy/cloud/pre-deployment/nixl/build_and_deploy.sh @@ -0,0 +1,413 @@ +#!/bin/bash + +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +set -euo pipefail + + +NIXL_VERSION="0.6.0" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Function to check if a command exists +command_exists() { + command -v "$1" >/dev/null 2>&1 +} + +# Function to check Docker daemon status +check_docker_daemon() { + if ! docker info >/dev/null 2>&1; then + return 1 + fi + return 0 +} + +# Function to check all required dependencies +check_dependencies() { + echo "Checking required dependencies..." + local missing_deps=() + local warnings=() + + # Check wget + if ! command_exists wget; then + missing_deps+=("wget") + else + echo "✅ wget is available" + fi + + # Check unzip + if ! command_exists unzip; then + missing_deps+=("unzip") + else + echo "✅ unzip is available" + fi + + # Check kubectl + if ! command_exists kubectl; then + missing_deps+=("kubectl") + else + echo "✅ kubectl is available" + # Test kubectl connectivity + if ! kubectl cluster-info >/dev/null 2>&1; then + warnings+=("kubectl is installed but cannot connect to cluster") + else + echo "✅ kubectl can connect to cluster" + fi + fi + + # Check Docker + if ! command_exists docker; then + missing_deps+=("docker") + else + echo "✅ docker is available" + # Check Docker daemon + if ! check_docker_daemon; then + warnings+=("Docker is installed but daemon is not running or accessible") + else + echo "✅ Docker daemon is running" + + # Additional Docker toolchain checks + if ! docker ps >/dev/null 2>&1; then + warnings+=("Docker requires sudo or user is not in docker group - consider adding user to docker group") + fi + + if ! docker buildx version >/dev/null 2>&1; then + warnings+=("Docker buildx not available (may affect multi-architecture builds)") + fi + fi + fi + + # Report missing dependencies + if [ ${#missing_deps[@]} -gt 0 ]; then + echo + echo "❌ Missing required dependencies:" + for dep in "${missing_deps[@]}"; do + echo " - $dep" + done + echo + echo "Please install the missing dependencies and try again." + echo + echo "Installation suggestions:" + for dep in "${missing_deps[@]}"; do + case "$dep" in + wget) + echo " wget: sudo apt-get install wget (Ubuntu/Debian) or yum install wget (RHEL/CentOS)" + ;; + unzip) + echo " unzip: sudo apt-get install unzip (Ubuntu/Debian) or yum install unzip (RHEL/CentOS)" + ;; + kubectl) + echo " kubectl: https://kubernetes.io/docs/tasks/tools/install-kubectl/" + ;; + docker) + echo " docker: https://docs.docker.com/get-docker/" + ;; + esac + done + return 1 + fi + + # Report warnings + if [ ${#warnings[@]} -gt 0 ]; then + echo + echo "⚠️ Warnings:" + for warning in "${warnings[@]}"; do + echo " - $warning" + done + echo + printf "Do you want to continue despite these warnings? (y/N): " + read continue_with_warnings + case "$continue_with_warnings" in + [Yy]|[Yy][Ee][Ss]) + echo "Continuing with warnings..." + ;; + *) + echo "Please resolve the warnings and try again." + return 1 + ;; + esac + fi + + echo "✅ All required dependencies are available" + return 0 +} + +# Function to display available architectures +show_architectures() { + echo "Available architectures:" + echo "1) x86_64 (Intel/AMD 64-bit)" + echo "2) aarch64 (ARM64)" +} + +# Function to validate architecture input +validate_architecture() { + local arch=$1 + case $arch in + 1|x86_64) + echo "x86_64" + return 0 + ;; + 2|aarch64) + echo "aarch64" + return 0 + ;; + *) + return 1 + ;; + esac +} + +# Function to prompt for registry +prompt_for_registry() { + echo + printf "Enter your Docker registry (e.g., my-registry, docker.io/username): " + read REGISTRY + if [ -z "$REGISTRY" ]; then + echo "Error: Registry cannot be empty" + exit 1 + fi +} + +# Function to build nixlbench image +build_nixlbench() { + local arch=$1 + local registry=$2 + + echo "Building nixlbench image for architecture: $arch" + echo "Registry: $registry" + + NIXL_BUILD_DIR="/tmp/nixlbench-${NIXL_VERSION}" + rm -rf "${NIXL_BUILD_DIR}" + mkdir -p "${NIXL_BUILD_DIR}" + cd "${NIXL_BUILD_DIR}" + + echo "Downloading NIXL source..." + wget https://github.com/ai-dynamo/nixl/archive/refs/tags/${NIXL_VERSION}.zip + unzip "${NIXL_VERSION}.zip" + cd "nixl-${NIXL_VERSION}/benchmark/nixlbench/contrib" + read -p "Press Enter to continue" + echo "Building Docker image..." + ./build.sh --tag "${registry}/nixlbench:${NIXL_VERSION}-${arch}" --arch "${arch}" + + echo "Build completed successfully!" + echo "Image: ${registry}/nixlbench:${NIXL_VERSION}-${arch}" +} + +# Function to update deployment yaml +update_deployment() { + local arch=$1 + local registry=$2 + local deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${arch}.yaml" + + echo "Creating deployment file: $deployment_file" + + # Copy the original deployment file and update the image + cp "${SCRIPT_DIR}/nixlbench-deployment.yaml" "$deployment_file" + + # Update the image field using sed + sed -i "s|my-registry/nixlbench:version-arch|${registry}/nixlbench:${NIXL_VERSION}-${arch}|g" "$deployment_file" + + echo "Deployment file updated with image: ${registry}/nixlbench:${NIXL_VERSION}-${arch}" +} + +# Function to prompt for steps to execute +prompt_for_steps() { + echo + echo "Select which steps to execute:" + echo "1) Build nixlbench Docker image" + echo "2) Update deployment YAML file" + echo "3) Deploy to Kubernetes" + echo + echo "Enter the steps you want to execute (e.g., '1,2,3' for all, '1,2' to skip deployment, '3' for deployment only):" + printf "Steps to execute: " + read steps_input + + if [ -z "$steps_input" ]; then + echo "Error: Please select at least one step" + return 1 + fi + + # Parse the input and set flags + EXECUTE_BUILD=false + EXECUTE_UPDATE=false + EXECUTE_DEPLOY=false + + # Convert comma-separated input to array + IFS=',' read -ra STEPS <<< "$steps_input" + for step in "${STEPS[@]}"; do + # Remove whitespace + step=$(echo "$step" | tr -d ' ') + case "$step" in + 1) + EXECUTE_BUILD=true + ;; + 2) + EXECUTE_UPDATE=true + ;; + 3) + EXECUTE_DEPLOY=true + ;; + *) + echo "Warning: Invalid step '$step' ignored. Valid steps are 1, 2, 3" + ;; + esac + done + + # Check if at least one valid step was selected + if [ "$EXECUTE_BUILD" = false ] && [ "$EXECUTE_UPDATE" = false ] && [ "$EXECUTE_DEPLOY" = false ]; then + echo "Error: No valid steps selected" + return 1 + fi + + return 0 +} + +# Function to deploy to Kubernetes +deploy_to_k8s() { + local arch=$1 + local deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${arch}.yaml" + + echo "Deploying to Kubernetes..." + kubectl apply -f "$deployment_file" + echo "Deployment applied successfully!" + echo + echo "To check the status of your deployment:" + echo "kubectl get pods -l app=nixl-benchmark" + echo + echo "To view logs:" + echo "kubectl logs -l app=nixl-benchmark -f" +} + +# Main script +main() { + echo "NIXL Benchmark Build and Deploy Script" + echo "======================================" + echo + + # Check dependencies first + if ! check_dependencies; then + exit 1 + fi + echo + + # Show available architectures + show_architectures + echo + + # Prompt for architecture + while true; do + printf "Select architecture (1-2 or enter x86_64/aarch64): " + read arch_input + + if [ -z "$arch_input" ]; then + echo "Error: Please select an architecture" + continue + fi + + SELECTED_ARCH=$(validate_architecture "$arch_input") + if [ $? -eq 0 ]; then + break + else + echo "Error: Invalid architecture. Please select 1, 2, x86_64, or aarch64" + fi + done + + echo "Selected architecture: $SELECTED_ARCH" + + # Prompt for registry (only if building or updating deployment) + REGISTRY="" + + # Prompt for steps to execute + while true; do + if prompt_for_steps; then + break + fi + echo "Please try again." + echo + done + + # Only prompt for registry if we need it + if [ "$EXECUTE_BUILD" = true ] || [ "$EXECUTE_UPDATE" = true ]; then + prompt_for_registry + fi + + echo + echo "Summary:" + echo "- Architecture: $SELECTED_ARCH" + if [ -n "$REGISTRY" ]; then + echo "- Registry: $REGISTRY" + echo "- Image will be: $REGISTRY/nixlbench:$NIXL_VERSION-$SELECTED_ARCH" + fi + echo "- Steps to execute:" + if [ "$EXECUTE_BUILD" = true ]; then + echo " ✓ Build nixlbench Docker image" + else + echo " ✗ Build nixlbench Docker image (skipped)" + fi + if [ "$EXECUTE_UPDATE" = true ]; then + echo " ✓ Update deployment YAML file" + else + echo " ✗ Update deployment YAML file (skipped)" + fi + if [ "$EXECUTE_DEPLOY" = true ]; then + echo " ✓ Deploy to Kubernetes" + else + echo " ✗ Deploy to Kubernetes (skipped)" + fi + echo + + printf "Proceed with selected steps? (y/N): " + read confirm + case "$confirm" in + [Yy]|[Yy][Ee][Ss]) + ;; + *) + echo "Process cancelled." + exit 0 + ;; + esac + + # Execute selected steps + if [ "$EXECUTE_BUILD" = true ]; then + echo + echo "=== Building nixlbench Docker image ===" + build_nixlbench "$SELECTED_ARCH" "$REGISTRY" + fi + + if [ "$EXECUTE_UPDATE" = true ]; then + echo + echo "=== Updating deployment YAML file ===" + update_deployment "$SELECTED_ARCH" "$REGISTRY" + fi + + if [ "$EXECUTE_DEPLOY" = true ]; then + echo + echo "=== Deploying to Kubernetes ===" + # Check if deployment file exists + deployment_file="${SCRIPT_DIR}/nixlbench-deployment-${SELECTED_ARCH}.yaml" + if [ ! -f "$deployment_file" ]; then + echo "Warning: Deployment file not found at $deployment_file" + echo "You may need to run step 2 (Update deployment YAML file) first." + printf "Do you want to continue with deployment anyway? (y/N): " + read deploy_confirm + case "$deploy_confirm" in + [Yy]|[Yy][Ee][Ss]) + ;; + *) + echo "Deployment skipped." + EXECUTE_DEPLOY=false + ;; + esac + fi + + if [ "$EXECUTE_DEPLOY" = true ]; then + deploy_to_k8s "$SELECTED_ARCH" + fi + fi + + echo + echo "Process completed successfully!" +} + +# Run main function +main "$@" diff --git a/deploy/cloud/pre-deployment/nixl/nixl-benchmark-deployment.yaml b/deploy/cloud/pre-deployment/nixl/nixlbench-deployment.yaml similarity index 54% rename from deploy/cloud/pre-deployment/nixl/nixl-benchmark-deployment.yaml rename to deploy/cloud/pre-deployment/nixl/nixlbench-deployment.yaml index b0bf1084ac20..15cd39431555 100644 --- a/deploy/cloud/pre-deployment/nixl/nixl-benchmark-deployment.yaml +++ b/deploy/cloud/pre-deployment/nixl/nixlbench-deployment.yaml @@ -14,16 +14,22 @@ spec: labels: app: nixl-benchmark spec: - imagePullSecrets: - - name: nvcr-imagepullsecret containers: - name: nixl-benchmark - image: my-registry/vllm-runtime:nixlbench-e42c07a8 + image: "my-registry/nixlbench:version-arch" command: ["sh", "-c"] + env: + - name: ETCD_ENDPOINTS + value: etcd:2379 args: - - "nixlbench -etcd_endpoints http://dynamo-platform-etcd:2379 --target_seg_type VRAM --initiator_seg_type VRAM && sleep infinity" + - | + nixlbench -etcd_endpoints ${ETCD_ENDPOINTS} --target_seg_type VRAM --initiator_seg_type VRAM && sleep infinity resources: requests: - nvidia.com/gpu: "1" + cpu: "10" + memory: "5Gi" + nvidia.com/gpu: "1" limits: - nvidia.com/gpu: "1" + cpu: "10" + memory: "5Gi" + nvidia.com/gpu: "1"