Skip to content

feat: guides for nixl benchmarking#3584

Merged
biswapanda merged 4 commits into
bis/dep-461-check-default-storage-class-before-deploymentfrom
bis/dep-460-create-guides-for-nixlnccl-test
Oct 13, 2025
Merged

feat: guides for nixl benchmarking#3584
biswapanda merged 4 commits into
bis/dep-461-check-default-storage-class-before-deploymentfrom
bis/dep-460-create-guides-for-nixlnccl-test

Conversation

@biswapanda
Copy link
Copy Markdown
Contributor

@biswapanda biswapanda commented Oct 13, 2025

Overview:

This guide describes how to build and deploy the NIXL benchmark using the provided scripts on a Kubernetes (K8s) cluster.

Summary by CodeRabbit

  • New Features

    • Added a pre-deployment check script that validates kubectl access, default StorageClass, GPU availability, and GPU operator status with a clear pass/fail summary.
    • Introduced an interactive NIXL build-and-deploy script (x86_64/aarch64) with registry prompts, step selection, and Kubernetes deployment.
    • Updated NIXL deployment manifest: new image format, ETCD endpoint via env var, improved command, and explicit CPU/memory/GPU resources.
  • Documentation

    • Added guides for pre-deployment checks and NIXL benchmark build/deploy workflow, including quick start and troubleshooting.
    • Removed outdated NIXL benchmark README.

@biswapanda biswapanda requested a review from a team as a code owner October 13, 2025 17:14
@github-actions github-actions Bot added the feat label Oct 13, 2025
@biswapanda biswapanda changed the base branch from main to bis/dep-461-check-default-storage-class-before-deployment October 13, 2025 17:15
@biswapanda biswapanda changed the title feat: guides for nixl benchmarking feat: guides for nixlbench Oct 13, 2025
@biswapanda biswapanda changed the title feat: guides for nixlbench feat: pre-deployment guide for nixlbench Oct 13, 2025
@biswapanda biswapanda self-assigned this Oct 13, 2025
@biswapanda biswapanda force-pushed the bis/dep-460-create-guides-for-nixlnccl-test branch from 9c079f9 to 20c34fc Compare October 13, 2025 17:20
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Oct 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@biswapanda biswapanda changed the title feat: pre-deployment guide for nixlbench feat: guides for nixl benchmarking Oct 13, 2025
@biswapanda biswapanda force-pushed the bis/dep-460-create-guides-for-nixlnccl-test branch from 20c34fc to 0b1357a Compare October 13, 2025 17:22
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 13, 2025

Caution

Review failed

Failed to post review comments

Walkthrough

Adds Kubernetes pre-deployment checks and NIXL benchmark build/deploy tooling. Introduces two new Bash scripts, updates NIXL deployment YAML, replaces/relocates NIXL documentation, and removes an old benchmark README. No application code or APIs changed.

Changes

Cohort / File(s) Summary
Pre-deployment checks
deploy/cloud/pre-deployment/pre-deployment-check.sh, deploy/cloud/pre-deployment/README.md
New script and guide to validate kubectl access, default StorageClass, GPU resources, and GPU operator; prints a PASS/FAIL summary and exit code.
NIXL docs restructure
benchmarks/nixl/README.md, deploy/cloud/pre-deployment/nixl/README.md
Removes old benchmark README; adds a comprehensive NIXL build/deploy guide under pre-deployment/nixl.
NIXL build and deploy tooling
deploy/cloud/pre-deployment/nixl/build_and_deploy.sh, deploy/cloud/pre-deployment/nixl/nixlbench-deployment.yaml
Adds interactive script to build images (x86_64/aarch64), update deployment YAML, and deploy via kubectl; updates deployment manifest (image, env var, resources, command) and removes imagePullSecrets.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant S as pre-deployment-check.sh
  participant K as kubectl
  participant API as Kubernetes API

  U->>S: Run script
  S->>K: kubectl version / cluster-info
  K->>API: Connect
  API-->>K: Response
  K-->>S: Status
  S->>K: Get default StorageClass
  K-->>S: SC list/labels
  S->>K: Get GPU nodes (labels/resources)
  K-->>S: Node counts
  S->>K: Check GPU operator pods
  K-->>S: Pod states
  S-->>U: Per-check results and overall summary (PASS/FAIL), exit code
Loading
sequenceDiagram
  autonumber
  actor U as User
  participant B as build_and_deploy.sh
  participant FS as Filesystem
  participant D as Docker/Build
  participant R as Registry
  participant K as kubectl
  participant API as Kubernetes API

  U->>B: Start script (select arch, steps)
  alt Build image
    B->>FS: Fetch NIXL source
    B->>D: docker build -t REG/nixlbench:VERSION-ARCH
    D->>R: Push (if configured)
    R-->>B: Image available
  end
  alt Update YAML
    B->>FS: Copy base YAML -> arch-specific file
    B->>FS: Update image ref via sed
  end
  alt Deploy
    B->>K: kubectl apply -f arch-specific YAML
    K->>API: Create/Update resources
    API-->>K: Status
    K-->>B: Apply result
  end
  B-->>U: Summary and follow-up commands
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paws, deploy with cheer,
New scripts check clouds both far and near.
NIXL builds hum, YAMLs align,
GPUs ready—everything’s fine.
Hop, push, apply—logs scroll bright,
Carrots for green checks, all paws light! 🥕✨

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description only provides the Overview section and a closes statement but omits the required Details and Where should the reviewer start sections from the repository’s template, leaving out specific change descriptions and file callouts. Please complete the template by adding a Details section that describes the specific files and changes made in this PR and a Where should the reviewer start section listing the key files reviewers should focus on.
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title clearly and concisely describes the addition of guides for NIXL benchmarking and aligns with the main purpose of the changeset without extraneous details.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread deploy/cloud/pre-deployment/nixl/build_and_deploy.sh Outdated
Comment thread deploy/cloud/pre-deployment/nixl/build_and_deploy.sh
Comment thread deploy/cloud/pre-deployment/nixl/build_and_deploy.sh Outdated
Comment thread deploy/cloud/pre-deployment/nixl/build_and_deploy.sh
Copy link
Copy Markdown
Contributor

@atchernych atchernych left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nice to haves, i think quotes are important

@biswapanda
Copy link
Copy Markdown
Contributor Author

some nice to haves, i think quotes are important

makes sense @atchernych.
I've addressed the comments

@biswapanda biswapanda merged commit d2b3941 into bis/dep-461-check-default-storage-class-before-deployment Oct 13, 2025
20 of 21 checks passed
@biswapanda biswapanda deleted the bis/dep-460-create-guides-for-nixlnccl-test branch October 13, 2025 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants