Skip to content

Conversation

@alec-flowers
Copy link
Contributor

@alec-flowers alec-flowers commented Oct 9, 2025

Overview:

I ran through the profiler steps to dogfood profiling Qwen 32b. These changes were necessary to make things run and I also improved documentation where I got stuck.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Refactor

    • Updated argument handling for worker containers to pass arguments directly (removes extra quoting layer).
  • Documentation

    • Quick Start: Clarified PVC creation with RWX and how to set storageClassName; added example and storage class check; noted HF_TOKEN usage.
    • Pre-deployment profiling: Added steps to create nvcr-imagepullsecret using NGC_API_KEY and kubectl.
    • Kubernetes installation guide: Added cluster-wide install option via NS_RESTRICT_FLAGS, refined helm steps (including cd and dep build), quoted namespace flag, and updated post-install pod expectations.
    • SLA Planner Quickstart: Corrected PVC manifest path and minor formatting.

@alec-flowers alec-flowers requested review from a team as code owners October 9, 2025 18:55
@github-actions github-actions bot added the fix label Oct 9, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 9, 2025

Walkthrough

Removes a shell-quoting helper in profiler config to assign container args directly. Updates multiple docs: PVC RWX guidance and env var example, NGC image pull secret troubleshooting, Kubernetes installation instructions (cluster-wide option, path and quoting fixes), and SLA planner quickstart path/formatting corrections.

Changes

Cohort / File(s) Summary of Changes
Profiler args handling
benchmarks/profiler/utils/config.py
Removed join_arguments; assign args list directly to extraPodSpec.mainContainer.args. Deleted exported function join_arguments(args: list[str]) -> list[str].
Deploy utils documentation
deploy/utils/README.md
Added Quick Start notes on RWX PVC creation, storageClassName adjustments, accessModes example, storage class check tip, and HF_TOKEN env usage.
Benchmarks troubleshooting
docs/benchmarks/pre_deployment_profiling.md
Added steps to create nvcr-imagepullsecret using NGC_API_KEY; duplicated guidance in initial error notes and Troubleshooting section.
Kubernetes installation guide
docs/kubernetes/installation_guide.md
Added cluster-wide install option via NS_RESTRICT_FLAGS; quoted namespace flag; inserted cd into deploy/cloud/helm and helm dep build; updated expected pods list.
SLA planner quickstart
docs/kubernetes/sla_planner_quickstart.md
Updated PVC manifest path to deploy/utils/manifests/pvc.yaml; minor markdown formatting tweak in help block.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I thump my paws at lines now clean,
Args hop straight—no quotes between.
RWX fields and secrets set,
Helm paths clear, no tangled net.
Pods appear like stars at night—
I twitch, approve this tidy sight. 🐇✨

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description includes an Overview and a Related Issues section but leaves the Details and Where should the reviewer start sections as unfilled placeholders, so it does not adhere to the required template format. Please complete the Details section with a summary of the code and documentation changes and update the Where should the reviewer start section to point to the key files or areas impacted by this PR.
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly summarizes the primary changes by referencing bug fixes in the profiling utility and documentation updates, which matches the removal of the join_arguments function and multiple doc improvements.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9414e3b and 3dd195e.

📒 Files selected for processing (5)
  • benchmarks/profiler/utils/config.py (11 hunks)
  • deploy/utils/README.md (1 hunks)
  • docs/benchmarks/pre_deployment_profiling.md (1 hunks)
  • docs/kubernetes/installation_guide.md (2 hunks)
  • docs/kubernetes/sla_planner_quickstart.md (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: sglang
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (9)
docs/benchmarks/pre_deployment_profiling.md (1)

227-233: LGTM! Clear troubleshooting guidance added.

The instructions for creating the nvcr-imagepullsecret are clear and actionable, providing users with the exact commands needed to resolve ImagePull authentication errors.

benchmarks/profiler/utils/config.py (1)

467-467: LGTM! Correct approach for Kubernetes container args.

Directly assigning the list[str] to mainContainer.args is the proper approach for Kubernetes, which expects container arguments as a list rather than a shell-quoted string. This change removes unnecessary shell quoting that could have caused issues with argument parsing.

deploy/utils/README.md (1)

37-56: LGTM! Helpful RWX storage guidance.

The documentation clearly explains the ReadWriteMany requirement and provides actionable steps to configure the appropriate storage class. The YAML example and kubectl command tip make it easy for users to resolve storage access mode issues.

docs/kubernetes/sla_planner_quickstart.md (2)

42-42: LGTM! Path reference updated correctly.

The path update to deploy/utils/manifests/pvc.yaml correctly reflects the centralized PVC manifest location documented in the deployment utilities.


264-264: LGTM! Improved markdown formatting.

The formatting change makes the help tip more prominent and properly renders as a blockquote with bold text.

docs/kubernetes/installation_guide.md (4)

147-147: LGTM! Ensures correct working directory.

Adding the cd deploy/cloud/helm step ensures all subsequent relative paths (e.g., ./platform/) resolve correctly.


153-153: LGTM! Essential helm dependency step.

The helm dep build ./platform/ command is required to download chart dependencies before installation—a standard helm workflow step.


155-163: LGTM! Clear cluster-wide vs namespace-restricted install documentation.

The introduction of NS_RESTRICT_FLAGS makes the installation mode explicit and easy to configure. The documentation clearly explains both options:

  • Cluster-wide (empty flags or omitted)
  • Namespace-restricted (with namespaceRestriction.enabled=true)

The quoted namespace variable ("${NAMESPACE}") follows shell best practices for variable expansion.


177-177: LGTM! More accurate expected pods description.

Including nats-* in the expected pods list provides users with a complete picture of the installation result.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tedzhouhk
Copy link
Contributor

thanks for the fixes!

@alec-flowers alec-flowers enabled auto-merge (squash) October 9, 2025 21:23
@alec-flowers alec-flowers merged commit 179f993 into main Oct 9, 2025
22 checks passed
@alec-flowers alec-flowers deleted the aflowers/k8s-doc-fix branch October 9, 2025 21:47
dagil-nvidia pushed a commit that referenced this pull request Oct 10, 2025
Signed-off-by: alec-flowers <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Co-authored-by: hongkuanz <[email protected]>
Signed-off-by: Dan Gil <[email protected]>
ziqifan617 pushed a commit that referenced this pull request Oct 20, 2025
Signed-off-by: alec-flowers <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Co-authored-by: hongkuanz <[email protected]>
nv-tusharma pushed a commit that referenced this pull request Oct 20, 2025
Signed-off-by: alec-flowers <[email protected]>
Signed-off-by: hongkuanz <[email protected]>
Co-authored-by: hongkuanz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants