Skip to content

Conversation

@athreesh
Copy link
Contributor

@athreesh athreesh commented Aug 19, 2025

This commit consolidates and improves the documentation structure based on tech writer feedback:

New Documentation:

  • Added Grove advanced Kubernetes scheduling guide
  • Added comprehensive K8s metrics setup guide with Prometheus/Grafana

Heading Fixes:

  • Fixed redundant headings that would appear redundant in Sphinx breadcrumbs
  • Changed "Architecture" → "Design" in SLA planner docs
  • Changed "Core Components" → "Core Services" to avoid repetition
  • Removed duplicate H1 headings in component docs

Quick Start Disambiguation:

  • "Quick Start" → "SGLang Quick Start" in SGLang README
  • "Quick Start" → "TensorRT-LLM Quick Start" in TensorRT-LLM README
  • "Quick Start" → "vLLM Quick Start" in vLLM README
  • "Quick Start" → "KV Router Quick Start" in router docs
  • "Quick Start" → "Pre-deployment Steps" in Fluid caching guide

Platform Naming:

  • Updated references to use consistent "Dynamo Kubernetes Platform" naming

I had Claude Code take small snippets out of a larger PR i had, as per @rmccorm4 feedback

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Documentation
    • Added comprehensive guides for Grove (advanced Kubernetes scheduling) and for collecting/visualizing Kubernetes and GPU metrics with Prometheus/Grafana, including alerting examples and troubleshooting.
    • Clarified and standardized headings across multiple docs (e.g., “Quick Start” retitled for specific backends; architecture and planner sections renamed for clarity).
    • Restructured the Minikube setup guide for readability and updated platform references.
    • Enhanced the model caching guide with a new PVC mounting step.
    • Expanded the Hello World example with concrete Kubernetes deployment and cleanup commands.

This commit consolidates and improves the documentation structure based on tech writer feedback:

**New Documentation:**
- Added Grove advanced Kubernetes scheduling guide
- Added comprehensive K8s metrics setup guide with Prometheus/Grafana

**Heading Fixes:**
- Fixed redundant headings that would appear redundant in Sphinx breadcrumbs
- Changed "Architecture" → "Design" in SLA planner docs
- Changed "Core Components" → "Core Services" to avoid repetition
- Removed duplicate H1 headings in component docs

**Quick Start Disambiguation:**
- "Quick Start" → "SGLang Quick Start" in SGLang README
- "Quick Start" → "TensorRT-LLM Quick Start" in TensorRT-LLM README
- "Quick Start" → "vLLM Quick Start" in vLLM README
- "Quick Start" → "KV Router Quick Start" in router docs
- "Quick Start" → "Pre-deployment Steps" in Fluid caching guide

**Platform Naming:**
- Updated references to use consistent "Dynamo Kubernetes Platform" naming

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

This commit includes the full comprehensive documentation revamp that was missing:

**dynamo_cloud.md:**
- Complete restructure with 3 clear installation paths (Production, Local Dev, Custom)
- Streamlined prerequisites and setup steps
- Updated to "Dynamo Kubernetes Platform" naming throughout
- Added troubleshooting and verification sections
- Simplified deployment workflow

**README.md (dynamo_deploy):**
- Restructured as high-level deployment guide
- Added clear 3-step process: Install Platform → Choose Backend → Deploy Model
- Added DynamoGraphDeployment explanation with example YAML
- Organized backend options in clear table format
- Streamlined additional resources section

Both files now provide a much clearer, more organized user experience for deploying Dynamo on Kubernetes.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 19, 2025

Walkthrough

Renames several documentation headers, adds two new guides (Grove scheduling and Kubernetes metrics), restructures Minikube and Fluid caching guides, and updates the Hello World example with concrete Kubernetes deployment commands. No code or API changes.

Changes

Cohort / File(s) Summary of Changes
Components Index Header Update
components/README.md
Renamed “Core Components” to “Core Services”.
Backend Quick Start Header Renames
components/backends/sglang/README.md, components/backends/trtllm/README.md, components/backends/vllm/README.md
Retitled “Quick Start” headers to backend-specific variants (SGLang/TensorRT-LLM/vLLM).
Architecture Docs Header Renames
docs/architecture/architecture.md, docs/architecture/sla_planner.md
Renamed section headers (“High level architecture and key benefits”→“Key benefits”; “Architecture”→“Design”; deployment header retitled).
Router Docs Heading Update
docs/components/router/README.md
Retitled “Quick Start” to “KV Router Quick Start”.
New Grove Scheduling Guide
docs/guides/dynamo_deploy/grove.md
Added guide on Grove (scheduler extension), concepts (PodGangSet, PodClique), deployment, examples, monitoring, troubleshooting.
New Kubernetes Metrics Guide
docs/guides/dynamo_deploy/k8s_metrics.md
Added end-to-end K8s metrics collection guide (Prometheus/Grafana/DCGM), ServiceMonitors, dashboards, alerts, tuning, troubleshooting.
Minikube Guide Restructure
docs/guides/dynamo_deploy/minikube.md
Flattened headings, updated intro/next steps; content sequence unchanged.
Fluid Model Caching Guide Adjustments
docs/guides/dynamo_deploy/model_caching_with_fluid.md
Renamed “Quick Start” to “Pre-deployment Steps”; added Step 3 to mount PVC.
Hello World Example Deployment Steps
examples/runtime/hello_world/README.md
Updated to “Dynamo Kubernetes Platform”; added explicit kubectl apply/delete steps and namespace setup.

Sequence Diagram(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

A rabbit taps keys with gentle delight,
Headers aligned, guides taking flight.
Grove hums, pods dance in tidy arrays,
Metrics sing Grafana displays.
Hello World waves from Kubernetes blue—
Hop, apply, delete—docs crisp and true.
Carrot-stamped, we’re through! 🥕

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3> [!CAUTION]

Some comments are outside the diff and can’t be posted inline due to platform limits.

🔭 Outside diff range comments (7)
components/backends/sglang/README.md (2)

24-30: Fix TOC anchor to match renamed header

The section header was renamed to “SGLang Quick Start,” but the TOC still links to “Quick Start,” breaking the anchor.

Apply this diff to update the TOC entry:

- - [Quick Start](#quick-start)
+ - [SGLang Quick Start](#sglang-quick-start)

Also applies to: 53-53


49-51: Fix typos and minor grammar issues

A few user-facing wording issues reduce clarity:

  • Line 49: “router does not router” → “router does not route”
  • Line 164: “conjuction” → “conjunction”
  • Lines 178/186: “dp attention” → “DP attention”
  • Line 186: “use the our implementation” → “use our implementation”

Proposed patch:

- | **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not router to DP worker |
+ | **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not route to DP workers |

- Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead, we first route to a random prefill worker, select a random decode worker, and then send the request to both. Internally, SGLang's bootstrap server (which is a part of the `tokenizer_manager`) is used in conjuction with NIXL to handle the kv transfer.
+ Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead, we first route to a random prefill worker, select a random decode worker, and then send the request to both. Internally, SGLang's bootstrap server (which is a part of the `tokenizer_manager`) is used in conjunction with NIXL to handle the KV transfer.

- You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
+ You can use this configuration to test out disaggregated serving with DP attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.

- When using MoE models, you can also use the our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets up the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](docs/expert-distribution-eplb.md).
+ When using MoE models, you can also use our implementation of the native SGLang endpoints to record expert distribution data. The `disagg_dp_attn.sh` script automatically sets up the SGLang HTTP server, the environment variable that controls the expert distribution recording directory, and sets the expert distribution recording mode to `stat`. You can learn more about expert parallelism load balancing [here](docs/expert-distribution-eplb.md).

Also applies to: 164-166, 176-186

docs/architecture/architecture.md (1)

91-91: Fix typo: “preceeding” → “preceding”

User-facing docs should avoid typos.

- The preceeding figures illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency.
+ The preceding figures illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency.
components/backends/vllm/README.md (2)

24-31: Fix TOC anchor to match renamed header

The TOC still points to “Quick Start,” but the section header is now “vLLM Quick Start,” breaking the anchor.

Apply this diff to update the TOC entry:

- - [Quick Start](#quick-start)
+ - [vLLM Quick Start](#vllm-quick-start)

Also applies to: 54-54


1-1: Switch absolute README links to relative paths

To resolve the InvalidPathToUri warnings, update the links in examples/multimodal/README.md that point to /components/backends/vllm/README.md so they use a relative path (../../components/backends/vllm/README.md) instead of an absolute one.

Files and locations to update:

  • examples/multimodal/README.md: lines 47, 125, 206, 270

Example diff for each occurrence:

- Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](/components/backends/vllm/README.md) example.
+ Its VllmPDWorker then prefills and decodes the prompt, just like the [LLM aggregated serving](../../components/backends/vllm/README.md) example.

- For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](/components/backends/vllm/README.md) example.
+ For more details on the roles of the prefill and decode workers, refer to the [LLM disaggregated serving](../../components/backends/vllm/README.md) example.
components/backends/trtllm/README.md (2)

36-41: Fix TOC anchor to match renamed header

The TOC links to “Quick Start” but the header is now “TensorRT-LLM Quick Start,” causing a broken anchor.

Apply this diff to update the TOC entry:

- - [Quick Start](#quick-start)
+ - [TensorRT-LLM Quick Start](#tensorrt-llm-quick-start)

Also applies to: 69-69


194-199: Deduplicate “Client” section and fix heading level consistency

There are two “Client” sections: one at Line 194 (###) and another at Line 234 (##). This duplicates content and creates anchor conflicts.

Suggested fix: keep the earlier “### Client” and remove the later duplicate. Patch:

-## Client
-
-See [client](../sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
-
-NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
-
-## Benchmarking
-
-To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
+## Benchmarking
+
+To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
+`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)

If you prefer to keep a single “Client” section near the end, move the earlier content down and make sure there’s only one heading, with consistent level (### under “Advanced Examples”).

Also applies to: 234-244

🧹 Nitpick comments (17)
components/backends/vllm/README.md (1)

56-56: Grammar nit: remove duplicate article

“all of our the common” → “all of our common”

- Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
+ Below we provide a guide that lets you run all of our common deployment patterns on a single node.
components/backends/trtllm/README.md (1)

71-71: Grammar nit: remove duplicate article

“all of our the common” → “all of our common”

- Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
+ Below we provide a guide that lets you run all of our common deployment patterns on a single node.
components/README.md (1)

32-32: Suggest adding a legacy anchor for “#core-components” (optional)

We ran:

rg -nP '\(#core-components\)|\[Core Components\]\(#core-components\)'

and found no internal references to the old slug. If you maintain any external docs or third-party sites linking to #core-components, you can preserve compatibility by adding this just above the updated heading:

-## Core Services
+<a id="core-components"></a>
+## Core Services

If you’re confident there are no external dependencies on the old anchor, feel free to skip this—this refactor is purely optional.

docs/components/router/README.md (1)

12-12: Preserve old anchor to avoid breaking inbound links.

Renaming the heading will change the auto-generated anchor (e.g., from #quick-start to #kv-router-quick-start). If other docs link to the old anchor, those links will break. Add a backward-compatible anchor alias just above the heading.

+<a id="quick-start"></a>
 ## KV Router Quick Start
docs/guides/dynamo_deploy/model_caching_with_fluid.md (1)

30-35: Clarify Step 3 with a pointer to the full usage example.

Adding a cross-link reduces ambiguity and helps readers immediately find a working spec for mounting the PVC.

-3. Mount the resulting PVC in your workload.
+3. Mount the resulting PVC in your workload. See [Usage with Dynamo](#usage-with-dynamo) for a complete example.
docs/architecture/sla_planner.md (1)

111-115: Fix absolute link in README.md and add stable page label

  • Add a MyST label at the top of docs/architecture/sla_planner.md so other docs can link to it reliably:

     docs/architecture/sla_planner.md
    +# (sla-planner)=
     # SLA-based Planner
  • Replace the absolute link in README.md (line 62) with a relative path:

     README.md
    -| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | 🚧 |
    +| [**SLA-Based Planner**](docs/architecture/sla_planner.md) | ✅ | ✅ | 🚧 |
examples/runtime/hello_world/README.md (1)

109-116: Quickstart link verified; consider adding a readiness check

The Quickstart Guide link is correct (docs/guides/dynamo_deploy/quickstart.md). To help users confirm their pods are up, you might add a simple readiness check after deployment:

 Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Kubernetes Platform.
 Then deploy to kubernetes using
 
 ```bash
 export NAMESPACE=<your-namespace>
 cd dynamo
 kubectl apply -f examples/runtime/hello_world/deploy/hello_world.yaml -n ${NAMESPACE}
+kubectl get pods -n ${NAMESPACE}

</blockquote></details>
<details>
<summary>docs/guides/dynamo_deploy/grove.md (4)</summary><blockquote>

`24-43`: **Add schedulerName to the PodGangSet example for consistency**

Your later example specifies schedulerName: grove-scheduler, but this first PodGangSet does not. Add it here to avoid confusion and ensure users don’t accidentally use the default scheduler.


```diff
 spec:
   template:
     spec:
+      schedulerName: grove-scheduler
       containers:
       - name: worker
         image: dynamo/worker:latest
         resources:
           requests:
             nvidia.com/gpu: 1

75-81: Pin Grove install manifests to a versioned release, not latest

Using “releases/latest” makes docs non-reproducible and can break unexpectedly. Recommend pinning to a specific Grove version and optionally showing how to discover newer versions.

-# Install Grove CRDs and scheduler
-kubectl apply -f https://github.com/ai-dynamo/grove/releases/latest/download/grove-crds.yaml
-kubectl apply -f https://github.com/ai-dynamo/grove/releases/latest/download/grove-scheduler.yaml
+# Install Grove CRDs and scheduler (pin to a specific version)
+GROVE_VERSION=v0.3.1
+kubectl apply -f https://github.com/ai-dynamo/grove/releases/download/${GROVE_VERSION}/grove-crds.yaml
+kubectl apply -f https://github.com/ai-dynamo/grove/releases/download/${GROVE_VERSION}/grove-scheduler.yaml

If helpful, I can add a short snippet showing how to find the latest version with the GitHub API.


87-97: Clarify that label keys are examples; prefer standard/consistent labels

The accelerator label (accelerator=h100) is fine for examples, but many environments use standardized keys (e.g., nvidia.com/gpu.product, topology.kubernetes.io/zone). Consider a note clarifying these are examples and that clusters should adopt a consistent labeling convention.

You could add a sentence after the code block: “Note: Label keys/values shown are examples. Use your cluster’s standardized label keys (e.g., topology.kubernetes.io/*, nvidia.com/gpu.product) to align with your labeling policy.”


106-134: Disambiguate prefill vs decode gang composition

This PodGangSet template sets WORKER_TYPE to “prefill”, but replicas: 16 + minAvailable: 16 suggests a single gang. If prefill and decode are intended as separate gangs, consider showing two PodGangSets (one per role), or document how a single PodGangSet is partitioned by label selectors or multiple templates if supported by the CRD.

If PodGangSet currently supports only one template, show two separate objects (prefill and decode) so users don’t assume mixed roles in a single gang.

docs/guides/dynamo_deploy/k8s_metrics.md (6)

29-34: Avoid hardcoding Grafana admin password in commands

Publishing a fixed admin password (admin123) is insecure and can be copied into production. Recommend using a values file or a Secret and instruct users to set a strong password.

-helm install prometheus-stack prometheus-community/kube-prometheus-stack \
+helm install prometheus-stack prometheus-community/kube-prometheus-stack \
   --namespace monitoring \
   --create-namespace \
   --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
-  --set grafana.adminPassword=admin123
+  --values values-monitoring.yaml

And in values-monitoring.yaml:

grafana:
  adminPassword: "CHANGEME_STRONG_PASSWORD"

Optionally document using a pre-created Secret via grafana.admin.existingSecret.


95-133: Pin images instead of using :latest to ensure reproducibility

Using :latest can change behavior over time. Suggest pinning to a tested tag and/or digest, and adding a note that teams should align versions across components.

-        image: nvcr.io/nvidia/ai-dynamo/frontend:latest
+        image: nvcr.io/nvidia/ai-dynamo/frontend:0.12.3
+# or use a digest:
+#       image: nvcr.io/nvidia/ai-dynamo/frontend@sha256:<digest>

139-147: GPU Operator install flags are correct; add a note on operator version compatibility

Enabling DCGM exporter and ServiceMonitor is the recommended approach. Consider pinning the chart version and adding a note on minimum GPU Operator version tested with these flags to avoid user surprises.

-helm install gpu-operator nvidia/gpu-operator \
+helm install gpu-operator nvidia/gpu-operator \
   --namespace gpu-operator \
   --create-namespace \
   --set dcgmExporter.enabled=true \
   --set dcgmExporter.serviceMonitor.enabled=true
+# Optionally pin a version:
+#   --version 24.6.2

229-242: Quote retentionSize to match expected string type

Some configurations expect retentionSize as a string; quoting also avoids YAML parsers misreading the value.

     retention: 30d
-    retentionSize: 50GB
+    retentionSize: "50GB"

248-262: Prefer ServiceMonitor interval over static scrape jobs with Prometheus Operator

With kube-prometheus-stack, Prometheus is managed by the Operator; static scrape configs in prometheus.yml are typically supplied via additionalScrapeConfigs and can confuse readers. Since you already use ServiceMonitors, show interval override there instead.

Example (adjust interval on a ServiceMonitor endpoint):

endpoints:
- interval: 5s
  port: metrics
  path: /metrics

Consider removing the static job examples or moving them to an “Advanced (additionalScrapeConfigs)” section with a Secret-based example.


286-295: Make kubectl exec resilient to chart name/version changes

The Prometheus pod name varies. Suggest discovering the pod name dynamically or using a label selector.

# Discover the Prometheus pod name dynamically
POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n monitoring "$POD" -- cat /etc/prometheus/config_out/prometheus.env.yaml
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1945f59 and 359b4ae.

📒 Files selected for processing (12)
  • components/README.md (1 hunks)
  • components/backends/sglang/README.md (1 hunks)
  • components/backends/trtllm/README.md (1 hunks)
  • components/backends/vllm/README.md (1 hunks)
  • docs/architecture/architecture.md (1 hunks)
  • docs/architecture/sla_planner.md (2 hunks)
  • docs/components/router/README.md (1 hunks)
  • docs/guides/dynamo_deploy/grove.md (1 hunks)
  • docs/guides/dynamo_deploy/k8s_metrics.md (1 hunks)
  • docs/guides/dynamo_deploy/minikube.md (3 hunks)
  • docs/guides/dynamo_deploy/model_caching_with_fluid.md (1 hunks)
  • examples/runtime/hello_world/README.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/guides/dynamo_deploy/grove.md

[grammar] ~7-~7: There might be a mistake here.
Context: ...s' default scheduling capabilities with: - Gang scheduling: Ensures all pods in a...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...olicies ## Key Features ### PodGangSet PodGangSet is Grove's primary scheduling...

(QB_NEW_EN)


[grammar] ~45-~45: There might be a mistake here.
Context: ...lues: ["gpu-compute"] ``` ### PodClique PodClique provides fine-grained control ...

(QB_NEW_EN)


[grammar] ~70-~70: There might be a mistake here.
Context: ...ck ``` ## Deployment ### Prerequisites - Kubernetes cluster with GPU nodes - NVID...

(QB_NEW_EN)


[grammar] ~71-~71: There might be a mistake here.
Context: ...ites - Kubernetes cluster with GPU nodes - NVIDIA GPU Operator installed - Node top...

(QB_NEW_EN)


[grammar] ~72-~72: There might be a mistake here.
Context: ...PU nodes - NVIDIA GPU Operator installed - Node topology labels configured ### Ins...

(QB_NEW_EN)


[grammar] ~138-~138: There might be a mistake here.
Context: ...## Best Practices ### Resource Planning - Use minAvailable: replicas for strict ...

(QB_NEW_EN)


[grammar] ~143-~143: There might be a mistake here.
Context: ...i-node workloads ### Topology Awareness - Label nodes with rack, zone, and network...

(QB_NEW_EN)


[grammar] ~148-~148: There might be a mistake here.
Context: ...timize for your workload ### Monitoring Grove provides metrics for scheduling de...

(QB_NEW_EN)


[grammar] ~161-~161: There might be a mistake here.
Context: ...n Issues Pods stuck in Pending state: - Check if sufficient resources are availa...

(QB_NEW_EN)


[grammar] ~166-~166: There might be a mistake here.
Context: ...heduler **Gang scheduling not working:** - EnsureschedulerName: grove-scheduler` ...

(QB_NEW_EN)

docs/guides/dynamo_deploy/k8s_metrics.md

[grammar] ~7-~7: There might be a mistake here.
Context: ...rm provides extensive telemetry through: - Application metrics: Request latency, ...

(QB_NEW_EN)


[grammar] ~8-~8: There might be a mistake here.
Context: ...Request latency, throughput, error rates - Infrastructure metrics: GPU utilizatio...

(QB_NEW_EN)


[grammar] ~9-~9: There might be a mistake here.
Context: ...U utilization, memory usage, network I/O - Kubernetes metrics: Pod status, resour...

(QB_NEW_EN)


[grammar] ~10-~10: There might be a mistake here.
Context: ...us, resource consumption, cluster health - Custom Dynamo metrics: KV cache effici...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ... Kubernetes cluster with Dynamo deployed - Cluster admin permissions for installing...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...missions for installing monitoring stack - Sufficient storage for metrics retention...

(QB_NEW_EN)


[grammar] ~156-~156: There might be a mistake here.
Context: ...d the JSON file or paste the content 4. Configure data source as your Prometheus instance...

(QB_NEW_EN)


[grammar] ~268-~268: There might be a mistake here.
Context: ... Common Issues Metrics not appearing: - Verify ServiceMonitor is in correct name...

(QB_NEW_EN)


[grammar] ~269-~269: There might be a mistake here.
Context: ...appearing:** - Verify ServiceMonitor is in correct namespace - Check Prometheus ta...

(QB_NEW_EN)


[grammar] ~270-~270: There might be a mistake here.
Context: ...t namespace - Check Prometheus targets: http://prometheus:9090/targets - Ensure firewall/network policies allow s...

(QB_NEW_EN)


[grammar] ~273-~273: There might be a mistake here.
Context: ...w scraping High cardinality warnings: - Review metric labels and limit dynamic l...

(QB_NEW_EN)


[grammar] ~278-~278: There might be a mistake here.
Context: ...ary labels Grafana connection issues: - Verify Prometheus data source URL - Chec...

(QB_NEW_EN)

docs/guides/dynamo_deploy/minikube.md

[grammar] ~45-~45: There might be a mistake here.
Context: ...r-rancher ``` ## 4. Verify Installation Let's make sure everything is working co...

(QB_NEW_EN)

🪛 GitHub Actions: Docs link check
docs/guides/dynamo_deploy/grove.md

[error] 1-1: Network error: error sending request for url (https://grove.dynamo.ai/docs) Maybe a certificate error?

docs/architecture/sla_planner.md

[warning] 1-1: lychee: Warning - Error creating request: InvalidPathToUri('/docs/architecture/sla_planner.md')

components/backends/vllm/README.md

[warning] 1-1: lychee: Warning - Error creating request: InvalidPathToUri('/components/backends/vllm/README.md')


[warning] 1-1: lychee: Warning - Error creating request: InvalidPathToUri('/components/backends/vllm/README.md')


[warning] 1-1: lychee: Warning - Error creating request: InvalidPathToUri('/components/backends/vllm/README.md')


[warning] 1-1: lychee: Warning - Error creating request: InvalidPathToUri('/components/backends/vllm/README.md')

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2518/merge) by athreesh.
docs/guides/dynamo_deploy/k8s_metrics.md

[error] 199-208: Trailing whitespace detected by pre-commit hook 'trailing-whitespace'. The hook modified the file. Re-run pre-commit to re-commit changes. (Command: pre-commit run --show-diff-on-failure --color=always --all-files)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (8)
docs/architecture/architecture.md (1)

51-59: LGTM on header rename

“Key benefits” reads cleaner and aligns with breadcrumb brevity. No structural issues spotted.

docs/architecture/sla_planner.md (1)

20-28: LGTM on “Architecture” → “Design”.

The rename aligns with the PR objective to reduce redundant breadcrumb headings.

examples/runtime/hello_world/README.md (1)

120-122: LGTM on the cleanup command.

The explicit resource delete is clear and user-friendly.

docs/guides/dynamo_deploy/minikube.md (2)

20-27: LGTM on intro and section flattening.

The tone update and heading level adjustments improve clarity without altering the flow.


32-43: Please verify Minikube addon names locally

The sandbox doesn’t have Minikube installed, so you’ll need to run the following on your machine and confirm which addons are available before updating the docs:

minikube addons list

Check and update these entries in docs/guides/dynamo_deploy/minikube.md (lines 32–43):

 # Enable required addons
-minikube addons enable istio-provisioner
-minikube addons enable istio
-minikube addons enable storage-provisioner-rancher
+minikube addons enable istio-provisioner   # verify availability
+minikube addons enable istio               # verify availability
+minikube addons enable storage-provisioner # replace if needed
docs/guides/dynamo_deploy/grove.md (1)

151-155: Verify metrics service name/namespace match the install manifests

Port-forward references svc/grove-scheduler-metrics in namespace grove-system. Please confirm the installed Service name and namespace from grove-scheduler.yaml match this path to avoid 404s during copy/paste.

I can add a quick kubectl get svc -n grove-system | grep metrics step to help users locate the exact service name if desired.

docs/guides/dynamo_deploy/k8s_metrics.md (2)

41-76: ServiceMonitor scope and intervals look good

Label selectors and namespaceSelector are clear, and endpoint intervals are specified. This should work out of the box with the kube-prometheus-stack defaults.


293-295: Verify DCGM exporter label selector

Some distributions label the DCGM exporter as app.kubernetes.io/name=dcgm-exporter rather than app=nvidia-dcgm-exporter. Suggest checking the actual labels created by the chart or using a broader selector.

kubectl get pods -n gpu-operator -l 'app.kubernetes.io/name=dcgm-exporter' -o name
# Fallback:
kubectl get pods -n gpu-operator -l 'app=nvidia-dcgm-exporter' -o name

Updated to the more accurate and product-specific versions from fixes-docs:

**Grove Documentation:**
- Changed from generic Kubernetes scheduling guide to AI-focused deployment guide
- Now covers disaggregated inference orchestration specifically
- Includes integration details with NVIDIA Dynamo
- Mentions specific models like DeepSeek-R1 and Llama-4-Maverick
- Added reference to Grove installation guide
- Covers PodGangSet, PodClique, and PodCliqueScalingGroup for AI workloads

**K8s Metrics Documentation:**
- Changed from generic Prometheus/Grafana setup to Dynamo operator-integrated approach
- Uses PodMonitor instead of ServiceMonitor for better automation
- Leverages Dynamo operator's automatic labeling system
- Includes hands-on workflow starting with DynamoGraphDeployment
- Component-specific monitoring for frontend/worker/planner

Both versions are now much more technically accurate and practically useful for Dynamo users.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM - but left a few comments on some header renames, grove reviewer, and make sure to fix the lychee link checks

@rmccorm4 rmccorm4 enabled auto-merge (squash) August 19, 2025 17:49
@rmccorm4 rmccorm4 merged commit 129a244 into main Aug 19, 2025
12 checks passed
@rmccorm4 rmccorm4 deleted the docs-consolidation-clean branch August 19, 2025 18:25
hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants