Skip to content

Commit 359b4ae

Browse files
athreeshclaude
andcommitted
docs: Consolidate documentation and fix redundant headings
This commit consolidates and improves the documentation structure based on tech writer feedback: **New Documentation:** - Added Grove advanced Kubernetes scheduling guide - Added comprehensive K8s metrics setup guide with Prometheus/Grafana **Heading Fixes:** - Fixed redundant headings that would appear redundant in Sphinx breadcrumbs - Changed "Architecture" → "Design" in SLA planner docs - Changed "Core Components" → "Core Services" to avoid repetition - Removed duplicate H1 headings in component docs **Quick Start Disambiguation:** - "Quick Start" → "SGLang Quick Start" in SGLang README - "Quick Start" → "TensorRT-LLM Quick Start" in TensorRT-LLM README - "Quick Start" → "vLLM Quick Start" in vLLM README - "Quick Start" → "KV Router Quick Start" in router docs - "Quick Start" → "Pre-deployment Steps" in Fluid caching guide **Platform Naming:** - Updated references to use consistent "Dynamo Kubernetes Platform" naming 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1 parent 1945f59 commit 359b4ae

File tree

12 files changed

+484
-18
lines changed

12 files changed

+484
-18
lines changed

components/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Dynamo supports multiple inference engines (with a focus on SGLang, vLLM, and Te
2929

3030
Each engine provides launch scripts for different deployment patterns in their respective `/launch` & `/deploy` directories.
3131

32-
## Core Components
32+
## Core Services
3333

3434
### [Backends](backends/)
3535

components/backends/sglang/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
5050
| **GB200 Support** || |
5151

5252

53-
## Quick Start
53+
## SGLang Quick Start
5454

5555
Below we provide a guide that lets you run all of our common deployment patterns on a single node.
5656

components/backends/trtllm/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
6666
| **DP Rank Routing**|| |
6767
| **GB200 Support** || |
6868

69-
## Quick Start
69+
## TensorRT-LLM Quick Start
7070

7171
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
7272

components/backends/vllm/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
5151
| **DP Rank Routing**|| Supported via external control of DP ranks |
5252
| **GB200 Support** | 🚧 | Container functional on main |
5353

54-
## Quick Start
54+
## vLLM Quick Start
5555

5656
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
5757

docs/architecture/architecture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ There are multi-faceted challenges:
4848

4949
To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
5050

51-
## High level architecture and key benefits
51+
## Key benefits
5252

5353
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
5454

docs/architecture/sla_planner.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
1717
* **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions
1818
* **Correction factors**: Adapts to real-world performance deviations from profiled data
1919

20-
## Architecture
20+
## Design
2121

2222
The SLA planner consists of several key components:
2323

@@ -108,7 +108,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill
108108

109109
For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md).
110110

111-
**Quick Start:**
111+
**To deploy SLA Planner:**
112112
```bash
113113
cd components/backends/vllm/deploy
114114
kubectl apply -f disagg_planner.yaml -n {$NAMESPACE}

docs/components/router/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ SPDX-License-Identifier: Apache-2.0
99

1010
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
1111

12-
## Quick Start
12+
## KV Router Quick Start
1313

1414
To launch the Dynamo frontend with the KV Router:
1515

docs/guides/dynamo_deploy/grove.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Grove: Advanced Kubernetes Scheduling
2+
3+
Grove is an advanced Kubernetes scheduler and batch workload manager built on top of the Dynamo Kubernetes Platform. It enables sophisticated scheduling policies for multi-node GPU workloads, with special support for large-scale LLM inference deployments.
4+
5+
## Overview
6+
7+
Grove extends Kubernetes' default scheduling capabilities with:
8+
- **Gang scheduling**: Ensures all pods in a workload start together or not at all
9+
- **Topology-aware placement**: Optimizes pod placement based on network topology
10+
- **Resource-aware scheduling**: Makes intelligent decisions based on GPU memory, compute capacity, and network bandwidth
11+
- **Priority-based queueing**: Manages workload priorities and preemption policies
12+
13+
## Key Features
14+
15+
### PodGangSet
16+
PodGangSet is Grove's primary scheduling primitive that groups related pods that must be scheduled together.
17+
18+
```yaml
19+
apiVersion: grove.dynamo.ai/v1
20+
kind: PodGangSet
21+
metadata:
22+
name: llm-inference-gang
23+
namespace: default
24+
spec:
25+
template:
26+
spec:
27+
containers:
28+
- name: worker
29+
image: dynamo/worker:latest
30+
resources:
31+
requests:
32+
nvidia.com/gpu: 1
33+
replicas: 8
34+
minAvailable: 8 # All pods must be schedulable
35+
scheduling:
36+
nodeAffinity:
37+
requiredDuringSchedulingIgnoredDuringExecution:
38+
nodeSelectorTerms:
39+
- matchExpressions:
40+
- key: node-type
41+
operator: In
42+
values: ["gpu-compute"]
43+
```
44+
45+
### PodClique
46+
PodClique provides fine-grained control over pod co-location and anti-affinity rules within a gang.
47+
48+
```yaml
49+
apiVersion: grove.dynamo.ai/v1
50+
kind: PodClique
51+
metadata:
52+
name: prefill-decode-clique
53+
spec:
54+
selector:
55+
matchLabels:
56+
app: dynamo-worker
57+
topology:
58+
# Prefer pods to be co-located on the same rack
59+
preferredDuringSchedulingIgnoredDuringExecution:
60+
- weight: 100
61+
podAffinityTerm:
62+
labelSelector:
63+
matchLabels:
64+
component: prefill
65+
topologyKey: topology.kubernetes.io/rack
66+
```
67+
68+
## Deployment
69+
70+
### Prerequisites
71+
- Kubernetes cluster with GPU nodes
72+
- NVIDIA GPU Operator installed
73+
- Node topology labels configured
74+
75+
### Install Grove Scheduler
76+
77+
```bash
78+
# Install Grove CRDs and scheduler
79+
kubectl apply -f https://github.com/ai-dynamo/grove/releases/latest/download/grove-crds.yaml
80+
kubectl apply -f https://github.com/ai-dynamo/grove/releases/latest/download/grove-scheduler.yaml
81+
```
82+
83+
### Configure Node Topology
84+
85+
Label your nodes with topology information:
86+
87+
```bash
88+
# Label nodes with rack information
89+
kubectl label node gpu-node-01 topology.kubernetes.io/rack=rack-1
90+
kubectl label node gpu-node-02 topology.kubernetes.io/rack=rack-1
91+
kubectl label node gpu-node-03 topology.kubernetes.io/rack=rack-2
92+
93+
# Label nodes with GPU types
94+
kubectl label node gpu-node-01 accelerator=h100
95+
kubectl label node gpu-node-02 accelerator=h100
96+
kubectl label node gpu-node-03 accelerator=a100
97+
```
98+
99+
## Integration with Dynamo
100+
101+
Grove integrates seamlessly with Dynamo's disaggregated serving architecture:
102+
103+
### Multi-Node Prefill/Decode Scheduling
104+
105+
```yaml
106+
apiVersion: grove.dynamo.ai/v1
107+
kind: PodGangSet
108+
metadata:
109+
name: dynamo-multinode-serving
110+
spec:
111+
template:
112+
metadata:
113+
labels:
114+
app: dynamo-worker
115+
spec:
116+
schedulerName: grove-scheduler
117+
containers:
118+
- name: dynamo-worker
119+
image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:latest
120+
env:
121+
- name: WORKER_TYPE
122+
value: "prefill" # or "decode"
123+
replicas: 16
124+
minAvailable: 16
125+
scheduling:
126+
# Ensure all workers can communicate efficiently
127+
nodeAffinity:
128+
requiredDuringSchedulingIgnoredDuringExecution:
129+
nodeSelectorTerms:
130+
- matchExpressions:
131+
- key: network-tier
132+
operator: In
133+
values: ["high-bandwidth"]
134+
```
135+
136+
## Best Practices
137+
138+
### Resource Planning
139+
- Use `minAvailable: replicas` for strict gang scheduling
140+
- Set appropriate resource requests and limits
141+
- Consider network bandwidth requirements for multi-node workloads
142+
143+
### Topology Awareness
144+
- Label nodes with rack, zone, and network topology information
145+
- Use PodClique for fine-grained placement control
146+
- Test different affinity rules to optimize for your workload
147+
148+
### Monitoring
149+
Grove provides metrics for scheduling decisions:
150+
151+
```bash
152+
# View Grove scheduler metrics
153+
kubectl port-forward -n grove-system svc/grove-scheduler-metrics 8080:8080
154+
curl localhost:8080/metrics | grep grove_
155+
```
156+
157+
## Troubleshooting
158+
159+
### Common Issues
160+
161+
**Pods stuck in Pending state:**
162+
- Check if sufficient resources are available across required nodes
163+
- Verify node labels match gang affinity requirements
164+
- Review Grove scheduler logs: `kubectl logs -n grove-system deployment/grove-scheduler`
165+
166+
**Gang scheduling not working:**
167+
- Ensure `schedulerName: grove-scheduler` is set in pod specs
168+
- Verify PodGangSet controller is running
169+
- Check for resource conflicts with other scheduled workloads
170+
171+
For more detailed troubleshooting, see the [Grove Documentation](https://grove.dynamo.ai/docs).

0 commit comments

Comments
 (0)