Skip to content
This repository was archived by the owner on Apr 20, 2026. It is now read-only.
Merged
13 changes: 1 addition & 12 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,6 @@ Running large language models across multiple GPUs and nodes requires orchestrat
- **Parameter sweeps** - Run grid searches across configurations with a single command
- **Profiling support** - Built-in torch/nsys profiling modes

## Architecture Overview

`srtctl` orchestrates distributed inference using SGLang workers in either **disaggregated** or **aggregated** mode.

**Disaggregated Mode** separates prefill and decode into specialized workers:

- Prefill workers handle the initial prompt processing
- Decode workers handle token generation
- Frontend distribution via nginx load balancer (default) or sglang_router

**Aggregated Mode** runs combined prefill+decode on each worker, simpler but potentially less efficient for high-throughput scenarios.

## How It Works

When you run `srtctl apply -f config.yaml`, the tool:
Expand Down Expand Up @@ -54,3 +42,4 @@ Once allocated, workers launch inside containers, discover each other through ET
- [Parameter Sweeps](sweeps.md) - Run grid searches across configurations
- [Profiling](profiling.md) - Performance analysis with torch/nsys
- [Analyzing Results](analyzing.md) - Dashboard and visualization
- [SGLang Router](sglang-router.md) - Alternative to Dynamo for PD disaggregation
14 changes: 8 additions & 6 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Table of contents

* [Introduction](README.md)
* [Installation](installation.md)
* [Profiling](profiling.md)
* [Monitoring](monitoring.md)
* [Parameter Sweeps](sweeps.md)
* [Analying](analyzing.md)
- [Introduction](README.md)
- [Installation](installation.md)
- [SGLang Router](sglang-router.md)
- [Profiling](profiling.md)
- [Monitoring](monitoring.md)
- [Parameter Sweeps](sweeps.md)
- [Analyzing](analyzing.md)
- [SLURM FAQ](slurm-faq.md)
19 changes: 1 addition & 18 deletions docs/analyzing.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,5 @@ uv run streamlit run analysis/dashboard/app.py
# Another way to launch dashboard
make dashboard
```
Opens interactive dashboard at http://localhost:8501


## Features

### 📊 Interactive Dashboard

- **Pareto Analysis** - TPS/GPU vs TPS/User tradeoffs
- **Latency Breakdown** - TTFT, TPOT, ITL across concurrency levels
- **Node Metrics** - Runtime metrics from prefill/decode nodes
- **Config Comparison** - Side-by-side configuration diffs
- **Run Comparison** - Performance deltas between runs

### 🚀 SLURM Job Submission

- Disaggregated (prefill/decode) or aggregated mode
- Multiple frontends with nginx load balancing (default)
- Automated benchmarking with sa-bench
- Job metadata tracking
Opens interactive dashboard at http://localhost:8501
133 changes: 34 additions & 99 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ pip install -e .

## Gather your cluster user and target partition

These commands might not work on all clusters. You can use AI to figure out the right set of commands for your cluster.

```bash
# user
sacctmgr -nP show assoc where user=$(whoami) format=account
Expand All @@ -27,6 +29,8 @@ sinfo

## Run Setup

If you are trying to deploy onto Grace (GH200, GB200, etc.), you need to use the `aarch64` architecture. Otherwise use `x86_64`.

```bash
make setup ARCH=aarch64 # or ARCH=x86_64
```
Expand All @@ -42,8 +46,6 @@ The setup will:
3. Create `srtslurm.yaml` with your settings
4. Auto-detect and set `srtctl_root` path

Dynamo 0.7.0 is now available on PyPI and will be installed automatically from pip when workers start.

## Configure srtslurm.yaml

After setup, edit `srtslurm.yaml` to add model paths, containers, and cluster-specific settings:
Expand All @@ -56,7 +58,6 @@ The `model_paths` section maps short aliases to full filesystem paths:
model_paths:
deepseek-r1: "/mnt/lustre/models/DeepSeek-R1"
deepseek-r1-fp4: "/mnt/lustre/models/deepseek-r1-0528-fp4-v2"
llama-70b: "/mnt/lustre/models/Llama-3-70B"
```

Models must be accessible from all compute nodes (typically on a shared filesystem like Lustre or GPFS).
Expand All @@ -67,15 +68,14 @@ The `containers` section maps version aliases to `.sqsh` container images:

```yaml
containers:
latest: "/mnt/containers/lmsysorg+sglang+v0.5.5.sqsh"
stable: "/mnt/containers/lmsysorg+sglang+v0.5.4.sqsh"
container1: "/mnt/containers/lmsysorg+sglang+v0.5.5.sqsh"
container2: "/mnt/containers/lmsysorg+sglang+v0.5.4.sqsh"
```

To create a container image from Docker:

```bash
enroot import docker://lmsysorg/sglang:v0.5.5
mv lmsysorg+sglang+v0.5.5.sqsh /mnt/containers/
```

### Cloud Sync (Optional)
Expand All @@ -91,42 +91,43 @@ cloud:

Then use `make sync-to-cloud` or `make sync-run RUN_ID=<run_id>`.

### Cluster Compatibility Settings

Some SLURM clusters don't support certain SBATCH directives. If you encounter errors during job submission, you may need to adjust these settings:

#### GPU Resource Specification

If you see this error when submitting jobs:
### Complete srtslurm.yaml Reference

```
sbatch: error: Invalid generic resource (gres) specification
```

Your cluster doesn't support the `--gpus-per-node` directive. Disable it in `srtslurm.yaml`:
Here's a complete example of all available options:

```yaml
use_gpus_per_node_directive: false
```
# Default SLURM settings
default_account: "your-account"
default_partition: "batch"
default_time_limit: "4:00:00"

This will omit the `#SBATCH --gpus-per-node` directive from generated job scripts while keeping all other functionality intact.
# Resource defaults
gpus_per_node: 4

#### Segment-Based Scheduling
# SLURM directive compatibility
use_gpus_per_node_directive: true # Set false if cluster doesn't support --gpus-per-node
use_segment_sbatch_directive: true # Set false if cluster doesn't support --segment

If you see this error when submitting jobs:
# Path to srtctl repo root (auto-set by make setup)
srtctl_root: "/path/to/srtctl"

```
sbatch: error: Invalid --segment specification
```
# Model path aliases
model_paths:
deepseek-r1: "/models/DeepSeek-R1"
llama-70b: "/models/Llama-3-70B"

Your cluster doesn't support the `--segment` directive for topology-aware scheduling. Disable it in `srtslurm.yaml`:
# Container aliases
containers:
latest: "/containers/sglang-latest.sqsh"
stable: "/containers/sglang-stable.sqsh"

```yaml
use_segment_sbatch_directive: false
# Cloud sync settings (optional)
cloud:
endpoint_url: "https://s3.example.com"
bucket: "benchmark-results"
prefix: "my-team/"
```

The `--segment` directive ensures all allocated nodes are within the same network segment/switch for optimal interconnect performance between prefill and decode workers. If your cluster doesn't support it, SLURM will still allocate nodes but may scatter them across the cluster.

## Create a Job Config

Create `configs/my-job.yaml`:
Expand Down Expand Up @@ -178,7 +179,7 @@ benchmark:
isl: 1024
osl: 1024
concurrencies: [256, 512]
req_rate: "inf" # Request rate, use "inf" for max throughput
req_rate: "inf"
```

### Backend Options
Expand All @@ -195,35 +196,6 @@ backend:
use_sglang_router: false # Default: false. Use sglang_router for load balancing
```

## Profiling (torch / nsys)

You can enable profiling via a top-level `profiling` section in your job YAML:

```yaml
profiling:
type: "torch" # one of: "none", "torch", "nsys"
isl: 1024
osl: 128
concurrency: 24
start_step: 0 # optional
stop_step: 50 # optional

benchmark:
type: "manual" # Required - profiling and benchmarking are mutually exclusive
```

See [Profiling](profiling.md) for detailed configuration options, constraints, and output file locations.

## Validate with Dry Run

Always validate before submitting:

```bash
srtctl dry-run -f configs/my-job.yaml
```

This validates your config, resolves aliases, generates all files, and saves them to `dry-runs/` without submitting to SLURM.

## Submit the Job

```bash
Expand Down Expand Up @@ -285,43 +257,6 @@ You can run custom initialization scripts on worker nodes before starting SGLang
srtctl apply -f configs/my-job.yaml --setup-script custom-setup.sh
```

The script will be executed on each worker node (prefill, decode, and aggregated) before installing Dynamo from PyPI and starting the SGLang workers. The script must be located in the `configs/` directory, which is mounted into containers at `/configs/`.
The script will be executed on each worker node (prefill, decode, or aggregated) before installing Dynamo from PyPI and starting the SGLang workers. The script must be located in the `configs/` directory, which is mounted into containers at `/configs/`.

**Note**: Setup scripts only run when you explicitly specify `--setup-script`. No default setup script will run if this flag is omitted.

## Complete srtslurm.yaml Reference

Here's a complete example of all available options:

```yaml
# Default SLURM settings
default_account: "your-account"
default_partition: "batch"
default_time_limit: "4:00:00"

# Resource defaults
gpus_per_node: 4

# SLURM directive compatibility
use_gpus_per_node_directive: true # Set false if cluster doesn't support --gpus-per-node
use_segment_sbatch_directive: true # Set false if cluster doesn't support --segment

# Path to srtctl repo root (auto-set by make setup)
srtctl_root: "/path/to/srtctl"

# Model path aliases
model_paths:
deepseek-r1: "/models/DeepSeek-R1"
llama-70b: "/models/Llama-3-70B"

# Container aliases
containers:
latest: "/containers/sglang-latest.sqsh"
stable: "/containers/sglang-stable.sqsh"

# Cloud sync settings (optional)
cloud:
endpoint_url: "https://s3.example.com"
bucket: "benchmark-results"
prefix: "my-team/"
```
6 changes: 5 additions & 1 deletion docs/profiling.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
# Profiling

srtctl supports two profiling backends for performance analysis: **Torch Profiler** and **NVIDIA Nsight Systems (nsys)**. Profiling helps identify bottlenecks in prefill and decode operations.
srtctl supports two profiling backends for performance analysis: **Torch Profiler** and **NVIDIA Nsight Systems (nsys)**.

## Quick Start

Add a `profiling` section to your job YAML:

```yaml
# must set benchmark type to "manual"
benchmark:
type: "manual"

profiling:
type: "torch" # or "nsys"
isl: 1024
Expand Down
Loading