Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ make setup ARCH=aarch64 # or ARCH=x86_64
- [Parameter Sweeps](docs/sweeps.md) - Grid searches
- [Profiling](docs/profiling.md) - Torch/nsys profiling
- [Analyzing Results](docs/analyzing.md) - Dashboard and visualization
- [Accuracy Benchmarks](docs/accuracy.md) - Running accuracy benchmarks

## Commands

Expand Down
14 changes: 8 additions & 6 deletions docs/accuracy.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ For MMLU dataset, the benchmark section in yaml file can be modified in the foll
benchmark:
type: "mmlu"
num_examples: 200 # Number of examples to run
max_tokens: 2048 # Max number of output tokens
max_tokens: 8192 # Max number of output tokens.
repeat: 8 # Number of repetition
num_threads: 512 # Number of parallel threads for running benchmark
```
Expand All @@ -40,18 +40,20 @@ srtctl apply -f config.yaml
After finishing benchmarking, the `benchmark.out` will contain the results of accuracy:
```
====================
Repeat: 8, mean: 0.812
Scores: ['0.790', '0.820', '0.800', '0.820', '0.820', '0.790', '0.820', '0.840']
Repeat: 8, mean: 0.895
Scores: ['0.905', '0.895', '0.900', '0.880', '0.905', '0.890', '0.890', '0.895']
====================
Writing report to /tmp/mmlu_deepseek-ai_DeepSeek-R1.html
{'other': np.float64(0.9), 'other:std': np.float64(0.30000000000000004), 'score:std': np.float64(0.36660605559646725), 'stem': np.float64(0.8095238095238095), 'stem:std': np.float64(0.392676726249301), 'humanities': np.float64(0.7428571428571429), 'humanities:std': np.float64(0.4370588154508102), 'social_sciences': np.float64(0.9583333333333334), 'social_sciences:std': np.float64(0.19982631347136331), 'score': np.float64(0.84)}
{'other': np.float64(0.9361702127659575), 'other:std': np.float64(0.24444947432076722), 'score:std': np.float64(0.3065534211193866), 'stem': np.float64(0.9285714285714286), 'stem:std': np.float64(0.25753937681885636), 'humanities': np.float64(0.8064516129032258), 'humanities:std': np.float64(0.3950789907714804), 'social_sciences': np.float64(0.9387755102040817), 'social_sciences:std': np.float64(0.23974163519328023), 'score': np.float64(0.895)}
Writing results to /tmp/mmlu_deepseek-ai_DeepSeek-R1.json
Total latency: 465.618 s
Score: 0.840
Total latency: 754.457 s
Score: 0.895
Results saved to: /logs/accuracy/mmlu_deepseek-ai_DeepSeek-R1.json
MMLU evaluation complete
```

**Note: `max-tokens` should be large enough to reach expected accuracy. For deepseek-r1-fp4 model, `max-tokens=8192` can reach expected accuracy 0.895, while `max-tokens=2048` can only score at 0.81.**


## GPQA
For GPQA dataset, the benchmark section in yaml file can be modified in the following way:
Expand Down
2 changes: 1 addition & 1 deletion docs/profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ profiling:
profiling:
type: "torch" # Required: "none", "torch", or "nsys"

# Traffic generator parameters (required when profiling is enabled)
# Traffic generator parameters (required when profiling is enabled)
isl: 1024 # Input sequence length
osl: 128 # Output sequence length
concurrency: 24 # Batch size for profiling workload
Expand Down
61 changes: 61 additions & 0 deletions docs/sglang-router.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,67 @@ The default bootstrap port is `30001` (matching most recipes). If you use a diff

Workers listen on port `30000` by default. This is standard sglang behavior and doesn't need configuration.

## Debugging with SGLang Source Code

When using sglang-router mode, you can mount and install sglang from source for debugging purposes. This is useful when you need to test local changes or debug issues in sglang itself.

### Configuration

Add `sglang_src_dir` to your recipe's `backend` section:

```yaml
backend:
use_sglang_router: true
sglang_src_dir: "/path/to/your/local/sglang"
```

### How It Works

1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
2. Before launching workers, the script runs: `pip install -e . --no-deps`
3. Workers use your local sglang code instead of the container's pre-installed version

Comment on lines +194 to +199
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document the automatic 'python' subdirectory resolution.

The documentation states that the local directory is mounted to /ext-sglang-src/, but the implementation in scripts/worker_setup/command.py (line 163) automatically appends /python to this path before installation. Users should be informed that sglang_src_dir must point to the SGLang repository root (not the python subdirectory), as the code will automatically resolve to the python/ subdirectory within it.

📝 Suggested documentation update
 ### How It Works
 
-1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
-2. Before launching workers, the script runs: `pip install -e . --no-deps`
+1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
+2. The installation automatically uses the `python/` subdirectory within the mounted source
+3. Before launching workers, the script runs: `pip install -e . --no-deps` from `/ext-sglang-src/python/`
-3. Workers use your local sglang code instead of the container's pre-installed version
+4. Workers use your local sglang code instead of the container's pre-installed version
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### How It Works
1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
2. Before launching workers, the script runs: `pip install -e . --no-deps`
3. Workers use your local sglang code instead of the container's pre-installed version
### How It Works
1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
2. The installation automatically uses the `python/` subdirectory within the mounted source
3. Before launching workers, the script runs: `pip install -e . --no-deps` from `/ext-sglang-src/python/`
4. Workers use your local sglang code instead of the container's pre-installed version
🤖 Prompt for AI Agents
In @docs/sglang-router.md around lines 140 - 145, The docs currently say your
local sglang dir is mounted to /ext-sglang-src/, but the worker setup actually
takes the provided sglang_src_dir and automatically appends "/python" before
running pip install -e; update the "How It Works" section to state that
sglang_src_dir must point to the SGLang repository root (not the python
subdirectory) because the setup code will resolve to the python/ subdirectory
automatically prior to installation.

### Behavior

**With `sglang_src_dir` set:**
- Mounts your local sglang source to `/ext-sglang-src/`
- Installs it in editable mode on all prefill/decode/aggregated workers
- Your local changes take effect immediately

**Without `sglang_src_dir` (or empty):**
- No mount is added
- Installation step is skipped gracefully
- Uses the container's pre-installed sglang

### Example

```yaml
name: "debug-sglang-router"

model:
path: "deepseek-r1-fp4"
container: "0.5.5.post2"

backend:
use_sglang_router: true
sglang_src_dir: "/home/username/projects/sglang" # Your local sglang checkout

sglang_config:
# ... your config
```

Then apply:
```bash
srtctl apply -f recipies/debug-sglang-router.yaml
```

### Notes

- Only works with `use_sglang_router: true` (disaggregation mode)
- The source directory must exist on the host running srtctl
- Dependencies are NOT reinstalled (uses `--no-deps`), so the container must have compatible dependencies already installed
- Useful for iterative debugging without rebuilding containers

## Complete Example

Here's a full recipe using sglang router:
Expand Down
123 changes: 123 additions & 0 deletions examples/fp4-disagg-nsys-profiling.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
name: "gb200-fp4-1p2d"

model:
path: "dsfp4"
container: "0.5.5.post2"
precision: "fp4"

resources:
gpu_type: "gb200"
prefill_nodes: 1
decode_nodes: 2
prefill_workers: 1
decode_workers: 2
gpus_per_node: 4

backend:
use_sglang_router: "true"

prefill_environment:
SGLANG_LOG_FORWARD_ITERS: "1"
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
PYTHONUNBUFFERED: "1"
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
#SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
#SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
SGLANG_ENABLE_JIT_DEEPGEMM: "false"
SGLANG_ENABLE_FLASHINFER_GEMM: "true" #instead of SGLANG_FLASHINFER_FP4_GEMM_BACKEND

decode_environment:
SGLANG_LOG_FORWARD_ITERS: "1"
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
PYTHONUNBUFFERED: "1"
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
# SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
# SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
SGLANG_ENABLE_JIT_DEEPGEMM: "false"
SGLANG_ENABLE_FLASHINFER_GEMM: "true" #instead of SGLANG_FLASHINFER_FP4_GEMM_BACKEND

sglang_config:
prefill:
disaggregation-mode: "prefill"
served-model-name: "deepseek-ai/DeepSeek-R1"
model-path: "/model/"
trust-remote-code: true
disable-radix-cache: true
kv-cache-dtype: "fp8_e4m3"
attention-backend: "trtllm_mla"
quantization: "modelopt_fp4"
moe-runner-backend: "flashinfer_trtllm"
stream-interval: 10
watchdog-timeout: 1000000
context-length: 2200
mem-fraction-static: 0.95
max-total-tokens: 8192
chunked-prefill-size: 8192
cuda-graph-max-bs: 256
max-running-requests: 512
scheduler-recv-interval: 10
enable-symm-mem: true
moe-dense-tp-size: 1
load-balance-method: "round_robin"
disaggregation-bootstrap-port: 30001
load-format: "dummy"
data-parallel-size: 1
tensor-parallel-size: 4
expert-parallel-size: 1

decode:
disaggregation-mode: "decode"
served-model-name: "deepseek-ai/DeepSeek-R1"
model-path: "/model/"
prefill-round-robin-balance: true
trust-remote-code: true
disable-radix-cache: true
kv-cache-dtype: "fp8_e4m3"
attention-backend: "trtllm_mla"
quantization: "modelopt_fp4"
moe-runner-backend: "flashinfer_trtllm"
disaggregation-bootstrap-port: 30001
stream-interval: 10
watchdog-timeout: 1000000
context-length: 2200
mem-fraction-static: 0.95
load-format: "dummy"
chunked-prefill-size: 8192
cuda-graph-max-bs: 256
scheduler-recv-interval: 10
enable-symm-mem: true
moe-dense-tp-size: 1
tensor-parallel-size: 4
expert-parallel-size: 1

profiling:
type: "nsys"
isl: 1024
osl: 1024
concurrency: 256
prefill:
start_step: 60
stop_step: 70
decode:
start_step: 700
stop_step: 730
64 changes: 64 additions & 0 deletions scripts/benchmarks/mmlu/bench.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# GPQA evaluation script using sglang.test.run_eval with mmlu

head_node="localhost"
head_port=8000
model_name="deepseek-ai/DeepSeek-R1" # Default model name

# Parse arguments from SLURM job
n_prefill=$1
n_decode=$2
prefill_gpus=$3
decode_gpus=$4
num_examples=${5:-200} # Default: 200
max_tokens=${6:-8192} # Default: 8192
repeat=${7:-8} # Default: 8
num_threads=${8:-512} # Default: 512

echo "MMLU Benchmark Config: num_examples=${num_examples}; max_tokens=${max_tokens}; repeat=${repeat}; num_threads=${num_threads}"

# Source utilities for wait_for_model
source /scripts/utils/benchmark_utils.sh

wait_for_model_timeout=1500 # 25 minutes
wait_for_model_check_interval=5 # check interval -> 5s
wait_for_model_report_interval=60 # wait_for_model report interval -> 60s

wait_for_model $head_node $head_port $n_prefill $n_decode $wait_for_model_check_interval $wait_for_model_timeout $wait_for_model_report_interval

# Create results directory
result_dir="/logs/accuracy"
mkdir -p $result_dir

echo "Running MMLU evaluation..."

# Set OPENAI_API_KEY if not set
if [ -z "$OPENAI_API_KEY" ]; then
export OPENAI_API_KEY="EMPTY"
fi

# Run the evaluation
python3 -m sglang.test.run_eval \
--base-url "http://${head_node}:${head_port}" \
--model ${model_name} \
--eval-name mmlu \
--num-examples ${num_examples} \
--max-tokens ${max_tokens} \
--repeat ${repeat} \
--num-threads ${num_threads}

# Copy the result file from /tmp to our logs directory
# The result file is named mmlu_{model_name}.json
result_file=$(ls -t /tmp/mmlu_*.json 2>/dev/null | head -n1)

if [ -f "$result_file" ]; then
cp "$result_file" "$result_dir/"
echo "Results saved to: $result_dir/$(basename $result_file)"
else
echo "Warning: Could not find result file in /tmp"
fi

echo "MMLU evaluation complete"
Loading