ishandhanani · wenscarl · Dec 21, 2025 · Dec 22, 2025 · Dec 24, 2025 · Dec 25, 2025
diff --git a/README.md b/README.md
@@ -23,6 +23,7 @@ make setup ARCH=aarch64  # or ARCH=x86_64
 - [Parameter Sweeps](docs/sweeps.md) - Grid searches
 - [Profiling](docs/profiling.md) - Torch/nsys profiling
 - [Analyzing Results](docs/analyzing.md) - Dashboard and visualization
+- [Accuracy Benchmarks](docs/accuracy.md) - Running accuracy benchmarks
 
 ## Commands
 

diff --git a/docs/accuracy.md b/docs/accuracy.md
@@ -27,7 +27,7 @@ For MMLU dataset, the benchmark section in yaml file can be modified in the foll
 benchmark:
   type: "mmlu"
   num_examples: 200 # Number of examples to run
-  max_tokens: 2048 # Max number of output tokens
+  max_tokens: 8192 # Max number of output tokens.
   repeat: 8 # Number of repetition
   num_threads: 512 # Number of parallel threads for running benchmark
 ```
@@ -40,18 +40,20 @@ srtctl apply -f config.yaml
 After finishing benchmarking, the `benchmark.out` will contain the results of accuracy:
 ```
 ====================
-Repeat: 8, mean: 0.812
-Scores: ['0.790', '0.820', '0.800', '0.820', '0.820', '0.790', '0.820', '0.840']
+Repeat: 8, mean: 0.895
+Scores: ['0.905', '0.895', '0.900', '0.880', '0.905', '0.890', '0.890', '0.895']
 ====================
 Writing report to /tmp/mmlu_deepseek-ai_DeepSeek-R1.html
-{'other': np.float64(0.9), 'other:std': np.float64(0.30000000000000004), 'score:std': np.float64(0.36660605559646725), 'stem': np.float64(0.8095238095238095), 'stem:std': np.float64(0.392676726249301), 'humanities': np.float64(0.7428571428571429), 'humanities:std': np.float64(0.4370588154508102), 'social_sciences': np.float64(0.9583333333333334), 'social_sciences:std': np.float64(0.19982631347136331), 'score': np.float64(0.84)}
+{'other': np.float64(0.9361702127659575), 'other:std': np.float64(0.24444947432076722), 'score:std': np.float64(0.3065534211193866), 'stem': np.float64(0.9285714285714286), 'stem:std': np.float64(0.25753937681885636), 'humanities': np.float64(0.8064516129032258), 'humanities:std': np.float64(0.3950789907714804), 'social_sciences': np.float64(0.9387755102040817), 'social_sciences:std': np.float64(0.23974163519328023), 'score': np.float64(0.895)}
 Writing results to /tmp/mmlu_deepseek-ai_DeepSeek-R1.json
-Total latency: 465.618 s
-Score: 0.840
+Total latency: 754.457 s
+Score: 0.895
 Results saved to: /logs/accuracy/mmlu_deepseek-ai_DeepSeek-R1.json
 MMLU evaluation complete
 ```
 
+**Note: `max-tokens` should be large enough to reach expected accuracy. For deepseek-r1-fp4 model, `max-tokens=8192` can reach expected accuracy 0.895, while `max-tokens=2048` can only score at 0.81.**
+
 
 ## GPQA
 For GPQA dataset, the benchmark section in yaml file can be modified in the following way:

diff --git a/docs/profiling.md b/docs/profiling.md
@@ -66,7 +66,7 @@ profiling:
 profiling:
   type: "torch" # Required: "none", "torch", or "nsys"
 
-  # Traffic generator parameters (required when profiling is enabled)
+# Traffic generator parameters (required when profiling is enabled)
   isl: 1024 # Input sequence length
   osl: 128 # Output sequence length
   concurrency: 24 # Batch size for profiling workload

diff --git a/docs/sglang-router.md b/docs/sglang-router.md
@@ -177,6 +177,67 @@ The default bootstrap port is `30001` (matching most recipes). If you use a diff
 
 Workers listen on port `30000` by default. This is standard sglang behavior and doesn't need configuration.
 
+## Debugging with SGLang Source Code
+
+When using sglang-router mode, you can mount and install sglang from source for debugging purposes. This is useful when you need to test local changes or debug issues in sglang itself.
+
+### Configuration
+
+Add `sglang_src_dir` to your recipe's `backend` section:
+
+```yaml
+backend:
+  use_sglang_router: true
+  sglang_src_dir: "/path/to/your/local/sglang"
+```
+
+### How It Works
+
+1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
+2. Before launching workers, the script runs: `pip install -e . --no-deps`
+3. Workers use your local sglang code instead of the container's pre-installed version
+
-### How It Works
-
-1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
-2. Before launching workers, the script runs: `pip install -e . --no-deps`
-3. Workers use your local sglang code instead of the container's pre-installed version
+### How It Works
+
+1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
+2. The installation automatically uses the `python/` subdirectory within the mounted source
+3. Before launching workers, the script runs: `pip install -e . --no-deps` from `/ext-sglang-src/python/`
+4. Workers use your local sglang code instead of the container's pre-installed version
-### How It Works
-
-1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
-2. Before launching workers, the script runs: `pip install -e . --no-deps`
-3. Workers use your local sglang code instead of the container's pre-installed version
+### How It Works
+
+1. Your local sglang directory is mounted to `/ext-sglang-src/` in the container
+2. The installation automatically uses the `python/` subdirectory within the mounted source
+3. Before launching workers, the script runs: `pip install -e . --no-deps` from `/ext-sglang-src/python/`
+4. Workers use your local sglang code instead of the container's pre-installed version
+### Behavior
+
+**With `sglang_src_dir` set:**
+- Mounts your local sglang source to `/ext-sglang-src/`
+- Installs it in editable mode on all prefill/decode/aggregated workers
+- Your local changes take effect immediately
+
+**Without `sglang_src_dir` (or empty):**
+- No mount is added
+- Installation step is skipped gracefully
+- Uses the container's pre-installed sglang
+
+### Example
+
+```yaml
+name: "debug-sglang-router"
+
+model:
+  path: "deepseek-r1-fp4"
+  container: "0.5.5.post2"
+
+backend:
+  use_sglang_router: true
+  sglang_src_dir: "/home/username/projects/sglang"  # Your local sglang checkout
+
+  sglang_config:
+    # ... your config
+```
+
+Then apply:
+```bash
+srtctl apply -f recipies/debug-sglang-router.yaml
+```
+
+### Notes
+
+- Only works with `use_sglang_router: true` (disaggregation mode)
+- The source directory must exist on the host running srtctl
+- Dependencies are NOT reinstalled (uses `--no-deps`), so the container must have compatible dependencies already installed
+- Useful for iterative debugging without rebuilding containers
+
 ## Complete Example
 
 Here's a full recipe using sglang router:

diff --git a/examples/fp4-disagg-nsys-profiling.yaml b/examples/fp4-disagg-nsys-profiling.yaml
@@ -0,0 +1,123 @@
+name: "gb200-fp4-1p2d"
+
+model:
+  path: "dsfp4"
+  container: "0.5.5.post2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  decode_nodes: 2
+  prefill_workers: 1
+  decode_workers: 2
+  gpus_per_node: 4
+
+backend:
+  use_sglang_router: "true"
+
+  prefill_environment:
+    SGLANG_LOG_FORWARD_ITERS: "1"
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    #SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    #SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    SGLANG_ENABLE_FLASHINFER_GEMM: "true" #instead of SGLANG_FLASHINFER_FP4_GEMM_BACKEND
+
+  decode_environment:
+    SGLANG_LOG_FORWARD_ITERS: "1"
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    # SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    # SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    SGLANG_ENABLE_FLASHINFER_GEMM: "true" #instead of SGLANG_FLASHINFER_FP4_GEMM_BACKEND
+
+  sglang_config:
+    prefill:
+      disaggregation-mode: "prefill"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 2200
+      mem-fraction-static: 0.95
+      max-total-tokens: 8192
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 256
+      max-running-requests: 512
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      moe-dense-tp-size: 1
+      load-balance-method: "round_robin"
+      disaggregation-bootstrap-port: 30001
+      load-format: "dummy"
+      data-parallel-size: 1
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+
+    decode:
+      disaggregation-mode: "decode"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      prefill-round-robin-balance: true
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      disaggregation-bootstrap-port: 30001
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 2200
+      mem-fraction-static: 0.95
+      load-format: "dummy"
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 256
+      scheduler-recv-interval: 10
+      enable-symm-mem: true
+      moe-dense-tp-size: 1
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+
+profiling:
+  type: "nsys"
+  isl: 1024
+  osl: 1024
+  concurrency: 256
+  prefill:
+    start_step: 60
+    stop_step: 70
+  decode:
+    start_step: 700
+    stop_step: 730
diff --git a/scripts/benchmarks/mmlu/bench.sh b/scripts/benchmarks/mmlu/bench.sh
@@ -0,0 +1,64 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# GPQA evaluation script using sglang.test.run_eval with mmlu
+
+head_node="localhost"
+head_port=8000
+model_name="deepseek-ai/DeepSeek-R1"  # Default model name
+
+# Parse arguments from SLURM job
+n_prefill=$1
+n_decode=$2
+prefill_gpus=$3
+decode_gpus=$4
+num_examples=${5:-200}  # Default: 200
+max_tokens=${6:-8192}    # Default: 8192
+repeat=${7:-8}          # Default: 8
+num_threads=${8:-512}   # Default: 512
+
+echo "MMLU Benchmark Config: num_examples=${num_examples}; max_tokens=${max_tokens}; repeat=${repeat}; num_threads=${num_threads}"
+
+# Source utilities for wait_for_model
+source /scripts/utils/benchmark_utils.sh
+
+wait_for_model_timeout=1500 # 25 minutes
+wait_for_model_check_interval=5 # check interval -> 5s
+wait_for_model_report_interval=60 # wait_for_model report interval -> 60s
+
+wait_for_model $head_node $head_port $n_prefill $n_decode $wait_for_model_check_interval $wait_for_model_timeout $wait_for_model_report_interval
+
+# Create results directory
+result_dir="/logs/accuracy"
+mkdir -p $result_dir
+
+echo "Running MMLU evaluation..."
+
+# Set OPENAI_API_KEY if not set
+if [ -z "$OPENAI_API_KEY" ]; then
+    export OPENAI_API_KEY="EMPTY"
+fi
+
+# Run the evaluation
+python3 -m sglang.test.run_eval \
+    --base-url "http://${head_node}:${head_port}" \
+    --model ${model_name} \
+    --eval-name mmlu \
+    --num-examples ${num_examples} \
+    --max-tokens ${max_tokens} \
+    --repeat ${repeat} \
+    --num-threads ${num_threads}
+
+# Copy the result file from /tmp to our logs directory
+# The result file is named mmlu_{model_name}.json
+result_file=$(ls -t /tmp/mmlu_*.json 2>/dev/null | head -n1)
+
+if [ -f "$result_file" ]; then
+    cp "$result_file" "$result_dir/"
+    echo "Results saved to: $result_dir/$(basename $result_file)"
+else
+    echo "Warning: Could not find result file in /tmp"
+fi
+
+echo "MMLU evaluation complete"