Skip to content
This repository was archived by the owner on Apr 20, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions docs/accuracy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Accuracy Benchmark

In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa` and `longbenchv2`.

**Note that the `context-length` argument in the config yaml needs to be larger than the `max_tokens` argument of accuracy benchmark.**


## MMLU

For MMLU dataset, the benchmark section in yaml file can be modified in the following way:
```bash
benchmark:
type: "mmlu"
num_examples: 200 # Number of examples to run
max_tokens: 2048 # Max number of output tokens
repeat: 8 # Number of repetition
num_threads: 512 # Number of parallel threads for running benchmark
```

Then launch the script as usual:
```bash
srtctl apply -f config.yaml
```

After finishing benchmarking, the `benchmark.out` will contain the results of accuracy:
```
====================
Repeat: 8, mean: 0.812
Scores: ['0.790', '0.820', '0.800', '0.820', '0.820', '0.790', '0.820', '0.840']
====================
Writing report to /tmp/mmlu_deepseek-ai_DeepSeek-R1.html
{'other': np.float64(0.9), 'other:std': np.float64(0.30000000000000004), 'score:std': np.float64(0.36660605559646725), 'stem': np.float64(0.8095238095238095), 'stem:std': np.float64(0.392676726249301), 'humanities': np.float64(0.7428571428571429), 'humanities:std': np.float64(0.4370588154508102), 'social_sciences': np.float64(0.9583333333333334), 'social_sciences:std': np.float64(0.19982631347136331), 'score': np.float64(0.84)}
Writing results to /tmp/mmlu_deepseek-ai_DeepSeek-R1.json
Total latency: 465.618 s
Score: 0.840
Results saved to: /logs/accuracy/mmlu_deepseek-ai_DeepSeek-R1.json
MMLU evaluation complete
```


## GPQA
For GPQA dataset, the benchmark section in yaml file can be modified in the following way:
```bash
benchmark:
type: "gpqa"
num_examples: 198 # Number of examples to run
max_tokens: 65536 # We need a larger output token number for GPQA
repeat: 8 # Number of repetition
num_threads: 128 # Number of parallel threads for running benchmark
```
The `context-length` argument here should be set to a value larger than `max_tokens`.


## LongBench-V2
To be updated


11 changes: 5 additions & 6 deletions scripts/benchmarks/gpqa/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@ n_decode=$2
prefill_gpus=$3
decode_gpus=$4
num_examples=${5:-198} # Default: 198
max_tokens=${6:-512} # Default: 512
max_tokens=${6:-32768} # Default: 32768
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n scripts/benchmarks/gpqa/bench.sh | head -60

Repository: ishandhanani/srt-slurm

Length of output: 2469


🏁 Script executed:

cat scripts/benchmarks/gpqa/bench.sh

Repository: ishandhanani/srt-slurm

Length of output: 2145


Add comments explaining the rationale for default parameter values.

The --thinking-mode removal is properly documented, but the script lacks explanatory comments for the new defaults:

  • max_tokens=32768 — clarify why this value was chosen for DeepSeek-R1 reasoning on GPQA
  • num_threads=128 — clarify the impact on sglang request parallelization

These values are reasonable for GPQA evaluation with a reasoning model, but documenting the rationale helps future maintainers understand the performance/quality trade-offs.

🤖 Prompt for AI Agents
In scripts/benchmarks/gpqa/bench.sh around line 17 (and where num_threads is
set, e.g., the thread-init line), add short inline comments explaining the
rationale for the new defaults: document that max_tokens=32768 was chosen to
allow DeepSeek-R1 to perform extended chain-of-thought reasoning on GPQA without
truncation, and that num_threads=128 was chosen to increase sglang request
parallelization for throughput during batch evaluation while noting potential
CPU/memory trade-offs; place each comment immediately above or on the same line
as the variable assignment and keep them concise (one sentence each) mentioning
the intended trade-off (quality vs. resource use).

repeat=${7:-8} # Default: 8
num_threads=${8:-512} # Default: 512
thinking_mode=${9:-deepseek-r1} # Default: deepseek-r1
num_threads=${8:-128} # Default: 128
# Note: --thinking-mode removed because dynamo frontend doesn't support chat_template_kwargs
Comment thread
coderabbitai[bot] marked this conversation as resolved.

echo "GPQA Benchmark Config: num_examples=${num_examples}; max_tokens=${max_tokens}; repeat=${repeat}; num_threads=${num_threads}; thinking-mode=${thinking_mode}"
echo "GPQA Benchmark Config: num_examples=${num_examples}; max_tokens=${max_tokens}; repeat=${repeat}; num_threads=${num_threads}"

# Source utilities for wait_for_model
source /scripts/utils/benchmark_utils.sh
Expand Down Expand Up @@ -49,8 +49,7 @@ python3 -m sglang.test.run_eval \
--num-examples ${num_examples} \
--max-tokens ${max_tokens} \
--repeat ${repeat} \
--num-threads ${num_threads} \
--thinking-mode ${thinking_mode}
--num-threads ${num_threads}

# Copy the result file from /tmp to our logs directory
# The result file is named gpqa_{model_name}.json
Expand Down
4 changes: 2 additions & 2 deletions scripts/benchmarks/mmlu/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ n_prefill=$1
n_decode=$2
prefill_gpus=$3
decode_gpus=$4
num_examples=${5:-198} # Default: 198
max_tokens=${6:-512} # Default: 512
num_examples=${5:-200} # Default: 200
max_tokens=${6:-2048} # Default: 2048
repeat=${7:-8} # Default: 8
num_threads=${8:-512} # Default: 512

Expand Down
19 changes: 19 additions & 0 deletions src/srtctl/backends/sglang.py
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,25 @@ def generate_slurm_script(self, config_path: Path = None, timestamp: str = None)
concurrency_str = str(concurrencies)

parsable_config = f"{isl} {osl} {concurrency_str} {req_rate}"
elif bench_type == "mmlu":
num_examples = benchmark_config.get("num_examples", 200)
max_tokens = benchmark_config.get("max_tokens", 2048)
repeat = benchmark_config.get("repeat", 8)
num_threads = benchmark_config.get("num_threads", 512)
parsable_config = f"{num_examples} {max_tokens} {repeat} {num_threads}"
elif bench_type == "gpqa":
num_examples = benchmark_config.get("num_examples", 198)
max_tokens = benchmark_config.get("max_tokens", 32768)
repeat = benchmark_config.get("repeat", 8)
num_threads = benchmark_config.get("num_threads", 128)
parsable_config = f"{num_examples} {max_tokens} {repeat} {num_threads}"
elif bench_type == "longbenchv2":
num_examples = benchmark_config.get("num_examples", None)
max_tokens = benchmark_config.get("max_tokens", 16384)
max_context_length = benchmark_config.get("max_context_length", 128000)
num_threads = benchmark_config.get("num_threads", 16)
categories = benchmark_config.get("categories", None)
parsable_config = f"{num_examples} {max_tokens} {max_context_length} {num_threads} {categories}"

# Config directory should point to where deepep_config.json lives
# This is typically the configs/ directory in the yaml-config repo
Expand Down
8 changes: 8 additions & 0 deletions src/srtctl/core/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,14 @@ class BenchmarkConfig(BaseModel):
)
req_rate: Optional[str] = Field("inf", description="Request rate")

# Accuracy benchmark arguments
num_examples: Optional[int] = Field(None, description="Number of examples")
max_tokens: Optional[int] = Field(None, description="Maximum output tokens")
repeat: Optional[int] = Field(None, description="Number of times to repeat the benchmark")
num_threads: Optional[int] = Field(None, description="Number of running threads for accuracy benchmark")
max_context_length: Optional[int] = Field(None, description="Maximum context length for LongBench-v2 accuracy benchmark")
categories: Optional[list[str]] = Field(None, description="Comma-separated list of categories to evaluate for LongBench-v2 (None for all)")


class ProfilingType(str, Enum):
"""Supported profiling types."""
Expand Down