Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 82 additions & 1 deletion docs/accuracy.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Accuracy Benchmarks

In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa` and `longbenchv2`.
In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, and `lm-eval`.

## Table of Contents

Expand All @@ -14,6 +14,7 @@ In srt-slurm, users can run different accuracy benchmarks by setting the benchma
- [Example: Quick Validation](#example-quick-validation)
- [Output](#output)
- [Important Notes](#important-notes)
- [lm-eval](#lm-eval)

---

Expand Down Expand Up @@ -191,3 +192,83 @@ The output includes per-category scores and aggregate metrics:
4. **Categories**: Running specific categories is useful for targeted validation (e.g., just testing summarization capabilities)


## lm-eval

The `lm-eval` benchmark runner integrates [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) against the deployed OpenAI-compatible endpoint. By default, the runner invokes the `lm_eval` CLI directly (installing the `lm-eval` pip package on demand when it isn't already present in the container). An external eval workspace that exposes a compatible `benchmark_lib.sh` can be plugged in instead — see [External eval harness](#external-eval-harness) below.

### How it works

1. `do_sweep.py` starts infrastructure, workers, and the frontend for the normal recipe topology.
2. For `EVAL_ONLY=true`, `do_sweep.py` skips the throughput benchmark stage and runs `_run_post_eval()` directly after frontend startup.
3. `_run_post_eval()` waits for the OpenAI-compatible endpoint on port 8000 and, in eval-only mode, performs the full `wait_for_model()` health check for the configured prefill/decode or aggregated topology.
4. `_run_post_eval()` launches the registered `lm-eval` runner on the head node.
5. The runner script (`benchmarks/scripts/lm-eval/bench.sh`) uses `MODEL_NAME` from `do_sweep.py`, or auto-discovers the served model from `/v1/models` as a fallback.
6. The runner calls `lm_eval --model local-chat-completions --model_args base_url=<endpoint>/v1/chat/completions,model=<MODEL_NAME>,num_concurrent=<EVAL_CONCURRENT_REQUESTS>,tokenized_requests=False --tasks <LM_EVAL_TASKS>` and writes results to `/logs/eval_results/`.

### EVAL_ONLY mode

srt-slurm supports an `EVAL_ONLY` mode for jobs that should only validate accuracy. It is controlled by environment variables:

| Env var | Description |
|---------|-------------|
| `EVAL_ONLY` | Set to `true` to skip the throughput benchmark stage and run eval only |
| `RUN_EVAL` | Set to `true` to run eval after the throughput benchmark completes |
| `EVAL_CONC` | Concurrent requests for lm-eval; falls back to max of the recipe benchmark concurrency list |
| `MODEL_NAME` | Served model alias for OpenAI-compatible requests; set by `do_sweep.py` from `config.served_model_name` |
| `LM_EVAL_TASKS` | Comma-separated lm-evaluation-harness tasks (default: `gsm8k`) |
| `LM_EVAL_MODEL` | lm-evaluation-harness model type (default: `local-chat-completions`) |
| `LM_EVAL_EXTRA_ARGS` | Extra flags appended to the `lm_eval` command line |
| `EVAL_OUTPUT_DIR` | Where eval artifacts are written inside the container (default: `/logs/eval_results`) |

When `EVAL_ONLY=true`:
- Stage 4 skips the throughput benchmark entirely. No throughput result JSON is expected from srt-slurm.
- The eval path uses the full `wait_for_model()` health check before starting lm-eval.
- `_run_post_eval()` launches the `lm-eval` runner and returns its exit code.
- Eval failure is fatal because eval is the only purpose of the job.

When `RUN_EVAL=true` (without `EVAL_ONLY`):
- Throughput benchmark runs normally.
- After benchmark completes successfully, eval runs as a post-step.
- Eval failure is non-fatal; the benchmark job still succeeds if throughput passed.

### Topology metadata passthrough

`do_sweep.py` also forwards a set of topology/precision env vars to the eval container so downstream aggregation tooling can record them alongside the eval results:

| Env var | Purpose |
|---------|---------|
| `RUN_EVAL`, `EVAL_ONLY`, `IS_MULTINODE` | Whether eval runs and how artifacts are classified |
| `FRAMEWORK`, `PRECISION`, `MODEL_PREFIX`, `RUNNER_TYPE`, `SPEC_DECODING` | Benchmark identity metadata |
| `ISL`, `OSL`, `RESULT_FILENAME` | Sequence length and result-file metadata |
| `MODEL`, `MODEL_PATH`, `MODEL_NAME` | Model metadata and the served model alias |
| `MAX_MODEL_LEN`, `EVAL_MAX_MODEL_LEN` | Context-length metadata |
| `PREFILL_TP`, `PREFILL_EP`, `PREFILL_NUM_WORKERS`, `PREFILL_DP_ATTN` | Prefill-side topology metadata |
| `DECODE_TP`, `DECODE_EP`, `DECODE_NUM_WORKERS`, `DECODE_DP_ATTN` | Decode-side topology metadata |
| `EVAL_CONC`, `EVAL_CONCURRENT_REQUESTS` | Eval concurrency controls |

These variables are optional: they are passed through only when the launcher sets them, and the default lm-eval path works without any of them.

### Concurrency

The runner exports `EVAL_CONCURRENT_REQUESTS`, preferring `EVAL_CONC` when set and falling back to `256`:

```bash
export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
```

When `EVAL_CONC` is not set, `do_sweep.py` defaults it to the max of the recipe benchmark concurrency list.

### Output

Eval artifacts are written to `/logs/eval_results/` inside the container (override with `EVAL_OUTPUT_DIR`):
- `results*.json` - lm-eval scores per task
- `sample*.jsonl` - per-sample outputs (when the harness emits them)

### External eval harness

If the launcher needs to drive lm-eval through an existing `benchmark_lib.sh` (for example, an internal evaluation library that also handles summary generation), mount that workspace into the container and point the runner at it with one of:

- `LM_EVAL_LIB` - absolute container path to the `benchmark_lib.sh` file.
- `LM_EVAL_WORKSPACE` - host path to a workspace; it is mounted at `/lm-eval-workspace` inside the container, and the runner sources `/lm-eval-workspace/benchmarks/benchmark_lib.sh`.

When a `benchmark_lib.sh` is found, the runner sources it, calls `run_eval --framework lm-eval` and (if exported) `append_lm_eval_summary`, and copies `meta_env.json`, `results*.json`, and `sample*.jsonl` from the CWD into `EVAL_OUTPUT_DIR`. Otherwise, the default in-container `lm_eval` CLI path is used.
2 changes: 2 additions & 0 deletions src/srtctl/benchmarks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
custom,
gpqa,
gsm8k,
lm_eval,
longbenchv2,
mmlu,
mooncake_router,
Expand All @@ -30,6 +31,7 @@
"register_benchmark",
# Runners
"custom",
"lm_eval",
"sa_bench",
"sglang_bench",
"mmlu",
Expand Down
54 changes: 54 additions & 0 deletions src/srtctl/benchmarks/lm_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""lm-eval benchmark runner.

Runs the EleutherAI lm-evaluation-harness against the deployed OpenAI-compatible
endpoint. An external harness that exposes ``benchmark_lib.sh`` is used instead
when available — either via the ``LM_EVAL_LIB`` env var, or a workspace mounted
at ``/lm-eval-workspace`` (set host-side with ``LM_EVAL_WORKSPACE``).
"""

from __future__ import annotations

from typing import TYPE_CHECKING

from srtctl.benchmarks.base import SCRIPTS_DIR, BenchmarkRunner, register_benchmark

if TYPE_CHECKING:
from srtctl.core.runtime import RuntimeContext
from srtctl.core.schema import SrtConfig


@register_benchmark("lm-eval")
class LMEvalRunner(BenchmarkRunner):
"""lm-eval accuracy evaluation runner."""

@property
def name(self) -> str:
return "lm-eval"

@property
def script_path(self) -> str:
return "/srtctl-benchmarks/lm-eval/bench.sh"

@property
def local_script_dir(self) -> str:
return str(SCRIPTS_DIR / "lm-eval")

def validate_config(self, config: SrtConfig) -> list[str]:
# lm-eval has sensible defaults.
return []

def build_command(
self,
config: SrtConfig,
runtime: RuntimeContext,
) -> list[str]:
endpoint = f"http://localhost:{runtime.frontend_port}"
return [
"bash",
self.script_path,
endpoint,
]
119 changes: 119 additions & 0 deletions src/srtctl/benchmarks/scripts/lm-eval/bench.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# lm-eval accuracy evaluation.
#
# By default runs the EleutherAI lm-evaluation-harness CLI directly against
# the OpenAI-compatible endpoint. If an external eval harness is mounted at
# /lm-eval-workspace (or pointed to via LM_EVAL_LIB) and exposes a compatible
# benchmark_lib.sh, that harness is sourced and used instead.
#
# Expects: endpoint

set -e

ENDPOINT=$1

# Extract HOST and PORT from endpoint (e.g., http://localhost:8000)
HOST=$(echo "$ENDPOINT" | sed -E 's|https?://||; s|:.*||')
PORT=$(echo "$ENDPOINT" | sed -E 's|.*:([0-9]+).*|\1|')

echo "lm-eval config: endpoint=${ENDPOINT}; host=${HOST}; port=${PORT}"

# Auto-discover the served model name from /v1/models if MODEL_NAME is not set.
# This ensures we use the exact name the server recognizes, regardless of what
# $MODEL (a HuggingFace ID from the launcher) is set to.
if [[ -z "${MODEL_NAME:-}" ]]; then
DISCOVERED_MODEL=$(curl -sf "${ENDPOINT}/v1/models" 2>/dev/null \
| python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data'][0]['id'])" 2>/dev/null || true)
if [[ -n "$DISCOVERED_MODEL" ]]; then
export MODEL_NAME="$DISCOVERED_MODEL"
echo "Auto-discovered MODEL_NAME from /v1/models: ${MODEL_NAME}"
else
echo "WARNING: Could not discover model name from /v1/models, using MODEL_NAME=${MODEL_NAME:-$MODEL}"
fi
else
echo "Using MODEL_NAME from environment: ${MODEL_NAME}"
fi

# Output directory for eval artifacts. Override with EVAL_OUTPUT_DIR.
EVAL_OUTPUT_DIR="${EVAL_OUTPUT_DIR:-/logs/eval_results}"
mkdir -p "${EVAL_OUTPUT_DIR}"

# Concurrency. Prefer EVAL_CONC (set by orchestration) over EVAL_CONCURRENT_REQUESTS.
export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
echo "Running lm-eval with concurrent-requests=${EVAL_CONCURRENT_REQUESTS}..."

eval_rc=0

# Let a host-provided harness take over if one is mounted. LM_EVAL_LIB points
# at a benchmark_lib.sh directly; otherwise we look for one at the standard
# /lm-eval-workspace mount point.
EXT_LIB="${LM_EVAL_LIB:-}"
if [[ -z "$EXT_LIB" && -f "/lm-eval-workspace/benchmarks/benchmark_lib.sh" ]]; then
EXT_LIB="/lm-eval-workspace/benchmarks/benchmark_lib.sh"
fi

if [[ -n "$EXT_LIB" && -f "$EXT_LIB" ]]; then
echo "Using external eval harness: $EXT_LIB"
# cd to the workspace root so relative paths (utils/evals/*.yaml) resolve.
cd "$(dirname "$(dirname "$EXT_LIB")")"
# shellcheck disable=SC1090
source "$EXT_LIB"
run_eval --framework lm-eval --port "$PORT" || eval_rc=$?

# Derive metadata env vars that append_lm_eval_summary expects but the
# launcher passes under different names.
export IS_MULTINODE="${IS_MULTINODE:-true}"
export TP="${TP:-${PREFILL_TP:-1}}"
export CONC="${CONC:-${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-1}}}"
export EP_SIZE="${EP_SIZE:-${PREFILL_EP:-1}}"
export DP_ATTENTION="${DP_ATTENTION:-${PREFILL_DP_ATTN:-false}}"
export PREFILL_DP_ATTENTION="${PREFILL_DP_ATTENTION:-${PREFILL_DP_ATTN:-${DP_ATTENTION:-false}}}"
export DECODE_DP_ATTENTION="${DECODE_DP_ATTENTION:-${DECODE_DP_ATTN:-${DP_ATTENTION:-false}}}"

if declare -F append_lm_eval_summary >/dev/null 2>&1; then
echo "Generating lm-eval summary..."
append_lm_eval_summary || true
fi

echo "Copying eval artifacts to ${EVAL_OUTPUT_DIR}/..."
cp -v meta_env.json "${EVAL_OUTPUT_DIR}/" 2>/dev/null || true
cp -v results*.json "${EVAL_OUTPUT_DIR}/" 2>/dev/null || true
cp -v sample*.jsonl "${EVAL_OUTPUT_DIR}/" 2>/dev/null || true
else
# Default path: run the EleutherAI lm-evaluation-harness CLI directly.
if ! command -v lm_eval >/dev/null 2>&1; then
echo "lm_eval CLI not found; installing lm-evaluation-harness via pip..."
python3 -m pip install --quiet --upgrade lm-eval || {
echo "ERROR: lm_eval CLI is not available and pip install failed."
exit 127
}
fi

TASKS="${LM_EVAL_TASKS:-gsm8k}"
MODEL_TYPE="${LM_EVAL_MODEL:-local-chat-completions}"
BASE_URL="${ENDPOINT%/}/v1/chat/completions"

echo "Running lm_eval on tasks=${TASKS} (model=${MODEL_NAME}, model_type=${MODEL_TYPE})"

MODEL_ARGS="base_url=${BASE_URL},model=${MODEL_NAME},num_concurrent=${EVAL_CONCURRENT_REQUESTS},tokenized_requests=False"

# shellcheck disable=SC2086
lm_eval \
--model "${MODEL_TYPE}" \
--model_args "${MODEL_ARGS}" \
--tasks "${TASKS}" \
--apply_chat_template \
--output_path "${EVAL_OUTPUT_DIR}" \
${LM_EVAL_EXTRA_ARGS:-} || eval_rc=$?
fi

if [[ "$eval_rc" -ne 0 ]]; then
echo "lm-eval evaluation failed with exit code ${eval_rc}"
exit "$eval_rc"
fi

echo "lm-eval evaluation complete"
Loading