NVIDIA · Oseltamivir · Apr 23, 2026
@@ -1,6 +1,6 @@
 # Accuracy Benchmarks
 
-In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa` and `longbenchv2`.
+In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, and `lm-eval`.
 
 ## Table of Contents
 
@@ -14,6 +14,7 @@ In srt-slurm, users can run different accuracy benchmarks by setting the benchma
   - [Example: Quick Validation](#example-quick-validation)
   - [Output](#output)
   - [Important Notes](#important-notes)
+- [lm-eval](#lm-eval)
 
 ---
 
@@ -191,3 +192,83 @@ The output includes per-category scores and aggregate metrics:
 4. **Categories**: Running specific categories is useful for targeted validation (e.g., just testing summarization capabilities)
 
 
+## lm-eval
+
+The `lm-eval` benchmark runner integrates [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) against the deployed OpenAI-compatible endpoint. By default, the runner invokes the `lm_eval` CLI directly (installing the `lm-eval` pip package on demand when it isn't already present in the container). An external eval workspace that exposes a compatible `benchmark_lib.sh` can be plugged in instead — see [External eval harness](#external-eval-harness) below.
+
+### How it works
+
+1. `do_sweep.py` starts infrastructure, workers, and the frontend for the normal recipe topology.
+2. For `EVAL_ONLY=true`, `do_sweep.py` skips the throughput benchmark stage and runs `_run_post_eval()` directly after frontend startup.
+3. `_run_post_eval()` waits for the OpenAI-compatible endpoint on port 8000 and, in eval-only mode, performs the full `wait_for_model()` health check for the configured prefill/decode or aggregated topology.
+4. `_run_post_eval()` launches the registered `lm-eval` runner on the head node.
+5. The runner script (`benchmarks/scripts/lm-eval/bench.sh`) uses `MODEL_NAME` from `do_sweep.py`, or auto-discovers the served model from `/v1/models` as a fallback.
+6. The runner calls `lm_eval --model local-chat-completions --model_args base_url=<endpoint>/v1/chat/completions,model=<MODEL_NAME>,num_concurrent=<EVAL_CONCURRENT_REQUESTS>,tokenized_requests=False --tasks <LM_EVAL_TASKS>` and writes results to `/logs/eval_results/`.
+
+### EVAL_ONLY mode
+
+srt-slurm supports an `EVAL_ONLY` mode for jobs that should only validate accuracy. It is controlled by environment variables:
+
+| Env var | Description |
+|---------|-------------|
+| `EVAL_ONLY` | Set to `true` to skip the throughput benchmark stage and run eval only |
+| `RUN_EVAL` | Set to `true` to run eval after the throughput benchmark completes |
+| `EVAL_CONC` | Concurrent requests for lm-eval; falls back to max of the recipe benchmark concurrency list |
+| `MODEL_NAME` | Served model alias for OpenAI-compatible requests; set by `do_sweep.py` from `config.served_model_name` |
+| `LM_EVAL_TASKS` | Comma-separated lm-evaluation-harness tasks (default: `gsm8k`) |
+| `LM_EVAL_MODEL` | lm-evaluation-harness model type (default: `local-chat-completions`) |
+| `LM_EVAL_EXTRA_ARGS` | Extra flags appended to the `lm_eval` command line |
+| `EVAL_OUTPUT_DIR` | Where eval artifacts are written inside the container (default: `/logs/eval_results`) |
+
+When `EVAL_ONLY=true`:
+- Stage 4 skips the throughput benchmark entirely. No throughput result JSON is expected from srt-slurm.
+- The eval path uses the full `wait_for_model()` health check before starting lm-eval.
+- `_run_post_eval()` launches the `lm-eval` runner and returns its exit code.
+- Eval failure is fatal because eval is the only purpose of the job.
+
+When `RUN_EVAL=true` (without `EVAL_ONLY`):
+- Throughput benchmark runs normally.
+- After benchmark completes successfully, eval runs as a post-step.
+- Eval failure is non-fatal; the benchmark job still succeeds if throughput passed.
+
+### Topology metadata passthrough
+
+`do_sweep.py` also forwards a set of topology/precision env vars to the eval container so downstream aggregation tooling can record them alongside the eval results:
+
+| Env var | Purpose |
+|---------|---------|
+| `RUN_EVAL`, `EVAL_ONLY`, `IS_MULTINODE` | Whether eval runs and how artifacts are classified |
+| `FRAMEWORK`, `PRECISION`, `MODEL_PREFIX`, `RUNNER_TYPE`, `SPEC_DECODING` | Benchmark identity metadata |
+| `ISL`, `OSL`, `RESULT_FILENAME` | Sequence length and result-file metadata |
+| `MODEL`, `MODEL_PATH`, `MODEL_NAME` | Model metadata and the served model alias |
+| `MAX_MODEL_LEN`, `EVAL_MAX_MODEL_LEN` | Context-length metadata |
+| `PREFILL_TP`, `PREFILL_EP`, `PREFILL_NUM_WORKERS`, `PREFILL_DP_ATTN` | Prefill-side topology metadata |
+| `DECODE_TP`, `DECODE_EP`, `DECODE_NUM_WORKERS`, `DECODE_DP_ATTN` | Decode-side topology metadata |
+| `EVAL_CONC`, `EVAL_CONCURRENT_REQUESTS` | Eval concurrency controls |
+
+These variables are optional: they are passed through only when the launcher sets them, and the default lm-eval path works without any of them.
+
+### Concurrency
+
+The runner exports `EVAL_CONCURRENT_REQUESTS`, preferring `EVAL_CONC` when set and falling back to `256`:
+
+```bash
+export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
+```
+
+When `EVAL_CONC` is not set, `do_sweep.py` defaults it to the max of the recipe benchmark concurrency list.
+
+### Output
+
+Eval artifacts are written to `/logs/eval_results/` inside the container (override with `EVAL_OUTPUT_DIR`):
+- `results*.json` - lm-eval scores per task
+- `sample*.jsonl` - per-sample outputs (when the harness emits them)
+
+### External eval harness
+
+If the launcher needs to drive lm-eval through an existing `benchmark_lib.sh` (for example, an internal evaluation library that also handles summary generation), mount that workspace into the container and point the runner at it with one of:
+
+- `LM_EVAL_LIB` - absolute container path to the `benchmark_lib.sh` file.
+- `LM_EVAL_WORKSPACE` - host path to a workspace; it is mounted at `/lm-eval-workspace` inside the container, and the runner sources `/lm-eval-workspace/benchmarks/benchmark_lib.sh`.
+
+When a `benchmark_lib.sh` is found, the runner sources it, calls `run_eval --framework lm-eval` and (if exported) `append_lm_eval_summary`, and copies `meta_env.json`, `results*.json`, and `sample*.jsonl` from the CWD into `EVAL_OUTPUT_DIR`. Otherwise, the default in-container `lm_eval` CLI path is used.
@@ -8,6 +8,7 @@
     custom,
     gpqa,
     gsm8k,
+    lm_eval,
     longbenchv2,
     mmlu,
     mooncake_router,
@@ -30,6 +31,7 @@
     "register_benchmark",
     # Runners
     "custom",
+    "lm_eval",
     "sa_bench",
     "sglang_bench",
     "mmlu",

@@ -0,0 +1,54 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""lm-eval benchmark runner.
+
+Runs the EleutherAI lm-evaluation-harness against the deployed OpenAI-compatible
+endpoint. An external harness that exposes ``benchmark_lib.sh`` is used instead
+when available — either via the ``LM_EVAL_LIB`` env var, or a workspace mounted
+at ``/lm-eval-workspace`` (set host-side with ``LM_EVAL_WORKSPACE``).
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from srtctl.benchmarks.base import SCRIPTS_DIR, BenchmarkRunner, register_benchmark
+
+if TYPE_CHECKING:
+    from srtctl.core.runtime import RuntimeContext
+    from srtctl.core.schema import SrtConfig
+
+
+@register_benchmark("lm-eval")
+class LMEvalRunner(BenchmarkRunner):
+    """lm-eval accuracy evaluation runner."""
+
+    @property
+    def name(self) -> str:
+        return "lm-eval"
+
+    @property
+    def script_path(self) -> str:
+        return "/srtctl-benchmarks/lm-eval/bench.sh"
+
+    @property
+    def local_script_dir(self) -> str:
+        return str(SCRIPTS_DIR / "lm-eval")
+
+    def validate_config(self, config: SrtConfig) -> list[str]:
+        # lm-eval has sensible defaults.
+        return []
+
+    def build_command(
+        self,
+        config: SrtConfig,
+        runtime: RuntimeContext,
+    ) -> list[str]:
+        endpoint = f"http://localhost:{runtime.frontend_port}"
+        return [
+            "bash",
+            self.script_path,
+            endpoint,
+        ]
@@ -0,0 +1,119 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# lm-eval accuracy evaluation.
+#
+# By default runs the EleutherAI lm-evaluation-harness CLI directly against
+# the OpenAI-compatible endpoint. If an external eval harness is mounted at
+# /lm-eval-workspace (or pointed to via LM_EVAL_LIB) and exposes a compatible
+# benchmark_lib.sh, that harness is sourced and used instead.
+#
+# Expects: endpoint
+
+set -e
+
+ENDPOINT=$1
+
+# Extract HOST and PORT from endpoint (e.g., http://localhost:8000)
+HOST=$(echo "$ENDPOINT" | sed -E 's|https?://||; s|:.*||')
+PORT=$(echo "$ENDPOINT" | sed -E 's|.*:([0-9]+).*|\1|')
+
+echo "lm-eval config: endpoint=${ENDPOINT}; host=${HOST}; port=${PORT}"
+
+# Auto-discover the served model name from /v1/models if MODEL_NAME is not set.
+# This ensures we use the exact name the server recognizes, regardless of what
+# $MODEL (a HuggingFace ID from the launcher) is set to.
+if [[ -z "${MODEL_NAME:-}" ]]; then
+    DISCOVERED_MODEL=$(curl -sf "${ENDPOINT}/v1/models" 2>/dev/null \
+        | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data'][0]['id'])" 2>/dev/null || true)
+    if [[ -n "$DISCOVERED_MODEL" ]]; then
+        export MODEL_NAME="$DISCOVERED_MODEL"
+        echo "Auto-discovered MODEL_NAME from /v1/models: ${MODEL_NAME}"
+    else
+        echo "WARNING: Could not discover model name from /v1/models, using MODEL_NAME=${MODEL_NAME:-$MODEL}"
+    fi
+else
+    echo "Using MODEL_NAME from environment: ${MODEL_NAME}"
+fi
+
+# Output directory for eval artifacts. Override with EVAL_OUTPUT_DIR.
+EVAL_OUTPUT_DIR="${EVAL_OUTPUT_DIR:-/logs/eval_results}"
+mkdir -p "${EVAL_OUTPUT_DIR}"
+
+# Concurrency. Prefer EVAL_CONC (set by orchestration) over EVAL_CONCURRENT_REQUESTS.
+export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
+echo "Running lm-eval with concurrent-requests=${EVAL_CONCURRENT_REQUESTS}..."
+
+eval_rc=0
+
+# Let a host-provided harness take over if one is mounted. LM_EVAL_LIB points
+# at a benchmark_lib.sh directly; otherwise we look for one at the standard
+# /lm-eval-workspace mount point.
+EXT_LIB="${LM_EVAL_LIB:-}"
+if [[ -z "$EXT_LIB" && -f "/lm-eval-workspace/benchmarks/benchmark_lib.sh" ]]; then
+    EXT_LIB="/lm-eval-workspace/benchmarks/benchmark_lib.sh"
+fi
+
+if [[ -n "$EXT_LIB" && -f "$EXT_LIB" ]]; then
+    echo "Using external eval harness: $EXT_LIB"
+    # cd to the workspace root so relative paths (utils/evals/*.yaml) resolve.
+    cd "$(dirname "$(dirname "$EXT_LIB")")"
+    # shellcheck disable=SC1090
+    source "$EXT_LIB"
+    run_eval --framework lm-eval --port "$PORT" || eval_rc=$?
+
+    # Derive metadata env vars that append_lm_eval_summary expects but the
+    # launcher passes under different names.
+    export IS_MULTINODE="${IS_MULTINODE:-true}"
+    export TP="${TP:-${PREFILL_TP:-1}}"
+    export CONC="${CONC:-${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-1}}}"
+    export EP_SIZE="${EP_SIZE:-${PREFILL_EP:-1}}"
+    export DP_ATTENTION="${DP_ATTENTION:-${PREFILL_DP_ATTN:-false}}"
+    export PREFILL_DP_ATTENTION="${PREFILL_DP_ATTENTION:-${PREFILL_DP_ATTN:-${DP_ATTENTION:-false}}}"
+    export DECODE_DP_ATTENTION="${DECODE_DP_ATTENTION:-${DECODE_DP_ATTN:-${DP_ATTENTION:-false}}}"
+
+    if declare -F append_lm_eval_summary >/dev/null 2>&1; then
+        echo "Generating lm-eval summary..."
+        append_lm_eval_summary || true
+    fi
+
+    echo "Copying eval artifacts to ${EVAL_OUTPUT_DIR}/..."
+    cp -v meta_env.json "${EVAL_OUTPUT_DIR}/" 2>/dev/null || true
+    cp -v results*.json "${EVAL_OUTPUT_DIR}/" 2>/dev/null || true
+    cp -v sample*.jsonl "${EVAL_OUTPUT_DIR}/" 2>/dev/null || true
+else
+    # Default path: run the EleutherAI lm-evaluation-harness CLI directly.
+    if ! command -v lm_eval >/dev/null 2>&1; then
+        echo "lm_eval CLI not found; installing lm-evaluation-harness via pip..."
+        python3 -m pip install --quiet --upgrade lm-eval || {
+            echo "ERROR: lm_eval CLI is not available and pip install failed."
+            exit 127
+        }
+    fi
+
+    TASKS="${LM_EVAL_TASKS:-gsm8k}"
+    MODEL_TYPE="${LM_EVAL_MODEL:-local-chat-completions}"
+    BASE_URL="${ENDPOINT%/}/v1/chat/completions"
+
+    echo "Running lm_eval on tasks=${TASKS} (model=${MODEL_NAME}, model_type=${MODEL_TYPE})"
+
+    MODEL_ARGS="base_url=${BASE_URL},model=${MODEL_NAME},num_concurrent=${EVAL_CONCURRENT_REQUESTS},tokenized_requests=False"
+
+    # shellcheck disable=SC2086
+    lm_eval \
+        --model "${MODEL_TYPE}" \
+        --model_args "${MODEL_ARGS}" \
+        --tasks "${TASKS}" \
+        --apply_chat_template \
+        --output_path "${EVAL_OUTPUT_DIR}" \
+        ${LM_EVAL_EXTRA_ARGS:-} || eval_rc=$?
+fi
+
+if [[ "$eval_rc" -ne 0 ]]; then
+    echo "lm-eval evaluation failed with exit code ${eval_rc}"
+    exit "$eval_rc"
+fi
+
+echo "lm-eval evaluation complete"