Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 84 additions & 1 deletion docs/accuracy.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Accuracy Benchmarks

In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, and AIME (via the script under `configs/aime/`).
In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, `lm-eval`, and AIME (via the script under `configs/aime/`).

## Table of Contents

Expand All @@ -16,6 +16,7 @@ In srt-slurm, users can run different accuracy benchmarks by setting the benchma
- [Example: Quick Validation](#example-quick-validation)
- [Output](#output)
- [Important Notes](#important-notes)
- [lm-eval (InferenceX)](#lm-eval-inferencex)

---

Expand Down Expand Up @@ -290,3 +291,85 @@ The output includes per-category scores and aggregate metrics:
3. **Throughput**: Increase `num_threads` for faster evaluation, but monitor for OOM errors
4. **Categories**: Running specific categories is useful for targeted validation (e.g., just testing summarization capabilities)


## lm-eval (InferenceX)

The `lm-eval` benchmark runner integrates [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) via InferenceX's `benchmark_lib.sh`. Unlike the built-in benchmarks above, this runner sources evaluation logic from an external InferenceX workspace mounted at `/infmax-workspace`.

This is used by InferenceX CI to run evals such as GSM8K and GPQA against NVIDIA multi-node disaggregated deployments on GB200, GB300, B200, B300, H100, and H200. AMD MI355X multi-node evals are handled by InferenceX's upstreamed AMD Slurm path, not by this srt-slurm runner.

In InferenceX CI, recipes normally keep their throughput benchmark configuration. `do_sweep.py` invokes the registered `lm-eval` runner as a post-step when `RUN_EVAL=true`, or as the only benchmark-like step when `EVAL_ONLY=true`. There is no separate `infmax-eval` benchmark type.

### How it works

1. `RuntimeContext` mounts the host path from `INFMAX_WORKSPACE` at `/infmax-workspace` inside the Slurm container.
2. `do_sweep.py` starts infrastructure, workers, and the frontend for the normal recipe topology.
3. For `EVAL_ONLY=true`, `do_sweep.py` skips the throughput benchmark stage and runs `_run_post_eval()` directly after frontend startup.
4. `_run_post_eval()` waits for the OpenAI-compatible endpoint on port 8000 and, in eval-only mode, performs the full `wait_for_model()` health check for the configured prefill/decode or aggregated topology.
5. `_run_post_eval()` launches the registered `lm-eval` runner on the head node and passes through InferenceX metadata such as framework, precision, sequence length, prefill/decode topology, and eval concurrency.
6. The runner script (`benchmarks/scripts/lm-eval/bench.sh`) uses `MODEL_NAME` from `do_sweep.py`, or auto-discovers the served model from `/v1/models` as a fallback.
7. The runner sources `/infmax-workspace/benchmarks/benchmark_lib.sh`, runs `run_eval --framework lm-eval`, and calls `append_lm_eval_summary`.
8. Eval artifacts are copied to `/logs/eval_results/` for InferenceX launcher-side artifact pickup.

### EVAL_ONLY mode

srt-slurm supports an `EVAL_ONLY` mode for CI jobs that should only validate accuracy. This is controlled by environment variables from the InferenceX workflow:

| Env var | Description |
|---------|-------------|
| `EVAL_ONLY` | Set to `true` to skip the throughput benchmark stage and run eval only |
| `RUN_EVAL` | Set to `true` to run eval after the throughput benchmark completes |
| `EVAL_CONC` | Concurrent requests for lm-eval, normally set by InferenceX from the generated `eval-conc` value |
| `INFMAX_WORKSPACE` | Host path to the InferenceX checkout that should be mounted at `/infmax-workspace` |
| `MODEL_NAME` | Served model alias for OpenAI-compatible requests; set by `do_sweep.py` from `config.served_model_name` |

When `EVAL_ONLY=true`:
- Stage 4 skips the throughput benchmark entirely. No throughput result JSON is expected from srt-slurm.
- The eval path uses the full `wait_for_model()` health check before starting lm-eval.
- `_run_post_eval()` launches the `lm-eval` runner and returns its exit code.
- Eval failure is fatal because eval is the only purpose of the job.

When `RUN_EVAL=true` (without `EVAL_ONLY`):
- Throughput benchmark runs normally
- After benchmark completes successfully, eval runs as a post-step
- Eval failure is non-fatal; the benchmark job still succeeds if throughput passed

### Environment variables

The following env vars are passed through to the lm-eval runner container:

| Env var | Purpose |
|---------|---------|
| `RUN_EVAL`, `EVAL_ONLY`, `IS_MULTINODE` | Control whether eval runs and how InferenceX classifies the artifact |
| `FRAMEWORK`, `PRECISION`, `MODEL_PREFIX`, `RUNNER_TYPE`, `SPEC_DECODING` | Benchmark identity metadata for `meta_env.json` |
| `ISL`, `OSL`, `RESULT_FILENAME` | Sequence length and result-file metadata |
| `MODEL`, `MODEL_PATH`, `MODEL_NAME` | Model metadata and the served model alias used for requests |
| `MAX_MODEL_LEN`, `EVAL_MAX_MODEL_LEN` | Context-length metadata used by InferenceX eval helpers when available |
| `PREFILL_TP`, `PREFILL_EP`, `PREFILL_NUM_WORKERS`, `PREFILL_DP_ATTN` | Prefill-side topology metadata |
| `DECODE_TP`, `DECODE_EP`, `DECODE_NUM_WORKERS`, `DECODE_DP_ATTN` | Decode-side topology metadata |
| `EVAL_CONC`, `EVAL_CONCURRENT_REQUESTS` | Eval concurrency controls |

The runner maps srt-slurm's `PREFILL_DP_ATTN` and `DECODE_DP_ATTN` names to InferenceX's `PREFILL_DP_ATTENTION` and `DECODE_DP_ATTENTION` names before calling `append_lm_eval_summary`. This is required for multi-node summary tables to preserve prefill/decode DPA state.

### Concurrency

Eval concurrency is ultimately read by InferenceX's `benchmark_lib.sh` from `EVAL_CONCURRENT_REQUESTS`. The runner script sets that value from `EVAL_CONC` when present, preserves an existing `EVAL_CONCURRENT_REQUESTS` otherwise, and falls back to `256` only if neither variable is set:

```bash
export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
```

The InferenceX workflow sets `EVAL_CONC` from the generated `eval-conc` value. For multi-node configs, InferenceX selects the `8k1k` entry with the highest max eligible concurrency for each `(model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn)` group, then sets `eval-conc` to the upper median of that config's eligible concurrency list. If `EVAL_CONC` is not set in the environment, `do_sweep.py` falls back to the max of the recipe benchmark concurrency list.

### Output

Eval artifacts are written to `/logs/eval_results/` inside the container:
- `meta_env.json` - metadata used by InferenceX aggregation and summary tables
- `results*.json` - lm-eval scores per task
- `sample*.jsonl` - per-sample outputs

These are collected by the InferenceX NVIDIA launch scripts and uploaded as workflow artifacts. In eval-only mode the InferenceX workflow expects eval artifacts, not throughput benchmark artifacts.

### Intricacies
1. Eval floor of 16
- There is 1 sweep config of conc: [1], which causes evals to take >4hrs to complete.
2 changes: 2 additions & 0 deletions src/srtctl/benchmarks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
custom,
gpqa,
gsm8k,
lm_eval,
longbenchv2,
mmlu,
mooncake_router,
Expand All @@ -30,6 +31,7 @@
"register_benchmark",
# Runners
"custom",
"lm_eval",
"sa_bench",
"sglang_bench",
"mmlu",
Expand Down
58 changes: 58 additions & 0 deletions src/srtctl/benchmarks/lm_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""lm-eval benchmark runner for InferenceX evals."""

from __future__ import annotations

from typing import TYPE_CHECKING

from srtctl.benchmarks.base import SCRIPTS_DIR, BenchmarkRunner, register_benchmark

if TYPE_CHECKING:
from srtctl.core.runtime import RuntimeContext
from srtctl.core.schema import SrtConfig


@register_benchmark("lm-eval")
class LMEvalRunner(BenchmarkRunner):
"""lm-eval accuracy evaluation using InferenceX benchmark_lib.

Runs lm-eval via the InferenceX benchmark_lib.sh harness,
which handles task selection, result collection, and summary generation.
"""

@property
def name(self) -> str:
return "lm-eval"

@property
def script_path(self) -> str:
return "/srtctl-benchmarks/lm-eval/bench.sh"

@property
def local_script_dir(self) -> str:
return str(SCRIPTS_DIR / "lm-eval")

def validate_config(self, config: SrtConfig) -> list[str]:
# lm-eval has sensible defaults
return []

def build_command(
self,
config: SrtConfig,
runtime: RuntimeContext,
) -> list[str]:
endpoint = f"http://localhost:{runtime.frontend_port}"
# Always use the container mount path, not the host path.
# INFMAX_WORKSPACE env var contains the host path (used for mount setup
# in runtime.py), but inside the container it's at /infmax-workspace.
infmax_workspace = "/infmax-workspace"

return [
"bash",
self.script_path,
endpoint,
infmax_workspace,
]
77 changes: 77 additions & 0 deletions src/srtctl/benchmarks/scripts/lm-eval/bench.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# lm-eval accuracy evaluation using InferenceX benchmark_lib
# Expects: endpoint [infmax_workspace]

set -e

ENDPOINT=$1
INFMAX_WORKSPACE=${2:-/infmax-workspace}

# Extract HOST and PORT from endpoint (e.g., http://localhost:8000)
HOST=$(echo "$ENDPOINT" | sed -E 's|https?://||; s|:.*||')
PORT=$(echo "$ENDPOINT" | sed -E 's|.*:([0-9]+).*|\1|')

echo "lm-eval Config: endpoint=${ENDPOINT}; host=${HOST}; port=${PORT}; workspace=${INFMAX_WORKSPACE}"

# Auto-discover the served model name from /v1/models if MODEL_NAME is not set.
# This ensures we use the exact name the server recognizes, regardless of what
# $MODEL (the HuggingFace ID from the workflow) is set to.
if [[ -z "${MODEL_NAME:-}" ]]; then
DISCOVERED_MODEL=$(curl -sf "${ENDPOINT}/v1/models" 2>/dev/null \
| python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data'][0]['id'])" 2>/dev/null || true)
if [[ -n "$DISCOVERED_MODEL" ]]; then
export MODEL_NAME="$DISCOVERED_MODEL"
echo "Auto-discovered MODEL_NAME from /v1/models: ${MODEL_NAME}"
else
echo "WARNING: Could not discover model name from /v1/models, using MODEL_NAME=${MODEL_NAME:-$MODEL}"
fi
else
echo "Using MODEL_NAME from environment: ${MODEL_NAME}"
fi

# cd to workspace so that relative paths (e.g., utils/evals/*.yaml) resolve
cd "${INFMAX_WORKSPACE}"

# Source the InferenceX benchmark library
source "${INFMAX_WORKSPACE}/benchmarks/benchmark_lib.sh"

# Run lm-eval via benchmark_lib
# EVAL_CONC is set by the InferenceX workflow (median of conc list).
# benchmark_lib reads concurrency from EVAL_CONCURRENT_REQUESTS env var.
export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
echo "Running lm-eval with concurrent-requests=${EVAL_CONCURRENT_REQUESTS}..."
eval_rc=0
run_eval --framework lm-eval --port "$PORT" || eval_rc=$?

# Derive metadata env vars that append_lm_eval_summary needs but do_sweep.py
# does not pass directly (it passes PREFILL_TP/EP/etc, not TP/EP_SIZE/CONC).
export IS_MULTINODE="${IS_MULTINODE:-true}"
export TP="${TP:-${PREFILL_TP:-1}}"
export CONC="${CONC:-${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-1}}}"
export EP_SIZE="${EP_SIZE:-${PREFILL_EP:-1}}"
export DP_ATTENTION="${DP_ATTENTION:-${PREFILL_DP_ATTN:-false}}"
# Remap srt-slurm's DP_ATTN names to InferenceX's DP_ATTENTION names
export PREFILL_DP_ATTENTION="${PREFILL_DP_ATTENTION:-${PREFILL_DP_ATTN:-${DP_ATTENTION:-false}}}"
export DECODE_DP_ATTENTION="${DECODE_DP_ATTENTION:-${DECODE_DP_ATTN:-${DP_ATTENTION:-false}}}"

# Generate the lm-eval summary
echo "Generating lm-eval summary..."
append_lm_eval_summary || true

# Copy eval artifacts to /logs/eval_results/
mkdir -p /logs/eval_results
echo "Copying eval artifacts to /logs/eval_results/..."
cp -v meta_env.json /logs/eval_results/ 2>/dev/null || true
cp -v results*.json /logs/eval_results/ 2>/dev/null || true
cp -v sample*.jsonl /logs/eval_results/ 2>/dev/null || true

if [[ "$eval_rc" -ne 0 ]]; then
echo "lm-eval evaluation failed with exit code ${eval_rc}"
exit "$eval_rc"
fi

echo "lm-eval evaluation complete"
Loading
Loading