Skip to content
This repository was archived by the owner on Apr 20, 2026. It is now read-only.

Multinode Evals#245

Closed
Oseltamivir wants to merge 6 commits into
ishandhanani:sa-submission-q1-2026from
Oseltamivir:sa-submission-q1-2026
Closed

Multinode Evals#245
Oseltamivir wants to merge 6 commits into
ishandhanani:sa-submission-q1-2026from
Oseltamivir:sa-submission-q1-2026

Conversation

@Oseltamivir
Copy link
Copy Markdown

@Oseltamivir Oseltamivir commented Apr 7, 2026

Summary

Add InferenceX multi-node eval support through an lm-eval benchmark runner and eval-only orchestration path. Lets InferenceX run accuracy-only jobs against existing srt-slurm multi-node disaggregated recipes without running the throughput benchmark stage.

How

  • Add an lm-eval benchmark runner that sources InferenceX's benchmarks/benchmark_lib.sh from a mounted /infmax-workspace.
  • Mount INFMAX_WORKSPACE into the container as /infmax-workspace when provided.
  • Add EVAL_ONLY=true handling in do_sweep.py so eval-only jobs start infra/workers/frontend, run
    the full model health check, skip throughput, and launch lm-eval directly.
  • Keep RUN_EVAL=true behavior as a post-benchmark eval path for normal throughput jobs.
  • Pass model/framework/topology metadata into the eval container, including served MODEL_NAME, prefill/decode TP/EP/DPA/worker counts, sequence length, precision, runner type, and eval concurrency.
  • Map srt-slurm PREFILL_DP_ATTN / DECODE_DP_ATTN env vars to the InferenceX PREFILL_DP_ATTENTION /DECODE_DP_ATTENTION names expected by append_lm_eval_summary.
  • Copy eval outputs (meta_env.json, results*.json, sample*.jsonl) into /logs/eval_results/ for launcher-side artifact pickup.
  • Preserve partial eval artifacts on lm-eval failure while still returning the original eval failure
    code.
  • Document the InferenceX lm-eval integration in docs/accuracy.md.

What

For EVAL_ONLY=true:

  • srt-slurm still starts the normal deployment topology.
  • The throughput benchmark runner is skipped.
  • wait_for_model() verifies the configured prefill/decode or aggregated worker counts.
  • lm-eval runs against the OpenAI-compatible endpoint.
  • Eval failure is fatal.
  • Low score leads to failure

For RUN_EVAL=true without EVAL_ONLY=true:

  • The normal benchmark runs first.
  • lm-eval runs as a post-step if throughput succeeds.
  • Eval failure is non-fatal to the benchmark result.
  • Low score leads to failure

Validation run

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771

InferenceX PR

SemiAnalysisAI/InferenceX#1000

Oseltamivir and others added 6 commits April 4, 2026 11:38
Adds support for running lm-eval accuracy evaluations as a post-benchmark
step, leveraging the InferenceX benchmark_lib.sh harness.

- New LMEvalRunner registered as "lm-eval" benchmark type
- bench.sh script sources benchmark_lib.sh and calls run_eval/append_lm_eval_summary
- Post-benchmark eval hook in SweepOrchestrator.run() triggered by RUN_EVAL=true
- Auto-mount INFMAX_WORKSPACE into container when env var is set

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In eval-only mode the benchmark stage is skipped, which also skips
its model health check. The 30s port check in _run_post_eval is
insufficient — workers are still loading. Use wait_for_model() with
the full health check config (same as benchmark stage) when
EVAL_ONLY=true.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of capping eval examples with --limit to avoid timeouts,
use the highest benchmark concurrency for eval requests. This runs
the full eval set faster by matching the throughput the server was
already benchmarked at.

do_sweep.py computes max(config.benchmark.concurrencies) and passes
it as EVAL_CONC to the lm-eval bench script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Oseltamivir
Copy link
Copy Markdown
Author

Continued in NVIDIA/srt-slurm#12

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant