This repository was archived by the owner on Apr 20, 2026. It is now read-only.
Multinode Evals#245
Closed
Oseltamivir wants to merge 6 commits into
Closed
Conversation
Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness. - New LMEvalRunner registered as "lm-eval" benchmark type - bench.sh script sources benchmark_lib.sh and calls run_eval/append_lm_eval_summary - Post-benchmark eval hook in SweepOrchestrator.run() triggered by RUN_EVAL=true - Auto-mount INFMAX_WORKSPACE into container when env var is set Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In eval-only mode the benchmark stage is skipped, which also skips its model health check. The 30s port check in _run_post_eval is insufficient — workers are still loading. Use wait_for_model() with the full health check config (same as benchmark stage) when EVAL_ONLY=true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of capping eval examples with --limit to avoid timeouts, use the highest benchmark concurrency for eval requests. This runs the full eval set faster by matching the throughput the server was already benchmarked at. do_sweep.py computes max(config.benchmark.concurrencies) and passes it as EVAL_CONC to the lm-eval bench script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
|
Continued in NVIDIA/srt-slurm#12 |
This was referenced Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add InferenceX multi-node eval support through an
lm-evalbenchmark runner and eval-only orchestration path. Lets InferenceX run accuracy-only jobs against existing srt-slurm multi-node disaggregated recipes without running the throughput benchmark stage.How
lm-evalbenchmark runner that sources InferenceX'sbenchmarks/benchmark_lib.shfrom a mounted/infmax-workspace.INFMAX_WORKSPACEinto the container as/infmax-workspacewhen provided.EVAL_ONLY=truehandling indo_sweep.pyso eval-only jobs start infra/workers/frontend, runthe full model health check, skip throughput, and launch
lm-evaldirectly.RUN_EVAL=truebehavior as a post-benchmark eval path for normal throughput jobs.MODEL_NAME, prefill/decode TP/EP/DPA/worker counts, sequence length, precision, runner type, and eval concurrency.PREFILL_DP_ATTN/DECODE_DP_ATTNenv vars to the InferenceXPREFILL_DP_ATTENTION/DECODE_DP_ATTENTIONnames expected byappend_lm_eval_summary.meta_env.json,results*.json,sample*.jsonl) into/logs/eval_results/for launcher-side artifact pickup.code.
docs/accuracy.md.What
For
EVAL_ONLY=true:wait_for_model()verifies the configured prefill/decode or aggregated worker counts.lm-evalruns against the OpenAI-compatible endpoint.For
RUN_EVAL=truewithoutEVAL_ONLY=true:lm-evalruns as a post-step if throughput succeeds.Validation run
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771
InferenceX PR
SemiAnalysisAI/InferenceX#1000