examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp

ggerganov · 2026-03-29T14:52:33Z

Overview

Adds a lean and mean evaluation tool:

Single Python script
Datasets: AIME, AIME2025, GSM8K, GPQA
Graders: regex, llm, custom
Store evaluation state in a json file
Realtime results
Output to stdout and HTML (with reasoning traces)
Supports stop/resume
Supports multiple eval servers

Sample usage:

# start a new AIME25 evaluation of gpt-oss-20b (low) using gpt-oss-20b (medium) as grader
python3 llama-eval.py \
  --model  gpt-oss-20b-hf-low  \
  --server http://127.0.0.1:8013 \
  --grader-type llm \
  --grader-model  gpt-oss-20b-hf-medium \
  --grader-server http://127.0.0.1:9013 \
  --dataset aime2025 --n_cases 240 \
  --temperature 1.0 --top-k 0 --top-p 1.0 --min-p 0.01 --threads 240 \
  --output aime2025-gpt-oss-20b-low-x8.json --seed 1234

# corresponding llama-server that will perform the computation
# note: no need for checkpoints and prompt caching
# note: for most evals you need at least -np 8 for reasonable eval time
./bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF -c 4194304 -np 256 \
  --port 8013 --host 0.0.0.0 \
  -cram 0 --ctx-checkpoints 0 \
  --chat-template-kwargs '{"reasoning_effort": "low"}'

# grader on port 9013
./bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF -c 32768 -np 1 \
  --port 9013 --host 0.0.0.0 \
  --chat-template-kwargs '{"reasoning_effort": "medium"}'

Sample results:

CLI

Loading AIME2025 dataset...
AIME2025 dataset loaded: 15 questions
Loading AIME2025 dataset (part 2)...
AIME2025 dataset loaded: 30 questions (total)

Tasks:
  Task ID             Dataset  Prompt (first 40 chars)                        Expected    Answer       Tokens  Status
  aime2025_000_020     AIME2025   Circle $\omega_1$ with radius 6 centered at...    293        N/A        N/A    pending
  aime2025_000_006     AIME2025   The twelve letters $A,B,C,D,E,F,G,H,I,J,K$,...    821        N/A        N/A    pending
  aime2025_000_008     AIME2025   The parabola with equation $y=x^{2}-4$ is r...    62         N/A        N/A    pending
  aime2025_000_004     AIME2025   There are $8!=40320$ eight-digit positive i...    279        N/A        N/A    pending
  aime2025_000_015     AIME2025   Six points $ A, B, C, D, E, $ and $ F $ lie...    468        N/A        N/A    pending
  aime2025_000_028     AIME2025   Let $ \triangle ABC $ be a right triangle w...    104        N/A        N/A    pending
  aime2025_000_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         N/A        N/A    pending
  aime2025_000_023     AIME2025   There are $ n $ values of $ x $ in the inte...    149        N/A        N/A    pending
  aime2025_000_022     AIME2025   From an unlimited supply of 1-cent coins, 1...    610        N/A        N/A    pending
  aime2025_000_019     AIME2025   Suppose $ \triangle ABC $ has angles $ \ang...    336        N/A        N/A    pending
  aime2025_000_012     AIME2025   Alex divides a disk into four quadrants wit...    204        N/A        N/A    pending
  aime2025_000_029     AIME2025   There are exactly three positive real numbe...    240        N/A        N/A    pending
  aime2025_000_009     AIME2025   The 27 cells of a $3\times9$ grid are fille...    81         N/A        N/A    pending
  aime2025_000_010     AIME2025   A piecewise linear periodic function is def...    259        N/A        N/A    pending
  aime2025_000_005     AIME2025   An isosceles trapezoid has an inscribed cir...    504        N/A        N/A    pending
  aime2025_000_016     AIME2025   Find the sum of all positive integers $ n $...    49         N/A        N/A    pending

Processing 240 AIME2025 tasks ...
Server: http://192.168.1.62:8014 (model: gpt-oss-20b-hf-low)
Grader: llm
Threads: 240
Sampling: temp=1.0, top-k=0, top-p=1.0, min-p=0.01
  1/240  aime2025_007_011     AIME2025   The set of points in 3-dimensional coordina...    510        78         157    ✗  [  0/  1, 0.000]
  2/240  aime2025_004_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         96         250    ✗  [  0/  2, 0.000]
  3/240  aime2025_005_023     AIME2025   There are $ n $ values of $ x $ in the inte...    149        N/A        N/A    ✗  [  0/  3, 0.000]
  4/240  aime2025_001_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         33         385    ✗  [  0/  4, 0.000]
  5/240  aime2025_002_011     AIME2025   The set of points in 3-dimensional coordina...    510        28         428    ✗  [  0/  5, 0.000]
  6/240  aime2025_006_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         42         515    ✗  [  0/  6, 0.000]
  7/240  aime2025_006_019     AIME2025   Suppose $ \triangle ABC $ has angles $ \ang...    336        336        530    ✓  [  1/  7, 0.143]
  8/240  aime2025_006_002     AIME2025   The 9 members of a baseball team went to an...    16         16         541    ✓  [  2/  8, 0.250]
  9/240  aime2025_007_029     AIME2025   There are exactly three positive real numbe...    240        0          573    ✗  [  2/  9, 0.222]
 10/240  aime2025_007_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         577    ✓  [  3/ 10, 0.300]
 11/240  aime2025_000_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         558    ✓  [  4/ 11, 0.364]
 12/240  aime2025_003_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         590    ✓  [  5/ 12, 0.417]
 ...

Session time: 3454.6s | Total accumulated time: 3454.6s
============================================================
Results: 91/240 correct (37.9%)
============================================================

Eval state dumped to aime2025-gpt-oss-20b-low-x8.json

HTML: test-3.json.html

Multi-server usage

Distribute evaluation tasks across multiple machines. Tasks are pulled dynamically from a shared queue — faster servers naturally get more work.

# evaluate across 3 servers with different thread counts
python3 llama-eval.py \
   --server http://192.168.0.1:8013,http://192.168.0.2:8013,http://192.168.0.3:8013 \
   --server-name server1,server2,server3 \
   --threads 64,32,16 \
   --model gpt-oss-20b-hf-low \
   --dataset aime2025 --n_cases 240 \
   --temperature 1.0 --top-k 0 --top-p 1.0 --min-p 0.01 \
   --output aime2025-gpt-oss-20b-low-x8.json --seed 1234

Additional information

I've been vibe coding this from time to time using local models and OpenCode (in the beginning and Pi in the end). Given that I don't write Python, I would guess the quality of the implementation is quite poor. Thought I've tried to keep it minimalistic.

TODOs:

Speed tracking (tok/s)
Support passing multiple evaluation servers in order to distribute the eval tasks to more machines
Better (i.e. simpler) HTML layout. Easier to read results
Result uncertainty estimate
Unslop

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, OpenCode + Qwen3 30B Coder, GLM 4.7 Flash, MiniMax M2.5, pi + Qwen3.6

strawberrymelonpanda · 2026-03-29T20:17:38Z

evaluation of gpt-oss-20b (low) using gpt-oss-20b (medium) as grader

Love the idea, but can you really trust the same 20B model to grade itself?
It's been awhile, but my own experiments with LLM grading have never been satisfactory.

I liked that #18892 seemed to be simple pass/fail, unless I've overlooked something.

ggerganov · 2026-03-29T20:29:37Z

The script also supports regex-based grader. Also a custom grader with your own script.

Generally, when using regex grading, I've seen quite a few false-negatives even when using the original gpt-oss sophisticated regexes.

With the current gpt-oss grader I haven't observed false-positives yet. Ideally, you would want to use gpt-oss-120b just to make sure. Though I think that the task of extracting a number from a paragraph of text should be solvable with gpt-oss-20b quite robustly.

Still if you spot a failure, please do report.

strawberrymelonpanda · 2026-03-29T20:36:13Z

Though I think that the task of extracting a number from a paragraph of text should be solvable with gpt-oss-20b quite robustly.

Fair. I'd also be curious what the minimum viable model for the task is. i.e., can Qwen 3.5 4B solve it reliably?
Something to tinker with.

I'll certainly pull the branch, but hoping this one makes it to a mainline tool. 😄

strawberrymelonpanda · 2026-03-29T20:42:29Z

The script also supports regex-based grader. Also a custom grader with your own script.
Generally, when using regex grading, I've seen quite a few false-negatives

I wonder if a "hybrid" option could cut down the eval time by only checking the false results, as a double-check. Seems like false-passes would be more rare.

Depending on the task that might not make a huge difference, when pass rates are well below 50%, but just musing.

Add a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. The simulator: - Implements /v1/chat/completions endpoint with OpenAI-compatible format - Loads AIME dataset from HuggingFace with local caching - Uses Levenshtein distance for intelligent question matching - Supports configurable success rate for correct/wrong answer generation - Provides debug logging for troubleshooting Also includes test scripts and documentation for testing and understanding the simulator functionality.

Extract repeating question string into TEST_QUESTION variable and create make_request() helper function to reduce code duplication. Add proper error handling for error responses.

Add summary of llama-server-simulator implementation work including features, testing results, technical decisions, and refactoring.

- Create new simplified evaluation script focused only on AIME - Implement EvalState and Processor dataclasses for structured state management - Add real-time feedback showing correct/incorrect status per case - Abstract grading interface for external grader support - Use structured JSON output for eval state - Apply HuggingFace dataset caching to avoid repeated downloads - Remove Levenshtein matching - eval script only sends requests and validates answers

- Add Grader class supporting regex and CLI-based grading - Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande - Add CLI grader interface: python script.py --answer <pred> --expected <gold> - Add HF telemetry disable to avoid warnings - Support exact match requirement for regex patterns - Add 30-second timeout for CLI grader - Handle both boxed and plain text formats for AIME answers

- Add ThreadPoolExecutor for parallel request processing controlled by --threads - Add --model argument to specify model name in request data - Refactor process() to use thread-safe _process_single_case() method - Update progress tracking to work with concurrent execution

…ter updates - Add threading support implementation details - Document ThreadPoolExecutor usage and thread safety - Add model parameter implementation details - Include testing results for both features

Replace all occurrences of "judge" with "grader" for consistency across the codebase (CLI args, Grader class fields, help text). Assisted-by: llama.cpp:local pi

Compute 95% CI on-the-fly from completed cases. Displayed in terminal output, HTML report, and JSON state.

Extract predicted_per_second from the server timings response and store it as tps_gen per task. Display in console progress, print_all_tasks, and HTML report. Assisted-by: llama.cpp:local pi

Extract predicted_ms from the server timings response and store it as t_gen_ms per task. Display in seconds with one decimal digit in console progress, print_all_tasks, and HTML report. Assisted-by: llama.cpp:local pi

…ix convention - _display suffix → display_ prefix (answer, tokens, tps, t_gen) - _escaped suffix → escaped_ prefix (response, prompt, reasoning) - _count suffix → n_ prefix (correct, incorrect, pending) Assisted-by: llama.cpp:local pi

…distribution - Add ServerConfig dataclass (url, threads, name) - Accept comma-separated --server, --threads, --server-name CLI args - Dynamic shared-queue task distribution across servers (fast servers do more work) - One ThreadPoolExecutor per server, workers pull from shared Queue - Track which server processed each task (server_name in results) - Thread-safe EvalState with threading.Lock for concurrent mutations - Server column in HTML report and console output - Backward compatible: single server works as before Assisted-by: llama.cpp:local pi

- Use HTTPServer + BaseHTTPRequestHandler instead of Flask - RequestHandler handles POST /v1/chat/completions - Server runs in daemon thread with clean Ctrl+C shutdown - Remove flask and unused asdict imports Assisted-by: llama.cpp:local pi

Assisted-by: llama.cpp:local pi

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new lightweight evaluation tool under examples/llama-eval/ for running dataset-based model evaluations against one or more llama-server instances, including grading and HTML reporting.

Changes:

Introduces llama-eval.py to run evals, resume from state JSON, and render an interactive HTML report.
Adds a llama-server-simulator.py plus a shell script to exercise the simulator locally.
Adds minimal README with quick-start usage.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.

File	Description
examples/llama-eval/llama-eval.py	Core eval runner: dataset loading, multi-server worker pool, grading, state/HTML output.
examples/llama-eval/llama-server-simulator.py	Local HTTP server that simulates `/v1/chat/completions` responses using AIME dataset.
examples/llama-eval/test-simulator.sh	Ad-hoc script to start the simulator and sanity-check responses.
examples/llama-eval/README.md	Quick start documentation and link to PR for fuller details.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        cache_path = Path.home() / ".cache" / "huggingface" / "datasets" / "AI-MO___aimo-validation-aime" / "default" / "0.0.0"
+        if cache_path.exists():
+            print(f"Using cached dataset from {cache_path}")
+            ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split, cache_dir=str(cache_path))
+        else:
+            ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split)


- Store model_name in EvalState and JSON output - Display model in HTML summary table - Verify --model matches stored model when resuming Assisted-by: llama.cpp:local pi

Assisted-by: llama.cpp:local pi

…pe llm Assisted-by: llama.cpp:local pi

Assisted-by: llama.cpp:local pi

- Replace verbose summary table with single inline bar - Shorten status text: '✓'/'✗'/'–'/'!' instead of full words - Flatten CSS: remove box-shadows, border-radius, reduce padding - Use system-ui font, 13px table, 12px details - Conditional reasoning section (only shown when present) - Single toggle JS function instead of two - Shorter column headers Assisted-by: llama.cpp:local pi

- Hit /v1/models for each server before evaluation - Exit with error if any server is unreachable - Print comma-separated model IDs per server in startup output - Sequential checks, no retries, no timeout override Assisted-by: llama.cpp:local pi

Assisted-by: llama.cpp:local pi

cmp-nct · 2026-05-12T13:01:07Z

Very good addition

JohannesGaessler · 2026-05-12T15:21:07Z

Getting something like this in llama.cpp is one of my current priorities. I'm willing to cooperate on the development but from a cursory check of the code I don't agree with all of the design decisions that were made and would want to make changes or else I would need to fork the code/make my own. Since "unslop" is currently unchecked: how finalized do you consider the design to be and do you intend to make further changes in a follow-up PR?

ggerganov · 2026-05-12T15:55:13Z

I realized I can't really unslop the implementation because I don't have a model of a "good python code" in my head, so left it as it is.

Feel free to modify it anyway you like, I don't feel strongly about any of these features, apart from being self-contained, single script without heavy dependencies. I don't plan to make changes soon, apart from fixing edge-cases when running the current functionality.

* working llama-eval mc and math suite * multi source llama-eval * Add readme * add checkpointing * examples: add llama-server simulator for testing eval scripts Add a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. The simulator: - Implements /v1/chat/completions endpoint with OpenAI-compatible format - Loads AIME dataset from HuggingFace with local caching - Uses Levenshtein distance for intelligent question matching - Supports configurable success rate for correct/wrong answer generation - Provides debug logging for troubleshooting Also includes test scripts and documentation for testing and understanding the simulator functionality. * examples: refactor test-simulator.sh for better readability Extract repeating question string into TEST_QUESTION variable and create make_request() helper function to reduce code duplication. Add proper error handling for error responses. * docs: update llama-eval-discussion.md with session work summary Add summary of llama-server-simulator implementation work including features, testing results, technical decisions, and refactoring. * examples: add simplified llama-eval-new.py for AIME evaluation - Create new simplified evaluation script focused only on AIME - Implement EvalState and Processor dataclasses for structured state management - Add real-time feedback showing correct/incorrect status per case - Abstract grading interface for external grader support - Use structured JSON output for eval state - Apply HuggingFace dataset caching to avoid repeated downloads - Remove Levenshtein matching - eval script only sends requests and validates answers * docs: remove README.md from llama-eval * examples: implement flexible grader system for answer validation - Add Grader class supporting regex and CLI-based grading - Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande - Add CLI grader interface: python script.py --answer <pred> --expected <gold> - Add HF telemetry disable to avoid warnings - Support exact match requirement for regex patterns - Add 30-second timeout for CLI grader - Handle both boxed and plain text formats for AIME answers * examples: use HF_HUB_OFFLINE to avoid HF Hub warnings * examples: remove HF_HUB_OFFLINE to allow dataset download * examples: use cached dataset path to avoid HF Hub requests * examples: use cached dataset path in simulator to avoid HF Hub requests * docs: update llama-eval-discussion.md with session work summary * examples: add threading support and model parameter to llama-eval-new.py - Add ThreadPoolExecutor for parallel request processing controlled by --threads - Add --model argument to specify model name in request data - Refactor process() to use thread-safe _process_single_case() method - Update progress tracking to work with concurrent execution * docs: update llama-eval-discussion.md with threading and model parameter updates - Add threading support implementation details - Document ThreadPoolExecutor usage and thread safety - Add model parameter implementation details - Include testing results for both features * examples: add task summary table to llama-eval-new.py * eval : print progress * eval : add prompts * test : fix path * sim : fix answer matching * eval : support multiple dataset runs * minor * improve grader * docs * remove old files * datasets : add gsm8k * add gpqa + sampling + docs * rename * grader : improve example answers * cont * datasets : add aime2025 * grader : update prompt * grade : improve regex + logs * datasets : fix aime2025 * cleanup * add AGENTS.md * ignore errors * resume eval * cleanup * fix counts * simplify * fix prompts * add html * store full response * add tokens * resoning and error handling * refactor * track total time * remove junk * eval : unify "judge" terminology to "grader" Replace all occurrences of "judge" with "grader" for consistency across the codebase (CLI args, Grader class fields, help text). Assisted-by: llama.cpp:local pi * eval : add Wilson score confidence interval to results Compute 95% CI on-the-fly from completed cases. Displayed in terminal output, HTML report, and JSON state. * llama-eval : add per-task generation speed from server timings Extract predicted_per_second from the server timings response and store it as tps_gen per task. Display in console progress, print_all_tasks, and HTML report. Assisted-by: llama.cpp:local pi * llama-eval : add per-task generation time from server timings Extract predicted_ms from the server timings response and store it as t_gen_ms per task. Display in seconds with one decimal digit in console progress, print_all_tasks, and HTML report. Assisted-by: llama.cpp:local pi * llama-eval : rename display, escaped, and count variables to use prefix convention - _display suffix → display_ prefix (answer, tokens, tps, t_gen) - _escaped suffix → escaped_ prefix (response, prompt, reasoning) - _count suffix → n_ prefix (correct, incorrect, pending) Assisted-by: llama.cpp:local pi * llama-eval : support multiple evaluation endpoints with dynamic task distribution - Add ServerConfig dataclass (url, threads, name) - Accept comma-separated --server, --threads, --server-name CLI args - Dynamic shared-queue task distribution across servers (fast servers do more work) - One ThreadPoolExecutor per server, workers pull from shared Queue - Track which server processed each task (server_name in results) - Thread-safe EvalState with threading.Lock for concurrent mutations - Server column in HTML report and console output - Backward compatible: single server works as before Assisted-by: llama.cpp:local pi * llama-server-simulator : replace Flask with stdlib http.server - Use HTTPServer + BaseHTTPRequestHandler instead of Flask - RequestHandler handles POST /v1/chat/completions - Server runs in daemon thread with clean Ctrl+C shutdown - Remove flask and unused asdict imports Assisted-by: llama.cpp:local pi * llama-eval : update README with PR link and quick-start examples Assisted-by: llama.cpp:local pi * llama-eval : track model name in eval state and verify on resume - Store model_name in EvalState and JSON output - Display model in HTML summary table - Verify --model matches stored model when resuming Assisted-by: llama.cpp:local pi * llama-server-simulator : fix comment - Dice coefficient, not Levenshtein Assisted-by: llama.cpp:local pi * llama-eval : require --grader-model or --model when using --grader-type llm Assisted-by: llama.cpp:local pi * llama-eval : protect dump() with lock for thread safety Assisted-by: llama.cpp:local pi * llama-eval : compact HTML report output - Replace verbose summary table with single inline bar - Shorten status text: '✓'/'✗'/'–'/'!' instead of full words - Flatten CSS: remove box-shadows, border-radius, reduce padding - Use system-ui font, 13px table, 12px details - Conditional reasoning section (only shown when present) - Single toggle JS function instead of two - Shorter column headers Assisted-by: llama.cpp:local pi * llama-eval : check server connectivity on startup - Hit /v1/models for each server before evaluation - Exit with error if any server is unreachable - Print comma-separated model IDs per server in startup output - Sequential checks, no retries, no timeout override Assisted-by: llama.cpp:local pi * llama-eval : use server1/server2 instead of gpu1/gpu2 in README Assisted-by: llama.cpp:local pi --------- Co-authored-by: gatbontonpc <gatbontonpc@gmail.com>

github-actions Bot added examples python python script changes labels Mar 29, 2026

ggerganov mentioned this pull request Mar 29, 2026

llama : rotate activations for better quantization #21038

Merged

2 tasks

This was referenced Mar 30, 2026

Handle reasoning budget #20297

Merged

server: save and clear idle slots on new task (--clear-idle) #20993

Merged

gatbontonpc and others added 21 commits May 10, 2026 18:13

working llama-eval mc and math suite

db8b09d

multi source llama-eval

4db4497

Add readme

c7f3ce2

add checkpointing

5cbe95b

examples: refactor test-simulator.sh for better readability

05b8425

Extract repeating question string into TEST_QUESTION variable and create make_request() helper function to reduce code duplication. Add proper error handling for error responses.

docs: update llama-eval-discussion.md with session work summary

deed078

Add summary of llama-server-simulator implementation work including features, testing results, technical decisions, and refactoring.

docs: remove README.md from llama-eval

de8eda4

examples: use HF_HUB_OFFLINE to avoid HF Hub warnings

30ea512

examples: remove HF_HUB_OFFLINE to allow dataset download

d7d2c22

examples: use cached dataset path to avoid HF Hub requests

edc766c

examples: use cached dataset path in simulator to avoid HF Hub requests

3732aea

docs: update llama-eval-discussion.md with session work summary

2fe445c

docs: update llama-eval-discussion.md with threading and model parame…

d639ee5

…ter updates - Add threading support implementation details - Document ThreadPoolExecutor usage and thread safety - Add model parameter implementation details - Include testing results for both features

examples: add task summary table to llama-eval-new.py

ee9b715

eval : print progress

940364e

eval : add prompts

1a780f7

test : fix path

64720e1

ggerganov added 4 commits May 10, 2026 18:13

track total time

e0a2cf4

remove junk

633a68d

eval : unify "judge" terminology to "grader"

7d433f7

Replace all occurrences of "judge" with "grader" for consistency across the codebase (CLI args, Grader class fields, help text). Assisted-by: llama.cpp:local pi

eval : add Wilson score confidence interval to results

81a65cf

Compute 95% CI on-the-fly from completed cases. Displayed in terminal output, HTML report, and JSON state.

ggerganov force-pushed the gg/scripts-eval branch from 1c128d9 to 81a65cf Compare May 10, 2026 15:47

ggerganov added 6 commits May 10, 2026 19:05

llama-eval : add per-task generation speed from server timings

4d5dedc

Extract predicted_per_second from the server timings response and store it as tps_gen per task. Display in console progress, print_all_tasks, and HTML report. Assisted-by: llama.cpp:local pi

llama-eval : add per-task generation time from server timings

9f10d8d

Extract predicted_ms from the server timings response and store it as t_gen_ms per task. Display in seconds with one decimal digit in console progress, print_all_tasks, and HTML report. Assisted-by: llama.cpp:local pi

llama-eval : update README with PR link and quick-start examples

094554d

Assisted-by: llama.cpp:local pi

ggerganov marked this pull request as ready for review May 10, 2026 18:25

ggerganov requested a review from Copilot May 10, 2026 18:25

Copilot AI reviewed May 10, 2026

View reviewed changes

ggerganov added 4 commits May 10, 2026 21:43

llama-eval : track model name in eval state and verify on resume

e5ac6d1

- Store model_name in EvalState and JSON output - Display model in HTML summary table - Verify --model matches stored model when resuming Assisted-by: llama.cpp:local pi

llama-server-simulator : fix comment - Dice coefficient, not Levenshtein

85c6aa0

Assisted-by: llama.cpp:local pi

llama-eval : require --grader-model or --model when using --grader-ty…

d5165e8

…pe llm Assisted-by: llama.cpp:local pi

llama-eval : protect dump() with lock for thread safety

f49c636

Assisted-by: llama.cpp:local pi

Copilot started reviewing on behalf of ggerganov May 10, 2026 19:23 View session

ggerganov added 3 commits May 12, 2026 14:50

llama-eval : use server1/server2 instead of gpu1/gpu2 in README

f634472

Assisted-by: llama.cpp:local pi

ggerganov merged commit fde69a3 into master May 12, 2026
4 checks passed

ggerganov deleted the gg/scripts-eval branch May 12, 2026 13:06

CISC mentioned this pull request May 12, 2026

examples : enable llama-eval type check #22988

Merged

Conversation

ggerganov commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Multi-server usage

Additional information

Requirements

Uh oh!

strawberrymelonpanda commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 29, 2026

Uh oh!

strawberrymelonpanda commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strawberrymelonpanda commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cmp-nct commented May 12, 2026

Uh oh!

JohannesGaessler commented May 12, 2026

Uh oh!

ggerganov commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ggerganov commented Mar 29, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 29, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 29, 2026 •

edited

Loading

strawberrymelonpanda commented Mar 29, 2026 •

edited

Loading