Skip to content

examples : add llama-eval#21152

Merged
ggerganov merged 66 commits into
masterfrom
gg/scripts-eval
May 12, 2026
Merged

examples : add llama-eval#21152
ggerganov merged 66 commits into
masterfrom
gg/scripts-eval

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented Mar 29, 2026

Overview

ref #18195
cont #18892

Adds a lean and mean evaluation tool:

  • Single Python script
  • Datasets: AIME, AIME2025, GSM8K, GPQA
  • Graders: regex, llm, custom
  • Store evaluation state in a json file
  • Realtime results
  • Output to stdout and HTML (with reasoning traces)
  • Supports stop/resume
  • Supports multiple eval servers

Sample usage:

# start a new AIME25 evaluation of gpt-oss-20b (low) using gpt-oss-20b (medium) as grader
python3 llama-eval.py \
  --model  gpt-oss-20b-hf-low  \
  --server http://127.0.0.1:8013 \
  --grader-type llm \
  --grader-model  gpt-oss-20b-hf-medium \
  --grader-server http://127.0.0.1:9013 \
  --dataset aime2025 --n_cases 240 \
  --temperature 1.0 --top-k 0 --top-p 1.0 --min-p 0.01 --threads 240 \
  --output aime2025-gpt-oss-20b-low-x8.json --seed 1234

# corresponding llama-server that will perform the computation
# note: no need for checkpoints and prompt caching
# note: for most evals you need at least -np 8 for reasonable eval time
./bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF -c 4194304 -np 256 \
  --port 8013 --host 0.0.0.0 \
  -cram 0 --ctx-checkpoints 0 \
  --chat-template-kwargs '{"reasoning_effort": "low"}'

# grader on port 9013
./bin/llama-server \
  -hf ggml-org/gpt-oss-20b-GGUF -c 32768 -np 1 \
  --port 9013 --host 0.0.0.0 \
  --chat-template-kwargs '{"reasoning_effort": "medium"}'

Sample results:

CLI
Loading AIME2025 dataset...
AIME2025 dataset loaded: 15 questions
Loading AIME2025 dataset (part 2)...
AIME2025 dataset loaded: 30 questions (total)

Tasks:
  Task ID             Dataset  Prompt (first 40 chars)                        Expected    Answer       Tokens  Status
  aime2025_000_020     AIME2025   Circle $\omega_1$ with radius 6 centered at...    293        N/A        N/A    pending
  aime2025_000_006     AIME2025   The twelve letters $A,B,C,D,E,F,G,H,I,J,K$,...    821        N/A        N/A    pending
  aime2025_000_008     AIME2025   The parabola with equation $y=x^{2}-4$ is r...    62         N/A        N/A    pending
  aime2025_000_004     AIME2025   There are $8!=40320$ eight-digit positive i...    279        N/A        N/A    pending
  aime2025_000_015     AIME2025   Six points $ A, B, C, D, E, $ and $ F $ lie...    468        N/A        N/A    pending
  aime2025_000_028     AIME2025   Let $ \triangle ABC $ be a right triangle w...    104        N/A        N/A    pending
  aime2025_000_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         N/A        N/A    pending
  aime2025_000_023     AIME2025   There are $ n $ values of $ x $ in the inte...    149        N/A        N/A    pending
  aime2025_000_022     AIME2025   From an unlimited supply of 1-cent coins, 1...    610        N/A        N/A    pending
  aime2025_000_019     AIME2025   Suppose $ \triangle ABC $ has angles $ \ang...    336        N/A        N/A    pending
  aime2025_000_012     AIME2025   Alex divides a disk into four quadrants wit...    204        N/A        N/A    pending
  aime2025_000_029     AIME2025   There are exactly three positive real numbe...    240        N/A        N/A    pending
  aime2025_000_009     AIME2025   The 27 cells of a $3\times9$ grid are fille...    81         N/A        N/A    pending
  aime2025_000_010     AIME2025   A piecewise linear periodic function is def...    259        N/A        N/A    pending
  aime2025_000_005     AIME2025   An isosceles trapezoid has an inscribed cir...    504        N/A        N/A    pending
  aime2025_000_016     AIME2025   Find the sum of all positive integers $ n $...    49         N/A        N/A    pending

Processing 240 AIME2025 tasks ...
Server: http://192.168.1.62:8014 (model: gpt-oss-20b-hf-low)
Grader: llm
Threads: 240
Sampling: temp=1.0, top-k=0, top-p=1.0, min-p=0.01
  1/240  aime2025_007_011     AIME2025   The set of points in 3-dimensional coordina...    510        78         157    ✗  [  0/  1, 0.000]
  2/240  aime2025_004_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         96         250    ✗  [  0/  2, 0.000]
  3/240  aime2025_005_023     AIME2025   There are $ n $ values of $ x $ in the inte...    149        N/A        N/A    ✗  [  0/  3, 0.000]
  4/240  aime2025_001_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         33         385    ✗  [  0/  4, 0.000]
  5/240  aime2025_002_011     AIME2025   The set of points in 3-dimensional coordina...    510        28         428    ✗  [  0/  5, 0.000]
  6/240  aime2025_006_013     AIME2025   Let $ABCDE$ be a convex pentagon with $AB=1...    60         42         515    ✗  [  0/  6, 0.000]
  7/240  aime2025_006_019     AIME2025   Suppose $ \triangle ABC $ has angles $ \ang...    336        336        530    ✓  [  1/  7, 0.143]
  8/240  aime2025_006_002     AIME2025   The 9 members of a baseball team went to an...    16         16         541    ✓  [  2/  8, 0.250]
  9/240  aime2025_007_029     AIME2025   There are exactly three positive real numbe...    240        0          573    ✗  [  2/  9, 0.222]
 10/240  aime2025_007_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         577    ✓  [  3/ 10, 0.300]
 11/240  aime2025_000_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         558    ✓  [  4/ 11, 0.364]
 12/240  aime2025_003_000     AIME2025   Find the sum of all integer bases $b>9$ for...    70         70         590    ✓  [  5/ 12, 0.417]
 ...

Session time: 3454.6s | Total accumulated time: 3454.6s
============================================================
Results: 91/240 correct (37.9%)
============================================================

Eval state dumped to aime2025-gpt-oss-20b-low-x8.json

HTML: test-3.json.html

image

Multi-server usage

Distribute evaluation tasks across multiple machines. Tasks are pulled dynamically from a shared queue — faster servers naturally get more work.

# evaluate across 3 servers with different thread counts
python3 llama-eval.py \
   --server http://192.168.0.1:8013,http://192.168.0.2:8013,http://192.168.0.3:8013 \
   --server-name server1,server2,server3 \
   --threads 64,32,16 \
   --model gpt-oss-20b-hf-low \
   --dataset aime2025 --n_cases 240 \
   --temperature 1.0 --top-k 0 --top-p 1.0 --min-p 0.01 \
   --output aime2025-gpt-oss-20b-low-x8.json --seed 1234

Additional information

I've been vibe coding this from time to time using local models and OpenCode (in the beginning and Pi in the end). Given that I don't write Python, I would guess the quality of the implementation is quite poor. Thought I've tried to keep it minimalistic.

TODOs:

  • Speed tracking (tok/s)
  • Support passing multiple evaluation servers in order to distribute the eval tasks to more machines
  • Better (i.e. simpler) HTML layout. Easier to read results
  • Result uncertainty estimate
  • Unslop

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, OpenCode + Qwen3 30B Coder, GLM 4.7 Flash, MiniMax M2.5, pi + Qwen3.6

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Mar 29, 2026

evaluation of gpt-oss-20b (low) using gpt-oss-20b (medium) as grader

Love the idea, but can you really trust the same 20B model to grade itself?
It's been awhile, but my own experiments with LLM grading have never been satisfactory.

I liked that #18892 seemed to be simple pass/fail, unless I've overlooked something.

@ggerganov
Copy link
Copy Markdown
Member Author

The script also supports regex-based grader. Also a custom grader with your own script.

Generally, when using regex grading, I've seen quite a few false-negatives even when using the original gpt-oss sophisticated regexes.

With the current gpt-oss grader I haven't observed false-positives yet. Ideally, you would want to use gpt-oss-120b just to make sure. Though I think that the task of extracting a number from a paragraph of text should be solvable with gpt-oss-20b quite robustly.

Still if you spot a failure, please do report.

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Mar 29, 2026

Though I think that the task of extracting a number from a paragraph of text should be solvable with gpt-oss-20b quite robustly.

Fair. I'd also be curious what the minimum viable model for the task is. i.e., can Qwen 3.5 4B solve it reliably?
Something to tinker with.

I'll certainly pull the branch, but hoping this one makes it to a mainline tool. 😄

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Mar 29, 2026

The script also supports regex-based grader. Also a custom grader with your own script.
Generally, when using regex grading, I've seen quite a few false-negatives

I wonder if a "hybrid" option could cut down the eval time by only checking the false results, as a double-check. Seems like false-passes would be more rare.

Depending on the task that might not make a huge difference, when pass rates are well below 50%, but just musing.

gatbontonpc and others added 21 commits May 10, 2026 18:13
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
…ter updates

- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
ggerganov added 4 commits May 10, 2026 18:13
Replace all occurrences of "judge" with "grader" for consistency
across the codebase (CLI args, Grader class fields, help text).

Assisted-by: llama.cpp:local pi
Compute 95% CI on-the-fly from completed cases. Displayed in
terminal output, HTML report, and JSON state.
ggerganov added 6 commits May 10, 2026 19:05
Extract predicted_per_second from the server timings response and store
it as tps_gen per task. Display in console progress, print_all_tasks,
and HTML report.

Assisted-by: llama.cpp:local pi
Extract predicted_ms from the server timings response and store it as
t_gen_ms per task. Display in seconds with one decimal digit in console
progress, print_all_tasks, and HTML report.

Assisted-by: llama.cpp:local pi
…ix convention

- _display suffix → display_ prefix (answer, tokens, tps, t_gen)
- _escaped suffix → escaped_ prefix (response, prompt, reasoning)
- _count suffix → n_ prefix (correct, incorrect, pending)

Assisted-by: llama.cpp:local pi
…distribution

- Add ServerConfig dataclass (url, threads, name)
- Accept comma-separated --server, --threads, --server-name CLI args
- Dynamic shared-queue task distribution across servers (fast servers do more work)
- One ThreadPoolExecutor per server, workers pull from shared Queue
- Track which server processed each task (server_name in results)
- Thread-safe EvalState with threading.Lock for concurrent mutations
- Server column in HTML report and console output
- Backward compatible: single server works as before

Assisted-by: llama.cpp:local pi
- Use HTTPServer + BaseHTTPRequestHandler instead of Flask
- RequestHandler handles POST /v1/chat/completions
- Server runs in daemon thread with clean Ctrl+C shutdown
- Remove flask and unused asdict imports

Assisted-by: llama.cpp:local pi
@ggerganov ggerganov marked this pull request as ready for review May 10, 2026 18:25
@ggerganov ggerganov requested a review from Copilot May 10, 2026 18:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new lightweight evaluation tool under examples/llama-eval/ for running dataset-based model evaluations against one or more llama-server instances, including grading and HTML reporting.

Changes:

  • Introduces llama-eval.py to run evals, resume from state JSON, and render an interactive HTML report.
  • Adds a llama-server-simulator.py plus a shell script to exercise the simulator locally.
  • Adds minimal README with quick-start usage.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.

File Description
examples/llama-eval/llama-eval.py Core eval runner: dataset loading, multi-server worker pool, grading, state/HTML output.
examples/llama-eval/llama-server-simulator.py Local HTTP server that simulates /v1/chat/completions responses using AIME dataset.
examples/llama-eval/test-simulator.sh Ad-hoc script to start the simulator and sanity-check responses.
examples/llama-eval/README.md Quick start documentation and link to PR for fuller details.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/llama-eval/test-simulator.sh
Comment thread examples/llama-eval/test-simulator.sh
Comment thread examples/llama-eval/test-simulator.sh
Comment thread examples/llama-eval/llama-eval.py
Comment thread examples/llama-eval/llama-eval.py Outdated
Comment thread examples/llama-eval/llama-eval.py
Comment thread examples/llama-eval/llama-eval.py
Comment thread examples/llama-eval/llama-eval.py
Comment thread examples/llama-eval/llama-server-simulator.py Outdated
Comment on lines +76 to +81
cache_path = Path.home() / ".cache" / "huggingface" / "datasets" / "AI-MO___aimo-validation-aime" / "default" / "0.0.0"
if cache_path.exists():
print(f"Using cached dataset from {cache_path}")
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split, cache_dir=str(cache_path))
else:
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split)
ggerganov added 4 commits May 10, 2026 21:43
- Store model_name in EvalState and JSON output
- Display model in HTML summary table
- Verify --model matches stored model when resuming

Assisted-by: llama.cpp:local pi
ggerganov added 3 commits May 12, 2026 14:50
- Replace verbose summary table with single inline bar
- Shorten status text: '✓'/'✗'/'–'/'!' instead of full words
- Flatten CSS: remove box-shadows, border-radius, reduce padding
- Use system-ui font, 13px table, 12px details
- Conditional reasoning section (only shown when present)
- Single toggle JS function instead of two
- Shorter column headers

Assisted-by: llama.cpp:local pi
- Hit /v1/models for each server before evaluation
- Exit with error if any server is unreachable
- Print comma-separated model IDs per server in startup output
- Sequential checks, no retries, no timeout override

Assisted-by: llama.cpp:local pi
@ggerganov ggerganov merged commit fde69a3 into master May 12, 2026
4 checks passed
@cmp-nct
Copy link
Copy Markdown
Contributor

cmp-nct commented May 12, 2026

Very good addition

@ggerganov ggerganov deleted the gg/scripts-eval branch May 12, 2026 13:06
@JohannesGaessler
Copy link
Copy Markdown
Contributor

Getting something like this in llama.cpp is one of my current priorities. I'm willing to cooperate on the development but from a cursory check of the code I don't agree with all of the design decisions that were made and would want to make changes or else I would need to fork the code/make my own. Since "unslop" is currently unchecked: how finalized do you consider the design to be and do you intend to make further changes in a follow-up PR?

@ggerganov
Copy link
Copy Markdown
Member Author

I realized I can't really unslop the implementation because I don't have a model of a "good python code" in my head, so left it as it is.

Feel free to modify it anyway you like, I don't feel strongly about any of these features, apart from being self-contained, single script without heavy dependencies. I don't plan to make changes soon, apart from fixing edge-cases when running the current functionality.

xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 12, 2026
* working llama-eval mc and math suite

* multi source llama-eval

* Add readme

* add checkpointing

* examples: add llama-server simulator for testing eval scripts

Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.

* examples: refactor test-simulator.sh for better readability

Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.

* docs: update llama-eval-discussion.md with session work summary

Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.

* examples: add simplified llama-eval-new.py for AIME evaluation

- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers

* docs: remove README.md from llama-eval

* examples: implement flexible grader system for answer validation

- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers

* examples: use HF_HUB_OFFLINE to avoid HF Hub warnings

* examples: remove HF_HUB_OFFLINE to allow dataset download

* examples: use cached dataset path to avoid HF Hub requests

* examples: use cached dataset path in simulator to avoid HF Hub requests

* docs: update llama-eval-discussion.md with session work summary

* examples: add threading support and model parameter to llama-eval-new.py

- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution

* docs: update llama-eval-discussion.md with threading and model parameter updates

- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features

* examples: add task summary table to llama-eval-new.py

* eval : print progress

* eval : add prompts

* test : fix path

* sim : fix answer matching

* eval : support multiple dataset runs

* minor

* improve grader

* docs

* remove old files

* datasets : add gsm8k

* add gpqa + sampling + docs

* rename

* grader : improve example answers

* cont

* datasets : add aime2025

* grader : update prompt

* grade : improve regex + logs

* datasets : fix aime2025

* cleanup

* add AGENTS.md

* ignore errors

* resume eval

* cleanup

* fix counts

* simplify

* fix prompts

* add html

* store full response

* add tokens

* resoning and error handling

* refactor

* track total time

* remove junk

* eval : unify "judge" terminology to "grader"

Replace all occurrences of "judge" with "grader" for consistency
across the codebase (CLI args, Grader class fields, help text).

Assisted-by: llama.cpp:local pi

* eval : add Wilson score confidence interval to results

Compute 95% CI on-the-fly from completed cases. Displayed in
terminal output, HTML report, and JSON state.

* llama-eval : add per-task generation speed from server timings

Extract predicted_per_second from the server timings response and store
it as tps_gen per task. Display in console progress, print_all_tasks,
and HTML report.

Assisted-by: llama.cpp:local pi

* llama-eval : add per-task generation time from server timings

Extract predicted_ms from the server timings response and store it as
t_gen_ms per task. Display in seconds with one decimal digit in console
progress, print_all_tasks, and HTML report.

Assisted-by: llama.cpp:local pi

* llama-eval : rename display, escaped, and count variables to use prefix convention

- _display suffix → display_ prefix (answer, tokens, tps, t_gen)
- _escaped suffix → escaped_ prefix (response, prompt, reasoning)
- _count suffix → n_ prefix (correct, incorrect, pending)

Assisted-by: llama.cpp:local pi

* llama-eval : support multiple evaluation endpoints with dynamic task distribution

- Add ServerConfig dataclass (url, threads, name)
- Accept comma-separated --server, --threads, --server-name CLI args
- Dynamic shared-queue task distribution across servers (fast servers do more work)
- One ThreadPoolExecutor per server, workers pull from shared Queue
- Track which server processed each task (server_name in results)
- Thread-safe EvalState with threading.Lock for concurrent mutations
- Server column in HTML report and console output
- Backward compatible: single server works as before

Assisted-by: llama.cpp:local pi

* llama-server-simulator : replace Flask with stdlib http.server

- Use HTTPServer + BaseHTTPRequestHandler instead of Flask
- RequestHandler handles POST /v1/chat/completions
- Server runs in daemon thread with clean Ctrl+C shutdown
- Remove flask and unused asdict imports

Assisted-by: llama.cpp:local pi

* llama-eval : update README with PR link and quick-start examples

Assisted-by: llama.cpp:local pi

* llama-eval : track model name in eval state and verify on resume

- Store model_name in EvalState and JSON output
- Display model in HTML summary table
- Verify --model matches stored model when resuming

Assisted-by: llama.cpp:local pi

* llama-server-simulator : fix comment - Dice coefficient, not Levenshtein

Assisted-by: llama.cpp:local pi

* llama-eval : require --grader-model or --model when using --grader-type llm

Assisted-by: llama.cpp:local pi

* llama-eval : protect dump() with lock for thread safety

Assisted-by: llama.cpp:local pi

* llama-eval : compact HTML report output

- Replace verbose summary table with single inline bar
- Shorten status text: '✓'/'✗'/'–'/'!' instead of full words
- Flatten CSS: remove box-shadows, border-radius, reduce padding
- Use system-ui font, 13px table, 12px details
- Conditional reasoning section (only shown when present)
- Single toggle JS function instead of two
- Shorter column headers

Assisted-by: llama.cpp:local pi

* llama-eval : check server connectivity on startup

- Hit /v1/models for each server before evaluation
- Exit with error if any server is unreachable
- Print comma-separated model IDs per server in startup output
- Sequential checks, no retries, no timeout override

Assisted-by: llama.cpp:local pi

* llama-eval : use server1/server2 instead of gpu1/gpu2 in README

Assisted-by: llama.cpp:local pi

---------

Co-authored-by: gatbontonpc <gatbontonpc@gmail.com>
Jcfunk pushed a commit to Jcfunk/llama.cpp that referenced this pull request May 13, 2026
* working llama-eval mc and math suite

* multi source llama-eval

* Add readme

* add checkpointing

* examples: add llama-server simulator for testing eval scripts

Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.

* examples: refactor test-simulator.sh for better readability

Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.

* docs: update llama-eval-discussion.md with session work summary

Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.

* examples: add simplified llama-eval-new.py for AIME evaluation

- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers

* docs: remove README.md from llama-eval

* examples: implement flexible grader system for answer validation

- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers

* examples: use HF_HUB_OFFLINE to avoid HF Hub warnings

* examples: remove HF_HUB_OFFLINE to allow dataset download

* examples: use cached dataset path to avoid HF Hub requests

* examples: use cached dataset path in simulator to avoid HF Hub requests

* docs: update llama-eval-discussion.md with session work summary

* examples: add threading support and model parameter to llama-eval-new.py

- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution

* docs: update llama-eval-discussion.md with threading and model parameter updates

- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features

* examples: add task summary table to llama-eval-new.py

* eval : print progress

* eval : add prompts

* test : fix path

* sim : fix answer matching

* eval : support multiple dataset runs

* minor

* improve grader

* docs

* remove old files

* datasets : add gsm8k

* add gpqa + sampling + docs

* rename

* grader : improve example answers

* cont

* datasets : add aime2025

* grader : update prompt

* grade : improve regex + logs

* datasets : fix aime2025

* cleanup

* add AGENTS.md

* ignore errors

* resume eval

* cleanup

* fix counts

* simplify

* fix prompts

* add html

* store full response

* add tokens

* resoning and error handling

* refactor

* track total time

* remove junk

* eval : unify "judge" terminology to "grader"

Replace all occurrences of "judge" with "grader" for consistency
across the codebase (CLI args, Grader class fields, help text).

Assisted-by: llama.cpp:local pi

* eval : add Wilson score confidence interval to results

Compute 95% CI on-the-fly from completed cases. Displayed in
terminal output, HTML report, and JSON state.

* llama-eval : add per-task generation speed from server timings

Extract predicted_per_second from the server timings response and store
it as tps_gen per task. Display in console progress, print_all_tasks,
and HTML report.

Assisted-by: llama.cpp:local pi

* llama-eval : add per-task generation time from server timings

Extract predicted_ms from the server timings response and store it as
t_gen_ms per task. Display in seconds with one decimal digit in console
progress, print_all_tasks, and HTML report.

Assisted-by: llama.cpp:local pi

* llama-eval : rename display, escaped, and count variables to use prefix convention

- _display suffix → display_ prefix (answer, tokens, tps, t_gen)
- _escaped suffix → escaped_ prefix (response, prompt, reasoning)
- _count suffix → n_ prefix (correct, incorrect, pending)

Assisted-by: llama.cpp:local pi

* llama-eval : support multiple evaluation endpoints with dynamic task distribution

- Add ServerConfig dataclass (url, threads, name)
- Accept comma-separated --server, --threads, --server-name CLI args
- Dynamic shared-queue task distribution across servers (fast servers do more work)
- One ThreadPoolExecutor per server, workers pull from shared Queue
- Track which server processed each task (server_name in results)
- Thread-safe EvalState with threading.Lock for concurrent mutations
- Server column in HTML report and console output
- Backward compatible: single server works as before

Assisted-by: llama.cpp:local pi

* llama-server-simulator : replace Flask with stdlib http.server

- Use HTTPServer + BaseHTTPRequestHandler instead of Flask
- RequestHandler handles POST /v1/chat/completions
- Server runs in daemon thread with clean Ctrl+C shutdown
- Remove flask and unused asdict imports

Assisted-by: llama.cpp:local pi

* llama-eval : update README with PR link and quick-start examples

Assisted-by: llama.cpp:local pi

* llama-eval : track model name in eval state and verify on resume

- Store model_name in EvalState and JSON output
- Display model in HTML summary table
- Verify --model matches stored model when resuming

Assisted-by: llama.cpp:local pi

* llama-server-simulator : fix comment - Dice coefficient, not Levenshtein

Assisted-by: llama.cpp:local pi

* llama-eval : require --grader-model or --model when using --grader-type llm

Assisted-by: llama.cpp:local pi

* llama-eval : protect dump() with lock for thread safety

Assisted-by: llama.cpp:local pi

* llama-eval : compact HTML report output

- Replace verbose summary table with single inline bar
- Shorten status text: '✓'/'✗'/'–'/'!' instead of full words
- Flatten CSS: remove box-shadows, border-radius, reduce padding
- Use system-ui font, 13px table, 12px details
- Conditional reasoning section (only shown when present)
- Single toggle JS function instead of two
- Shorter column headers

Assisted-by: llama.cpp:local pi

* llama-eval : check server connectivity on startup

- Hit /v1/models for each server before evaluation
- Exit with error if any server is unreachable
- Print comma-separated model IDs per server in startup output
- Sequential checks, no retries, no timeout override

Assisted-by: llama.cpp:local pi

* llama-eval : use server1/server2 instead of gpu1/gpu2 in README

Assisted-by: llama.cpp:local pi

---------

Co-authored-by: gatbontonpc <gatbontonpc@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants