[Docs] Add GSM8K accuracy benchmark example by ZhuangYu07 · Pull Request #36591 · vllm-project/vllm

ZhuangYu07 · 2026-03-10T05:06:22Z

Purpose

修复 issue #5215 中提到的 Phi-4-mini-instruct 在 GSM8K 上准确率只有 0.08% 的问题。原因是用户没用 chat template，导致模型不理解任务。加了文档和示例脚本，告诉别人怎么正确用 vLLM 的聊天接口测准确率。
脚本：

bash

pip install datasets requests
python examples/benchmark_gsm8k_accuracy.py --api-url http://localhost:8004/v1/chat/completions --model phi4-mini

Test Result

跑了 1319 条 GSM8K 测试集，准确率 74.5%，符合预期（这模型正常应该在 80% 左右）。证明之前 0.08% 就是接口用错了。

mergify · 2026-03-10T05:07:01Z

Documentation preview: https://vllm--36591.org.readthedocs.build/en/36591/

ZhuangYu07 · 2026-03-10T05:07:35Z

来函妥收，请君勿念，择日详复，还望见谅。您的邮件我已收到，我会尽快回复哒，谢谢您的来信，祝您生活愉快！

github-actions · 2026-03-10T05:07:41Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

这个 PR 增加了一个在 GSM8K 上对聊天模型进行准确率基准测试的示例，并附有清晰的文档。这对于帮助用户正确使用 vLLM 的聊天 API 并获得预期的模型性能非常有价值。

代码整体写得很好，但我发现了一个关于准确率计算的问题，并提供了一个修复建议。修复后，脚本将能更准确地报告评估结果。

_{Note: Security Review is unavailable for this PR.}

gemini-code-assist · 2026-03-10T05:11:56Z

+    total = len(problems)
+    correct = 0
+
+    print(f"Evaluating {total} examples...\n")
+
+    for idx, problem in enumerate(problems):
+        question = problem["question"]
+        # Extract gold answer (after "####")
+        gold_match = re.search(r'####\s*(-?\d+(?:\.\d+)?)', problem["answer"])
+        if not gold_match:
+            print(f"Warning: cannot parse gold answer for example {idx}, skipping")
+            continue
+        gold_answer = gold_match.group(1)
+
+        payload = {
+            "model": args.model,
+            "messages": [{"role": "user", "content": question}],
+            "max_tokens": args.max_tokens,
+            "temperature": args.temperature,
+        }
+
+        try:
+            response = requests.post(args.api_url, json=payload, timeout=60)
+            response.raise_for_status()
+            model_output = response.json()["choices"][0]["message"]["content"]
+            pred_answer = extract_answer(model_output)
+
+            if pred_answer == gold_answer:
+                correct += 1
+            else:
+                print(f"Example {idx}:")
+                print(f"  Q: {question[:60]}...")
+                print(f"  Gold: {gold_answer}, Pred: {pred_answer}")
+                print(f"  Model output snippet: {model_output[:120]}...\n")
+        except Exception as e:
+            print(f"Error on example {idx}: {e}")
+
+        if (idx + 1) % 100 == 0:
+            print(f"Processed {idx+1}/{total}, current accuracy: {correct/(idx+1)*100:.2f}%")
+
+    accuracy = correct / total * 100 if total > 0 else 0
+    print(f"\n{'='*40}")
+    print(f"Final accuracy: {correct}/{total} = {accuracy:.2f}%")


当前的准确率计算方式不正确。分母 total 是数据集中所有问题的数量，但脚本中会跳过一些无法解析答案或请求出错的问题。这会导致最终计算出的准确率低于实际值。

准确率的分母应该是成功评估的样本数量。我提供了一个代码建议来修复这个问题，它引入了一个 num_evaluated 计数器来正确地跟踪已评估的样本，并用它来计算准确率。

total = len(problems) correct = 0 num_evaluated = 0 print(f"Evaluating {total} examples...\n") for idx, problem in enumerate(problems): question = problem["question"] # Extract gold answer (after "####") gold_match = re.search(r'####\s*(-?\d+(?:\.\d+)?)', problem["answer"]) if not gold_match: print(f"Warning: cannot parse gold answer for example {idx}, skipping") continue gold_answer = gold_match.group(1) payload = { "model": args.model, "messages": [{"role": "user", "content": question}], "max_tokens": args.max_tokens, "temperature": args.temperature, } try: response = requests.post(args.api_url, json=payload, timeout=60) response.raise_for_status() model_output = response.json()["choices"][0]["message"]["content"] pred_answer = extract_answer(model_output) if pred_answer == gold_answer: correct += 1 else: print(f"Example {idx}:") print(f" Q: {question[:60]}...") print(f" Gold: {gold_answer}, Pred: {pred_answer}") print(f" Model output snippet: {model_output[:120]}...\n") num_evaluated += 1 except Exception as e: print(f"Error on example {idx}: {e}") if (idx + 1) % 100 == 0 and num_evaluated > 0: current_accuracy = correct / num_evaluated * 100 print(f"Processed {idx+1}/{total}, Evaluated: {num_evaluated}, Accuracy: {current_accuracy:.2f}%") accuracy = correct / num_evaluated * 100 if num_evaluated > 0 else 0 print(f"\n{'='*40}") print(f"Final accuracy: {correct}/{num_evaluated} = {accuracy:.2f}%")

谢谢指正我已经修改啦

mergify · 2026-03-10T05:13:02Z

Hi @ZhuangYu07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-10T05:30:31Z

Hi @ZhuangYu07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-10T06:49:01Z

Hi @ZhuangYu07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-10T14:21:46Z

Hi @ZhuangYu07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-10T14:41:51Z

Hi @ZhuangYu07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-10T14:57:10Z

Hi @ZhuangYu07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-10T15:47:51Z

Hi @ZhuangYu07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-11T00:14:19Z

Hi @ZhuangYu07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-11T01:40:46Z

Hi @ZhuangYu07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-11T09:42:31Z

Hi @ZhuangYu07, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

mergify Bot added the documentation Improvements or additions to documentation label Mar 10, 2026

gemini-code-assist Bot reviewed Mar 10, 2026

View reviewed changes

ZhuangYu07 force-pushed the docs/gsm8k-accuracy branch from c109bf5 to ee40b0b Compare March 10, 2026 14:17

ZhuangYu07 force-pushed the docs/gsm8k-accuracy branch from 335d7cc to f08c781 Compare March 10, 2026 15:33

ZhuangYu07 force-pushed the docs/gsm8k-accuracy branch from 5b6654b to 19f4684 Compare March 11, 2026 01:36

ZhuangYu07 added 10 commits March 11, 2026 17:56

[Docs] Add GSM8K accuracy benchmark example

c21f0a7

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

Fix accuracy calculation: use num_evaluated instead of total

165ac34

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

Fix accuracy calculation and remove comments

65d91af

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

Apply black and isort formatting

445bc4f

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

Fix pre-commit issues: use regex, SPDX headers, formatting

4f45dbb

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

Fix line too long E501

e2c4722

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

Final fix: correct file with proper formatting

97cff3c

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

Final fix: correct file formatting

e925bf5

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

Final fix: replace with correct file

e6c18f2

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

Apply pre-commit formatting fixes

ac5e07b

Signed-off-by: ZhuangYu07 <202344420843@qq.com>

ZhuangYu07 force-pushed the docs/gsm8k-accuracy branch from edf0d65 to ac5e07b Compare March 11, 2026 09:57

Uh oh!

Conversation

ZhuangYu07 commented Mar 10, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Mar 10, 2026

Uh oh!

ZhuangYu07 commented Mar 10, 2026 via email

Uh oh!

github-actions Bot commented Mar 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

ZhuangYu07 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Mar 10, 2026

Uh oh!

mergify Bot commented Mar 10, 2026

Uh oh!

mergify Bot commented Mar 10, 2026

Uh oh!

mergify Bot commented Mar 10, 2026

Uh oh!

mergify Bot commented Mar 10, 2026

Uh oh!

mergify Bot commented Mar 10, 2026

Uh oh!

mergify Bot commented Mar 10, 2026

Uh oh!

mergify Bot commented Mar 11, 2026

Uh oh!

mergify Bot commented Mar 11, 2026

Uh oh!

mergify Bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZhuangYu07 commented Mar 10, 2026 •

edited by github-actions Bot

Loading