[Docs] Add GSM8K accuracy benchmark example#36591
Conversation
|
Documentation preview: https://vllm--36591.org.readthedocs.build/en/36591/ |
|
来函妥收,请君勿念,择日详复,还望见谅。您的邮件我已收到,我会尽快回复哒,谢谢您的来信,祝您生活愉快!
|
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
| total = len(problems) | ||
| correct = 0 | ||
|
|
||
| print(f"Evaluating {total} examples...\n") | ||
|
|
||
| for idx, problem in enumerate(problems): | ||
| question = problem["question"] | ||
| # Extract gold answer (after "####") | ||
| gold_match = re.search(r'####\s*(-?\d+(?:\.\d+)?)', problem["answer"]) | ||
| if not gold_match: | ||
| print(f"Warning: cannot parse gold answer for example {idx}, skipping") | ||
| continue | ||
| gold_answer = gold_match.group(1) | ||
|
|
||
| payload = { | ||
| "model": args.model, | ||
| "messages": [{"role": "user", "content": question}], | ||
| "max_tokens": args.max_tokens, | ||
| "temperature": args.temperature, | ||
| } | ||
|
|
||
| try: | ||
| response = requests.post(args.api_url, json=payload, timeout=60) | ||
| response.raise_for_status() | ||
| model_output = response.json()["choices"][0]["message"]["content"] | ||
| pred_answer = extract_answer(model_output) | ||
|
|
||
| if pred_answer == gold_answer: | ||
| correct += 1 | ||
| else: | ||
| print(f"Example {idx}:") | ||
| print(f" Q: {question[:60]}...") | ||
| print(f" Gold: {gold_answer}, Pred: {pred_answer}") | ||
| print(f" Model output snippet: {model_output[:120]}...\n") | ||
| except Exception as e: | ||
| print(f"Error on example {idx}: {e}") | ||
|
|
||
| if (idx + 1) % 100 == 0: | ||
| print(f"Processed {idx+1}/{total}, current accuracy: {correct/(idx+1)*100:.2f}%") | ||
|
|
||
| accuracy = correct / total * 100 if total > 0 else 0 | ||
| print(f"\n{'='*40}") | ||
| print(f"Final accuracy: {correct}/{total} = {accuracy:.2f}%") |
There was a problem hiding this comment.
当前的准确率计算方式不正确。分母 total 是数据集中所有问题的数量,但脚本中会跳过一些无法解析答案或请求出错的问题。这会导致最终计算出的准确率低于实际值。
准确率的分母应该是成功评估的样本数量。我提供了一个代码建议来修复这个问题,它引入了一个 num_evaluated 计数器来正确地跟踪已评估的样本,并用它来计算准确率。
total = len(problems)
correct = 0
num_evaluated = 0
print(f"Evaluating {total} examples...\n")
for idx, problem in enumerate(problems):
question = problem["question"]
# Extract gold answer (after "####")
gold_match = re.search(r'####\s*(-?\d+(?:\.\d+)?)', problem["answer"])
if not gold_match:
print(f"Warning: cannot parse gold answer for example {idx}, skipping")
continue
gold_answer = gold_match.group(1)
payload = {
"model": args.model,
"messages": [{"role": "user", "content": question}],
"max_tokens": args.max_tokens,
"temperature": args.temperature,
}
try:
response = requests.post(args.api_url, json=payload, timeout=60)
response.raise_for_status()
model_output = response.json()["choices"][0]["message"]["content"]
pred_answer = extract_answer(model_output)
if pred_answer == gold_answer:
correct += 1
else:
print(f"Example {idx}:")
print(f" Q: {question[:60]}...")
print(f" Gold: {gold_answer}, Pred: {pred_answer}")
print(f" Model output snippet: {model_output[:120]}...\n")
num_evaluated += 1
except Exception as e:
print(f"Error on example {idx}: {e}")
if (idx + 1) % 100 == 0 and num_evaluated > 0:
current_accuracy = correct / num_evaluated * 100
print(f"Processed {idx+1}/{total}, Evaluated: {num_evaluated}, Accuracy: {current_accuracy:.2f}%")
accuracy = correct / num_evaluated * 100 if num_evaluated > 0 else 0
print(f"\n{'='*40}")
print(f"Final accuracy: {correct}/{num_evaluated} = {accuracy:.2f}%")|
Hi @ZhuangYu07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
2 similar comments
|
Hi @ZhuangYu07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @ZhuangYu07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
c109bf5 to
ee40b0b
Compare
|
Hi @ZhuangYu07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
2 similar comments
|
Hi @ZhuangYu07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @ZhuangYu07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
335d7cc to
f08c781
Compare
|
Hi @ZhuangYu07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @ZhuangYu07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
5b6654b to
19f4684
Compare
|
Hi @ZhuangYu07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @ZhuangYu07, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: ZhuangYu07 <202344420843@qq.com>
Signed-off-by: ZhuangYu07 <202344420843@qq.com>
Signed-off-by: ZhuangYu07 <202344420843@qq.com>
Signed-off-by: ZhuangYu07 <202344420843@qq.com>
Signed-off-by: ZhuangYu07 <202344420843@qq.com>
Signed-off-by: ZhuangYu07 <202344420843@qq.com>
Signed-off-by: ZhuangYu07 <202344420843@qq.com>
Signed-off-by: ZhuangYu07 <202344420843@qq.com>
Signed-off-by: ZhuangYu07 <202344420843@qq.com>
Signed-off-by: ZhuangYu07 <202344420843@qq.com>
edf0d65 to
ac5e07b
Compare
Purpose
修复 issue #5215 中提到的 Phi-4-mini-instruct 在 GSM8K 上准确率只有 0.08% 的问题。原因是用户没用 chat template,导致模型不理解任务。加了文档和示例脚本,告诉别人怎么正确用 vLLM 的聊天接口测准确率。
脚本:
bash
pip install datasets requests
python examples/benchmark_gsm8k_accuracy.py --api-url http://localhost:8004/v1/chat/completions --model phi4-mini
Test Result
跑了 1319 条 GSM8K 测试集,准确率 74.5%,符合预期(这模型正常应该在 80% 左右)。证明之前 0.08% 就是接口用错了。