[CI/Build] Update Llama4 eval yaml by zhewenl · Pull Request #27070 · vllm-project/vllm

zhewenl · 2025-10-17T03:10:39Z

Purpose

Fix some missing pieces in #21810

Test Plan

ChartQA:

> bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 100 -t 8
...
| Tasks |Version|Filter|n-shot|     Metric      |   |Value|   |Stderr|
|-------|------:|------|-----:|-----------------|---|----:|---|-----:|
|chartqa|      0|none  |     0|anywhere_accuracy|↑  | 0.81|±  |0.0394|
|       |       |none  |     0|exact_match      |↑  | 0.51|±  |0.0502|
|       |       |none  |     0|relaxed_accuracy |↑  | 0.81|±  |0.0394|

MMLU-Pro:

bash .buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 250 -t 8 -f 5
...
|       Tasks       |Version|    Filter    |n-shot|  Metric   |   |Value |   |Stderr|
|-------------------|------:|--------------|-----:|-----------|---|-----:|---|-----:|
|mmlu_pro           |    2.0|custom-extract|      |exact_match|↑  |0.7743|±  |0.0069|
| - biology         |    2.1|custom-extract|     5|exact_match|↑  |0.8920|±  |0.0197|
| - business        |    2.1|custom-extract|     5|exact_match|↑  |0.8360|±  |0.0235|
| - chemistry       |    2.1|custom-extract|     5|exact_match|↑  |0.8280|±  |0.0239|
| - computer_science|    2.1|custom-extract|     5|exact_match|↑  |0.8040|±  |0.0252|
| - economics       |    2.1|custom-extract|     5|exact_match|↑  |0.8800|±  |0.0206|
| - engineering     |    2.1|custom-extract|     5|exact_match|↑  |0.6640|±  |0.0299|
| - health          |    2.1|custom-extract|     5|exact_match|↑  |0.7680|±  |0.0268|
| - history         |    2.1|custom-extract|     5|exact_match|↑  |0.7240|±  |0.0283|
| - law             |    2.1|custom-extract|     5|exact_match|↑  |0.5480|±  |0.0315|
| - math            |    2.1|custom-extract|     5|exact_match|↑  |0.8640|±  |0.0217|
| - other           |    2.1|custom-extract|     5|exact_match|↑  |0.6840|±  |0.0295|
| - philosophy      |    2.1|custom-extract|     5|exact_match|↑  |0.7080|±  |0.0288|
| - physics         |    2.1|custom-extract|     5|exact_match|↑  |0.8360|±  |0.0235|
| - psychology      |    2.1|custom-extract|     5|exact_match|↑  |0.8040|±  |0.0252|

| Groups |Version|    Filter    |n-shot|  Metric   |   |Value |   |Stderr|
|--------|------:|--------------|------|-----------|---|-----:|---|-----:|
|mmlu_pro|      2|custom-extract|      |exact_match|↑  |0.7743|±  |0.0069|

Signed-off-by: zhewenli <zhewenli@meta.com>

chatgpt-codex-connector

💡 Codex Review

https://github.com/vllm-project/vllm/blob/45f1d12f2b8c11143ee308e7c2b7586377e9735a/.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml#L7-L12
Update MMLU baseline to measured accuracy

The updated instructions and test plan document MMLU‑Pro results of about 0.7743 exact_match, yet the YAML still asserts a ground truth of 0.80. test_lm_eval_correctness.py consumes this file and checks np.isclose against the value field, so the regression test will continue to fail even when the model reproduces the numbers reported in this commit. Please lower the expected value (or justify raising the actual score) so the baseline reflects the measured metric.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.

@codex fix this CI failure
@codex address that feedback

Signed-off-by: zhewenli <zhewenli@meta.com>

Signed-off-by: zhewenli <zhewenli@meta.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Signed-off-by: zhewenli <zhewenli@meta.com>

update yaml

45f1d12

Signed-off-by: zhewenli <zhewenli@meta.com>

zhewenl requested review from 22quinn, houseroad, luccafong and yeqcharlotte October 17, 2025 03:11

zhewenl marked this pull request as ready for review October 17, 2025 03:11

zhewenl requested review from mgoin and simon-mo as code owners October 17, 2025 03:11

mergify bot added ci/build llama Related to Llama models labels Oct 17, 2025

chatgpt-codex-connector bot reviewed Oct 17, 2025

View reviewed changes

yeqcharlotte approved these changes Oct 17, 2025

View reviewed changes

yeqcharlotte enabled auto-merge (squash) October 17, 2025 03:17

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 17, 2025

yeqcharlotte merged commit 9c2c228 into vllm-project:main Oct 17, 2025
21 of 22 checks passed

Zhuul pushed a commit to Zhuul/vllm that referenced this pull request Oct 17, 2025

[CI/Build] Update Llama4 eval yaml (vllm-project#27070)

75d7b91

Signed-off-by: zhewenli <zhewenli@meta.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[CI/Build] Update Llama4 eval yaml (vllm-project#27070)

b60114f

Signed-off-by: zhewenli <zhewenli@meta.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[CI/Build] Update Llama4 eval yaml (vllm-project#27070)

f2dfb3f

Signed-off-by: zhewenli <zhewenli@meta.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[CI/Build] Update Llama4 eval yaml (vllm-project#27070)

e61d110

Signed-off-by: zhewenli <zhewenli@meta.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[CI/Build] Update Llama4 eval yaml (vllm-project#27070)

aa875ec

Signed-off-by: zhewenli <zhewenli@meta.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[CI/Build] Update Llama4 eval yaml (vllm-project#27070)

1cf4c09

Signed-off-by: zhewenli <zhewenli@meta.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[CI/Build] Update Llama4 eval yaml (vllm-project#27070)

b8b14d7

Signed-off-by: zhewenli <zhewenli@meta.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI/Build] Update Llama4 eval yaml#27070

[CI/Build] Update Llama4 eval yaml#27070
yeqcharlotte merged 1 commit intovllm-project:mainfrom
zhewenl:patch-llama4-yaml

zhewenl commented Oct 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zhewenl commented Oct 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhewenl commented Oct 17, 2025 •

edited by github-actions bot

Loading