Skip to content

[CI/Build] Update Llama4 eval yaml#27070

Merged
yeqcharlotte merged 1 commit intovllm-project:mainfrom
zhewenl:patch-llama4-yaml
Oct 17, 2025
Merged

[CI/Build] Update Llama4 eval yaml#27070
yeqcharlotte merged 1 commit intovllm-project:mainfrom
zhewenl:patch-llama4-yaml

Conversation

@zhewenl
Copy link
Copy Markdown
Collaborator

@zhewenl zhewenl commented Oct 17, 2025

Purpose

Fix some missing pieces in #21810

Test Plan

ChartQA:

> bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 100 -t 8
...
| Tasks |Version|Filter|n-shot|     Metric      |   |Value|   |Stderr|
|-------|------:|------|-----:|-----------------|---|----:|---|-----:|
|chartqa|      0|none  |     0|anywhere_accuracy|↑  | 0.81|±  |0.0394|
|       |       |none  |     0|exact_match      |↑  | 0.51|±  |0.0502|
|       |       |none  |     0|relaxed_accuracy |↑  | 0.81|±  |0.0394|

MMLU-Pro:

bash .buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 250 -t 8 -f 5
...
|       Tasks       |Version|    Filter    |n-shot|  Metric   |   |Value |   |Stderr|
|-------------------|------:|--------------|-----:|-----------|---|-----:|---|-----:|
|mmlu_pro           |    2.0|custom-extract|      |exact_match|↑  |0.7743|±  |0.0069|
| - biology         |    2.1|custom-extract|     5|exact_match|↑  |0.8920|±  |0.0197|
| - business        |    2.1|custom-extract|     5|exact_match|↑  |0.8360|±  |0.0235|
| - chemistry       |    2.1|custom-extract|     5|exact_match|↑  |0.8280|±  |0.0239|
| - computer_science|    2.1|custom-extract|     5|exact_match|↑  |0.8040|±  |0.0252|
| - economics       |    2.1|custom-extract|     5|exact_match|↑  |0.8800|±  |0.0206|
| - engineering     |    2.1|custom-extract|     5|exact_match|↑  |0.6640|±  |0.0299|
| - health          |    2.1|custom-extract|     5|exact_match|↑  |0.7680|±  |0.0268|
| - history         |    2.1|custom-extract|     5|exact_match|↑  |0.7240|±  |0.0283|
| - law             |    2.1|custom-extract|     5|exact_match|↑  |0.5480|±  |0.0315|
| - math            |    2.1|custom-extract|     5|exact_match|↑  |0.8640|±  |0.0217|
| - other           |    2.1|custom-extract|     5|exact_match|↑  |0.6840|±  |0.0295|
| - philosophy      |    2.1|custom-extract|     5|exact_match|↑  |0.7080|±  |0.0288|
| - physics         |    2.1|custom-extract|     5|exact_match|↑  |0.8360|±  |0.0235|
| - psychology      |    2.1|custom-extract|     5|exact_match|↑  |0.8040|±  |0.0252|

| Groups |Version|    Filter    |n-shot|  Metric   |   |Value |   |Stderr|
|--------|------:|--------------|------|-----------|---|-----:|---|-----:|
|mmlu_pro|      2|custom-extract|      |exact_match|↑  |0.7743|±  |0.0069|

Signed-off-by: zhewenli <zhewenli@meta.com>
@zhewenl zhewenl marked this pull request as ready for review October 17, 2025 03:11
@mergify mergify bot added ci/build llama Related to Llama models labels Oct 17, 2025
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/vllm-project/vllm/blob/45f1d12f2b8c11143ee308e7c2b7586377e9735a/.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml#L7-L12
P1 Badge Update MMLU baseline to measured accuracy

The updated instructions and test plan document MMLU‑Pro results of about 0.7743 exact_match, yet the YAML still asserts a ground truth of 0.80. test_lm_eval_correctness.py consumes this file and checks np.isclose against the value field, so the regression test will continue to fail even when the model reproduces the numbers reported in this commit. Please lower the expected value (or justify raising the actual score) so the baseline reflects the measured metric.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.

@yeqcharlotte yeqcharlotte enabled auto-merge (squash) October 17, 2025 03:17
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 17, 2025
@yeqcharlotte yeqcharlotte merged commit 9c2c228 into vllm-project:main Oct 17, 2025
21 of 22 checks passed
Zhuul pushed a commit to Zhuul/vllm that referenced this pull request Oct 17, 2025
Signed-off-by: zhewenli <zhewenli@meta.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: zhewenli <zhewenli@meta.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: zhewenli <zhewenli@meta.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Signed-off-by: zhewenli <zhewenli@meta.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: zhewenli <zhewenli@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build llama Related to Llama models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants