[CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI by zhewenl · Pull Request #21810 · vllm-project/vllm

zhewenl · 2025-07-29T07:11:33Z

Purpose

Add more large models to accuracy testing, note they are unable to run on A100, so we will add them only after we get H100/MI300X capacity
Evals will be added for small models as LM Eval Small Multimodal Models which will take ~50min in CI(example).

This PR also picks up #19959, where we added supports for MM evals

Test Plan

# MMLU Pro large models
pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-large-h100.txt \
    --tp-size=8

# MM large models
pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-mm-large-h100.txt \
    --tp-size=8

# MM small models
pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-mm-small.txt \
    --tp-size=1 

# validate current evals
pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-small.txt \
    --tp-size=1

Test Result

MMLU Pro:

Running generate_until requests: 100%|███████████████████████████████████████████████████████████████████████████████| 3500/3500 [02:46<00:00, 21.03it/s]
mmlu_pro | exact_match,custom-extract: ground_truth=0.8 | measured=0.8002857142857143
PASSED

=================================================================== warnings summary ====================================================================
../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2381108) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

.buildkite/lm-eval-harness/test_lm_eval_correctness.py: 98 warnings
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/datasets/utils/_dill.py:385: DeprecationWarning: co_lnotab is deprecated, use co_lines instead.
    obj.co_lnotab,  # for < python 3.10 [not counted in args]

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 1 passed, 101 warnings in 273.62s (0:04:33) ======================================================

ChartQA large models:

chartqa | relaxed_accuracy,none: ground_truth=0.9 | measured=0.84
PASSED

=================================================================== warnings summary ====================================================================
../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2920450) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================= 1 passed, 3 warnings in 133.34s (0:02:13) =======================================================

ChartQA small models:

chartqa | relaxed_accuracy,none: ground_truth=0.855 | measured=0.8604
PASSED

=================================================================== warnings summary ====================================================================
../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=3243997) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================= 1 passed, 3 warnings in 289.29s (0:04:49) =======================================================

Current:

=================================================================== warnings summary ====================================================================
../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=3597739) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================= 6 passed, 3 warnings in 366.26s (0:06:06) =======================================================

github-actions · 2025-07-29T07:11:44Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces support for two new large models, DeepSeek-V3 and Llama-4-Maverick-FP8, to the accuracy testing suite, targeting H100/MI300X hardware. The changes involve adding new YAML configuration files for these models and updating the test_lm_eval_correctness.py script to allow for configurable gpu_memory_utilization and batch_size. While the changes are well-structured, I've identified a potential issue with the default gpu_memory_utilization value, which could lead to test instability.

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

zhewenl · 2025-07-29T07:14:05Z

.buildkite/lm-eval-harness/configs/DeepSeek-V3.yaml

+num_fewshot: 8
+trust_remote_code: True
+max_model_len: 1024
+batch_size: 1


I am testing on my local H100 using this config, it's not ideal(batch size=1 and only 1k seq len) and perhaps we should test it using MI300X where it has much more GPU memory

noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?

if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.

if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.

I don't think we have any MI300X/H200 in CI, and will follow up with @huydhn whether we can add it to Pytorch CI (it has H100/MI300X now)

yeqcharlotte · 2025-07-29T09:02:30Z

.buildkite/lm-eval-harness/configs/DeepSeek-V3.yaml

+num_fewshot: 8
+trust_remote_code: True
+max_model_len: 1024
+batch_size: 1


noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?

if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.

if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.

zhewenl · 2025-07-30T04:15:47Z

.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8-MM.yaml

@@ -0,0 +1,11 @@
+# For hf script, without -t option (tensor parallel size).


we will need to define model name + tasks(text/MM) which is not that scalable, we can think about refactoring the yaml/code to support tasks_groups where we can define the tests suite per model like this
cc @robertgshaw2-redhat

task_groups: mm_tasks: name: "chartqa" ... text_tasks: name: "gsm8k"

houseroad

why change so many files?

DarkLight1337 · 2025-10-14T02:30:48Z

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

    )
    results = lm_eval.simple_evaluate(
-        model="vllm",
+        model=eval_config["backend"],


Can we use eval_config.get with default for the backend so we don't have to modify so many files?

Can we use eval_config.get with default for the backend so we don't have to modify so many files?

@DarkLight1337 Thanks for the suggestion! Updated PR and verified existing tests are working:

============================================================ warnings summary ============================================================= ../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305 /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable. ref_error: type[Exception] = jsonschema.RefResolutionError, .buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0] /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] .buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0] /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2667961) is multi-threaded, use of fork() may lead to deadlocks in the child. self.pid = os.fork() -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ================================================ 6 passed, 3 warnings in 406.75s (0:06:46) ================================================

Signed-off-by: zhewenli <zhewenli@meta.com>

yeqcharlotte · 2025-10-14T23:50:19Z

.buildkite/test-pipeline.yaml

+  source_file_dependencies:
+  - vllm/multimodal/
+  - vllm/inputs/
+  - vllm/model_executor/models


we should not have to run this test on every single model change

we should not have to run this test on every single model change

offline discussed, we need to expand current ci-infra to support globs/regex: https://github.com/vllm-project/ci-infra/blob/69766cdb77b731a1ac6371d40c577f028e68fa17/buildkite/test-template-ci.j2#L49

Signed-off-by: zhewenli <zhewenli@meta.com>

yeqcharlotte

thanks! let's monitor how this go. cc: @ywang96 @DarkLight1337

.buildkite/test-pipeline.yaml

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>

…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Signed-off-by: bogdan01m <minkobogdan2001@gmail.com>

…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>

…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>

zhewenl requested review from mgoin and simon-mo as code owners July 29, 2025 07:11

zhewenl requested review from robertgshaw2-redhat and yeqcharlotte July 29, 2025 07:11

mergify bot added the ci/build label Jul 29, 2025

zhewenl requested a review from houseroad July 29, 2025 07:12

mergify bot added deepseek Related to DeepSeek models llama Related to Llama models labels Jul 29, 2025

gemini-code-assist bot reviewed Jul 29, 2025

View reviewed changes

.buildkite/lm-eval-harness/test_lm_eval_correctness.py Outdated Show resolved Hide resolved

zhewenl commented Jul 29, 2025

View reviewed changes

yeqcharlotte requested changes Jul 29, 2025

View reviewed changes

zhewenl force-pushed the add-more-large-model branch from c48e0de to 6496ade Compare July 29, 2025 22:12

zhewenl requested review from DarkLight1337, aarnphm, hmellor and ywang96 as code owners July 29, 2025 22:12

mergify bot added documentation Improvements or additions to documentation frontend new-model Requests to new models performance Performance-related issues labels Jul 29, 2025

zhewenl force-pushed the add-more-large-model branch from 7a7d83e to 9fb7562 Compare July 30, 2025 00:20

zhewenl changed the title ~~[RFC][CI/Build] Add Deepseek v3 and Llama4 Maverick FP8~~ [RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests Jul 30, 2025

zhewenl force-pushed the add-more-large-model branch from 9fb7562 to 851ccc9 Compare July 30, 2025 00:28

zhewenl commented Jul 30, 2025

View reviewed changes

zhewenl force-pushed the add-more-large-model branch from 851ccc9 to 813b018 Compare September 10, 2025 21:57

houseroad reviewed Sep 22, 2025

View reviewed changes

zhewenl force-pushed the add-more-large-model branch 3 times, most recently from 114bf81 to 694588d Compare September 30, 2025 03:44

DarkLight1337 reviewed Oct 14, 2025

View reviewed changes

zhewenl added 2 commits October 14, 2025 10:14

update backend

16970c2

Signed-off-by: zhewenli <zhewenli@meta.com>

update path coverage

1b18000

Signed-off-by: zhewenli <zhewenli@meta.com>

yeqcharlotte reviewed Oct 14, 2025

View reviewed changes

update

10412d2

Signed-off-by: zhewenli <zhewenli@meta.com>

yeqcharlotte approved these changes Oct 15, 2025

View reviewed changes

yeqcharlotte reviewed Oct 15, 2025

View reviewed changes

.buildkite/test-pipeline.yaml Outdated Show resolved Hide resolved

yeqcharlotte changed the title ~~[CI/Build] Add Llama4 Maverick FP8 MMLU Pro + ChartQA Accuracy Tests~~ [CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI Oct 15, 2025

mergify bot added the qwen Related to Qwen models label Oct 15, 2025

DarkLight1337 mentioned this pull request Oct 15, 2025

[Misc] Support MMMU accuracy benchmark #23034

Open

4 tasks

update trigger

a9ac48a

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>

yeqcharlotte force-pushed the add-more-large-model branch from 1b573e9 to a9ac48a Compare October 15, 2025 04:30

yeqcharlotte enabled auto-merge (squash) October 15, 2025 04:31

yeqcharlotte merged commit f3c378f into vllm-project:main Oct 15, 2025
18 checks passed

zhewenl deleted the add-more-large-model branch October 15, 2025 16:57

zhewenl mentioned this pull request Oct 17, 2025

[CI/Build] Update Llama4 eval yaml #27070

Merged

		@@ -0,0 +1,11 @@
		# For hf script, without -t option (tensor parallel size).

Uh oh!

Conversation

zhewenl commented Jul 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

zhewenl Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

zhewenl Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

zhewenl Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhewenl Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

zhewenl Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhewenl commented Jul 29, 2025 •

edited by github-actions bot

Loading

zhewenl Jul 30, 2025 •

edited

Loading

DarkLight1337 Oct 14, 2025 •

edited

Loading