Skip to content

[CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI#21810

Merged
yeqcharlotte merged 10 commits intovllm-project:mainfrom
zhewenl:add-more-large-model
Oct 15, 2025
Merged

[CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI#21810
yeqcharlotte merged 10 commits intovllm-project:mainfrom
zhewenl:add-more-large-model

Conversation

@zhewenl
Copy link
Copy Markdown
Collaborator

@zhewenl zhewenl commented Jul 29, 2025

Purpose

Add more large models to accuracy testing, note they are unable to run on A100, so we will add them only after we get H100/MI300X capacity
Evals will be added for small models as LM Eval Small Multimodal Models which will take ~50min in CI(example).

This PR also picks up #19959, where we added supports for MM evals

Test Plan

# MMLU Pro large models
pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-large-h100.txt \
    --tp-size=8

# MM large models
pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-mm-large-h100.txt \
    --tp-size=8

# MM small models
pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-mm-small.txt \
    --tp-size=1 

# validate current evals
pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-small.txt \
    --tp-size=1

Test Result

MMLU Pro:

Running generate_until requests: 100%|███████████████████████████████████████████████████████████████████████████████| 3500/3500 [02:46<00:00, 21.03it/s]
mmlu_pro | exact_match,custom-extract: ground_truth=0.8 | measured=0.8002857142857143
PASSED

=================================================================== warnings summary ====================================================================
../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2381108) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

.buildkite/lm-eval-harness/test_lm_eval_correctness.py: 98 warnings
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/datasets/utils/_dill.py:385: DeprecationWarning: co_lnotab is deprecated, use co_lines instead.
    obj.co_lnotab,  # for < python 3.10 [not counted in args]

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 1 passed, 101 warnings in 273.62s (0:04:33) ======================================================

ChartQA large models:

chartqa | relaxed_accuracy,none: ground_truth=0.9 | measured=0.84
PASSED

=================================================================== warnings summary ====================================================================
../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2920450) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================= 1 passed, 3 warnings in 133.34s (0:02:13) =======================================================

ChartQA small models:

chartqa | relaxed_accuracy,none: ground_truth=0.855 | measured=0.8604
PASSED

=================================================================== warnings summary ====================================================================
../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=3243997) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================= 1 passed, 3 warnings in 289.29s (0:04:49) =======================================================

Current:

=================================================================== warnings summary ====================================================================
../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=3597739) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================= 6 passed, 3 warnings in 366.26s (0:06:06) =======================================================

@zhewenl zhewenl requested review from mgoin and simon-mo as code owners July 29, 2025 07:11
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Jul 29, 2025
@zhewenl zhewenl requested a review from houseroad July 29, 2025 07:12
@mergify mergify bot added deepseek Related to DeepSeek models llama Related to Llama models labels Jul 29, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for two new large models, DeepSeek-V3 and Llama-4-Maverick-FP8, to the accuracy testing suite, targeting H100/MI300X hardware. The changes involve adding new YAML configuration files for these models and updating the test_lm_eval_correctness.py script to allow for configurable gpu_memory_utilization and batch_size. While the changes are well-structured, I've identified a potential issue with the default gpu_memory_utilization value, which could lead to test instability.

num_fewshot: 8
trust_remote_code: True
max_model_len: 1024
batch_size: 1
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am testing on my local H100 using this config, it's not ideal(batch size=1 and only 1k seq len) and perhaps we should test it using MI300X where it has much more GPU memory

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?

if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.

if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any MI300X/H200 in CI, and will follow up with @huydhn whether we can add it to Pytorch CI (it has H100/MI300X now)

num_fewshot: 8
trust_remote_code: True
max_model_len: 1024
batch_size: 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?

if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.

if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.

@zhewenl zhewenl force-pushed the add-more-large-model branch from c48e0de to 6496ade Compare July 29, 2025 22:12
@mergify mergify bot added documentation Improvements or additions to documentation frontend new-model Requests to new models performance Performance-related issues labels Jul 29, 2025
@zhewenl zhewenl force-pushed the add-more-large-model branch from 7a7d83e to 9fb7562 Compare July 30, 2025 00:20
@zhewenl zhewenl changed the title [RFC][CI/Build] Add Deepseek v3 and Llama4 Maverick FP8 [RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests Jul 30, 2025
@zhewenl zhewenl force-pushed the add-more-large-model branch from 9fb7562 to 851ccc9 Compare July 30, 2025 00:28
@@ -0,0 +1,11 @@
# For hf script, without -t option (tensor parallel size).
Copy link
Copy Markdown
Collaborator Author

@zhewenl zhewenl Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need to define model name + tasks(text/MM) which is not that scalable, we can think about refactoring the yaml/code to support tasks_groups where we can define the tests suite per model like this
cc @robertgshaw2-redhat

task_groups:
    mm_tasks:
        name: "chartqa"
        ...
    text_tasks:
        name: "gsm8k"

@zhewenl zhewenl force-pushed the add-more-large-model branch from 851ccc9 to 813b018 Compare September 10, 2025 21:57
Copy link
Copy Markdown
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change so many files?

@zhewenl zhewenl force-pushed the add-more-large-model branch 3 times, most recently from 114bf81 to 694588d Compare September 30, 2025 03:44
)
results = lm_eval.simple_evaluate(
model="vllm",
model=eval_config["backend"],
Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use eval_config.get with default for the backend so we don't have to modify so many files?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use eval_config.get with default for the backend so we don't have to modify so many files?

@DarkLight1337 Thanks for the suggestion! Updated PR and verified existing tests are working:

============================================================ warnings summary =============================================================
../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2667961) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================ 6 passed, 3 warnings in 406.75s (0:06:46) ================================================

Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
source_file_dependencies:
- vllm/multimodal/
- vllm/inputs/
- vllm/model_executor/models
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not have to run this test on every single model change

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not have to run this test on every single model change

offline discussed, we need to expand current ci-infra to support globs/regex: https://github.com/vllm-project/ci-infra/blob/69766cdb77b731a1ac6371d40c577f028e68fa17/buildkite/test-template-ci.j2#L49

Signed-off-by: zhewenli <zhewenli@meta.com>
Copy link
Copy Markdown
Collaborator

@yeqcharlotte yeqcharlotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! let's monitor how this go. cc: @ywang96 @DarkLight1337

@yeqcharlotte yeqcharlotte changed the title [CI/Build] Add Llama4 Maverick FP8 MMLU Pro + ChartQA Accuracy Tests [CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI Oct 15, 2025
@mergify mergify bot added the qwen Related to Qwen models label Oct 15, 2025
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
@yeqcharlotte yeqcharlotte enabled auto-merge (squash) October 15, 2025 04:31
@yeqcharlotte yeqcharlotte merged commit f3c378f into vllm-project:main Oct 15, 2025
18 checks passed
@zhewenl zhewenl deleted the add-more-large-model branch October 15, 2025 16:57
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…llm-project#21810)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
bogdanminko pushed a commit to bogdanminko/vllm that referenced this pull request Oct 16, 2025
…llm-project#21810)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Signed-off-by: bogdan01m <minkobogdan2001@gmail.com>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
…llm-project#21810)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…llm-project#21810)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…llm-project#21810)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…llm-project#21810)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…llm-project#21810)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…llm-project#21810)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…llm-project#21810)

Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend llama Related to Llama models new-model Requests to new models performance Performance-related issues qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants