[CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI#21810
[CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI#21810yeqcharlotte merged 10 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces support for two new large models, DeepSeek-V3 and Llama-4-Maverick-FP8, to the accuracy testing suite, targeting H100/MI300X hardware. The changes involve adding new YAML configuration files for these models and updating the test_lm_eval_correctness.py script to allow for configurable gpu_memory_utilization and batch_size. While the changes are well-structured, I've identified a potential issue with the default gpu_memory_utilization value, which could lead to test instability.
| num_fewshot: 8 | ||
| trust_remote_code: True | ||
| max_model_len: 1024 | ||
| batch_size: 1 |
There was a problem hiding this comment.
I am testing on my local H100 using this config, it's not ideal(batch size=1 and only 1k seq len) and perhaps we should test it using MI300X where it has much more GPU memory
There was a problem hiding this comment.
noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?
if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.
if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.
There was a problem hiding this comment.
I don't think we have any MI300X/H200 in CI, and will follow up with @huydhn whether we can add it to Pytorch CI (it has H100/MI300X now)
| num_fewshot: 8 | ||
| trust_remote_code: True | ||
| max_model_len: 1024 | ||
| batch_size: 1 |
There was a problem hiding this comment.
noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?
if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.
if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.
c48e0de to
6496ade
Compare
7a7d83e to
9fb7562
Compare
9fb7562 to
851ccc9
Compare
| @@ -0,0 +1,11 @@ | |||
| # For hf script, without -t option (tensor parallel size). | |||
There was a problem hiding this comment.
we will need to define model name + tasks(text/MM) which is not that scalable, we can think about refactoring the yaml/code to support tasks_groups where we can define the tests suite per model like this
cc @robertgshaw2-redhat
task_groups:
mm_tasks:
name: "chartqa"
...
text_tasks:
name: "gsm8k"
851ccc9 to
813b018
Compare
houseroad
left a comment
There was a problem hiding this comment.
why change so many files?
114bf81 to
694588d
Compare
| ) | ||
| results = lm_eval.simple_evaluate( | ||
| model="vllm", | ||
| model=eval_config["backend"], |
There was a problem hiding this comment.
Can we use eval_config.get with default for the backend so we don't have to modify so many files?
There was a problem hiding this comment.
Can we use
eval_config.getwith default for the backend so we don't have to modify so many files?
@DarkLight1337 Thanks for the suggestion! Updated PR and verified existing tests are working:
============================================================ warnings summary =============================================================
../../../../uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305
/home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
ref_error: type[Exception] = jsonschema.RefResolutionError,
.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
/home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
/usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2667961) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================ 6 passed, 3 warnings in 406.75s (0:06:46) ================================================
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
.buildkite/test-pipeline.yaml
Outdated
| source_file_dependencies: | ||
| - vllm/multimodal/ | ||
| - vllm/inputs/ | ||
| - vllm/model_executor/models |
There was a problem hiding this comment.
we should not have to run this test on every single model change
There was a problem hiding this comment.
we should not have to run this test on every single model change
offline discussed, we need to expand current ci-infra to support globs/regex: https://github.com/vllm-project/ci-infra/blob/69766cdb77b731a1ac6371d40c577f028e68fa17/buildkite/test-template-ci.j2#L49
yeqcharlotte
left a comment
There was a problem hiding this comment.
thanks! let's monitor how this go. cc: @ywang96 @DarkLight1337
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
1b573e9 to
a9ac48a
Compare
…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Signed-off-by: bogdan01m <minkobogdan2001@gmail.com>
…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
…llm-project#21810) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com> Signed-off-by: zhewenli <zhewenli@meta.com> Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Purpose
Add more large models to accuracy testing, note they are unable to run on A100, so we will add them only after we get H100/MI300X capacity
Evals will be added for small models as
LM Eval Small Multimodal Modelswhich will take ~50min in CI(example).This PR also picks up #19959, where we added supports for MM evals
Test Plan
Test Result
MMLU Pro:
ChartQA large models:
ChartQA small models:
Current: