Skip to content

[Script] Update new VLM accuracy test for Qwen3-VL#135

Merged
qichu-yun merged 1 commit intodev/perffrom
update_vlm_test
Jan 6, 2026
Merged

[Script] Update new VLM accuracy test for Qwen3-VL#135
qichu-yun merged 1 commit intodev/perffrom
update_vlm_test

Conversation

@qichu-yun
Copy link
Copy Markdown
Collaborator

@qichu-yun qichu-yun commented Jan 6, 2026

Update new VLM accuracy test for Qwen3-VL

Inspired by sgl-project#15205
image

Start evaluating:

    export OPENAI_API_KEY=EMPTY
    export OPENAI_API_BASE=http://localhost:9000/v1
    export PYTHONPATH=/the/path/to/your/sglang/python

    python3 -m lmms_eval \
        --model=openai_compatible \
        --model_args model_version=/mnt/raid0/models/Qwen3-VL-235B-A22B-Instruct-FP8-dynamic/ \
        --tasks mmmu_val   \
        --batch_size 16 \

The result of TP8 on MI308 should be:

    openai_compatible (model_version=/mnt/raid0/models/Qwen3-VL-235B-A22B-Instruct-FP8-dynamic/), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 16
    | Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
    |--------|------:|------|-----:|--------|---|-----:|---|------|
    |mmmu_val|      0|none  |     0|mmmu_acc||0.607 |±  |   N/A|

@qichu-yun qichu-yun merged commit 6d2106d into dev/perf Jan 6, 2026
4 checks passed
@sammysun0711
Copy link
Copy Markdown
Collaborator

sammysun0711 commented Jan 7, 2026

May I know why mmmu accuracy measured with lmms_eval: 0.607 is different from accuracy 0.584 measure with benchmark/mmmu/README.md in CI: https://github.com/zejunchen-zejun/sglang/actions/runs/20593023968/job/59141786653

@qichu-yun
Copy link
Copy Markdown
Collaborator Author

qichu-yun commented Jan 7, 2026

May I know why mmmu accuracy measured with lmms_eval: 0.607 is different from accuracy 0.584 measure with benchmark/mmmu/README.md in CI: https://github.com/zejunchen-zejun/sglang/actions/runs/20593023968/job/59141786653

Because accuracy often fluctuates within a certain range, scores between 0.57 and 0.61 are all considered acceptable. When I use the method with benchmark/mmmu/README.md, I often get different results.

@qichu-yun qichu-yun deleted the update_vlm_test branch March 11, 2026 07:27
@qichu-yun qichu-yun restored the update_vlm_test branch March 11, 2026 07:27
@qichu-yun qichu-yun deleted the update_vlm_test branch March 16, 2026 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants