-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Internvl and MMMU dataset evaluation #4509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@Fxycst1213 Well done, maybe you can update qwen2-vl and internvl's mmmu accuracy results in #4456 |
|
@zhaochenyang20 I have fixed some bugs. Could you help me test it again? |
reruned. thanks! if you need immediate rerun, ping me in wechat. |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen it if needed. |
Motivation
1.Based on the #3351 branch, I implemented the internvl model with the language part from Qwen.



2.When evaluating the model using the MMMU dataset, we encountered some issues with multi-image inputs. The generated prompt was incomplete. We conducted a comparison using VLMEvalKIT, and the results are as follows.
VLMevalKIT:
bench_hf:
Additionally, different models require different prompt modifications. For example, InternVL requires the use of "Image-i" as a separator for multi-image inputs.When evaluating the InternVL model using bench_hf, I found that its prompt differs from the one generated by VLMEvalKIT.The correct prompt format should be as follows:
The Qwen2-VL-7B model performed lower compared to the test results on the VLMevalKIT.
Modifications
Added the InternVL model and conducted tests on it. After modifying the MMMU evaluation script, we evaluated InternVL-38B and Qwen2VL-38B, and updated the results in the README.
InternVL38B
VLMevalKIT
bench_hf
bench_sglang
Qwen2-VL-7B
VLMevalKIT
bench_hf
bench_sglang
Checklist