Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add record benchmark result table #80

Merged
merged 3 commits into from
Nov 28, 2024
Merged

add record benchmark result table #80

merged 3 commits into from
Nov 28, 2024

Conversation

Silviase
Copy link
Collaborator

  • ベンチマーク結果を保存するcsvファイルを作成するスクリプトを作成した.
  • resultに保存されている(モデル名・ベンチマーク名・スコア名)を元にして

@speed1313
こちら実行してみて結果を試してもらえますか?(あとJMMMUとかmulti-imageとかを付け足してくれると非常に助かります)

@odashi
杉浦くんにもCloud Strageへのアクセス権限を渡してもらうことはできますか?

@Silviase Silviase linked an issue Nov 27, 2024 that may be closed by this pull request
@speed1313
Copy link
Collaborator

ありがとうございます.
ネストしたdict形式のmetricをサポートできるように微調整しました.

結果については以下のような結果が得られました.

Model Name,Benchmark Name,Metric Name,Score
llava-hf-llava-1.5-7b-hf,japanese-heron-bench,conv,3.238095238095238
llava-hf-llava-1.5-7b-hf,japanese-heron-bench,conv_rel,34.34343434343434
llava-hf-llava-1.5-7b-hf,japanese-heron-bench,detail,3.0952380952380953
llava-hf-llava-1.5-7b-hf,japanese-heron-bench,detail_rel,35.51912568306011
llava-hf-llava-1.5-7b-hf,japanese-heron-bench,complex,3.275
llava-hf-llava-1.5-7b-hf,japanese-heron-bench,complex_rel,35.89041095890411
llava-hf-llava-1.5-7b-hf,japanese-heron-bench,parse_error_count,0
llava-hf-llava-1.5-7b-hf,japanese-heron-bench,overall,3.2233009708737863
llava-hf-llava-1.5-7b-hf,japanese-heron-bench,overall_rel,35.25099032846619
OpenGVLab-InternVL2-8B,japanese-heron-bench,conv,3.880952380952381
OpenGVLab-InternVL2-8B,japanese-heron-bench,conv_rel,41.902313624678655
OpenGVLab-InternVL2-8B,japanese-heron-bench,detail,4.0476190476190474
OpenGVLab-InternVL2-8B,japanese-heron-bench,detail_rel,50.29585798816568
OpenGVLab-InternVL2-8B,japanese-heron-bench,complex,3.925
OpenGVLab-InternVL2-8B,japanese-heron-bench,complex_rel,44.857142857142854
OpenGVLab-InternVL2-8B,japanese-heron-bench,parse_error_count,0
OpenGVLab-InternVL2-8B,japanese-heron-bench,overall,3.9320388349514563
OpenGVLab-InternVL2-8B,japanese-heron-bench,overall_rel,45.68510482332906
llava-hf-llava-v1.6-mistral-7b-hf,japanese-heron-bench,conv,2.5952380952380953
llava-hf-llava-v1.6-mistral-7b-hf,japanese-heron-bench,conv_rel,37.58620689655172
llava-hf-llava-v1.6-mistral-7b-hf,japanese-heron-bench,detail,1.380952380952381
llava-hf-llava-v1.6-mistral-7b-hf,japanese-heron-bench,detail_rel,20.13888888888889
llava-hf-llava-v1.6-mistral-7b-hf,japanese-heron-bench,complex,1.925
llava-hf-llava-v1.6-mistral-7b-hf,japanese-heron-bench,complex_rel,25.328947368421055
llava-hf-llava-v1.6-mistral-7b-hf,japanese-heron-bench,parse_error_count,0
llava-hf-llava-v1.6-mistral-7b-hf,japanese-heron-bench,overall,2.087378640776699
llava-hf-llava-v1.6-mistral-7b-hf,japanese-heron-bench,overall_rel,27.684681051287225
gpt-4o-2024-05-13,japanese-heron-bench,conv,6.857142857142857
gpt-4o-2024-05-13,japanese-heron-bench,conv_rel,79.12087912087912
gpt-4o-2024-05-13,japanese-heron-bench,detail,6.761904761904762
gpt-4o-2024-05-13,japanese-heron-bench,detail_rel,91.02564102564102
gpt-4o-2024-05-13,japanese-heron-bench,complex,6.9
gpt-4o-2024-05-13,japanese-heron-bench,complex_rel,84.92307692307692
gpt-4o-2024-05-13,japanese-heron-bench,parse_error_count,0
gpt-4o-2024-05-13,japanese-heron-bench,overall,6.854368932038835
gpt-4o-2024-05-13,japanese-heron-bench,overall_rel,85.02319902319903
llava-hf-llava-1.5-7b-hf,ja-vg-vqa-500,rougel,13.946308219535375
llava-hf-llava-1.5-7b-hf,ja-vlm-bench-in-the-wild,rougel,40.679229480800295
llava-hf-llava-1.5-7b-hf,ja-vlm-bench-in-the-wild,llm_as_a_judge,2.52
llava-hf-llava-v1.6-mistral-7b-hf,ja-vlm-bench-in-the-wild,rougel,28.62429363991303
llava-hf-llava-v1.6-mistral-7b-hf,ja-vlm-bench-in-the-wild,llm_as_a_judge,2.5
llm-jp-VILA-ja,jdocqa,yesno_exact,0.6666666666666666
llm-jp-VILA-ja,jdocqa,factoid_exact,0.045454545454545456
llm-jp-VILA-ja,jdocqa,numerical_exact,0.07692307692307693
llm-jp-VILA-ja,jdocqa,open-ended_bleu,2.611098779484095
llava-hf-llava-1.5-7b-hf,jdocqa,yesno_exact,0.8666666666666667
llava-hf-llava-1.5-7b-hf,jdocqa,factoid_exact,0.045454545454545456
llava-hf-llava-1.5-7b-hf,jdocqa,numerical_exact,0.0
llava-hf-llava-1.5-7b-hf,jdocqa,open-ended_bleu,7.344318860699513
OpenGVLab-InternVL2-8B,jdocqa,yesno_exact,0.8
OpenGVLab-InternVL2-8B,jdocqa,factoid_exact,0.13636363636363635
OpenGVLab-InternVL2-8B,jdocqa,numerical_exact,0.07692307692307693
OpenGVLab-InternVL2-8B,jdocqa,open-ended_bleu,8.088679765402643
Qwen-Qwen2-VL-7B-Instruct,jdocqa,yesno_exact,0.6
Qwen-Qwen2-VL-7B-Instruct,jdocqa,factoid_exact,0.4090909090909091
Qwen-Qwen2-VL-7B-Instruct,jdocqa,numerical_exact,0.0
Qwen-Qwen2-VL-7B-Instruct,jdocqa,open-ended_bleu,18.57594965065007
gpt-4o-2024-05-13,jdocqa,yesno_exact,0.6
gpt-4o-2024-05-13,jdocqa,factoid_exact,0.22727272727272727
gpt-4o-2024-05-13,jdocqa,numerical_exact,0.0
gpt-4o-2024-05-13,jdocqa,open-ended_bleu,7.9305565320043065
llava-hf-llava-1.5-7b-hf,ja-multi-image-vqa,rougel,35.98891855337162
llava-hf-llava-1.5-7b-hf,jmmmu,Overall-Art and Psychology,0.22222
llava-hf-llava-1.5-7b-hf,jmmmu,Design,0.26667
llava-hf-llava-1.5-7b-hf,jmmmu,Music,0.26667
llava-hf-llava-1.5-7b-hf,jmmmu,Psychology,0.13333
llava-hf-llava-1.5-7b-hf,jmmmu,Overall-Business,0.27333
llava-hf-llava-1.5-7b-hf,jmmmu,Accounting,0.33333
llava-hf-llava-1.5-7b-hf,jmmmu,Economics,0.26667
llava-hf-llava-1.5-7b-hf,jmmmu,Finance,0.2
llava-hf-llava-1.5-7b-hf,jmmmu,Manage,0.3
llava-hf-llava-1.5-7b-hf,jmmmu,Marketing,0.26667
llava-hf-llava-1.5-7b-hf,jmmmu,Overall-Science,0.29167
llava-hf-llava-1.5-7b-hf,jmmmu,Biology,0.36667
llava-hf-llava-1.5-7b-hf,jmmmu,Chemistry,0.23333
llava-hf-llava-1.5-7b-hf,jmmmu,Math,0.3
llava-hf-llava-1.5-7b-hf,jmmmu,Physics,0.26667
llava-hf-llava-1.5-7b-hf,jmmmu,Overall-Health and Medicine,0.34667
llava-hf-llava-1.5-7b-hf,jmmmu,Basic_Medical_Science,0.43333
llava-hf-llava-1.5-7b-hf,jmmmu,Clinical_Medicine,0.36667
llava-hf-llava-1.5-7b-hf,jmmmu,Diagnostics_and_Laboratory_Medicine,0.26667
llava-hf-llava-1.5-7b-hf,jmmmu,Pharmacy,0.3
llava-hf-llava-1.5-7b-hf,jmmmu,Public_Health,0.36667
llava-hf-llava-1.5-7b-hf,jmmmu,Overall-Tech and Engineering,0.30952
llava-hf-llava-1.5-7b-hf,jmmmu,Agriculture,0.3
llava-hf-llava-1.5-7b-hf,jmmmu,Architecture_and_Engineering,0.33333
llava-hf-llava-1.5-7b-hf,jmmmu,Computer_Science,0.4
llava-hf-llava-1.5-7b-hf,jmmmu,Electronics,0.2
llava-hf-llava-1.5-7b-hf,jmmmu,Energy_and_Power,0.3
llava-hf-llava-1.5-7b-hf,jmmmu,Materials,0.33333
llava-hf-llava-1.5-7b-hf,jmmmu,Mechanical_Engineering,0.3
llava-hf-llava-1.5-7b-hf,jmmmu,Overall,0.29621
OpenGVLab-InternVL2-8B,jmmmu,Overall-Art and Psychology,0.4
OpenGVLab-InternVL2-8B,jmmmu,Design,0.56667
OpenGVLab-InternVL2-8B,jmmmu,Music,0.2
OpenGVLab-InternVL2-8B,jmmmu,Psychology,0.43333
OpenGVLab-InternVL2-8B,jmmmu,Overall-Business,0.38
OpenGVLab-InternVL2-8B,jmmmu,Accounting,0.5
OpenGVLab-InternVL2-8B,jmmmu,Economics,0.3
OpenGVLab-InternVL2-8B,jmmmu,Finance,0.2
OpenGVLab-InternVL2-8B,jmmmu,Manage,0.36667
OpenGVLab-InternVL2-8B,jmmmu,Marketing,0.53333
OpenGVLab-InternVL2-8B,jmmmu,Overall-Science,0.33333
OpenGVLab-InternVL2-8B,jmmmu,Biology,0.33333
OpenGVLab-InternVL2-8B,jmmmu,Chemistry,0.33333
OpenGVLab-InternVL2-8B,jmmmu,Math,0.36667
OpenGVLab-InternVL2-8B,jmmmu,Physics,0.3
OpenGVLab-InternVL2-8B,jmmmu,Overall-Health and Medicine,0.37333
OpenGVLab-InternVL2-8B,jmmmu,Basic_Medical_Science,0.53333
OpenGVLab-InternVL2-8B,jmmmu,Clinical_Medicine,0.36667
OpenGVLab-InternVL2-8B,jmmmu,Diagnostics_and_Laboratory_Medicine,0.36667
OpenGVLab-InternVL2-8B,jmmmu,Pharmacy,0.33333
OpenGVLab-InternVL2-8B,jmmmu,Public_Health,0.26667
OpenGVLab-InternVL2-8B,jmmmu,Overall-Tech and Engineering,0.32381
OpenGVLab-InternVL2-8B,jmmmu,Agriculture,0.33333
OpenGVLab-InternVL2-8B,jmmmu,Architecture_and_Engineering,0.23333
OpenGVLab-InternVL2-8B,jmmmu,Computer_Science,0.5
OpenGVLab-InternVL2-8B,jmmmu,Electronics,0.3
OpenGVLab-InternVL2-8B,jmmmu,Energy_and_Power,0.3
OpenGVLab-InternVL2-8B,jmmmu,Materials,0.3
OpenGVLab-InternVL2-8B,jmmmu,Mechanical_Engineering,0.3
OpenGVLab-InternVL2-8B,jmmmu,Overall,0.38864
gpt-4o-2024-05-13,jmmmu,Overall-Art and Psychology,0.54444
gpt-4o-2024-05-13,jmmmu,Design,0.7
gpt-4o-2024-05-13,jmmmu,Music,0.4
gpt-4o-2024-05-13,jmmmu,Psychology,0.53333
gpt-4o-2024-05-13,jmmmu,Overall-Business,0.40667
gpt-4o-2024-05-13,jmmmu,Accounting,0.3
gpt-4o-2024-05-13,jmmmu,Economics,0.5
gpt-4o-2024-05-13,jmmmu,Finance,0.33333
gpt-4o-2024-05-13,jmmmu,Manage,0.53333
gpt-4o-2024-05-13,jmmmu,Marketing,0.36667
gpt-4o-2024-05-13,jmmmu,Overall-Science,0.39167
gpt-4o-2024-05-13,jmmmu,Biology,0.26667
gpt-4o-2024-05-13,jmmmu,Chemistry,0.43333
gpt-4o-2024-05-13,jmmmu,Math,0.43333
gpt-4o-2024-05-13,jmmmu,Physics,0.43333
gpt-4o-2024-05-13,jmmmu,Overall-Health and Medicine,0.52
gpt-4o-2024-05-13,jmmmu,Basic_Medical_Science,0.66667
gpt-4o-2024-05-13,jmmmu,Clinical_Medicine,0.36667
gpt-4o-2024-05-13,jmmmu,Diagnostics_and_Laboratory_Medicine,0.5
gpt-4o-2024-05-13,jmmmu,Pharmacy,0.63333
gpt-4o-2024-05-13,jmmmu,Public_Health,0.43333
gpt-4o-2024-05-13,jmmmu,Overall-Tech and Engineering,0.38095
gpt-4o-2024-05-13,jmmmu,Agriculture,0.56667
gpt-4o-2024-05-13,jmmmu,Architecture_and_Engineering,0.4
gpt-4o-2024-05-13,jmmmu,Computer_Science,0.5
gpt-4o-2024-05-13,jmmmu,Electronics,0.26667
gpt-4o-2024-05-13,jmmmu,Energy_and_Power,0.33333
gpt-4o-2024-05-13,jmmmu,Materials,0.3
gpt-4o-2024-05-13,jmmmu,Mechanical_Engineering,0.3
gpt-4o-2024-05-13,jmmmu,Overall,0.52424

@Silviase
Copy link
Collaborator Author

すごく助かります!!!ありがとうございます

@Silviase Silviase merged commit 48fe6c0 into master Nov 28, 2024
@speed1313 speed1313 deleted the 70-record branch November 28, 2024 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

可視化ツールとしてstreamlit の利用
2 participants