pfgen-benchmark is a benchmark designed to evaluate Japanese text generation specifically for pretrained models. Unlike conventional benchmarks that use templates containing instructions, this benchmark relies solely on providing numerous examples. By conveying expectations such as the question-answering nature of the task, responses of approximately 100 characters, and outputs resembling formal public documents purely through examples, it minimizes the influence of differences in instructions or templates. Additionally, output evaluation is conducted using n-gram-based methods, enabling quick, cost-effective, and deterministic evaluations, unlike the LLM as a Judge approach.
To enable comparisons across as many models as possible, the leaderboard actively includes a wide range of models. These include openly accessible models, models mentioned in academic papers, and those announced by companies through press releases. Contributions of model outputs are encouraged, and results can be submitted via pull requests. For detailed instructions on how to contribute, please refer to the "How to Contribute" section.
See more details: TBD (arxiv)
pfgen-benchmark は事前学習モデル向けに設計された日本語の生成文を評価するベンチマークです。通常のベンチマークでは指示文を含むテンプレートを使いますが、このベンチマークでは多数の例示のみを行います。質問応答タスクであることや、約100字の回答、公用文に近い出力を期待していることを例示のみで伝えることで、指示文やテンプレートの差異による影響を小さくしています。また、出力文の評価は n-gram を用いた方法を用いており、LLM as a Judge の手法と異なり、短時間、低コストでかつ決定的な評価を可能にしています。
詳しくはこちら: Jxiv preprint
できる限り多くのモデルを同じ軸で比較できるように、リーダーボードには積極的に多くのモデル掲載しています。オープンにアクセス可能なモデル、論文で言及されているモデル、企業がプレスリリースを出しているモデルなど、比較の価値があると思われるモデルについては、是非プルリクエストで出力を追加してください。追加方法については「How to contribute」を参照ください。
The license of the parts of this repository other than the output of LLM is Apache License Version 2.0. The license of the output of LLM depends on the license of each model.
You can evaluate the model using run-hf.py (which uses transformers) or run-vllm.py (which uses vLLM). For detailed parameters, refer to --help. The --num-trials parameter, which is the number of patterns for which the model will generate answers, should be decided considering the trade-off between execution time and required accuracy.
# Run a model using Huggingface library or vLLM.
python ./run-hf.py --model=pfnet/plamo-13b --num-trials=5
# Evaluate output and update leaderboard.
make
Follow the instructions in the "How to Evaluate Model" section to run the evaluation. This process will generate config.json and trials.jsonl.xz files under the result directory. Please create a pull request containing only these two files.
To ensure more accurate ranking among models, the number of executions (--num-trials) should be as many as possible, within the limit of 100 trials.
Rank |
Score |
Model |
Length |
Fluency |
Truthfulness |
Helpfulness |
---|---|---|---|---|---|---|
N/A |
1.0501 (±0.0000/√1) |
👑 system/ground-truth |
100.0 (±0.0) |
1.155 |
0.996 |
1.000 |
1 |
0.9303 (±0.0083/√10) |
💬 anthropic/claude-3-5-sonnet-20240620 |
102.2 (±10.4) |
0.949 |
0.959 |
0.883 |
2 |
0.9144 (±0.0037/√2) |
💬 deepseek-ai/DeepSeek-V3 |
87.4 (±14.9) |
0.960 |
0.983 |
0.800 |
3 |
0.8615 (±0.0092/√10) |
💬 openai/gpt-4o |
84.5 (±18.6) |
0.919 |
0.980 |
0.686 |
N/A |
0.8494 (±0.0253/√1000) |
🎯 system/criteria |
100.0 (±3.4) |
0.936 |
0.978 |
0.505 |
4 |
0.8270 (±0.0229/√10) |
💬 anthropic/claude-3-opus-20240229 |
102.3 (±9.5) |
0.911 |
0.944 |
0.627 |
5 |
0.8059 (±0.0169/√5) |
💬 google/gemini-2.0-flash-exp |
68.0 (±17.7) |
0.834 |
0.984 |
0.600 |
6 |
0.8036 (±0.0133/√10) |
💬 openai/gpt-4-turbo |
86.5 (±17.4) |
0.820 |
0.959 |
0.632 |
7 |
0.7916 (±0.0146/√10) |
💬 openai/gpt-4 |
107.2 (±11.6) |
0.888 |
0.951 |
0.536 |
8 |
0.7821 (±0.0166/√5) |
💬 Qwen/Qwen2.5-72B-Instruct |
98.3 (±14.9) |
0.871 |
0.933 |
0.542 |
9 |
0.7789 (±0.0213/√100) |
🟢 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 |
109.1 (±36.8) |
0.890 |
0.941 |
0.506 |
10 |
0.7773 (±0.0168/√100) |
💬 pfnet/plamo-1.0-prime |
178.2 (±114.5) |
0.874 |
0.942 |
0.516 |
11 |
0.7768 (±0.0113/√5) |
💬 mlx-community/Qwen2.5-72B-Instruct-4bit |
100.8 (±17.7) |
0.860 |
0.933 |
0.538 |
12 |
0.7766 (±0.0276/√100) |
🟢 tokyotech-llm/Swallow-70b-NVE-hf |
104.1 (±17.9) |
0.884 |
0.938 |
0.507 |
13 |
0.7756 (±0.0264/√100) |
🟢 tokyotech-llm/Swallow-70b-NVE-instruc... |
104.1 (±18.5) |
0.878 |
0.938 |
0.510 |
14 |
0.7748 (±0.0000/√1) |
💬 openai/chatgpt-o1 |
76.3 (±17.7) |
0.755 |
0.960 |
0.610 |
15 |
0.7650 (±0.0263/√100) |
🟢 tokyotech-llm/Swallow-70b-instruct-hf |
102.5 (±14.4) |
0.872 |
0.929 |
0.494 |
16 |
0.7643 (±0.0000/√1) |
💬 openai/chatgpt-o1-pro |
79.5 (±17.3) |
0.748 |
0.955 |
0.590 |
17 |
0.7628 (±0.0275/√100) |
🟢 tokyotech-llm/Swallow-70b-hf |
103.5 (±16.1) |
0.876 |
0.930 |
0.483 |
18 |
0.7469 (±0.0270/√100) |
🟢 pfnet/plamo-100b-base |
115.2 (±64.0) |
0.861 |
0.920 |
0.460 |
19 |
0.7458 (±0.0085/√5) |
💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... |
69.5 (±25.9) |
0.842 |
0.972 |
0.424 |
20 |
0.7444 (±0.0260/√100) |
🟢 sbintuitions/sarashina2-70b |
120.0 (±49.4) |
0.825 |
0.923 |
0.485 |
21 |
0.7423 (±0.0302/√100) |
💬 cyberagent/Llama-3.1-70B-Japanese-Ins... |
199.2 (±110.3) |
0.817 |
0.905 |
0.505 |
22 |
0.7365 (±0.0218/√100) |
🟢 CohereForAI/c4ai-command-r-plus |
107.5 (±42.3) |
0.818 |
0.913 |
0.478 |
23 |
0.7336 (±0.0254/√100) |
🟢 tokyotech-llm/Llama-3-Swallow-70B-v0.1 |
108.2 (±24.7) |
0.837 |
0.908 |
0.456 |
24 |
0.7320 (±0.0201/√10) |
💬 anthropic/claude-3-sonnet-20240229 |
114.3 (±18.9) |
0.810 |
0.910 |
0.476 |
25 |
0.7249 (±0.0247/√100) |
💬 cyberagent/calm3-22b-chat |
136.8 (±46.7) |
0.813 |
0.907 |
0.455 |
26 |
0.7217 (±0.0219/√100) |
🟢 cyberagent/calm3-22b-chat |
105.0 (±13.1) |
0.824 |
0.916 |
0.425 |
27 |
0.7194 (±0.0321/√10) |
💬 google/text-bison |
77.6 (±31.9) |
0.790 |
0.968 |
0.401 |
28 |
0.7185 (±0.0000/√1) |
💬 elyza/Llama-3-ELYZA-JP-70B |
98.6 (±33.8) |
0.837 |
0.931 |
0.388 |
29 |
0.7175 (±0.0257/√100) |
🟢 nvidia/nemotron-4-340b-instruct |
107.3 (±28.4) |
0.816 |
0.908 |
0.429 |
30 |
0.7046 (±0.0248/√100) |
💬 nvidia/nemotron-4-340b-instruct |
94.5 (±39.1) |
0.768 |
0.910 |
0.435 |
31 |
0.7024 (±0.0238/√100) |
🟢 rinna/nekomata-14b |
104.3 (±18.0) |
0.812 |
0.912 |
0.383 |
32 |
0.7008 (±0.0318/√100) |
🟢 tokyotech-llm/Swallow-13b-instruct-hf |
104.5 (±13.0) |
0.812 |
0.898 |
0.392 |
33 |
0.6990 (±0.0288/√100) |
🟢 tokyotech-llm/Swallow-13b-NVE-hf |
106.2 (±19.2) |
0.820 |
0.906 |
0.371 |
34 |
0.6945 (±0.0300/√100) |
🟢 sbintuitions/sarashina2-13b |
107.8 (±28.3) |
0.794 |
0.900 |
0.390 |
35 |
0.6938 (±0.0217/√100) |
🟢 weblab-GENIAC/Tanuki-8B-dpo-v1.0 |
111.5 (±22.8) |
0.800 |
0.893 |
0.389 |
36 |
0.6891 (±0.0255/√100) |
🟢 tokyotech-llm/Swallow-13b-hf |
104.8 (±17.7) |
0.811 |
0.901 |
0.355 |
37 |
0.6842 (±0.0171/√5) |
🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... |
92.9 (±20.0) |
0.804 |
0.932 |
0.317 |
38 |
0.6794 (±0.0243/√100) |
🟢 cyberagent/Llama-3.1-70B-Japanese-Ins... |
128.8 (±72.2) |
0.764 |
0.883 |
0.391 |
39 |
0.6759 (±0.0232/√10) |
🟢 meta-llama/Meta-Llama-3.1-405B |
101.2 (±15.1) |
0.767 |
0.892 |
0.368 |
40 |
0.6745 (±0.0152/√10) |
💬 google/gemini-1.5-pro-001 |
52.4 (±15.0) |
0.666 |
0.980 |
0.377 |
41 |
0.6737 (±0.0276/√100) |
🟢 sbintuitions/sarashina1-13b |
105.4 (±23.4) |
0.775 |
0.882 |
0.364 |
42 |
0.6697 (±0.0277/√100) |
🟢 nvidia/nemotron-4-340b-base |
106.9 (±26.5) |
0.768 |
0.884 |
0.357 |
43 |
0.6677 (±0.0250/√100) |
🟢 llm-jp/llm-jp-3-13b |
101.1 (±9.7) |
0.770 |
0.884 |
0.349 |
44 |
0.6673 (±0.0225/√100) |
🟢 sbintuitions/sarashina1-65b |
104.2 (±20.0) |
0.776 |
0.894 |
0.332 |
45 |
0.6663 (±0.0262/√100) |
🟢 tokyotech-llm/Swallow-7b-plus-hf |
106.1 (±18.1) |
0.780 |
0.880 |
0.339 |
46 |
0.6656 (±0.0169/√10) |
💬 google/gemini-1.5-flash-001 |
55.1 (±21.7) |
0.687 |
0.967 |
0.342 |
47 |
0.6625 (±0.0140/√10) |
💬 anthropic/claude-3-haiku-20240307 |
81.9 (±31.0) |
0.747 |
0.943 |
0.298 |
48 |
0.6590 (±0.0133/√10) |
💬 google/gemini-2.0-flash-thinking-exp-... |
49.8 (±11.0) |
0.639 |
0.984 |
0.354 |
49 |
0.6473 (±0.0182/√100) |
💬 Qwen/Qwen2-72B-Instruct |
108.7 (±24.8) |
0.703 |
0.853 |
0.386 |
50 |
0.6456 (±0.0255/√100) |
🟢 sbintuitions/sarashina2-7b |
105.6 (±22.8) |
0.746 |
0.874 |
0.316 |
51 |
0.6445 (±0.0241/√100) |
🟢 tokyotech-llm/Llama-3-Swallow-8B-v0.1 |
110.3 (±28.4) |
0.748 |
0.867 |
0.319 |
52 |
0.6368 (±0.0207/√100) |
🟢 tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1 |
105.5 (±21.0) |
0.753 |
0.870 |
0.287 |
53 |
0.6350 (±0.0260/√100) |
🟢 karakuri-ai/karakuri-lm-8x7b-instruct... |
104.0 (±16.9) |
0.755 |
0.863 |
0.287 |
54 |
0.6337 (±0.0265/√100) |
🟢 tokyotech-llm/Swallow-7b-hf |
106.5 (±18.7) |
0.746 |
0.866 |
0.289 |
55 |
0.6335 (±0.0252/√100) |
🟢 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 |
103.2 (±16.6) |
0.766 |
0.872 |
0.263 |
56 |
0.6318 (±0.0264/√100) |
🟢 tokyotech-llm/Llama-3-Swallow-70B-Ins... |
119.2 (±74.3) |
0.724 |
0.861 |
0.311 |
57 |
0.6303 (±0.0252/√100) |
🟢 cyberagent/calm2-7b-chat-dpo-experime... |
110.0 (±24.3) |
0.735 |
0.863 |
0.293 |
58 |
0.6285 (±0.0239/√100) |
🟢 pfnet/nekomata-14b-pfn-qfin-inst-merge |
124.7 (±47.2) |
0.725 |
0.866 |
0.295 |
59 |
0.6279 (±0.0252/√100) |
🟢 tokyotech-llm/Swallow-7b-NVE-hf |
108.1 (±24.5) |
0.747 |
0.870 |
0.267 |
60 |
0.6274 (±0.0772/√100) |
🟢 rinna/nekomata-14b-instruction |
98.3 (±24.2) |
0.732 |
0.855 |
0.295 |
61 |
0.6267 (±0.0263/√100) |
🟢 sbintuitions/sarashina1-7b |
106.7 (±25.1) |
0.737 |
0.866 |
0.276 |
62 |
0.6252 (±0.0246/√100) |
🟢 karakuri-ai/karakuri-lm-70b-v0.1 |
106.0 (±27.0) |
0.713 |
0.852 |
0.310 |
63 |
0.6214 (±0.0063/√10) |
💬 google/gemini-1.0-pro-001 |
47.4 (±15.2) |
0.635 |
0.976 |
0.254 |
64 |
0.6202 (±0.0251/√100) |
🟢 stabilityai/japanese-stablelm-base-be... |
107.3 (±19.2) |
0.733 |
0.848 |
0.280 |
65 |
0.6197 (±0.0258/√100) |
🟢 stockmark/stockmark-13b |
108.9 (±49.3) |
0.727 |
0.860 |
0.272 |
66 |
0.6191 (±0.0284/√100) |
🟢 stockmark/stockmark-13b-instruct |
108.0 (±46.8) |
0.720 |
0.859 |
0.278 |
67 |
0.6178 (±0.0230/√100) |
🟢 karakuri-ai/karakuri-lm-70b-chat-v0.1 |
104.7 (±27.5) |
0.706 |
0.842 |
0.306 |
68 |
0.6176 (±0.0249/√100) |
🟢 tokyotech-llm/Swallow-7b-instruct-hf |
106.3 (±17.8) |
0.716 |
0.851 |
0.285 |
69 |
0.6136 (±0.0143/√10) |
💬 openai/gpt-35-turbo |
64.0 (±22.2) |
0.658 |
0.944 |
0.239 |
70 |
0.6095 (±0.0225/√100) |
💬 rinna/llama-3-youko-70b-instruct |
135.3 (±46.8) |
0.683 |
0.817 |
0.328 |
71 |
0.6091 (±0.0277/√100) |
🟢 pfnet/nekomata-14b-pfn-qfin |
85.1 (±28.4) |
0.672 |
0.893 |
0.262 |
72 |
0.6087 (±0.1545/√100) |
💬 tokyotech-llm/Swallow-70b-NVE-instruc... |
135.7 (±74.0) |
0.678 |
0.804 |
0.344 |
73 |
0.6060 (±0.0238/√100) |
🟢 Qwen/Qwen2-72B |
105.5 (±23.5) |
0.703 |
0.836 |
0.279 |
74 |
0.6037 (±0.0239/√100) |
🟢 tokyotech-llm/Swallow-7b-NVE-instruct-hf |
105.7 (±16.4) |
0.719 |
0.847 |
0.245 |
75 |
0.6030 (±0.0287/√100) |
💬 karakuri-ai/karakuri-lm-8x7b-instruct... |
197.4 (±72.1) |
0.703 |
0.832 |
0.274 |
76 |
0.6029 (±0.0223/√100) |
🟢 Qwen/Qwen2-72B-Instruct |
106.0 (±26.7) |
0.684 |
0.825 |
0.299 |
77 |
0.5987 (±0.0264/√100) |
🟢 cyberagent/calm2-7b-chat |
107.5 (±20.8) |
0.701 |
0.843 |
0.253 |
78 |
0.5971 (±0.0235/√100) |
🟢 stockmark/stockmark-100b |
107.2 (±24.7) |
0.709 |
0.842 |
0.240 |
79 |
0.5945 (±0.1370/√100) |
💬 tokyotech-llm/Swallow-13b-instruct-hf |
167.3 (±116.4) |
0.670 |
0.790 |
0.323 |
80 |
0.5921 (±0.0211/√100) |
🟢 elyza/Llama-3-ELYZA-JP-8B |
115.6 (±44.8) |
0.685 |
0.831 |
0.260 |
81 |
0.5832 (±0.0220/√100) |
🟢 augmxnt/shisa-gamma-7b-v1 |
106.7 (±21.8) |
0.706 |
0.831 |
0.213 |
82 |
0.5825 (±0.0249/√100) |
🟢 tokyotech-llm/Swallow-MS-7b-v0.1 |
106.4 (±25.9) |
0.702 |
0.828 |
0.218 |
83 |
0.5811 (±0.0218/√100) |
🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... |
103.6 (±15.6) |
0.675 |
0.816 |
0.252 |
84 |
0.5808 (±0.0220/√100) |
🟢 stabilityai/japanese-stablelm-base-ga... |
106.9 (±17.2) |
0.690 |
0.822 |
0.230 |
85 |
0.5783 (±0.0217/√100) |
🟢 microsoft/Phi-3-medium-4k-instruct |
105.9 (±20.0) |
0.675 |
0.826 |
0.234 |
86 |
0.5777 (±0.0228/√100) |
🟢 llm-jp/llm-jp-13b-instruct-full-dolly... |
105.2 (±14.5) |
0.675 |
0.811 |
0.247 |
87 |
0.5754 (±0.0182/√100) |
🟢 Xwin-LM/Xwin-LM-70B-V0.1 |
105.4 (±26.8) |
0.681 |
0.833 |
0.213 |
88 |
0.5737 (±0.0209/√100) |
🟢 microsoft/Phi-3-medium-128k-instruct |
107.7 (±24.7) |
0.674 |
0.825 |
0.223 |
89 |
0.5735 (±0.0216/√100) |
🟢 google/gemma-2-9b-it |
95.9 (±22.0) |
0.674 |
0.837 |
0.209 |
90 |
0.5734 (±0.1980/√100) |
💬 tokyotech-llm/Swallow-70b-instruct-hf |
130.9 (±105.0) |
0.636 |
0.758 |
0.326 |
91 |
0.5724 (±0.0209/√100) |
🟢 rinna/llama-3-youko-70b |
104.6 (±20.6) |
0.681 |
0.826 |
0.210 |
92 |
0.5716 (±0.0230/√100) |
🟢 sbintuitions/sarashina2.1-1b |
116.9 (±41.3) |
0.668 |
0.821 |
0.226 |
93 |
0.5712 (±0.0194/√100) |
💬 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 |
244.4 (±49.3) |
0.678 |
0.816 |
0.220 |
94 |
0.5710 (±0.0226/√100) |
🟢 rinna/llama-3-youko-8b-instruct |
111.6 (±23.4) |
0.672 |
0.809 |
0.232 |
95 |
0.5659 (±0.0234/√100) |
🟢 meta-llama/Meta-Llama-3.1-70B |
103.7 (±20.1) |
0.665 |
0.822 |
0.211 |
96 |
0.5656 (±0.0226/√100) |
💬 meta-llama/Meta-Llama-3-70B-Instruct |
110.2 (±36.4) |
0.665 |
0.777 |
0.254 |
97 |
0.5646 (±0.0240/√100) |
💬 microsoft/Phi-3-medium-4k-instruct |
131.3 (±50.6) |
0.633 |
0.807 |
0.253 |
98 |
0.5642 (±0.0261/√100) |
🟢 stabilityai/japanese-stablelm-instruc... |
105.1 (±19.5) |
0.646 |
0.799 |
0.247 |
99 |
0.5620 (±0.0254/√100) |
🟢 meta-llama/Meta-Llama-3-70B |
102.0 (±17.2) |
0.664 |
0.809 |
0.213 |
100 |
0.5588 (±0.0230/√100) |
🟢 stabilityai/japanese-stablelm-instruc... |
105.6 (±17.0) |
0.673 |
0.812 |
0.191 |
101 |
0.5574 (±0.0216/√100) |
🟢 rinna/nekomata-7b |
108.4 (±18.0) |
0.678 |
0.816 |
0.178 |
102 |
0.5569 (±0.0244/√100) |
🟢 rinna/llama-3-youko-8b |
104.9 (±17.0) |
0.670 |
0.813 |
0.188 |
103 |
0.5568 (±0.0200/√100) |
🟢 meta-llama/Meta-Llama-3-70B-Instruct |
111.8 (±55.9) |
0.655 |
0.780 |
0.236 |
104 |
0.5562 (±0.0952/√100) |
💬 stockmark/stockmark-13b-instruct |
137.2 (±89.6) |
0.633 |
0.798 |
0.238 |
105 |
0.5537 (±0.0204/√100) |
🟢 tokyotech-llm/Llama-3-Swallow-8B-Inst... |
114.4 (±48.5) |
0.657 |
0.812 |
0.192 |
106 |
0.5516 (±0.1016/√100) |
💬 cyberagent/calm2-7b-chat-dpo-experime... |
181.1 (±120.1) |
0.644 |
0.775 |
0.236 |
107 |
0.5511 (±0.0203/√100) |
🟢 google/gemma-2-27b-it |
110.3 (±56.8) |
0.599 |
0.836 |
0.218 |
108 |
0.5500 (±0.0605/√100) |
💬 tokyotech-llm/Llama-3-Swallow-70B-Ins... |
156.5 (±106.5) |
0.633 |
0.780 |
0.237 |
109 |
0.5500 (±0.0467/√100) |
💬 tokyotech-llm/Swallow-7b-instruct-hf |
121.9 (±77.3) |
0.612 |
0.812 |
0.225 |
110 |
0.5437 (±0.0218/√100) |
💬 Xwin-LM/Xwin-LM-70B-V0.1 |
200.7 (±63.1) |
0.652 |
0.782 |
0.198 |
111 |
0.5436 (±0.0246/√100) |
🟢 llm-jp/llm-jp-3-3.7b |
101.3 (±10.4) |
0.646 |
0.795 |
0.189 |
112 |
0.5432 (±0.0208/√100) |
💬 CohereForAI/c4ai-command-r-plus |
48.9 (±16.5) |
0.505 |
0.931 |
0.194 |
113 |
0.5429 (±0.0238/√100) |
🟢 meta-llama/Meta-Llama-3.1-70B-Instruct |
157.6 (±221.7) |
0.636 |
0.770 |
0.222 |
114 |
0.5387 (±0.0269/√100) |
💬 rinna/llama-3-youko-8b-instruct |
265.4 (±104.1) |
0.635 |
0.771 |
0.210 |
115 |
0.5386 (±0.0215/√100) |
💬 microsoft/Phi-3-medium-128k-instruct |
91.9 (±44.7) |
0.589 |
0.834 |
0.193 |
116 |
0.5377 (±0.0481/√100) |
💬 meta-llama/Meta-Llama-3.1-70B-Instruct |
135.8 (±194.8) |
0.617 |
0.779 |
0.218 |
117 |
0.5349 (±0.0203/√100) |
💬 google/gemma-2-27b-it |
74.7 (±42.7) |
0.545 |
0.874 |
0.186 |
118 |
0.5347 (±0.0188/√100) |
🟢 rinna/youri-7b |
107.6 (±16.3) |
0.654 |
0.802 |
0.148 |
119 |
0.5316 (±0.0273/√100) |
💬 lightblue/karasu-7B-chat |
111.8 (±46.5) |
0.621 |
0.800 |
0.174 |
120 |
0.5301 (±0.0476/√100) |
💬 lightblue/karasu-7B-chat-plus |
107.1 (±46.7) |
0.615 |
0.798 |
0.178 |
121 |
0.5283 (±0.0585/√100) |
💬 lightblue/karasu-7B-chat-plus-unleashed |
104.6 (±45.3) |
0.614 |
0.794 |
0.177 |
122 |
0.5179 (±0.0264/√100) |
🟢 cyberagent/calm2-7b |
106.0 (±26.2) |
0.601 |
0.770 |
0.182 |
123 |
0.5164 (±0.0209/√100) |
🟢 llm-jp/llm-jp-13b-instruct-full-jaste... |
109.3 (±33.5) |
0.606 |
0.788 |
0.155 |
124 |
0.5143 (±0.0212/√100) |
🟢 llm-jp/llm-jp-13b-v2.0 |
104.1 (±11.2) |
0.604 |
0.760 |
0.180 |
125 |
0.5143 (±0.0170/√100) |
🟢 moneyforward/houou-instruction-7b-v3 |
112.2 (±37.8) |
0.629 |
0.778 |
0.135 |
126 |
0.5085 (±0.0160/√100) |
🟢 moneyforward/houou-instruction-7b-v1 |
105.9 (±41.0) |
0.617 |
0.781 |
0.128 |
127 |
0.5080 (±0.0306/√100) |
💬 stabilityai/japanese-stablelm-instruc... |
111.3 (±58.3) |
0.548 |
0.782 |
0.195 |
128 |
0.5073 (±0.0208/√100) |
💬 Qwen/Qwen2-57B-A14B-Instruct |
154.8 (±89.5) |
0.615 |
0.734 |
0.173 |
129 |
0.5045 (±0.0208/√100) |
🟢 Qwen/Qwen2-57B-A14B |
106.7 (±22.5) |
0.617 |
0.757 |
0.139 |
130 |
0.5041 (±0.0225/√100) |
🟢 llm-jp/llm-jp-13b-instruct-full-dolly... |
106.2 (±29.3) |
0.579 |
0.778 |
0.155 |
131 |
0.5022 (±0.0221/√100) |
🟢 llm-jp/llm-jp-13b-instruct-full-jaste... |
95.0 (±36.2) |
0.579 |
0.795 |
0.132 |
132 |
0.5013 (±0.0196/√100) |
🟢 google/gemma-2-9b |
107.3 (±26.0) |
0.595 |
0.761 |
0.148 |
133 |
0.5013 (±0.0375/√100) |
💬 karakuri-ai/karakuri-lm-70b-chat-v0.1 |
427.4 (±151.5) |
0.579 |
0.723 |
0.202 |
134 |
0.5002 (±0.0218/√100) |
🟢 Qwen/Qwen-72B-Chat |
223.0 (±258.3) |
0.614 |
0.716 |
0.171 |
135 |
0.4995 (±0.0211/√100) |
💬 Qwen/Qwen1.5-72B-Chat |
119.3 (±58.1) |
0.582 |
0.708 |
0.208 |
136 |
0.4963 (±0.0189/√100) |
🟢 Qwen/Qwen1.5-72B-Chat |
128.1 (±77.7) |
0.586 |
0.698 |
0.206 |
137 |
0.4959 (±0.0235/√100) |
🟢 llm-jp/llm-jp-13b-v1.0 |
115.0 (±40.9) |
0.576 |
0.756 |
0.156 |
138 |
0.4953 (±0.0203/√100) |
🟢 meta-llama/Llama-2-70b-hf |
110.4 (±25.8) |
0.596 |
0.745 |
0.145 |
139 |
0.4949 (±0.0177/√100) |
💬 moneyforward/houou-instruction-7b-v1 |
180.5 (±66.6) |
0.604 |
0.734 |
0.146 |
140 |
0.4931 (±0.0247/√100) |
🟢 Rakuten/RakutenAI-7B-instruct |
105.6 (±33.1) |
0.598 |
0.750 |
0.132 |
141 |
0.4921 (±0.0219/√100) |
🟢 Rakuten/RakutenAI-7B-chat |
114.9 (±44.7) |
0.592 |
0.760 |
0.124 |
142 |
0.4916 (±0.0201/√100) |
🟢 moneyforward/houou-instruction-7b-v2 |
104.7 (±41.2) |
0.588 |
0.770 |
0.116 |
143 |
0.4895 (±0.0440/√100) |
💬 llm-jp/llm-jp-13b-instruct-full-dolly... |
268.1 (±133.1) |
0.548 |
0.722 |
0.199 |
144 |
0.4872 (±0.0237/√100) |
🟢 lightblue/karasu-7B |
110.1 (±19.0) |
0.586 |
0.739 |
0.137 |
145 |
0.4870 (±0.0215/√100) |
🟢 Qwen/Qwen-72B |
134.6 (±114.6) |
0.593 |
0.715 |
0.152 |
146 |
0.4868 (±0.0163/√100) |
💬 google/gemma-2-9b-it |
47.6 (±14.6) |
0.477 |
0.880 |
0.104 |
147 |
0.4863 (±0.1167/√100) |
💬 pfnet/nekomata-14b-pfn-qfin-inst-merge |
93.4 (±55.0) |
0.544 |
0.721 |
0.194 |
148 |
0.4862 (±0.0221/√100) |
🟢 Qwen/Qwen2-57B-A14B-Instruct |
116.9 (±82.5) |
0.601 |
0.734 |
0.124 |
149 |
0.4857 (±0.0168/√100) |
💬 moneyforward/houou-instruction-7b-v2 |
207.0 (±57.3) |
0.591 |
0.719 |
0.147 |
150 |
0.4829 (±0.0211/√100) |
🟢 Qwen/Qwen1.5-72B |
136.2 (±85.6) |
0.591 |
0.705 |
0.153 |
151 |
0.4827 (±0.0464/√100) |
💬 llm-jp/llm-jp-13b-instruct-full-ac_00... |
269.1 (±131.5) |
0.542 |
0.716 |
0.191 |
152 |
0.4762 (±0.0810/√100) |
💬 stabilityai/japanese-stablelm-instruc... |
126.2 (±67.4) |
0.545 |
0.726 |
0.158 |
153 |
0.4746 (±0.0210/√100) |
🟢 rinna/youri-7b-chat |
102.1 (±16.4) |
0.571 |
0.752 |
0.100 |
154 |
0.4744 (±0.0227/√100) |
🟢 pfnet/plamo-13b |
108.2 (±28.5) |
0.558 |
0.749 |
0.116 |
155 |
0.4743 (±0.0987/√100) |
💬 tokyotech-llm/Swallow-7b-NVE-instruct-hf |
129.0 (±72.8) |
0.535 |
0.725 |
0.163 |
156 |
0.4730 (±0.0166/√100) |
🟢 Xwin-LM/Xwin-LM-13B-V0.2 |
109.7 (±27.4) |
0.582 |
0.723 |
0.114 |
157 |
0.4723 (±0.0204/√100) |
💬 Rakuten/RakutenAI-7B-chat |
233.0 (±133.0) |
0.565 |
0.734 |
0.118 |
158 |
0.4723 (±0.0808/√100) |
💬 tokyotech-llm/Llama-3-Swallow-8B-Inst... |
199.3 (±155.6) |
0.563 |
0.699 |
0.154 |
159 |
0.4698 (±0.0200/√100) |
🟢 Rakuten/RakutenAI-7B |
105.4 (±25.6) |
0.576 |
0.721 |
0.113 |
160 |
0.4692 (±0.0161/√100) |
🟢 shisa-ai/shisa-v1-qwen2-7b |
109.0 (±23.9) |
0.563 |
0.712 |
0.133 |
161 |
0.4661 (±0.0210/√100) |
🟢 llm-jp/llm-jp-13b-instruct-full-dolly... |
111.6 (±44.2) |
0.536 |
0.756 |
0.106 |
162 |
0.4659 (±0.0438/√100) |
💬 deepseek-ai/deepseek-llm-67b-chat |
146.0 (±62.1) |
0.555 |
0.703 |
0.139 |
163 |
0.4659 (±0.0202/√100) |
🟢 llm-jp/llm-jp-3-1.8b |
105.0 (±16.9) |
0.568 |
0.725 |
0.105 |
164 |
0.4648 (±0.1659/√100) |
💬 cyberagent/calm2-7b-chat |
124.7 (±95.9) |
0.536 |
0.688 |
0.171 |
165 |
0.4622 (±0.0195/√100) |
🟢 Qwen/Qwen-14B-Chat |
135.5 (±84.3) |
0.572 |
0.718 |
0.097 |
166 |
0.4619 (±0.0162/√100) |
💬 lmsys/vicuna-13b-v1.5-16k |
126.5 (±48.4) |
0.574 |
0.715 |
0.097 |
167 |
0.4609 (±0.0113/√10) |
🟢 google/gemma-2-2b-jpn-it |
69.4 (±24.1) |
0.509 |
0.805 |
0.069 |
168 |
0.4607 (±0.0165/√100) |
🟢 SakanaAI/EvoLLM-JP-v1-7B |
111.2 (±30.4) |
0.579 |
0.708 |
0.095 |
169 |
0.4601 (±0.0184/√100) |
🟢 shisa-ai/shisa-v1-llama3-8b |
112.9 (±31.4) |
0.557 |
0.703 |
0.120 |
170 |
0.4597 (±0.0268/√100) |
🟢 CohereForAI/c4ai-command-r-v01 |
179.2 (±166.3) |
0.590 |
0.592 |
0.197 |
171 |
0.4586 (±0.0141/√100) |
🟢 google/gemma-2-2b-it |
88.2 (±30.8) |
0.536 |
0.761 |
0.079 |
172 |
0.4561 (±0.0202/√100) |
🟢 pfnet/plamo-13b-instruct |
144.0 (±147.7) |
0.532 |
0.763 |
0.073 |
173 |
0.4559 (±0.0201/√100) |
🟢 pfnet/plamo-13b-instruct-nc |
156.0 (±183.1) |
0.523 |
0.768 |
0.077 |
174 |
0.4558 (±0.0156/√100) |
🟢 rinna/japanese-gpt-neox-3.6b-instruct... |
75.3 (±26.6) |
0.488 |
0.804 |
0.076 |
175 |
0.4543 (±0.0217/√100) |
🟢 rinna/youri-7b-instruction |
96.2 (±29.5) |
0.530 |
0.743 |
0.090 |
176 |
0.4535 (±0.0348/√100) |
💬 Rakuten/RakutenAI-7B-instruct |
128.6 (±83.2) |
0.527 |
0.726 |
0.108 |
177 |
0.4535 (±0.0183/√100) |
🟢 THUDM/glm-4-9b |
110.3 (±36.9) |
0.554 |
0.689 |
0.118 |
178 |
0.4527 (±0.0146/√100) |
🟢 lmsys/vicuna-13b-v1.5-16k |
107.9 (±25.9) |
0.576 |
0.708 |
0.075 |
179 |
0.4504 (±0.0224/√100) |
🟢 rinna/nekomata-7b-instruction |
96.4 (±23.7) |
0.528 |
0.734 |
0.089 |
180 |
0.4486 (±0.0161/√100) |
💬 Qwen/Qwen2-7B-Instruct |
163.6 (±61.4) |
0.547 |
0.688 |
0.111 |
181 |
0.4484 (±0.0191/√100) |
💬 SakanaAI/EvoLLM-JP-v1-7B |
123.9 (±68.1) |
0.545 |
0.706 |
0.094 |
182 |
0.4477 (±0.0205/√100) |
🟢 rinna/llama-3-youko-70b-instruct |
130.7 (±95.3) |
0.527 |
0.670 |
0.146 |
183 |
0.4426 (±0.0204/√100) |
🟢 elyza/ELYZA-japanese-Llama-2-13b-inst... |
111.1 (±28.2) |
0.544 |
0.687 |
0.097 |
184 |
0.4409 (±0.1064/√100) |
💬 lightblue/karasu-7B |
138.1 (±92.9) |
0.512 |
0.679 |
0.131 |
185 |
0.4404 (±0.0146/√100) |
🟢 rinna/bilingual-gpt-neox-4b-instructi... |
75.9 (±22.7) |
0.493 |
0.773 |
0.056 |
186 |
0.4387 (±0.0655/√100) |
💬 Qwen/Qwen-72B-Chat |
117.7 (±137.1) |
0.541 |
0.632 |
0.143 |
187 |
0.4385 (±0.0285/√100) |
💬 rinna/youri-7b-chat |
95.4 (±41.1) |
0.500 |
0.733 |
0.083 |
188 |
0.4377 (±0.0107/√100) |
🟢 google/gemma-1.1-7b-it |
86.8 (±21.4) |
0.509 |
0.732 |
0.072 |
189 |
0.4374 (±0.0217/√100) |
🟢 Qwen/Qwen1.5-32B-Chat |
127.0 (±57.0) |
0.538 |
0.642 |
0.133 |
190 |
0.4336 (±0.0168/√100) |
🟢 stabilityai/japanese-stablelm-base-be... |
107.1 (±17.2) |
0.539 |
0.689 |
0.073 |
191 |
0.4335 (±0.0221/√100) |
🟢 Qwen/Qwen-14B |
118.1 (±71.6) |
0.530 |
0.675 |
0.096 |
192 |
0.4332 (±0.0164/√100) |
🟢 Qwen/Qwen2-7B-Instruct |
119.1 (±45.7) |
0.531 |
0.670 |
0.098 |
193 |
0.4330 (±0.0149/√100) |
💬 google/gemma-2-2b-it |
56.0 (±27.8) |
0.445 |
0.788 |
0.066 |
194 |
0.4320 (±0.0171/√100) |
🟢 Qwen/Qwen2-7B |
109.1 (±40.1) |
0.532 |
0.671 |
0.093 |
195 |
0.4296 (±0.0322/√100) |
💬 Qwen/Qwen-14B-Chat |
159.0 (±69.7) |
0.522 |
0.675 |
0.092 |
196 |
0.4295 (±0.0157/√100) |
🟢 elyza/ELYZA-japanese-Llama-2-7b-instruct |
111.5 (±31.4) |
0.530 |
0.676 |
0.083 |
197 |
0.4292 (±0.0181/√100) |
💬 Xwin-LM/Xwin-LM-13B-V0.2 |
240.7 (±48.4) |
0.533 |
0.670 |
0.085 |
198 |
0.4282 (±0.0193/√100) |
🟢 stabilityai/japanese-stablelm-3b-4e1t... |
110.8 (±26.0) |
0.518 |
0.688 |
0.078 |
199 |
0.4272 (±0.0273/√100) |
🟢 mistralai/Mistral-Nemo-Instruct-2407 |
155.8 (±132.8) |
0.548 |
0.611 |
0.122 |
200 |
0.4265 (±0.0115/√100) |
💬 google/gemma-1.1-7b-it |
78.7 (±28.4) |
0.475 |
0.739 |
0.066 |
201 |
0.4256 (±0.0270/√100) |
🟢 rinna/japanese-gpt-neox-3.6b |
129.8 (±73.4) |
0.485 |
0.685 |
0.106 |
202 |
0.4228 (±0.0185/√100) |
🟢 stabilityai/japanese-stablelm-base-ja... |
110.4 (±28.6) |
0.528 |
0.668 |
0.073 |
203 |
0.4222 (±0.0138/√100) |
🟢 Xwin-LM/Xwin-LM-7B-V0.2 |
110.6 (±29.3) |
0.520 |
0.677 |
0.070 |
204 |
0.4220 (±0.0185/√100) |
🟢 lmsys/vicuna-7b-v1.5-16k |
111.8 (±31.8) |
0.522 |
0.670 |
0.074 |
205 |
0.4207 (±0.0189/√100) |
🟢 stabilityai/japanese-stablelm-3b-4e1t... |
112.8 (±27.0) |
0.507 |
0.683 |
0.072 |
206 |
0.4201 (±0.0177/√100) |
💬 lmsys/vicuna-7b-v1.5-16k |
128.1 (±52.5) |
0.514 |
0.668 |
0.078 |
207 |
0.4164 (±0.0244/√100) |
🟢 google/gemma-7b |
135.5 (±132.3) |
0.533 |
0.631 |
0.085 |
208 |
0.4150 (±0.0212/√100) |
💬 Qwen/Qwen1.5-32B-Chat |
125.7 (±250.5) |
0.496 |
0.620 |
0.130 |
209 |
0.4149 (±0.0375/√100) |
💬 llm-jp/llm-jp-13b-instruct-full-dolly... |
186.6 (±108.4) |
0.469 |
0.685 |
0.090 |
210 |
0.4144 (±0.0149/√100) |
💬 01-ai/Yi-1.5-34B-Chat |
170.6 (±47.1) |
0.514 |
0.628 |
0.101 |
211 |
0.4140 (±0.0208/√100) |
🟢 meta-llama/Meta-Llama-3-8B-Instruct |
116.8 (±44.3) |
0.523 |
0.637 |
0.082 |
212 |
0.4125 (±0.0303/√100) |
💬 CohereForAI/c4ai-command-r-v01 |
137.7 (±324.6) |
0.519 |
0.562 |
0.157 |
213 |
0.4122 (±0.0199/√100) |
🟢 rinna/bilingual-gpt-neox-4b |
121.0 (±43.6) |
0.485 |
0.660 |
0.092 |
214 |
0.4097 (±0.0187/√100) |
🟢 meta-llama/Meta-Llama-3.1-8B |
108.7 (±35.4) |
0.512 |
0.650 |
0.068 |
215 |
0.4087 (±0.0201/√100) |
🟢 meta-llama/Llama-2-70b-chat-hf |
161.3 (±140.8) |
0.519 |
0.608 |
0.099 |
216 |
0.4087 (±0.0146/√100) |
🟢 microsoft/Phi-3-small-8k-instruct |
109.1 (±24.1) |
0.514 |
0.644 |
0.068 |
217 |
0.4076 (±0.0142/√100) |
🟢 elyza/ELYZA-japanese-Llama-2-7b-fast-... |
109.0 (±32.9) |
0.503 |
0.644 |
0.076 |
218 |
0.4074 (±0.0207/√100) |
💬 elyza/ELYZA-japanese-Llama-2-13b-inst... |
156.6 (±65.9) |
0.490 |
0.646 |
0.086 |
219 |
0.4073 (±0.0175/√100) |
🟢 stabilityai/japanese-stablelm-instruc... |
110.0 (±26.5) |
0.490 |
0.663 |
0.070 |
220 |
0.4058 (±0.0295/√100) |
💬 rinna/youri-7b-instruction |
97.0 (±57.0) |
0.439 |
0.713 |
0.065 |
221 |
0.4050 (±0.0191/√100) |
🟢 mistralai/Mixtral-8x22B-v0.1 |
115.6 (±55.4) |
0.517 |
0.615 |
0.084 |
222 |
0.4048 (±0.0175/√100) |
🟢 meta-llama/Meta-Llama-3-8B |
109.0 (±19.8) |
0.505 |
0.641 |
0.068 |
223 |
0.4045 (±0.0186/√100) |
🟢 rinna/japanese-gpt-neox-3.6b-instruct... |
133.1 (±57.4) |
0.475 |
0.678 |
0.061 |
224 |
0.4042 (±0.0131/√100) |
🟢 microsoft/Orca-2-13b |
115.5 (±42.6) |
0.510 |
0.630 |
0.073 |
225 |
0.4041 (±0.0218/√100) |
💬 meta-llama/Meta-Llama-3-8B-Instruct |
131.4 (±88.3) |
0.508 |
0.614 |
0.090 |
226 |
0.4035 (±0.0151/√100) |
🟢 SakanaAI/EvoLLM-JP-A-v1-7B |
110.4 (±31.3) |
0.508 |
0.633 |
0.069 |
227 |
0.4033 (±0.0164/√100) |
🟢 elyza/ELYZA-japanese-Llama-2-13b-fast... |
107.2 (±28.5) |
0.495 |
0.643 |
0.072 |
228 |
0.4032 (±0.0237/√100) |
🟢 Qwen/Qwen1.5-32B |
150.3 (±104.8) |
0.505 |
0.605 |
0.100 |
229 |
0.4024 (±0.0187/√100) |
🟢 01-ai/Yi-1.5-34B |
109.9 (±28.2) |
0.493 |
0.631 |
0.083 |
230 |
0.4011 (±0.0236/√100) |
🟢 cyberagent/open-calm-7b |
143.8 (±97.0) |
0.472 |
0.641 |
0.091 |
231 |
0.4006 (±0.0166/√100) |
💬 microsoft/Phi-3-small-8k-instruct |
189.7 (±84.1) |
0.500 |
0.630 |
0.073 |
232 |
0.4001 (±0.0199/√100) |
🟢 rinna/japanese-gpt-neox-3.6b-instruct... |
117.6 (±48.9) |
0.464 |
0.684 |
0.052 |
233 |
0.3985 (±0.0161/√100) |
🟢 elyza/ELYZA-japanese-Llama-2-13b |
138.4 (±51.8) |
0.493 |
0.634 |
0.069 |
234 |
0.3960 (±0.0199/√100) |
🟢 line-corporation/japanese-large-lm-1.7b |
179.2 (±174.5) |
0.474 |
0.650 |
0.065 |
235 |
0.3949 (±0.0193/√100) |
💬 meta-llama/Meta-Llama-3.1-8B-Instruct |
216.6 (±345.2) |
0.487 |
0.624 |
0.074 |
236 |
0.3948 (±0.0190/√100) |
💬 Qwen/Qwen1.5-14B-Chat |
127.9 (±50.6) |
0.500 |
0.604 |
0.080 |
237 |
0.3946 (±0.0201/√100) |
🟢 Qwen/Qwen1.5-14B |
130.9 (±67.8) |
0.509 |
0.609 |
0.066 |
238 |
0.3934 (±0.0201/√100) |
🟢 stabilityai/japanese-stablelm-instruc... |
107.8 (±38.0) |
0.466 |
0.648 |
0.066 |
239 |
0.3914 (±0.0172/√100) |
🟢 mistralai/Mixtral-8x7B-Instruct-v0.1 |
95.1 (±25.2) |
0.488 |
0.636 |
0.050 |
240 |
0.3863 (±0.0160/√100) |
🟢 Qwen/Qwen1.5-14B-Chat |
131.4 (±55.8) |
0.491 |
0.593 |
0.075 |
241 |
0.3837 (±0.0188/√100) |
🟢 rinna/bilingual-gpt-neox-4b-instructi... |
117.4 (±42.4) |
0.462 |
0.649 |
0.041 |
242 |
0.3823 (±0.0645/√100) |
💬 mistralai/Mistral-Nemo-Instruct-2407 |
157.9 (±140.3) |
0.484 |
0.563 |
0.100 |
243 |
0.3822 (±0.0647/√100) |
💬 llm-jp/llm-jp-13b-instruct-full-dolly... |
97.6 (±76.2) |
0.397 |
0.664 |
0.086 |
244 |
0.3819 (±0.0265/√100) |
🟢 google/gemma-2-27b |
214.2 (±183.3) |
0.450 |
0.608 |
0.087 |
245 |
0.3804 (±0.0161/√100) |
🟢 Qwen/Qwen-7B-Chat |
140.8 (±65.1) |
0.485 |
0.612 |
0.045 |
246 |
0.3803 (±0.0249/√100) |
💬 elyza/ELYZA-japanese-Llama-2-7b-instruct |
136.4 (±70.7) |
0.452 |
0.619 |
0.070 |
247 |
0.3772 (±0.0162/√100) |
💬 microsoft/Phi-3-small-128k-instruct |
199.7 (±111.9) |
0.473 |
0.590 |
0.069 |
248 |
0.3760 (±0.0236/√100) |
🟢 cyberagent/open-calm-3b |
123.2 (±79.0) |
0.442 |
0.624 |
0.062 |
249 |
0.3759 (±0.0149/√100) |
🟢 lmsys/longchat-7b-v1.5-32k |
116.9 (±31.6) |
0.474 |
0.609 |
0.045 |
250 |
0.3740 (±0.0164/√100) |
🟢 meta-llama/Llama-2-13b-hf |
108.5 (±21.8) |
0.474 |
0.603 |
0.045 |
251 |
0.3737 (±0.0197/√100) |
🟢 meta-llama/Meta-Llama-3.1-8B-Instruct |
204.5 (±303.4) |
0.478 |
0.589 |
0.055 |
252 |
0.3720 (±0.0622/√100) |
💬 Xwin-LM/Xwin-LM-7B-V0.2 |
205.3 (±79.1) |
0.466 |
0.590 |
0.060 |
253 |
0.3720 (±0.0157/√100) |
🟢 elyza/ELYZA-japanese-Llama-2-13b-fast |
177.5 (±147.2) |
0.458 |
0.598 |
0.061 |
254 |
0.3699 (±0.0345/√100) |
💬 Qwen/Qwen-7B-Chat |
182.9 (±110.3) |
0.468 |
0.600 |
0.042 |
255 |
0.3694 (±0.0103/√100) |
🟢 google/gemma-7b-it |
89.7 (±21.6) |
0.446 |
0.640 |
0.022 |
256 |
0.3685 (±0.0173/√100) |
🟢 elyza/ELYZA-japanese-Llama-2-7b |
140.0 (±52.8) |
0.462 |
0.596 |
0.047 |
257 |
0.3673 (±0.0089/√100) |
💬 google/gemma-7b-it |
110.0 (±47.6) |
0.448 |
0.633 |
0.020 |
258 |
0.3655 (±0.0116/√100) |
🟢 deepseek-ai/deepseek-llm-7b-chat |
113.9 (±24.7) |
0.474 |
0.579 |
0.043 |
259 |
0.3642 (±0.0165/√100) |
🟢 llm-jp/llm-jp-1.3b-v1.0 |
134.0 (±62.6) |
0.437 |
0.612 |
0.044 |
260 |
0.3637 (±0.0223/√100) |
🟢 cyberagent/open-calm-large |
122.3 (±73.9) |
0.424 |
0.611 |
0.056 |
261 |
0.3637 (±0.0152/√100) |
🟢 elyza/ELYZA-japanese-Llama-2-7b-fast |
168.0 (±77.4) |
0.452 |
0.587 |
0.052 |
262 |
0.3632 (±0.0237/√100) |
💬 elyza/ELYZA-japanese-Llama-2-7b-fast-... |
178.6 (±113.6) |
0.443 |
0.582 |
0.064 |
263 |
0.3628 (±0.0145/√100) |
🟢 Qwen/Qwen-7B |
117.3 (±39.0) |
0.468 |
0.582 |
0.039 |
264 |
0.3554 (±0.0178/√100) |
🟢 meta-llama/Llama-2-7b-chat-hf |
139.3 (±93.1) |
0.464 |
0.570 |
0.031 |
265 |
0.3545 (±0.0445/√100) |
💬 llm-jp/llm-jp-13b-instruct-full-jaste... |
48.8 (±50.1) |
0.283 |
0.723 |
0.058 |
266 |
0.3543 (±0.0439/√100) |
💬 lmsys/longchat-7b-v1.5-32k |
160.1 (±73.5) |
0.448 |
0.572 |
0.043 |
267 |
0.3538 (±0.0175/√100) |
🟢 01-ai/Yi-1.5-9B |
113.0 (±29.4) |
0.457 |
0.555 |
0.050 |
268 |
0.3531 (±0.0159/√100) |
🟢 mistralai/Mixtral-8x7B-v0.1 |
94.3 (±20.8) |
0.450 |
0.573 |
0.037 |
269 |
0.3514 (±0.0102/√100) |
🟢 google/gemma-1.1-2b-it |
80.4 (±21.6) |
0.404 |
0.625 |
0.025 |
270 |
0.3495 (±0.0268/√100) |
🟢 cyberagent/open-calm-1b |
141.3 (±110.0) |
0.412 |
0.578 |
0.059 |
271 |
0.3471 (±0.0131/√100) |
🟢 microsoft/Orca-2-7b |
131.1 (±70.7) |
0.447 |
0.555 |
0.039 |
272 |
0.3465 (±0.0202/√100) |
💬 deepseek-ai/deepseek-llm-7b-chat |
167.2 (±76.5) |
0.435 |
0.562 |
0.042 |
273 |
0.3463 (±0.0178/√100) |
💬 mistralai/Mixtral-8x7B-Instruct-v0.1 |
147.1 (±111.8) |
0.448 |
0.548 |
0.043 |
274 |
0.3449 (±0.0986/√100) |
💬 stabilityai/japanese-stablelm-instruc... |
109.4 (±66.2) |
0.397 |
0.585 |
0.053 |
275 |
0.3440 (±0.0978/√100) |
💬 stabilityai/japanese-stablelm-3b-4e1t... |
127.8 (±80.5) |
0.401 |
0.576 |
0.055 |
276 |
0.3436 (±0.0126/√100) |
💬 01-ai/Yi-1.5-9B-Chat |
143.6 (±60.1) |
0.438 |
0.540 |
0.053 |
277 |
0.3428 (±0.0163/√100) |
🟢 meta-llama/Llama-2-7b-hf |
112.3 (±28.0) |
0.440 |
0.550 |
0.038 |
278 |
0.3408 (±0.0225/√100) |
🟢 anthracite-org/magnum-32b-v2 |
191.9 (±223.2) |
0.442 |
0.507 |
0.073 |
279 |
0.3393 (±0.0225/√100) |
🟢 stockmark/gpt-neox-japanese-1.4b |
92.2 (±63.7) |
0.351 |
0.641 |
0.025 |
280 |
0.3322 (±0.0151/√100) |
🟢 Qwen/Qwen1.5-7B-Chat |
127.7 (±117.0) |
0.431 |
0.520 |
0.045 |
281 |
0.3315 (±0.0203/√100) |
🟢 Qwen/Qwen1.5-7B |
141.8 (±126.5) |
0.445 |
0.504 |
0.046 |
282 |
0.3313 (±0.0115/√100) |
🟢 google/gemma-2b-it |
85.9 (±24.7) |
0.393 |
0.577 |
0.024 |
283 |
0.3293 (±0.0252/√100) |
💬 Qwen/Qwen1.5-7B-Chat |
195.7 (±113.1) |
0.429 |
0.503 |
0.056 |
284 |
0.3276 (±0.0709/√100) |
💬 elyza/ELYZA-japanese-Llama-2-13b-fast... |
134.0 (±98.8) |
0.395 |
0.543 |
0.045 |
285 |
0.3272 (±0.0101/√100) |
💬 01-ai/Yi-1.5-6B-Chat |
194.4 (±75.0) |
0.426 |
0.530 |
0.025 |
286 |
0.3187 (±0.0142/√100) |
🟢 Qwen/Qwen2-1.5B-Instruct |
131.4 (±46.7) |
0.421 |
0.513 |
0.022 |
287 |
0.3172 (±0.0150/√100) |
🟢 Qwen/Qwen2-1.5B |
120.9 (±30.7) |
0.422 |
0.511 |
0.019 |
288 |
0.3161 (±0.0119/√100) |
🟢 deepseek-ai/deepseek-llm-7b-base |
113.7 (±21.6) |
0.424 |
0.501 |
0.024 |
289 |
0.3147 (±0.0175/√100) |
💬 Qwen/Qwen2-1.5B-Instruct |
180.7 (±101.0) |
0.408 |
0.511 |
0.025 |
290 |
0.3078 (±0.0195/√100) |
🟢 cyberagent/open-calm-medium |
117.3 (±59.4) |
0.363 |
0.537 |
0.024 |
291 |
0.3058 (±0.1106/√100) |
💬 rinna/nekomata-7b-instruction |
61.2 (±57.0) |
0.307 |
0.567 |
0.043 |
292 |
0.3053 (±0.0177/√100) |
🟢 google/gemma-2b |
151.5 (±113.6) |
0.410 |
0.480 |
0.026 |
293 |
0.3050 (±0.0190/√100) |
🟢 Qwen/Qwen1.5-MoE-A2.7B |
146.4 (±90.3) |
0.412 |
0.468 |
0.035 |
294 |
0.2993 (±0.0095/√100) |
🟢 01-ai/Yi-1.5-6B-Chat |
133.3 (±46.2) |
0.394 |
0.481 |
0.022 |
295 |
0.2993 (±0.0107/√100) |
🟢 tiiuae/falcon-11B |
121.6 (±31.5) |
0.398 |
0.483 |
0.016 |
296 |
0.2957 (±0.0641/√100) |
💬 meta-llama/Llama-2-13b-chat-hf |
305.2 (±299.7) |
0.402 |
0.453 |
0.032 |
297 |
0.2953 (±0.0442/√100) |
🟢 augmxnt/shisa-base-7b-v1 |
200.4 (±160.3) |
0.378 |
0.478 |
0.030 |
298 |
0.2924 (±0.0506/√100) |
💬 Qwen/Qwen1.5-MoE-A2.7B-Chat |
245.1 (±209.1) |
0.381 |
0.453 |
0.043 |
299 |
0.2914 (±0.0133/√100) |
🟢 mistralai/Mistral-7B-v0.1 |
117.4 (±40.4) |
0.402 |
0.454 |
0.018 |
300 |
0.2907 (±0.0175/√100) |
🟢 Qwen/Qwen1.5-MoE-A2.7B-Chat |
149.8 (±91.0) |
0.388 |
0.448 |
0.036 |
301 |
0.2853 (±0.0163/√100) |
🟢 Qwen/Qwen1.5-4B-Chat |
127.8 (±71.2) |
0.395 |
0.441 |
0.019 |
302 |
0.2809 (±0.0133/√100) |
🟢 Qwen/Qwen1.5-1.8B-Chat |
178.3 (±92.0) |
0.381 |
0.445 |
0.017 |
303 |
0.2770 (±0.0131/√100) |
🟢 mistralai/Mistral-7B-Instruct-v0.2 |
146.2 (±70.1) |
0.387 |
0.419 |
0.024 |
304 |
0.2769 (±0.0324/√100) |
💬 llm-jp/llm-jp-13b-instruct-full-jaste... |
16.9 (±24.6) |
0.125 |
0.693 |
0.013 |
305 |
0.2769 (±0.1029/√100) |
💬 stabilityai/japanese-stablelm-instruc... |
117.0 (±115.0) |
0.307 |
0.489 |
0.035 |
306 |
0.2666 (±0.0241/√100) |
🟢 deepseek-ai/deepseek-llm-67b-chat |
140.2 (±83.0) |
0.351 |
0.440 |
0.009 |
307 |
0.2661 (±0.0128/√100) |
🟢 Qwen/Qwen1.5-1.8B |
129.7 (±65.7) |
0.360 |
0.424 |
0.014 |
308 |
0.2613 (±0.0136/√100) |
🟢 Qwen/Qwen2-0.5B-Instruct |
176.8 (±98.9) |
0.351 |
0.426 |
0.007 |
309 |
0.2604 (±0.0148/√100) |
🟢 mistralai/Mistral-7B-Instruct-v0.1 |
139.8 (±101.3) |
0.367 |
0.400 |
0.014 |
310 |
0.2598 (±0.0129/√100) |
🟢 Qwen/Qwen2-0.5B |
122.7 (±43.5) |
0.350 |
0.420 |
0.009 |
311 |
0.2581 (±0.0196/√100) |
🟢 cyberagent/open-calm-small |
119.1 (±54.1) |
0.310 |
0.460 |
0.004 |
312 |
0.2555 (±0.0163/√100) |
🟢 Qwen/Qwen1.5-4B |
149.2 (±76.6) |
0.363 |
0.388 |
0.015 |
313 |
0.2543 (±0.0266/√100) |
🟢 mosaicml/mpt-30b-chat |
121.3 (±46.4) |
0.327 |
0.428 |
0.008 |
314 |
0.2414 (±0.0281/√100) |
💬 Qwen/Qwen1.5-1.8B-Chat |
480.0 (±210.3) |
0.329 |
0.392 |
0.003 |
315 |
0.2394 (±0.0745/√100) |
💬 Qwen/Qwen1.5-4B-Chat |
105.3 (±104.1) |
0.307 |
0.390 |
0.021 |
316 |
0.2317 (±0.0455/√100) |
💬 mistralai/Mistral-7B-Instruct-v0.1 |
202.3 (±153.9) |
0.320 |
0.362 |
0.012 |
317 |
0.2231 (±0.0166/√100) |
💬 mistralai/Mistral-7B-Instruct-v0.2 |
261.2 (±166.3) |
0.316 |
0.334 |
0.019 |
318 |
0.2182 (±0.0152/√100) |
🟢 microsoft/phi-1 |
47.6 (±34.3) |
0.234 |
0.420 |
0.000 |
319 |
0.2177 (±0.0110/√100) |
🟢 Qwen/Qwen1.5-0.5B-Chat |
143.4 (±52.1) |
0.317 |
0.327 |
0.009 |
320 |
0.2169 (±0.0561/√100) |
💬 Qwen/Qwen2-0.5B-Instruct |
129.5 (±114.3) |
0.265 |
0.379 |
0.006 |
321 |
0.2169 (±0.0218/√100) |
🟢 mosaicml/mpt-30b-instruct |
109.8 (±36.1) |
0.274 |
0.370 |
0.008 |
322 |
0.2146 (±0.0151/√100) |
🟢 microsoft/phi-2 |
78.0 (±31.4) |
0.287 |
0.356 |
0.001 |
323 |
0.2061 (±0.0820/√100) |
💬 meta-llama/Llama-2-70b-chat-hf |
523.3 (±444.5) |
0.271 |
0.303 |
0.045 |
324 |
0.2040 (±0.0152/√100) |
🟢 Qwen/Qwen1.5-0.5B |
138.6 (±55.9) |
0.296 |
0.314 |
0.003 |
325 |
0.2038 (±0.0538/√100) |
🟢 mosaicml/mpt-30b |
236.5 (±433.3) |
0.271 |
0.334 |
0.007 |
326 |
0.1885 (±0.0194/√100) |
🟢 microsoft/phi-1_5 |
77.5 (±33.6) |
0.258 |
0.306 |
0.001 |
327 |
0.1833 (±0.0406/√100) |
💬 google/gemma-1.1-2b-it |
32.6 (±26.7) |
0.171 |
0.376 |
0.003 |
328 |
0.1765 (±0.0439/√100) |
💬 Qwen/Qwen1.5-0.5B-Chat |
214.3 (±172.6) |
0.251 |
0.276 |
0.002 |
329 |
0.1687 (±0.0172/√100) |
🟢 upstage/SOLAR-10.7B-v1.0 |
171.0 (±87.1) |
0.265 |
0.237 |
0.004 |
330 |
0.1544 (±0.0132/√100) |
🟢 01-ai/Yi-1.5-34B-Chat |
730.0 (±533.6) |
0.201 |
0.256 |
0.006 |
331 |
0.1475 (±0.0826/√100) |
💬 mosaicml/mpt-30b-chat |
112.2 (±112.4) |
0.182 |
0.254 |
0.007 |
332 |
0.1241 (±0.0558/√100) |
💬 google/gemma-2b-it |
24.1 (±24.6) |
0.115 |
0.257 |
0.000 |
333 |
0.1226 (±0.0240/√100) |
🟢 Deci/DeciLM-7B |
174.0 (±165.5) |
0.190 |
0.174 |
0.003 |
334 |
0.1160 (±0.0081/√100) |
🟢 rinna/japanese-gpt-neox-3.6b-instruct... |
212.1 (±148.9) |
0.153 |
0.195 |
0.000 |
335 |
0.1009 (±0.0846/√100) |
💬 meta-llama/Llama-2-7b-chat-hf |
241.5 (±336.2) |
0.136 |
0.158 |
0.009 |
336 |
0.1004 (±0.0094/√100) |
🟢 rinna/japanese-gpt-neox-3.6b-instruct... |
123.1 (±128.8) |
0.119 |
0.182 |
0.000 |
337 |
0.0987 (±0.0145/√100) |
🟢 deepseek-ai/deepseek-llm-67b-base |
154.2 (±77.3) |
0.174 |
0.121 |
0.000 |
338 |
0.0982 (±0.1596/√100) |
💬 rinna/nekomata-14b-instruction |
16.0 (±38.1) |
0.115 |
0.141 |
0.039 |
339 |
0.0955 (±0.0102/√100) |
🟢 rinna/japanese-gpt-neox-3.6b-instruct... |
129.5 (±141.0) |
0.116 |
0.170 |
0.000 |
340 |
0.0939 (±0.0064/√100) |
🟢 sbintuitions/tiny-lm-chat |
250.2 (±275.6) |
0.133 |
0.149 |
0.000 |
341 |
0.0936 (±0.0082/√100) |
💬 sbintuitions/tiny-lm-chat |
276.7 (±209.6) |
0.135 |
0.145 |
0.000 |
342 |
0.0921 (±0.0058/√100) |
🟢 sbintuitions/tiny-lm |
471.9 (±199.0) |
0.135 |
0.142 |
0.000 |
343 |
0.0880 (±0.0334/√100) |
🟢 rinna/bilingual-gpt-neox-4b-instructi... |
134.0 (±144.7) |
0.105 |
0.159 |
0.000 |
344 |
0.0762 (±0.0033/√100) |
🟢 line-corporation/japanese-large-lm-3.6b |
1066.6 (±31.6) |
0.125 |
0.103 |
0.000 |
345 |
0.0760 (±0.0032/√100) |
🟢 line-corporation/japanese-large-lm-3.... |
1066.4 (±31.8) |
0.125 |
0.103 |
0.000 |
346 |
0.0758 (±0.0034/√100) |
💬 line-corporation/japanese-large-lm-3.... |
1067.2 (±31.8) |
0.125 |
0.102 |
0.000 |
347 |
0.0673 (±0.0085/√100) |
🟢 moneyforward/houou-instruction-7b-v3 |
143.2 (±112.2) |
0.098 |
0.104 |
0.000 |
348 |
0.0625 (±0.0169/√100) |
🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... |
31.6 (±10.3) |
0.088 |
0.099 |
0.000 |
349 |
0.0429 (±0.0440/√100) |
🟢 rinna/bilingual-gpt-neox-4b-instructi... |
31.7 (±54.7) |
0.045 |
0.084 |
0.000 |
350 |
0.0406 (±0.0028/√100) |
🟢 microsoft/Phi-3-small-128k-instruct |
268.1 (±123.4) |
0.083 |
0.039 |
0.000 |
351 |
0.0337 (±0.0026/√100) |
🟢 augmxnt/shisa-7b-v1 |
590.7 (±238.2) |
0.076 |
0.025 |
0.000 |
352 |
0.0284 (±0.0012/√100) |
🟢 lightblue/karasu-7B-chat-plus |
285.1 (±53.8) |
0.080 |
0.005 |
0.000 |
353 |
0.0225 (±0.0702/√100) |
💬 SakanaAI/EvoLLM-JP-A-v1-7B |
5.9 (±27.6) |
0.026 |
0.037 |
0.005 |
354 |
0.0180 (±0.0039/√100) |
🟢 mistralai/Mistral-Nemo-Base-2407 |
607.5 (±344.5) |
0.039 |
0.015 |
0.000 |
355 |
0.0047 (±0.0024/√100) |
🟢 ai-forever/mGPT-13B |
321.1 (±266.7) |
0.008 |
0.006 |
0.000 |
356 |
0.0022 (±0.0006/√100) |
🟢 lightblue/qarasu-14B-chat-plus-unleashed |
937.5 (±557.0) |
0.004 |
0.002 |
0.000 |
357 |
0.0019 (±0.0002/√100) |
🟢 01-ai/Yi-1.5-9B-Chat |
1440.0 (±51.9) |
0.005 |
0.001 |
0.000 |
358 |
0.0018 (±0.0004/√100) |
🟢 CohereForAI/aya-23-8B |
1676.6 (±351.0) |
0.004 |
0.002 |
0.000 |
359 |
0.0006 (±0.0002/√100) |
🟢 meta-llama/Llama-2-13b-chat-hf |
1523.9 (±43.5) |
0.001 |
0.001 |
0.000 |
360 |
0.0000 (±0.0000/√100) |
🟢 01-ai/Yi-1.5-6B |
0.0 (±0.0) |
0.000 |
0.000 |
0.000 |
361 |
0.0000 (±0.0000/√100) |
🟢 lightblue/karasu-1.1B |
0.0 (±0.0) |
0.000 |
0.000 |
0.000 |
362 |
0.0000 (±0.0000/√100) |
🟢 lightblue/karasu-7B-chat-plus-unleashed |
0.0 (±0.0) |
0.000 |
0.000 |
0.000 |
363 |
0.0000 (±0.0000/√100) |
🟢 lightblue/karasu-7B-chat |
0.0 (±0.0) |
0.000 |
0.000 |
0.000 |
364 |
0.0000 (±0.0000/√100) |
🟢 lightblue/suzume-llama-3-8B-japanese |
300.0 (±0.0) |
0.000 |
0.000 |
0.000 |
365 |
0.0000 (±0.0000/√100) |
🟢 lightblue/suzume-llama-3-8B-multilingual |
300.0 (±0.0) |
0.000 |
0.000 |
0.000 |
If you use this repository, please cite the following paper:
@preprint{Imos2024-pre-pfgen,
title={{pfgen-bench: 日本語事前学習モデルのための文章生成性能評価ベンチマーク}},
author={今城, 健太郎 and 平野, 正徳 and 鈴木, 脩司 and 三上, 裕明},
doi={10.51094/jxiv.1008},
year={2024}
}
Or cite directory this repository:
@misc{imajo2024-pfgen
title={{Preferred Generation Benchmark}},
author={Kentaro Imajo and Masanori Hirano and Shuji Suzuki and Hiroaki Mikami},
year={2024},
url = {https://github.com/pfnet-research/pfgen-bench}
}