Preferred Generation Benchmark

pfgen-benchmark is a benchmark designed to evaluate Japanese text generation specifically for pretrained models. Unlike conventional benchmarks that use templates containing instructions, this benchmark relies solely on providing numerous examples. By conveying expectations such as the question-answering nature of the task, responses of approximately 100 characters, and outputs resembling formal public documents purely through examples, it minimizes the influence of differences in instructions or templates. Additionally, output evaluation is conducted using n-gram-based methods, enabling quick, cost-effective, and deterministic evaluations, unlike the LLM as a Judge approach.

To enable comparisons across as many models as possible, the leaderboard actively includes a wide range of models. These include openly accessible models, models mentioned in academic papers, and those announced by companies through press releases. Contributions of model outputs are encouraged, and results can be submitted via pull requests. For detailed instructions on how to contribute, please refer to the "How to Contribute" section.

See more details: TBD (arxiv)

pfgen-benchmark は事前学習モデル向けに設計された日本語の生成文を評価するベンチマークです。通常のベンチマークでは指示文を含むテンプレートを使いますが、このベンチマークでは多数の例示のみを行います。質問応答タスクであることや、約100字の回答、公用文に近い出力を期待していることを例示のみで伝えることで、指示文やテンプレートの差異による影響を小さくしています。また、出力文の評価は n-gram を用いた方法を用いており、LLM as a Judge の手法と異なり、短時間、低コストでかつ決定的な評価を可能にしています。

詳しくはこちら： Jxiv preprint

できる限り多くのモデルを同じ軸で比較できるように、リーダーボードには積極的に多くのモデル掲載しています。オープンにアクセス可能なモデル、論文で言及されているモデル、企業がプレスリリースを出しているモデルなど、比較の価値があると思われるモデルについては、是非プルリクエストで出力を追加してください。追加方法については「How to contribute」を参照ください。

License of LLM output

The license of the parts of this repository other than the output of LLM is Apache License Version 2.0. The license of the output of LLM depends on the license of each model.

How to evaluate model

You can evaluate the model using run-hf.py (which uses transformers) or run-vllm.py (which uses vLLM). For detailed parameters, refer to --help. The --num-trials parameter, which is the number of patterns for which the model will generate answers, should be decided considering the trade-off between execution time and required accuracy.

# Run a model using Huggingface library or vLLM.
python ./run-hf.py --model=pfnet/plamo-13b --num-trials=5

# Evaluate output and update leaderboard.
make

How to contribute

Follow the instructions in the "How to Evaluate Model" section to run the evaluation. This process will generate config.json and trials.jsonl.xz files under the result directory. Please create a pull request containing only these two files.

To ensure more accurate ranking among models, the number of executions (--num-trials) should be as many as possible, within the limit of 100 trials.

Leaderboard

`Rank`	`Score`	`Model`	`Length`	`Fluency`	`Truthfulness`	`Helpfulness`
`N/A`	`1.0501 (±0.0000/√1)`	`👑 system/ground-truth`	`100.0 (±0.0)`	`1.155`	`0.996`	`1.000`
`1`	`0.9303 (±0.0083/√10)`	`💬 anthropic/claude-3-5-sonnet-20240620`	`102.2 (±10.4)`	`0.949`	`0.959`	`0.883`
`2`	`0.9144 (±0.0037/√2)`	`💬 deepseek-ai/DeepSeek-V3`	`87.4 (±14.9)`	`0.960`	`0.983`	`0.800`
`3`	`0.8615 (±0.0092/√10)`	`💬 openai/gpt-4o`	`84.5 (±18.6)`	`0.919`	`0.980`	`0.686`
`N/A`	`0.8494 (±0.0253/√1000)`	`🎯 system/criteria`	`100.0 (±3.4)`	`0.936`	`0.978`	`0.505`
`4`	`0.8270 (±0.0229/√10)`	`💬 anthropic/claude-3-opus-20240229`	`102.3 (±9.5)`	`0.911`	`0.944`	`0.627`
`5`	`0.8059 (±0.0169/√5)`	`💬 google/gemini-2.0-flash-exp`	`68.0 (±17.7)`	`0.834`	`0.984`	`0.600`
`6`	`0.8036 (±0.0133/√10)`	`💬 openai/gpt-4-turbo`	`86.5 (±17.4)`	`0.820`	`0.959`	`0.632`
`7`	`0.7916 (±0.0146/√10)`	`💬 openai/gpt-4`	`107.2 (±11.6)`	`0.888`	`0.951`	`0.536`
`8`	`0.7821 (±0.0166/√5)`	`💬 Qwen/Qwen2.5-72B-Instruct`	`98.3 (±14.9)`	`0.871`	`0.933`	`0.542`
`9`	`0.7789 (±0.0213/√100)`	`🟢 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0`	`109.1 (±36.8)`	`0.890`	`0.941`	`0.506`
`10`	`0.7773 (±0.0168/√100)`	`💬 pfnet/plamo-1.0-prime`	`178.2 (±114.5)`	`0.874`	`0.942`	`0.516`
`11`	`0.7768 (±0.0113/√5)`	`💬 mlx-community/Qwen2.5-72B-Instruct-4bit`	`100.8 (±17.7)`	`0.860`	`0.933`	`0.538`
`12`	`0.7766 (±0.0276/√100)`	`🟢 tokyotech-llm/Swallow-70b-NVE-hf`	`104.1 (±17.9)`	`0.884`	`0.938`	`0.507`
`13`	`0.7756 (±0.0264/√100)`	`🟢 tokyotech-llm/Swallow-70b-NVE-instruc...`	`104.1 (±18.5)`	`0.878`	`0.938`	`0.510`
`14`	`0.7748 (±0.0000/√1)`	`💬 openai/chatgpt-o1`	`76.3 (±17.7)`	`0.755`	`0.960`	`0.610`
`15`	`0.7650 (±0.0263/√100)`	`🟢 tokyotech-llm/Swallow-70b-instruct-hf`	`102.5 (±14.4)`	`0.872`	`0.929`	`0.494`
`16`	`0.7643 (±0.0000/√1)`	`💬 openai/chatgpt-o1-pro`	`79.5 (±17.3)`	`0.748`	`0.955`	`0.590`
`17`	`0.7628 (±0.0275/√100)`	`🟢 tokyotech-llm/Swallow-70b-hf`	`103.5 (±16.1)`	`0.876`	`0.930`	`0.483`
`18`	`0.7469 (±0.0270/√100)`	`🟢 pfnet/plamo-100b-base`	`115.2 (±64.0)`	`0.861`	`0.920`	`0.460`
`19`	`0.7458 (±0.0085/√5)`	`💬 tokyotech-llm/Llama-3.1-Swallow-70B-I...`	`69.5 (±25.9)`	`0.842`	`0.972`	`0.424`
`20`	`0.7444 (±0.0260/√100)`	`🟢 sbintuitions/sarashina2-70b`	`120.0 (±49.4)`	`0.825`	`0.923`	`0.485`
`21`	`0.7423 (±0.0302/√100)`	`💬 cyberagent/Llama-3.1-70B-Japanese-Ins...`	`199.2 (±110.3)`	`0.817`	`0.905`	`0.505`
`22`	`0.7365 (±0.0218/√100)`	`🟢 CohereForAI/c4ai-command-r-plus`	`107.5 (±42.3)`	`0.818`	`0.913`	`0.478`
`23`	`0.7336 (±0.0254/√100)`	`🟢 tokyotech-llm/Llama-3-Swallow-70B-v0.1`	`108.2 (±24.7)`	`0.837`	`0.908`	`0.456`
`24`	`0.7320 (±0.0201/√10)`	`💬 anthropic/claude-3-sonnet-20240229`	`114.3 (±18.9)`	`0.810`	`0.910`	`0.476`
`25`	`0.7249 (±0.0247/√100)`	`💬 cyberagent/calm3-22b-chat`	`136.8 (±46.7)`	`0.813`	`0.907`	`0.455`
`26`	`0.7217 (±0.0219/√100)`	`🟢 cyberagent/calm3-22b-chat`	`105.0 (±13.1)`	`0.824`	`0.916`	`0.425`
`27`	`0.7194 (±0.0321/√10)`	`💬 google/text-bison`	`77.6 (±31.9)`	`0.790`	`0.968`	`0.401`
`28`	`0.7185 (±0.0000/√1)`	`💬 elyza/Llama-3-ELYZA-JP-70B`	`98.6 (±33.8)`	`0.837`	`0.931`	`0.388`
`29`	`0.7175 (±0.0257/√100)`	`🟢 nvidia/nemotron-4-340b-instruct`	`107.3 (±28.4)`	`0.816`	`0.908`	`0.429`
`30`	`0.7046 (±0.0248/√100)`	`💬 nvidia/nemotron-4-340b-instruct`	`94.5 (±39.1)`	`0.768`	`0.910`	`0.435`
`31`	`0.7024 (±0.0238/√100)`	`🟢 rinna/nekomata-14b`	`104.3 (±18.0)`	`0.812`	`0.912`	`0.383`
`32`	`0.7008 (±0.0318/√100)`	`🟢 tokyotech-llm/Swallow-13b-instruct-hf`	`104.5 (±13.0)`	`0.812`	`0.898`	`0.392`
`33`	`0.6990 (±0.0288/√100)`	`🟢 tokyotech-llm/Swallow-13b-NVE-hf`	`106.2 (±19.2)`	`0.820`	`0.906`	`0.371`
`34`	`0.6945 (±0.0300/√100)`	`🟢 sbintuitions/sarashina2-13b`	`107.8 (±28.3)`	`0.794`	`0.900`	`0.390`
`35`	`0.6938 (±0.0217/√100)`	`🟢 weblab-GENIAC/Tanuki-8B-dpo-v1.0`	`111.5 (±22.8)`	`0.800`	`0.893`	`0.389`
`36`	`0.6891 (±0.0255/√100)`	`🟢 tokyotech-llm/Swallow-13b-hf`	`104.8 (±17.7)`	`0.811`	`0.901`	`0.355`
`37`	`0.6842 (±0.0171/√5)`	`🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In...`	`92.9 (±20.0)`	`0.804`	`0.932`	`0.317`
`38`	`0.6794 (±0.0243/√100)`	`🟢 cyberagent/Llama-3.1-70B-Japanese-Ins...`	`128.8 (±72.2)`	`0.764`	`0.883`	`0.391`
`39`	`0.6759 (±0.0232/√10)`	`🟢 meta-llama/Meta-Llama-3.1-405B`	`101.2 (±15.1)`	`0.767`	`0.892`	`0.368`
`40`	`0.6745 (±0.0152/√10)`	`💬 google/gemini-1.5-pro-001`	`52.4 (±15.0)`	`0.666`	`0.980`	`0.377`
`41`	`0.6737 (±0.0276/√100)`	`🟢 sbintuitions/sarashina1-13b`	`105.4 (±23.4)`	`0.775`	`0.882`	`0.364`
`42`	`0.6697 (±0.0277/√100)`	`🟢 nvidia/nemotron-4-340b-base`	`106.9 (±26.5)`	`0.768`	`0.884`	`0.357`
`43`	`0.6677 (±0.0250/√100)`	`🟢 llm-jp/llm-jp-3-13b`	`101.1 (±9.7)`	`0.770`	`0.884`	`0.349`
`44`	`0.6673 (±0.0225/√100)`	`🟢 sbintuitions/sarashina1-65b`	`104.2 (±20.0)`	`0.776`	`0.894`	`0.332`
`45`	`0.6663 (±0.0262/√100)`	`🟢 tokyotech-llm/Swallow-7b-plus-hf`	`106.1 (±18.1)`	`0.780`	`0.880`	`0.339`
`46`	`0.6656 (±0.0169/√10)`	`💬 google/gemini-1.5-flash-001`	`55.1 (±21.7)`	`0.687`	`0.967`	`0.342`
`47`	`0.6625 (±0.0140/√10)`	`💬 anthropic/claude-3-haiku-20240307`	`81.9 (±31.0)`	`0.747`	`0.943`	`0.298`
`48`	`0.6590 (±0.0133/√10)`	`💬 google/gemini-2.0-flash-thinking-exp-...`	`49.8 (±11.0)`	`0.639`	`0.984`	`0.354`
`49`	`0.6473 (±0.0182/√100)`	`💬 Qwen/Qwen2-72B-Instruct`	`108.7 (±24.8)`	`0.703`	`0.853`	`0.386`
`50`	`0.6456 (±0.0255/√100)`	`🟢 sbintuitions/sarashina2-7b`	`105.6 (±22.8)`	`0.746`	`0.874`	`0.316`
`51`	`0.6445 (±0.0241/√100)`	`🟢 tokyotech-llm/Llama-3-Swallow-8B-v0.1`	`110.3 (±28.4)`	`0.748`	`0.867`	`0.319`
`52`	`0.6368 (±0.0207/√100)`	`🟢 tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1`	`105.5 (±21.0)`	`0.753`	`0.870`	`0.287`
`53`	`0.6350 (±0.0260/√100)`	`🟢 karakuri-ai/karakuri-lm-8x7b-instruct...`	`104.0 (±16.9)`	`0.755`	`0.863`	`0.287`
`54`	`0.6337 (±0.0265/√100)`	`🟢 tokyotech-llm/Swallow-7b-hf`	`106.5 (±18.7)`	`0.746`	`0.866`	`0.289`
`55`	`0.6335 (±0.0252/√100)`	`🟢 karakuri-ai/karakuri-lm-8x7b-chat-v0.1`	`103.2 (±16.6)`	`0.766`	`0.872`	`0.263`
`56`	`0.6318 (±0.0264/√100)`	`🟢 tokyotech-llm/Llama-3-Swallow-70B-Ins...`	`119.2 (±74.3)`	`0.724`	`0.861`	`0.311`
`57`	`0.6303 (±0.0252/√100)`	`🟢 cyberagent/calm2-7b-chat-dpo-experime...`	`110.0 (±24.3)`	`0.735`	`0.863`	`0.293`
`58`	`0.6285 (±0.0239/√100)`	`🟢 pfnet/nekomata-14b-pfn-qfin-inst-merge`	`124.7 (±47.2)`	`0.725`	`0.866`	`0.295`
`59`	`0.6279 (±0.0252/√100)`	`🟢 tokyotech-llm/Swallow-7b-NVE-hf`	`108.1 (±24.5)`	`0.747`	`0.870`	`0.267`
`60`	`0.6274 (±0.0772/√100)`	`🟢 rinna/nekomata-14b-instruction`	`98.3 (±24.2)`	`0.732`	`0.855`	`0.295`
`61`	`0.6267 (±0.0263/√100)`	`🟢 sbintuitions/sarashina1-7b`	`106.7 (±25.1)`	`0.737`	`0.866`	`0.276`
`62`	`0.6252 (±0.0246/√100)`	`🟢 karakuri-ai/karakuri-lm-70b-v0.1`	`106.0 (±27.0)`	`0.713`	`0.852`	`0.310`
`63`	`0.6214 (±0.0063/√10)`	`💬 google/gemini-1.0-pro-001`	`47.4 (±15.2)`	`0.635`	`0.976`	`0.254`
`64`	`0.6202 (±0.0251/√100)`	`🟢 stabilityai/japanese-stablelm-base-be...`	`107.3 (±19.2)`	`0.733`	`0.848`	`0.280`
`65`	`0.6197 (±0.0258/√100)`	`🟢 stockmark/stockmark-13b`	`108.9 (±49.3)`	`0.727`	`0.860`	`0.272`
`66`	`0.6191 (±0.0284/√100)`	`🟢 stockmark/stockmark-13b-instruct`	`108.0 (±46.8)`	`0.720`	`0.859`	`0.278`
`67`	`0.6178 (±0.0230/√100)`	`🟢 karakuri-ai/karakuri-lm-70b-chat-v0.1`	`104.7 (±27.5)`	`0.706`	`0.842`	`0.306`
`68`	`0.6176 (±0.0249/√100)`	`🟢 tokyotech-llm/Swallow-7b-instruct-hf`	`106.3 (±17.8)`	`0.716`	`0.851`	`0.285`
`69`	`0.6136 (±0.0143/√10)`	`💬 openai/gpt-35-turbo`	`64.0 (±22.2)`	`0.658`	`0.944`	`0.239`
`70`	`0.6095 (±0.0225/√100)`	`💬 rinna/llama-3-youko-70b-instruct`	`135.3 (±46.8)`	`0.683`	`0.817`	`0.328`
`71`	`0.6091 (±0.0277/√100)`	`🟢 pfnet/nekomata-14b-pfn-qfin`	`85.1 (±28.4)`	`0.672`	`0.893`	`0.262`
`72`	`0.6087 (±0.1545/√100)`	`💬 tokyotech-llm/Swallow-70b-NVE-instruc...`	`135.7 (±74.0)`	`0.678`	`0.804`	`0.344`
`73`	`0.6060 (±0.0238/√100)`	`🟢 Qwen/Qwen2-72B`	`105.5 (±23.5)`	`0.703`	`0.836`	`0.279`
`74`	`0.6037 (±0.0239/√100)`	`🟢 tokyotech-llm/Swallow-7b-NVE-instruct-hf`	`105.7 (±16.4)`	`0.719`	`0.847`	`0.245`
`75`	`0.6030 (±0.0287/√100)`	`💬 karakuri-ai/karakuri-lm-8x7b-instruct...`	`197.4 (±72.1)`	`0.703`	`0.832`	`0.274`
`76`	`0.6029 (±0.0223/√100)`	`🟢 Qwen/Qwen2-72B-Instruct`	`106.0 (±26.7)`	`0.684`	`0.825`	`0.299`
`77`	`0.5987 (±0.0264/√100)`	`🟢 cyberagent/calm2-7b-chat`	`107.5 (±20.8)`	`0.701`	`0.843`	`0.253`
`78`	`0.5971 (±0.0235/√100)`	`🟢 stockmark/stockmark-100b`	`107.2 (±24.7)`	`0.709`	`0.842`	`0.240`
`79`	`0.5945 (±0.1370/√100)`	`💬 tokyotech-llm/Swallow-13b-instruct-hf`	`167.3 (±116.4)`	`0.670`	`0.790`	`0.323`
`80`	`0.5921 (±0.0211/√100)`	`🟢 elyza/Llama-3-ELYZA-JP-8B`	`115.6 (±44.8)`	`0.685`	`0.831`	`0.260`
`81`	`0.5832 (±0.0220/√100)`	`🟢 augmxnt/shisa-gamma-7b-v1`	`106.7 (±21.8)`	`0.706`	`0.831`	`0.213`
`82`	`0.5825 (±0.0249/√100)`	`🟢 tokyotech-llm/Swallow-MS-7b-v0.1`	`106.4 (±25.9)`	`0.702`	`0.828`	`0.218`
`83`	`0.5811 (±0.0218/√100)`	`🟢 llm-jp/llm-jp-13b-instruct-full-ac_00...`	`103.6 (±15.6)`	`0.675`	`0.816`	`0.252`
`84`	`0.5808 (±0.0220/√100)`	`🟢 stabilityai/japanese-stablelm-base-ga...`	`106.9 (±17.2)`	`0.690`	`0.822`	`0.230`
`85`	`0.5783 (±0.0217/√100)`	`🟢 microsoft/Phi-3-medium-4k-instruct`	`105.9 (±20.0)`	`0.675`	`0.826`	`0.234`
`86`	`0.5777 (±0.0228/√100)`	`🟢 llm-jp/llm-jp-13b-instruct-full-dolly...`	`105.2 (±14.5)`	`0.675`	`0.811`	`0.247`
`87`	`0.5754 (±0.0182/√100)`	`🟢 Xwin-LM/Xwin-LM-70B-V0.1`	`105.4 (±26.8)`	`0.681`	`0.833`	`0.213`
`88`	`0.5737 (±0.0209/√100)`	`🟢 microsoft/Phi-3-medium-128k-instruct`	`107.7 (±24.7)`	`0.674`	`0.825`	`0.223`
`89`	`0.5735 (±0.0216/√100)`	`🟢 google/gemma-2-9b-it`	`95.9 (±22.0)`	`0.674`	`0.837`	`0.209`
`90`	`0.5734 (±0.1980/√100)`	`💬 tokyotech-llm/Swallow-70b-instruct-hf`	`130.9 (±105.0)`	`0.636`	`0.758`	`0.326`
`91`	`0.5724 (±0.0209/√100)`	`🟢 rinna/llama-3-youko-70b`	`104.6 (±20.6)`	`0.681`	`0.826`	`0.210`
`92`	`0.5716 (±0.0230/√100)`	`🟢 sbintuitions/sarashina2.1-1b`	`116.9 (±41.3)`	`0.668`	`0.821`	`0.226`
`93`	`0.5712 (±0.0194/√100)`	`💬 karakuri-ai/karakuri-lm-8x7b-chat-v0.1`	`244.4 (±49.3)`	`0.678`	`0.816`	`0.220`
`94`	`0.5710 (±0.0226/√100)`	`🟢 rinna/llama-3-youko-8b-instruct`	`111.6 (±23.4)`	`0.672`	`0.809`	`0.232`
`95`	`0.5659 (±0.0234/√100)`	`🟢 meta-llama/Meta-Llama-3.1-70B`	`103.7 (±20.1)`	`0.665`	`0.822`	`0.211`
`96`	`0.5656 (±0.0226/√100)`	`💬 meta-llama/Meta-Llama-3-70B-Instruct`	`110.2 (±36.4)`	`0.665`	`0.777`	`0.254`
`97`	`0.5646 (±0.0240/√100)`	`💬 microsoft/Phi-3-medium-4k-instruct`	`131.3 (±50.6)`	`0.633`	`0.807`	`0.253`
`98`	`0.5642 (±0.0261/√100)`	`🟢 stabilityai/japanese-stablelm-instruc...`	`105.1 (±19.5)`	`0.646`	`0.799`	`0.247`
`99`	`0.5620 (±0.0254/√100)`	`🟢 meta-llama/Meta-Llama-3-70B`	`102.0 (±17.2)`	`0.664`	`0.809`	`0.213`
`100`	`0.5588 (±0.0230/√100)`	`🟢 stabilityai/japanese-stablelm-instruc...`	`105.6 (±17.0)`	`0.673`	`0.812`	`0.191`
`101`	`0.5574 (±0.0216/√100)`	`🟢 rinna/nekomata-7b`	`108.4 (±18.0)`	`0.678`	`0.816`	`0.178`
`102`	`0.5569 (±0.0244/√100)`	`🟢 rinna/llama-3-youko-8b`	`104.9 (±17.0)`	`0.670`	`0.813`	`0.188`
`103`	`0.5568 (±0.0200/√100)`	`🟢 meta-llama/Meta-Llama-3-70B-Instruct`	`111.8 (±55.9)`	`0.655`	`0.780`	`0.236`
`104`	`0.5562 (±0.0952/√100)`	`💬 stockmark/stockmark-13b-instruct`	`137.2 (±89.6)`	`0.633`	`0.798`	`0.238`
`105`	`0.5537 (±0.0204/√100)`	`🟢 tokyotech-llm/Llama-3-Swallow-8B-Inst...`	`114.4 (±48.5)`	`0.657`	`0.812`	`0.192`
`106`	`0.5516 (±0.1016/√100)`	`💬 cyberagent/calm2-7b-chat-dpo-experime...`	`181.1 (±120.1)`	`0.644`	`0.775`	`0.236`
`107`	`0.5511 (±0.0203/√100)`	`🟢 google/gemma-2-27b-it`	`110.3 (±56.8)`	`0.599`	`0.836`	`0.218`
`108`	`0.5500 (±0.0605/√100)`	`💬 tokyotech-llm/Llama-3-Swallow-70B-Ins...`	`156.5 (±106.5)`	`0.633`	`0.780`	`0.237`
`109`	`0.5500 (±0.0467/√100)`	`💬 tokyotech-llm/Swallow-7b-instruct-hf`	`121.9 (±77.3)`	`0.612`	`0.812`	`0.225`
`110`	`0.5437 (±0.0218/√100)`	`💬 Xwin-LM/Xwin-LM-70B-V0.1`	`200.7 (±63.1)`	`0.652`	`0.782`	`0.198`
`111`	`0.5436 (±0.0246/√100)`	`🟢 llm-jp/llm-jp-3-3.7b`	`101.3 (±10.4)`	`0.646`	`0.795`	`0.189`
`112`	`0.5432 (±0.0208/√100)`	`💬 CohereForAI/c4ai-command-r-plus`	`48.9 (±16.5)`	`0.505`	`0.931`	`0.194`
`113`	`0.5429 (±0.0238/√100)`	`🟢 meta-llama/Meta-Llama-3.1-70B-Instruct`	`157.6 (±221.7)`	`0.636`	`0.770`	`0.222`
`114`	`0.5387 (±0.0269/√100)`	`💬 rinna/llama-3-youko-8b-instruct`	`265.4 (±104.1)`	`0.635`	`0.771`	`0.210`
`115`	`0.5386 (±0.0215/√100)`	`💬 microsoft/Phi-3-medium-128k-instruct`	`91.9 (±44.7)`	`0.589`	`0.834`	`0.193`
`116`	`0.5377 (±0.0481/√100)`	`💬 meta-llama/Meta-Llama-3.1-70B-Instruct`	`135.8 (±194.8)`	`0.617`	`0.779`	`0.218`
`117`	`0.5349 (±0.0203/√100)`	`💬 google/gemma-2-27b-it`	`74.7 (±42.7)`	`0.545`	`0.874`	`0.186`
`118`	`0.5347 (±0.0188/√100)`	`🟢 rinna/youri-7b`	`107.6 (±16.3)`	`0.654`	`0.802`	`0.148`
`119`	`0.5316 (±0.0273/√100)`	`💬 lightblue/karasu-7B-chat`	`111.8 (±46.5)`	`0.621`	`0.800`	`0.174`
`120`	`0.5301 (±0.0476/√100)`	`💬 lightblue/karasu-7B-chat-plus`	`107.1 (±46.7)`	`0.615`	`0.798`	`0.178`
`121`	`0.5283 (±0.0585/√100)`	`💬 lightblue/karasu-7B-chat-plus-unleashed`	`104.6 (±45.3)`	`0.614`	`0.794`	`0.177`
`122`	`0.5179 (±0.0264/√100)`	`🟢 cyberagent/calm2-7b`	`106.0 (±26.2)`	`0.601`	`0.770`	`0.182`
`123`	`0.5164 (±0.0209/√100)`	`🟢 llm-jp/llm-jp-13b-instruct-full-jaste...`	`109.3 (±33.5)`	`0.606`	`0.788`	`0.155`
`124`	`0.5143 (±0.0212/√100)`	`🟢 llm-jp/llm-jp-13b-v2.0`	`104.1 (±11.2)`	`0.604`	`0.760`	`0.180`
`125`	`0.5143 (±0.0170/√100)`	`🟢 moneyforward/houou-instruction-7b-v3`	`112.2 (±37.8)`	`0.629`	`0.778`	`0.135`
`126`	`0.5085 (±0.0160/√100)`	`🟢 moneyforward/houou-instruction-7b-v1`	`105.9 (±41.0)`	`0.617`	`0.781`	`0.128`
`127`	`0.5080 (±0.0306/√100)`	`💬 stabilityai/japanese-stablelm-instruc...`	`111.3 (±58.3)`	`0.548`	`0.782`	`0.195`
`128`	`0.5073 (±0.0208/√100)`	`💬 Qwen/Qwen2-57B-A14B-Instruct`	`154.8 (±89.5)`	`0.615`	`0.734`	`0.173`
`129`	`0.5045 (±0.0208/√100)`	`🟢 Qwen/Qwen2-57B-A14B`	`106.7 (±22.5)`	`0.617`	`0.757`	`0.139`
`130`	`0.5041 (±0.0225/√100)`	`🟢 llm-jp/llm-jp-13b-instruct-full-dolly...`	`106.2 (±29.3)`	`0.579`	`0.778`	`0.155`
`131`	`0.5022 (±0.0221/√100)`	`🟢 llm-jp/llm-jp-13b-instruct-full-jaste...`	`95.0 (±36.2)`	`0.579`	`0.795`	`0.132`
`132`	`0.5013 (±0.0196/√100)`	`🟢 google/gemma-2-9b`	`107.3 (±26.0)`	`0.595`	`0.761`	`0.148`
`133`	`0.5013 (±0.0375/√100)`	`💬 karakuri-ai/karakuri-lm-70b-chat-v0.1`	`427.4 (±151.5)`	`0.579`	`0.723`	`0.202`
`134`	`0.5002 (±0.0218/√100)`	`🟢 Qwen/Qwen-72B-Chat`	`223.0 (±258.3)`	`0.614`	`0.716`	`0.171`
`135`	`0.4995 (±0.0211/√100)`	`💬 Qwen/Qwen1.5-72B-Chat`	`119.3 (±58.1)`	`0.582`	`0.708`	`0.208`
`136`	`0.4963 (±0.0189/√100)`	`🟢 Qwen/Qwen1.5-72B-Chat`	`128.1 (±77.7)`	`0.586`	`0.698`	`0.206`
`137`	`0.4959 (±0.0235/√100)`	`🟢 llm-jp/llm-jp-13b-v1.0`	`115.0 (±40.9)`	`0.576`	`0.756`	`0.156`
`138`	`0.4953 (±0.0203/√100)`	`🟢 meta-llama/Llama-2-70b-hf`	`110.4 (±25.8)`	`0.596`	`0.745`	`0.145`
`139`	`0.4949 (±0.0177/√100)`	`💬 moneyforward/houou-instruction-7b-v1`	`180.5 (±66.6)`	`0.604`	`0.734`	`0.146`
`140`	`0.4931 (±0.0247/√100)`	`🟢 Rakuten/RakutenAI-7B-instruct`	`105.6 (±33.1)`	`0.598`	`0.750`	`0.132`
`141`	`0.4921 (±0.0219/√100)`	`🟢 Rakuten/RakutenAI-7B-chat`	`114.9 (±44.7)`	`0.592`	`0.760`	`0.124`
`142`	`0.4916 (±0.0201/√100)`	`🟢 moneyforward/houou-instruction-7b-v2`	`104.7 (±41.2)`	`0.588`	`0.770`	`0.116`
`143`	`0.4895 (±0.0440/√100)`	`💬 llm-jp/llm-jp-13b-instruct-full-dolly...`	`268.1 (±133.1)`	`0.548`	`0.722`	`0.199`
`144`	`0.4872 (±0.0237/√100)`	`🟢 lightblue/karasu-7B`	`110.1 (±19.0)`	`0.586`	`0.739`	`0.137`
`145`	`0.4870 (±0.0215/√100)`	`🟢 Qwen/Qwen-72B`	`134.6 (±114.6)`	`0.593`	`0.715`	`0.152`
`146`	`0.4868 (±0.0163/√100)`	`💬 google/gemma-2-9b-it`	`47.6 (±14.6)`	`0.477`	`0.880`	`0.104`
`147`	`0.4863 (±0.1167/√100)`	`💬 pfnet/nekomata-14b-pfn-qfin-inst-merge`	`93.4 (±55.0)`	`0.544`	`0.721`	`0.194`
`148`	`0.4862 (±0.0221/√100)`	`🟢 Qwen/Qwen2-57B-A14B-Instruct`	`116.9 (±82.5)`	`0.601`	`0.734`	`0.124`
`149`	`0.4857 (±0.0168/√100)`	`💬 moneyforward/houou-instruction-7b-v2`	`207.0 (±57.3)`	`0.591`	`0.719`	`0.147`
`150`	`0.4829 (±0.0211/√100)`	`🟢 Qwen/Qwen1.5-72B`	`136.2 (±85.6)`	`0.591`	`0.705`	`0.153`
`151`	`0.4827 (±0.0464/√100)`	`💬 llm-jp/llm-jp-13b-instruct-full-ac_00...`	`269.1 (±131.5)`	`0.542`	`0.716`	`0.191`
`152`	`0.4762 (±0.0810/√100)`	`💬 stabilityai/japanese-stablelm-instruc...`	`126.2 (±67.4)`	`0.545`	`0.726`	`0.158`
`153`	`0.4746 (±0.0210/√100)`	`🟢 rinna/youri-7b-chat`	`102.1 (±16.4)`	`0.571`	`0.752`	`0.100`
`154`	`0.4744 (±0.0227/√100)`	`🟢 pfnet/plamo-13b`	`108.2 (±28.5)`	`0.558`	`0.749`	`0.116`
`155`	`0.4743 (±0.0987/√100)`	`💬 tokyotech-llm/Swallow-7b-NVE-instruct-hf`	`129.0 (±72.8)`	`0.535`	`0.725`	`0.163`
`156`	`0.4730 (±0.0166/√100)`	`🟢 Xwin-LM/Xwin-LM-13B-V0.2`	`109.7 (±27.4)`	`0.582`	`0.723`	`0.114`
`157`	`0.4723 (±0.0204/√100)`	`💬 Rakuten/RakutenAI-7B-chat`	`233.0 (±133.0)`	`0.565`	`0.734`	`0.118`
`158`	`0.4723 (±0.0808/√100)`	`💬 tokyotech-llm/Llama-3-Swallow-8B-Inst...`	`199.3 (±155.6)`	`0.563`	`0.699`	`0.154`
`159`	`0.4698 (±0.0200/√100)`	`🟢 Rakuten/RakutenAI-7B`	`105.4 (±25.6)`	`0.576`	`0.721`	`0.113`
`160`	`0.4692 (±0.0161/√100)`	`🟢 shisa-ai/shisa-v1-qwen2-7b`	`109.0 (±23.9)`	`0.563`	`0.712`	`0.133`
`161`	`0.4661 (±0.0210/√100)`	`🟢 llm-jp/llm-jp-13b-instruct-full-dolly...`	`111.6 (±44.2)`	`0.536`	`0.756`	`0.106`
`162`	`0.4659 (±0.0438/√100)`	`💬 deepseek-ai/deepseek-llm-67b-chat`	`146.0 (±62.1)`	`0.555`	`0.703`	`0.139`
`163`	`0.4659 (±0.0202/√100)`	`🟢 llm-jp/llm-jp-3-1.8b`	`105.0 (±16.9)`	`0.568`	`0.725`	`0.105`
`164`	`0.4648 (±0.1659/√100)`	`💬 cyberagent/calm2-7b-chat`	`124.7 (±95.9)`	`0.536`	`0.688`	`0.171`
`165`	`0.4622 (±0.0195/√100)`	`🟢 Qwen/Qwen-14B-Chat`	`135.5 (±84.3)`	`0.572`	`0.718`	`0.097`
`166`	`0.4619 (±0.0162/√100)`	`💬 lmsys/vicuna-13b-v1.5-16k`	`126.5 (±48.4)`	`0.574`	`0.715`	`0.097`
`167`	`0.4609 (±0.0113/√10)`	`🟢 google/gemma-2-2b-jpn-it`	`69.4 (±24.1)`	`0.509`	`0.805`	`0.069`
`168`	`0.4607 (±0.0165/√100)`	`🟢 SakanaAI/EvoLLM-JP-v1-7B`	`111.2 (±30.4)`	`0.579`	`0.708`	`0.095`
`169`	`0.4601 (±0.0184/√100)`	`🟢 shisa-ai/shisa-v1-llama3-8b`	`112.9 (±31.4)`	`0.557`	`0.703`	`0.120`
`170`	`0.4597 (±0.0268/√100)`	`🟢 CohereForAI/c4ai-command-r-v01`	`179.2 (±166.3)`	`0.590`	`0.592`	`0.197`
`171`	`0.4586 (±0.0141/√100)`	`🟢 google/gemma-2-2b-it`	`88.2 (±30.8)`	`0.536`	`0.761`	`0.079`
`172`	`0.4561 (±0.0202/√100)`	`🟢 pfnet/plamo-13b-instruct`	`144.0 (±147.7)`	`0.532`	`0.763`	`0.073`
`173`	`0.4559 (±0.0201/√100)`	`🟢 pfnet/plamo-13b-instruct-nc`	`156.0 (±183.1)`	`0.523`	`0.768`	`0.077`
`174`	`0.4558 (±0.0156/√100)`	`🟢 rinna/japanese-gpt-neox-3.6b-instruct...`	`75.3 (±26.6)`	`0.488`	`0.804`	`0.076`
`175`	`0.4543 (±0.0217/√100)`	`🟢 rinna/youri-7b-instruction`	`96.2 (±29.5)`	`0.530`	`0.743`	`0.090`
`176`	`0.4535 (±0.0348/√100)`	`💬 Rakuten/RakutenAI-7B-instruct`	`128.6 (±83.2)`	`0.527`	`0.726`	`0.108`
`177`	`0.4535 (±0.0183/√100)`	`🟢 THUDM/glm-4-9b`	`110.3 (±36.9)`	`0.554`	`0.689`	`0.118`
`178`	`0.4527 (±0.0146/√100)`	`🟢 lmsys/vicuna-13b-v1.5-16k`	`107.9 (±25.9)`	`0.576`	`0.708`	`0.075`
`179`	`0.4504 (±0.0224/√100)`	`🟢 rinna/nekomata-7b-instruction`	`96.4 (±23.7)`	`0.528`	`0.734`	`0.089`
`180`	`0.4486 (±0.0161/√100)`	`💬 Qwen/Qwen2-7B-Instruct`	`163.6 (±61.4)`	`0.547`	`0.688`	`0.111`
`181`	`0.4484 (±0.0191/√100)`	`💬 SakanaAI/EvoLLM-JP-v1-7B`	`123.9 (±68.1)`	`0.545`	`0.706`	`0.094`
`182`	`0.4477 (±0.0205/√100)`	`🟢 rinna/llama-3-youko-70b-instruct`	`130.7 (±95.3)`	`0.527`	`0.670`	`0.146`
`183`	`0.4426 (±0.0204/√100)`	`🟢 elyza/ELYZA-japanese-Llama-2-13b-inst...`	`111.1 (±28.2)`	`0.544`	`0.687`	`0.097`
`184`	`0.4409 (±0.1064/√100)`	`💬 lightblue/karasu-7B`	`138.1 (±92.9)`	`0.512`	`0.679`	`0.131`
`185`	`0.4404 (±0.0146/√100)`	`🟢 rinna/bilingual-gpt-neox-4b-instructi...`	`75.9 (±22.7)`	`0.493`	`0.773`	`0.056`
`186`	`0.4387 (±0.0655/√100)`	`💬 Qwen/Qwen-72B-Chat`	`117.7 (±137.1)`	`0.541`	`0.632`	`0.143`
`187`	`0.4385 (±0.0285/√100)`	`💬 rinna/youri-7b-chat`	`95.4 (±41.1)`	`0.500`	`0.733`	`0.083`
`188`	`0.4377 (±0.0107/√100)`	`🟢 google/gemma-1.1-7b-it`	`86.8 (±21.4)`	`0.509`	`0.732`	`0.072`
`189`	`0.4374 (±0.0217/√100)`	`🟢 Qwen/Qwen1.5-32B-Chat`	`127.0 (±57.0)`	`0.538`	`0.642`	`0.133`
`190`	`0.4336 (±0.0168/√100)`	`🟢 stabilityai/japanese-stablelm-base-be...`	`107.1 (±17.2)`	`0.539`	`0.689`	`0.073`
`191`	`0.4335 (±0.0221/√100)`	`🟢 Qwen/Qwen-14B`	`118.1 (±71.6)`	`0.530`	`0.675`	`0.096`
`192`	`0.4332 (±0.0164/√100)`	`🟢 Qwen/Qwen2-7B-Instruct`	`119.1 (±45.7)`	`0.531`	`0.670`	`0.098`
`193`	`0.4330 (±0.0149/√100)`	`💬 google/gemma-2-2b-it`	`56.0 (±27.8)`	`0.445`	`0.788`	`0.066`
`194`	`0.4320 (±0.0171/√100)`	`🟢 Qwen/Qwen2-7B`	`109.1 (±40.1)`	`0.532`	`0.671`	`0.093`
`195`	`0.4296 (±0.0322/√100)`	`💬 Qwen/Qwen-14B-Chat`	`159.0 (±69.7)`	`0.522`	`0.675`	`0.092`
`196`	`0.4295 (±0.0157/√100)`	`🟢 elyza/ELYZA-japanese-Llama-2-7b-instruct`	`111.5 (±31.4)`	`0.530`	`0.676`	`0.083`
`197`	`0.4292 (±0.0181/√100)`	`💬 Xwin-LM/Xwin-LM-13B-V0.2`	`240.7 (±48.4)`	`0.533`	`0.670`	`0.085`
`198`	`0.4282 (±0.0193/√100)`	`🟢 stabilityai/japanese-stablelm-3b-4e1t...`	`110.8 (±26.0)`	`0.518`	`0.688`	`0.078`
`199`	`0.4272 (±0.0273/√100)`	`🟢 mistralai/Mistral-Nemo-Instruct-2407`	`155.8 (±132.8)`	`0.548`	`0.611`	`0.122`
`200`	`0.4265 (±0.0115/√100)`	`💬 google/gemma-1.1-7b-it`	`78.7 (±28.4)`	`0.475`	`0.739`	`0.066`
`201`	`0.4256 (±0.0270/√100)`	`🟢 rinna/japanese-gpt-neox-3.6b`	`129.8 (±73.4)`	`0.485`	`0.685`	`0.106`
`202`	`0.4228 (±0.0185/√100)`	`🟢 stabilityai/japanese-stablelm-base-ja...`	`110.4 (±28.6)`	`0.528`	`0.668`	`0.073`
`203`	`0.4222 (±0.0138/√100)`	`🟢 Xwin-LM/Xwin-LM-7B-V0.2`	`110.6 (±29.3)`	`0.520`	`0.677`	`0.070`
`204`	`0.4220 (±0.0185/√100)`	`🟢 lmsys/vicuna-7b-v1.5-16k`	`111.8 (±31.8)`	`0.522`	`0.670`	`0.074`
`205`	`0.4207 (±0.0189/√100)`	`🟢 stabilityai/japanese-stablelm-3b-4e1t...`	`112.8 (±27.0)`	`0.507`	`0.683`	`0.072`
`206`	`0.4201 (±0.0177/√100)`	`💬 lmsys/vicuna-7b-v1.5-16k`	`128.1 (±52.5)`	`0.514`	`0.668`	`0.078`
`207`	`0.4164 (±0.0244/√100)`	`🟢 google/gemma-7b`	`135.5 (±132.3)`	`0.533`	`0.631`	`0.085`
`208`	`0.4150 (±0.0212/√100)`	`💬 Qwen/Qwen1.5-32B-Chat`	`125.7 (±250.5)`	`0.496`	`0.620`	`0.130`
`209`	`0.4149 (±0.0375/√100)`	`💬 llm-jp/llm-jp-13b-instruct-full-dolly...`	`186.6 (±108.4)`	`0.469`	`0.685`	`0.090`
`210`	`0.4144 (±0.0149/√100)`	`💬 01-ai/Yi-1.5-34B-Chat`	`170.6 (±47.1)`	`0.514`	`0.628`	`0.101`
`211`	`0.4140 (±0.0208/√100)`	`🟢 meta-llama/Meta-Llama-3-8B-Instruct`	`116.8 (±44.3)`	`0.523`	`0.637`	`0.082`
`212`	`0.4125 (±0.0303/√100)`	`💬 CohereForAI/c4ai-command-r-v01`	`137.7 (±324.6)`	`0.519`	`0.562`	`0.157`
`213`	`0.4122 (±0.0199/√100)`	`🟢 rinna/bilingual-gpt-neox-4b`	`121.0 (±43.6)`	`0.485`	`0.660`	`0.092`
`214`	`0.4097 (±0.0187/√100)`	`🟢 meta-llama/Meta-Llama-3.1-8B`	`108.7 (±35.4)`	`0.512`	`0.650`	`0.068`
`215`	`0.4087 (±0.0201/√100)`	`🟢 meta-llama/Llama-2-70b-chat-hf`	`161.3 (±140.8)`	`0.519`	`0.608`	`0.099`
`216`	`0.4087 (±0.0146/√100)`	`🟢 microsoft/Phi-3-small-8k-instruct`	`109.1 (±24.1)`	`0.514`	`0.644`	`0.068`
`217`	`0.4076 (±0.0142/√100)`	`🟢 elyza/ELYZA-japanese-Llama-2-7b-fast-...`	`109.0 (±32.9)`	`0.503`	`0.644`	`0.076`
`218`	`0.4074 (±0.0207/√100)`	`💬 elyza/ELYZA-japanese-Llama-2-13b-inst...`	`156.6 (±65.9)`	`0.490`	`0.646`	`0.086`
`219`	`0.4073 (±0.0175/√100)`	`🟢 stabilityai/japanese-stablelm-instruc...`	`110.0 (±26.5)`	`0.490`	`0.663`	`0.070`
`220`	`0.4058 (±0.0295/√100)`	`💬 rinna/youri-7b-instruction`	`97.0 (±57.0)`	`0.439`	`0.713`	`0.065`
`221`	`0.4050 (±0.0191/√100)`	`🟢 mistralai/Mixtral-8x22B-v0.1`	`115.6 (±55.4)`	`0.517`	`0.615`	`0.084`
`222`	`0.4048 (±0.0175/√100)`	`🟢 meta-llama/Meta-Llama-3-8B`	`109.0 (±19.8)`	`0.505`	`0.641`	`0.068`
`223`	`0.4045 (±0.0186/√100)`	`🟢 rinna/japanese-gpt-neox-3.6b-instruct...`	`133.1 (±57.4)`	`0.475`	`0.678`	`0.061`
`224`	`0.4042 (±0.0131/√100)`	`🟢 microsoft/Orca-2-13b`	`115.5 (±42.6)`	`0.510`	`0.630`	`0.073`
`225`	`0.4041 (±0.0218/√100)`	`💬 meta-llama/Meta-Llama-3-8B-Instruct`	`131.4 (±88.3)`	`0.508`	`0.614`	`0.090`
`226`	`0.4035 (±0.0151/√100)`	`🟢 SakanaAI/EvoLLM-JP-A-v1-7B`	`110.4 (±31.3)`	`0.508`	`0.633`	`0.069`
`227`	`0.4033 (±0.0164/√100)`	`🟢 elyza/ELYZA-japanese-Llama-2-13b-fast...`	`107.2 (±28.5)`	`0.495`	`0.643`	`0.072`
`228`	`0.4032 (±0.0237/√100)`	`🟢 Qwen/Qwen1.5-32B`	`150.3 (±104.8)`	`0.505`	`0.605`	`0.100`
`229`	`0.4024 (±0.0187/√100)`	`🟢 01-ai/Yi-1.5-34B`	`109.9 (±28.2)`	`0.493`	`0.631`	`0.083`
`230`	`0.4011 (±0.0236/√100)`	`🟢 cyberagent/open-calm-7b`	`143.8 (±97.0)`	`0.472`	`0.641`	`0.091`
`231`	`0.4006 (±0.0166/√100)`	`💬 microsoft/Phi-3-small-8k-instruct`	`189.7 (±84.1)`	`0.500`	`0.630`	`0.073`
`232`	`0.4001 (±0.0199/√100)`	`🟢 rinna/japanese-gpt-neox-3.6b-instruct...`	`117.6 (±48.9)`	`0.464`	`0.684`	`0.052`
`233`	`0.3985 (±0.0161/√100)`	`🟢 elyza/ELYZA-japanese-Llama-2-13b`	`138.4 (±51.8)`	`0.493`	`0.634`	`0.069`
`234`	`0.3960 (±0.0199/√100)`	`🟢 line-corporation/japanese-large-lm-1.7b`	`179.2 (±174.5)`	`0.474`	`0.650`	`0.065`
`235`	`0.3949 (±0.0193/√100)`	`💬 meta-llama/Meta-Llama-3.1-8B-Instruct`	`216.6 (±345.2)`	`0.487`	`0.624`	`0.074`
`236`	`0.3948 (±0.0190/√100)`	`💬 Qwen/Qwen1.5-14B-Chat`	`127.9 (±50.6)`	`0.500`	`0.604`	`0.080`
`237`	`0.3946 (±0.0201/√100)`	`🟢 Qwen/Qwen1.5-14B`	`130.9 (±67.8)`	`0.509`	`0.609`	`0.066`
`238`	`0.3934 (±0.0201/√100)`	`🟢 stabilityai/japanese-stablelm-instruc...`	`107.8 (±38.0)`	`0.466`	`0.648`	`0.066`
`239`	`0.3914 (±0.0172/√100)`	`🟢 mistralai/Mixtral-8x7B-Instruct-v0.1`	`95.1 (±25.2)`	`0.488`	`0.636`	`0.050`
`240`	`0.3863 (±0.0160/√100)`	`🟢 Qwen/Qwen1.5-14B-Chat`	`131.4 (±55.8)`	`0.491`	`0.593`	`0.075`
`241`	`0.3837 (±0.0188/√100)`	`🟢 rinna/bilingual-gpt-neox-4b-instructi...`	`117.4 (±42.4)`	`0.462`	`0.649`	`0.041`
`242`	`0.3823 (±0.0645/√100)`	`💬 mistralai/Mistral-Nemo-Instruct-2407`	`157.9 (±140.3)`	`0.484`	`0.563`	`0.100`
`243`	`0.3822 (±0.0647/√100)`	`💬 llm-jp/llm-jp-13b-instruct-full-dolly...`	`97.6 (±76.2)`	`0.397`	`0.664`	`0.086`
`244`	`0.3819 (±0.0265/√100)`	`🟢 google/gemma-2-27b`	`214.2 (±183.3)`	`0.450`	`0.608`	`0.087`
`245`	`0.3804 (±0.0161/√100)`	`🟢 Qwen/Qwen-7B-Chat`	`140.8 (±65.1)`	`0.485`	`0.612`	`0.045`
`246`	`0.3803 (±0.0249/√100)`	`💬 elyza/ELYZA-japanese-Llama-2-7b-instruct`	`136.4 (±70.7)`	`0.452`	`0.619`	`0.070`
`247`	`0.3772 (±0.0162/√100)`	`💬 microsoft/Phi-3-small-128k-instruct`	`199.7 (±111.9)`	`0.473`	`0.590`	`0.069`
`248`	`0.3760 (±0.0236/√100)`	`🟢 cyberagent/open-calm-3b`	`123.2 (±79.0)`	`0.442`	`0.624`	`0.062`
`249`	`0.3759 (±0.0149/√100)`	`🟢 lmsys/longchat-7b-v1.5-32k`	`116.9 (±31.6)`	`0.474`	`0.609`	`0.045`
`250`	`0.3740 (±0.0164/√100)`	`🟢 meta-llama/Llama-2-13b-hf`	`108.5 (±21.8)`	`0.474`	`0.603`	`0.045`
`251`	`0.3737 (±0.0197/√100)`	`🟢 meta-llama/Meta-Llama-3.1-8B-Instruct`	`204.5 (±303.4)`	`0.478`	`0.589`	`0.055`
`252`	`0.3720 (±0.0622/√100)`	`💬 Xwin-LM/Xwin-LM-7B-V0.2`	`205.3 (±79.1)`	`0.466`	`0.590`	`0.060`
`253`	`0.3720 (±0.0157/√100)`	`🟢 elyza/ELYZA-japanese-Llama-2-13b-fast`	`177.5 (±147.2)`	`0.458`	`0.598`	`0.061`
`254`	`0.3699 (±0.0345/√100)`	`💬 Qwen/Qwen-7B-Chat`	`182.9 (±110.3)`	`0.468`	`0.600`	`0.042`
`255`	`0.3694 (±0.0103/√100)`	`🟢 google/gemma-7b-it`	`89.7 (±21.6)`	`0.446`	`0.640`	`0.022`
`256`	`0.3685 (±0.0173/√100)`	`🟢 elyza/ELYZA-japanese-Llama-2-7b`	`140.0 (±52.8)`	`0.462`	`0.596`	`0.047`
`257`	`0.3673 (±0.0089/√100)`	`💬 google/gemma-7b-it`	`110.0 (±47.6)`	`0.448`	`0.633`	`0.020`
`258`	`0.3655 (±0.0116/√100)`	`🟢 deepseek-ai/deepseek-llm-7b-chat`	`113.9 (±24.7)`	`0.474`	`0.579`	`0.043`
`259`	`0.3642 (±0.0165/√100)`	`🟢 llm-jp/llm-jp-1.3b-v1.0`	`134.0 (±62.6)`	`0.437`	`0.612`	`0.044`
`260`	`0.3637 (±0.0223/√100)`	`🟢 cyberagent/open-calm-large`	`122.3 (±73.9)`	`0.424`	`0.611`	`0.056`
`261`	`0.3637 (±0.0152/√100)`	`🟢 elyza/ELYZA-japanese-Llama-2-7b-fast`	`168.0 (±77.4)`	`0.452`	`0.587`	`0.052`
`262`	`0.3632 (±0.0237/√100)`	`💬 elyza/ELYZA-japanese-Llama-2-7b-fast-...`	`178.6 (±113.6)`	`0.443`	`0.582`	`0.064`
`263`	`0.3628 (±0.0145/√100)`	`🟢 Qwen/Qwen-7B`	`117.3 (±39.0)`	`0.468`	`0.582`	`0.039`
`264`	`0.3554 (±0.0178/√100)`	`🟢 meta-llama/Llama-2-7b-chat-hf`	`139.3 (±93.1)`	`0.464`	`0.570`	`0.031`
`265`	`0.3545 (±0.0445/√100)`	`💬 llm-jp/llm-jp-13b-instruct-full-jaste...`	`48.8 (±50.1)`	`0.283`	`0.723`	`0.058`
`266`	`0.3543 (±0.0439/√100)`	`💬 lmsys/longchat-7b-v1.5-32k`	`160.1 (±73.5)`	`0.448`	`0.572`	`0.043`
`267`	`0.3538 (±0.0175/√100)`	`🟢 01-ai/Yi-1.5-9B`	`113.0 (±29.4)`	`0.457`	`0.555`	`0.050`
`268`	`0.3531 (±0.0159/√100)`	`🟢 mistralai/Mixtral-8x7B-v0.1`	`94.3 (±20.8)`	`0.450`	`0.573`	`0.037`
`269`	`0.3514 (±0.0102/√100)`	`🟢 google/gemma-1.1-2b-it`	`80.4 (±21.6)`	`0.404`	`0.625`	`0.025`
`270`	`0.3495 (±0.0268/√100)`	`🟢 cyberagent/open-calm-1b`	`141.3 (±110.0)`	`0.412`	`0.578`	`0.059`
`271`	`0.3471 (±0.0131/√100)`	`🟢 microsoft/Orca-2-7b`	`131.1 (±70.7)`	`0.447`	`0.555`	`0.039`
`272`	`0.3465 (±0.0202/√100)`	`💬 deepseek-ai/deepseek-llm-7b-chat`	`167.2 (±76.5)`	`0.435`	`0.562`	`0.042`
`273`	`0.3463 (±0.0178/√100)`	`💬 mistralai/Mixtral-8x7B-Instruct-v0.1`	`147.1 (±111.8)`	`0.448`	`0.548`	`0.043`
`274`	`0.3449 (±0.0986/√100)`	`💬 stabilityai/japanese-stablelm-instruc...`	`109.4 (±66.2)`	`0.397`	`0.585`	`0.053`
`275`	`0.3440 (±0.0978/√100)`	`💬 stabilityai/japanese-stablelm-3b-4e1t...`	`127.8 (±80.5)`	`0.401`	`0.576`	`0.055`
`276`	`0.3436 (±0.0126/√100)`	`💬 01-ai/Yi-1.5-9B-Chat`	`143.6 (±60.1)`	`0.438`	`0.540`	`0.053`
`277`	`0.3428 (±0.0163/√100)`	`🟢 meta-llama/Llama-2-7b-hf`	`112.3 (±28.0)`	`0.440`	`0.550`	`0.038`
`278`	`0.3408 (±0.0225/√100)`	`🟢 anthracite-org/magnum-32b-v2`	`191.9 (±223.2)`	`0.442`	`0.507`	`0.073`
`279`	`0.3393 (±0.0225/√100)`	`🟢 stockmark/gpt-neox-japanese-1.4b`	`92.2 (±63.7)`	`0.351`	`0.641`	`0.025`
`280`	`0.3322 (±0.0151/√100)`	`🟢 Qwen/Qwen1.5-7B-Chat`	`127.7 (±117.0)`	`0.431`	`0.520`	`0.045`
`281`	`0.3315 (±0.0203/√100)`	`🟢 Qwen/Qwen1.5-7B`	`141.8 (±126.5)`	`0.445`	`0.504`	`0.046`
`282`	`0.3313 (±0.0115/√100)`	`🟢 google/gemma-2b-it`	`85.9 (±24.7)`	`0.393`	`0.577`	`0.024`
`283`	`0.3293 (±0.0252/√100)`	`💬 Qwen/Qwen1.5-7B-Chat`	`195.7 (±113.1)`	`0.429`	`0.503`	`0.056`
`284`	`0.3276 (±0.0709/√100)`	`💬 elyza/ELYZA-japanese-Llama-2-13b-fast...`	`134.0 (±98.8)`	`0.395`	`0.543`	`0.045`
`285`	`0.3272 (±0.0101/√100)`	`💬 01-ai/Yi-1.5-6B-Chat`	`194.4 (±75.0)`	`0.426`	`0.530`	`0.025`
`286`	`0.3187 (±0.0142/√100)`	`🟢 Qwen/Qwen2-1.5B-Instruct`	`131.4 (±46.7)`	`0.421`	`0.513`	`0.022`
`287`	`0.3172 (±0.0150/√100)`	`🟢 Qwen/Qwen2-1.5B`	`120.9 (±30.7)`	`0.422`	`0.511`	`0.019`
`288`	`0.3161 (±0.0119/√100)`	`🟢 deepseek-ai/deepseek-llm-7b-base`	`113.7 (±21.6)`	`0.424`	`0.501`	`0.024`
`289`	`0.3147 (±0.0175/√100)`	`💬 Qwen/Qwen2-1.5B-Instruct`	`180.7 (±101.0)`	`0.408`	`0.511`	`0.025`
`290`	`0.3078 (±0.0195/√100)`	`🟢 cyberagent/open-calm-medium`	`117.3 (±59.4)`	`0.363`	`0.537`	`0.024`
`291`	`0.3058 (±0.1106/√100)`	`💬 rinna/nekomata-7b-instruction`	`61.2 (±57.0)`	`0.307`	`0.567`	`0.043`
`292`	`0.3053 (±0.0177/√100)`	`🟢 google/gemma-2b`	`151.5 (±113.6)`	`0.410`	`0.480`	`0.026`
`293`	`0.3050 (±0.0190/√100)`	`🟢 Qwen/Qwen1.5-MoE-A2.7B`	`146.4 (±90.3)`	`0.412`	`0.468`	`0.035`
`294`	`0.2993 (±0.0095/√100)`	`🟢 01-ai/Yi-1.5-6B-Chat`	`133.3 (±46.2)`	`0.394`	`0.481`	`0.022`
`295`	`0.2993 (±0.0107/√100)`	`🟢 tiiuae/falcon-11B`	`121.6 (±31.5)`	`0.398`	`0.483`	`0.016`
`296`	`0.2957 (±0.0641/√100)`	`💬 meta-llama/Llama-2-13b-chat-hf`	`305.2 (±299.7)`	`0.402`	`0.453`	`0.032`
`297`	`0.2953 (±0.0442/√100)`	`🟢 augmxnt/shisa-base-7b-v1`	`200.4 (±160.3)`	`0.378`	`0.478`	`0.030`
`298`	`0.2924 (±0.0506/√100)`	`💬 Qwen/Qwen1.5-MoE-A2.7B-Chat`	`245.1 (±209.1)`	`0.381`	`0.453`	`0.043`
`299`	`0.2914 (±0.0133/√100)`	`🟢 mistralai/Mistral-7B-v0.1`	`117.4 (±40.4)`	`0.402`	`0.454`	`0.018`
`300`	`0.2907 (±0.0175/√100)`	`🟢 Qwen/Qwen1.5-MoE-A2.7B-Chat`	`149.8 (±91.0)`	`0.388`	`0.448`	`0.036`
`301`	`0.2853 (±0.0163/√100)`	`🟢 Qwen/Qwen1.5-4B-Chat`	`127.8 (±71.2)`	`0.395`	`0.441`	`0.019`
`302`	`0.2809 (±0.0133/√100)`	`🟢 Qwen/Qwen1.5-1.8B-Chat`	`178.3 (±92.0)`	`0.381`	`0.445`	`0.017`
`303`	`0.2770 (±0.0131/√100)`	`🟢 mistralai/Mistral-7B-Instruct-v0.2`	`146.2 (±70.1)`	`0.387`	`0.419`	`0.024`
`304`	`0.2769 (±0.0324/√100)`	`💬 llm-jp/llm-jp-13b-instruct-full-jaste...`	`16.9 (±24.6)`	`0.125`	`0.693`	`0.013`
`305`	`0.2769 (±0.1029/√100)`	`💬 stabilityai/japanese-stablelm-instruc...`	`117.0 (±115.0)`	`0.307`	`0.489`	`0.035`
`306`	`0.2666 (±0.0241/√100)`	`🟢 deepseek-ai/deepseek-llm-67b-chat`	`140.2 (±83.0)`	`0.351`	`0.440`	`0.009`
`307`	`0.2661 (±0.0128/√100)`	`🟢 Qwen/Qwen1.5-1.8B`	`129.7 (±65.7)`	`0.360`	`0.424`	`0.014`
`308`	`0.2613 (±0.0136/√100)`	`🟢 Qwen/Qwen2-0.5B-Instruct`	`176.8 (±98.9)`	`0.351`	`0.426`	`0.007`
`309`	`0.2604 (±0.0148/√100)`	`🟢 mistralai/Mistral-7B-Instruct-v0.1`	`139.8 (±101.3)`	`0.367`	`0.400`	`0.014`
`310`	`0.2598 (±0.0129/√100)`	`🟢 Qwen/Qwen2-0.5B`	`122.7 (±43.5)`	`0.350`	`0.420`	`0.009`
`311`	`0.2581 (±0.0196/√100)`	`🟢 cyberagent/open-calm-small`	`119.1 (±54.1)`	`0.310`	`0.460`	`0.004`
`312`	`0.2555 (±0.0163/√100)`	`🟢 Qwen/Qwen1.5-4B`	`149.2 (±76.6)`	`0.363`	`0.388`	`0.015`
`313`	`0.2543 (±0.0266/√100)`	`🟢 mosaicml/mpt-30b-chat`	`121.3 (±46.4)`	`0.327`	`0.428`	`0.008`
`314`	`0.2414 (±0.0281/√100)`	`💬 Qwen/Qwen1.5-1.8B-Chat`	`480.0 (±210.3)`	`0.329`	`0.392`	`0.003`
`315`	`0.2394 (±0.0745/√100)`	`💬 Qwen/Qwen1.5-4B-Chat`	`105.3 (±104.1)`	`0.307`	`0.390`	`0.021`
`316`	`0.2317 (±0.0455/√100)`	`💬 mistralai/Mistral-7B-Instruct-v0.1`	`202.3 (±153.9)`	`0.320`	`0.362`	`0.012`
`317`	`0.2231 (±0.0166/√100)`	`💬 mistralai/Mistral-7B-Instruct-v0.2`	`261.2 (±166.3)`	`0.316`	`0.334`	`0.019`
`318`	`0.2182 (±0.0152/√100)`	`🟢 microsoft/phi-1`	`47.6 (±34.3)`	`0.234`	`0.420`	`0.000`
`319`	`0.2177 (±0.0110/√100)`	`🟢 Qwen/Qwen1.5-0.5B-Chat`	`143.4 (±52.1)`	`0.317`	`0.327`	`0.009`
`320`	`0.2169 (±0.0561/√100)`	`💬 Qwen/Qwen2-0.5B-Instruct`	`129.5 (±114.3)`	`0.265`	`0.379`	`0.006`
`321`	`0.2169 (±0.0218/√100)`	`🟢 mosaicml/mpt-30b-instruct`	`109.8 (±36.1)`	`0.274`	`0.370`	`0.008`
`322`	`0.2146 (±0.0151/√100)`	`🟢 microsoft/phi-2`	`78.0 (±31.4)`	`0.287`	`0.356`	`0.001`
`323`	`0.2061 (±0.0820/√100)`	`💬 meta-llama/Llama-2-70b-chat-hf`	`523.3 (±444.5)`	`0.271`	`0.303`	`0.045`
`324`	`0.2040 (±0.0152/√100)`	`🟢 Qwen/Qwen1.5-0.5B`	`138.6 (±55.9)`	`0.296`	`0.314`	`0.003`
`325`	`0.2038 (±0.0538/√100)`	`🟢 mosaicml/mpt-30b`	`236.5 (±433.3)`	`0.271`	`0.334`	`0.007`
`326`	`0.1885 (±0.0194/√100)`	`🟢 microsoft/phi-1_5`	`77.5 (±33.6)`	`0.258`	`0.306`	`0.001`
`327`	`0.1833 (±0.0406/√100)`	`💬 google/gemma-1.1-2b-it`	`32.6 (±26.7)`	`0.171`	`0.376`	`0.003`
`328`	`0.1765 (±0.0439/√100)`	`💬 Qwen/Qwen1.5-0.5B-Chat`	`214.3 (±172.6)`	`0.251`	`0.276`	`0.002`
`329`	`0.1687 (±0.0172/√100)`	`🟢 upstage/SOLAR-10.7B-v1.0`	`171.0 (±87.1)`	`0.265`	`0.237`	`0.004`
`330`	`0.1544 (±0.0132/√100)`	`🟢 01-ai/Yi-1.5-34B-Chat`	`730.0 (±533.6)`	`0.201`	`0.256`	`0.006`
`331`	`0.1475 (±0.0826/√100)`	`💬 mosaicml/mpt-30b-chat`	`112.2 (±112.4)`	`0.182`	`0.254`	`0.007`
`332`	`0.1241 (±0.0558/√100)`	`💬 google/gemma-2b-it`	`24.1 (±24.6)`	`0.115`	`0.257`	`0.000`
`333`	`0.1226 (±0.0240/√100)`	`🟢 Deci/DeciLM-7B`	`174.0 (±165.5)`	`0.190`	`0.174`	`0.003`
`334`	`0.1160 (±0.0081/√100)`	`🟢 rinna/japanese-gpt-neox-3.6b-instruct...`	`212.1 (±148.9)`	`0.153`	`0.195`	`0.000`
`335`	`0.1009 (±0.0846/√100)`	`💬 meta-llama/Llama-2-7b-chat-hf`	`241.5 (±336.2)`	`0.136`	`0.158`	`0.009`
`336`	`0.1004 (±0.0094/√100)`	`🟢 rinna/japanese-gpt-neox-3.6b-instruct...`	`123.1 (±128.8)`	`0.119`	`0.182`	`0.000`
`337`	`0.0987 (±0.0145/√100)`	`🟢 deepseek-ai/deepseek-llm-67b-base`	`154.2 (±77.3)`	`0.174`	`0.121`	`0.000`
`338`	`0.0982 (±0.1596/√100)`	`💬 rinna/nekomata-14b-instruction`	`16.0 (±38.1)`	`0.115`	`0.141`	`0.039`
`339`	`0.0955 (±0.0102/√100)`	`🟢 rinna/japanese-gpt-neox-3.6b-instruct...`	`129.5 (±141.0)`	`0.116`	`0.170`	`0.000`
`340`	`0.0939 (±0.0064/√100)`	`🟢 sbintuitions/tiny-lm-chat`	`250.2 (±275.6)`	`0.133`	`0.149`	`0.000`
`341`	`0.0936 (±0.0082/√100)`	`💬 sbintuitions/tiny-lm-chat`	`276.7 (±209.6)`	`0.135`	`0.145`	`0.000`
`342`	`0.0921 (±0.0058/√100)`	`🟢 sbintuitions/tiny-lm`	`471.9 (±199.0)`	`0.135`	`0.142`	`0.000`
`343`	`0.0880 (±0.0334/√100)`	`🟢 rinna/bilingual-gpt-neox-4b-instructi...`	`134.0 (±144.7)`	`0.105`	`0.159`	`0.000`
`344`	`0.0762 (±0.0033/√100)`	`🟢 line-corporation/japanese-large-lm-3.6b`	`1066.6 (±31.6)`	`0.125`	`0.103`	`0.000`
`345`	`0.0760 (±0.0032/√100)`	`🟢 line-corporation/japanese-large-lm-3....`	`1066.4 (±31.8)`	`0.125`	`0.103`	`0.000`
`346`	`0.0758 (±0.0034/√100)`	`💬 line-corporation/japanese-large-lm-3....`	`1067.2 (±31.8)`	`0.125`	`0.102`	`0.000`
`347`	`0.0673 (±0.0085/√100)`	`🟢 moneyforward/houou-instruction-7b-v3`	`143.2 (±112.2)`	`0.098`	`0.104`	`0.000`
`348`	`0.0625 (±0.0169/√100)`	`🟢 llm-jp/llm-jp-13b-instruct-full-ac_00...`	`31.6 (±10.3)`	`0.088`	`0.099`	`0.000`
`349`	`0.0429 (±0.0440/√100)`	`🟢 rinna/bilingual-gpt-neox-4b-instructi...`	`31.7 (±54.7)`	`0.045`	`0.084`	`0.000`
`350`	`0.0406 (±0.0028/√100)`	`🟢 microsoft/Phi-3-small-128k-instruct`	`268.1 (±123.4)`	`0.083`	`0.039`	`0.000`
`351`	`0.0337 (±0.0026/√100)`	`🟢 augmxnt/shisa-7b-v1`	`590.7 (±238.2)`	`0.076`	`0.025`	`0.000`
`352`	`0.0284 (±0.0012/√100)`	`🟢 lightblue/karasu-7B-chat-plus`	`285.1 (±53.8)`	`0.080`	`0.005`	`0.000`
`353`	`0.0225 (±0.0702/√100)`	`💬 SakanaAI/EvoLLM-JP-A-v1-7B`	`5.9 (±27.6)`	`0.026`	`0.037`	`0.005`
`354`	`0.0180 (±0.0039/√100)`	`🟢 mistralai/Mistral-Nemo-Base-2407`	`607.5 (±344.5)`	`0.039`	`0.015`	`0.000`
`355`	`0.0047 (±0.0024/√100)`	`🟢 ai-forever/mGPT-13B`	`321.1 (±266.7)`	`0.008`	`0.006`	`0.000`
`356`	`0.0022 (±0.0006/√100)`	`🟢 lightblue/qarasu-14B-chat-plus-unleashed`	`937.5 (±557.0)`	`0.004`	`0.002`	`0.000`
`357`	`0.0019 (±0.0002/√100)`	`🟢 01-ai/Yi-1.5-9B-Chat`	`1440.0 (±51.9)`	`0.005`	`0.001`	`0.000`
`358`	`0.0018 (±0.0004/√100)`	`🟢 CohereForAI/aya-23-8B`	`1676.6 (±351.0)`	`0.004`	`0.002`	`0.000`
`359`	`0.0006 (±0.0002/√100)`	`🟢 meta-llama/Llama-2-13b-chat-hf`	`1523.9 (±43.5)`	`0.001`	`0.001`	`0.000`
`360`	`0.0000 (±0.0000/√100)`	`🟢 01-ai/Yi-1.5-6B`	`0.0 (±0.0)`	`0.000`	`0.000`	`0.000`
`361`	`0.0000 (±0.0000/√100)`	`🟢 lightblue/karasu-1.1B`	`0.0 (±0.0)`	`0.000`	`0.000`	`0.000`
`362`	`0.0000 (±0.0000/√100)`	`🟢 lightblue/karasu-7B-chat-plus-unleashed`	`0.0 (±0.0)`	`0.000`	`0.000`	`0.000`
`363`	`0.0000 (±0.0000/√100)`	`🟢 lightblue/karasu-7B-chat`	`0.0 (±0.0)`	`0.000`	`0.000`	`0.000`
`364`	`0.0000 (±0.0000/√100)`	`🟢 lightblue/suzume-llama-3-8B-japanese`	`300.0 (±0.0)`	`0.000`	`0.000`	`0.000`
`365`	`0.0000 (±0.0000/√100)`	`🟢 lightblue/suzume-llama-3-8B-multilingual`	`300.0 (±0.0)`	`0.000`	`0.000`	`0.000`

Citation

If you use this repository, please cite the following paper:

@preprint{Imos2024-pre-pfgen,
  title={{pfgen-bench: 日本語事前学習モデルのための文章生成性能評価ベンチマーク}},
  author={今城, 健太郎 and 平野, 正徳 and 鈴木, 脩司 and 三上, 裕明},
  doi={10.51094/jxiv.1008},
  year={2024}
}

Or cite directory this repository:

@misc{imajo2024-pfgen
    title={{Preferred Generation Benchmark}},
    author={Kentaro Imajo and Masanori Hirano and Shuji Suzuki and Hiroaki Mikami},
    year={2024},
    url = {https://github.com/pfnet-research/pfgen-bench}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
result		result
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pfgen.py		pfgen.py
pfgen_eval.py		pfgen_eval.py
pfgen_report.py		pfgen_report.py
run-hf.py		run-hf.py
run-manual.py		run-manual.py
run-openai.py		run-openai.py
run-vllm.py		run-vllm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preferred Generation Benchmark

License of LLM output

How to evaluate model

How to contribute

Leaderboard

Citation

About

Releases

Packages

Contributors 5

Languages

License

pfnet-research/pfgen-bench

Folders and files

Latest commit

History

Repository files navigation

Preferred Generation Benchmark

License of LLM output

How to evaluate model

How to contribute

Leaderboard

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages