[ Japanese | English ]
This tool automatically evaluates Japanese multi-modal large language models across multiple datasets. It offers the following features:
- Uses existing Japanese evaluation data and converts it into multi-modal text generation tasks for evaluation.
- Calculates task-specific evaluation metrics using inference results created by users.
For details on the data format and the list of supported data, please check DATASET.md.
- LLM-jp-eval-mm
You can also use this tool via PyPI.
-
Use the
pip
command to includeeval_mm
in the virtual environment where you want to run it:pip install eval_mm
-
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a
.env
file and setAZURE_OPENAI_ENDPOINT
andAZURE_OPENAI_KEY
if you’re using Azure, orOPENAI_API_KEY
if you’re using the OpenAI API.
That’s it for environment setup.
If you prefer to clone the repository and use it, please follow the instructions below.
eval-mm
uses uv
to manage virtual environments.
-
Clone the repository and move into it:
git clone [email protected]:llm-jp/llm-jp-eval-mm.git cd llm-jp-eval-mm
-
Build the environment with
uv
.Please install
uv
by referring to the official doc.cd llm-jp-eval-mm uv sync
-
Following the sample .env.sample, create a
.env
file and setAZURE_OPENAI_ENDPOINT
andAZURE_OPENAI_KEY
, orOPENAI_API_KEY
.
That’s all you need for the setup.
(Currently, the llm-jp-eval-mm repository is private. You can download the examples
directory from the Source Distribution at https://pypi.org/project/eval-mm/#files.)
We provide a sample code examples/sample.py
for running an evaluation.
Models listed as examples/{model_name}.py
are supported only in terms of their inference method.
If you want to run an evaluation on a new inference method or a new model, create a similar file referencing existing examples/{model_name}.py
, and you can run the evaluation in the same way.
For example, if you want to evaluate the llava-hf/llava-1.5-7b-hf
model on the japanese-heron-bench task, run the following command:
uv sync --group normal
uv run --group normal python examples/sample.py \
--model_id llava-hf/llava-1.5-7b-hf \
--task_id japanese-heron-bench \
--result_dir test \
--metrics "llm_as_a_judge_heron_bench" \
--judge_model "gpt-4o-2024-05-13" \
--overwrite
The evaluation score and output results will be saved in
test/{task_id}/evaluation/{model_id}.jsonl
and test/{task_id}/prediction/{model_id}.jsonl
.
If you want to evaluate multiple models on multiple tasks, please check eval_all.sh
.
Leaderboard is here
Right now, the following benchmark tasks are supported:
Japanese Task:
English Task:
Different models require different libraries. In this repository, we use uv’s Dependency groups to manage the libraries needed for each model.
When using the following models, please specify the normal
group:
stabiliyai/japanese-instructblip-alpha, stabilityai/japanese-stable-vlm, cyberagent/llava-calm2-siglip, llava-hf/llava-1.5-7b-hf, llava-hf/llava-v1.6-mistral-7b-hf, neulab/Pangea-7B-hf, meta-llama/Llama-3.2-11B-Vision-Instruct, meta-llama/Llama-3.2-90B-Vision-Instruct, OpenGVLab/InternVL2-8B, Qwen/Qwen2-VL-7B-Instruct, OpenGVLab/InternVL2-26B, Qwen/Qwen2-VL-72B-Instruct, gpt-4o-2024-05-13
uv sync --group normal
When using the following model, please specify the evovlm
group:
SamanaAI/Llama-3-EvoVLM-JP-v2
uv sync --group evovlm
When using the following models, please specify the vilaja
group:
llm-jp/llm-jp-3-vila-14b, Efficient-Large-Model/VILA1.5-13b
uv sync --group vilaja
mistralai/Pixtral-12B-2409
uv sync --group pixtral
When running the script, make sure to specify the group:
$ uv run --group normal python ...
If you add a new group, don’t forget to configure conflict.
-
JDocQA For constructing the JDocQA dataset, you need the pdf2image library. Since pdf2image depends on poppler-utils, please install it with:
sudo apt-get install poppler-utils
This repository is licensed under the Apache-2.0 License. For the licenses of each evaluation dataset, please see DATASET.md.
- If you find any issues or have suggestions, please report them on the Issue tracker.
- If you add new benchmark tasks, metrics, or VLM model inference code, or if you fix bugs, please send us a Pull Request.
Tasks are defined in the Task
class.
Please reference the code in src/eval_mm/tasks and implement your Task
class. You’ll need methods to convert the dataset into a format for input to the VLM model, and methods to calculate the score.
Metrics are defined in the Scorer
class.
Please reference the code in src/eval_mm/metrics and implement your Scorer
class. You’ll need to implement a score()
method for sample-level scoring comparing references and generated outputs, and an aggregate()
method for population-level metric calculation.
Inference code for VLM models is defined in the VLM
class.
Please reference examples/base_vlm and implement your VLM
class. You’ll need a generate()
method to produce output text from images and prompts.
uv add <package_name>
uv add --group <group_name> <package_name>
uv run ruff format src
uv run ruff check --fix src
git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags
Or you can manually create a new release on GitHub.
Please refer to github_pages/README.md.
- Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
- lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.
We also thank the developers of the evaluation datasets for their hard work.