llm-jp-eval-mm

This tool automatically evaluates Japanese multi-modal large language models across multiple datasets. It offers the following features:

Uses existing Japanese evaluation data and converts it into multi-modal text generation tasks for evaluation.
Calculates task-specific evaluation metrics using inference results created by users.

For details on the data format and the list of supported data, please check DATASET.md.

Environment Setup

You can also use this tool via PyPI.

Install via PyPI

Use the pip command to include eval_mm in the virtual environment where you want to run it:
```
pip install eval_mm
```
This tool uses the LLM-as-a-judge method for evaluation, which sends requests to GPT-4o via the OpenAI API. Please create a .env file and set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_KEY if you’re using Azure, or OPENAI_API_KEY if you’re using the OpenAI API.

That’s it for environment setup.

If you prefer to clone the repository and use it, please follow the instructions below.

Clone the GitHub Repo

eval-mm uses uv to manage virtual environments.

Clone the repository and move into it:

git clone [email protected]:llm-jp/llm-jp-eval-mm.git
cd llm-jp-eval-mm

Build the environment with uv.

Please install uv by referring to the official doc.
```
cd llm-jp-eval-mm
uv sync
```
Following the sample .env.sample, create a .env file and set AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_KEY, or OPENAI_API_KEY.

That’s all you need for the setup.

How to Evaluate

Running an Evaluation

(Currently, the llm-jp-eval-mm repository is private. You can download the examples directory from the Source Distribution at https://pypi.org/project/eval-mm/#files.)

We provide a sample code examples/sample.py for running an evaluation.

Models listed as examples/{model_name}.py are supported only in terms of their inference method.

If you want to run an evaluation on a new inference method or a new model, create a similar file referencing existing examples/{model_name}.py, and you can run the evaluation in the same way.

For example, if you want to evaluate the llava-hf/llava-1.5-7b-hf model on the japanese-heron-bench task, run the following command:

uv sync --group normal
uv run --group normal python examples/sample.py \
  --model_id llava-hf/llava-1.5-7b-hf \
  --task_id japanese-heron-bench  \
  --result_dir test  \
  --metrics "llm_as_a_judge_heron_bench" \
  --judge_model "gpt-4o-2024-05-13" \
  --overwrite

The evaluation score and output results will be saved in test/{task_id}/evaluation/{model_id}.jsonl and test/{task_id}/prediction/{model_id}.jsonl.

If you want to evaluate multiple models on multiple tasks, please check eval_all.sh.

Leaderboard

Leaderboard is here

Supported Tasks

Right now, the following benchmark tasks are supported:

Japanese Task:

English Task:

Required Libraries for Each VLM Model Inference

Different models require different libraries. In this repository, we use uv’s Dependency groups to manage the libraries needed for each model.

When using the following models, please specify the normal group: stabiliyai/japanese-instructblip-alpha, stabilityai/japanese-stable-vlm, cyberagent/llava-calm2-siglip, llava-hf/llava-1.5-7b-hf, llava-hf/llava-v1.6-mistral-7b-hf, neulab/Pangea-7B-hf, meta-llama/Llama-3.2-11B-Vision-Instruct, meta-llama/Llama-3.2-90B-Vision-Instruct, OpenGVLab/InternVL2-8B, Qwen/Qwen2-VL-7B-Instruct, OpenGVLab/InternVL2-26B, Qwen/Qwen2-VL-72B-Instruct, gpt-4o-2024-05-13

uv sync --group normal

When using the following model, please specify the evovlm group: SamanaAI/Llama-3-EvoVLM-JP-v2

uv sync --group evovlm

When using the following models, please specify the vilaja group: llm-jp/llm-jp-3-vila-14b, Efficient-Large-Model/VILA1.5-13b

uv sync --group vilaja

mistralai/Pixtral-12B-2409

uv sync --group pixtral

When running the script, make sure to specify the group:

$ uv run --group normal python ...

If you add a new group, don’t forget to configure conflict.

Benchmark-Specific Required Libraries

JDocQA For constructing the JDocQA dataset, you need the pdf2image library. Since pdf2image depends on poppler-utils, please install it with:
```
sudo apt-get install poppler-utils
```

License

This repository is licensed under the Apache-2.0 License. For the licenses of each evaluation dataset, please see DATASET.md.

Contribution

If you find any issues or have suggestions, please report them on the Issue tracker.
If you add new benchmark tasks, metrics, or VLM model inference code, or if you fix bugs, please send us a Pull Request.

How to Add a Benchmark Task

Tasks are defined in the Task class. Please reference the code in src/eval_mm/tasks and implement your Task class. You’ll need methods to convert the dataset into a format for input to the VLM model, and methods to calculate the score.

How to Add a Metric

Metrics are defined in the Scorer class. Please reference the code in src/eval_mm/metrics and implement your Scorer class. You’ll need to implement a score() method for sample-level scoring comparing references and generated outputs, and an aggregate() method for population-level metric calculation.

How to Add Inference Code for a VLM Model

Inference code for VLM models is defined in the VLM class. Please reference examples/base_vlm and implement your VLM class. You’ll need a generate() method to produce output text from images and prompts.

How to Add Dependencies

uv add <package_name>
uv add --group <group_name> <package_name>

Formatting and Linting with ruff

uv run ruff format src
uv run ruff check --fix src

How to Release to PyPI

git tag -a v0.x.x -m "version 0.x.x"
git push origin --tags

Or you can manually create a new release on GitHub.

How to Update the Website

Please refer to github_pages/README.md.

Acknowledgements

Heron: We refer to the Heron code for the evaluation of the Japanese Heron Bench task.
lmms-eval: We refer to the lmms-eval code for the evaluation of the JMMMU and MMMU tasks.

We also thank the developers of the evaluation datasets for their hard work.

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
github_pages		github_pages
scripts		scripts
src/eval_mm		src/eval_mm
.env.sample		.env.sample
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
README_ja.md		README_ja.md
eval_all.sh		eval_all.sh
pyproject.toml		pyproject.toml
test_model.sh		test_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-jp-eval-mm

Table of Contents

Environment Setup

Install via PyPI

Clone the GitHub Repo

How to Evaluate

Running an Evaluation

Leaderboard

Supported Tasks

Required Libraries for Each VLM Model Inference

Benchmark-Specific Required Libraries

License

Contribution

How to Add a Benchmark Task

How to Add a Metric

How to Add Inference Code for a VLM Model

How to Add Dependencies

Formatting and Linting with ruff

How to Release to PyPI

How to Update the Website

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

llm-jp/llm-jp-eval-mm

Folders and files

Latest commit

History

Repository files navigation

llm-jp-eval-mm

Table of Contents

Environment Setup

Install via PyPI

Clone the GitHub Repo

How to Evaluate

Running an Evaluation

Leaderboard

Supported Tasks

Required Libraries for Each VLM Model Inference

Benchmark-Specific Required Libraries

License

Contribution

How to Add a Benchmark Task

How to Add a Metric

How to Add Inference Code for a VLM Model

How to Add Dependencies

Formatting and Linting with ruff

How to Release to PyPI

How to Update the Website

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages