<div align="center"> <div> <a href="https://github.com/Q-Future/"><img src="https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fvqassessment%2FA-Bench&count_bg=%23E97EBA&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=visitors&edge_flat=false"/></a> <a href="https://github.com/Q-Future/A-Bench"><img src="https://img.shields.io/github/stars/Q-Future/A-Bench"/></a> <a href="https://arxiv.org/pdf/2406.03070"><img src="https://img.shields.io/badge/Arxiv-2406.03070-blue"/></a> <a href="https://huggingface.co/datasets/q-future/A-Bench"><img src="https://img.shields.io/badge/Data-Release-green"></a> </div> <div style="width: 100%; text-align: center; margin:auto;"> <img style="width:100%" src="a-bench.png"> </div> <div style="width: 100%; text-align: center; margin:auto;"> <img style="width:100%" src="teaser.jpg"> </div> <h1>A-Bench: Are LMMs Masters at Evaluating AI-generated Images?</h1> _What do we expect from LMMs as AIGI evaluators and how do they perform?_ <div> <a href="https://zzc-1998.github.io/" target="_blank">Zicheng Zhang</a><sup>1</sup><sup>*</sup>, <a href="https://teowu.github.io/" target="_blank">Haoning Wu</a><sup>2</sup><sup>*</sup>, <a href="https://github.com/lcysyzxdxc" target="_blank">Chunyi Li</a><sup>1</sup><sup>*</sup>, <a href="https://scholar.google.com/citations?hl=zh-CN&user=85yWgIcAAAAJ" target="_blank">Yingjie Zhou</a><sup>1</sup>, <a href="https://scholar.google.com/citations?hl=zh-CN&user=nDlEBJ8AAAAJ" target="_blank">Wei Sun</a><sup>1</sup>, </div> <div> <a href="https://minxiongkuo.github.io/" target="_blank">Xiongkuo Min</a><sup>1</sup>, <a href="https://scholar.google.com/citations?hl=zh-CN&user=NSR4UkMAAAAJ" target="_blank">Zijian Chen</a><sup>1</sup>, <a href="https://scholar.google.ca/citations?user=Tq2hoMQAAAAJ&hl=en" target="_blank">Xiaohong Liu</a><sup>1</sup>, <a href="https://personal.ntu.edu.sg/wslin/Home.html" target="_blank">Weisi Lin</a><sup>2</sup>, <a href="https://ee.sjtu.edu.cn/en/FacultyDetail.aspx?id=24&infoid=153&flag=153" target="_blank">Guangtao Zhai</a><sup>1</sup><sup>#</sup> </div> <div> <sup>1</sup>Shanghai Jiaotong University, <sup>2</sup>Nanyang Technological University </div> <div> <sup>*</sup>Equal contribution. <sup>#</sup>Corresponding author. </div> <a href="https://github.com/Q-Future/A-Bench/blob/main/A_Bench__Are_LMMs_Masters_at_Evaluating_AI_generated_Images_.pdf"><strong>Paper</strong></a> | <a href="https://a-bench-sjtu.github.io/"><strong>Project Page</strong></a> | <a href="https://github.com/Q-Future/A-Bench"><strong>Github</strong></a> | <a href="https://huggingface.co/datasets/q-future/A-Bench"><strong>Data</strong></a> <div style="width: 100%; text-align: center; margin:auto;"> <img style="width:100%" src="spotlight.png"> </div> <div align="left"> T2I models aim to create images that accurately align with the text and showcase high perceptual quality. Therefore, the proposed A-Bench includes two parts to diagnose whether LMMs are masters at evaluating AIGIs: **1) Semantic Understanding, 2) Quality Perception**. ## Release - [2025/1] 🔥**A-Bench** is accepted by ICLR 2025, and more recent LMMs' performance is added. - [2024/9/26]🔥 Update the performance of GPT \& Gemini with the latest version on **A-Bench**. - [2024/8/1]🔥 The **A-Bench** is released on [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), come and test your LMM with one command. - [2024/6/17]🔥 The **A-Bench** has now joined [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), which makes it easier to test LMM !! - [2024/6/5] 🔥 We are releasing the **A-Bench** data and meta information at [Huggingface](https://huggingface.co/datasets/q-future/A-Bench). - [2024/6/3] 🔥 [Github repo](https://github.com/Q-Future/A-Bench) for **A-Bench** is online. Do you want to find out if your LMM is a master at evaluating AI-generated images? Come and test on **A-Bench** !! ## A-Bench Construction Two key diagnostic subsets are defined: **A-Bench-P1** → high-level semantic understanding, and **A-Bench-P2** → low-level quality perception. For high-level semantic understanding, **A-Bench-P1** targets three critical areas: *Basic Recognition, Bag-of-Words Pitfalls Discrimination*, and *Outside Knowledge Realization*, which are designed to progressively test the LMM’s capability in AIGI semantic understanding, moving from simple to complex prompt-related content. For low-level quality perception, **A-Bench-P2** concentrates on *Technical Quality Perception, Aesthetic Quality Evaluation*, and *Generative Distortion Assessment*, which are designed to cover the common quality issues and AIGI-specific quality problems. Specifically, a comprehensive dataset of 2,864 AIGIs sourced from various T2I models is compiled, including 1,408 AIGIs for **A-Bench-P1** and 1,456 for **A-Bench-P2**. Each AIGI is paired with a question-answer set annotated by human experts. We are open to **submission-based evaluation** for **A-Bench**. The details for submission are in the **Evaluate your model on A-Bench** Section. <div style="width: 100%; text-align: center; margin:auto;"> <img style="width:100%" src="examples.png"> </div> ## Glance at A-Bench Performance For *open-source* models, **LLaVA-NeXT (Qwen-110B)** takes the first place. For *closed-source* models, **GEMINI 1.5 PRO** takes the first place. <div align="center"> <div style="width: 100%; text-align: center; margin:auto;"> <img style="width:80%" src="overall.png"> </div> **A Quick Look of the A-Bench Outcomes.** |**Participant Name** | Major↑ | Minor↑ | Attr.↑ | N. Adj.↑ | Comp.↑ | Number↑ | Term↑ | Contra.↑ | Technical↑ | Aesthetic↑ | Generative↑ | | - | - | - | - | - | - | -| - | - | - | - | - | | Gemini 1.5 Pro | 93.82% | 95.18% | 94.35% | 80.27% | 72.14% | 79.35% | 72.88% | 61.56% | 84.70% | 71.22% | 77.61% | 59.07% | 69.12% | | GPT-4v | 92.95% | 96.00% | 87.40% | 82.67% | 64.39% | 68.84% | 77.60% | 66.73% | 83.60% | 67.82% | 68.34% | 58.02% | 64.31% | | GPT-4o | 94.34% | 95.14% | 91.99% | 79.54% | 76.40% | 73.30% | 77.47% | 68.59% | 85.44% | 70.59% | 61.61% | 67.92% | 66.88% | | Qwen-VL-Max | 92.56% | 94.75% | 91.99% | 85.78% | 68.94% | 75.85% | 78.94% | 65.05% | 84.47% | 71.31% | 69.77% | 58.56% | 66.21% | | Human (Worst) | 95.18% | 94.24% | 96.78% | 88.70% | 85.49% | 82.46% | 81.76% | 88.91% | 92.40% | 94.32% | 84.49% | 86.25% | 90.56% | | Human (Best) | 95.40% | 95.21% | 99.42% | 95.17% | 93.34% | 91.73% | 84.29% | 96.05% | 94.02% | 94.69% | 86.01% | 93.00% | 92.22% | <div align="left"> We release the performance of top-tier *closed-source* LMMs against humans. Two conclusions can be obtained: 1) **LMMs excel at basic recognition tasks but tend to be less effective when it comes to nuanced semantic understanding.** 2) **LMMs are poor quality evaluators.** ## Evaluate your model on A-Bench ### With LMMs-Eval Use [LMMs-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) to automatically evaluate A-Bench: ```shell git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git cd lmms-eval pip install -e . export NUM_GPUS=8 export MODEL_NAME=idefics2 python3 -m accelerate.commands.launch --num_processes=$NUM_GPUS -m lmms_eval --model $MODEL_NAME --tasks abench_dev --batch_size 1 --log_samples --log_samples_suffix $MODEL_NAME_a_bench --output_path ./logs/ ``` ### With VLMEvalKit Use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to automatically evaluate A-Bench: ``` git clone https://github.com/open-compass/VLMEvalKit.git cd VLMEvalKit pip install -e . ``` Example, quick test InternVL2-1B on the val and test sets of A-Bench: ``` python run.py --data A-Bench_VAL A-Bench_TEST --model InternVL2-1B --verbose ``` The val set has the correct answers and you can directly get the acc results. For test set performance, please submit the results to [e-mail](zzc1998@sjtu.edu.cn) ### With ```datasets``` API TO evaluate on your custom model, you can use our [converted dataset](https://huggingface.co/datasets/q-future/A-Bench-HF) in huggingface `datasets` format: ```shell pip install datasets ``` ```python from datasets import load_dataset ds = load_dataset("q-future/A-Bench-HF") ds["dev"][0] ``` Outputs should be as follows: ``` {'id': 0, 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x288>, 'question': 'May I ask where the scene in the picture is located?', 'option0': 'Forest', 'option1': 'Riverside', 'option2': 'Desert', 'option3': 'N/A', 'category': 'part1 -> bag_of_words -> attribute', 'correct_choice': 'B'} ``` Which can be then evaluated with your own model's format for MCQ. For example, if your model follows llava's format, it should be as follows: ```python di = ds["dev"][0] prompt = di["question"] + "\n" for i in range(4): if di[f"option{i}"] != "N/A": prompt += chr(ord("A")+i) + ". " + di[f"option{i}"] + "\n" prompt = prompt + "Answer with the option's letter from the given choices directly." print(prompt) ``` The prompt for the previous data item should be ```May I ask where the scene in the picture is located? A. Forest B. Riverside C. Desert Answer with the option's letter from the given choices directly.``` ### Legacy First download the dataset and meta information from [Huggingface](https://huggingface.co/datasets/q-future/A-Bench). The *imgs.zip* contains all the AI-generated images and *Abench.json* contains all the meta information including the img_path, questions, answers, and categories. The item of *Abench.json* is structured like: ``` "img_path": "part1_0000.png", "question": "What is the color of the windows in the house in the picture?", "answers": [ "white", "yellow", "blue" ], "category": "part1 -> basic_recognition -> major" ``` The "img_path" indicates the path to the image in *imgs.zip*, the "question" is a string, the "answers" is a list of answer candidates (several false answers and the correct answer). The correct answers are kept confidential to ensure A-Bench retains its long-term value as a benchmark for assessing AIGI evaluation capabilities. ### Test without API To test with your LMM, we suggest using the following prompt: ```python import json with open("Abench.json", "r") as f: f = f.read() data = json.loads(f) for item in data: image_file = 'path-to-imgs' + item["img_path"] message = item["question"] + "\n" for choice, ans in zip(["A.", "B.", "C.", "D."], item["answers"]): message += f"{choice} {ans}\n" message = message + "Answer with the option's letter from the given choices directly." print(message) # What is the color of the windows in the house in the picture? # A.white # B.yellow # C.blue # Answer with the option's letter from the given choices directly. # do your test here # response = LMM(image_file,message) item['response'] = response with open("results.jsonl", "a") as wf: json.dump(item, wf) wf.write("\n") ``` After finishing validation, you can submit the results via [e-mail](zzc1998@sjtu.edu.cn) to get your LMM results on A-Bench ! ## Contact Please contact any of the first authors of this paper for queries. - Zicheng Zhang, `zzc1998@sjtu.edu.cn`, @zzc-1998 - Haoning Wu, `haoning001@e.ntu.edu.sg`, @teowu ## Citation If you find our work interesting, please feel free to cite our paper: ```bibtex @misc{zhang2024abench, title={A-Bench: Are LMMs Masters at Evaluating AI-generated Images?}, author={Zicheng Zhang and Haoning Wu and Chunyi Li and Yingjie Zhou and Wei Sun and Xiongkuo Min and Zijian Chen and Xiaohong Liu and Weisi Lin and Guangtao Zhai}, year={2024}, eprint={2406.03070}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```