Skip to content

Commit c9b2252

Browse files
committed
Bump version to 0.2.0.dev0
1 parent 465bd42 commit c9b2252

File tree

3 files changed

+228
-213
lines changed

3 files changed

+228
-213
lines changed

Diff for: README.md

+105-212
Original file line numberDiff line numberDiff line change
@@ -9,76 +9,23 @@
99
🏠 [LMMs-Lab Homepage](https://lmms-lab.github.io/) | 🎉 [Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/) | 📚 [Documentation](docs/README.md) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab) | <a href="https://emoji.gg/emoji/1684-discord-thread"><img src="https://cdn3.emoji.gg/emojis/1684-discord-thread.png" width="14px" height="14px" alt="Discord_Thread"></a> [discord/lmms-eval](https://discord.gg/zdkwKUqrPy)
1010

1111

12-
In today's world, we're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.
13-
14-
To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we're on a treasure hunt, but the maps are scattered everywhere.
15-
16-
In the field of language models, there has been a valuable precedent set by the work of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the [open-llm-leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), and has gradually become the underlying ecosystem of the era of foundation models.
17-
18-
However, though there are many new evaluation datasets are recently proposed, the efficient evaluation pipeline of LMM is still in its infancy, and there is no unified evaluation framework that can be used to evaluate LMM across a wide range of datasets. To address this challenge, we introduce **lmms-eval**, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
19-
20-
We humbly obsorbed the exquisite and efficient design of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). Building upon its foundation, we implemented our `lmms-eval` framework with performance optimizations specifically for LMMs.
21-
22-
## Necessity of lmms-eval
23-
24-
We believe our effort could provide an efficient interface for the detailed comparison of publicly available models to discern their strengths and weaknesses. It's also useful for research institutions and production-oriented companies to accelerate the development of large multimodal models. With the `lmms-eval`, we have significantly accelerated the lifecycle of model iteration. Inside the LLaVA team, the utilization of `lmms-eval` largely improves the efficiency of the model development cycle, as we are able to evaluate weekly trained hundreds of checkpoints on 20-30 datasets, identifying the strengths and weaknesses, and then make targeted improvements.
25-
2612
# Annoucement
2713

28-
## Contribution Guidance
29-
30-
We've added guidance on contributing new datasets and models. Please refer to our [documentation](docs/README.md). If you need assistance, you can contact us via [discord/lmms-eval](https://discord.gg/ebAMGSsS).
31-
32-
## v0.1.0 Released
33-
34-
The first version of the `lmms-eval` is released. We are working on providing an one-command evaluation suite for accelerating the development of LMMs.
35-
36-
> In [LLaVA Next](https://llava-vl.github.io/blog/2024-01-30-llava-next/) development, we internally utilize this suite to evaluate the multiple different model versions on various datasets. It significantly accelerates the model development cycle for it's easy integration and fast evaluation speed.
37-
38-
The main feature includes:
39-
40-
<p align="center" width="100%">
41-
<img src="https://i.postimg.cc/sgzNmJx7/teaser.png" width="100%" height="80%">
42-
</p>
43-
44-
### One-command evaluation, with detailed logs and samples.
45-
You can evaluate the models on multiple datasets with a single command. No model/data preparation is needed, just one command line, few minutes, and get the results. Not just a result number, but also the detailed logs and samples, including the model args, input question, model response, and ground truth answer.
46-
47-
```python
48-
# Evaluating LLaVA on multiple datasets
49-
accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
50-
```
51-
52-
### Accelerator support and Tasks grouping.
53-
We support the usage of `accelerate` to wrap the model for distributed evaluation, supporting multi-gpu and tensor parallelism. With **Task Grouping**, all instances from all tasks are grouped and evaluated in parallel, which significantly improves the throughput of the evaluation. After evaluation, all instances are sent to postprocessing module for metric calcuations and potential GPT4-eval queries.
54-
55-
Below are the total runtime on different datasets using 4 x A100 40G.
14+
- [2024-06] The `lmms-eval/v0.2` has been upgraded to support video evaluations, and other feature updates. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details
5615

57-
| Dataset (#num) | LLaVA-v1.5-7b | LLaVA-v1.5-13b |
58-
| :---------------------- | :----------------- | :----------------- |
59-
| mme (2374) | 2 mins 43 seconds | 3 mins 27 seconds |
60-
| gqa (12578) | 10 mins 43 seconds | 14 mins 23 seconds |
61-
| scienceqa_img (2017) | 1 mins 58 seconds | 2 mins 52 seconds |
62-
| ai2d (3088) | 3 mins 17 seconds | 4 mins 12 seconds |
63-
| coco2017_cap_val (5000) | 14 mins 13 seconds | 19 mins 58 seconds |
16+
- [2024-03] We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details
6417

65-
### All-In-One HF dataset hubs.
18+
# Why `lmms-eval`?
6619

67-
We are hosting more than 40 (and increasing) datasets on [huggingface/lmms-lab](https://huggingface.co/lmms-lab), we carefully converted these datasets from original sources and included all variants, versions and splits. Now they can be directly accessed without any burden of data preprocessing. They also serve for the purpose of visualizing the data and grasping the sense of evaluation tasks distribution.
68-
69-
<p align="center" width="100%">
70-
<img src="https://i.postimg.cc/8PXFW9sk/WX20240228-123110_2x.png" width="100%" height="80%">
71-
</p>
20+
In today's world, we're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.
7221

73-
### Detailed Logging Utilites
22+
To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI.
7423

75-
We provide detailed logging utilities to help you understand the evaluation process and results. The logs include the model args, generation parameters, input question, model response, and ground truth answer. You can also record every details and visualize them inside runs on Weights & Biases.
24+
However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we're on a treasure hunt, but the maps are scattered everywhere.
7625

77-
{% include figure.liquid loading="eager" path="assets/img/wandb_table.png" class="img-fluid rounded z-depth-1" zoomable=true %}
26+
In the field of language models, there has been a valuable precedent set by the work of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the [open-llm-leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), and has gradually become the underlying ecosystem of the era of foundation models.
7827

79-
<p align="center" width="100%">
80-
<img src="https://i.postimg.cc/W1c1vBDJ/Wechat-IMG1993.png" width="100%" height="80%">
81-
</p>
28+
We humbly obsorbed the exquisite and efficient design of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and introduce **lmms-eval**, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
8229

8330
# Installation
8431

@@ -95,37 +42,110 @@ pip install -e .
9542
```
9643

9744
If you wanted to test llava, you will have to clone their repo from [LLaVA](https://github.com/haotian-liu/LLaVA) and
98-
```
99-
git clone https://github.com/haotian-liu/LLaVA
100-
cd LLaVA
45+
```bash
46+
# for llava 1.5
47+
# git clone https://github.com/haotian-liu/LLaVA
48+
# cd LLaVA
49+
# pip install -e .
50+
51+
# for llava-next (1.6)
52+
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
53+
cd LLaVA-NeXT
10154
pip install -e .
10255
```
10356

57+
<details>
58+
<summary>Reproduction of LLaVA-1.5's paper results</summary>
59+
10460
You can check the [environment install script](miscs/repr_scripts.sh) and [torch environment info](miscs/repr_torch_envs.txt) to **reproduce LLaVA-1.5's paper results**. We found torch/cuda versions difference would cause small variations in the results, we provide the [results check](miscs/llava_result_check.md) with different environments.
10561

62+
</details>
63+
10664
If you want to test on caption dataset such as `coco`, `refcoco`, and `nocaps`, you will need to have `java==1.8.0 ` to let pycocoeval api to work. If you don't have it, you can install by using conda
10765
```
10866
conda install openjdk=8
10967
```
11068
you can then check your java version by `java -version`
11169

112-
# Usage
70+
# Multiple Usages
11371
```bash
114-
# Evaluating LLaVA on MME
115-
accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks mme --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme --output_path ./logs/
116-
117-
# Evaluating LLaVA on multiple datasets
118-
accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.5-7b" --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
72+
# Evaluation of LLaVA on MME
73+
python3 -m accelerate.commands.launch \
74+
--num_processes=8 \
75+
-m lmms_eval \
76+
--model llava \
77+
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
78+
--tasks mme \
79+
--batch_size 1 \
80+
--log_samples \
81+
--log_samples_suffix llava_v1.5_mme \
82+
--output_path ./logs/
83+
84+
# Evaluation of LLaVA on multiple datasets
85+
python3 -m accelerate.commands.launch \
86+
--num_processes=8 \
87+
-m lmms_eval \
88+
--model llava \
89+
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
90+
--tasks mme,mmbench_en \
91+
--batch_size 1 \
92+
--log_samples \
93+
--log_samples_suffix llava_v1.5_mme_mmbenchen \
94+
--output_path ./logs/
11995

12096
# For other variants llava. Note that `conv_template` is an arg of the init function of llava in `lmms_eval/models/llava.py`
121-
accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.6-mistral-7b,conv_template=mistral_instruct" --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
122-
accelerate launch --num_processes=8 -m lmms_eval --model llava --model_args pretrained="liuhaotian/llava-v1.6-34b,conv_template=mistral_direct" --tasks mme,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme_mmbenchen --output_path ./logs/ #
123-
124-
# From a predefined configuration, supporting evaluation of multiple models and datasets
125-
accelerate launch --num_processes=8 -m lmms_eval --config ./miscs/example_eval.yaml
97+
python3 -m accelerate.commands.launch \
98+
--num_processes=8 \
99+
-m lmms_eval \
100+
--model llava \
101+
--model_args pretrained="liuhaotian/llava-v1.6-mistral-7b,conv_template=mistral_instruct" \
102+
--tasks mme,mmbench_en \
103+
--batch_size 1 \
104+
--log_samples \
105+
--log_samples_suffix llava_v1.5_mme_mmbenchen \
106+
--output_path ./logs/
107+
108+
# Evaluation of larger lmms (llava-v1.6-34b)
109+
python3 -m accelerate.commands.launch \
110+
--num_processes=8 \
111+
-m lmms_eval \
112+
--model llava \
113+
--model_args pretrained="liuhaotian/llava-v1.6-34b,conv_template=mistral_direct" \
114+
--tasks mme,mmbench_en \
115+
--batch_size 1 \
116+
--log_samples \
117+
--log_samples_suffix llava_v1.5_mme_mmbenchen \
118+
--output_path ./logs/
119+
120+
# Evaluation with a set of configurations, supporting evaluation of multiple models and datasets
121+
python3 -m accelerate.commands.launch --num_processes=8 -m lmms_eval --config ./miscs/example_eval.yaml
122+
123+
# Evaluation with naive model sharding for bigger model (llava-next-72b)
124+
python3 -m lmms_eval \
125+
--model=llava \
126+
--model_args=pretrained=lmms-lab/llava-next-72b,conv_template=qwen_1_5,device_map=auto,model_name=llava_qwen \
127+
--tasks=pope,vizwiz_vqa_val,scienceqa_img \
128+
--batch_size=1 \
129+
--log_samples \
130+
--log_samples_suffix=llava_qwen \
131+
--output_path="./logs/" \
132+
--wandb_args=project=lmms-eval,job_type=eval,entity=llava-vl
133+
134+
# Evaluation with SGLang for bigger model (llava-next-72b)
135+
python3 -m lmms_eval \
136+
--model=llava_sglang \
137+
--model_args=pretrained=lmms-lab/llava-next-72b,tokenizer=lmms-lab/llavanext-qwen-tokenizer,conv_template=chatml-llava,tp_size=8,parallel=8 \
138+
--tasks=mme \
139+
--batch_size=1 \
140+
--log_samples \
141+
--log_samples_suffix=llava_qwen \
142+
--output_path=./logs/ \
143+
--verbosity=INFO
126144
```
127145

128-
# Model Results
146+
<details>
147+
<summary>Comprehensive Evaluation Results of LLaVA Family Models</summary>
148+
<br>
129149

130150
As demonstrated by the extensive table below, we aim to provide detailed information for readers to understand the datasets included in lmms-eval and some specific details about these datasets (we remain grateful for any corrections readers may have during our evaluation process).
131151

@@ -137,146 +157,19 @@ We provide a Google Sheet for the detailed results of the LLaVA series models on
137157

138158
We also provide the raw data exported from Weights & Biases for the detailed results of the LLaVA series models on different datasets. You can access the raw data [here](https://docs.google.com/spreadsheets/d/1AvaEmuG4csSmXaHjgu4ei1KBMmNNW8wflOD_kkTDdv8/edit?usp=sharing).
139159

140-
> Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.
160+
</details>
161+
<br>
162+
141163

164+
Our Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.
142165

143166
## Supported models
144167

145-
- GPT4V (API, only generation-based evaluation)
146-
- LLaVA-v1.5/v1.6-7B/13B/34B (ppl-based, generation-based)
147-
- Qwen-VL series (ppl-based, generation-based)
148-
- Fuyu series (ppl-based, generation-based)
149-
- InstructBLIP series (generation-based)
150-
151-
## Supported datasets
152-
> () indicates the task name in the lmms_eval. The task name is also used to specify the dataset in the configuration file.
153-
154-
- AI2D (ai2d)
155-
- ChartQA (chartqa)
156-
- CMMMU (cmmmu)
157-
- CMMMU Validation (cmmmu_val)
158-
- CMMMU Test (cmmmu_test)
159-
- COCO Caption (coco_cap)
160-
- COCO 2014 Caption (coco2014_cap)
161-
- COCO 2014 Caption Validation (coco2014_cap_val)
162-
- COCO 2014 Caption Test (coco2014_cap_test)
163-
- COCO 2017 Caption (coco2017_cap)
164-
- COCO 2017 Caption MiniVal (coco2017_cap_val)
165-
- COCO 2017 Caption MiniTest (coco2017_cap_test)
166-
- [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench)
167-
- DOCVQA (docvqa)
168-
- DOCVQA Validation (docvqa_val)
169-
- DOCVQA Test (docvqa_test)
170-
- Ferret (ferret)
171-
- Flickr30K (flickr30k)
172-
- Ferret Test (ferret_test)
173-
- GQA (gqa)
174-
- HallusionBenchmark (hallusion_bench_image)
175-
- Infographic VQA (info_vqa)
176-
- Infographic VQA Validation (info_vqa_val)
177-
- Infographic VQA Test (info_vqa_test)
178-
- LLaVA-Bench (llava_in_the_wild)
179-
- LLaVA-Bench-COCO (llava_bench_coco)
180-
- MathVerse (mathverse)
181-
- MathVerse Text Dominant (mathverse_testmini_text_dominant)
182-
- MathVerse Text Only (mathverse_testmini_text_only)
183-
- MathVerse Text Lite (mathverse_testmini_text_lite)
184-
- MathVerse Vision Dominant (mathverse_testmini_vision_dominant)
185-
- MathVerse Vision Intensive (mathverse_testmini_vision_intensive)
186-
- MathVerse Vision Only (mathverse_testmini_vision_only)
187-
- MathVista (mathvista)
188-
- MathVista Validation (mathvista_testmini)
189-
- MathVista Test (mathvista_test)
190-
- MMBench (mmbench)
191-
- MMBench English (mmbench_en)
192-
- MMBench English Dev (mmbench_en_dev)
193-
- MMBench English Test (mmbench_en_test)
194-
- MMBench Chinese (mmbench_cn)
195-
- MMBench Chinese Dev (mmbench_cn_dev)
196-
- MMBench Chinese Test (mmbench_cn_test)
197-
- MME (mme)
198-
- MMMU (mmmu)
199-
- MMMU Validation (mmmu_val)
200-
- MMMU Test (mmmu_test)
201-
- MMUPD (mmupd)
202-
- MMUPD Base (mmupd_base)
203-
- MMAAD Base (mmaad_base)
204-
- MMIASD Base (mmiasd_base)
205-
- MMIVQD Base (mmivqd_base)
206-
- MMUPD Option (mmupd_option)
207-
- MMAAD Option (mmaad_option)
208-
- MMIASD Option (mmiasd_option)
209-
- MMIVQD Option (mmivqd_option)
210-
- MMUPD Instruction (mmupd_instruction)
211-
- MMAAD Instruction (mmaad_instruction)
212-
- MMIASD Instruction (mmiasd_instruction)
213-
- MMIVQD Instruction (mmivqd_instruction)
214-
- MMVet (mmvet)
215-
- Multi-DocVQA (multidocvqa)
216-
- Multi-DocVQA Validation (multidocvqa_val)
217-
- Multi-DocVQA Test (multidocvqa_test)
218-
- NoCaps (nocaps)
219-
- NoCaps Validation (nocaps_val)
220-
- NoCaps Test (nocaps_test)
221-
- OKVQA (ok_vqa)
222-
- OKVQA Validation 2014 (ok_vqa_val2014)
223-
- POPE (pope)
224-
- RefCOCO (refcoco)
225-
- refcoco_seg_test
226-
- refcoco_seg_val
227-
- refcoco_seg_testA
228-
- refcoco_seg_testB
229-
- refcoco_bbox_test
230-
- refcoco_bbox_val
231-
- refcoco_bbox_testA
232-
- refcoco_bbox_testB
233-
- RefCOCO+ (refcoco+)
234-
- refcoco+_seg
235-
- refcoco+_seg_val
236-
- refcoco+_seg_testA
237-
- refcoco+_seg_testB
238-
- refcoco+_bbox
239-
- refcoco+_bbox_val
240-
- refcoco+_bbox_testA
241-
- refcoco+_bbox_testB
242-
- RefCOCOg (refcocog)
243-
- refcocog_seg_test
244-
- refcocog_seg_val
245-
- refcocog_bbox_test
246-
- refcocog_bbox_val
247-
- ScienceQA (scienceqa_full)
248-
- ScienceQA Full (scienceqa)
249-
- ScienceQA IMG (scienceqa_img)
250-
- ScreenSpot (screenspot)
251-
- ScreenSpot REC / Grounding (screenspot_rec)
252-
- ScreenSpot REG / Instruction Generation (screenspot_reg)
253-
- SeedBench (seedbench)
254-
- SeedBench 2 (seedbench_2)
255-
- ST-VQA (stvqa)
256-
- TextCaps (textcaps)
257-
- TextCaps Validation (textcaps_val)
258-
- TextCaps Test (textcaps_test)
259-
- TextVQA (textvqa)
260-
- TextVQA Validation (textvqa_val)
261-
- TextVQA Test (textvqa_test)
262-
- VizWizVQA (vizwiz_vqa)
263-
- VizWizVQA Validation (vizwiz_vqa_val)
264-
- VizWizVQA Test (vizwiz_vqa_test)
265-
- VQAv2 (vqav2)
266-
- VQAv2 Validation (vqav2_val)
267-
- VQAv2 Test (vqav2_test)
268-
- WebSRC (websrc)
269-
- WebSRC Validation (websrc_val)
270-
- WebSRC Test (websrc_test)
271-
272-
## Datasets to be added and tested
273-
- TallyQA (tallyqa)
274-
- VSR (vsr)
275-
- Winoground (winoground)
276-
- NLVR2 (nlvr2)
277-
- RavenIQ-Test (raveniq)
278-
- IconQA (iconqa)
279-
- VistBench (vistbench)
168+
Please check [supported models](lmms_eval/models/__init__.py) for more details.
169+
170+
## Supported tasks
171+
172+
Please check [supported tasks](lmms_eval/docs/current_tasks.md) for more details.
280173

281174
# Add Customized Model and Dataset
282175

0 commit comments

Comments
 (0)