Skip to content

Commit 465bd42

Browse files
committed
Merge branch 'main' of https://github.com/EvolvingLMMs-Lab/lmms-eval into internal_main_dev
2 parents e43bd84 + d99a24a commit 465bd42

File tree

71 files changed

+3517
-29
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+3517
-29
lines changed

Diff for: LICENSE

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# For the main pipeline structure-related code, we maintain the original license provided with lm-evaluation-harness, which is the MIT License.
2+
3+
MIT License
4+
5+
Copyright (c) 2024 LMMs-Lab
6+
7+
Permission is hereby granted, free of charge, to any person obtaining a copy
8+
of this software and associated documentation files (the "Software"), to deal
9+
in the Software without restriction, including without limitation the rights
10+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
11+
copies of the Software, and to permit persons to whom the Software is
12+
furnished to do so, subject to the following conditions:
13+
14+
The above copyright notice and this permission notice shall be included in all
15+
copies or substantial portions of the Software.
16+
17+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
18+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
19+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
20+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
21+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
22+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
23+
SOFTWARE.
24+
25+
# For the multimodal models and datasets that we have added (defined as code in the lmms_eval/tasks and lmms_eval/models folders), we apply the Apache License.
26+
27+
Apache 2.0 License
28+
29+
Copyright (c) 2024 LMMs-Lab
30+
31+
Licensed under the Apache License, Version 2.0 (the "License");
32+
you may not use this file except in compliance with the License.
33+
You may obtain a copy of the License at
34+
35+
http://www.apache.org/licenses/LICENSE-2.0
36+
37+
Unless required by applicable law or agreed to in writing, software
38+
distributed under the License is distributed on an "AS IS" BASIS,
39+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
40+
See the License for the specific language governing permissions and
41+
limitations under the License.
42+
43+
When modifying the code, please include the following information about the original lmms-eval source:
44+
# Adopted from lmms-eval from https://github.com/EvolvingLMMs-Lab/lmms-eval. Below is the original copyright:
45+
#
46+
# Licensed under the Apache License, Version 2.0 (the "License");
47+
# you may not use this file except in compliance with the License.
48+
# You may obtain a copy of the License at
49+
#
50+
# http://www.apache.org/licenses/LICENSE-2.0
51+
#
52+
# Unless required by applicable law or agreed to in writing, software
53+
# distributed under the License is distributed on an "AS IS" BASIS,
54+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
55+
# See the License for the specific language governing permissions and
56+
# limitations under the License.

Diff for: README.md

+28-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
🏠 [LMMs-Lab Homepage](https://lmms-lab.github.io/) | 🎉 [Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/) | 📚 [Documentation](docs/README.md) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab) | <a href="https://emoji.gg/emoji/1684-discord-thread"><img src="https://cdn3.emoji.gg/emojis/1684-discord-thread.png" width="14px" height="14px" alt="Discord_Thread"></a> [discord/lmms-eval](https://discord.gg/zdkwKUqrPy)
1010

1111

12-
In today's world, we're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks. These advancements bring us closer to achieving AGI.
12+
In today's world, we're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.
1313

1414
To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we're on a treasure hunt, but the maps are scattered everywhere.
1515

@@ -163,6 +163,7 @@ We also provide the raw data exported from Weights & Biases for the detailed res
163163
- COCO 2017 Caption (coco2017_cap)
164164
- COCO 2017 Caption MiniVal (coco2017_cap_val)
165165
- COCO 2017 Caption MiniTest (coco2017_cap_test)
166+
- [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench)
166167
- DOCVQA (docvqa)
167168
- DOCVQA Validation (docvqa_val)
168169
- DOCVQA Test (docvqa_test)
@@ -176,6 +177,13 @@ We also provide the raw data exported from Weights & Biases for the detailed res
176177
- Infographic VQA Test (info_vqa_test)
177178
- LLaVA-Bench (llava_in_the_wild)
178179
- LLaVA-Bench-COCO (llava_bench_coco)
180+
- MathVerse (mathverse)
181+
- MathVerse Text Dominant (mathverse_testmini_text_dominant)
182+
- MathVerse Text Only (mathverse_testmini_text_only)
183+
- MathVerse Text Lite (mathverse_testmini_text_lite)
184+
- MathVerse Vision Dominant (mathverse_testmini_vision_dominant)
185+
- MathVerse Vision Intensive (mathverse_testmini_vision_intensive)
186+
- MathVerse Vision Only (mathverse_testmini_vision_only)
179187
- MathVista (mathvista)
180188
- MathVista Validation (mathvista_testmini)
181189
- MathVista Test (mathvista_test)
@@ -190,6 +198,19 @@ We also provide the raw data exported from Weights & Biases for the detailed res
190198
- MMMU (mmmu)
191199
- MMMU Validation (mmmu_val)
192200
- MMMU Test (mmmu_test)
201+
- MMUPD (mmupd)
202+
- MMUPD Base (mmupd_base)
203+
- MMAAD Base (mmaad_base)
204+
- MMIASD Base (mmiasd_base)
205+
- MMIVQD Base (mmivqd_base)
206+
- MMUPD Option (mmupd_option)
207+
- MMAAD Option (mmaad_option)
208+
- MMIASD Option (mmiasd_option)
209+
- MMIVQD Option (mmivqd_option)
210+
- MMUPD Instruction (mmupd_instruction)
211+
- MMAAD Instruction (mmaad_instruction)
212+
- MMIASD Instruction (mmiasd_instruction)
213+
- MMIVQD Instruction (mmivqd_instruction)
193214
- MMVet (mmvet)
194215
- Multi-DocVQA (multidocvqa)
195216
- Multi-DocVQA Validation (multidocvqa_val)
@@ -226,6 +247,9 @@ We also provide the raw data exported from Weights & Biases for the detailed res
226247
- ScienceQA (scienceqa_full)
227248
- ScienceQA Full (scienceqa)
228249
- ScienceQA IMG (scienceqa_img)
250+
- ScreenSpot (screenspot)
251+
- ScreenSpot REC / Grounding (screenspot_rec)
252+
- ScreenSpot REG / Instruction Generation (screenspot_reg)
229253
- SeedBench (seedbench)
230254
- SeedBench 2 (seedbench_2)
231255
- ST-VQA (stvqa)
@@ -241,6 +265,9 @@ We also provide the raw data exported from Weights & Biases for the detailed res
241265
- VQAv2 (vqav2)
242266
- VQAv2 Validation (vqav2_val)
243267
- VQAv2 Test (vqav2_test)
268+
- WebSRC (websrc)
269+
- WebSRC Validation (websrc_val)
270+
- WebSRC Test (websrc_test)
244271

245272
## Datasets to be added and tested
246273
- TallyQA (tallyqa)

Diff for: lmms_eval/models/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
"llava_onevision": "Llava_OneVision",
2828
"from_log": "FromLog",
2929
"mplug_owl_video": "mplug_Owl",
30+
"phi3v": "Phi3v",
3031
}
3132

3233
for model_name, model_class in AVAILABLE_MODELS.items():

Diff for: lmms_eval/models/idefics2.py

+1
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,7 @@ def _collate(x):
203203
gen_kwargs["max_new_tokens"] = 1024
204204
if "temperature" not in gen_kwargs:
205205
gen_kwargs["temperature"] = 0
206+
206207
prompts = []
207208
for context, visual in zip(contexts, visuals):
208209
content = []

Diff for: lmms_eval/models/llava.py

+2-13
Original file line numberDiff line numberDiff line change
@@ -26,19 +26,11 @@
2626
try:
2727
from llava.model.builder import load_pretrained_model
2828
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
29-
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
30-
from llava.conversation import conv_templates, SeparatorStyle
29+
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
30+
from llava.conversation import conv_templates
3131
except Exception as e:
3232
eval_logger.debug("LLaVA is not installed. Please install LLaVA to use this model.\nError: %s" % e)
3333

34-
from transformers.integrations.deepspeed import (
35-
is_deepspeed_zero3_enabled,
36-
set_hf_deepspeed_config,
37-
unset_hf_deepspeed_config,
38-
)
39-
40-
from transformers.utils import is_flash_attn_2_available
41-
4234
# inference implementation for attention, can be "sdpa", "eager", "flash_attention_2". Seems FA2 is not effective during inference: https://discuss.huggingface.co/t/flash-attention-has-no-effect-on-inference/73453/5
4335
# if is_flash_attn_2_available:
4436
# best_fit_attn_implementation = "flash_attention_2" # flash_attn has a bug that says: ERROR Error query and key must have the same dtype in generating
@@ -60,10 +52,7 @@ def __init__(
6052
pretrained: str = "liuhaotian/llava-v1.5-7b",
6153
truncation: Optional[bool] = True,
6254
device: Optional[str] = "cuda:0",
63-
dtype: Optional[Union[str, torch.dtype]] = "auto",
6455
batch_size: Optional[Union[int, str]] = 1,
65-
trust_remote_code: Optional[bool] = False,
66-
revision=None,
6756
model_name=None,
6857
attn_implementation=best_fit_attn_implementation,
6958
device_map="cuda:0",

Diff for: lmms_eval/models/llava_hf.py

+28-10
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from accelerate import Accelerator, DistributedType
99
from accelerate.state import AcceleratorState
1010
from typing import List, Optional, Union, Tuple
11-
from transformers import LlavaForConditionalGeneration, AutoProcessor
11+
from transformers import LlavaForConditionalGeneration, LlavaNextForConditionalGeneration, AutoProcessor
1212

1313
import warnings
1414

@@ -31,10 +31,10 @@ class LlavaHf(lmms):
3131
3232
Example usage:
3333
34-
accelerate launch --num_processes=8 -m lmms_eval \
34+
accelerate launch --num_processes=8 --main_process_port 12345 -m lmms_eval \
3535
--model llava_hf \
3636
--model_args pretrained=llava-hf/llava-1.5-7b-hf \
37-
--tasks mme \
37+
--tasks seedbench \
3838
--batch_size 1 \
3939
--output_path ./logs/ \
4040
--log_samples
@@ -67,7 +67,16 @@ def __init__(
6767
self.device_map = device_map
6868
if isinstance(dtype, str) and dtype != "auto":
6969
dtype = getattr(torch, dtype)
70-
self._model = LlavaForConditionalGeneration.from_pretrained(pretrained, revision=revision, torch_dtype=dtype, device_map=self.device_map, trust_remote_code=trust_remote_code, attn_implementation=attn_implementation)
70+
71+
if "1.5" in pretrained:
72+
self._model = LlavaForConditionalGeneration.from_pretrained(pretrained, revision=revision, torch_dtype=dtype, device_map=self.device_map, trust_remote_code=trust_remote_code, attn_implementation=attn_implementation)
73+
elif "1.6" in pretrained:
74+
self._model = LlavaNextForConditionalGeneration.from_pretrained(pretrained, revision=revision, torch_dtype=dtype, device_map=self.device_map, trust_remote_code=trust_remote_code, attn_implementation=attn_implementation)
75+
else:
76+
eval_logger.info("Not sure whether you use 1.5 or 1.6. Use 1.5 by default. This might cause bugs if you are actually using 1.6")
77+
self._model = LlavaForConditionalGeneration.from_pretrained(pretrained, revision=revision, torch_dtype=dtype, device_map=self.device_map, trust_remote_code=trust_remote_code, attn_implementation=attn_implementation)
78+
79+
self.pretrained = pretrained
7180
self._image_processor = AutoProcessor.from_pretrained(pretrained, revision=revision, trust_remote_code=trust_remote_code)
7281
# Pad from left for batched generation: https://huggingface.co/docs/transformers/v4.39.3/en/model_doc/llava#usage-tips
7382
self._image_processor.tokenizer.padding_side = "left"
@@ -106,6 +115,7 @@ def __init__(
106115
self.model.to(self._device)
107116
self._rank = 0
108117
self._word_size = 1
118+
self.accelerator = accelerator
109119

110120
@property
111121
def config(self):
@@ -199,8 +209,8 @@ def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
199209
labels[: len(contxt_id)] = -100
200210

201211
if self.accelerator.is_main_process and doc_id % 100 == 0:
202-
eval_logger.info(f"Prompt for doc ID {doc_id}:\n\n{formatted_contexts[0]}\n")
203-
eval_logger.info(f"Prompt and continuation for doc ID {doc_id}:\n\n{formatted_continuation[0]}\n")
212+
eval_logger.debug(f"Prompt for doc ID {doc_id}:\n\n{formatted_contexts[0]}\n")
213+
eval_logger.debug(f"Prompt and continuation for doc ID {doc_id}:\n\n{formatted_continuation[0]}\n")
204214

205215
with torch.inference_mode():
206216
outputs = self.model(**model_inputs, labels=labels)
@@ -268,7 +278,9 @@ def _collate(x):
268278

269279
# Some benchmarks like MME do not contain image tokens, so we prepend them to the prompt.
270280
if DEFAULT_IMAGE_TOKEN not in context:
271-
context = f"{DEFAULT_IMAGE_TOKEN}\n{context}"
281+
image_tokens = [DEFAULT_IMAGE_TOKEN] * len(visuals)
282+
image_tokens = " ".join(image_tokens)
283+
context = f"{image_tokens}\n{context}"
272284
# Apply chat template
273285
messages = [{"role": "user", "content": context}]
274286
if self.chat_template is not None:
@@ -281,7 +293,7 @@ def _collate(x):
281293
text = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
282294

283295
if self.accelerator.is_main_process and doc_id[0] % 100 == 0:
284-
eval_logger.info(f"Prompt for doc ID {doc_id[0]}:\n\n{text}\n")
296+
eval_logger.debug(f"Prompt for doc ID {doc_id[0]}:\n\n{text}\n")
285297

286298
inputs = self._image_processor(images=visuals, text=text, return_tensors="pt").to(self._device, self.model.dtype)
287299

@@ -303,15 +315,21 @@ def _collate(x):
303315
num_beams=gen_kwargs["num_beams"],
304316
max_new_tokens=gen_kwargs["max_new_tokens"],
305317
use_cache=self.use_cache,
318+
pad_token_id=self.tokenizer.eos_token_id,
306319
)
307320
except Exception as e:
308321
eval_logger.error(f"Error {e} in generating")
309322
cont = ""
310323
text_outputs = self.tokenizer.batch_decode(cont, skip_special_tokens=True)[0]
311-
text_outputs = text_outputs.split("ASSISTANT:")[-1].strip()
324+
if "1.5" in self.pretrained:
325+
text_outputs = text_outputs.split("ASSISTANT:")[-1].strip()
326+
elif "mistral" in self.pretrained:
327+
text_outputs = text_outputs.split("[/INST]")[-1].strip()
328+
else:
329+
text_outputs = text_outputs.split("ASSISTANT:")[-1].strip()
312330

313331
if self.accelerator.is_main_process and doc_id[0] % 100 == 0:
314-
eval_logger.info(f"Generated text for doc ID {doc_id[0]}:\n\n{text_outputs}\n")
332+
eval_logger.debug(f"Generated text for doc ID {doc_id[0]}:\n\n{text_outputs}\n")
315333

316334
res.append(text_outputs)
317335
self.cache_hook.add_partial("generate_until", (context, gen_kwargs), text_outputs)

0 commit comments

Comments
 (0)