Skip to content

Latest commit

 

History

History
41 lines (27 loc) · 3.59 KB

README.md

File metadata and controls

41 lines (27 loc) · 3.59 KB

LVIS-INSTRUCT4V

Introduction

We introduce a fine-grained visual instruction dataset, LVIS-INSTRUCT4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Please refer to the arxiv paper for more details.

Usage

Please follow LLaVA to set up the code.

LVIS-INSTRUCT4V is available at LVIS-INSTRUCT4V. To achieve better results on the QA benchmarks, we follow LLaVA 1.5 to mix LVIS-INSTRUCT4V with academic task related data (see the Table 1 & 7 in LLaVA 1.5 paper), which can be found at LVIS-INSTRUCT4V-Nodetail-mix619k, LVIS-INSTRUCT4V-mix730k, and LVIS-Instruct4V-LLaVA-Instruct-mix880k.

Model Zoo

Version Data Size Schedule Checkpoint VQAv2 GQA VizWiz SQA T-VQA POPE MME MM-Bench MM-Bench-CN SEED LLaVA-Bench-Wild MM-Vet
LLaVA-1.5 LVIS-Instruct4V-Nodetail-mix619k 7B full_ft-1e LVIS-Instruct4V-Nodetail-mix619k-7b 79.2 62.6 52.5 68.4 57.6 84.0 1472.9 67.1 60.0 60.8 70.4 34.6
LLaVA-1.5 LVIS-Instruct4V-mix730k 7B full_ft-1e LVIS-Instruct4V-mix730k-7b 79.4 62.6 52.6 68.9 58.4 85.1 1495.3 66.6 59.6 60.5 67.1 33.3
LLaVA-1.5 LVIS-Instruct4V-LLaVA-Instruct-mix880k 7B full_ft-1e LVIS-Instruct4V-LLaVA-Instruct-mix880k-7b 79.6 62.6 51.8 68.3 58.7 86.0 1528.2 66.2 60.4 60.6 67.0 31.5
LLaVA-1.51 LVIS-Instruct4V-Nodetail-mix619k 13B full_ft-1e LVIS-Instruct4V-Nodetail-mix619k-13b 80.1 63.8 51.4 69.0 62.1 85.3 1572.0 67.8 61.0 62.5 76.7 40.2
LLaVA-1.5 LVIS-Instruct4V-LLaVA-Instruct-mix880k 13B full_ft-1e LVIS-Instruct4V-LLaVA-Instruct-mix880k-13b 80.7 63.6 57.2 70.6 62.5 86.0 1574.9 68.0 61.1 61.6 71.3 37.4

Reference

If you find our work useful for your research or applications, please cite using this BibTeX:

@article{wang2023instruct4v,
  title={To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning},
  author={Wang, Junke and Meng, Lingchen and Weng, Zejia and He, Bo and Wu, Zuxuan and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2311.07574},
  year={2023}
}

Acknowledgement

We thank the authors of LLaVA for their contribution to the open-source community.

Footnotes

  1. We find TextQA is sensitive to the beam number, and for 13B models, we use beam = 3 on TextQA.