BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

A multi-modal LLM capable of jointly understanding of text, vision and audio and grounding knowledge into visual objects.

[Project Page] [Arxiv] [Demo Video] [Gradio] [Data] [Model]

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Yang Zhao*, Zhijie Lin*, Daquan Zhou, Zilong Huang, Jiashi Feng and Bingyi Kang† (*Equal Contribution, †Project Lead)
Bytedance Inc.

News🔥

2023/07/21 - Huggingface demo released!

Setup

Clone this repository and navigate to the current folder.

Environment

Our code is based on Python 3.9, CUDA 11.7 and Pytorch 2.0.1.

pip3 install -r pre-requirements.txt
pip3 install -r requirements.txt

Models

Follow the instruction to prepare the pretrained Vicuna weights, and update the llama_model in bubogpt/configs/models/mmgpt4.yaml.

## get pre-trained checkpoints
mkdir checkpoints && cd checkpoints;
wget https://huggingface.co/spaces/Vision-CAIR/minigpt4/resolve/main/blip2_pretrained_flant5xxl.pth;
wget https://huggingface.co/spaces/xinyu1205/recognize-anything/resolve/main/ram_swin_large_14m.pth;
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth;
wget https://huggingface.co/spaces/abhishek/StableSAM/resolve/main/sam_vit_h_4b8939.pth;
wget https://huggingface.co/magicr/BuboGPT-ckpt/resolve/main/bubogpt_7b.pth

For training, down load MiniGPT-4 checkpoint to checkpoints.

Data

Stage1

Image-Text Data: Following MiniGPT4's instruction to prepare the stage1 dataset.
Audio-Text Data: Following our audio data instruction to prepare it.

Stage2

MiniGPT4's visual instruction-following data: following their prepartion doc.
LLaVA's visual instruction-following data: refer to LLaVA Visual Instruct 150K Dataset Card
Our audio instruction-following data: Following our audio data instruction to prepare it.
Our image-audio sound localization data: Following our image-audio data instruction to prepare it.
Our image-audio negatively paired data: Following our image-audio data instruction to prepare it.

Usage

Gradio demo

Run gradio demo with:

python3 app.py --cfg-path eval_configs/mmgpt4_eval.yaml --gpu-id 0

Training

Browse the dataset config folder, and replace the storage item with path/to/your/data for each dataset.

Stage 1: Audio pre-training

bash dist_train.sh train_configs/mmgpt4_stage1_audio.yaml

Stage2: Multi-modal instruct tuning

Put path/to/stage1/ckpt to ckpt in train_configs/mmgpt4_stage2_mm.yaml

bash dist_train.sh train_configs/mmgpt4_stage2_mm.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
bubogpt		bubogpt
constants		constants
dataset		dataset
eval_configs		eval_configs
eval_scripts		eval_scripts
examples		examples
imagebind		imagebind
prompts		prompts
ram		ram
train_configs		train_configs
.gitignore		.gitignore
InputSans-Regular.ttf		InputSans-Regular.ttf
LICENSE.md		LICENSE.md
LICENSE_Lavis.md		LICENSE_Lavis.md
PrepareVicuna.md		PrepareVicuna.md
README.md		README.md
app.py		app.py
dist_train.sh		dist_train.sh
grounding_model.py		grounding_model.py
match.py		match.py
pre-requirements.txt		pre-requirements.txt
requirements.txt		requirements.txt
tagging_model.py		tagging_model.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

News🔥

Setup

Environment

Models

Data

Stage1

Stage2

Usage

Gradio demo

Training

Demo

1. Image Understanding with Grounding

2. Audio Understanding

3. Aligned Audio-Image Understanding

4. Arbitrary Audio-Image Understanding

Acknowledgement

About

Licenses found

Releases

Packages

Contributors 2

Languages

License

Licenses found

magic-research/bubogpt

Folders and files

Latest commit

History

Repository files navigation

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

News🔥

Setup

Environment

Models

Data

Stage1

Stage2

Usage

Gradio demo

Training

Demo

1. Image Understanding with Grounding

2. Audio Understanding

3. Aligned Audio-Image Understanding

4. Arbitrary Audio-Image Understanding

Acknowledgement

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages