Awesome Unified Multimodal Models

This is a repository for organizing papers, codes and other resources related to unified multimodal models.

🤔 What are unified multimodal models?

Traditional multimodal models can be broadly categorized into two types: multimodal understanding and multimodal generation. Unified multimodal models aim to integrate these two tasks within a single framework. Such models are also referred to as Any-to-Any generation in the community. These models operate on the principle of multimodal input and multimodal output, enabling them to process and generate content across various modalities seamlessly.

🔆 This project is still on-going, pull requests are welcomed!!

If you have any suggestions (missing papers, new papers, or typos), please feel free to edit and pull a request. Just letting us know the title of papers can also be a great contribution to us. You can do this by open issue or contact us directly via email.

⭐ If you find this repo useful, please star it!!!

Paper List

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (Sep. 2024, arXiv)
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (Aug. 2024, arXiv)
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model (Aug. 2024, arXiv)
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation (Jul. 2024, arXiv)
X-VILA: Cross-Modality Alignment for Large Language Model (May. 2024, arXiv)
Chameleon: Mixed-Modal Early-Fusion Foundation Models (May 2024, arXiv)
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation (Apr. 2024, arXiv)
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (Mar. 2024, arXiv)
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling (Feb. 2024, arXiv)
World Model on Million-Length Video And Language With Blockwise RingAttention (Feb. 2024, arXiv)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization (Feb. 2024, arXiv)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer (Jan. 2024, arXiv)
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action (Dec. 2023, arXiv)
Emu2: Generative Multimodal Models are In-Context Learners (Jul. 2023, CVPR)
Gemini: A Family of Highly Capable Multimodal Models (Dec. 2023, arXiv)
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation (Dec. 2023, arXiv)
DreamLLM: Synergistic Multimodal Comprehension and Creation (Dec. 2023, ICLR)
NExT-GPT: Any-to-Any Multimodal LLM (Sep. 2023, ICML)
LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (Sep. 2023, ICLR)
Emu: Generative Pretraining in Multimodality (Jul. 2023, ICLR)
CoDi: Any-to-Any Generation via Composable Diffusion (May. 2023, NeurIPS)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Unified Multimodal Models

🤔 What are unified multimodal models?

🔆 This project is still on-going, pull requests are welcomed!!

⭐ If you find this repo useful, please star it!!!

Paper List

About

Releases

Packages

tattrongvu/Awesome-Unified-Multimodal-Models

Folders and files

Latest commit

History

Repository files navigation

Awesome Unified Multimodal Models

🤔 What are unified multimodal models?

🔆 This project is still on-going, pull requests are welcomed!!

⭐ If you find this repo useful, please star it!!!

Paper List

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages