Skip to content
/ IDA-VLM Public

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Notifications You must be signed in to change notification settings

jiyt17/IDA-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IDA-VLM

This is the code base for IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model.

We propose visual instruction tuning with ID reference, which unleashes the potential of LVLM in identity memory and recognition across diverse scenes, and develop an ID-aware LVLM, IDA-VLM. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.

Samples:

Animation image urls https://img1.doubanio.com/view/photo/l/public/p2625512480.webp, https://img1.doubanio.com/view/photo/m/public/p2901199610.webp, https://img2.doubanio.com/view/photo/m/public/p2896107391.webp, https://img2.doubanio.com/view/photo/l/public/p2895851711.webp, https://olimg.3dmgame.com/uploads/images/xiaz/2021/0924/1632447816995.jpg, https://i0.hdslb.com/bfs/archive/0384c2f5139013b1ceae84395bbd58fae25898ef.jpg, https://act-webstatic.mihoyo.com/event-static/2023/08/15/9797cacf6d60a54f91fb6f68546b43e1_6723404097102093983.jpg?x-oss-process=image/quality,Q_80/resize,m_lfit,s_700

Todo list:

  • Release code.
  • Release benchmark images, tuning data.
  • Release model weights and easy start.

We have three main contributions: MM-ID, tuning data construction and model training.

In MM-ID, we introduce the task format and evaluation methods. ID_reference_data contains the processing code for producing instruction tuning data. Model includes training and inference code, which is based on Qwen-VL-Chat.

For a quickstart, you need download images of MM-ID (or prepare ID images and test images of your own) and model weights, to complete instruction task with ID inference, detailed in Model.

License

The majority of this project is licensed under Qwen-VL License.

Acknowledge

  • Qwen-VL: The codebase we build upon.
  • MovieNet: The main dataset we use for tuning data construction.

About

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published