Skip to content

Varsha-Kini/MULQA--Adapting-Multimodal-Models-to-Unimodal-Tasks-by-Ensembling-FLAVA-with-ALBERT

Repository files navigation

MULQA: Adapting Multimodal Models to Unimodal Tasks by Ensembling FLAVA with ALBERT

CSCI 566 Project: Deep Learning and its Applications

Team members:

1.Akash Gujju 2.Anushka Kamath 3.Trisha Mandal 4.Varsha Kini

Abstract:

This research paper, presents an innovative study on enhancing the capabilities of multimodal models in performing unimodal tasks. By integrating FLAVA, a foundational language and vision alignment model, with ALBERT, a lite version of BERT focused on efficient language understanding, the research aims to explore the potential of these combined models in tasks that require understanding either text or vision solely, rather than both. The paper extensively compares traditional FLAVA and the ensembled MULQA model across various datasets, demonstrating that the adapted model can significantly improve performance in language-only and vision-only tasks. This adaptation not only suggests a promising direction for future research in multimodal learning but also contributes to the understanding of how such models can be optimized for specific unimodal applications. The experiments cover a range of datasets, from TextVQA and CommonsenseQA to image classification datasets like Fashion MNIST and SVHN, showcasing the model's versatility.

FLAVA baseline model - https://github.com/facebookresearch/multimodal/tree/main/examples/flava

Project Report - https://github.com/Varsha-Kini/MULQA--Adapting-Multimodal-Models-to-Unimodal-Tasks-by-Ensembling-FLAVA-with-ALBERT/blob/bb77ad367d6732244cfd4da7bdf4edda313da72f/FinalReport.pdf

About

Project for CSCI 566: Deep Learning and its applications

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published