1.Akash Gujju 2.Anushka Kamath 3.Trisha Mandal 4.Varsha Kini
This research paper, presents an innovative study on enhancing the capabilities of multimodal models in performing unimodal tasks. By integrating FLAVA, a foundational language and vision alignment model, with ALBERT, a lite version of BERT focused on efficient language understanding, the research aims to explore the potential of these combined models in tasks that require understanding either text or vision solely, rather than both. The paper extensively compares traditional FLAVA and the ensembled MULQA model across various datasets, demonstrating that the adapted model can significantly improve performance in language-only and vision-only tasks. This adaptation not only suggests a promising direction for future research in multimodal learning but also contributes to the understanding of how such models can be optimized for specific unimodal applications. The experiments cover a range of datasets, from TextVQA and CommonsenseQA to image classification datasets like Fashion MNIST and SVHN, showcasing the model's versatility.
FLAVA baseline model - https://github.com/facebookresearch/multimodal/tree/main/examples/flava