EMNLP-2023-Papers

Speech & Multimodality

Title	Repo	Video
A Video is Worth 4096 Tokens: Verbalize Story Videos to Understand them in Zero Shot		➖
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks		➖
Three Stream based Multi-Level Event Contrastive Learning for Text-Video Event Extraction	➖	➖
Reading Order Matters: Information Extraction from Visually-Rich Documents by Token Path Prediction		➖
MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup		➖
Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation		➖
Rethinking and Improving Multi-Task Learning for End-to-End Speech Translation		➖
Unsupervised Sounding Pixel Learning	➖	➖
Homophone Disambiguation Reveals Patterns of Context Mixing in Speech Transformers		➖
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition		➖
Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction	➖	➖