EMNLP-2023-Papers Speech & Multimodality Title Repo Paper Video A Video is Worth 4096 Tokens: Verbalize Story Videos to Understand them in Zero Shot ➖ Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks ➖ Three Stream based Multi-Level Event Contrastive Learning for Text-Video Event Extraction ➖ ➖ Reading Order Matters: Information Extraction from Visually-Rich Documents by Token Path Prediction ➖ MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup ➖ Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation ➖ Rethinking and Improving Multi-Task Learning for End-to-End Speech Translation ➖ Unsupervised Sounding Pixel Learning ➖ ➖ Homophone Disambiguation Reveals Patterns of Context Mixing in Speech Transformers ➖ Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition ➖ Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction ➖ ➖