Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM
Zinuo Li, Xian Zhang, Yongxin Guo, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi, Luqi Gong, and Qiuhong Ke
🏆 NeurIPS 2025| Advances in Neural Information Processing SystemsCCF-ACORE-A*
We present TriSense, a novel multimodal large language model that can understand audio-visual-speech moments in videos. Our approach combines visual, audio, and speech modalities to provide comprehensive video understanding capabilities.