TriSense
Understanding Audio-Visual-Speech Moments with Multimodal LLM
TriSense is a novel multimodal large language model designed to understand audio-visual-speech moments in videos. This project represents a significant advancement in multimodal AI, combining visual, audio, and speech modalities for comprehensive video understanding.
Key Features
- Multimodal Integration: Seamlessly combines visual, audio, and speech information
- Temporal Understanding: Captures temporal relationships across different modalities
- Real-world Applications: Designed for practical video understanding tasks
Technical Approach
Our approach leverages the latest advances in large language models and extends them to handle multimodal inputs. The key innovations include:
- Unified Multimodal Architecture: A single model that can process and reason across visual, audio, and speech modalities
- Attention Mechanisms: Specialized attention mechanisms for cross-modal understanding
- Temporal Modeling: Advanced techniques for capturing temporal dependencies in video sequences
Results
TriSense demonstrates strong performance on various multimodal benchmarks and shows promising results for real-world applications. The model achieves state-of-the-art performance on several video understanding tasks.
Publication
This work has been accepted to NeurIPS 2025, one of the top venues in machine learning and artificial intelligence.
Citation:
Li, Z., Ke, Q., Bennamoun, M., & Boussaid, F. (2025).
Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM.
Advances in Neural Information Processing Systems.