TriSense | Zinuo Li

TriSense is a novel multimodal large language model designed to understand audio-visual-speech moments in videos. This project represents a significant advancement in multimodal AI, combining visual, audio, and speech modalities for comprehensive video understanding.

Key Features

Multimodal Integration: Seamlessly combines visual, audio, and speech information
Temporal Understanding: Captures temporal relationships across different modalities
Real-world Applications: Designed for practical video understanding tasks

Technical Approach

Our approach leverages the latest advances in large language models and extends them to handle multimodal inputs. The key innovations include:

Unified Multimodal Architecture: A single model that can process and reason across visual, audio, and speech modalities
Attention Mechanisms: Specialized attention mechanisms for cross-modal understanding
Temporal Modeling: Advanced techniques for capturing temporal dependencies in video sequences

Results

TriSense demonstrates strong performance on various multimodal benchmarks and shows promising results for real-world applications. The model achieves state-of-the-art performance on several video understanding tasks.

Publication

This work has been accepted to NeurIPS 2025, one of the top venues in machine learning and artificial intelligence.

Citation:

Li, Z., Ke, Q., Bennamoun, M., & Boussaid, F. (2025). 
Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM. 
Advances in Neural Information Processing Systems.