My Journey into Multimodal Large Language Models

As a PhD student working on multimodal large language models (MLLMs), I’ve been fascinated by the rapid progress in this field. The ability to understand and reason across different modalities - vision, audio, and text - opens up incredible possibilities for AI applications.

The Challenge of Multimodal Understanding

One of the most exciting challenges in AI today is teaching machines to understand the world the way humans do - through multiple senses simultaneously. When we watch a video, we don’t just see the visual content; we hear the audio, understand the speech, and integrate all these signals to form a comprehensive understanding.

Our Approach: TriSense

In our recent work “TriSense,” we tackled the problem of understanding audio-visual-speech moments in videos. The key insight was that these three modalities are not independent - they work together to create meaning. A person’s facial expressions, their tone of voice, and the words they speak all contribute to the overall message.

Looking Forward

The field of multimodal AI is evolving rapidly, and I’m excited to be part of this journey. As we continue to push the boundaries of what’s possible, I believe we’ll see increasingly sophisticated AI systems that can understand and interact with the world in more human-like ways.

Stay tuned for more updates on my research journey!




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Tackling Document Shadow Removal with Deep Learning
  • Reflections on PhD Life and Research