Zinuo Li

PhD Student @ UWA

Perth, Australia

I am currently a second-year Ph.D. student in Computer Science at the University of Western Australia (UWA) advised by Prof. Mohammed Bennamoun, Prof. Farid Boussaid, jointly advised by Dr. Qiuhong Ke at Monash University. I am currently a Qingyun Research Intern at Tencent HY.

My research focuses on advancing Video Understanding and Multimodal Large Language Models, with particular interests in Agentic Reinforcement Learning and Visual Reasoning. Beyond research, I love anime and am passionate about exploring ACG-related AI topics, feel free to contact me if you have any similar interests and ideas.

👀 News

Mar 2026	🚀 Started as a Qingyun Research Intern at Tencent HY, working on Reinforcement Learning on Video Understanding.
Oct 2025	🚀 Started as a Research Intern at Tencent Youtu Lab in 2025, working on Reinforcement Learning on Video Understanding.
Oct 2025	🎉 Our paper “Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM” has been accepted to NeurIPS 2025.
Mar 2024	🎓 Started my PhD in Computer Science at the University of Western Australia.

🔬 Research Experience

Tencent Hunyuan Qingyun Research Intern
2026.03 - Present

Lab: Tencent HY
Research Works:
- Video understanding; Reinforcement Learning
Tencent Youtu Lab Research Intern
2025.10 - 2026.03

Lab: Tencent Youtu Lab
Research Works:
- Video understanding; Reinforcement Learning
University of Western Australia PhD Student
2024.03 - Present

Lab: Department of Computer Science and Software Engineering
Research Works:
- Video Understanding; Multimodal Large Language Models; RLVR
University of Macau Research Assistant
2022.09 - 2024.03

Lab: University of Macau
Research Works:
- Document analysis and image enhancement projects
Chinese Academy of Sciences, SIAT Shenzhen Research Assistant
2022.10 - 2024.03

Lab: Shenzhen Institute of Advanced Technology
Research Works:
- Contributed to publications in top-tier conferences

📖 Selected Publications

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Zinuo Li, Xian Zhang, Yongxin Guo, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi, Luqi Gong, and Qiuhong Ke

🏆 NeurIPS 2025| Advances in Neural Information Processing SystemsCCF-ACORE-A*

PDF Code

We present TriSense, a novel multimodal large language model that can understand audio-visual-speech moments in videos. Our approach combines visual, audio, and speech modalities to provide comprehensive video understanding capabilities.
High-resolution Document Shadow Removal via A Large-scale Real-world Dataset and A Frequency-aware Shadow Erasing Net

Zinuo Li, Xuhang Chen, Chi-Man Pun, and Xiaodong Cun

🏆 ICCV 2023| IEEE/CVF International Conference on Computer VisionCCF-ACORE-A*

PDF Code

We introduce SD7K, a large-scale real-world dataset for document shadow removal, along with a frequency-aware shadow erasing network. Our approach achieves state-of-the-art performance on document shadow removal tasks.
A Large-scale Film Style Dataset for Learning Multi-frequency Driven Film Enhancement

Zinuo Li, Xuhang Chen, Chi-Man Pun, and Shuqiang Wang

🏆 IJCAI 2023| International Joint Conference on Artificial IntelligenceCCF-ACORE-A*

PDF Code

We present FilmNet, a comprehensive framework for film enhancement using a large-scale film style dataset. Our multi-frequency driven approach enables high-quality film style transfer and enhancement.
Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer With Adaptive Channel Expansion

Shenghong Luo, Xuhang Chen, Weiwen Chen, Zinuo Li, Shuqiang Wang, and Chi-Man Pun

🏆 AAAI 2024| AAAI Conference on Artificial IntelligenceCCF-ACORE-A*

PDF Code

We propose DeVigNet, a novel dual aggregated fusion transformer for high-resolution vignetting removal. Our approach uses adaptive channel expansion to effectively remove vignetting effects while preserving image details.

🌟 Selected Honors & Awards

UWA Full Scholarship 2024

Full scholarship for PhD studies at University of Western Australia