Cong Wei

Profile

I am a 3rd year PhD student in Computer Science at University of Waterloo, advised by Wenhu Chen.

Previously, I earned my master’s degree in Computer Science from the University of Toronto, where I was advised by Florian Shkurti. I also completed my undergraduate studies at the University of Toronto, working under the supervision of David Duvenaud. During my undergraduate years, I was a student researcher at the Vector Institute, advised by David Duvenaud and Gennady Pekhimenko.

Email  /  Google Scholar  /  Twitter  /  Github  /  Linkedin

profile photo

News

02/2026: Visual-Aware CoT is accepted to CVPR 2026.
01/2026: UniVideo is accepted to ICLR 2026.
10/2025: MoCha is accepted to NeurIPS 2025 (Spotlight)
03/2025: Thrilled to introduce MoCha! Enjoy the Promotional Video!
02/2025: OmniEdit is accepted to ICLR 2025.
10/2024: I will join Meta GenAI as a Research Scientist Intern in 2024 Winter.

Research

I'm broadly interested in multimodal generation and multimodal understanding. I build scalable data synthesis pipelines for large-scale training data creation, and design scalable unified architectures that share and leverage data across multiple tasks.

1. Multimodal Generation

2. Multimodal Understanding

  • UniIR: the framework to enable unified and compositional multimodal information retrieval
  • Mantis / Vamba: advanced vision-language models
  • MMMU: a commonly used multimodal evaluation benchmark

Publications

  [show selected / show all by date]
(*: Indicating equal contribution.)

Context Forcing Context Forcing: Consistent Autoregressive Video Generation with Long Context
Shuo Chen*, Cong Wei*, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen
arXiv 2026
website / paper / code

Real-time 60s+ long video generation with long context

UniVideo: Unified Understanding, Generation, and Editing for Videos
Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen
ICLR 2026
website / hf page / paper / code

Unifed models for image/video understanding and generation

MoCha: Towards Movie-Grade Talking Character Synthesis
Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei-Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, Animesh Sinha, Peter Vajda, Wenhu Chen
NeurIPS 2025 (Spotlight Presentation)
website / hf page / paper / tweet

Automated Filmmaking

intro OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
Cong Wei*, Zheyang Xiong*, Weiming Ren, Xinrun Du, Ge Zhang, Wenhu Chen
ICLR 2025
paper / dataset / website / tweet

A method to scale up image editing training data: Multi-Experts generation + LLM filtering

intro UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
Cong Wei, Yang Chen, Haonan Chen, and Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen
ECCV 2024 (Oral Presentation)
paper / website / tweet

A unified multimodal instruction guided retriever

Mantis MANTIS: Interleaved Multi-Image Instruction Tuning
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen
TMLR 2024 (TMLR 2024 Outstanding/Best Paper Award)
paper / website / code

Multi-image Understanding

intro Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers
Cong Wei*, Brendan Duke*, Ruowei Jiang, and Parham Aarabi, Graham W Taylor, Florian Shkurti
CVPR 2023
paper / website / video

Learning to Sparsify Attention Pattern in ViT

AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks
Max Ku*, Cong Wei*, Weiming Ren*, Harry Yang, Wenhu Chen
TMLR 2024 (TMLR 2024 Reproducibility Certification)
tweet / website / paper

A training-free V2V method that can be used to generate video editing data.

intro MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
CVPR 2024 (Oral Presentation)(Best Paper Finalist)
paper / website

A large scale benchmark for multimodal llm evaluation

Vamba Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, Wenhu Chen
ICCV 2025
paper / website / code

Hybrid Mamba-Transformer for efficient hour-long video understanding

VACoT Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhan Luo
CVPR 2026
website paper

Visually-aware chain-of-thought for high-fidelity visual consistency in unified models

Education

University of Waterloo, Canada
Ph.D. in Computer Science • May. 2023 to Now
University of Toronto, Canada
Master of Science in Applied Computing • Sep. 2021 to Jun. 2023
University of Toronto, Canada
Honours Bachelor of Science • Sep. 2017 to May 2021
Computer Science Specialist & Statistics Major & Mathematics Minor

Experience

Kuaishou Technology KlingAI
May 2025 - Present
Research Scientist Intern
Meta GenAI, US
Oct 2024 - Apr 2025
Research Scientist Intern
ModiFace, Canada
May 2022 - Nov 2022
Machine Learning Researcher Intern
Vector Institute, Canada
Sep 2020 - Sep 2021
Undergraduate Researcher



Website source from Jon Barron