MoCha

Towards Movie-Grade Talking Character Synthesis
1GenAI, Meta 2University of Waterloo
*Work done during the first author’s internship at GenAI, Meta †Project Lead

MoCha is a pioneering model for Dialogue-driven Movie Shot Generation.

All talking characters are generated solely from Speech + Text. Click ▢️ to bring them to life.
The full set with prompt is available at πŸ€—Our-Results-on-MoChaBenchβ˜•.
We released a benchmark πŸ€—MoChaBenchβ˜• tailored for Dialogue-driven Movie Shot Generation.
When combined with a TTS model, MoCha can also achieve Text β†’ Speech + Video generation, similar to the Veo 3 setting. All videos presented in this project are solely for research demonstration purposes and have no commercial use.



Camera Control




Emotion Control




Action Control




Multi-Characters




Multi-character Conversation with Turn-based Dialogue




Portrait Talking Characters




Text β†’ Video + Audio

We provide a pipeline solution that directly generates video and audio from text.
We first feed the character's spoken text description into a TTS model to synthesize speech.
Then, we provide both the text prompt and the synthesized speech to MoCha for video generation.
Below is an (Text β†’ Video + Audio) example generated using MoCha in combination with the OpenAI TTS model.



Comparison with SoTA

"A realistic medium shot with smooth camera movement captures a charming woman outdoors on a grassy lawn. She is wearing a white shirt paired with a white jacket, and she adorns a necklace and earrings, adding elegance to her appearance. As she speaks, she gracefully tucks her hair behind her ear, adding a natural and expressive touch to her gestures. The woman is walking around an area enclosed by a wooden fence, moving in a gentle arc as she walks past it. The background features a lush green lawn and tent-like structures, creating a serene and refreshing atmosphere. The lighting is ample, highlighting the natural beauty of the scene."



"A close-up shot of a woman speaking to the camera, speaking to the camera while facing straight ahead. The background is slightly blurred, revealing rolling fields and scattered trees bathed in the warm glow of the late afternoon sun. She has long, dark hair tucked behind her ears and wears a fitted leather jacket over a casual top. She is sitting in the driver's seat of a car. Her left hand rests firmly on the steering wheel, fingers gripping it with ease. As the video progresses, she continues speaking while looking straight ahead. The static camera captures her face and upper body."



"A close-up shot of a woman standing in a modern house, facing slightly to the left as she speaks. Her eyes filled with tears and her expression heavy with sorrow. Two distinct streams of tears run down her cheeks, glistening in the soft natural light. Her posture is tense. The background is slightly blurred, showing sleek furniture, large windows, and warm ambient lighting. Her shoulder-length dark hair frames her face, and her eyes sparkle. She is dressed in a stylish beige blouse. She holds a teddy bear tightly. As the video progresses, she continues speaking, fresh tears well up in her eyes, while still facing slightly to the right. The static camera captures her face and upper body."


Failure Cases




Citation

🌟 If you find our work helpful, please cite our paper:

@article{wei2025mocha,
  title={MoCha: Towards Movie-Grade Talking Character Synthesis},
  author={Wei, Cong and Sun, Bo and Ma, Haoyu and Hou, Ji and Juefei-Xu, Felix and He, Zecheng and Dai, Xiaoliang and Zhang, Luxin and Li, Kunpeng and Hou, Tingbo and others},
  journal={arXiv preprint arXiv:2503.23307},
  year={2025}
}