MoCha

Towards Movie-Grade Talking Character Synthesis

Cong Wei^{* 1,2} Bo Sun^{† 1} Haoyu Ma¹ Ji Hou¹
Felix Juefei-Xu¹ Zecheng He¹ Xiaoliang Dai¹ Luxin Zhang¹ Kunpeng Li¹
Tingbo Hou¹ Animesh Sinha¹ Peter Vajda¹ Wenhu Chen²

¹GenAI, Meta ²University of Waterloo
^*Work done during the first author’s internship at GenAI, Meta ^†Project Lead

Arxiv Paper 🤗 HF Paper MoChaBench 🤗 Visulaizing MoChaBench 🤗 Visulaizing Our Results on MoChaBench 📚 Citation 𝕏 𝕏 𝕏

MoCha is a pioneering model for Dialogue-driven Movie Shot Generation.

All talking characters are generated solely from Speech + Text. Click ▶️ to bring them to life.
The full set with prompt is available at 🤗Our-Results-on-MoChaBench☕.
We released a benchmark 🤗MoChaBench☕ tailored for Dialogue-driven Movie Shot Generation.
When combined with a TTS model, MoCha can also achieve Text → Speech + Video generation, similar to the Veo 3 setting. All videos presented in this project are solely for research demonstration purposes and have no commercial use.

Camera Control

A tilt up shot ...

A tracking low-angle shot ...

A camera tracking shot ...

A dolly zoom shot ...

A medium dolly-in shot ...

The camera pans slowly to the right ...

A handheld shot ...

A camera pulls out shot ...

A circling shot ...

Emotion Control

... A sad expression ...

... His face glowing with joy ..

... His expression filled with warmth and quiet joy ...

... A serious expression ...

... Joyful expression ...

... Sadness ...

... Angry Expression ...

... A sad expression ...

... A concerned expression ...

... Her expression is filled with sadness ...

Action Control

... She waves her tennis racket ...

... He speaks while stirring the food ...

... She gracefully tucks her hair behind her ear ...

... He raises his hand in a confident thumbs-up gesture ...

... painting with a paintbrush on a canvas ...

... holding its long trunk gently with his hands ...

... He speaks while getting ready to dribble ...

... He speaks while adjusting the knot ...

... He speaks while chopping vegetables ...

... She looks down at her sleeping white and brown tabby cat with a tender, sorrowful gaze, then slowly lifts her eyes and turns to face slightly to the right of the frame...

... As the video progress, she raises her hand to comb her hair back ...

... She grips a teddy bear tightly. Suddenly, she throws it away with force ...

Multi-Characters

Multi-character Conversation with Turn-based Dialogue

Portrait Talking Characters

Text → Video + Audio

We provide a pipeline solution that directly generates video and audio from text.
We first feed the character's spoken text description into a TTS model to synthesize speech.
Then, we provide both the text prompt and the synthesized speech to MoCha for video generation.
Below is an (Text → Video + Audio) example generated using MoCha in combination with the OpenAI TTS model.

SPOKEN TEXT: "Remember. With great power comes great responsibility"
+
PROMPT: "A beautiful woman, aged between 25 and 35, is dressed in a professional yet chic outfit, emphasizing her intellectual character. She exudes a sense of sophistication and confidence. The spring and autumn seasons are reflected in her attire, which could be a stylish blazer paired with a fashionable blouse, marking her as a host. She is captured in an outdoor setting on a high-end street, possibly lined with upscale boutiques or cafes. Despite being outside, she maintains a professional demeanor, gesturing with her hands while speaking. She raises her right hand slightly, palm facing outward. She speaks into a microphone or engages with an unseen audience, showing her poise and eloquence as a host."

Comparison with SoTA

"A realistic medium shot with smooth camera movement captures a charming woman outdoors on a grassy lawn. She is wearing a white shirt paired with a white jacket, and she adorns a necklace and earrings, adding elegance to her appearance. As she speaks, she gracefully tucks her hair behind her ear, adding a natural and expressive touch to her gestures. The woman is walking around an area enclosed by a wooden fence, moving in a gentle arc as she walks past it. The background features a lush green lawn and tent-like structures, creating a serene and refreshing atmosphere. The lighting is ample, highlighting the natural beauty of the scene."

MoCha(Ours)

Hallo3

SadTalker

aniportrait

"A close-up shot of a woman speaking to the camera, speaking to the camera while facing straight ahead. The background is slightly blurred, revealing rolling fields and scattered trees bathed in the warm glow of the late afternoon sun. She has long, dark hair tucked behind her ears and wears a fitted leather jacket over a casual top. She is sitting in the driver's seat of a car. Her left hand rests firmly on the steering wheel, fingers gripping it with ease. As the video progresses, she continues speaking while looking straight ahead. The static camera captures her face and upper body."

MoCha(Ours)

Hallo3

SadTalker

aniportrait

"A close-up shot of a woman standing in a modern house, facing slightly to the left as she speaks. Her eyes filled with tears and her expression heavy with sorrow. Two distinct streams of tears run down her cheeks, glistening in the soft natural light. Her posture is tense. The background is slightly blurred, showing sleek furniture, large windows, and warm ambient lighting. Her shoulder-length dark hair frames her face, and her eyes sparkle. She is dressed in a stylish beige blouse. She holds a teddy bear tightly. As the video progresses, she continues speaking, fresh tears well up in her eyes, while still facing slightly to the right. The static camera captures her face and upper body."

MoCha(Ours)

Hallo3

SadTalker

aniportrait

Failure Cases

If the caption is too vague and doesn't describe the facial attribute or shot type, the model may generate a wide shot where the character is far from the camera, making lip sync difficult to observe. For example, the prompt 'A man playing skateboard at a skatepark'.

When multiple characters appear in the same scene, sometimes the lip-sync quality is degraded when character is far from the camera. Potentially due to limited training data. For example, the prompt:"A medium shot set in a dimly lit tavern. The central figure, a rugged man with long, wet-looking hair and a thick beard, sits on a rustic wooden bench. He wears a weathered wool cloak over leather armor, evoking the image of a battle-hardened warrior. His expression is intense as he speaks, holding a short sword in his right hand. To his left, another man with tied-back hair and fur-lined garments watches him closely."

When increasing the speech CFG from the default value of 7.5 to 12, the model tends to generate overly expressive characters. For example: 'A tracking shot circling around the man as he ties a tie over his blue suit. He speaks to the camera while adjusting the knot, maintaining eye contact throughout.'

Citation

🌟 If you find our work helpful, please cite our paper:

@article{wei2025mocha,
  title={MoCha: Towards Movie-Grade Talking Character Synthesis},
  author={Wei, Cong and Sun, Bo and Ma, Haoyu and Hou, Ji and Juefei-Xu, Felix and He, Zecheng and Dai, Xiaoliang and Zhang, Luxin and Li, Kunpeng and Hou, Tingbo and others},
  journal={arXiv preprint arXiv:2503.23307},
  year={2025}
}