UniVideo

Unified Understanding, Generation, and Editing for Videos
Cong Wei* 1,2 Quande Liu† 2 Zixuan Ye2 Qiulin Wang2 Xintao Wang2
Pengfei Wan2 Kun Gai2 Wenhu Chen† 1
1University of Waterloo 2Kuaishou Technology
*Work done during an internship at KwaiVGI, Kuaishou Technology. Corresponding authors

In‑Context Generation

Reference image

Instruction: "Panoramic shot, a man leaning against a tree, playing a beautiful melody on the guitar in his hand."

Reference image 1 Reference image 2 Reference image 3

Instruction: "A man dressed in a vibrant Hawaiian shirt with a colorful floral pattern, sits on a beach lounge chair. On his shoulder, a Pikachu with a small detective hat perches. The man holds an ice cream cone, taking a bite."

Reference image 1 Reference image 2

Instruction: "A man wearing in a black T-shirt rides a majestic tiger across a sunlit plain. He holds a gaint RTX 4090 graphics card in one hand, maintaining perfect balance as the tiger moves gracefully."

Reference image

Instruction: "Panoramic shot of a dog pulling a Christmas sleigh, camera pans around the dog, from the side to the back"

Reference image

Instruction: "A cute dog strolls through the supermarket, pushing a tiny shopping cart with its paws."

Reference image 1 Reference image 2 Reference image 3

Instruction: "In a cozy, softly lit living room, the woman sits casually on a plush couch, showcasing a very small fragrance bottle. She is wearing a cream jacket with green patterns."

Reference image 1 Reference image 2 Reference image 3

Instruction: "Camera follows a car driving on the road"

Reference image 1 Reference image 2 Reference image 3 Reference image 4

Instruction: "Mona Lisa wearing casual sportswear and walking with a dog in front of a building"

Reference image 1 Reference image 2

Instruction: "Lego tiger is walking in the forest"

Reference image 1 Reference image 2 Reference image 3

Instruction: "A dog rides an electric bike on the lunar surface"

Reference image 1 Reference image 2 Reference image 3 Reference image 4

Instruction: "An anthropomorphic dog in a jacket plays guitar on stage"

Reference image 1 Reference image 2 Reference image 3 Reference image 4

Instruction: "A woman steps out of the car, gracefully sits on the sofa, and reaches for the bag."

Visual Prompt Understanding

Input Visual prompt image
Output
Input Visual prompt image
Output
Input Visual prompt image
Output
Input Visual prompt image
Output

In Context Editing

Input1 Reference image
Input2
Output

Instruction: "Superman flies in from the right side of the frame and lands on the wing."

Input1 Reference image
Input2
Output

Instruction: "Replace the woman in the video with the man from the reference image, and add a burning effect to the barbell."

Input1 Reference image
Input2
Output

Instruction: "Add Wukong to the left side of the video."

Input1 Reference image
Input2 Reference image
Input3
Output

Instruction: "Add a pale-skinned male with a skull-like face, dressed in a dark, form-fitting outfit, sitting on a sofa in the room, eating a red apple"

Input1 Reference image
Input2
Output

Instruction: "Add the pair of glasses from the reference image to the woman in the video."

Input1 Reference image
Input2
Output

Instruction: "Add a little flying dragon near the man."

Input1 Reference image
Input2 Reference image
Input3
Output

Instruction: "Add Superman from the left side and batman from the right side, walking along with the person in the video."

Input1 Reference image
Input2
Output

Instruction: "Change the sofa the man is sitting on in the video into a car."

Input1 Reference image
Input2
Output

Instruction: "Replace the dog with a robotic quadruped dog."

Input1 Reference image
Input2
Output

Instruction: "Replace the dog in the video with the alpaca in the reference image."

Input1 Reference image
Input2
Output

Instruction: "Replace the man in the video with the polar bear in the image."

Input1 Reference image
Input2
Output

Instruction: "Remove the horse and turn the background into autumn."

Instruction-based Editing

Input
Output

Instruction: "Change the material of the characters from bread to ice"

Input
Output

Instruction: "After opening the refrigerator, what you see is a small erupting volcano"

Input
Output

Instruction: "Green screen the man and the woman"

Input
Output

Instruction: "Change the bench in the video to yellow, and add a woman"

Input
Output 1
Output 2
Output 3

Instruction: "Convert the video to Miyazaki Hayao style."

/ "Set the background during a storm, with a tornado in the distance and lightning flashing in the sky."

/ "Transform the photorealistic color video of a dragon flight into a black and white pencil sketch animation."

Input
Output 1
Output 2

Instruction: "Change the environment to nighttime, with the car headlights turned on."

/ "Change the environment to nighttime, turn on the car’s headlights, and change the car’s color to red."

Input
Output

Instruction: "Make the woman look like glass."

Input
Output

Instruction: "Convert the video to a 3D Pixar style."

Input
Output

Instruction: "Remove the people walking around in the scene."

Input
Output
Output

Instruction: "The man and woman are surrounded by snow."

/ "The man and woman are surrounded by flames."

Input
Output

Instruction: "Change the white T-shirt in the video to blue."