UniVideo

Unified Understanding, Generation, and Editing for Videos
Cong Wei* 1,2 Quande Liu† 2 Zixuan Ye2 Qiulin Wang2 Xintao Wang2
Pengfei Wan2 Kun Gai2 Wenhu Chen† 1
1University of Waterloo 2Kling Team, Kuaishou Technology
*Work done during an internship at Kling Team, Kuaishou Technology. Corresponding authors

In‑Context Generation

Reference image

Instruction: "Panoramic shot, a man leaning against a tree, playing a beautiful melody on the guitar in his hand."

Reference image 1 Reference image 2 Reference image 3

Instruction: "A man dressed in a vibrant Hawaiian shirt with a colorful floral pattern, sits on a beach lounge chair. On his shoulder, a Pikachu with a small detective hat perches. The man holds an ice cream cone, taking a bite."

Reference image 1 Reference image 2

Instruction: "A man wearing in a black T-shirt rides a majestic tiger across a sunlit plain. He holds a gaint RTX 4090 graphics card in one hand, maintaining perfect balance as the tiger moves gracefully."

Reference image 1 Reference image 2 Reference image 3

Instruction: "Wu kong, clad in ornate golden armor adorned with intricate red and black patterns, strides confidently through the aisles of a brightly lit modern supermarket."

Reference image 1 Reference image 2

Instruction: "A futuristic stainless-steel Tesla Cybertruck glides smoothly across the surface of a calm ocean under bright sunlight"

Reference image 1 Reference image 2

Instruction: "A highly detailed Lego tank, constructed from interlocking bricks, rolls steadily through a dense, sun-dappled forest."

Visual Prompt Understanding

Input Visual prompt image
Output
Input Visual prompt image
Output
Input Visual prompt image
Output
Input Visual prompt image
Output

In Context Editing

Input1 Reference image
Input2
Output
Input1 Reference image
Input2
Output
Input1 Reference image
Input2
Output
Input1 Reference image
Input2
Output
Input1 Reference image
Input2 Reference image
Input3
Output
Input1 Reference image
Input2
Output
Input1 Reference image
Input2
Output
Input1 Reference image
Input2
Output
Input1 Reference image
Input2
Output

Free-form Editing

Input
Output
Input
Output
Output
Input
Output
Input
Output
Input
Output