1University of Waterloo 2Kuaishou Technology  *Work done during an internship at KwaiVGI, Kuaishou Technology. †Corresponding authors 
Instruction: "Panoramic shot, a man leaning against a tree, playing a beautiful melody on the guitar in his hand."
Instruction: "A man dressed in a vibrant Hawaiian shirt with a colorful floral pattern, sits on a beach lounge chair. On his shoulder, a Pikachu with a small detective hat perches. The man holds an ice cream cone, taking a bite."
Instruction: "A man wearing in a black T-shirt rides a majestic tiger across a sunlit plain. He holds a gaint RTX 4090 graphics card in one hand, maintaining perfect balance as the tiger moves gracefully."
Instruction: "Panoramic shot of a dog pulling a Christmas sleigh, camera pans around the dog, from the side to the back"
Instruction: "A cute dog strolls through the supermarket, pushing a tiny shopping cart with its paws."
Instruction: "In a cozy, softly lit living room, the woman sits casually on a plush couch, showcasing a very small fragrance bottle. She is wearing a cream jacket with green patterns."
Instruction: "Camera follows a car driving on the road"
Instruction: "Mona Lisa wearing casual sportswear and walking with a dog in front of a building"
Instruction: "Lego tiger is walking in the forest"
Instruction: "A dog rides an electric bike on the lunar surface"
Instruction: "An anthropomorphic dog in a jacket plays guitar on stage"
Instruction: "A woman steps out of the car, gracefully sits on the sofa, and reaches for the bag."
Visual Prompt Understanding
Input
Output
Input
Output
Input
Output
Input
Output
In Context Editing
Input1
Input2
Output
Instruction: "Superman flies in from the right side of the frame and lands on the wing."
Input1
Input2
Output
Instruction: "Replace the woman in the video with the man from the reference image, and add a burning effect to the barbell."
Input1
Input2
Output
Instruction: "Add Wukong to the left side of the video."
Input1
Input2
Input3
Output
Instruction: "Add a pale-skinned male with a skull-like face, dressed in a dark, form-fitting outfit, sitting on a sofa in the room, eating a red apple"
Input1
Input2
Output
Instruction: "Add the pair of glasses from the reference image to the woman in the video."
Input1
Input2
Output
Instruction: "Add a little flying dragon near the man."
Input1
Input2
Input3
Output
Instruction: "Add Superman from the left side and batman from the right side, walking along with the person in the video."
Input1
Input2
Output
Instruction: "Change the sofa the man is sitting on in the video into a car."
Input1
Input2
Output
Instruction: "Replace the dog with a robotic quadruped dog."
Input1
Input2
Output
Instruction: "Replace the dog in the video with the alpaca in the reference image."
Input1
Input2
Output
Instruction: "Replace the man in the video with the polar bear in the image."
Input1
Input2
Output
Instruction: "Remove the horse and turn the background into autumn."
Instruction-based Editing
Input
Output
Instruction: "Change the material of the characters from bread to ice"
Input
Output
Instruction: "After opening the refrigerator, what you see is a small erupting volcano"
Input
Output
Instruction: "Green screen the man and the woman"
Input
Output
Instruction: "Change the bench in the video to yellow, and add a woman"
Input
Output 1
Output 2
Output 3
Instruction: "Convert the video to Miyazaki Hayao style."
/ "Set the background during a storm, with a tornado in the distance and lightning flashing in the sky."
/ "Transform the photorealistic color video of a dragon flight into a black and white pencil sketch animation."
Input
Output 1
Output 2
Instruction: "Change the environment to nighttime, with the car headlights turned on."
/ "Change the environment to nighttime, turn on the car’s headlights, and change the car’s color to red."
Input
Output
Instruction: "Make the woman look like glass."
Input
Output
Instruction: "Convert the video to a 3D Pixar style."
Input
Output
Instruction: "Remove the people walking around in the scene."
Input
Output
Output
Instruction: "The man and woman are surrounded by snow."
/ "The man and woman are surrounded by flames."
Input
Output
Instruction: "Change the white T-shirt in the video to blue."