By Dave DeFusco
Researchers at the Katz School of Science and Health have developed a new artificial intelligence system that can edit and generate videos using simple text instructions鈥攁n advance that could make video creation faster, more flexible and far easier for people without technical expertise.
Published in the journal IEEE Access, their paper, 鈥4EV: Adaptive Video Editing with Spatial Temporal Dynamics and Motion Pathways,鈥 describes the system, called 4EV, which uses artificial intelligence to understand written prompts and then change videos accordingly. A user might type a command such as 鈥渕ake the car drive to the left,鈥 鈥渮oom in on the bird鈥 or 鈥渃hange the background from a city to a beach.鈥 The model then edits the video to match those instructions.
鈥淭his research explores how AI can give people much more precise control over video editing through simple text prompts,鈥 said Namrata Patel, a co-author of the study, graduate of the Katz School鈥檚 M.S. in Artificial Intelligence who is now a Ph.D. student in the Graduate Department of Computer Science and Engineering. 鈥淚nstead of manually adjusting dozens of parameters, users can describe the change they want, and the system generates it while keeping the motion in the video smooth and realistic.鈥
Artificial intelligence systems have already become skilled at generating images from text descriptions. Popular systems can create pictures of almost anything a user imagines, but video is far more complicated. A video is essentially a sequence of many images played in rapid order. For a computer to generate or edit video successfully, it must ensure that objects remain consistent from one frame to the next and that movement appears natural.
鈥淰ideo editing with AI isn鈥檛 just about creating a single image,鈥 said Lakshmi Priya Ramisetty, a graduate of the M.S. in Artificial Intelligence and now a software engineer with BlueArc. 鈥淵ou also have to preserve the continuity of motion across frames so that a person walking or a car turning behaves the way we expect in the real world.鈥
鈥淗umans are remarkably sensitive to unnatural motion. We notice immediately when something moves wrong, even if we can't explain why,鈥 said Lakshmi Priya Ramisetty, a graduate of the M.S. in Artificial Intelligence and now a software engineer with BlueArc. 鈥淭he real challenge in AI video editing is generating changes that don鈥檛 just look right in isolation, but feel right as part of a continuous, moving world.鈥
The 4EV system builds on a class of AI tools known as diffusion models, similar to those behind popular image generators such as Stable Diffusion. These systems start with random visual noise and gradually refine it into a coherent image or video that matches a user鈥檚 text description.
To handle motion, the researchers designed a method called spatial-temporal attention. In simple terms, the AI learns to pay attention to both space鈥攚hat appears in each frame鈥攁nd time鈥攈ow things move between frames. This helps objects stay consistent as they move.
The team also created a custom training dataset called Motion4EV, which contains videos paired with text prompts describing movement. These prompts include instructions for different kinds of motion, such as objects following paths, zooming in or out or changing direction. The dataset was designed specifically to help the model understand motion dynamics.
鈥淲e generated many examples of objects moving in different ways so the model could learn how motion behaves in video,鈥 said Jialu Li, a student in the M.S. in Artificial Intelligence. 鈥淏y exposing the system to diverse motion patterns, we trained it to respond more accurately to prompts describing movement.鈥
The researchers also introduced new components in the model architecture that allow the AI to preserve motion from the original video while adding changes requested in text prompts. For example, the system can replace an object, such as turning a bicycle into a motorcycle, while keeping the same motion path.
One challenge in AI-generated media is ensuring that the output actually matches the user鈥檚 instructions. To address this, the team implemented a technique called attention map injection, which helps the system align the text description with the visual content more precisely.
鈥淎ttention maps help the model understand which parts of the image correspond to the text prompt,鈥 said Aditya Singh Parmar, a graduate of the M.S. in Artificial Intelligence who is now a machine vision engineer with Corning Incorporated. 鈥淏y injecting these maps into the system, we can guide the model so it modifies the correct objects without disrupting the rest of the scene.鈥
The researchers evaluated their system against several existing AI video-editing models, including systems such as Text2LIVE and CogVideo. In key tests measuring how closely the video matched the prompt and how smoothly motion was preserved, 4EV performed better than competing systems.
One metric, known as a CLIP score, measures how well the generated visuals match the written description. The team reported that 4EV achieved higher scores than other models, indicating stronger alignment between text and video content. While the research is still in development, the technology could have wide-ranging applications.
In film and media production, AI-driven editing tools could speed up visual effects and pre-production work. Game designers might use the technology to generate animated scenes quickly. In education, teachers could create customized visual demonstrations simply by writing a prompt. The system may also support emerging technologies such as virtual reality, where environments must change dynamically in response to user actions.
鈥淥ur goal is to make video creation more accessible and interactive,鈥 said Patel. 鈥淎I systems like 4EV could allow creators to experiment with ideas quickly, even if they don鈥檛 have extensive editing experience.鈥