In Study, Researchers Unveil AI System That Can Edit Video from a Text Prompt

Mar 30, 2026 By: daviddefusco

Jialu Li, left, a student in the Katz School's M.S. in Artificial Intelligence, and Namrata Patel, a Ph.D. student in the Graduate Department of Computer Science and Engineering, are co-authors of the study. Not pictured are co-authors Lakshmi Priya Ramisetty, a software engineer with BlueArc, and Aditya Singh Parmar, a machine vision engineer with Corning Incorporated.

By Dave DeFusco

Researchers at the Katz School of Science and Health have developed a new artificial intelligence system that can edit and generate videos using simple text instructions—an advance that could make video creation faster, more flexible and far easier for people without technical expertise.

Published in the journal IEEE Access, their paper, “4EV: Adaptive Video Editing with Spatial Temporal Dynamics and Motion Pathways,” describes the system, called 4EV, which uses artificial intelligence to understand written prompts and then change videos accordingly. A user might type a command such as “make the car drive to the left,” “zoom in on the bird” or “change the background from a city to a beach.” The model then edits the video to match those instructions.

“This research explores how AI can give people much more precise control over video editing through simple text prompts,” said Namrata Patel, a co-author of the study, graduate of the Katz School’s M.S. in Artificial Intelligence who is now a Ph.D. student in the Graduate Department of Computer Science and Engineering. “Instead of manually adjusting dozens of parameters, users can describe the change they want, and the system generates it while keeping the motion in the video smooth and realistic.”

Artificial intelligence systems have already become skilled at generating images from text descriptions. Popular systems can create pictures of almost anything a user imagines, but video is far more complicated. A video is essentially a sequence of many images played in rapid order. For a computer to generate or edit video successfully, it must ensure that objects remain consistent from one frame to the next and that movement appears natural.

“Humans are remarkably sensitive to unnatural motion. We notice immediately when something moves wrong, even if we can't explain why,” said Lakshmi Priya Ramisetty, a graduate of the M.S. in Artificial Intelligence and now a software engineer with BlueArc. “The real challenge in AI video editing is generating changes that don’t just look right in isolation, but feel right as part of a continuous, moving world.”

The 4EV system builds on a class of AI tools known as diffusion models, similar to those behind popular image generators such as Stable Diffusion. These systems start with random visual noise and gradually refine it into a coherent image or video that matches a user’s text description.

To handle motion, the researchers designed a method called spatial-temporal attention. In simple terms, the AI learns to pay attention to both space—what appears in each frame—and time—how things move between frames. This helps objects stay consistent as they move.

The team also created a custom training dataset called Motion4EV, which contains videos paired with text prompts describing movement. These prompts include instructions for different kinds of motion, such as objects following paths, zooming in or out or changing direction. The dataset was designed specifically to help the model understand motion dynamics.

“We generated many examples of objects moving in different ways so the model could learn how motion behaves in video,” said Jialu Li, a student in the M.S. in Artificial Intelligence. “By exposing the system to diverse motion patterns, we trained it to respond more accurately to prompts describing movement.”

The researchers also introduced new components in the model architecture that allow the AI to preserve motion from the original video while adding changes requested in text prompts. For example, the system can replace an object, such as turning a bicycle into a motorcycle, while keeping the same motion path.

One challenge in AI-generated media is ensuring that the output actually matches the user’s instructions. To address this, the team implemented a technique called attention map injection, which helps the system align the text description with the visual content more precisely.

“Attention maps help the model understand which parts of the image correspond to the text prompt,” said Aditya Singh Parmar, a graduate of the M.S. in Artificial Intelligence who is now a machine vision engineer with Corning Incorporated. “By injecting these maps into the system, we can guide the model so it modifies the correct objects without disrupting the rest of the scene.”

The researchers evaluated their system against several existing AI video-editing models, including systems such as Text2LIVE and CogVideo. In key tests measuring how closely the video matched the prompt and how smoothly motion was preserved, 4EV performed better than competing systems.

One metric, known as a CLIP score, measures how well the generated visuals match the written description. The team reported that 4EV achieved higher scores than other models, indicating stronger alignment between text and video content. While the research is still in development, the technology could have wide-ranging applications.

In film and media production, AI-driven editing tools could speed up visual effects and pre-production work. Game designers might use the technology to generate animated scenes quickly. In education, teachers could create customized visual demonstrations simply by writing a prompt. The system may also support emerging technologies such as virtual reality, where environments must change dynamically in response to user actions.

“Our goal is to make video creation more accessible and interactive,” said Patel. “AI systems like 4EV could allow creators to experiment with ideas quickly, even if they don’t have extensive editing experience.”

ӣ��

YU News

News Channel