Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two main limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion which is essential for producing cinematic shots and the behavior of existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. Our key observation based on the error maps is that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases, namely, occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two major failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to better assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.
We show a grid of videos for a set of prompts in our DynamicEval dataset. The columns represents 10 models, and rows represent 3 random generations in each model.
Prompt: "In a cozy nursery filled with toys and a crib, a single cat sniffs at food on the floor, while a wide-angle camera pans around, revealing colorful walls and a rocking chair in the background."
Prompt: "In a cinematic pull-out shot, a lone person is seen paragliding over a sprawling junkyard, their shadow gliding over heaps of scrap metal and abandoned cars, revealing the chaotic landscape below."
Prompt: "A drone follows two men carrying a basket through the sprawling, complex machinery of an outdoor power plant, revealing towering structures and intricate pipes against a backdrop of clear blue sky."
Prompt: "An aerial shot captures a cat precariously perched on the edge of a crevasse. In slow motion, the camera moves in to reveal the depth of the icy chasm, then retreats, showing the cat's daring position amidst the expansive landscape."
Prompt: "In a cozy dining room, a woman holds a bottle as the wide-angle camera performs a follow shot, revealing a wooden table and chairs in the foreground and a vintage cabinet in the background."
We propose a debiased background consistency metric that isolates moving objects and occlusion regions to provide a more accurate evaluation of temporal smoothness. Specifically, we construct edge masks around object boundaries using SAM-2 segmentation and morphological operations, and generate foreground masks by extracting dynamic object names from the generation prompt (via LLM) and localizing them with GroundingDINO, followed by mask propagation with SAM-2. These masks are used to down-weight contributions from object regions and occlusion edges when computing motion smoothness errors, yielding debiased error maps that better reflect background temporal stability independent of camera motion or moving objects.
We propose a debiased background consistency metric that isolates moving objects and occlusion regions to provide a more accurate evaluation of temporal smoothness. Specifically, we construct edge masks around object boundaries using SAM-2 segmentation and morphological operations, and generate foreground masks by extracting dynamic object names from the generation prompt (via LLM) and localizing them with GroundingDINO, followed by mask propagation with SAM-2. These masks are used to down-weight contributions from object regions and occlusion edges when computing motion smoothness errors, yielding debiased error maps that better reflect background temporal stability independent of camera motion or moving objects.
@article{babu2025dynamiceval,
title = {{DynamicEval}: Rethinking Evaluation for Dynamic Text-to-Video Synthesis},
author = {Babu, Nithin C and Mahapatra, Aniruddha and Rangwani, Harsh and Soundararajan, Rajiv and Kulkarni, Kuldeep},
year = {2025},
eprint = {arXiv:2510.07441},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/pdf/2510.07441},
}