Overview
Vista4D is a video reshooting framework which synthesizes the dynamic scene represented by an input source video from novel camera trajectories and viewpoints. We bridge the distribution shift between training and inference for point-cloud-grounded video reshooting, as Vista4D is robust to point cloud artifacts from imprecise 4D reconstruction of real-world videos by training on noisy, reconstructed multiview videos. Our 4D point cloud with temporally-persistent static points also explicitly preserves scene content and improved camera control. Vista4D generalizes to real-world applications such as dynamic scene expansion (casual video capture of scene as background reference), 4D scene recomposition (point cloud editing), and long video inference with memory.
Video reshooting
Comparison to baselines
We show qualitative comparisons of Vista4D to baseline methods on a variety of real-world monocular videos and target cameras.
4D scene recomposition
We directly edit the 4D point cloud to manipulate the dynamic scene. To prevent conditioning conflicts between the unedited input source video (column 1) and render of the edited point cloud (column 3), we instead condition on the edited source video (column 2) which is just the edited point cloud rendered from the source cameras without static pixel temporal persistence.
Dynamic scene expansion
We jointly reconstruct the source video and the casual scene capture to form a 4D point cloud that contains (static) scene information beyond the source video, which grounds the model's generative priors to the real-world capture and improves camera accuracy. We show inference without (top) and with (bottom) this casual scene capture, along with their corresponding point cloud render conditions.
Long video inference
We chunk the source video into clips and run inference clip-by-clip. To explicitly preserve generated content, we continuously integrate the point cloud from the newly generated video clip into the existing one after each inference pass. This is visualized in the point cloud render below, where the point cloud grows after each clip. To ensure seamless transition between each clip, we run inference with a Vista4D checkpoint trained from a separate image-to-video model (our main checkpoint is trained from a text-to-video model).
Ablations
Click to expandWe show samples for ablations on no depth artifacts & source video conditioning and no temporal persistence. All ablations, including ones from our full method "Vista4D (ours)", are from checkpoints with fewer training steps than the final 672×384 checkpoint.
No depth artifacts & source video conditioning
We show samples of our method with ablations on the following design choices:
- No source video: We take away source video conditioning and only condition on the point cloud render.
- Source video via cross-attention: Instead of in-context conditioning (i.e., self-attention through frame concatenation), we condition the source video via cross-attention following TrajectoryCrafter.
- No depth artifacts: For the multiview dynamic dataset (MultiCamVideo dataset from ReCamMaster), instead of the point cloud render being rendered the source video in the cameras of the target video (main paper, Figure 3(b)), we always do double reprojection (main paper, Figure 3(a)) to remove depth artifacts so the point cloud render is always spatially aligned with the target video.
- No depth artifacts + no source video: Combination of II and I.
- No depth artifacts + source cross-attn: Combination of III and I.
We observe two major artifacts/problems when we remove depth artifacts and/or source video conditioning during training:
- Geometry artifacts from imprecise depth estimation: The model is unable to correct obvious depth estimation artifacts and thus produce output artifacts.
- Temporal jittering: One artifact of real-world depth estimation/4D reconstruction is temporal jittering of the resulting point cloud. Here, the model is unable to correct this jittering.
This sample exemplifies 1, where the 4D reconstruction artifacts on the car carried over to all ablations, except (b) where cross-attention was unable to properly transfer the car's geometry while ensuring its correct size with the camera flying back.
This sample also shows 1, where every ablation (a) to (e) displays artifacts from depth estimation, especially around the man's right arm and hand. We also see slight 2 with all the ablations with the man jittering left and right due to depth temporal jittering.
This sample exemplifies 2, where (a), (c), (d), and (e) all show significant temporal jittering on the man due jittering depths. Though (b) has no jittering, the source video was not well preserved via cross-attention, where the man is not taking his hand off his hat half way through and exhibits noticeably brighter and more saturated colors compared to the source video.
We see both 1 and 2 here. (a), (b), (d), and (e) show either geometry artifacts on the goat (or in the case of (b) where cross-attention was unable to accurately transfer the geometry/appearance of the goat), and (a), (c), (d), and (e) show noticeable to significant temporal jittering.
No temporal persistence
We show samples of our method trained with and without point cloud static pixel temporal persistence, and we also show the corresponding point cloud conditioning with or without temporal persistence. We observe two major artifacts/problems when we remove temporal persistence:
- Not preserving seen (static) content: The model struggles to preserve seen content from the source video.
- Imprecise camera control: The model has less accurate camera control during target camera frames which have little overlap with the source video point cloud.
This sample showcases 1, where the no-temporal-persistence model struggles to preserve the right side of the scene. There are also symptoms of 2 where cameras at the end of the video moves forward slightly instead of panning and moving to the side per the point cloud render.
This sample shows 2, where the no-temporal-persistence model just has strange, floaty cameras from the reduced camera information from the per-frame point cloud. Partly due to this, the BMX rider also has strange motion with geometry artifacts.
This sample shows 2, where the no-temporal-persistence model does not replicate the point cloud render's camera dolly-out smoothly and also jitters. The hiker also shows geometry artifacts.
This sample encompasses both 1 and 2, where the no-temporal-persistence model struggles to faithfully synthesize the snow and rock mountain behind the snowboarder as the per-frame point cloud render never explicitly sees it. The camera is also inaccurate throughout the video, especially the beginning, and the motion of the snowboarder is also strange and seemingly floating on top of the snow.
Acknowledgements
We would like to thank Aleksander Hołyński, Wenqi Xian, Dan Zheng, Mohsen Mousavi, Li Ma, and Lingxiao Li for their technical discussions; Ryan Tabrizi, Tianyi Lorena Yan, and Shreyas Havaldar for appearing in our demo videos; Lukas Lepicovsky, David Rhodes, Nhat Phong Tran, Dacklin Young, and Johnson Thomasson for their production support; Jeffrey Shapiro, Ritwik Kumar, and Hossein Taghavi for their executive support; Jennifer Lao and Lianette Alnaber for their operational support.
BibTeX
@inproceedings{lin2026vista4d,
author = {Lin, {Kuan Heng} and Liu, Zhizheng and Salamanca, Pablo and Kant, Yash and Burgert, Ryan and Namekata, Koichi and Zhao, Yiwei and Zhou, Bolei and Goldblum, Micah and Debevec, Paul and Yu, Ning},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
title = {{Vista4D}: Video Reshooting with 4D Point Clouds},
year = {2026}
}