Vista4DVideo Reshooting with 4D Point Clouds

1Eyeline Labs 2Netflix 3Columbia University 4UCLA 5Stony Brook University 6University of Oxford
Work done during an internship at Eyeline Labs
CVPR 2026 Highlight

Overview

Vista4D is a video reshooting framework which synthesizes the dynamic scene represented by an input source video from novel camera trajectories and viewpoints. We bridge the distribution shift between training and inference for point-cloud-grounded video reshooting, as Vista4D is robust to point cloud artifacts from imprecise 4D reconstruction of real-world videos by training on noisy, reconstructed multiview videos. Our 4D point cloud with temporally-persistent static points also explicitly preserves scene content and improved camera control. Vista4D generalizes to real-world applications such as dynamic scene expansion (casual video capture of scene as background reference), 4D scene recomposition (point cloud editing), and long video inference with memory.

Video reshooting

Comparison to baselines

We show qualitative comparisons of Vista4D to baseline methods on a variety of real-world monocular videos and target cameras.

4D scene recomposition

We directly edit the 4D point cloud to manipulate the dynamic scene. To prevent conditioning conflicts between the unedited input source video (column 1) and render of the edited point cloud (column 3), we instead condition on the edited source video (column 2) which is just the edited point cloud rendered from the source cameras without static pixel temporal persistence.

Dynamic scene expansion

We jointly reconstruct the source video and the casual scene capture to form a 4D point cloud that contains (static) scene information beyond the source video, which grounds the model's generative priors to the real-world capture and improves camera accuracy. We show inference without (top) and with (bottom) this casual scene capture, along with their corresponding point cloud render conditions.

Long video inference

We chunk the source video into clips and run inference clip-by-clip. To explicitly preserve generated content, we continuously integrate the point cloud from the newly generated video clip into the existing one after each inference pass. This is visualized in the point cloud render below, where the point cloud grows after each clip. To ensure seamless transition between each clip, we run inference with a Vista4D checkpoint trained from a separate image-to-video model (our main checkpoint is trained from a text-to-video model).

Ablations

Click to expand

We show samples for ablations on no depth artifacts & source video conditioning and no temporal persistence. All ablations, including ones from our full method "Vista4D (ours)", are from checkpoints with fewer training steps than the final 672×384 checkpoint.

No depth artifacts & source video conditioning

We show samples of our method with ablations on the following design choices:

  1. No source video: We take away source video conditioning and only condition on the point cloud render.
  2. Source video via cross-attention: Instead of in-context conditioning (i.e., self-attention through frame concatenation), we condition the source video via cross-attention following TrajectoryCrafter.
  3. No depth artifacts: For the multiview dynamic dataset (MultiCamVideo dataset from ReCamMaster), instead of the point cloud render being rendered the source video in the cameras of the target video (main paper, Figure 3(b)), we always do double reprojection (main paper, Figure 3(a)) to remove depth artifacts so the point cloud render is always spatially aligned with the target video.
  4. No depth artifacts + no source video: Combination of II and I.
  5. No depth artifacts + source cross-attn: Combination of III and I.

We observe two major artifacts/problems when we remove depth artifacts and/or source video conditioning during training:

  1. Geometry artifacts from imprecise depth estimation: The model is unable to correct obvious depth estimation artifacts and thus produce output artifacts.
  2. Temporal jittering: One artifact of real-world depth estimation/4D reconstruction is temporal jittering of the resulting point cloud. Here, the model is unable to correct this jittering.

No temporal persistence

We show samples of our method trained with and without point cloud static pixel temporal persistence, and we also show the corresponding point cloud conditioning with or without temporal persistence. We observe two major artifacts/problems when we remove temporal persistence:

  1. Not preserving seen (static) content: The model struggles to preserve seen content from the source video.
  2. Imprecise camera control: The model has less accurate camera control during target camera frames which have little overlap with the source video point cloud.

Acknowledgements

We would like to thank Aleksander Hołyński, Wenqi Xian, Dan Zheng, Mohsen Mousavi, Li Ma, and Lingxiao Li for their technical discussions; Ryan Tabrizi, Tianyi Lorena Yan, and Shreyas Havaldar for appearing in our demo videos; Lukas Lepicovsky, David Rhodes, Nhat Phong Tran, Dacklin Young, and Johnson Thomasson for their production support; Jeffrey Shapiro, Ritwik Kumar, and Hossein Taghavi for their executive support; Jennifer Lao and Lianette Alnaber for their operational support.

BibTeX

@inproceedings{lin2026vista4d,
    author = {Lin, {Kuan Heng} and Liu, Zhizheng and Salamanca, Pablo and Kant, Yash and Burgert, Ryan and Namekata, Koichi and Zhao, Yiwei and Zhou, Bolei and Goldblum, Micah and Debevec, Paul and Yu, Ning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    title = {{Vista4D}: Video Reshooting with 4D Point Clouds},
    year = {2026}
}