Virtually Being : Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

SIGGRAPH Asia 2025

† project lead
1 Eyeline Labs    2 Netflix

Abstract

We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production.

Human Data from Volumetric Captures

We record multi-view performances, apply 4D Gaussian Splatting, and render videos with diverse camera motion and lighting for customization.

Single-subject data from 4DGS with diverse camera motion and multi-view information of the subjects.
Relit data with diverse lighting conditions.
Joint-subject dataset featuring both subjects in the same video.

Comparison with Text-to-Video Customization Baselines

Our approach shows higher multi-view identity preservation. Notably, our method outperforms ConsisID, which uses only a single facial image at inference and struggles with identity consistency across views—highlighting the value of multi-view data in our pipeline.

Alex references
Magic-Me DreamVideo VideoBooth MotionBooth ConsisID Ours
Emily references
Magic-Me DreamVideo VideoBooth MotionBooth ConsisID Ours

Text-to-Video Generation with Camera Conditions

Here are more results of subject-specific customization with camera control, along with the input camera trajectories. Our method supports precise 3D camera motion—such as rotation and forward movement—unlike prior customization methods like MotionBooth, which are limited to simple 2D translations.

Importance of Multi-view Data

To examine the role of multi-view data, we conducted an ablation using only frontal-view training videos. The results show noticeably worse identity preservation from side views, highlighting the importance of multi-view training for consistent identity across angles.

Importance of Relit Data

Without relit data, the generated videos show flat and unrealistic lighting. In contrast, adding relit data significantly improves lighting realism and diversity.

Multi-subject Generation

To enable text-to-video generation for multi-subject scenarios, we customized the model on individual subject datasets and a small joint dataset with both subjects. Results show that our model can accurately generate both identities together in the same scene.

Multi-subject Generation with Noise Blending

Instead of jointly customizing one model on all subjects, we propose another approach, where each subject is customized independently—resulting in separate single-subject models. At inference time, we use noise blending technique to leverage single-subject models to generate multi-subject videos. This produces realistic videos with both subjects, without joint training.

Importance of Joint-subject Data

To evaluate the impact of joint-subject data featuring both subjects in the same video, which are shown on the left, we compare models trained with and without them. As shown on the right, including joint-subject data leads to more realistic spatial relationships and more natural interactions between the two subjects.

Image-to-video Customization is Necessary

We also explore customization for image-to-video models. Even when the input frame preserves the identity well, without multi-view customization, the generated videos show noticeable identity loss over time. In contrast, customizing the model with multi-view data helps maintain consistent identity throughout the entire video.

Image-to-video Customization with Camera Control

Our approach also enables image-to-video customization with camera conditions.

Real-life Data

Beyond 4DGS human data, we validate our method on a real-world customization dataset featuring a cat with estimated camera parameters. As shown, our model generates videos of the same cat in new contexts, preserving identity across views and enabling controllable camera motion.

Scene Customization

Our method also supports scene customization, allowing the model to capture interactions between subjects and their environment. Here, we show examples with both customized scenes and one or more subjects, where the generated videos depict realistic interactions between the subjects within the shared environment.

Motion and Spatial Layout Control

Beyond camera control, we further explore controlling the subjects' motion and spatial layout within the video by customizing the go-with-the-flow model with optical flow conditions. The resulting videos largely preserve the motion patterns and spatial arrangements from the source video while generating the appearance of the customized subjects.

BibTeX

If you find our work useful, please consider citing our paper:


@article{coming soon
}