SIGGRAPH Asia 2025
We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production.
We record multi-view performances, apply 4D Gaussian Splatting, and render videos with diverse camera motion and lighting for customization.
Our approach shows higher multi-view identity preservation. Notably, our method outperforms ConsisID, which uses only a single facial image at inference and struggles with identity consistency across views—highlighting the value of multi-view data in our pipeline.
Here are more results of subject-specific customization with camera control, along with the input camera trajectories. Our method supports precise 3D camera motion—such as rotation and forward movement—unlike prior customization methods like MotionBooth, which are limited to simple 2D translations.
To examine the role of multi-view data, we conducted an ablation using only frontal-view training videos. The results show noticeably worse identity preservation from side views, highlighting the importance of multi-view training for consistent identity across angles.
Without relit data, the generated videos show flat and unrealistic lighting. In contrast, adding relit data significantly improves lighting realism and diversity.
To enable text-to-video generation for multi-subject scenarios, we customized the model on individual subject datasets and a small joint dataset with both subjects. Results show that our model can accurately generate both identities together in the same scene.
Instead of jointly customizing one model on all subjects, we propose another approach, where each subject is customized independently—resulting in separate single-subject models. At inference time, we use noise blending technique to leverage single-subject models to generate multi-subject videos. This produces realistic videos with both subjects, without joint training.
To evaluate the impact of joint-subject data featuring both subjects in the same video, which are shown on the left, we compare models trained with and without them. As shown on the right, including joint-subject data leads to more realistic spatial relationships and more natural interactions between the two subjects.
We also explore customization for image-to-video models. Even when the input frame preserves the identity well, without multi-view customization, the generated videos show noticeable identity loss over time. In contrast, customizing the model with multi-view data helps maintain consistent identity throughout the entire video.
Our approach also enables image-to-video customization with camera conditions.
Beyond 4DGS human data, we validate our method on a real-world customization dataset featuring a cat with estimated camera parameters. As shown, our model generates videos of the same cat in new contexts, preserving identity across views and enabling controllable camera motion.
Our method also supports scene customization, allowing the model to capture interactions between subjects and their environment. Here, we show examples with both customized scenes and one or more subjects, where the generated videos depict realistic interactions between the subjects within the shared environment.
Beyond camera control, we further explore controlling the subjects' motion and spatial layout within the video by customizing the go-with-the-flow model with optical flow conditions. The resulting videos largely preserve the motion patterns and spatial arrangements from the source video while generating the appearance of the customized subjects.
If you find our work useful, please consider citing our paper:
@article{coming soon
}