CineScale: Free Lunch in High-Resolution
Cinematic Visual Generation

Haonan Qiu^1,2 Ning Yu^*2 Ziqi Huang¹ Paul Debevec² Ziwei Liu^*1

¹ Nanyang Technological University ² Netflix Eyeline Studios
(^* Corresponding Author)

[arXiv] [Code]

UNet-based results are in the previous work FreeScale

Abstract

Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning.

Methodology

Overall framework of CineScale. (a) Tailored Self-Cascade Upscaling. CineScale first upsamples a generated image or video from the training resolution, then gradually adds noise to the high-resolution latent, and finally denoises it to achieve detail reconstruction. Part of the clean latent is reintroduced during denoising to stabilize generation and control detail. (b) Scale Fusion. For the UNet structure, we modify the self-attention layer to combine global and local attention, fusing high-frequency details and low-frequency semantics via Gaussian blur for the final output. We also use Restrained Dilated Convolution to adapt the convolution layer of the model to high resolution for reducing repetition. (c) DiT Extention. To support DiT models, we additionally add NTK-RoPE and Attentional Scaling. Building on the tuning-free setup, Minimal LoRA Fine-Tuning is additionally introduced to help the model better adapt to the modified RoPE, leading to improved performance.

Text-to-Video Ablation (960 × 1664)

Although all variants can generate rough results. Our full method performs the best.

Text-to-Video Comparison (960 × 1664)

Although other baselines can produce reasonable results at moderately higher resolutions, they still suffer from varying degrees of blurriness. In contrast, CineScale generates high-quality videos with rich visual details.

Text-to-Video Comparison (1920 × 3328)

Image qualitative comparisons with other baselines. At the resolution several times higher than those used during training, LTX and Wan-DI tend to fail completely. While UAV, a video super-resolution approach, can still produce visually reasonable results, it is unable to recover fine details that are ambiguous or missing in the low-resolution inputs. In contrast, CineScale consistently generates high-quality videos with rich and faithful visual details.

Image-to-Video Ablation (960 × 1664)

Although all variants can generate rough results. Our full method performs the best.

Image-to-Video Generation (2176 × 3840)

With minimal LoRA fine-tuning, CineScale can achieve 4k image-to-video generation.

ReCamMaster Video-to-Video Ablation (960 × 1664)

Without NTK-RoPE, repeated patterns are prone to occur due to errors in positional encoding. Although all variants can generate rough results. Our full method performs the best.

Local Semantic Editing (2176 × 3840)

CineScale supports efficient editing by allowing users to preview results at low resolution while modifying high-resolution local semantics via prompts.

BibTex

If you find this paper useful in your research, please consider citing:

@article{qiu2025cinescale,
  title={CineScale: Free Lunch in High-Resolution Cinematic Visual Generation}, 
  author={Haonan Qiu and Ning Yu and Ziqi Huang and Paul Debevec and Ziwei Liu},
  journal={arXiv preprint arXiv:2508.15774},
  year={2025}
}