AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models (DMs) are another class of deep generative models and have recently achieved remarkable performance on various image synthesis tasks. However, training image diffusion models usually requires substantial computational resources to achieve a high performance, which makes expanding diffusion models to high-dimensional video synthesis tasks more computationally expensive. To ease this problem while leveraging its advantages, we introduce lightweight video diffusion models that synthesize high-fidelity and arbitrary-long videos from pure noise. Specifically, we propose to perform diffusion and denoising in a low-dimensional 3D latent space, which significantly outperforms previous methods on 3D pixel space when under a limited computational budget. In addition, though trained on tens of frames, our models can generate videos with arbitrary lengths, i.e., thousands of frames, in an autoregressive way. Finally, conditional latent perturbation is further introduced to reduce performance degradation during synthesizing long-duration videos. Extensive experiments on various datasets and generated lengths suggest that our framework is able to sample much more realistic and longer videos than previous approaches, including GAN-based, autoregressive-based, and diffusion-based methods.
We provide videos with 1000 frames and spatial resolution of 256 × 256 sampled by our approach:
We present LVDM, a novel diffusion model (DM)-based framework for video generation. The diffusion and denoising process is performed on the video latent space, which is learned by a 3D autoencoder. Then an unconditional DM is trained on the latent space for generating short video clips. To extend videos to arbitrary lengths, we further propose two frame-conditional models, including a prediction DM and an infilling DM which can synthesize long-duration videos in autoregressive and hierarchical ways. We utilize noisy conditions at diffusion timestep s to mitigate the condition error induced during the autoregressive sampling process. The frame-conditional DMs are jointly trained with unconditional inputs, where the conditional and unconditional sample frequencies are controlled by their corresponding probabilities, i.e., p_c and p_u.
We compare our approach with DIGAN (Yu et al. 2022) and TATS (Ge et al. 2022) with randomly generated samples.
We compare the short video generation performance among various approaches on Sky Time-lapse, UCF-101, and TaiChi datasets.
Our approach achieves state-of-the-art performance under a higher spatial resolution.
We compare our approach with TATS (Ge et al. 2022) with randomly generated samples on both autoregressive and hierarchical architectures.
We compare our approach's long video generation performance (1024 frames) with TATS (Ge et al. 2022) on UCF-101 and Sky Time-lapse.
Our approach surpasses TATS by a large margin with a higher spatial resolution. (The default sampling method is DDPM. * indicates DDIM sampling)
@article{he2022lvdm,
title={Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths},
author={Yingqing He and Tianyu Yang and Yong Zhang and Ying Shan and Qifeng Chen},
year={2022},
journal={arXiv preprint arXiv:2211.13221},
}