Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths

Yingqing He1, Tianyu Yang2, Yong Zhang2, Ying Shan2, Qifeng Chen1

1The Hong Kong University of Science and Technology, 2Tencent AI Lab



Abstract

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models (DMs) are another class of deep generative models and have recently achieved remarkable performance on various image synthesis tasks. However, training image diffusion models usually requires substantial computational resources to achieve a high performance, which makes expanding diffusion models to high-dimensional video synthesis tasks more computationally expensive. To ease this problem while leveraging its advantages, we introduce lightweight video diffusion models that synthesize high-fidelity and arbitrary-long videos from pure noise. Specifically, we propose to perform diffusion and denoising in a low-dimensional 3D latent space, which significantly outperforms previous methods on 3D pixel space when under a limited computational budget. In addition, though trained on tens of frames, our models can generate videos with arbitrary lengths, i.e., thousands of frames, in an autoregressive way. Finally, conditional latent perturbation is further introduced to reduce performance degradation during synthesizing long-duration videos. Extensive experiments on various datasets and generated lengths suggest that our framework is able to sample much more realistic and longer videos than previous approaches, including GAN-based, autoregressive-based, and diffusion-based methods.

Results

We provide videos with 1000 frames and spatial resolution of 256 × 256 sampled by our approach:

Framework

We present LVDM, a novel diffusion model (DM)-based framework for video generation. The diffusion and denoising process is performed on the video latent space, which is learned by a 3D autoencoder. Then an unconditional DM is trained on the latent space for generating short video clips. To extend videos to arbitrary lengths, we further propose two frame-conditional models, including a prediction DM and an infilling DM which can synthesize long-duration videos in autoregressive and hierarchical ways. We utilize noisy conditions at diffusion timestep s to mitigate the condition error induced during the autoregressive sampling process. The frame-conditional DMs are jointly trained with unconditional inputs, where the conditional and unconditional sample frequencies are controlled by their corresponding probabilities, i.e., p_c and p_u.

scales

Comparisons on Short Video Generation


Qualitative Comparisons

We compare our approach with DIGAN (Yu et al. 2022) and TATS (Ge et al. 2022) with randomly generated samples.


Sky Time-lapse

DIGAN (128 × 128 × 16 frames)

TATS (128 × 128 × 16 frames)

LVDM (Ours) (256 × 256 × 16 frames)


UCF-101

DIGAN (128 × 128 × 16 frames)

TATS (128 × 128 × 16 frames)

LVDM (Ours) (256 × 256 × 16 frames)


TaiChi

DIGAN (128 × 128 × 16 frames)

TATS (128 × 128 × 16 frames)

LVDM (Ours) (256 × 256 × 16 frames)

Quantitative Comparisons

We compare the short video generation performance among various approaches on Sky Time-lapse, UCF-101, and TaiChi datasets.

Our approach achieves state-of-the-art performance under a higher spatial resolution.


scales


Comparisons on Long Video Generation

Qualitative Comparisons

We compare our approach with TATS (Ge et al. 2022) with randomly generated samples on both autoregressive and hierarchical architectures.


UCF-101

TATS - autoregressive (128 × 128 × 1000 frames)

Ours - autoregressive (256 × 256 × 1000 frames)

Ours - hierarchical (256 × 256 × 1000 frames)


Sky Time-lapse

TATS - hierarchical (128 × 128 × 1000 frames)

Ours - hierarchical (256 × 256 × 1000 frames)

Quantitative Comparisons

We compare our approach's long video generation performance (1024 frames) with TATS (Ge et al. 2022) on UCF-101 and Sky Time-lapse.

Our approach surpasses TATS by a large margin with a higher spatial resolution. (The default sampling method is DDPM. * indicates DDIM sampling)


scales


BibTeX


@article{he2022lvdm,
  title={Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths},
  author={Yingqing He and Tianyu Yang and Yong Zhang and Ying Shan and Qifeng Chen},
  year={2022},
  journal={arXiv preprint arXiv:2211.13221},
  }