LVDM for video generation

A corgi is swimming fastly

There is a table by a window with sunlight streaming through illuminating a pile of books.

A glass bead falling into water with a huge splash. Sunset in the background

Aerial autumn forest with a river in the mountains.

astronaut riding a horse

A clear wine glass with turquoise-colored waves inside it.

A bear dancing and jumping to upbeat music, moving his whole body.

A bigfoot walking in the snowstorm.

An iron man surfing in the sea

Filling a glass with warm coffee

3d fluffy Lion grinned, closeup cute and adorable, long fuzzy fur, Pixar render

A big palace is flying away, anime style, best quality

A teady bear is drinking a big wine

A giant spaceship is landing on mars in the sunset. High Definition.

A happy elephant wearing a big birthday hat walking under the sea, 4k

Albert Einstein washing dishes

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

Research Paper

Latent Video Diffusion Models

for High-Fidelity Long Video Generation

¹The Hong Kong University of Science and Technology ²Tencent AI Lab

Text-to-Video Generation

(Extension)

Unconditional Long Video Generation

(Thousands of frames)

Dataset: UCF-101

Dataset: Sky Timelapse

Unconditional Short Video Generation

(16 frames)

Dataset: UCF-101

Dataset: Sky timelapse

Dataset: TaiChi

Abstract

Framework

Hierarchical LVDM Pipeline

Ours-Hierarchical
Ours-Autoregressive
TATS-Autoregressive

Ours-Hierarchical
TATS-Hierarchical

Ours
DIGAN
TATS

Ours
DIGAN
TATS

Latent Video Diffusion Models

for High-Fidelity Long Video Generation

1The Hong Kong University of Science and Technology 2Tencent AI Lab

Text-to-Video Generation

(Extension)

Unconditional Long Video Generation

(Thousands of frames)

Dataset: UCF-101

Dataset: Sky Timelapse

Unconditional Short Video Generation

(16 frames)

Dataset: UCF-101

Dataset: Sky timelapse

Dataset: TaiChi

Abstract

Framework

Hierarchical LVDM Pipeline

¹The Hong Kong University of Science and Technology ²Tencent AI Lab