ScaleCrafter: Tuning-free Higher-Resolution Visual Generation
with Diffusion Models

Yingqing He* 1 Shaoshu Yang* 2 Haoxin Chen3 Xiaodong Cun3 Menghan Xia3 Yong Zhang# 3 Xintao Wang3 Ran He2 Qifeng Chen# 1 Ying Shan3
1 Hong Kong University of Science and Technology      2 Chinese Academy of Sciences       3 Tencent AI Lab      
(* Equal Contribution    # Corresponding Author)
ICLR 2024 Spotlight

Generate an image of “A panda is surfing in the universe” of resolution 1024 x 1024 using SD 2.1 trained on 512 x 512

图片 图片 图片 图片 图片
1. Direct Inference 2. Scaling Attention
[Jin et al. 2023]
3. MultiDiffusion
[Bar-Tal et al. 2023]
4. SyncDiffusion
[Lee et al. 2023]
5. Ours

Generate a video of “An astronaut is waving his hands on the moon" of resolution 1024 x 640
using a latent video diffusion model trained on 512 x 320

1. Direct Inference 2. Scaling Attention 3. Ours


Abstract

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.



Results

Higher-Resolution Synthsized Images with SD XL

Utilizing a pretrained model trained on a resolution of 1024 x 1024, our method can generate images with a resolution of up to 4096 x 4096 without the need for training/optimization.


Higher-Resolution Synthsized Videos with Latent Video Diffusion Models

Utilizing a pretrained model trained on a resolution of 512 x 320, our method can generate videos of 1024 x 640 resolution without the need for training/optimization.


Higher-Resolution Synthsized Images with SD 2.1

Utilizing a pretrained model trained on a resolution of 512 x 512, whlie our method can generate images with a resolution of up to 2048 x 2048 without the need for training/optimization.


Higher-Resolution Synthsized Images with SD 1.5

Utilizing a pretrained model trained on a resolution of 512 x 512, our method can generate images with a resolution of up to 2048 x 2048 without the need for training/optimization.


Compare with baselines on Text-to-Panorama Generation

We generate panorama of resolution 3072 x 512 following SyncDiffusion.
As shown below, our approach generate much more plausible and natural global structures than baselines,
which exhibit repetitive semantics (1,2,3,4) and incoherent layout (3,4) issues.