Generated video with resolution of 2048 x 1152.
Input: “a girl in astronaut outfit in the spaceship turning around, cinematic, super detailed face”
Generated video with resolution of 2048 x 1152.
Input: “Aerial autumn forest with a river in the mountains”
Generated video with resolution of 2048 x 1152.
Input: “A beautiful girl on a boat”
Generated video with resolution of 2048 x 1152.
Input: “A majestic eagle soars high above the treetops, surveying its territory.”
Generated video with resolution of 2048 x 1152.
Input: “an iron man walking on the new york times square, cinematic”
Generated video with resolution of 2048 x 1152.
Input: “a beautiful woman smiling on a bridge, facing camera, close-up shot, sunset, cinematic, super detailed face”
Generated video with resolution of 2048 x 1152.
Input: “A fluffy baby sloth with an orange knitted hat”
Generate an image of “A panda is surfing in the universe” of resolution 1024 x 1024 using SD 2.1 trained on 512 x 512
1. Direct Inference | 2. Scaling Attention [Jin et al. 2023] |
3. MultiDiffusion [Bar-Tal et al. 2023] |
4. SyncDiffusion [Lee et al. 2023] |
5. Ours |
Generate a video of “An astronaut is waving his hands on the moon" of resolution 1024 x 640
using a latent video diffusion model trained on 512 x 320
1. Direct Inference | 2. Scaling Attention | 3. Ours |
In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
Utilizing a pretrained model trained on a resolution of 1024 x 1024, our method can generate images with a resolution of up to 4096 x 4096 without the need for training/optimization.
Resolution: 4096x4096; Prompt: “Miniature house with plants in the potted area, hyper realism, dramatic ambient lighting, high detail”
Resolution: 4096x4096; Prompt: “Close up of a pair of vibrant koi fish swim upstream, surmounting a waterfall, oil painting style”
Resolution: 4096x4096; Prompt: “North sunset in Norway. Stunning sea view from mountains. Beautiful bay in evening sunlight. Lofoten islands landscape.”
Resolution: 4096x2048; Prompt: “A cherry blossom tree in full bloom amidst an arctic tundra showering petals on a polar bear”
Resolution: 4096x2048; Prompt: “baby succulents from ikea interior pinterest plants cacti and flora. Black Bedroom Furniture Sets. Home Design Ideas”
Resolution: 4096x2048; Prompt: “Handmade Watercolor Painting, home decor”
Resolution: 4096x2048; Prompt: “Placemats Sunflower Summer Landscape Heat Stain Resistant Non-Slip Place Mats for Kitchen Dining Table 12 x 18 Inch 4 Pc”
Resolution: 4096x2048; Prompt: “Spring Magnolia and Thistle bouquet. Featuring bronze magnolia leaves, woodland greeneries, pussywillow, blush, scarlet and pink peonies, cream roses, sea holly thistle and cream, gold and navy ribbons..jpg”
Resolution: 2048x2048; Prompt: “16 Inch Anime The Jungle Book Backpack For Teenagers Boys Girls School Bags Travel Bag Children School Backpacks Gift”
Utilizing a pretrained model trained on a resolution of 512 x 320, our method can generate videos of 1024 x 640 resolution without the need for training/optimization.
Resolution: 1024x640; Prompt: “Ironman is fighting against the enemy, big fire in the background, photorealistic, 4k”
Resolution: 1024x640; Prompt: “A solemn tortoise slowly ambles along, carrying its home on its back.”
Resolution: 1024x640; Prompt: “A charming raccoon stealthily rummages through a homeowner's trash can.”
Resolution: 1024x640; Prompt: “A panda playing on a swing set”
Resolution: 1024x640; Prompt: “Close up of grapes on a rotating table. High Definition.”
Resolution: 1024x640; Prompt: “A fat rabbit wearing a purple robe walking through a fantasy landscape.”
Resolution: 1024x640; Prompt: “Teddy bear walking down 5th Avenue, front view, beautiful sunset, close up, high definition, 4k.”
Utilizing a pretrained model trained on a resolution of 512 x 512, whlie our method can generate images with a resolution of up to 2048 x 2048 without the need for training/optimization.
Resolution: 2048x2048; Prompt: “HDR Image of Yosemite Falls”
Resolution: 2048x2048; Prompt: "Maine Coon"
Resolution: 2048x2048; Prompt: “Picture the sky, stones, Norway, Mountain Sunrice”
Resolution: 2048x2048; Prompt: “winter sunset lighthouse snow iceland”
Resolution: 2048x1024; Prompt: “A corgi sits on a beach chair on a beautiful beach, with palm trees behind, high details"
Resolution: 2048x1024; Prompt: "A car in a garden, with a lake and Eiffel Tower”
Resolution: 2048x1024; Prompt: "a photo of wonder woman"
Resolution: 2048x1024; Prompt: "The winding Great Wall of China in autumn"
Resolution: 2048x1024; Prompt: "A Cute Puppy with wings, Cartoon Drawings, high details"
Resolution: 2048x1024; Prompt: "A photo of a raccoon wearing an astronaut helmet, looking out"
Resolution: 2048x1024; Prompt: "A Cute Puppy with wings, in sky, photorealistic, high detail"
Resolution: 2048x1024; Prompt: "Sushi Roll on a wooden table"
Resolution: 1024x1024; Prompt: “photo of a cat holding a pineapple”
Resolution: 1024x1024; Prompt: “A rabbit is skateboarding in Time Square”
Resolution: 1024x1024; Prompt: “cherry with water splashed in all directions in a bowl”
Resolution: 1024x1024; Prompt: “A rabbit is riding a bicycle in New York Street”
Resolution: 2048x2048; Prompt: “Stack coins with plants to grow into steps”
Utilizing a pretrained model trained on a resolution of 512 x 512, our method can generate images with a resolution of up to 2048 x 2048 without the need for training/optimization.
Resolution: 2048x2048; Prompt: “A rustic wooden cabin nestled in a snowy forest”
Resolution: 2048x1024; Prompt: “A dog wearing a Superhero outfit with red cape flying through the sky”
Resolution: 2048x1024; Prompt: “A picturesque mountain scene with a clear lake reflecting the surrounding peaks”
Resolution: 2048x1024; Prompt: “A squirrel eating an acorn in a forest”
Resolution: 1024x1024; Prompt: “A beautiful sunset over a calm ocean with a lighthouse in the distance”
Resolution: 1024x1024; Prompt: “A butterfly landing on a sunflower”
Resolution: 1024x1024; Prompt: “A charming raccoon stealthily rummages through a homeowner's trash can”
Resolution: 1024x1024; Prompt: “A confused grizzly bear in calculus class”
Resolution: 1024x1024; Prompt: “A fox peeking out from behind a bush”
Resolution: 1024x1024; Prompt: “A picturesque mountain scene with a clear lake reflecting the surrounding peaks”
Resolution: 1024x1024; Prompt: “A tranquil garden filled with blooming flowers and a small fountain”
We generate panorama of resolution 3072 x 512 following SyncDiffusion.
As shown below, our approach generate much more plausible and natural global structures than baselines,
which exhibit repetitive semantics (1,2,3,4) and incoherent layout (3,4) issues.
“A photo of a lake under the northern lights”
“A beach with palm trees”
“A film photo of a beachside street under the sunset”