Bigger is not Always Better:

Scaling Properties of Latent Diffusion Models

Kangfu Mei^{1}\(^{\,\,\,\,1,2}\),Zhengzhong Tu\(^{1}\), Mauricio Delbracio\(^{1}\),

Hossein Talebi\(^{1}\), Vishal M. Patel\(^{2}\),Peyman Milanfar\(^{1}\)

April 01, 2024

We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size—a critical determinant of sampling efficiency—has not been thoroughly examined. Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results. Moreover, we extend our study to demonstrate the generalizability of the these findings by applying various diffusion samplers, exploring diverse downstream tasks, evaluating post-distilled models, as well as comparing performance relative to training compute. These findings open up new pathways for the development of LDM scaling strategies which can be employed to enhance generative capabilities within limited inference budgets.

Latent diffusion models (LDMs) [@rombach2022high], and diffusion models in general, trained on large-scale, high-quality data [@lin2014microsoft; @schuhmann2022laion] have emerged as a powerful and robust framework for generating impressive results in a variety of tasks, including image synthesis and editing [@rombach2022high; @podell2023sdxl; @delbracio2023inversion; @ren2023multiscale; @qi2023tip], video creation [@mei2023vidm; @mei2023t1; @wu2023tune; @singer2022make], audio production [@liu2023audioldm], and 3D synthesis [@lin2023magic3d; @liu2023zero]. Despite their versatility, the major barrier against wide deployment in real-world applications [@du2023exploring; @choi2023squeezing] comes from their low *sampling efficiency*. The essence of this challenge lies in the inherent reliance of LDMs on multi-step sampling [@song2020score; @ho2020denoising] to produce high-quality outputs, where the total cost of sampling is the product of sampling steps and the cost of each step. Specifically, the go-to
approach involves using the 50-step DDIM sampling [@song2020denoising; @rombach2022high], a process that, despite ensuring output quality, still requires a relatively long latency
for completion on modern mobile devices with post-quantization. In contrast to single shot generative models (e.g., generative-adversarial networks (GANs) [@goodfellow2020generative])
which bypass the need for iterative refinement [@goodfellow2020generative; @karras2019style], the operational latency of LDMs calls for a pressing need for efficiency
optimization to further facilitate their practical applications.

Recent advancements in this field [@li2023snapfusion; @zhao2023mobilediffusion; @peebles2023scalable; @kim2023bk; @kim2023architectural; @choi2023squeezing] have primarily focused on developing faster network architectures with comparable model size to reduce the inference time per step, along with innovations in improving sampling algorithms that allow for using less sampling steps [@song2020denoising; @dockhorn2022genie; @karras2022elucidating; @lu2022dpm; @liu2023instaflow; @xu2023ufogen]. Further progress has been made through diffusion-distillation techniques [@luhman2021knowledge; @salimans2022progressive; @song2023consistency; @sauer2023adversarial; @gu2023boot; @mei2023conditional], which simplifies the process by learning multi-step sampling results in a single forward pass, and then broadcasts this single-step prediction multiple times. These distillation techniques leverage the redundant learning capability in LDMs, enabling the distilled models to assimilate additional distillation knowledge. Despite these efforts being made to improve diffusion models, the sampling efficiency of smaller, less redundant models has not received adequate attention. A significant barrier to this area of research is the scarcity of available modern accelerator clusters [@jouppi2023tpu], as training high-quality text-to-image (T2I) LDMs from scratch is both time-consuming and expensive—often requiring several weeks and hundreds of thousands of dollars.

In this paper, we empirically investigate the scaling properties of LDMs, with a particular focus on understanding how their scaling properties impact the sampling efficiency across various model sizes. We trained a suite of 12 text-to-image LDMs from scratch, ranging from 39 million to 5 billion parameters, under a constrained budget. Example results are depicted in Fig. 1. All models were trained on TPUv5 using internal data sources with about 600 million aesthetically-filtered text-to-image pairs. Our study reveals that there exist a scaling trend within LDMs, notably that smaller models may have the capability to surpass larger models under an equivalent sampling budget. Furthermore, we investigate how the size of pre-trained text-to-image LDMs affects their sampling efficiency across diverse downstream tasks, such as real-world super-resolution [@saharia2022image; @sahak2023denoising] and subject-driven text-to-image synthesis (i.e., Dreambooth) [@ruiz2023dreambooth].

Our key findings for scaling latent diffusion models in text-to-image generation and various downstream tasks are as follows:

**Pretraining performance scales with training compute.** We demonstrate a clear link between compute resources and LDM performance by scaling models from 39 million to 5 billion parameters. This suggests potential for further improvement
with increased scaling. See Section 3.1 for details.

**Downstream performance scales with pretraining.** We demonstrate a strong correlation between pretraining performance and success in downstream tasks. Smaller models, even with extra training, cannot fully bridge the gap created by the
pretraining quality of larger models. This is explored in detail in Section 3.2.

**Smaller models sample more efficient.** Smaller models initially outperform larger models in image quality for a given sampling budget, but larger models surpass them in detail generation when computational constraints are relaxed. This
is further elaborated in Section 3.3.1 and Section 3.3.2.

**Sampler does not change the scaling efficiency.** Smaller models consistently demonstrate superior sampling efficiency, regardless of the diffusion sampler used. This holds true for deterministic DDIM [@song2020denoising], stochastic DDPM [@ho2020denoising], and higher-order DPM-Solver++ [@lu2022dpm2].
For more details, see Section 3.4.

**Smaller models sample more efficient on the downstream tasks with fewer steps.** The advantage of smaller models in terms of sampling efficiency extends to the downstream tasks when using less than 20 sampling steps. This is further
elaborated in Section 3.5.

**Diffusion distillation does not change scaling trends.** Even with diffusion distillation, smaller models maintain competitive performance against larger distilled models when sampling budgets are constrained. This suggests distillation
does not fundamentally alter scaling trends. See Section 3.6 for in-depth analysis.

Recent Large Language Models (LLMs) including GPT [@brown2020language], PaLM [@anil2023palm], and LLaMa [@touvron2023llama] have dominated language generative modeling tasks. The foundational works [@kaplan2020scaling;
@brown2020language; @hoffmann2022training] for investigating their scaling behavior have shown the capability of predicting the performance from the model size. They also investigated the factors that affect the scaling properties of language
models, including training compute, dataset size and quality, learning rate schedule, etc. Those experimental clues have effectively guided the later language model development, which have led to the emergence of several parameter-efficient LLMs [@hoffmann2022training; @touvron2023llama; @zhou2023brainformers; @alabdulmohsin2024getting]. However, scaling generative text-to-image
models are relatively unexplored, and existing efforts have only investigated the scaling properties on small datasets or small models, like scaling UNet [@nichol2021improved] to 270 million
parameters and DiT [@peebles2023scalable] on ImageNet (14 million), or less-efficient autoregressive models [@chen2020generative].
Different from these attempts, our work investigates the scaling properties by scaling down the efficient and capable diffusion models, *i.e*.. LDMs [@rombach2022high], on internal data
sources that have about 600 million aesthetics-filtered text-to-image pairs for featuring the sampling efficiency of scaled LDMs. We also scale LDMs on various scenarios such as finetuning LDMs on downstream tasks [@wang2021real; @ruiz2023dreambooth] and distilling LDMs [@mei2023conditional] for faster sampling to demonstrate the generalizability of the
scaled sampling-efficiency.

Nichol et al. [@nichol2021improved] show that the generative performance of diffusion models improves as the model size increases. Based on this preliminary observation, the model size of
widely used LDMs, *e.g*.., Stable Diffusion [@rombach2022high], has been empirically increased to billions of parameters [@ramesh2022hierarchical; @podell2023sdxl]. However, such a large model makes it impossible to fit into the common inference budget of practical scenarios. Recent work on improving the sampling
efficiency focus on improving network architectures [@li2023snapfusion; @zhao2023mobilediffusion;
@peebles2023scalable; @kim2023bk; @kim2023architectural; @choi2023squeezing] or the sampling procedures [@song2020denoising; @dockhorn2022genie; @karras2022elucidating; @lu2022dpm; @liu2023instaflow; @xu2023ufogen]. We explore sampling
efficiency by training smaller, more compact LDMs. Our analysis involves scaling down the model size, training from scratch, and comparing performance at equivalent inference cost.

4pt

Params |
39M | 83M | 145M | 223M | 318M | 430M | 558M | 704M | 866M | 2B | 5B |
---|---|---|---|---|---|---|---|---|---|---|---|

Filters \((c)\) | 64 | 96 | 128 | 160 | 192 | 224 | 256 | 288 | 320 | 512 | 768 |

GFLOPS | 25.3 | 102.7 | 161.5 | 233.5 | 318.5 | 416.6 | 527.8 | 652.0 | 789.3 | 1887.5 | 4082.6 |

Norm. Cost | 0.07 | 0.13 | 0.20 | 0.30 | 0.40 | 0.53 | 0.67 | 0.83 | 1.00 | 2.39 | 5.17 |

FID \(\downarrow\) | 25.30 | 24.30 | 24.18 | 23.76 | 22.83 | 22.35 | 22.15 | 21.82 | 21.55 | 20.98 | 20.14 |

CLIP \(\uparrow\) | 0.305 | 0.308 | 0.310 | 0.310 | 0.311 | 0.312 | 0.312 | 0.312 | 0.312 | 0.312 | 0.314 |

Compared to diffusion models, other generative models such as, Variational Autoencoders (VAEs) [@kingma2013auto; @rezende2015variational;
@makhzani2015adversarial; @vahdat2020nvae], Generative Adversarial Networks (GANs) [@goodfellow2020generative;
@mao2017least; @karras2019style; @reed2016generative; @miyato2018spectral], and Masked Models [@devlin2018bert; @raffel2020exploring;
@he2022masked; @chang2022maskgit; @chang2023muse], are more efficient, as they rely less on an iterative refinement process. Sauer et al. [@sauer2023stylegan] recently scaled up
StyleGAN [@karras2019style] into 1 billion parameters and demonstrated the single-step GANs’ effectiveness in modeling text-to-image generation. Chang et al. [@chang2023muse] scaled up masked transformer models for text-to-image generation. These non-diffusion generative models can generate high-quality images with less inference cost, which require fewer sampling steps than
diffusion models and autoregressive models, but they need more parameters, *i.e*.., 4 billion parameters.

We developed a family of powerful Latent Diffusion Models (LDMs) built upon the widely-used `866M`

Stable Diffusion v1.5 standard [[@rombach2022high]]^{2}. The denoising UNet of our models offers a flexible range of sizes, with parameters spanning from `39M`

to `5B`

. We incrementally increase the number of filters in
the residual blocks while maintaining other architecture elements the same, enabling a predictably controlled scaling. Table 1 shows the architectural
differences among our scaled models. We also provide the relative cost of each model against the baseline model. Fig. 2 shows the architectural differences during scaling. Models were trained using internal data sources
with about 600 million aesthetically-filtered text-to-image pairs. All the models are trained for 500K steps, batch size 2048, and learning rate 1e-4. This allows for all the models to have reached a point where we observe diminishing returns. Fig. 1 demonstrates the consistent generation capabilities across our scaled models. We used the common practice of 50 sampling steps with the DDIM sampler, 7.5 classifier-free guidance rate, for text-to-image generation. The
visual quality of the results exhibits a clear improvement as model size increases.

In order to evaluate the performance of the scaled models, we test the text-to-image performance of scaled models on the validation set of COCO 2014 [@lin2014microsoft] with 30k samples. For downstream performance, specifically real-world super-resolution, we test the performance of scaled models on the validation of DIV2K with 3k randomly cropped patches, which are degraded with the RealESRGAN degradation [@wang2021real].

We find that our scaled LDMs, across various model sizes, exhibit similar trends in generative performance relative to training compute cost, especially after training stabilizes, which typically occurs after 200K iterations. These trends demonstrate a smooth scaling in learning capability between different model sizes. To elaborate, Fig. 3 illustrates a series of training runs with models varying in size from 39 million to 5 billion parameters, where the training compute cost is quantified as the product of relative cost shown in Table 1 and training iterations. Model performance is evaluated by using the same sampling steps and sampling parameters. In scenarios with moderate training compute (i.e., \(<1G\), see Fig. 3), the generative performance of T2I models scales well with additional compute resources.

Using scaled models based on their pretraining on text-to-image data, we finetune these models on the downstream tasks of real-world super-resolution [@saharia2022image; @sahak2023denoising] and DreamBooth [@ruiz2023dreambooth]. The performance of these pretrained models is shown in Table. 1. In the left panel of Fig. 4, we present the generative performance FID versus training compute on the super-resolution (SR) task. It can be seen that the performance of SR models is more dependent on the model size than training compute. Our results demonstrate a clear limitation of smaller models: they cannot reach the same performance levels as larger models, regardless of training compute.

While the distortion metric LPIPS shows some inconsistencies compared to the generative metric FID (Fig. 4), Fig. 5 clearly demonstrates that larger models excel in recovering fine-grained details compared to smaller models.

The key takeaway from Fig. 4 is that large super-resolution models achieve superior results even after short finetuning periods compared to smaller models. This suggests that pretraining performance (dominated by the
pretraining model sizes) has a greater influence on the super-resolution FID scores than the duration of finetuning (*i.e*.., training compute for finetuning).

Furthermore, we compare the visual results of the DreamBooth finetuning on the different models in Fig. 6. We observe a similar trend between visual quality and model size. *Please see our supplement for more
discussions on the other quality metrics.*

Text-to-image generative models require nuanced evaluation beyond single metrics. Sampling parameters are vital for customization, with the Classifier-Free Guidance (CFG) rate [@ho2022classifier] directly influencing the balance between visual fidelity and semantic alignment with text prompt. Rombach et al. [@rombach2022high] experimentally demonstrate that different CFG rates result in different CLIP and FID scores.

In this study, we find that CFG rate as a sampling parameter yields inconsistent results across different model sizes. Hence, it is interesting to quantitatively determine the *optimal* CFG rate for each model size and sampling steps using either
FID or CLIP score. We demonstrate this by sampling the scaled models using different CFG rates, *i.e*.., \((1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0)\) and comparing their quantitative and qualitative results. In
Fig. 7, we present visual results of two models under varying CFG rates, highlighting the impact on the visual quality. We observed that changes in CFG rates impact visual quality more significantly than prompt semantic
accuracy and therefore opted to use the FID score for quantitative determination of the optimal CFG rate. performance. Fig. 8 shows how different classifier-free guidance rates affect the FID scores in text-to-image
generation (see figure caption for more details).

Using the optimal CFG rates established for each model at various number of sampling steps, we analyze the optimal performance to understand the sampling efficiency of different LDM sizes. Specifically, in Fig. 9, we
present a comparison between different models and their optimal performance given the sampling cost (normalized cost \(\times\) sampling steps). By tracing the points of optimal performance across various sampling
cost—represented by the dashed vertical line—we observe a consistent trend: smaller models frequently outperform larger models across a range of sampling cost in terms of FID scores. Furthermore, to visually substantiate better-quality results generated by
smaller models against larger ones, Fig. 10 compares the results of different scaled models, which highlights that the performance of smaller models can indeed match their larger counterparts under similar sampling cost
conditions. *Please see our supplement for more visual comparisons.*

To assess the generalizability of observed scaling trends in sampling efficiency, we compared scaled LDM performance using different diffusion samplers. In addition to the default DDIM sampler, we employed two representative alternatives: the stochastic DDPM sampler [@ho2020denoising] and the high-order DPM-Solver++ [@lu2022dpm2].

Experiments illustrated in Fig. 11 reveal that the DDPM sampler typically produces lower-quality results than DDIM with fewer sampling steps, while the DPM-Solver++ sampler generally outperforms DDIM in image quality (see the figure caption for details). Importantly, we observe consistent sampling-efficiency trends with the DDPM and DPM-Solver++ sampler as seen with the default DDIM: smaller models tend to achieve better performance than larger models under the same sampling cost. Since the DPM-Solver++ sampler is not designed for use beyond 20 steps, we focused our testing within this range. This finding demonstrates that the scaling properties of LDMs remain consistent regardless of the diffusion sampler used.

Here, we investigate the scaling sampling-efficiency of LDMs on downstream tasks, specifically focusing on the super-resolution task. Unlike our earlier discussions on optimal sampling performance, there is limited literature demonstrating the positive
impacts of SR performance without using classifier-free guidance. Thus, our approach directly uses the SR sampling result without applying classifier-free guidance. Inspired from Fig. 4, where the scaled downstream LDMs
have significant performance difference in 50-step sampling, we investigate sampling efficiency from two different aspects, *i.e*.., fewer sampling steps \([4, 20]\) and more sampling steps \((20, 250]\). As shown in the left part of Fig. 12, the scaling sampling-efficiency still holds in the SR tasks when the number of sampling steps is less than or equal to 20 steps. Beyond
this threshold, however, larger models demonstrate greater sampling-efficiency than smaller models, as illustrated in the right part of Fig. 12. This observation suggests the consistent sampling efficiency of scaled
models on fewer sampling steps from text-to-image generation to super-resolution tasks.

We have featured the scaling sampling-efficiency of latent diffusion models, which demonstrates that smaller model sizes exhibit higher sampling efficiency. A notable caveat, however, is that smaller models typically imply reduced modeling capability.
This poses a challenge for recent diffusion distillation methods [@luhman2021knowledge; @salimans2022progressive; @song2023consistency; @sauer2023adversarial;
@gu2023boot; @mei2023conditional; @luo2023latent; @lin2024sdxl] that heavily depend on modeling capability. One might expect a contradictory conclusion and believe the distilled large models sample faster than distilled small models. In order to
demonstrate the sampling efficiency of scaled models after distillation, we distill our previously scaled models with conditional consistency distillation [@song2023consistency;
@mei2023conditional] on text-to-image data and compare those distilled models on their optimal performance. *Please see our supplement for more distillation details.*

To elaborate, we test all distilled models with the same 4-step sampling, which is shown to be able to achieve the best sampling performance; we then compare each distilled model with the undistilled one on the normalized sampling cost. We follow the
same practice discussed in Section 3.3.1 for selecting the optimal CFG rate and compare them under the same relative inference cost. The results shown in the left part of Fig. 13
demonstrate that distillation significantly improves the generative performance for all models in 4-step sampling, with FID improvements across the board. By comparing these distilled models with the undistilled models in the right part of Fig. 13, we demonstrate that distilled models outperform undistilled models at the same sampling cost. However, at the specific sampling cost, *i.e*.., sampling cost \(\approx \texttt{8}\), the
smaller undistilled 83M model still achieves similar performance to the larger distilled 866M model. The observation further supports our proposed scaling sampling-efficiency of LDMs, which still holds under the circumstance of diffusion distillation.

In this paper, we investigated scaling properties of Latent Diffusion Models (LDMs), specifically through scaling model size from 39 million to 5 billion parameters. We trained these scaled models from scratch on a large-scale text-to-image dataset and then finetuned the pretrained models for downstream tasks. Our findings unveil that, under identical sampling costs, smaller models frequently outperform larger models, suggesting a promising direction for accelerating LDMs in terms of model size. We further show that the sampling efficiency is consistent in multiple axes. For example, it is invariant to various diffusion samplers (stochastic and deterministic), and also holds true for distilled models. We believe this analysis of scaling sampling efficiency would be instrumental in guiding future developments of LDMs, specifically for balancing model size against performance and efficiency in a broad spectrum of practical applications.

This work utilizes visual quality inspection alongside established metrics like FID and CLIP scores. We opted to avoid human evaluations due to the immense number of different combinations needed for the more than 1000 variants considered in this study. However, it is important to acknowledge the potential discrepancy between visual quality and quantitative metrics, which is actively discussed in recent works [@zhang2021cross; @jayasumana2023rethinking; @cho2023davidsonian]. Furthermore, claims regarding the scalability of latent models are made specifically for the particular model family studied in this work. Extending this analysis to other model families, particularly those incorporating transformer-based backbones [@peebles2023scalable; @mei2023t1; @ma2024sit], would be a valuable direction for future research.

We are grateful to Keren Ye, Jason Baldridge for their valuable feedback. We also extend our gratitude to Shlomi Fruchter, Kevin Murphy, Mohammad Babaeizadeh, and Han Zhang for their instrumental contributions in facilitating the initial implementation of the latent diffusion models.

In order to provide detailed visual comparisons for Fig. 1 in the main manuscript, Fig. 14, Fig. 15, and Fig. 16 show the generated results with the same prompt and the same sampling parameters (*i.e*.., 50-step DDIM sampling and 7.5 CFG rate).

To provide more metrics for the super-resolution experiments in Fig. 4 of the main manuscript, Fig. 18 shows the generative metric IS for the super-resolution results. Fig. 18 shows the visual results of the super-resolution results in order to provide more visual results for the visual comparisons of Fig. 5 in the main manuscript.

Diffusion distillation methods for accelerating sampling are generally derived from Progressive Distillation (PD) [@salimans2022progressive] and Consistency Models (CM) [@song2023consistency]. In the main paper, we have shown that CoDi [@mei2023conditional] based on CM is scalable to different model
sizes. Here we show other investigated methods, *i.e*.., guided distillation [@meng2023distillation], has inconsistent acceleration effects across different model sizes. Fig. 19 shows guided distillation results for the 83M and 223M models respectively, where `s16`

and `s8`

denote different distillation stages. It is easy to see that the performance improvement of these
two models is inconsistent.

Fig. 20 shows the visual results of the CoDi distilled models and the undistilled models under the same sampling cost to demonstrate the sampling-efficiency.

To provide more visual comparisons additional to Fig. 10 in the main paper, Fig. 21, Fig.22, and Fig. 23 present visual comparisons between different scaled models under a uniform sampling cost. This highlights that the performance of smaller models can indeed match their larger counterparts under similar sampling cost.

This work was done during an internship at Google↩︎

We adopted SD v1.5 since it is among the most popular diffusion models https://huggingface.co/models?sort=likes.↩︎