Stable Diffusion is a cutting-edge latent text-to-image diffusion model, expertly engineered to generate highly realistic images from any given text prompt. This innovative model opens up new possibilities in image generation and manipulation for research and creative applications.

Model Overview

Stable Diffusion is built upon the principles of diffusion models and leverages a latent space approach for efficient and high-quality image synthesis. It is designed to understand and translate textual descriptions into detailed and visually compelling images.

Key Features:

Text-to-Image Generation: Generates photo-realistic images based on textual input.
Latent Diffusion Model: Operates in a latent space to enhance efficiency and image quality.
CLIP Text Encoder: Utilizes a pre-trained CLIP ViT-L/14 text encoder for robust text understanding.

Developed by: Robin Rombach, Patrick Esser

Model Type: Diffusion-based text-to-image generation model

Language: English

License: The CreativeML OpenRAIL M license

This model is released under the Open RAIL M license, which is derived from the collaborative work of BigScience and the RAIL Initiative in responsible AI licensing. For further details, refer to the article about the BLOOM Open RAIL license, which served as the foundation for this license.

Model Description:

Stable Diffusion is a powerful tool for generating and modifying images from text prompts. It employs a Latent Diffusion Model architecture, utilizing a fixed, pre-trained text encoder (CLIP ViT-L/14), as inspired by the Imagen paper. This combination allows for nuanced and controlled image generation based on textual descriptions.

Further Resources:

GitHub Repository: https://github.com/CompVis/stable-diffusion
Research Paper: https://arxiv.org/abs/2112.10752

Intended Uses

Stable Diffusion is primarily intended for research purposes to facilitate exploration and advancement in the field of generative models. Specific areas of research and application include:

Safe Deployment Research: Investigating methods for the safe and responsible deployment of models capable of generating potentially harmful content.
Understanding Model Limitations and Biases: Probing and analyzing the inherent limitations and biases present in generative models to improve their fairness and reliability.
Artistic and Design Applications: Generating artworks, aiding in design processes, and exploring other creative applications within artistic workflows.
Educational and Creative Tools: Integration into educational platforms and creative tools to enhance learning and creative expression.
Generative Model Research: Broadening the scope of research on generative models and their capabilities.

Responsible Use: Misuse, Limitations, and Biases

It is crucial to acknowledge the potential for misuse and limitations associated with Stable Diffusion. Users are urged to employ this model responsibly and ethically.

Misuse and Malicious Use Considerations

Note: This section is adapted from the DALLE-MINI model card and is equally applicable to Stable Diffusion v1.

Stable Diffusion should not be used to generate or distribute images that could foster hostile or alienating environments. This encompasses content that is reasonably perceived as disturbing, distressing, or offensive, or that perpetuates harmful stereotypes, whether historical or contemporary.

Out-of-Scope Use Cases

The model's training data and objectives were not designed for generating factually accurate or truthful depictions of individuals or events. Therefore, utilizing Stable Diffusion to create such content falls outside its intended scope and capabilities.

Prohibited Misuse Scenarios

Using Stable Diffusion to generate content that is deliberately harmful or cruel to individuals constitutes a misuse of this technology. This includes, but is not limited to:

Generating demeaning, dehumanizing, or otherwise harmful portrayals of individuals, their environments, cultures, or religions.
Intentionally promoting or spreading discriminatory content or harmful stereotypes.
Impersonating individuals without explicit consent.
Creating sexual content without the consent of those who may view it.
Generating and disseminating misinformation or disinformation.
Depicting egregious violence and gore.
Sharing copyrighted or licensed material in violation of its terms of use.
Distributing content that is an unauthorized alteration of copyrighted or licensed material.

Limitations and Biases

Model Limitations

Stable Diffusion exhibits certain limitations that users should be aware of:

Photorealism Imperfections: The model may not always achieve perfect photorealistic results.
Text Rendering Challenges: Generating legible text within images is not reliably achieved.
Compositional Complexity: Performance may degrade on tasks requiring complex compositional understanding, such as rendering specific spatial arrangements of objects (e.g., "A red cube on top of a blue sphere").
Facial and Human Representation Issues: The generation of faces and human figures may sometimes be flawed or inaccurate.
Language Dependence: Primarily trained on English captions, the model's performance may be reduced for prompts in other languages.
Lossy Autoencoding: The autoencoding process in the model introduces some level of data loss.
Training Data Concerns: Trained on the large-scale LAION-5B dataset (https://laion.ai/blog/laion-5b/), which includes adult content, making it unsuitable for product use without additional safety measures.
Dataset Deduplication: Lack of dataset deduplication may lead to memorization effects for frequently duplicated images in the training data. The training data can be explored at https://rom1504.github.io/clip-retrieval/ to aid in detecting potential memorized images.

Model Bias

While image generation models like Stable Diffusion demonstrate remarkable capabilities, they can also inadvertently reinforce or amplify societal biases. Stable Diffusion v1 was trained on subsets of LAION-2B(en), primarily composed of images with English descriptions. This linguistic focus can lead to underrepresentation of communities and cultures that primarily use other languages. Consequently, the model's output may be skewed towards white and western cultural norms, which are often implicitly set as defaults. Furthermore, the model's effectiveness in processing non-English prompts is significantly lower compared to its performance with English prompts.

Training Details

Training Dataset:

The model was trained using the following datasets:

LAION-2B (en) and its subsets (refer to subsequent section for specifics)

Training Procedure:

Stable Diffusion v1 is a latent diffusion model integrating an autoencoder with a diffusion model trained within the autoencoder's latent space. The training process involves:

Image Encoding: Images are transformed into latent representations using an encoder. The autoencoder employs a downsampling factor of 8, mapping images of shape H x W x 3 to latents of shape H/f x W/f x 4.
Text Prompt Encoding: Textual prompts are encoded using a ViT-L/14 text encoder.
UNet Integration: The non-pooled output from the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention mechanisms.
Loss Function: The training objective is a reconstruction loss, measured between the noise added to the latent representation and the prediction generated by the UNet.

Model Checkpoints:

Three distinct checkpoints are provided: sd-v1-1.ckpt, sd-v1-2.ckpt, and sd-v1-3.ckpt. Their training progression is as follows:

sd-v1-1.ckpt: Trained for 237k steps at 256x256 resolution on laion2B-en, followed by 194k steps at 512x512 resolution on laion-high-resolution (comprising 170M examples from LAION-5B with resolution >= 1024x1024).
sd-v1-2.ckpt: Resumed training from sd-v1-1.ckpt for 515k steps at 512x512 resolution on “laion-improved-aesthetics”. This dataset is a subset of laion2B-en, filtered for images with original size >= 512x512, estimated aesthetics score > 5.0, and watermark probability < 0.5. Aesthetics scores are estimated using an improved aesthetics estimator, and watermark estimates are from LAION-5B metadata.
sd-v1-3.ckpt: Resumed training from sd-v1-2.ckpt for 195k steps at 512x512 resolution on “laion-improved-aesthetics”, incorporating a 10% dropout of text-conditioning to enhance classifier-free guidance sampling.

Training Hardware and Parameters:

Hardware: 32 x 8 x A100 GPUs
Optimizer: AdamW
Gradient Accumulation: 2
Batch Size: 2048 (32 x 8 x 2 x 4)
Learning Rate: Warm-up to 0.0001 for 10,000 steps, then held constant.

Evaluation Results

Evaluations conducted with varying classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling steps demonstrate the relative improvements across the checkpoints, as illustrated below:

These evaluations were performed using 50 PLMS steps and 10000 random prompts from the COCO2017 validation set, assessed at 512x512 resolution. Note that these evaluations were not specifically optimized for FID scores.

Environmental Impact

Stable Diffusion v1 - Estimated Emissions

Based on the training infrastructure and duration, we estimate the following CO2 emissions using the Machine Learning Impact calculator as detailed in Lacoste et al. (2019). This estimation considers hardware specifications, runtime, cloud provider, and compute region to assess the carbon impact.

Hardware Type: A100 PCIe 40GB
Hours Used: 150,000
Cloud Provider: AWS
Compute Region: US-east
Estimated Carbon Emissions: 11,250 kg CO2 eq. (Power consumption x Time x Carbon intensity based on power grid location)

Model card authored by: Robin Rombach and Patrick Esser, inspired by the DALL-E Mini model card.

Provider	Price ($)	Saving (%)
Synexa	$0.0007	-
replicate	$0.0015	53.3%

stability-ai/stable-diffusion

A latent text-to-image diffusion model capable of generating photo-realistic images given any text input

Pricing

Readme