stability-ai/sdxl
A text-to-image generative AI model that creates beautiful images
Pricing
stability-ai/
sdxl
Pricing for Synexa AI models works differently from other providers. Instead of being billed by time, you are billed by input and output, making pricing more predictable.
For example, generating 100 images should cost around $0.20.
Check out our docs for more information about how per-request pricing works on Synexa.
Provider | Price ($) | Saving (%) |
---|---|---|
Synexa | $0.0020 | - |
replicate | $0.0040 | 50.0% |
Readme
SDXL is a cutting-edge text-to-image generative AI model developed by Stability AI, designed to create high-quality and visually appealing images from textual descriptions. As the successor to the widely recognized Stable Diffusion, SDXL builds upon its foundation to offer enhanced capabilities and image generation fidelity.
Key Features of SDXL
This section outlines the core functionalities of SDXL, showcasing its versatility in image generation and manipulation.
Text-to-Image Generation
SDXL's primary function is generating images directly from text prompts. Users can input a descriptive prompt
to guide the model in creating corresponding visuals. Simply enter your desired prompt and run the model to witness your textual ideas transformed into images.
Image In-painting
SDXL incorporates powerful in-painting capabilities, enabling users to seamlessly modify existing images by filling in masked areas with generated content. This feature offers a creative way to refine and extend images.
In-painting Workflow:
- Prompt Input: Provide a
prompt
that describes the content you wish to generate within the masked area. - Image Selection: Upload an input image using the
image
field as the base for in-painting. - Mask Application: In the
mask
field, upload a black-and-white mask image that matches the dimensions of your input image. White pixels in the mask indicate areas to be in-painted based on the prompt, while black pixels represent regions to be preserved.
Image-to-Image Generation
Transform existing images based on textual prompts with SDXL's image-to-image generation. This functionality allows you to guide the transformation of an input image towards a described visual style or content. For example, you can transform a simple drawing of a castle into a photorealistic depiction.
Image-to-Image Process:
- Descriptive Prompt: Enter a
prompt
detailing the desired appearance of the output image. - Input Image: Upload an image to be transformed using the
image
field. - Prompt Strength Adjustment: Utilize the
prompt_strength
field to control the degree to which the prompt influences the transformation of the input image. Higher strength values result in a stronger influence of the prompt.
Refinement Capabilities
SDXL offers advanced refinement options to enhance the detail and quality of generated images using a separate refiner model. This can be implemented through two distinct approaches:
Ensemble of Experts Refinement
- In this mode, the SDXL base model manages the initial generation stages at higher noise levels. It then seamlessly transitions to the refiner model for the concluding steps at lower noise levels.
- This expert ensemble approach results in more detailed images achieved within fewer processing steps.
- The handover point, determining when the refinement model takes over, is adjustable. The default handover point is set to 0.8 (80%).
Sequential Base and Refiner Model
- In this configuration, the final output from the SDXL base model serves as the input for the refiner model.
- Users have the flexibility to define the number of steps the refiner model executes to further enhance the image.
Fine-tuning SDXL Models
SDXL models can be further customized through fine-tuning using the Replicate fine-tuning API. Replicate supports various programming languages, and the following example demonstrates fine-tuning using the Python SDK:
Pre-processing of Input Images:
Prior to fine-tuning initiation, input images undergo a preprocessing pipeline incorporating:
- SwinIR Upscaling: Enhances image resolution.
- BLIP Captioning: Generates descriptive captions for images.
- CLIPSeg Masking: Identifies and removes non-essential or less informative image regions to focus training on salient parts.
Model Description and Background
- Developed by: Stability AI
- Model type: Diffusion-based text-to-image generative model
- License: CreativeML Open RAIL++-M License
- Description: SDXL is a Latent Diffusion Model designed for generating and modifying images based on text prompts. It leverages two pre-trained text encoders: OpenCLIP-ViT/G and CLIP-ViT/L). The underlying architecture is described in the Latent Diffusion Model paper.
- Further Resources: For more in-depth information, refer to the Stability AI GitHub Repository and the SDXL report on arXiv.
Performance Evaluation

User Preference Study:
The chart above presents a comparative evaluation of user preference between SDXL (with and without refinement), SDXL 0.9, and Stable Diffusion versions 1.5 and 2.1. The results indicate that the SDXL base model significantly outperforms previous iterations. Furthermore, incorporating the refinement module with the base model achieves the highest level of overall user preference.
Intended Uses
Direct Research Use
SDXL is primarily intended for research exploration and development. Potential research areas and applications include:
- Creation of digital art and integration into design and artistic workflows.
- Applications in educational platforms and creative tools.
- Advancing research in the field of generative models.
- Investigating safe deployment strategies for models capable of generating potentially harmful content.
- Examining and understanding the limitations and inherent biases of generative models.
Out-of-Scope Usage
It is important to note that SDXL is not designed to generate factually accurate or truthful representations of real-world people or events. Using the model for such purposes falls outside its intended capabilities.
Limitations and Biases
Limitations
SDXL exhibits certain limitations inherent to current generative AI models:
- Imperfect photorealism in generated images.
- Inability to reliably render legible text within images.
- Challenges with complex tasks requiring compositional understanding, such as generating images from prompts like “A red cube on top of a blue sphere”.
- Potential inconsistencies in generating realistic faces and human figures.
- The autoencoding process in the model is inherently lossy, which may affect fine detail.
Biases
As with many large AI models, SDXL may reflect or amplify societal biases present in its training data. Users should be aware that image generation models can inadvertently reinforce or exacerbate existing social biases.
Model Architecture

SDXL employs a mixture-of-experts pipeline based on latent diffusion principles. The process unfolds in two primary stages:
- Base Model Generation: The base SDXL model initially generates latent representations, which inherently contain noise.
- Refinement Model Processing: These noisy latents are then passed to a specialized refinement model (available at: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/), designed to perform final denoising steps, enhancing image detail and quality. The base model can also be utilized independently as a standalone module.
Alternative Two-Stage Pipeline:
Alternatively, a two-stage pipeline can be employed:
- Base Model Latent Generation: The base model is used to generate latents at the desired output image size.
- High-Resolution Refinement via SDEdit: These latents are then refined using a specialized high-resolution model and a technique called SDEdit (https://arxiv.org/abs/2108.01073, also known as "img2img"). SDEdit, with the same prompt, is applied to the latents from the first stage. This approach, while slightly slower due to increased computational steps, offers an alternative refinement path.