wan2.1

Introduction of Wan2.1

Wan2.1 is designed on the mainstream diffusion transformer paradigm, achieving significant advancements in generative capabilities through a series of innovations. These include our novel spatio-temporal variational autoencoder (VAE), scalable training strategies, large-scale data construction, and automated evaluation metrics. Collectively, these contributions enhance the model’s performance and versatility.

3D Variational Autoencoders

We propose a novel 3D causal VAE architecture, termed Wan-VAE specifically designed for video generation. By combining multiple strategies, we improve spatio-temporal compression, reduce memory usage, and ensure temporal causality. Wan-VAE demonstrates significant advantages in performance efficiency compared to other open-source VAEs. Furthermore, our Wan-VAE can encode and decode unlimited-length 1080P videos without losing historical temporal information, making it particularly well-suited for video generation tasks.

wan-vae

Video Diffusion DiT

Wan2.1 is designed using the Flow Matching framework within the paradigm of mainstream Diffusion Transformers. Our model's architecture uses the T5 Encoder to encode multilingual text input, with cross-attention in each transformer block embedding the text into the model structure. Additionally, we employ an MLP with a Linear layer and a SiLU layer to process the input time embeddings and predict six modulation parameters individually. This MLP is shared across all transformer blocks, with each block learning a distinct set of biases. Our experimental findings reveal a significant performance improvement with this approach at the same parameter scale.

video-diffusion-dit

Model	Dimension	Input Dimension	Output Dimension	Feedforward Dimension	Frequency Dimension	Number of Heads	Number of Layers
1.3B	1536	16	16	8960	256	12	30
14B	5120	16	16	13824	256	40	40

Data

We curated and deduplicated a candidate dataset comprising a vast amount of image and video data. During the data curation process, we designed a four-step data cleaning process, focusing on fundamental dimensions, visual quality and motion quality. Through the robust data processing pipeline, we can easily obtain high-quality, diverse, and large-scale training sets of images and videos.

data

Comparisons to SOTA

We compared Wan2.1 with leading open-source and closed-source models to evaluate the performance. Using our carefully designed set of 1,035 internal prompts, we tested across 14 major dimensions and 26 sub-dimensions. We then compute the total score by performing a weighted calculation on the scores of each dimension, utilizing weights derived from human preferences in the matching process. The detailed results are shown in the table below. These results demonstrate our model's superior performance compared to both open-source and closed-source models.

vben_vs_sota

Citation

If you find our work helpful, please cite us.

@article{wan2.1, title = {Wan: Open and Advanced Large-Scale Video Generative Models}, author = {Wan Team}, journal = {}, year = {2025} }

License Agreement

The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents, granting you the freedom to use them while ensuring that your usage complies with the provisions of this license. You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws, causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations. For a complete list of restrictions and details regarding your rights, please refer to the full text of the license.

Acknowledgements

We would like to thank the contributors to the SD3, Qwen, umt5-xxl, diffusers and HuggingFace repositories, for their open research.

Contact Us

If you would like to leave a message to our research or product teams, feel free to join our Discord or WeChat groups!

Provider	Price ($)	Saving (%)
Synexa	$0.2000	-
replicate	$0.6000	66.7%

wan-video/wan2.1

Generate 5s 480p videos using Wan 2.1 14B. A comprehensive video foundation models that pushes the boundaries of video generation.

Pricing

Readme