Step-Video-T2V Video Demonstrations

Explore the video generation capabilities of Step-Video-T2V through the following demonstrations.

1. Introduction to Step-Video-T2V: A 30B Parameter Text-to-Video Model

Step-Video-T2V is introduced as a cutting-edge text-to-video (T2V) pre-trained model, distinguished by its substantial 30 billion parameters and the ability to generate videos of up to 204 frames. Addressing the challenges of training and inference efficiency in video generation, Step-Video-T2V incorporates a deep compression Variational Autoencoder (VAE) for videos. This VAE achieves significant compression ratios of 16x16 spatially and 8x temporally. To further refine the quality of generated videos, Direct Preference Optimization (DPO) is employed in the final stage. The performance of Step-Video-T2V has been rigorously evaluated using Step-Video-T2V-Eval, a newly developed video generation benchmark. Results from this benchmark demonstrate Step-Video-T2V's superior text-to-video generation quality when compared to both open-source and commercially available engines, establishing it as a state-of-the-art (SoTA) model in its class.

2. Step-Video-T2V Model Architecture and Key Components

Step-Video-T2V leverages a high-compression Video-VAE for efficient video representation, achieving 16x16 spatial and 8x temporal compression. The model is designed to process user prompts in both English and Chinese, utilizing two bilingual pre-trained text encoders. At its core, a Diffusion Transformer (DiT) architecture with 3D full attention is trained using Flow Matching. This DiT denoises input noise into coherent latent frames, conditioned by text embeddings and timesteps. To ensure enhanced visual fidelity in the output videos, a video-based DPO approach is integrated. This method effectively minimizes artifacts, resulting in video outputs that are smoother and more visually realistic.

2.1. Deep Compression Video-VAE

The foundation of Step-Video-T2V's efficiency is a Deep Compression Variational Autoencoder (VideoVAE). This component is specifically engineered for video generation, providing 16x16 spatial and 8x temporal compression ratios without sacrificing video reconstruction quality. This high level of compression substantially accelerates both the training and inference processes, aligning with the diffusion model's proficiency in handling compact representations.

2.2. DiT with 3D Full Attention Mechanism

Step-Video-T2V is built upon the Diffusion Transformer (DiT) architecture. The DiT within Step-Video-T2V comprises 48 layers, with each layer incorporating 48 attention heads, each set to a dimension of 128. AdaLN-Single is utilized to effectively incorporate timestep conditioning, and QK-Norm is implemented in the self-attention mechanism to ensure stable training dynamics. Furthermore, the integration of 3D RoPE (Rotary Positional Embedding) is crucial for managing video sequences with varying lengths and resolutions.

2.3. Video-based Direct Preference Optimization (DPO)

To further refine the visual quality of videos generated by Step-Video-T2V, Direct Preference Optimization (DPO) is incorporated, leveraging human feedback. DPO fine-tunes the model based on human preference data, thereby aligning the generated video content more closely with human aesthetic expectations. The DPO pipeline is integral to enhancing both the consistency and overall quality of the video generation process.

3. Download Step-Video-T2V Models

The pre-trained Step-Video-T2V models are available for download via the following links:

Models	🤗Huggingface	🤖Modelscope
Step-Video-T2V	download	download
Step-Video-T2V-Turbo (Inference Step Distillation)	download	download

3.1. Single-GPU Inference and Quantization

For users seeking to optimize resource utilization, the open-source project DiffSynth-Studio by ModelScope provides single-GPU inference and quantization support. This significantly reduces the Video RAM (VRAM) requirements for running Step-Video-T2V. For detailed implementation and usage examples, please refer to their examples.

4. Optimal Inference Settings for Step-Video-T2V

Step-Video-T2V is designed for robust inference performance, consistently producing high-fidelity and dynamic videos. Experiments have shown that adjusting inference hyperparameters can significantly impact the trade-off between video fidelity and dynamics. To achieve the best possible results, the following best practices for inference parameter tuning are recommended:

Models	infer_steps	cfg_scale	time_shift	num_frames
Step-Video-T2V	30-50	9.0	13.0	204
Step-Video-T2V-Turbo (Inference Step Distillation)	10-15	5.0	17.0	204

5. Step-Video-T2V-Eval Benchmark for Video Generation Quality Assessment

Introducing Step-Video-T2V Eval, a novel benchmark for evaluating text-to-video generation quality. This benchmark comprises 128 Chinese prompts sourced directly from real users. Step-Video-T2V-Eval is structured to assess video generation quality across 11 diverse categories: Sports, Food, Scenery, Animals, Festivals, Combination Concepts, Surreal, People, 3D Animation, Cinematography, and Style. Access the benchmark details at Step-Video-T2V Eval.

6. Access Step-Video-T2V Online Engine

Experience Step-Video-T2V directly through the online engine available at 跃问视频. Explore a range of impressive video generation examples and interact with the model.

7. Citation Information for Step-Video-T2V

If you utilize Step-Video-T2V in your research or applications, please cite the following paper:

@misc{ma2025stepvideot2vtechnicalreportpractice, title={Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model}, author={Guoqing Ma and Haoyang Huang etc}, year={2025}, eprint={2502.10248}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.10248}, }

8. Acknowledgements

We extend our sincere gratitude to the following teams and projects for their invaluable contributions:

The xDiT team for their exceptional support and parallelization strategy.
Huggingface/Diffusers for integrating our code into their official repository.
The FastVideo team for their ongoing collaboration and for their partnership in developing inference acceleration solutions in the near future.

Provider	Price ($)	Saving (%)
Synexa	$0.4000	-
replicate	$0.6900	42.0%

stepfun-ai/step-video-t2v

SoTA text-to-video model with 30 billion parameters and the capability to generate videos up to 204 frames.

Pricing

Readme