bytedance/seedance-2.0

ByteDance's advanced video generation model with native audio, keyframe control, and multimodal references for consistent character and style.

$1 / request
GPU: H100

Readme

Seedance 2.0 is ByteDance's next-generation video model, generating high-quality video with synchronized native audio in a single pass.

Generation modes:

  • Text-to-video — describe a scene in natural language
  • Keyframe mode — provide start_image (and optional end_image) to anchor the video
  • Multi-ref mode — combine up to 11 reference images and up to 3 reference videos (max 11 total, max 3 videos) for character consistency and motion reference
  • Keyframe mode and multi-ref mode are mutually exclusive within a single request

Capabilities:

  • Native audio generation — dialogue, sound effects, and music generated together with video
  • Better motion and physics for complex interactions like sports, dancing, and object collisions
  • Character consistency across multi-shot narratives via reference images
  • Video editing and extension via reference videos

Prompt references:

  • Reference images: @IMG_1 .. @IMG_11
  • Reference videos: @VID_1 .. @VID_3
  • For dialogue, put spoken words in double quotes (e.g. The man said: "Remember this moment.") to drive lip-sync

Supported durations: 5s, 10s, 15s Supported resolutions: 480p, 720p (default), 1080p Supported aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4, 21:9 Reference video constraints: each <=15s, longest side <=1280px (<=720p)

Pricing: $0.2 per second of output video.