[Feature] Motif-Video 2B

### Feature Summary

A micro-budget text-to-video diffusion transformer from Motif Technologies

### Detailed Description

<img width="1280" height="368" alt="Image" src="https://github.com/user-attachments/assets/f415a9d6-4b24-4e62-9135-380d2b2defd3" />

<img width="1742" height="1457" alt="Image" src="https://github.com/user-attachments/assets/b8e14102-f207-48fe-9458-9249df5cf955" />

<img width="1005" height="872" alt="Image" src="https://github.com/user-attachments/assets/c2a0e408-58f1-4cde-9cf1-17d71a91e75c" />

https://huggingface.co/Motif-Technologies/Motif-Video-2B
https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF
paper: https://arxiv.org/abs/2604.16503

"Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget — fewer than 10M training clips and under 100,000 H200 GPU hours — and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.

Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:

Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize text–video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers.
These are paired with a micro-budget training recipe combining TREAD token routing and early-phase REPA with a frozen V-JEPA teacher — to our knowledge, the first time this combination has been applied to text-to-video training.

On VBench, Motif-Video 2B reaches 83.76%, the highest Total Score among open-source models we evaluate, surpassing Wan2.1-14B at 7× fewer parameters and roughly an order of magnitude less training data."

### Alternatives you considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Motif-Video 2B #1472

Feature Summary

Detailed Description

Alternatives you considered

Additional context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Motif-Video 2B #1472

Description

Feature Summary

Detailed Description

Alternatives you considered

Additional context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions