Feature Summary
A micro-budget text-to-video diffusion transformer from Motif Technologies
Detailed Description
https://huggingface.co/Motif-Technologies/Motif-Video-2B
https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF
paper: https://arxiv.org/abs/2604.16503
"Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget — fewer than 10M training clips and under 100,000 H200 GPU hours — and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.
Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:
Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize text–video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers.
These are paired with a micro-budget training recipe combining TREAD token routing and early-phase REPA with a frozen V-JEPA teacher — to our knowledge, the first time this combination has been applied to text-to-video training.
On VBench, Motif-Video 2B reaches 83.76%, the highest Total Score among open-source models we evaluate, surpassing Wan2.1-14B at 7× fewer parameters and roughly an order of magnitude less training data."
Alternatives you considered
No response
Additional context
No response
Feature Summary
A micro-budget text-to-video diffusion transformer from Motif Technologies
Detailed Description
https://huggingface.co/Motif-Technologies/Motif-Video-2B
https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF
paper: https://arxiv.org/abs/2604.16503
"Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget — fewer than 10M training clips and under 100,000 H200 GPU hours — and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.
Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:
Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize text–video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers.
These are paired with a micro-budget training recipe combining TREAD token routing and early-phase REPA with a frozen V-JEPA teacher — to our knowledge, the first time this combination has been applied to text-to-video training.
On VBench, Motif-Video 2B reaches 83.76%, the highest Total Score among open-source models we evaluate, surpassing Wan2.1-14B at 7× fewer parameters and roughly an order of magnitude less training data."
Alternatives you considered
No response
Additional context
No response