Chinese language tech large Alibaba has launched Wan 2.1, its open-source video basis mannequin, together with the code and weights. The mannequin can generate movies with advanced motions that precisely simulate real-world physics.
“Wan2.1 constantly outperforms present open-source fashions and state-of-the-art industrial options throughout a number of benchmarks,” the corporate mentioned in a weblog publish.
The corporate has launched a number of fashions optimised for video technology, providing capabilities in text-to-video, image-to-video, video modifying, text-to-image, and video-to-audio. The suite consists of three essential fashions: Wan2.1-I2V-14B, Wan2.1-T2V-14B, and Wan2.1-T2V-1.3B.
The I2V-14B mannequin generates movies at 480P and 720P resolutions, producing advanced visible scenes and movement patterns. The T2V-14B mannequin helps related resolutions and is “the one video mannequin able to producing each Chinese language and English textual content.”
The T2V-1.3B mannequin is designed for consumer-grade GPUs, requiring 8.19 GB VRAM to generate a five-second 480P video in 4 minutes on an RTX 4090 GPU.
The mannequin outperforms OpenAI’s Sora on the VBench Leaderboard, which evaluates video technology high quality throughout 16 dimensions, together with topic id inconsistency, movement smoothness, temporal flickering, and spatial relationships.
In response to the corporate, the technical developments in Wan2.1 are based mostly on a brand new spatio-temporal variational autoencoder (VAE), scalable pre-training methods, large-scale information building, and automatic analysis metrics.
“We suggest a novel 3D causal VAE structure particularly designed for video technology,” the corporate mentioned. The mannequin implements a function cache mechanism, lowering reminiscence utilization and preserving temporal causality.
Efficiency exams point out that Wan2.1’s VAE reconstructs video at 2.5 instances the velocity of HunYuanVideo on an A800 GPU. “This velocity benefit will probably be additional demonstrated at larger resolutions as a result of small measurement design of our VAE mannequin and the function cache mechanism,” the corporate defined.
Wan2.1 employs the Movement Matching framework inside the Diffusion Transformer (DiT) paradigm. It integrates the T5 encoder to course of multi-language textual content inputs with cross-attention mechanisms. “Our experimental findings reveal a major efficiency enchancment with this strategy on the similar parameter scale,” the corporate mentioned.
Wan2.1’s information pipeline concerned curating and deduplicating 1.5 billion movies and 10 billion photographs.
Alibaba lately launched QwQ-Max-Preview, a brand new reasoning mannequin in its Qwen AI household. The corporate plans to speculate over $52 billion in cloud computing and synthetic intelligence over the following three years.
The publish Alibaba Releases Open-Supply Video Technology Mannequin Wan 2.1, Outperforms OpenAI’s Sora appeared first on Analytics India Journal.