Recently Meta launched its new series of diffusion models in collaboration with King Abdullah University of Science and Technology (KAUST). MarDini’s MAR technology makes creating smooth, high-quality videos simpler and more flexible than ever.
This model can tackle various tasks, such as filling in missing frames in the middle of a video, turning a single image into a moving scene, or extending a short clip by adding natural, continuous frames.
Meta presents MarDini
Masked Autoregressive Diffusion for Video Generation at Scale pic.twitter.com/h2u0OnroF5— AK (@_akhaliq) October 29, 2024
Through this move, Meta is creating a good space for itself in the genAI video field. Last year, Meta released text-to-video and editing models: Emu Video and Emu Edit. This year, before the release of this series of diffusion models, it also unveiled Movie Gen, its state-of-the-art video editor.
What Can It Do?
With video interpolation, MarDini seamlessly fills in frames for smooth transitions between scenes. It also produces videos that rival those made by far more advanced and expensive models.
By building on masked auto-regression (MAR) within a single diffusion model (DM), it combines adaptability with speed, making it easy for creators to generate or expand videos with a natural flow.
This claims to empower video creators with control and quality—all in one tool.
How Does it Function?
MarDini’s video generation technology is an advanced yet efficient tool for creating high-quality, flexible videos. Its architecture consists of two main parts: a planning model and a generation model.
The planning model interprets low-resolution input frames using a MAR (masked auto-regression) approach, generating guiding signals for any frames that need to be created.
Then, the lightweight generation model steps in to produce detailed, high-resolution frames through a diffusion process, which ensures the final video looks smooth and well-collected.
Unlike many video models that require complex, pre-trained image models, MarDini is claimed to train from scratch using unlabelled video data, due to its progressive training strategy.
This approach adapts the way frames are masked during training, making the model more flexible and capable of handling different frame configurations.
Why is it Interesting?
What makes MarDini stand out are two key targeted features, its flexibility and performance. This adaptability means it’s not only powerful but also efficient and capable of scaling for larger tasks.
This model can handle a range of tasks: video interpolation (where it fills in middle frames to smooth transitions), image-to-video generation (turning a single frame into a flowing video by generating all following frames), and video expansion (extending a short clip by adding continuous frames).
These abilities make MarDini adaptable for multiple video generation needs, whether you’re smoothing out existing footage or creating entire sequences from scratch.
In terms of performance, it sets new benchmarks, generating high-quality videos in far fewer steps as compared to traditional models. This efficiency makes MarDini both cost-effective and time-saving compared to more complex alternatives.
The official research paper, released on 23 October, concluded that “Our investigation shows that our modelling strategy is powerful enough to obtain competitive results on various interpolation and animation benchmarks while doing it at a lower computational need than counterparts with comparable parameter size.”
The post Meta Introduces MarDini, its New Family of Video Diffusion Models appeared first on AIM.