Researchers from NVIDIA, Stanford College, College of California San Diego, UC Berkeley and UT Austin have developed a brand new AI mannequin that may whip up one-minute Tom and Jerry-style animation movies from simply textual content storyboards. Image this: dynamic, multi-scene adventures stuffed with iconic chaos and mischief that followers love, all generated from easy written prompts.
Right this moment, we're releasing a brand new paper – One-Minute Video Technology with Take a look at-Time Coaching.
We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with robust temporal consistency.
Each video under is produced straight by… pic.twitter.com/Bh2impMBWA— Karan Dalal (@karansdalal) April 7, 2025
The mannequin, known as TTT-MLP (Take a look at-Time Coaching-Multilayer Perceptron), makes use of TTT layers. This enhances the capabilities of pre-trained transformers by permitting their hidden states to be neural networks. This method allows extra expressive and longer-term reminiscence, essential for producing coherent movies with complicated narratives.
“Including TTT layers right into a pre-trained transformer allows it to generate one-minute movies from textual content storyboards,” researchers stated.
Notably, the researchers created a dataset based mostly on Tom and Jerry cartoons to check their mannequin.
TTT-MLP outperforms all different baselines in temporal consistency, movement smoothness, and general aesthetics, as measured by the human analysis system Elo. In human evaluations, movies generated by TTT layers outperformed robust baselines like Mamba 2 and Gated DeltaNet by 34 Elo factors.
One of many AI-made movies exhibits Tom strolling into an workplace, taking the elevator, and sitting at his desk. Nonetheless, issues rapidly flip wild when Jerry cuts a wire, beginning their normal cat-and-mouse recreation—however this time, in a bustling workplace in New York Metropolis.
Whereas the outcomes are promising, researchers famous that the movies nonetheless include artifacts, possible as a result of limitations of the pre-trained mannequin used. Additionally they highlighted the potential for extending this method to longer movies and extra complicated tales. In accordance with researchers, reaching this is able to require considerably bigger hidden states. As an alternative of a easy two-layer MLP, they stated the hidden features may very well be full-fledged neural networks, presumably even transformers.
Furthermore, they added that a number of promising instructions for future work exist, together with sooner implementation. The present TTT-MLP kernel runs into efficiency points attributable to register spills and suboptimal ordering of asynchronous directions. Researchers consider this may very well be improved by decreasing register stress and making a extra compiler-friendly implementation.
Additionally they identified that utilizing bidirectionality and realized gates is only one technique to combine TTT layers right into a pre-trained mannequin. Exploring higher integration methods might enhance era high quality and pace up fine-tuning. They added that different kinds of video era fashions, like autoregressive architectures, may have completely totally different strategies.
The put up AI Mannequin Creates Instantaneous Tom and Jerry Episodes from Textual content appeared first on Analytics India Journal.