A team of researchers from Google DeepMind has introduced a novel framework called Mixture of Nested Experts (MoNE) that significantly reduces computational costs for processing images and videos without sacrificing accuracy. This new approach dynamically allocates computational resources to different visual tokens based on their importance.
Read the full paper here.

MoNE achieves over two-fold reduction in inference time compute compared to baseline models while maintaining equivalent performance on standard image and video datasets. On the Something-Something-v2 video dataset, MoNE matched the baseline accuracy of 64.4% while using only 162.8 GFLOPs compared to the baseline’s 376.3 GFLOPs.
The framework builds on the concept of nested models, where smaller sub-models are contained within larger ones. MoNE uses a router network to dynamically assign visual tokens to these nested experts of varying sizes. This allows more important or informative tokens to be processed by larger, more computationally expensive models while less important tokens are handled by smaller nested models.
The team evaluated MoNE on image classification using the ImageNet-21k dataset and on video classification using the Kinetics-400 and Something-Something-v2 datasets. They found that MoNE consistently outperformed baseline models and other approaches like Mixture of Depths, especially at lower computational budgets.
A key advantage of MoNE is its ability to adapt to different inference-time compute budgets using a single trained model. This flexibility allows the framework to meet varying computational constraints without requiring retraining.
Visualisations showed that MoNE effectively identified important regions in images and videos, routing tokens from these areas to larger nested models. This demonstrates the framework’s ability to focus computational resources on the most informative parts of visual inputs.
While primarily designed for encoder architectures, the researchers note that extending MoNE to autoregressive decoding in large language models remains a challenge for future work. They also highlight the potential societal impact of MoNE in reducing energy usage and carbon emissions during model inference, as well as democratising access to AI by allowing broader use of trained models without requiring extensive computational resources.
As deep learning models continue to grow in size and complexity, approaches like MoNE that can significantly reduce computational costs. Also, maintaining performance is likely to become increasingly important for practical applications of AI in resource-constrained environments.
Google DeepMind is constantly developing advanced AI models and frameworks. Last month, they introduced Foundational Large Autorater Models (FLAMe) for various quality assessment tasks, and also launched the MatFormer Framework, which enhances on-device capabilities by allowing users to mix and match AI models within a single framework to optimise performance for specific tasks.
The post Google DeepMind Unveils Visual Processing Framework to Reduce Computational Costs appeared first on AIM.