Meta AI and Stanford have introduced Apollo, a family of video-based large multimodal models (LMMs) designed to efficiently and accurately understand video content. Apollo aims to bridge the gap between text-to-image models and video comprehension, addressing challenges posed by high computational demands and technical limitations.
Apollo models excel at video tasks by addressing key challenges, including how videos are sampled, encoded, and trained. This paper gains significance, especially in light of OpenAI co-founder Ilya Sutskevar’s recent talk on pre-training hitting a wall.
Apollo leverages scaling consistency to reduce reliance on large datasets and models while improving task-specific performance.
“We discovered scaling consistency, which allows us to design effective solutions using smaller models and datasets, reducing computational overhead,” the researchers explained in the paper.
Two key improvements make AI models better at understanding videos. First, fps sampling selects video frames at a steady rate, which works better than picking frames evenly. Second, combining SigLIP-SO400M (which focuses on clear image details) with InternVideo2 (which captures motion and timing) helps the model understand both still visuals and movements in videos.
Smaller Models with Superior Performance
Apollo-3B outperforms larger 7B models with a score of 68.7 on the MLVU benchmark. Meanwhile, Apollo-7B sets a new standard in its category, achieving 70.9 and even surpassing some 30B models.
The team also introduced ApolloBench, a faster, more efficient evaluation tool for video understanding, reducing test times by 41 times. “Our results prove that smart design and training can deliver top performance without relying on massive model sizes,” the researchers said.
Apollo marks a significant leap in video AI, opening doors to applications like content analysis and autonomous systems.
The post Meta’s Apollo Models Set New Benchmarks for Video Understanding in AI appeared first on Analytics India Magazine.