Top 6 Papers Presented by Meta at CVPR 2023

CVPR 2023 saw interesting research papers at the forefront of computer vision. The event presented an array of contributions from Meta, encompassing a diverse range of topics including video understanding, video-language embeddings, object detection, reinforcement learning, and image similarity.

Let us take a look at the remarkable papers presented by Meta, which introduce cutting-edge models, frameworks, approaches, techniques, and benchmarks, all aimed at enhancing the performance and scalability of computer vision systems. Larry Zitnick, senior research scientist in the Fundamental AI Research (FAIR) team at Meta AI, is the keynote speaker.

Read more: Top 9 Papers Presented at Google CVPR 2023

Egocentric Video Task Translation

The paper discusses the limitations of treating different video understanding tasks in isolation and proposes a unified approach called EgoTask Translation (EgoT2) for wearable cameras. EgoT2 uses separate task-specific models and a shared task translator to improve performance on multiple tasks simultaneously. The authors demonstrate the effectiveness of EgoT2 on various video tasks and achieve top-ranked results in benchmark challenges. Two variants of EgoT2, EgoT2-s and EgoT2-g, are introduced, with EgoT2-s focusing on translating auxiliary task features into predictions for the primary task, and EgoT2-g conducting task translation for multiple tasks concurrently. The authors provide qualitative analysis showcasing how auxiliary tasks contribute to the primary task prediction. The experiments are conducted on a diverse set of egocentric video tasks from the Ego4D dataset.

Learning Video Representations from Large Language Models

In this paper, we get to know about LaViLa, a new method for learning video-language representations using LLMs. By repurposing pre-trained LLMs to incorporate visual input, and fine-tuning them, automatic video narrators are created. These auto-generated narrations offer several advantages, such as comprehensive coverage of lengthy videos, improved synchronisation of visual information and text, and increased text diversity. The video-text embedding, which is learned through contrastive methods using these narrations, surpasses the previous state-of-the-art for both first-person and third-person video tasks. This improvement is observed in both zero-shot and finetuned scenarios, with LaViLa achieving notable absolute gains of 10.1% on EGTEA classification and 5.9% on Epic-Kitchens-100 multi-instance retrieval benchmarks. Moreover, even when trained on only half the narrations from the Ego4D dataset, LaViLa outperforms baseline models trained on the entire set, demonstrating positive scaling behavior with larger pre-training data and model size.

📺 Learning Video Representations from Large Language Models
This work repurposed pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators.
Paper ➡ https://t.co/OsZt3AnAEM
3/7 pic.twitter.com/DLunz5bcgP

— Meta AI (@MetaAI) June 20, 2023

PACO: Parts and Attributes of Common Objects

PACO, a dataset focusing on object models, offers detailed descriptions and goes beyond traditional object masks by including part masks and attributes. It covers 75 object categories, 456 object part categories, and 55 attributes in both image (LVIS) and video (Ego4D) datasets. With 641K part masks annotated across 260K object boxes, PACO provides extensive attribute annotations for about half of them. Evaluation metrics and benchmark results are provided for three dataset tasks: part mask segmentation, object and part attribute prediction, and zero-shot instance detection. The introduction of PACO aims to facilitate research on joint detection of objects, parts, and attributes, presenting unique challenges compared to traditional object detection tasks.

Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-per-Second

Galactic is a comprehensive framework for simulating and applying reinforcement learning (RL) to robotic mobile manipulation in indoor settings. The framework centers around a Fetch robot equipped with various features and tasks with rearranging objects within a home environment. Galactic boasts remarkable speed, achieving 421,000 steps-per-second (SPS) in simulation speed, rendering and physics combined, on an 8-GPU node. This is a 54-fold improvement over Habitat 2.0. Moreover, Galactic is designed to optimise the entire process, including rendering, physics, RL inference, and learning, achieving over 108,000 SPS, an 88-fold increase over Habitat 2.0. These significant speed enhancements lead to reduced training time and enable large-scale experiments. For instance, Galactic achieves a mobile pick skill accuracy of over 80% in under 16 minutes, a 100-fold acceleration compared to Habitat 2.0. Additionally, it enables a groundbreaking rearrangement experiment using 5 billion steps of experience within 46 hours, equivalent to 20 years of robot experience. A single neural network, comprising task-agnostic components, achieves an impressive 85% success rate in GeometricGoal rearrangement, in contrast to Habitat 2.0’s 0% success rate.

GeneCIS: A Benchmark for General Conditional Image Similarity

The paper introduces the GeneCIS benchmark, which evaluates models’ ability to adapt to various similarity conditions in a zero-shot setting. The benchmark focuses on an open-set of similarity conditions, allowing for a broader evaluation. The study reveals that powerful CLIP models struggle with GeneCIS and that performance on the benchmark is only weakly correlated with accuracy on ImageNet. This suggests that simply scaling existing methods is not effective. To address this, the authors propose a simple and scalable solution that involves mining information from image-caption datasets. This approach significantly improves performance on GeneCIS and enhances zero-shot performance on other image retrieval benchmarks. Even though the model is evaluated in a zero-shot manner, it outperforms state-of-the-art supervised models on MIT-States.

HierVL: Learning Hierarchical Video-Language Embeddings

HierVL is a novel approach to video-language embeddings that addresses the limitations of existing methods by considering both short-term and long-term associations. By utilizing videos accompanied by timestamped text descriptions and high-level summaries, HierVL introduces a hierarchical contrastive training objective to achieve alignment between text and visual elements at both the clip and video levels. This approach captures not only the immediate actions depicted in each video clip but also the broader context and intent behind the activity. HierVL surpasses single-level representations in terms of clip-level performance and achieves state-of-the-art results in long-term video modeling tasks. Moreover, it demonstrates successful transferability to various challenging downstream tasks in different datasets, including EPIC-KITCHENS-100, Charades-Ego, and HowTo100M, with both zero-shot and fine-tuned settings.

Read more: While My AI Gently Weeps – A New Beatles Finale!

The post Top 6 Papers Presented by Meta at CVPR 2023 appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...