Meta FAIR, the company’s advanced AI research division, has released several new research assets to advance its goal of achieving autonomous machine intelligence (AMI) while promoting open science and reproducibility.
Meta’s latest releases include the updated Segment Anything Model 2.1 (SAM 2.1) for image and video segmentation, along with research on improving large language models, post-quantum cryptography, model training, and inorganic materials discovery.
“We introduced SPIRIT-LM, a speech + text-generative language model based on Llama 2 that can generate both speech and text in a cross-modal manner. We showed that by alternating speech and text in the input sequence during training. The model is able to generate the content fluidly by changing from one modality to another,” the company said in its research paper.
They evaluated the models on a collection of speech and text metrics and plan to make future improvements both in the area of model capability and in transparency and safety.
Meta Isn’t the Only Player in This Space
Meanwhile, Anthropic launched Claude 3.5 Sonnet, an AI model designed to compete with other generative AI systems like Google’s NotebookLM and Meta’s Spirit LM. However, Anthropic has not introduced a feature specifically akin to podcast creation like Google’s NotebookLM or the expressive voice functionalities of Spirit LM.
The latest announcements came after Meta’s paper in August detailed how these models would rely on the ‘chain of thought’ mechanism, something which has been used by OpenAI for its recent o1 models that think before they respond.
It needs to be noted that Google and Anthropic, too, have published research on the concept of reinforcement learning from AI feedback. However, these are not out for public use yet.
Meta’s group of AI researchers under FAIR said that the new releases support the company’s goal of achieving advanced machine intelligence while also aiding open science and reproducibility. The newly released models include updated Segment Anything Model 2 for images and videos, Meta Spirit LM, Layer Skip, SALSA, Meta Lingua, OMat24, MEXMA, and Self Taught Evaluator.
Meta has termed this new model, capable of validating other AI models’ works, as “strong generative reward model with synthetic data”. The company claims that this is a new method for generating preference data to train reward models without relying on human annotations.
“This approach generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces for evaluation and final judgments, with an iterative self-improvement scheme,” the company said in its official blog post.
Essentially, this is a new method that generates its own data to train reward models with the need for humans to label it. The model generates different outputs from AI models and then uses another AI to assess and improve those outcomes. This is an iterative process.
According to Meta, the model is powerful and performs better than models that rely on human-labelled data such as GPT-4 and others.
Recently, Kyutai, a French non-profit AI research laboratory, launched Moshi, a real-time native multimodal foundational AI model capable of conversing with humans in real time, much like what OpenAI’s advanced model was intended to do.
Hume AI introduced EVI 2, a new foundational voice-to-voice AI model that promises to enhance human-like interactions. Available in beta, EVI 2 can engage in rapid, fluent conversations with users, interpreting tone and adapting its responses accordingly. The model supports a variety of personalities, accents, and speaking styles and includes multilingual capabilities.
Meanwhile, Amazon Alexa is partnering with Anthropic to improve its conversational abilities, making interactions more natural and human-like.
Best of All Worlds?
On the other hand, SpiritLM, the open-source multimodal AI model, integrates text and speech capabilities, directly competing with OpenAI’s GPT-4o. Both models represent advancements in the field of AI, particularly in how they handle multimodal inputs and outputs.
Meta’s model uses phonetic, pitch, and tone tokens to improve the expressiveness of its speech outputs. This allows it to capture emotional nuances such as anger, joy, or surprise and is designed to provide a more expressive and natural-sounding speech generation compared to traditional AI voice systems, which often sound robotic and lack emotional depth.
Meanwhile, OpenAI’s GPT-4o was introduced as a multimodal language foundation model that is faster, cheaper, and more powerful than its predecessors. It aims to streamline the interaction between users and AI by processing audio inputs natively without the need for multiple models.
It also eliminates the need for separate models for ASR and TTS, resulting in faster response times (232 ms) and improved accuracy in audio input processing. It’s designed for applications prioritising audio interactions and is available to developers via OpenAI’s API, enabling the creation of new audio-first apps.
In a post on X, Meta mentioned how the Spirit LM models could overcome these limitations for inputs and outputs to generate more natural-sounding speech while learning new tasks across ASR, TTS, and speech classification.
Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech.
Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches… pic.twitter.com/gMpTQVq0nE— AI at Meta (@AIatMeta) October 18, 2024
Meanwhile, Claude 3.5 Sonnet, developed by Anthropic, excels in complex reasoning tasks and natural language processing and operates at twice the speed of its predecessor, Claude 3 Opus, enhancing its efficiency for real-time applications and multi-step workflows.
Users report that it consistently delivers high-quality, nearly bug-free code in the first try, outperforming models like GPT-4 in programming tasks, making it a top choice for developers seeking dependable coding support.
Feature | SpiritLM | GPT-4o |
Release Date | October 19, 2024 | June 5, 2024 |
Model Type | Open-source multimodal | Proprietary multimodal |
Speech Capabilities | Phonetic, pitch, tone tokens | Native processing of audio inputs |
Response Time | Not specified | 232 milliseconds |
Emotional Expression | Yes (anger, joy, surprise) | Yes (observes tone/background noise) |
Use Cases | Virtual assistants, customer service bots | Audio-first applications |
Both Spirit LM and GPT-4o represent strides in the development of multimodal AI systems. While Spirit LM focuses on open-source accessibility and expressive speech generation, GPT-4o aims for efficiency and speed in processing multimodal interactions.
So, Which is Better?
Claude 3.5 Sonnet, Spirit LM, and NotebookLM each bring unique strengths tailored to specific user needs. Claude 3.5 Sonnet stands out for its advanced reasoning and coding proficiency, with a fast response time and an interactive “Artefacts” feature to enhance collaboration.
Spirit LM offers a user-friendly interface and strong integration, making it ideal for non-technical users seeking easy speech-text interaction. Meanwhile, NotebookLM uses RAG technology to retrieve relevant information, boosting accuracy for document-heavy tasks, albeit with slightly slower performance.
While each model has minor drawbacks—like limitations in expressiveness or speed—they are all versatile and powerful, making them well-suited to a variety of applications.
The post Why Spirit LM Outshines Claude 3.5 Sonnet and NotebookLM appeared first on Analytics India Magazine.