Meta Launches Meta Spirit LM, an Open Source Language Model for Speech and Text Integration

Meta has unveiled Meta Spirit LM, an open-source multimodal language model focused on the seamless integration of speech and text.

This new model improves the current text-to-speech (TTS) processes, which typically rely on automatic speech recognition (ASR) for transcription before synthesising text with a large language model (LLM) and converting it back to speech. Such methods often overlook the expressive qualities of speech.

Meta Spirit LM employs a word-level interleaving method during training, utilising both speech and text datasets to facilitate cross-modality generation.

The model comes in two versions, Spirit LM Base, which utilises phonetic tokens for speech modelling, and Spirit LM Expressive, which incorporates pitch and style tokens to convey tone, capturing emotions like excitement or anger.

The new model allows users to generate more natural-sounding speech and demonstrates the capability to learn tasks across different modalities, including ASR, TTS, and speech classification. Meta aims to inspire further development in speech and text integration within the research community.

Meta is not Alone

Similar to Spirit LM, Google recently launched NotebookLM, which can convert any text into a podcast. With this feature, users can input a link, article, or document, and the AI assistant generates a podcast featuring two AI commentators engaged in a lively discussion on the topic. They summarise the material, draw connections between subjects, and engage in banter.

NotebookLM is powered by Google’s Gemini 1.5 model for AI-driven content generation and voice models for lifelike audio outputs. It is supported by a custom-built tool called Content Studio, which provides editorial control.

OpenAI recently launched its Advanced Voice Mode on ChatGPT, and since then, people have been experimenting with it. Deedy Das from Menlo Ventures used it for the dramatic reenactment of a scene in Hindi from the Bollywood movie Dangal. Another user posted a video on X where ChatGPT was singing a duet with him. The possibilities with the voice feature of ChatGPT are endless.

Recently, Kyutai, a French non-profit AI research laboratory, launched Moshi, a real-time native multimodal foundational AI model capable of conversing with humans in real time, much like what OpenAI’s advanced model was intended to do.

Hume AI introduced EVI 2, a new foundational voice-to-voice AI model that promises to enhance human-like interactions. Available in beta, EVI 2 can engage in rapid, fluent conversations with users, interpreting tone and adapting its responses accordingly. The model supports a variety of personalities, accents, and speaking styles and includes multilingual capabilities.

Meanwhile, Amazon Alexa is partnering with Anthropic to improve its conversational abilities, making interactions more natural and human-like.

The post Meta Launches Meta Spirit LM, an Open Source Language Model for Speech and Text Integration appeared first on AIM.