Voice remains one of the most crucial modalities when it comes to AI-driven solutions for urban India and rural Bharat. The issue, though, pertains to the unavailability of sufficient data for all the Indian languages and their hundreds of dialects. This is often referred to as the missing link for the Indian language chatbots.
However, there are organisations and companies that are looking to solve this problem.
AI4Bharat, the research group based at IIT Madras, recently introduced Indic Parler-TTS, an open-source text-to-speech (TTS) model built for over a billion Indic speakers. The model was released in collaboration with Hugging Face and aims to bring accessibility and high-quality speech to diverse linguistic communities.
Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers!
In collaboration with @huggingface, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality… pic.twitter.com/g7xhFM2mYl— AI4Bharat (@ai4bharat) December 4, 2024
It’s trained on 1,806 hours of multilingual and English datasets and currently supports 20 of the 22 scheduled Indian languages. It also includes English in the US, British, and Indian accents, which is ideal for developers, researchers, and companies. This model has a permissive license with unrestricted usage.
The model includes 69 unique voices and can render emotions in 10 languages. An interesting feature of this model is its customisable output, which includes background noise, pitch, reverberation, speaking rate, and expressivity. The model can also automatically detect languages through prompts.
AI4Bharat, IISc, and EkStep Lead the Way
AI4Bharat has introduced several datasets and models, particularly aimed at speech translation. These include the BhasaAnuvaad dataset covering 13 Indian languages and the IndicConformer ASR model for 22 scheduled languages in India.
In September, the organisation launched a series of innovations looking to enhance Indian language technology, which includes IndicASR for 22 Indian languages and Rasa, which is India’s first multilingual expressive TTS dataset for Indian languages.
Now, speech data is extremely important. Bhashini, the government-led service for Indic language technology, launched a crowdsourcing initiative called Bhasha Daan in July to collect voice and text data in multiple Indian languages, where anyone can contribute. In collaboration with Nasscom, it also launched the ‘Be our Sahayogi’ program on National Technology Day to crowdsource multilingual AI problem statements.
In September, IISc AI and Robotics Technology Park (ARTPARK) also announced that it is set to open-source 16,000 hours of spontaneous speech data from 80 districts as part of Project Vaani, under the Ministry of Electronics and Information Technology’s flagship AI initiative, Bhashini.
The ambitious project, created in collaboration with Google, aims to curate datasets of 150,000 hours of natural speech and text from approximately one million people across 773 districts in India. The first phase of the project, launched at the end of 2022, is nearing completion.
In its second phase, Project Vaani will target 160 districts, collecting 200 hours of speech data from about 1,000 people per district. So far, voice data in 58 different language variants or dialects has been gathered from 80 districts and will soon be made publicly available.
In a past conversation with AIM, Prasanta Ghosh, assistant professor in the department of electrical engineering at IISc, and leader of Project Vaani, said, “We record people with impairments and are building technology that can understand them. Maybe humans can’t, but AI would be able to.” He added that while collecting the data, he realised that there is no corpus even to build technologies for healthy people.
“And then the first thing that came to my mind was the idea to build a good foundational model for them.” He added that the primary goal is to use this dataset as the training data for speech-to-text AI models, particularly benefiting conversational AI platforms and chatbots requiring diverse voice datasets.
Furthermore, three years ago, EkStep Foundation also open-sourced the wav2vec2 model after training it on 10,000 hours of speech data in 23 Indic languages. The Vakyansh team at the foundation was one of the first in the country to build Automatic Speech Recognition (ASR) and TTS models.
Startups Have Realised the Need
When it comes to startups, Sarvam AI and CoRover.ai have been focused heavily on building speech models. Speaking at Cypher 2024, Sarvam AI chief Vivek Raghavan demoed the speech capabilities of its AI models, leaving everyone at Cypher speechless.
The company also launched a range of products, including voice-based agents that are accessible via telephone and WhatsApp. The voice-based agents can also be integrated into an app, allowing users to communicate verbally whenever they choose.
Sarvam also released Bulbul, a multilingual TTS model in six different voices. According to the company’s website, its mission is to enable speech-first applications for India.
Similarly, Ankush Sabharwal, CEO of CoRover.ai and the creator of BharatGPT, told AIM that the company is actively working on building voice models which can work on WhatsApp for translation and Q&A. The goal is to make it voice-to-voice.
In June, the Indian startup smallest.ai launched its TTS model, AWAAZ. With state-of-the-art Mean Opinion Scores (MOS) in Hindi and Indian English, AWAAZ can fluently converse in over ten accents, reflecting the diverse linguistic landscape of India. Most recently, the company also introduced Lightning, TTS model capable of generating up to 10 seconds of audio within 100 milliseconds.
“When we started building, we realised that the models required for a voice bot were not mature for Indian languages. Existing models for non-English languages were nowhere close to production,” CEO Sudarshan Kamath told AIM.
Meanwhile, the voice of voice-based models is only increasing. Pragya Misra, lead (public policy and partnerships) for OpenAI in India, told AIM that the company will focus on multimodal AI going forward and solving for Indic AI is also something they want to focus on.
Since the focus is on building voice models for the Bhartiya market, Indian companies are obsessed with Indic language models. And naturally, if the future is all about communication in voice and giving it to the millions of UPI users in the country, speech models are necessary.
The post Why India is Even Building Speech Models appeared first on Analytics India Magazine.