These Indian Researchers Have the Cure to OpenAI’s Indic Trouble

Recently, researchers from the Digital University Kerala (DUK) identified a critical flaw in automatic speech recognition (ASR) models, including OpenAI’s Whisper, when dealing with Indic scripts.

The research team, led by Elizabeth Sherly from the Virtual Resource Centre for Language Computing at DUK, found that ASR models struggle to accurately evaluate Indian languages, such as Malayalam, Hindi, and Tamil.

The team extensively explored OpenAI’s Whisper for research purposes, but its results were poor for low-resource languages. Even for popular languages like Malayalam, discrepancies were found due to its complex vowels and unique scripts.

In an exclusive interview with AIM, Sherly, who has almost 24 years of experience in the field, revealed that the team was able to achieve 50-55% resultsfor the Malayalam language after conducting multi-step fine-tuning on OpenAI’s Whisper.

Even though the paper just mentioned OpenAI Whisper, the team has also benchmarked with other tools, such as SeamlessM4T by Meta. While Whisper was used to test the Malasar language, the team is currently working on testing the Malayalam language with Meta models.

When asked about the acknowledgement of errors and any development in the mentioned area, neither of the tech giants responded to the emails sent by AIM.

Sherley’s colleague, Kavya Manohar, a computational linguist at DUK, discussed the use of tools like OpenAI’s Whisper and Meta’s models, emphasising the need for quality data and proper normalisation. With a PhD in speech technology, she found that the initial border rates (10-12%) reported from OpenAI Whisper events after launch were too low for Indic languages.

“In our investigation, we discovered that Whisper’s accuracy-checking process does not properly account for critical elements in Indian scripts, specifically vowel signs and modifiers like the chandrakkala (virama). For example, if the Malayalam script of Digital University യൂണിവേഴ്സിറ്റി—loses these components, it becomes ഡ ജ റ റ ൽ യ ണ വ ഴ സ റ റ, leading to a loss in readability,” Leena G Pillai, a research scientist at DUK, wrote in a LinkedIn post.

This was explained in detail in their paper titled, ‘What is lost in normalisation? Exploring pitfalls in multilingual ASR model evaluations’.

This prompted the team to research better normalisation routines since they believed that the actual accuracy rates could be much higher, at around 30-40%.

How does Meta Perform?

Kavya Manohar explained that the team found improved outcomes with post-processing and model adjustments. To achieve better results for Malayalam, they have explored adding an external language model on top of Meta’s SeamlessM4T.

She also touched upon the use of Indian languages in pre-training global models. “If the quantity of pre-training data and quality of Indian languages is higher, we can expect better accuracy,” she said.

The team has not received a response from OpenAI yet, but they stress that while a response may add to the research, it was only done to create general awareness and higher knowledge among Indian developers.

Meanwhile, Sunil Abraham, the public policy director at Meta, spoke at the Bangalore Tech Summit and said that the magic of a self-supervised learning paradigm is that the developer need not understand the language.

For example, a Kannadiga developer using standard methods can make a reasonably performing Santali language model as long as the developer has access to small corpora.

Sherly also expresses that she looks forward to creating “a language model where such discriminators can be addressed” and expects that Indian AI companies will be better able to incorporate these developments.

What Happens With Data Scarcity?

Speech is an integral part of communication, especially automatic speech recognition systems (ASR), which have significantly improved since the introduction of deep learning. Sherly highlights that the key issue for underrepresented languages is not just the lack of dataset, but also the quality, and specificity of data and annotating the data correctly.

“How can we come up with a good result if we don’t have a large volume of data?” she said.

These are key issues that multistage fine-tuning can solve by refining models through focused, context-specific training.The process starts with collecting, cleaning, and preparing the data.

“There are a number of normalisation techniques available, and with respect to the data, we need to figure out what kind of cleaning to do,” she said. This involves fixing common issues like misplaced punctuation or mismatched symbols, which are frequent in raw data.

The challenge is even greater for languages like Malasar because they don’t have a written script—they’re only spoken. The research team tackled this by transcribing the language into Tamil, as their sounds are closely aligned.

Data collection was no easy task either. With fewer than 10,000 speakers globally, the team relied on a community of Malasar speakers who voluntarily contributed audio recordings. Additionally, the team also had funding to carry out this research effort.

Recently, AI4Bharat introduced BhasaAnuvaad, a dataset covering 13 Indian languages and is the IndicConformer ASR model for 22 scheduled languages in India. It is India’s first multilingual expressive TTS dataset for Indian languages.

Other initiatives include Bhasha Daan by Bhashini and Project Vaani by IISc AI and Robotics Technology Park (ARTPARK), set to open-source 16,000 hours of spontaneous speech data from 80 districts.

The Art of Multistage Model Training

Sherly’s research was driven by her early experiences with language technology, particularly her work on machine translation systems in India. Initially, statistical machine translation methods faced significant challenges in terms of accuracy, even though they were improving.

As deep learning techniques became more prominent, Sherly’s team adopted neural machine translation (NMT), which yielded better results but still faced limitations due to data scarcity. This led to her exploration of multistage fine-tuning, where models are trained and adjusted with specialised datasets to improve accuracy.

In contrast to a single training phase, this approach allows for the gradual enhancement of models using smaller, more targeted datasets. This is highly crucial when working with low-resource languages.
These languages include not only popular ones like Malayalam and Tamil but also more regional ones like Poula(an Angami-Pochuri language spoken in parts of Nagaland and Manipur), Malasar (a southern Dravidian language spoken by tribes of Western Ghats) and Santali, a language spoken in parts of eastern India.

The post These Indian Researchers Have the Cure to OpenAI’s Indic Trouble appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...