Google Introduces IndicGenBench to Benchmark Indic LLMs Across 29 Languages

As part of several initiatives that Google has taken up in India to improve Indic LLM capabilities, Google vice president Ambarish Kenghe announced the launch of IndicGenBench.

A benchmark to help in evaluating the generative capabilities of Indic LLMs, IndicGenBench is part of a slew of updates released during Google I/O Bengaluru 2024. Kenghe said that the benchmark covers as many as 29 languages, including several Indian languages that do not currently have benchmarks.

Additionally, Kenghe announced the open sourcing of DeepMind’s Composition to Augment Language Models (CALM), allowing developers to combine specialised language models with Google’s Gemma models. Interestingly, research on CALM had been done specifically by the Google DeepMind and Google Research teams in India, with the paper released earlier this year.

“Let’s say you’re building a coding assistant that can converse in English. Now, by composing a Kannada specialist model with CALM, you may be able to offer coding assistance to Kannada users as well,” explained Kenghe.

This focus on Indic language LLMs comes as DeepMind expands Project Vaani, a collaborative effort between Google and the Indian Institute of Science (IISc), wherein over 14,000 hours of speech data in 58 languages, has been made accessible to developers. This data was collected from over 80,000 speakers in 80 districts across the country.

As previously covered by AIM, this is being open-sourced as part of MeitY’s flagship AI initiative, Bhashini. These capabilities are also soon to be expanded as Bhashini also launched an initiative called Bhasha Daan, to help crowdsource voice and text data in multiple Indian languages.