The Dire Need for an Indic LLM Leaderboard

A few months ago, AIM pointed out the dire need for creating benchmarks for Indian languages since most of the famous ones, like MMLU and HumanEval, do not necessarily include a good amount of dataset for Indian languages. Now with Indic LLMs coming into the picture, it is high time that India got its own LLM benchmark for its models.

In his talk at MLDS 2024, Tamil Llama creator Abhinand Balachandran, also highlighted the importance of creating benchmarks specifically tailored for evaluating Indic language models.

Similarly, in an exclusive interview with AIM, Shantipriya Parida, the creator of Odia Llama, revealed that he was planning to create a benchmark for Indic language models. “We will build an LLM benchmark. You go, choose your model, and it will automatically tell you your model’s accuracy per task. It can be a fair comparison for anybody who wants to pick a model for research or any other purpose,” he said.

What about the current benchmarks?

Essentially, the process of creating a good benchmark requires tons of quality data. When it comes to Indic data, there is still a lack of it, which is also hindering the training of AI models. “LLMs require very large amounts of high-quality data. For many Indian languages, we do not have this right now,” Pratyush Kumar, co-founder at Sarvam AI and AI4Bharat, told AIM.

Though Indic models do not currently have a benchmark such as the Hugging Face Open LLM Leaderboard, there are several evaluation datasets available where creators can test their model on the provided dataset.

AI4Bharat, for instance, has created the IndicSentiment dataset, which was used to evaluate Airavata, the recent Indic language model. The dataset on Hugging Face has over 1k downloads, which shows that a lot of innovation is indeed happening in the Indic landscape, but we still need a standard metric for all.

In August last year, AI4Bharat, along with IIT Madras, IIT Kharagpur, and Microsoft India, published a paper titled Vistaar, which was a benchmark and training set for Indian languages for ASR. Though this was exclusively for speech and voice models, it also had a language dataset for around 12 Indic languages.

Interestingly, a lesser known benchmark on Hugging Face, titled IndicBenchmarkData, created by Sambit Sekhar, includes Indic benchmark dataset for Gujarati, Bengali, Telugu, Tamil, and several other Indic languages.

A leaderboard is all we need

It all boils down to the simple problem of data. A majority of the benchmarks on Hugging Face include a dataset of exams like SAT, LSAT, and US History. India needs to do the same for Indic languages, and include a dataset that covers the most important exams like UPSC or JEE.

The way India ultimately uses these LLMs may completely differ from the rest of the world. To evaluate Indian models on various tasks, we need to create a benchmark based on regional or vernacular dataset. These benchmarks would also help in evaluation of models like GPT-4 and Llama 2 on Indic tasks, and compare them to models that are originating in India.

With models such as BharatGPT and Ola’s Krutrim, which the makers claim to being built from scratch, it becomes a necessity for researchers to come up with solid and trustworthy evaluation benchmarks for these languages. This would eventually lead to the creation of a leaderboard, and allow people to decide which LLM to choose for what task.

Several of the current models in different languages such as Tamil, Telugu, Odia, and Malayalam, are all built on top of open source Llama 2, with fine-tuning on only a small amount of language token. As these models scale, and we eventually shift to building Indic models from scratch, it would become essential to benchmark these models on Indian metrics.

However, creating a benchmark is easier said than done. It requires a lot of computational infrastructures in terms of GPUs to run those models against the dataset in order to get their efficacy in different metrics. Since we are collecting the dataset, and NVIDIA is definitely giving us GPUs, an Indic LLM leaderboard could be just on the horizon.

The post The Dire Need for an Indic LLM Leaderboard appeared first on Analytics India Magazine.

The Dire Need for an Indic LLM Leaderboard

What about the current benchmarks?

A leaderboard is all we need

Latest stories

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron...

PNNL: Integrating AI into Biological Research

Rick Stevens on the Genesis Mission and the Future of...

Inside the DOE’s 26 AI Challenges for Genesis Mission

You might also like...

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron Star Data

PNNL: Integrating AI into Biological Research