Meet the Creator of ଓଡ଼ିଆ Llama 

When ChatGPT was introduced a year ago, Shantipriya Parida — the creator of Odia Llama — was quite disappointed that it did not understand any cultural context related to Odisha.

Cut to present, he built Odia Llama (Llama2-fine tuned LLM for Odia Language) and started an open-source project called Odia Generative AI. The language is spoken by over 35 million people, and it boasts a rich literary tradition and a unique identity.

“I attempted to ask ChatGPT about local contexts. For example, I asked, ‘Can you tell me the recipe for rasgulla?’ It couldn’t provide an answer. It missed all the local context, and the answers weren’t even correct,” said Parida.

Moreover, Parida is currently working in Finland as a senior AI scientist at Silo.ai, which recently released its own LLM named Poro. “If Europeans can build an LLM that can provide better answers than OpenAI’s ChatGPT in their local language, I thought, why not for our own Indic languages,” said Parida. ‘That’s the reason we are working hard, and it feels good when people appreciate our effort,” he added.

Journey of Odia Llama

To build Odia Llama, Parida formulated a three step plan. “We’ll begin with a fine-tuned model, then move on to a pre-trained model, and finally, proceed to app deployment. Once we have the fine-tuned and pre-trained models ready, app development will be relatively straightforward”, he explained.

The Odia Llama currently available on Hugging Face is the fine-tuned version. Parida noted that his team of researchers is presently working on the development of the pretrained version, which is currently in progress.

“All the fine-tuned models have some limitations. In the foundation model, if you have only 0.5% or 2% of data in your local language, then, after a certain point, no matter how much you fine-tune it, it will get stuck,” he said.

To train a pretrained model Parida’s team is currently working on collecting data. “We are collecting a lot of tokens. I think we have already collected around 30 million tokens, and we are targeting at least 40 to 50 million tokens. This way, we can expedite the release of our first pretrained model,” said Parida.

The data is sourced from various online platforms, including blogs, Wikipedia, Odia newspapers, local textbooks, literature, magazines, and government websites. Parida said that they have developed in-house tools as well named Olive Scraper and Olive Farm. Olive Scraper is a web scraping tool for extracting Odia content from various sources (e.g., websites, PDF, DOC, etc.), while Olive Farm generates LLM instruction sets in Indic languages. Presently, it offers support for Hindi and Odia, with seamless scalability to incorporate additional languages on the horizon.

Regarding computing, GPUs, and infrastructure, Parida mentioned that they received support from E2E Networks. Moreover, he said that for fine-tuning, they have ample resources, as his team consists of various independent researchers who have access to GPUs, which they utilise for research purposes.

AI Tutor

Odia Generative AI recently created an AI tutor named Acharya. This tutor facilitates self-learning in Hindi for students, offering real-time doubt resolution. The tutor, named Acharya, was developed using LLM (Mistral-7b Hindi, fine-tuned) and Retrieval Augmented Generation (RAG).

“For example, if you’re a tutor and you want to create a lesson plan on a specific topic, then this can also assist you in preparing a comprehensive lesson plan. Similarly, if you want to evaluate, for instance, create a set of questions and assess them, it can also be helpful. So, it has multiple use cases,”said Parida.

Acharya operates as a client-server web application, with the client built using JavaScript and the server utilizing Python with Fast API for seamless communication. Initial assessment scores indicate BERT (F1): 0.72 and RAGAS Answer Relevancy: 0.72.

The current demonstration version caters to class 8 subjects of the CBSE board. It’s worth noting that Acharya will soon support various languages, cover a wide range of subjects, and be freely accessible.

What’s Next?

Given that there are now several indic LLMs out in the market like OpenHathi, Airavat, Krutrim and BharatGPT, Parida wants to create an indic LLM benchmark next.

“We are planning to build an LLM benchmark. You go, choose your model, and per task, it will automatically tell you your model’s accuracy for that particular task. It can be a fair comparison for anybody who wants to pick any model for research or any purpose they want to use.” said Parida.

Moreover, on similar lines to AI Tutor, Parida wants to build more AI apps focusing on government budgets and policies which would make it easier for local citizens to get information in native languages.

“As a citizen, many times one wants to know about government policies and what the government is trying to do in your area. But you don’t know exactly whom to ask. Nowadays, information is available in the public domain, so it’s easy to build an AI app using an open-source model and using RAG.” said Parida.

“We are not a company, and we are building without the intention of selling anything. We started with one objective, to ensure that our Odia language does not lag behind. So, whatever we are building is solely for the benefit of the people.” concluded Parida.

The post Meet the Creator of ଓଡ଼ିଆ Llama appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...