An Open Source Movement Led to India’s First Telugu SLM

When Sam Altman, CEO of OpenAI, said during his visit to India that it’s ‘pretty hopeless’ to try develop something like ChatGPT at a relatively lower cost of around USD 10 million, it irked many of us. However, Ozonetel, a Hyderabad-based telephony company, involved with the Swetcha Movement, took this as a challenge and developed a 7 billion parameter Telugu small language model (SLM) called ‘AI Chandamama Kathalu’ at a significantly lower cost.

“Upon Sam Altman expressing the challenge of building a ChatGPT clone, we were intrigued by the possibility. Viewing it as an opportunity, we decided to embark on this challenge and successfully crafted a contextualised model, beginning with the Telugu language,” Chaitanya Chokkareddy, co-founder and chief technology officer at Ozonetel Communications, told AIM.

Chokkareddy said he has been involved with the Swecha Movement, a non-profit free software foundation based in Andhra Pradesh, for over 15 years. “Hence, we have connections to colleges and educational institutions and we can get access to more than 10,000 students at a moment’s notice.”

“We did a datathon, volunteers at Swecha collaborated with nearly 25-30 colleges and over 10,000 students were involved in translation, correction and digitalisation of 40,000-45,000 pages of Telugu folk tales. Me and my R&D team and Ozonetel supported them with the graphics processing units (GPUs) to train the model,” Chokkareddy said.

(AI Chandamama Kathalu launch, Credit: Twitter handle @cinevinodam)

An open-source Telugu SLM

The team at Ozonetel tried building the model by fine-tuning models like Meta’s Llama and Mistral. However, creating a model on top of other open-source models posed challenges for the team, given the time constraint to release the model in January.

The team also tried fine-tuning Google’s MT-5 open source model, however, they finally settled with building the model similar to GPT-2 from scratch. Training the model on a cluster of NVIDIA’s A100 GPUs took nearly a week.

Interestingly, the estimated GPU cost involved in training the Telugu SLM was around $1000. This excludes the labour cost of Chokkareddy and his team and those involved with the datathon. According to Chokkareddy, the model works really well and this motivated the team to dream bigger.

Now, the team aims to release more SLMs for other domains like education and coding. The plan for the team is now to conduct 5-6 more datathons.

“Next, we want to build an SLM for coding in Telugu. “We already have the instruction set for Python, and we are going to translate that,” Chokkareddy continued, “We are also going to train a model on textbooks for students.”

IIT Hyderabad, IIIT Hyderabad and around 20 other engineering colleges based in the city are expected to be involved in the project.

Once the team has developed around 10 different types of SLMs, the strategy is to combine the models and subsequently construct an LLM on top of it, a strategy which reportedly is also implemented by OpenAI while developing GPT-4.

While the team is currently focused on Telugu, there is a possibility that they expand to other Indic languages like Tamil and Hindi.

(Datathon by Swecha, Credit: Twitter handle @mr_rusher_143)

Born out of the Telugu open-source movement

The Swecha movement is a part of the free software movement of India (FSMI), and advocates for free software in the Indian states of Telangana and Andhra Pradesh. The Swecha team, which is a community of developers, also created a Telugu Operating System based on Ubuntu in 2005.

Chokkareddy said the very philosophy had led them to open-source everything about their model, including the dataset. “When you look at the popular open-source models like Llama or Mistral, they did not disclose anything about their dataset. However, we have disclosed everything about the Telugu SLM, including datasets, so others can build upon it.”

Moreover, the ongoing discussion regarding open-source and closed-source AI is a significant motivating factor, given that they are steadfast advocates of open-source principles. In fact, considering the team’s commitment to openness, Swecha is deliberating the creation of a distinct licence for the reusability of its datasets used to train AI Chandamama Kathalu.

They hold the belief that applying software licences to datasets may not be an effective approach. “We need to consider carefully how we define these licences, which we are actively addressing in collaboration with Swecha and the Software Freedom Law Centre (SFLC).

“For instance, if we release a dataset and someone else utilises it but keeps it closed, it hinders innovation in this space. Therefore, it is crucial to establish proper licensing, akin to the general public license (GPL), to ensure openness and collaboration,” Chokkareddy said.

The rise of Indic LLMs

The motive for those involved in building the Telugu SLM, according to Chokkareddy, was to showcase that what Sam Altman perceived as impossible could actually be done. “We did it just to show that, Hey look! We can collect the data, we can build the model and others can also use it.”

While it may not prove Sam Altman entirely wrong, it opens the possibility for more Indic LLMs to come out within the country. So far, we have seen multiple projects like Tech Mahindra’s Project Indus, Ola’s Krutrim, and Sarvam AI, focusing solely on creating Indic LLMs.

Interestingly, Project Indus was also conceived after Altman’s critical comments as former Tech Mahindra chief CP Gurnani also took up the prospect of developing an Indic LLM as a challenge. The IT giant plans to release its model, which has 539 million parameters and 10 billion Hindi+ dialect tokens, by April.

Chokkareddy further mentions that Ozonetel has created a series of Indic LLMs, leveraging diverse models such as GPT-4, tailored for their enterprise applications.

The post An Open Source Movement Led to India’s First Telugu SLM appeared first on Analytics India Magazine.

An Open Source Movement Led to India’s First Telugu SLM

An open-source Telugu SLM

Born out of the Telugu open-source movement

The rise of Indic LLMs

Latest stories

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron...

PNNL: Integrating AI into Biological Research

Rick Stevens on the Genesis Mission and the Future of...

Inside the DOE’s 26 AI Challenges for Genesis Mission

You might also like...

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron Star Data

PNNL: Integrating AI into Biological Research