IIT Gandhinagar Channels Rivers Ganga, Yamuna into LLMs

IIT Gandhinagar recently released Ganga-1B, a pre-trained large language model (LLM) for Hindi as part of its Unity project. Built from scratch using the largest curated Hindi dataset, Ganga-1B outperforms all open-source LLMs supporting Hindi, up to 7B in size.

Releasing the first product of the #Unity project @lingoiitgn , Ganga-1B , a pretrained LLM for Hindi. Created from Scratch using the largest curated Hindi Dataset. Ganga-1B is outperforming all open-source LLMs supporting Hindi with sizes till 7B.

— Mayank (@mayank_iitgn) July 3, 2024

The Unity project is building Indic LLMs and is looking to release the largest-ever curated datasets and SOTA models for other Indian languages. It’s a part of Lingo, the research group from IIT Gandhinagar, that engages in various activities, projects, and collaborations to advance the fields of natural language processing (NLP) and AI.

“We are neither a non-profit organisation nor a for-profit organisation. We are a small research group at IIT Gandhinagar. For the past six months, we have been devoted to this particular project,” said Mayank Singh, assistant professor of computer science & engineering at IIT Gandhinagar, in an exclusive interview with AIM.

Singh claimed that this is the first open-source Indic model from an academic research lab in India and revealed that it was built from scratch for under INR 10 lakh rupees. Coincidently, this is way less than what Tech Mahindra spent on Project Indus – i.e. about $5 million. (INR 4100 lakhs)

“The idea was to dedicate this to Ganga – the longest river flowing through the Hindi mainland. Our next set of models will follow a similar nomenclature,” said Singh

The project was initiated by two MTech students, Hitesh Lodwal and Siddhesh Dosi. “Hitesh was primarily involved in data curation for Hindi, while Siddhesh focused on modeling, architecture, and designing the system,” explained Singh.

He said that the model has been trained on CDAC servers. “We got some funding from several sponsored agencies, and then we paid CDAC to get some dedicated nodes on the CDAC servers.”

Singh added that computing costs are much cheaper for educational institutes compared to those on Azure or Google Cloud. “Ganga was trained on one NVIDIA DGX A100, which has eight NVIDIA A100 Tensor Core GPUs, and we started the training almost six months back,” said Singh.

He said that Lingo has secured additional funds and plans to train the next iterations of the model using Yotta’s infrastructure.

Meanwhile, BharatGPT, a consortium or ecosystem led by Ganesh Ramakrishnan, a professor at IIT Bombay, is also working on building Indic LLMs and is expected to release their models soon.

Not a Wrapper

“We have not built on top of any open-source models like Llama. It’s a simple decoder model,” said Singh, adding that they explored multiple architectures before developing Ganga, which is quite similar to Mistral. “It has a decoder-only architecture with 16 layers and 24 attention heads,” he added.

He added that the benefit of this small model is its versatility—it can fit anywhere, even on CPUs. One can use it for inference and run it on edge devices as well.

He further said that they will release open-source LLMs for other languages as well. “We are still curating the data for Hindi. We have already curated clean data for Tamil and Telugu. We will start training, maybe within one month,” said Singh.

Ganga’s tokeniser is also trained from scratch. “If you compare it with other tokenisers like Google’s Gemma, a majority of the tokens come from non-Hindi or non-Indic languages. In our case, we are creating it for Hindi; almost all tokens are actually Hindi tokens, and this is the reason our fertility scores are way lower than what you see in the case of Gemma,” said Singh.

Singh said that they utilised existing datasets and curated new ones to train the model. “We searched for existing Hindi data used in language models similar to Bloom,” he said.

“For this specific model, we gathered around 9200 GB of plain text data from diverse sources, including our own scraping efforts and publicly accessible sources,” he added.

Future Plans

“Our short-term plan is to create high-quality Indian datasets and make them publicly available as open-source,” said Singh. He also spoke about their upcoming initiative to build benchmarks for Indic LLMs.

In the future, Lingo will also offer models through APIs. “We will create APIs for these models so that users can access them easily, similar to what other companies are doing,” said Singh.

Singh told AIM that they are continuing discussions with the Gujarat government to develop solutions, although the discussions have not yet solidified.

The Lingo team is currently working on a technique called ‘model editing’, which allows developers to avoid the traditional RAG and fine-tuning methods. In fine-tuning, all parameters are adjusted, whereas in model editing, only selected parameters can be edited.

Moreover, Singh said that they are also developing a framework where updating the LLM in one language will automatically reflect changes in all supported languages.

The post IIT Gandhinagar Channels Rivers Ganga, Yamuna into LLMs appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...