In a world of Kannada, Tamil, Telugu and Odia Llamas, and SarvamAI’s bilingual LLM OpenHathi supporting both Hindi and English, Tech Mahindra’s Project Indus is all geared up to build LLMs grounds-up. These will specifically cater to speakers of the Hindi language across dialects like Bhojpuri, Kangdi, Angika, Nagpuri, Khortha, Kudmali, Panch-Parganiya, Hadauti etc.
“Hindi is spoken by 615 million speakers, much more than anybody who considers English as their first language as native speakers as well,” said Nikhil Malhotra, CIO at Tech Mahindra and the brain behind Project Indus, sharing with AIM the details of the project, data challenges, tech stack, and the roadmap ahead in an exclusive interaction.
Malhotra said that most of the models in India are built on top of Llama or use existing foundational model APIs. Those who seem to be building from scratch are yet to reveal what they are really built on—underlining that “it’s INDUS and BharatGPT, which are actually built from the ground up”.
Project Indus started in June last year to build India’s very own foundational models from scratch; there hasn’t been a turning back ever since. “When I came back from the US, the whole idea was how do you start the language revolution?” said Malhotra, sharing his interaction with CP Gurnani, who took Sam Altman’s words a little too seriously.
Coincidentally, Malhotra, around that time, was also working on a language model calledBHAML, or Bharat Mark-up Language. This AI system was used by kids to code in a language of their choice. He was roped in to build Project Indus.
“India did not have a foundational model at that point in time. So, we had different models. We had translational models like Bhashini, we had different, but there was no cut foundational modelling in the system,” shared Malhotra, speaking at length about the data challenges around low-resource Indic languages, particularly in Hindi (which has more than 49 dialects).
He stressed on the underrepresentation of Indic languages in the existing AI systems and models, including GPT-4, Llama and others.
“Because most of these dialects are on the endangered list. 80% of the Indians that do not speak English could also communicate,” said Malhotra.
Malhotra emphasised the significance of linguistic diversity, mentioning less common dialects like Angika, Nagpuri, Khortha, Kudmali, Panch-Parganiya, and Hadauti. “The aim is to serve languages spoken by 100,000 to 200,000 people, ensuring inclusivity across various linguistic groups.”
He further reiterated that, “The vision was to become a platform that can serve every Indian and every business… It could be a housewife, a patient, a farmer, or even machines.” This inclusivity defines the versatile persona-oriented nature of The Indus Project.
Data Collection and Language Diversity
The foundation of Indus relies on a vast dataset of 10 billion tokens, predominantly sourced from various parts of northern India, Bhashini, Bhasha-dan, etc.
“I sent teams to the country’s northern belt; we went to Madhya Pradesh, Rajasthan, and some parts of Bihar.” The teams’ task was to collect Hindi and dialect data by interacting with professors and leveraging the Bhasha-dan portal available on ProjectIndus.in.
He also highlighted the initiative to involve Tech Mahindra employees in contributing everyday interactions through sentences like ‘Main ghar se bahar jata hu’ to the portal to gather diverse linguistic prompts.
However unique, these methods would’ve taken years to collect quality and diverse larger data sources so the team turned to using the Falcon structure, incorporating data from the pile dataset, a free open-source resource. Additionally, they translated some datasets from Pile into Hindi, contributing significantly to the overall data volume.
Malhotra mentioned the development of toolkits to manage biases in the data, including a bias tool that identifies nine different types of biases in continuous text. About 70,000 to 80,000 sentences were annotated by individuals to create a biased analogy or classification algorithm, allowing for the identification and handling of biases in the dataset.
Model Architecture and Innovation
Project Indus team has used decoder-architecture-based transformers. However, Malhotra noted that, “These are transformers, but it has Tokenisers, which will be only Hindi.”
Malhotra highlighted the significance of the Hindi-only tokenisation, with approximately “10 billion tokens and 539 million parameters.” The decoder stack undergoes pre-training before being embedded within the Transformers of Hugging Face for open-source utilisation.
The LLMOps system involves data collection, pretraining, fine-tuning, understanding model behaviour, monitoring, and production deployment. Malhotra mentioned using open-source tools like CommEt and Switch for LLMOps management.
Additionally, to address the challenge of computational efficiency and the economics of scale, he said they will use Tensor Networks, inspired by quantum, to optimise the model’s performance, which the team is expected to release in April. “We needed multi-parallel GPUs in terms of what we were doing. So we actually used C-DAC’s GPUs, about 48 large 40 GB GPUs, and trained them for at least four days, four to five days,” said Malhotra.
What’s next?
Project Indus is gearing up for its use in diverse applications, including rural finance, agri-tech and media and entertainment. It aims to empower rural communities with a language model that understands their dialects, reducing reliance on call centres.
The team said that the model will be offered in open-source and enterprise-source formats, catering to innovators and enterprise applications. An innovative proposal involves making tractors conversational to address farmers’ challenges on the edge.
Malhotra said that following the Hindi model, the focus of the mere 14-member driven team will shift to Bangla, which is spoken by up to 400 million people.
Furthermore, he emphasised India’s potential to lead in AI research and development. With abundant data, talent, and unique use cases, he urged the Indian ecosystem to focus on defining new algorithms for AI. The call is to leapfrog in terms of innovation and sustainability, ensuring that AI serves the diverse needs of the Indian populace.
“We’ve caught on to the trend of LLMs, but we are still not leading the trend. (But) I think it’s time for Indian researchers and the Indian ecosystem to lead the trend,” concluded Malhotra, outlining the key factors that position India uniquely for AI advancements: an abundant data source and a pool of talent.
Recent News & Stories
- [Exclusive] Tech Mahindra’s Nikhil Malhotra on Making Foundational Models for India
- Learn to Build LLMs Using this Latest GPT
- OpenAI Got 9.9 Problems, But Gary Ain’t One
- Microsoft Research India Extends Deadline for Research Fellow Program Applications to Feb 16, 2024
- The Real Reason Behind Google’s Recent ‘Layoffs’
The post [Exclusive] Tech Mahindra’s Nikhil Malhotra on Making Foundational Models for India appeared first on Analytics India Magazine.