Meta’s Llama 3.1 is the Missing Link for Indic Datasets

Earlier this year, when Meta released the Llama 3.1 405B model, the updated version allowed developers to use outputs from Llama models—including the 405B—to improve other models. This benefitted developers and AI startups in India that were building Indic LLMs.

Yann LeCun, Meta’s chief AI scientist, acknowledged the past issues developers encountered with Llama regarding model creation. Speaking at Meta’s Build with AI Summit in Bengaluru, LeCun said the company took note of it and addressed the concerns with its newer version.

Sharing the stage with Yann LeCun in a fireside chat, Nandan Nilekani, influential Indian entrepreneur and co-founder of Infosys, said that this development will help Indian AI startups use LLMs easily and that it isn’t necessary for India to build LLMs from scratch.

“Our goal should not be to create another LLM. Let the big players in Silicon Valley handle that,” Nilekani said. “India should become the use-case capital of the world and focus on building small models quickly.” He further added India will use it [Llama] to create synthetic data, build small language models quickly, and train them using appropriate data.

Nilekani is partially right as using OpenAI’s GPT-40 or Anthropic’s Claude Sonnet 3.5 APIs can be expensive. In contrast, Llama 3.1 405B is freely available on Hugging Face and competes well with top foundation models like GPT-4, GPT-4o, and Claude 3.5 Sonnet.

Nilekani opined that the correct approach for Indian AI companies is to create appropriate data. Notably, in 2022, he invested in AI4Bharat, a research lab dedicated to creating open-source datasets, tools, models, and applications for Indian languages.

Llama Powers Sarvam AI

Speaking at Cypher 2024, Vivek Raghavan, the chief of Sarvam AI, revealed that they used Llama 3.1 405B to build Sarvam 2B. He explained that it is a 2 billion parameter model with 4 trillion tokens, of which 2 trillion are Indian language tokens.

Sarvam 2B is part of a class of small language models (SLMs) that includes Microsoft’s Phi series, Llama 3 (8 billion), and Google’s Gemma models. It serves as a viable alternative to using large models, such as those from OpenAI and Anthropic, while also being more efficient for specific use cases.

“If you look at the 100 billion tokens in Indian languages, we used a clever method to create synthetic data for building these models using Llama 3.1 405B. We trained the model on 1,024 NVIDIA H100s in India, and it took only 15 days,” said Raghavan.

Regarding Sarvam 2B, he further said that the model performs well on Indic tasks. “It is extremely good for summarisation in Indian languages, and for any kind of NLP task in Indian languages—this will outperform models that are much bigger.”

The company recently launched its latest model, Sarvam-1, which outperforms Google’s Gemma-2 and Llama 3.2 on Indic tasks. The claims that its secret sauce is 2 trillion tokens of synthetic Indic data, equivalent to 6-8T regular tokens due to their super-efficient tokenizer.

Sarvam AI is not alone. In a recent interaction with AIM, Meta VP Manohar Paluri revealed that even Ola Krutrim uses Llama. “Not just small companies or developers, even large companies build on top of this ecosystem, which really gives us confidence that we have the momentum.”

He further added that people are now using Llama as their de facto intelligence layer to build their businesses on top of it.“We are actually trying to bring high-quality Indian tokens into Llama so that Llama will work in Indian languages,” he said.

Paluri explained that since Llama is the engine for Meta AI, the AI company will support Indian languages, benefiting billions of people in India who use Meta AI on WhatsApp, Facebook, and Instagram.

Meta AI boasts over 500 million monthly active users globally and is on track to become the most widely used AI chatbot by the end of 2024. Recently, it was launched in Hindi and the company plans to integrate multiple Indian languages into their future models.

Exploring Llama 3.1’s Multilingual Capabilities

In the Llama 3.1 research paper, Meta states that the model has been trained on multilingual data, generating high-quality instruction-tuning data for languages such as German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Meta has collected high-quality, manually annotated data from linguists and native speakers. Moreover, with a context length of 128K tokens, it can process and generate longer, more complex pieces of text, which is beneficial for creating diverse synthetic datasets.

“Synthetic data generation is one of the main use cases for these very large models. This can be extremely helpful in domains where obtaining a large, high-quality dataset is challenging due to cost, privacy concerns, or simply a lack of available data,” said Hamid Shojanazeri, ML engineer at Meta.

Meta is not limited to LLMs. It intends to empower developers with the resources to create custom agents and discover new types of agentic behaviours.

The post Meta’s Llama 3.1 is the Missing Link for Indic Datasets appeared first on AIM.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...