Indian AI startup Sarvam AI has launched Sarvam-1, the first LLM optimised specifically for Indian languages.
Developed with 2 billion parameters, Sarvam-1 supports 10 major Indian languages—Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu—alongside English.
Despite its relatively smaller size, Sarvam-1 shows strong performance in Indic language tasks, outperforming larger models like Gemma-2 and Llama-3 on benchmarks such as MMLU, ARC-Challenge, and IndicGenBench. It also offers faster inference speeds—4 to 6 times faster—making it suitable for deployment on edge devices.
For instance, on the TriviaQA benchmark, Sarvam-1 achieved an accuracy of 86.11 across Indic languages, significantly surpassing Llama-3.1 8B’s score of 61.47.
Sarvam-1’s performance on the IndicGenBench, which tests cross-lingual tasks such as summarization, translation, and question answering, also stood out. It achieved an average chrF++ score of 46.81 on Flores, a dataset for English-to-Indic translation, surpassing the larger Llama-3.1 8B model.
The model bridges the gap for Indian language speakers by offering advanced natural language processing (NLP) capabilities that have previously been centered around English and other high-resource languages.
A key feature of Sarvam-1 is its efficiency in handling Indic scripts, a major challenge in previous LLMs. Most existing multilingual models have high token fertility—meaning they require more tokens per word for Indian languages compared to English.
Sarvam-1’s tokenizer significantly reduces this inefficiency, achieving fertility rates of 1.4 to 2.1 tokens per word, much closer to the 1.4 tokens needed for English. This enables more streamlined training and better model performance across Indian languages.
The model’s training corpus, Sarvam-2T, consists of approximately 2 trillion tokens, with content evenly distributed across the 10 supported languages, except for Hindi, which constitutes about 20% of the dataset. The dataset also includes a substantial portion of English and programming languages, which helps the model perform across both monolingual and multilingual tasks.
Sarvam-2T emphasises high-quality, diverse data, addressing limitations in existing Indic datasets like Sangraha, which are often web-crawled and lacking in quality. Sarvam-2T includes longer documents and richer scientific and technical content, enhancing the model’s ability to handle complex reasoning tasks.
Another key feature of Sarvam-1 is its computational efficiency. The model offers 4 to 6 times faster inference speeds compared to larger models like Gemma-2-9B and Llama-3.1-8B, while maintaining competitive performance levels. This makes Sarvam-1 particularly suitable for deployment in production environments, including edge devices where computing resources may be limited.
Sarvam-1 was trained over five days using 1,024 GPUs on Yotta’s Shakti cluster, leveraging NVIDIA’s NeMo framework for training optimisations.
The model is available for download on Hugging Face’s model hub, where developers can access and explore its capabilities for a range of Indic language applications, from translation to conversational AI and more.
The post Sarvam AI Launches Sarvam-1, Outperforms Gemma-2 and Llama-3.2 appeared first on AIM.