9 Indian AI Research Papers That Deserve More Attention in 2025

Indian AI research in 2025 is fast becoming less about imitation and more about invention. Apart from the flashy announcements in the country, from multilingual datasets to generative tools for accessibility, researchers are building what global labs often overlook—AI grounded in India’s languages, laws and lived realities.

These papers show where the country’s AI research is heading and why it matters.

Sakshm AI: Advancing AI‑Assisted Coding Education for Engineering Students in India Through Socratic Tutoring and Comprehensive Feedback

This study, authored by Raj Gupta, Harshita Goyal, Dhruv Kumar, Apurv Mehra, Sanchit Sharma, Kashish Mittal and Jagat Sesh Challa, introduces ‘Sakshm AI’, an intelligent tutoring system tailored for engineering students in India.

The platform embeds a chatbot named Disha, which uses a Socratic approach—offering context-aware hints and structured feedback instead of outright answers, while maintaining conversational memory.

It tackles a real gap in coding education in India—most tools either give direct answers or lack context awareness and feedback. By focusing on Socratic guidance and scalable intelligent tutoring, it promises deeper learning gains, especially in settings where expert human tutors may not be accessible. The mixed-methods evaluation strengthens its claims.

Nyay‑Darpan: Enhancing Decision Making Through Summarisation and Case Retrieval for Consumer Law in India

This paper comes from a team led by Swapnil Bhattacharyya and late Pushpak Bhattacharyya, among others, at institutions including the IIT Bombay and National Law School of India University. It addresses a clear gap: AI tools exist for criminal or civil law, but consumer‐law disputes in India are poorly served.

The authors propose a two‐in‐one system that first summarises consumer case files and then retrieves related judgments to support decision-making. The tool achieves over 75% accuracy in finding similar cases and ~70% in summary quality metrics. By releasing the dataset and framework, the work aims to democratise legal tech for consumers and smaller actors.

ILID: Native Script Language Identification for Indian Languages

Authored by Yash Ingle and Pruthwik Mishra (2025) from Indian institutions, the paper presents a dataset of 2,50,000 sentences covering English as well as all 22 Indian official languages, labelled for sentence-level native script language identification. The authors point out that many Indian languages share scripts or are code-mixed, making language identification a surprisingly challenging preprocessing task.

They provide baseline models (ML + transformer fine-tuning) and show performance drops for low‐resource languages. It matters because accurate language detection is foundational for any multilingual Indian NLP pipeline—if that fails, downstream tasks like translation, summarisation or QA will misfire.

MorphTok: Morphologically Grounded Tokenization for Indian Languages

By M Brahma et al (2025), along with professor Ganesh Ramakrishnan from IIT Bombay, the researchers observed that standard BPE tokenisation often mis-segments Indian language words, especially compound or sandhi forms in Hindi/Marathi. They propose a morphology-aware pre-tokenisation step along with Constrained BPE (CBPE), which handles dependent vowels and script peculiarities.

They build a new dataset for Hindi and Marathi sandhi splitting and show downstream improvements (eg, reduced fertility, better MT and LM performance). Tokenisation may seem mundane, but for Indian languages, the ‘right’ units matter a lot; improvements here ripple into many tasks.

COMI-LINGUA: Expert Annotated Large-Scale Dataset for Hindi-English Code-Mixed NLP

Authored by Rajvee Sheth, Himanshu Beniwal and Mayank Singh (IIT Gandhinagar), this dataset represents the largest manually annotated Hindi-English code-mixed collection with over 1,25,000 high-quality instances across five core NLP tasks.

Each instance is annotated by three bilingual annotators, yielding over 3,76,000 expert annotations with strong inter-annotator agreement (Fleiss’ Kappa ≥ 0.81). The dataset covers both Devanagari and Roman scripts and spans diverse domains, including social media, news and informal conversations. This addresses a critical gap: Hinglish (Hindi-English code-mixing) dominates across urban Indian communication, yet most NLP tools trained on monolingual data fail on this mixed language phenomenon.

IndianBailJudgments‑1200: A Multi‑Attribute Dataset for Legal NLP on Indian Bail Orders

Sneha Deshmukh and Prathmesh Kamble compiled 1,200 Indian court judgments on bail decisions. Each judgment is annotated with over 20 structured attributes (bail outcome, IPC sections, crime type, court name, legal reasoning). They used GPT-4o prompts to bootstrap labels and manually verified a subset.

Bail orders directly affect millions of undertrial prisoners in India; a structured dataset allows legal NLP tools to maybe assist lawyers, judges or reformers in analysing bail decisions systematically.

TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context

By Shubham K Nigam, Balaram Deepak Patnaik et al, the researchers introduce TathyaNyaya, a dataset focused on factual statements in Indian legal judgments (Supreme Court and High Courts) rather than full texts, and FactLegalLlama, an instruction-tuned LLaMa variation that predicts judgments and explains them.

Prediction, along with explanation, in the Indian legal domain is rare. The dataset & model aim to enhance transparency in AI‐legal assistance rather than black-box outputs.

LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification

Shubham K Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya created LegalSeg, a dataset of over 7,000 documents and 1.4 million sentences annotated with seven rhetorical roles (e.g., facts, arguments, judgements) in Indian legal judgments. They benchmark different architectures, including role-aware transformers and find that leveraging document structure helps.

Understanding the internal structure of legal texts is key to summarisation, information extraction and building legal-AI systems tailored to India’s judicial style.

DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models’ Understanding on Indian Culture

Arijit Maji, Raghvendra Kumar, Akash Ghosh et al propose DRISHTIKON, a benchmark spanning 15 languages, 64,288 aligned text-image pairs covering Indian cultural themes (festivals, cuisine, attire, heritage). They evaluate vision-language models and show that these struggle to reason about culturally grounded multimodal content.

As AI becomes global, culture matters. Indian cultural content is massively under-represented. This benchmark enables the evaluation of models’ cultural competence for Indian contexts.

The post 9 Indian AI Research Papers That Deserve More Attention in 2025 appeared first on Analytics India Magazine.