OpenAI’s New Benchmark IndQA to Evaluate AI Models on Indian Language & Culture

OpenAI has introduced IndQA, a new benchmark to evaluate how well AI models understand and reason about Indian languages and culture. The benchmark aims to measure AI performance in areas beyond translation or multiple-choice tasks, focusing instead on reasoning and cultural understanding.

“Our mission is to make AGI benefit all of humanity,” OpenAI said in a statement. “If AI is going to be useful for everyone, it needs to work well across languages and cultures.”

OpenAI noted that most multilingual benchmarks, such as MMMLU, are now saturated—top models cluster near perfect scores—making them less effective in tracking progress. IndQA was created to address these gaps by testing AI systems on culturally grounded and reasoning-heavy tasks in Indian contexts.

Covers 12 Languages and 10 Cultural Domains

IndQA includes 2,278 questions across 12 languages, including Bengali, English, Hindi, Telugu and Tamil, and 10 domains, including architecture, design, food and cuisine, history, media and entertainment and sports.

Each question is authored by domain experts and includes a rubric for evaluation, an English translation for auditability and an ideal answer. Responses are graded by a model-based system that checks whether specific expert-defined criteria are met.

Expert-Led, Adversarially Filtered

OpenAI said the benchmark was built in collaboration with 261 experts from across India, including linguists, journalists, artists, professors and practitioners.

“We worked with partners to find experts in India across 10 different domains,” the company explained. “They drafted reasoning-focused prompts tied to their regions and specialities.”

The questions underwent adversarial filtering—each was tested against OpenAI’s strongest models, including GPT-4o, OpenAI o3, GPT-4.5, and GPT-5—and only those that most models failed to answer were retained. “This process preserves headroom for progress,” OpenAI noted.

Performance Comparison

OpenAI used IndQA to evaluate leading AI models. The results showed GPT-5 (Thinking High) achieved the highest overall score of 34.9%, followed by Gemini 2.5 Pro at 34.3% and Grok 4 at 28.5%. GPT-4o and earlier versions scored lower, indicating measurable improvement in recent models.

When analysed by language, GPT-5 performed best across most Indian languages, though OpenAI cautioned against interpreting IndQA as a cross-language leaderboard. “Because questions are not identical across languages, cross-language scores shouldn’t be interpreted as direct comparisons,” the company said.

Cultural Depth and Regional Expertise

The benchmark reflects deep cultural diversity through expert contributions, including a Nandi Awards-winning Telugu actor and screenwriter, a Marathi journalist at Tarun Bharat, a Kannada linguistics scholar, a Tamil writer and activist and a Gujarati heritage curator.

“IndQA pushes AI systems to go beyond surface-level translation and demonstrate real cultural and contextual understanding,” a participating expert noted.

Towards Broader Global Benchmarks

OpenAI said IndQA is part of its broader effort to improve AI accessibility in India—ChatGPT’s second-largest market—and to develop similar benchmarks for other languages and regions.

“We hope IndQA will inspire the research community to build culturally grounded evaluations,” the company said. “Creating similar benchmarks can help AI systems learn more about languages and domains they struggle with today.”

The post OpenAI’s New Benchmark IndQA to Evaluate AI Models on Indian Language & Culture appeared first on Analytics India Magazine.