‘How many R’s does the word Strawberry have?’ is a question we have all asked LLMs. Finally, when OpenAI revealed the o1 series of models, we got the R’s correct. From here, we witnessed a shift that enabled almost everyone to carry “PhD-level intelligence”.
In a recent podcast, Diana Hu, general partner at Y Combinator, said that the rise of reasoning models can be traced back to OpenAI’s early work with DOTA, where they implemented reinforcement learning techniques inspired by AlphaGo and AlphaZero.
The o1 models did something different underneath: they used CoT and reasoning tokens to answer complex questions, which was not possible earlier. If we consider the technical breakthroughs for reasoning models, one of the most significant research was Reflection, which explained how models can reason.
Things took a different turn when China silently entered the reasoning model race and now boasts multiple reasoning models which are on par or better than the o1 series of models. Aravind Srinivas, CEO of Perplexity AI, mentioned in his recent X post that China is ahead of the USA in open-source models that can reason and do math well.
“This is simply due to time wasted in discussing the dangers of open source and money spent lobbying against it. America needs to fight back for its open-source supremacy,” he added.
Now, if you visit Hugging Face, you will realise there are multiple open-source models built for reasoning.
What Fueled the Tsunami of Reasoning Models?
The traditional “scaling laws” theory, which suggests that simply adding more data and computing power would continuously improve AI capabilities, is being questioned as major AI labs like OpenAI, Google, and Anthropic are not seeing dramatic improvements in their models as they once did. This limitation has pushed researchers to explore new approaches, including reasoning models.
In recent times, we have seen multiple frameworks, which eventually became the standard for developing models that can reason well. The core–Transformers and CoT—remain the same for all these reasoning models, but the technical components differ.
Researchers and developers have been exploring hybrid approaches that combine multiple technologies. The integration of neuro-symbolic AI with traditional deep learning has emerged as a promising direction, allowing systems to both learn from data and reason with explicit knowledge.
A key advancement in these models is the integration of test-time computation with process supervision. The OpenR framework unifies data acquisition, reinforcement learning training, and non-autoregressive decoding into a cohesive platform.
This approach has demonstrated substantial improvements, with process reward models and guided search enhancing test-time reasoning performance by approximately 10%.
Nous Research’s recent introduction of its Reasoning API marked a significant milestone. It demonstrated how smaller models could compete with larger ones through enhanced reasoning capabilities. Their 70B Hermes model, when integrated with the Monte Tree Search Chain and Mixture of Agents, outperformed larger models in complex mathematical challenges.
The Dataset Part
One of the possible reasons behind the recent surge in reasoning models could be the reasoning datasets. For example, MetaMathQA is a mathematical reasoning dataset that offers comprehensive problem-solving scenarios with high-quality annotations.
The Polymathic AI team has released two enormous datasets totalling 115 terabytes, dwarfing the 45 terabytes used to train GPT-3. These datasets cover diverse scientific fields and are crucial to train multidisciplinary AI models.
One of the datasets, called the Well, contains 15 terabytes of data from 16 diverse datasets, focusing on modelling partial differential equations. This emphasis could lead to breakthroughs in various scientific domains.
On the other hand, the Marco-o1 dataset collection stands out as a recent innovation. It has three essential components, totalling over 60,000 samples. These include a filtered Open-O1 CoT Dataset and a synthetically generated Marco-o1 CoT Dataset specifically designed to enhance structured reasoning patterns.
For images, we have the ReMI dataset, which focuses on multi-image reasoning capabilities. It contains diverse reasoning domains, including mathematics, physics, logic, code comprehension, and spatial-temporal reasoning. This dataset particularly addresses the growing need for evaluating language models’ ability to reason across multiple visual inputs.
Benchmarks also play a crucial role here, and recent reasoning models have significantly benefited from comprehensive benchmark collections that test various aspects of reasoning capabilities.
The Rainbow framework is a prominent example, unifying six critical commonsense reasoning benchmarks that span social and physical reasoning domains. This collection includes aNLI for abductive reasoning, Cosmos QA for reading comprehension, HellaSWAG for situation prediction, Physical IQa for physical interaction understanding, Social IQa for social commonsense, and WinoGrande for commonsense reasoning.
While the industry buzzes with excitement about reasoning models, the reality appears more nuanced.
Apple’s researchers recently noted that LLMs likely perform a form of probabilistic pattern-matching and searching to find the closest seen data during training without a proper understanding of concepts. This suggests that while we’re seeing impressive advances in pattern recognition and problem-solving, true AI reasoning remains an elusive goal.
The journey toward genuine AI reasoning continues, marked by both breakthrough moments and sobering reality checks.
The post The Rise of Reasoning Models appeared first on Analytics India Magazine.