OpenAI’s progress from GPT-4 to Orion has slowed, The information reported recently. According to the report, although OpenAI has completed only 20% of Orion’s training, it is already on par with GPT-4 in intelligence, task fulfilment, and question-answering abilities. While Orion outperforms previous models, the quality improvement is less dramatic than the leap from GPT-3 to GPT-4.
This led many to wonder—Have LLM improvements hit a wall? No one seemed more thrilled about it than the most celebrated AI critic, Gary Marcus, who promptly posted on X, “Folks, game over. I won. GPT is hitting a period of diminishing returns, just like I said it would.”
However, it appears Uncle Gary may have celebrated a bit too early. “With all due respect, the article introduces a new AI scaling law that could replace the old one. The sky isn’t falling,” said one of the article’s authors quickly responded to Marcus and clarified,
Similarly, OpenAI researchers were quick to correct the narrative, asserting that the article inaccurately portrays the progress of OpenAI’s upcoming models—or rather misleading.
“There are now two key dimensions of scaling for models like the o1 series—training time and inference time,” said Adam Goldberg, a founding member of OpenAI’s go-to-market (GTM) team. He explained that while traditional scaling laws focusing on pre-training larger models for longer are still relevant, there’s now another important factor.
“Aspect of scale remains foundational. However, the introduction of this second scaling dimension is set to unlock amazing new capabilities,” he added.
He was elaborating on OpenAI researcher Noam Brown’s earlier statement claiming that o1 is trained with reinforcement learning (RL) to “think” before responding via a private chain of thought. “The longer it thinks, the better it performs on reasoning tasks,” he had said. This, Brown explained, introduces a new dimension to scaling. “We’re no longer bottlenecked by pretraining. We can now scale inference compute as well,” he added.
Jason Wei, also a researcher at OpenAI, defended o1 and explained the difference in the chain of thought before and after o1. He explained that the traditional chain-of-thought reasoning used by AI models like GPT was more of a mimicry than a true “thinking” process. He said the model would often reproduce reasoning paths it encountered during its pretraining, like solutions to math problems or other tasks.
He added that the o1 system introduces a more robust and authentic “thinking” process. In this paradigm, the chain of thought reflects more of an internal reasoning process, similar to how humans think. He explained that instead of simply spitting out an answer, the model engages in an “inner monologue” or “stream of consciousness,” where it actively considers and evaluates options.
“You can see the model backtracking; it says things like ‘alternatively, let’s try’ or ‘wait, but’,” he added. This back-and-forth process is a more dynamic and thoughtful approach to solving problems.
“People underestimate how powerful test-time compute is: compute for longer, in parallel, or fork and branch arbitrarily—like cloning your mind 1,000 times and picking the best thoughts,” said Peter Welinder, VP of product at OpenAI.
Earlier, when OpenAI released o1-mini and o1-preview, they mentioned in their blog post that o1’s performance consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).
Regarding inference time scaling, they said, “The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.”
It appears that OpenAI has exhausted all available data for pre-training the model and is now exploring new methods to improve o1. According to The Information report, Orion was partially trained on AI-generated data (or synthetic data) produced by other OpenAI models, including GPT-4 and the recently released reasoning models.
Jensen to the Rescue: When NVIDIA CEO Jensen Huang recently said “We Are Going to Take Everybody with Us,” he really meant it. In a recent podcast with No Priors, Huang shared that one of the major challenges NVIDIA is currently facing in computing is inference time scaling, which involves generating tokens at incredibly low latency.
Huang explained that, in the future, AI systems will need to perform tasks like tree search, chain of thought, and mental simulations, reflecting on their own answers. The model would prompt itself and generate text internally, all while responding in real-time, ideally within a second. This approach subtly points to the capabilities of the o1 system.
AGI Coming Soon?
While others remain uncertain, OpenAI chief Sam Altman is confident that artificial general intelligence (AGI) is closer than many think. In a recent interview with Y Combinator’s Garry Tan, Altman suggested that AGI could emerge as soon as 2025. “I think we are going to get there faster than people expect,” he said, underscoring OpenAI’s accelerated progress.
Further, he said OpenAI had fewer resources than DeepMind and others. “So we said, ‘Okay, they are going to try a lot of things and we have just got to pick one and really concentrate’,” he added.
“I’ve heard people claim that Sam is just drumming up hype, but from what I’ve seen everything he’s saying matches the median view of OpenAI researchers on the ground,” said Brown.
OpenAI has yet to release o1 fully. While it may not perform well in math and coding at this stage, it doesn’t mean it won’t improve over time. Many believe that o1 could be the first commercial application of System 2 thinking.
In EpochAI’s FrontierMath benchmark, which tests LLMs on some of the hardest and unpublished problems in math, it was revealed that only 2% of these problems were successfully solved by LLMs. While all models showed poor performance, the o1 preview showed a positive sign, as it was able to consistently solve problems correctly in repeated testing.
Apple recently published a paper titled ‘Understanding the Limitations of Mathematical Reasoning in Large Language Models’, which said that the current LLMs can’t reason. The researchers introduced GSM-Symbolic, a new tool for testing mathematical reasoning within LLMs because GSM8K was not accurate enough and, thus, not reliable for testing the reasoning abilities of LLMs.
Surprisingly, on this benchmark, OpenAI’s o1 demonstrated “strong performance on various reasoning and knowledge-based benchmarks” according to the researchers. However, the capabilities dropped by 30% when the researchers introduced the GSM-NoOp experiment, which involved adding irrelevant information to the questions.
N-o1 Saw This Coming, Not Even OpenAI
Subbarao Kambhampati, a computer science and AI professor at Arizona State University said that some of the claims of LLMs being capable of reasoning are “exaggerated”. He argued that LLMs require more tools to handle System 2 tasks (reasoning), for which techniques like fine-tuning or chain of thought are not adequate.
“When we develop AI systems that can actually reason, they will involve deep learning (as one of two major components, the other being discrete search). Some people might argue that this ‘proves’ deep learning can reason,” said François Chollet, the creator of Keras. “But that’s not true. It will prove that deep learning alone isn’t enough and that we need to combine it with discrete search,” Chollet added.
Pointing to the inclusion of Gemini in AlphaProof, he described it as “basically cosmetic and for marketing purposes”. He argued that this reflects a wider trend—using the ‘LLM’ brand name as a blanket term for all AI progress, even though much of it is unrelated to LLMs.
When OpenAI released o1, claiming that the model thinks and reasons, Hugging Face CEO Clem Delangue was not impressed. “Once again, an AI system is not ‘thinking’; it’s ‘processing,’ ‘running predictions’… just like Google or computers do,” said Delangue, adding that OpenAI is “selling cheap snake oil”.
However, all is not lost for OpenAI, Google DeepMind recently published a paper titled ‘Chain of Thought Empowers Transformers to Solve Inherently Serial Problems’. While sharing his research on X, Denny Zhou mentioned, “We have mathematically proven that Transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.”
This echoes AI researcher Andrej Karpathy’s recent remarks on next-token prediction frameworks, suggesting that they could become a universal tool for solving a wide range of problems, far beyond just alone text or language.
The post OpenAI is So Doomed if Inference Time Scaling for o1 Fails appeared first on Analytics India Magazine.