As OpenAI’s ‘12 days of shipmas’ comes to a close, the company soft-announced AGI through the introduction of the next-generation frontier models o3 and o3 Mini. These models achieve state-of-the-art performance, nearing 90%, on the ARC-AGI benchmark, surpassing human performance.
Much has changed in a span of one month. In November, Sam Altman hinted that they might have achieved this benchmark internally. However, Francois Chollet, the creator of ARC-AGI benchmark, disregarded this claim as premature. Yesterday, with the ‘o’ family of models virtually saturated the benchmark, with the ARC team announcing a newer, upgraded evaluation.
Although not yet publicly available, these frontier models will now be accessible to researchers for public safety testing. o3 Mini is slated for release in January 2025, with o3 to follow shortly after.
“We view this as sort of the beginning of the next phase of AI,” said Altman on the livestream.
But Chollet opines that OpenAI is still not there with AGI. “While the new model is very impressive and represents a big milestone on the way towards AGI, I don’t believe this is AGI – there’s still a fair number of very easy ARC-AGI-1 tasks that o3 can’t solve, and we have early indications that ARC-AGI-2 will remain extremely challenging for o3,” Chollet posted on X.
While it was widely awaited that OpenAI would announce the AGI during the 12-days of shipmas, Altman has tread cautiously with a soft announcement as it would disrupt the existing clause in the contract with its lead investor, Microsoft, which would then cease access to openAI’s technology. Also, announcing AGI would mean more scrutiny and tickle competitors like Google and Anthropic.
Architecture is Everything
Companies are actively going to scale reasoning capabilities in the coming year. Google recently released Gemini 2.0 Flash Thinking with advanced reasoning capabilities, alongside showcasing its thoughts.
This joins Chinese models Qwen and DeepSeek. Besides, Meta has hinted at releasing reasoning models next year, with xAI’s Grok and Anthropic expected to follow.
OpenAI researchers are heavily betting on the Reinforcement Learning (RL) architecture to further this new paradigm of reasoning, aligning with OpenAI co-founder Ilya Sutskever’s claim that the era of pretraining has officially ended.
“o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on the chain of thought to scale inference compute. Way faster than pretraining the paradigm of a new model every 1-2 years,” OpenAI’s Jason Wei said on X.
Interestingly, the RL technique aligns closely with Google DeepMind’s expertise. “While o3 is very impressive, I feel like the test time inference/RL models play perfectly into Google’s strength,” said Finbarr Timbers, former researcher at Google Deepmind.
o3 Beats the ARC-AGI Benchmark
OpenAI skipped the name “o2” to avoid trademark concerns with an existing telephone company with the same name. It scaled from 0-87.5%, from GPT2 to o3 in a span of five years. It scored 75.7% on the ARC-AGI semi-private set under standard compute conditions. With high-compute settings, it reached 87.5%, surpassing the 85% human-level performance threshold.
The ARC team noted that o3 is the costliest model at test-time but marks a new era where greater compute unlocks extraordinary performance.
“My personal expectation is that token prices will fall and that the most important news here is that we now have methods to turn test-time compute into improved performance up to a very large scale,” shared Nat McAleese from OpenAI’s research team.
Human Coding is a Thing of the Past
The o3 model also in software engineering benchmarks, achieving 71.7% accuracy on SWE Bench Verified, a 20% improvement over its predecessor, o1. This benchmark focuses on real-world coding tasks. With this new milestone, human software engineering is a thing of the past.
On the Epic AI Frontier Math Benchmark, regarded as the toughest mathematical test available, o3 achieved an impressive 25% accuracy, a huge leap from the SOTA 2%. This benchmark includes novel, unpublished problems that challenge professional mathematicians.
OpenAI’s o3 ranks 2727 on Codeforces, equal to the 175th best human coder worldwide. “This is an absolutely superhuman result for AI and technology at large,” shared VC analyst Deedy Das on X.
In addition to these benchamrks, the team showed that o3 Mini supports API features like function calling, structured outputs, and developer messages.
A demo on the livestream showed o3 Mini creating a ChatGPT-like UI to self-evaluate itself on GPQA, generating a Python script, processing inputs, and grading its performance.
Safety in the Age of Acceleration
Altman stressed that as their models get more and more capable, safety testing will be taken even more seriously. To this end, OpenAI is also opening public safety testing for researchers.
OpenAI also introduced the concept of deliberative alignment, a new safety technique that uses o3’s advanced reasoning capabilities to identify and reject unsafe prompts more effectively. This approach has demonstrated significant improvements in both rejection accuracy and over-refusal rates, enabling the model to detect subtle user intents designed to bypass safety mechanisms.
Anthropic, too, released research on this. “AI models will get extremely good at deceiving humans if we teach them to lie,” said the newly appointed AI Czar David Sacks on the need for trust and safety.
Incubators like Y Combinator are also increasingly funding startups that solve for a post-AGI world. These include government software, public safety, US manufacturing with AI and robotics, LLM chip design, space tech, human-centric jobs, and energy-efficient computing, among others.
YC chief Garry Tan urged that in this new reality, actual dedication to craft will take center stage. “Actually make something people want. Software and coding won’t be the gating factor,” he said.
On the whole, systemic changes such as Universal Basic Income (UBI) and Universal Basic Compute (UBC) will be the foundation for this new reality – where GDP will grow because of AI, and not extra work hours. With the ongoing progress in robotics, Universal Basic Robot (UBR) is also beginning to become a huge theme for 2025.
The post OpenAI soft-launches AGI with o3 models, Enters Next Phase of AI appeared first on Analytics India Magazine.