OpenAI Soft-launches AGI with o3 Models, Enters Next Phase of AI

As OpenAI’s ‘12 days of shipmas’ comes to a close, the company soft-announced AGI through the introduction of the next-generation frontier models o3 and o3 Mini. These models achieve state-of-the-art performance, nearing 90%, on the ARC-AGI benchmark, surpassing human performance.

Much has changed in a span of one month. In November, Sam Altman hinted that they might have achieved this benchmark internally. However, Francois Chollet, the creator of ARC-AGI benchmark, disregarded this claim as premature. Yesterday, with the ‘o’ family of models virtually saturating the benchmark, the ARC team announced a newer, upgraded evaluation (ARC-AGI benchmark 2).

Although not yet publicly available, these frontier models will now be accessible to researchers for public safety testing. o3 Mini is slated for release in January 2025, with o3 to follow shortly after.

“We view this as sort of the beginning of the next phase of AI,” said Altman on the livestream.

But Chollet opines that OpenAI is still not there with AGI. “While the new model is very impressive and represents a big milestone on the way towards AGI, I don’t believe this is AGI – there’s still a fair number of very easy ARC-AGI-1 tasks that o3 can’t solve, and we have early indications that ARC-AGI-2 will remain extremely challenging for o3,” Chollet posted on X.

While it was widely awaited that OpenAI would announce the AGI during the 12-days of shipmas, Altman has tread cautiously with a soft announcement as it would disrupt the existing clause in the contract with its lead investor, Microsoft, which would then cease access to openAI’s technology. Also, announcing AGI would mean more scrutiny and tickle competitors like Google and Anthropic.

Scaling the Right Architecture Is All You Need

Companies are actively going to scale reasoning capabilities in the coming year. Google recently released Gemini 2.0 Flash Thinking with advanced reasoning capabilities.

This joins Chinese models Qwen and DeepSeek. Besides, Meta has also hinted at releasing reasoning models next year, with xAI’s Grok and Anthropic expected to follow.

OpenAI researchers are heavily betting on the Reinforcement Learning (RL) architecture to further this new paradigm of reasoning.

“o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on the chain of thought to scale inference compute. Way faster than pretraining the paradigm of a new model every 1-2 years,” OpenAI’s Jason Wei said on X.

Interestingly, the RL technique aligns closely with Google DeepMind’s expertise. “While o3 is very impressive, I feel like the test time inference/RL models play perfectly into Google’s strength,” said Finbarr Timbers, former researcher at Google Deepmind.

o3 Beats the ARC-AGI Benchmark

OpenAI skipped the name “o2” to avoid trademark concerns with an existing telephone company with the same name. It scaled from 0-87.5%, from GPT2 to o3 in a span of five years. It scored 75.7% on the ARC-AGI semi-private set under standard compute conditions. With high-compute settings, it reached 87.5%, surpassing the 85% human-level performance threshold.

The ARC team noted that o3 is the costliest model at test-time but marks a new era where greater compute unlocks extraordinary performance.

“My personal expectation is that token prices will fall and that the most important news here is that we now have methods to turn test-time compute into improved performance up to a very large scale,” shared Nat McAleese from OpenAI’s research team.

Every Benchmark Will Be Saturated

The o3 model also in software engineering benchmarks, achieving 71.7% accuracy on SWE Bench Verified, a 20% improvement over its predecessor, o1. This benchmark focuses on real-world coding tasks. With this new milestone, human software engineering is a thing of the past.

On the Epic AI Frontier Math Benchmark, regarded as the toughest mathematical test available, o3 achieved an impressive 25% accuracy, a huge leap from the SOTA 2%. This benchmark includes novel, unpublished problems that challenge professional mathematicians.

OpenAI’s o3 ranks 2727 on Codeforces, equal to the 175th best human coder worldwide. “This is an absolutely superhuman result for AI and technology at large,” shared VC analyst Deedy Das on X.

In addition to these benchamrks, the team showed that o3 Mini supports API features like function calling, structured outputs, and developer messages.

A demo on the livestream showed o3 Mini creating a ChatGPT-like UI to self-evaluate itself on GPQA, generating a Python script, processing inputs, and grading its performance.

Safety in the Age of Acceleration

Altman stressed that as their models get more and more capable, safety testing will be taken even more seriously. To this end, OpenAI is also opening public safety testing for researchers.

OpenAI also introduced the concept of deliberative alignment, a new safety technique that uses o3’s advanced reasoning capabilities to identify and reject unsafe prompts more effectively.

Anthropic, too, released research on this. “AI models will get extremely good at deceiving humans if we teach them to lie,” said the newly appointed AI Czar David Sacks on the need for trust and safety.

Incubators like Y Combinator are also increasingly funding startups that solve for a post-AGI world. These include government software, public safety, US manufacturing with AI and robotics, LLM chip design, space tech, human-centric jobs, and energy-efficient computing, among others.

YC chief Garry Tan urged that in this new reality, actual dedication to craft will take center stage. “Actually make something people want. Software and coding won’t be the gating factor,” he said.

On the whole, systemic changes such as Universal Basic Income (UBI) and Universal Basic Compute (UBC) will be the foundation for this new reality – where GDP will grow because of AI, and not extra work hours. With the ongoing progress in robotics, Universal Basic Robot (UBR) is also beginning to become a huge theme for 2025.

With this OpenAI hinted that they are just getting started. ASI does not feel very far away.

The post OpenAI Soft-launches AGI with o3 Models, Enters Next Phase of AI appeared first on Analytics India Magazine.