OpenAI’s o3: AI Benchmark Discrepancy Reveals Gaps in Efficiency Claims

FrontierMath accuracy for OpenAI’s o3 and o4-mini compared to leading models.
FrontierMath accuracy for OpenAI’s o3 and o4-mini in comparison with main fashions. Picture: Epoch AI

The newest outcomes from FrontierMath, a benchmark take a look at for generative AI on superior math issues, present OpenAI’s o3 mannequin carried out worse than OpenAI initially said. Whereas newer OpenAI fashions now outperform o3, the discrepancy highlights the necessity to scrutinize AI benchmarks carefully.

Epoch AI, the analysis institute that created and administers the take a look at, launched its newest findings on April 18.

OpenAI claimed 25% completion of the take a look at in December

Final 12 months, the FrontierMath rating for OpenAI o3 was a part of the practically overwhelming variety of bulletins and promotions launched as a part of OpenAI’s 12-day vacation occasion. The corporate claimed OpenAI o3, then its strongest reasoning mannequin, had solved greater than 25% of issues on FrontierMath. As compared, most rival AI fashions scored round 2%, based on TechCrunch.

SEE: For Earth Day, organizations could factor generative AI’s power into their sustainability efforts.

On April 18, Epoch AI launched take a look at outcomes exhibiting OpenAI o3 scored nearer to 10%. So, why is there such an enormous distinction? Each the mannequin and the take a look at may have been completely different again in December. The model of OpenAI o3 that had been submitted for benchmarking final 12 months was a prerelease model. FrontierMath itself has modified since December, with a distinct variety of math issues. This isn’t essentially a reminder to not belief benchmarks; as a substitute, simply keep in mind to dig into the model numbers.

OpenAI o4 and o3 mini rating highest on new FrontierMath outcomes

The up to date outcomes present OpenAI o4 with reasoning carried out finest, scoring between 15% and 19%. It was adopted by OpenAI o3 mini, with o3 in third. Different rankings embody:

  • OpenAI o1
  • Grok-3 mini
  • Claude 3.7 Sonnet (16K)
  • Grok-3
  • Claude 3.7 Sonnet (64K)

Though Epoch AI independently administers the take a look at, OpenAI initially commissioned FrontierMath and owns its content material.

Criticisms of AI benchmarking

Benchmarks are a standard option to evaluate generative AI fashions, however critics say the outcomes may be influenced by take a look at design or lack of transparency. A July 2024 research raised considerations that benchmarks usually overemphasize slim activity accuracy and endure from non-standradized analysis practices.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...