OpenAI o1 Can’t Do Maths, But Excels at Making Excuses

A few days ago, Epoch AI released FrontierMath, a new benchmark to evaluate the mathematical capabilities of large language models.

The results revealed a startling low for these babies—the LLMs all sucked at maths, albeit on problems that were harder to solve than ever.

Several debates have long occurred regarding the effectiveness of benchmarks. In a research paper, Apple stated that despite their performance in benchmarks, LLMs aren’t genuinely good at mathematical reasoning, and their output results from pattern recognition and replication of steps from training data.

Even OpenAI mentioned that they do not want to benchmark o1 on MATH and GSM8K since the evaluation method is quite outdated, and most LLMs will easily output high scores. “Recent frontier models do so well on MATH2 and GSM8K that these benchmarks are no longer effective at differentiating models,” said OpenAI in a blog post.

In light of such concerns, FrontierMath assigns LLMs to solve mathematical problems of unprecedented difficulty. According to Epoch AI, these problems demand hours of work from human scientists and mathematicians.

Moreover, the problems in the benchmark are all new and unpublished, alleviating any concerns of ‘contamination’ from existing benchmarks. They were developed in collaboration with 60 mathematicians.

So, how does the benchmark work exactly, and what does it say about LLMs’ capabilities today?

Long-Live Mathematicians

If there’s any evidence that LLMs are years behind human intelligence, FrontierMath is the best bet. The benchmark results revealed that LLMs solved a mere 2% of the problems correctly.

On the other hand, LLMs solved over 60% of the problems on benchmarks like Omni-MATH, MathVista, and GSM8-K.

“Each problem demands hours of work from expert mathematicians. Even the most advanced AI systems today, including GPT-4 and Gemini, solve less than 2% of them,” revealed Epoch AI.

Several mathematicians praised the benchmark and indicated that it contained one of the most complex sets of problems.
“To understand expert perspectives on FrontierMath’s difficulty and relevance, we interviewed several prominent mathematicians…They unanimously characterised the problems as exceptionally challenging, requiring deep domain expertise and significant time investment to solve,” mentioned Epoch AI in the research paper.

“These are extremely challenging. I think that in the near term, basically, the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages,” Terence Tao, the Fields Medalist in 2006, said.

Moreover, Epoch AI also said that testing LLMs on mathematical benchmarks that are hard to perform on may be a better way to assess their overall capabilities. This differs from several other methods which require subjective judgement.

“To understand and measure the progress in artificial intelligence, we need carefully designed benchmarks that can assess how well AI systems engage in complex scientific reasoning.

“Mathematics offers a unique opportunity for this assessment—it requires extended chains of precise reasoning, with each step building exactly on what came before,” said Epoch AI in the research paper.

The test problems contained integer-based answers, and the solutions were automatically verified using Python scripts. Besides, Epoch AI claims that these problems were “guess proof”, which means that all of the problems had to be fully solved to arrive at the solution.

“As a rule of thumb, we require that there should not be a greater than 1% chance of guessing the correct answer without doing most of the work that one would need to do to “correctly” find the solution,” said EpochAI.

“I think they will resist AI for several years at least,” said Tao, asserting that we’re years away from a powerful LLM that can solve these problems.

Interestingly, Andrej Karpathy, founder of Eureka Labs, took to X and compared the benchmark to Moravec’s paradox. “This is Moravec’s paradox in disguise, which observed 30+ years ago that what is easy/hard for humans can be non-intuitively very different to what is easy/hard for computers,” he said.

Moravec's paradox in LLM evals
I was reacting to this new benchmark of frontier math where LLMs only solve 2%. It was introduced because LLMs are increasingly crushing existing math benchmarks. The interesting issue is that even though by many accounts (/evals), LLMs are inching… https://t.co/3Ebm7MWX1G

— Andrej Karpathy (@karpathy) November 10, 2024

o1 Did Win an Important Challenge

While OpenAI claims that o1 is the best LLM to date, it did not perform well on the mathematical benchmark—just like coding. While Claude 3.5 and Gemini 1.5 Pro beat o1 in the results, their performance wasn’t notable either.

As mentioned, none of these models were able to solve more than 2% of the problems.

AI skeptics: LLMs are copy-paste engines, incapable of original thought, basically worthless.
Professionals who track AI progress: We've worked with 60 mathematicians to build a hard test that modern systems get 2% on. Hope this benchmark lasts more than a couple of years. pic.twitter.com/zEw5Kd9F5N

— Jack Clark (@jackclarkSF) November 9, 2024

However, there is an important takeaway. To perform a fair evaluation, the researchers tested the LLMs repeatedly on four of the problems that they all solved correctly. They mentioned that o1 Preview performed the strongest among repeated trials.

“When re-evaluating these problems that were solved at least once, o1-preview demonstrated
the strongest performance across repeated trials.” said Epoch AI in the research paper.

That is certainly a ray of hope. Perhaps o1’s strong reasoning capabilities aid a consistent output, preventing the model from any significant deviations. Moreover, it will be interesting to see how o1 performs on FrontierMath once it is out of preview and released with all its capabilities. Or will it be taken over by the likes of Gemini 2.0?

Epoch AI’s future plans include developing more such tests and implementing other methods for better assessment.

“For example, we will test the effects of increasing the token limit, allowing models to reason for longer and run more experiments per problem. We also plan to conduct multiple runs for each model-problem pair, enabling us to report statistics and confidence intervals across attempts,” Epoch AI wrote.

However, assessing these models on such tough benchmarks isn’t everything. “I also think it’s an interesting challenge to create evals for all the ‘easy’ stuff that is secretly hard. Very long-context windows, coherence, autonomy, common sense, multimodal I/O that works…

“How do we build good ‘menial job’ evals? The kinds of things you’d expect from any entry-level intern on your team,” said Karpathy.

The post OpenAI o1 Can’t Do Maths, But Excels at Making Excuses appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...