OpenAI’s o3 is Genius, Scores 135 in Hardest IQ take a look at

OpenAI’s o3 mannequin has emerged as essentially the most cognitively succesful AI system in a brand new benchmark take a look at carried out by Voronoi, primarily based on knowledge from Monitoring AI. The take a look at, which makes use of Norway’s Mensa IQ take a look at, a high-difficulty evaluation usually reserved for human intelligence analysis, positioned o3 at an IQ rating of 135, properly above the human common of 90–110.

Different excessive scorers embrace Anthropic’s Claude-4 Sonnet at 127 and Google’s Gemini 2.0 Flash at 126.

The evaluation coated 24 main AI fashions, with prime positions principally occupied by text-only fashions, whereas vision-enabled methods scored considerably decrease.

GPT-4o with imaginative and prescient, for instance, acquired an IQ rating of 63, whereas Grok-3 Suppose (Imaginative and prescient) adopted with 60.

These outcomes counsel that whereas language-based reasoning capabilities in AI are quickly surpassing human benchmarks, vision-based and multimodal methods nonetheless lag in summary problem-solving duties.

The take a look at outcomes elevate vital questions on how AI fashions are architected and educated, notably in terms of basic intelligence versus domain-specific strengths.

Voronoi’s findings replicate a broader development in AI growth, the place efficiency positive factors in language fashions proceed to dominate, however real multimodal reasoning stays a key problem.
Not Considering, But

Coincidentally, in a brand new paper titled The Phantasm of Considering, researchers from the Cupertino-based firm argued that even essentially the most superior AI fashions, together with the so-called massive reasoning fashions (LRMs), don’t really assume. As a substitute, they simulate reasoning with out actually understanding or fixing complicated issues.

The paper, launched simply forward of Apple’s Worldwide Developer Convention, examined main AI fashions, together with OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Considering, and Gemini Considering, utilizing specifically designed algorithmic puzzle environments moderately than customary benchmarks.

The researchers argue that conventional benchmarks, like math and coding checks, are flawed because of “knowledge contamination” and fail to disclose how these fashions really “assume”.

“We present that state-of-the-art LRMs nonetheless fail to develop generalisable problem-solving capabilities, with accuracy in the end collapsing to zero past sure complexities throughout totally different environments,” the paper famous.

The put up OpenAI’s o3 is Genius, Scores 135 in Hardest IQ take a look at appeared first on Analytics India Journal.