LLMs Hit a New Low on ARC-AGI-2 Benchmark, Pure LLMs Rating 0%

ARC Prize, a non-profit organisation that evaluates the effectiveness of AI fashions to display human-like intelligence, has introduced the ARC-AGI-2 benchmark.

The brand new benchmark is a successor to the ARC-AGI benchmark launched a number of years in the past. Like its predecessor, the benchmark assessments AI fashions on duties which are comparatively straightforward fashions for people however troublesome for synthetic techniques.

The ARC-AGI-2 benchmark poses even larger challenges than its predecessor, because it elements in effectivity (cost-per-task) along with efficiency. The duties require AI fashions to interpret symbols past their visible patterns, concurrently apply interrelated guidelines, and use totally different guidelines relying on context.

The outcomes revealed that AI fashions discovered the entire above duties difficult. Non-reasoning fashions, or ‘Pure LLMs’, scored 0% on the benchmark, whereas different publicly accessible reasoning fashions acquired single-digit proportion scores of lower than 4%. In distinction, a human panel fixing the duties achieved an ideal rating of 100%.

“AI techniques are already superhuman in lots of particular domains (e.g., enjoying Go and picture recognition.) Nonetheless, these are slender, specialised capabilities. The ‘human-ai hole’ reveals what’s lacking for basic intelligence—extremely effectively buying new abilities,” the organisation mentioned.

OpenAI’s unreleased o3 reasoning mannequin achieved the very best rating of 4.0%. Within the earlier ARC-AGI-1 benchmark, it scored 75.7%. Nonetheless, Sam Altman, CEO of OpenAI, has disclosed that it’ll not be launched as a standalone mannequin. As a substitute, o3’s reasoning capabilities can be built-in right into a hybrid GPT-5 mannequin.

Apart from, there weren’t any noteworthy scores from different AI fashions. Even the just lately launched Claude 3.7 Sonnet mannequin, typically thought of the most effective mannequin for coding, scored 0.7%, whereas the DeepSeek-R1 mannequin scored 1.3%. The leaderboard additionally outlined the associated fee (in USD) taken to carry out every activity.

Supply: ARC-Prize

“All different AI benchmarks concentrate on superhuman capabilities or specialised data by testing ‘PhD++’ abilities. ARC-AGI is the one benchmark that takes the alternative design alternative by specializing in duties which are comparatively straightforward for people, but exhausting, or inconceivable, for AI,” the organisation added.

François Chollet, creator of Keras and a former Google researcher, is among the creators of the ARC-AGI benchmark. He mentioned it’s “the one AI benchmark that measures progress in the direction of basic intelligence”.

Just lately, Chollet, together with Zapier co-founder Mike Knoop, launched Ndea, a brand new analysis lab devoted to creating synthetic basic intelligence (AGI).

The put up LLMs Hit a New Low on ARC-AGI-2 Benchmark, Pure LLMs Rating 0% appeared first on AIM.

LLMs Hit a New Low on ARC-AGI-2 Benchmark, Pure LLMs Rating 0%

Latest stories

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron...

PNNL: Integrating AI into Biological Research

Rick Stevens on the Genesis Mission and the Future of...

Inside the DOE’s 26 AI Challenges for Genesis Mission

You might also like...

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron Star Data

PNNL: Integrating AI into Biological Research