
Are synthetic intelligence (AI) fashions actually surpassing human capability? Or are present exams simply too straightforward for them?
On Thursday, Scale AI and the Heart for AI Security (CAIS) launched Humanity's Final Examination (HLE), a brand new educational benchmark aiming to "check the bounds of AI information on the frontiers of human experience," Scale AI stated in a launch. The check consists of three,000 textual content and multi-modal questions on greater than 100 topics like math, science, and humanities, submitted by specialists in a wide range of fields.
Additionally: Roll over, Darwin: How Google DeepMind's 'thoughts evolution' may improve AI considering
Anthropic's Michael Gerstenhaber, head of API applied sciences, famous to Bloomberg final fall that AI fashions ceaselessly outpace benchmarks (a part of why the Chatbot Area leaderboard adjustments so quickly when new fashions are launched). For instance, many LLMs now rating over 90% on multi-task language understanding (MMLU), a generally used benchmark. This is named benchmark saturation.
Against this, Scale reported that present fashions solely answered lower than 10 p.c of the HLE benchmark's questions appropriately.
Researchers from the 2 organizations collected over 70,000 questions for HLE initially, narrowing them to 13,000 that have been reviewed by human specialists after which distilled as soon as extra into the ultimate 3,000. They examined the questions on prime fashions like OpenAI's o1 and GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 Professional alongside the MMLU, MATH, and GPQA benchmarks.
"After I launched the MATH benchmark — a difficult competitors arithmetic dataset — in 2021, the very best mannequin scored lower than 10%; few predicted that scores greater than 90% could be achieved simply three years later," stated Dan Hendrycks, CAIS co-founder and government director. "Proper now, Humanity's Final Examination exhibits that there are nonetheless some knowledgeable closed-ended questions that fashions aren’t in a position to reply. We’ll see how lengthy that lasts."
Additionally: DeepSeek's new open-source AI mannequin can outperform o1 for a fraction of the associated fee
Scale and CAIS gave contributors money prizes for the highest questions: $5,000 went to every of the highest 50, whereas the subsequent finest 500 obtained $500. Though the ultimate questions at the moment are public, the 2 organizations saved one other set of questions non-public to deal with "mannequin overfitting," or when a mannequin is so carefully educated to a dataset that it’s unable to make correct predictions on new knowledge.
The benchmark's creators observe that they’re nonetheless accepting check questions, however will now not award money prizes, although contributors are eligible for co-authorship.
CAIS and Scale AI plan to launch the dataset to researchers in order that they will additional research new AI methods and their limitations. You’ll be able to view all benchmark and pattern questions at lastexam.ai.