The AI discipline welcomes a brand new benchmark: Humanity’s Final Examination (HLE), launched by the Heart for AI Security (CAIS) and Scale AI for testing AI techniques on expert-level data. The dataset consists of 3,000 questions crowdsourced from 1,000 contributors throughout 500 establishments in 50 international locations, together with professors and PhD holders. It covers arithmetic, humanities, and pure sciences utilizing the multi-format method that features textual content, diagrams, and pictures.
The benchmark examined fashions like GPT-4o, Claude 3.5, and DeepSeek, with none scoring above 10%, revealing their battle with advanced and interdisciplinary issues. The benchmark confirmed that DeepSeek R1 – a less expensive and fewer highly effective open-source mannequin – outperformed the complete o1 mannequin identified for its reasoning skills.

HLE was created to handle “benchmark saturation,” the place AI fashions excel on normal checks however fail on novel challenges.
“I wrote 5 questions within the new benchmark that even the highest AI fashions rating lower than 10% on: Humanity’s Final Examination,” stated Jeremy Nguyen on X.
The mission concerned contributors from various tutorial and analysis backgrounds. Summer time Yue, Scale AI’s Director of Analysis, stated the benchmark was designed to push AI fashions to their reasoning limits.
Benchmarks within the AGI period
“Beginning to see new well-built laborious benchmarks in AI since virtually all the things else has already been exceeded. We now have this (with humanities questions), ARC-AGI 2, and Frontier Math. We additionally want some benchmarks for brand new data creation moderately than testing identified issues,” wrote Wharton’s Ethan Mollick on X.
Final week, there have been issues about OpenAI’s involvement with FrontierMath. For context, In December, OpenAI introduced its o3 fashions, reporting 25% accuracy on the EpochAI Frontier Math benchmark, a big enchancment from the earlier 2% achieved by different fashions.
Epoch AI not too long ago clarified that OpenAI commissioned them to create 300 math questions for the FrontierMath benchmark. OpenAI owns these questions and has entry to their statements and options, aside from a 50-question personal holdout set.
The assertion additionally famous that Epoch AI can consider and publish outcomes on any mannequin utilizing the FrontierMath drawback set however can not share the questions or solutions with out OpenAI’s written permission.
“We will consider different fashions and have carried out so already. We are going to publish extra leads to the subsequent few weeks, maybe together with DeepSeek’s,” clarified Epoch’s Tamay Besiroglu to AIM, addressing FrontierMath’s method to evaluating fashions from different firms.
Relating to the holdout set, Epoch AI defined they’re finalising a 50-question set for which OpenAI will solely obtain the issue statements, not the options.
AI evaluations largely stay underfunded, and more durable benchmarks are important as we progress in direction of AGI. “Going ahead, we’ll guarantee all contributors have entry to details about business funding and information entry agreements earlier than collaborating and proactively publicly disclose benchmark sponsorship and information entry agreements,” learn Epoch’s assertion.
The publish Humanity’s Final Examination is the New MultiAgent AI Benchmark appeared first on AIM Media Home.