OpenAI’s o3 benchmark controversy is beginning to seem like a Theranos second—claiming record-breaking efficiency on EpochAI’s FrontierMath benchmark whereas getting access to a lot of the check information, and funding the identical.
Epoch AI’s affiliate director, Tamay Besiroglu admitted they had been contractually restricted from disclosing OpenAI’s involvement, whereas six contributing mathematicians revealed they had been unaware of the unique entry.
Besiroglu stated, “We made a mistake in not being extra clear about OpenAI’s involvement. “He revealed that the corporate was restricted from disclosing the partnership till the o3 mannequin was launched.
“Our contract particularly prevented us from disclosing details about the funding supply and the truth that OpenAI has information entry to a lot however not the entire dataset. We personal this error and are dedicated to doing higher sooner or later,” he added.
Besiroglu additionally acknowledged that OpenAI had entry to a big portion of the FrontierMath issues and options. Nevertheless, an ‘unseen-by-OpenAI hold-out set’ helped confirm the mannequin’s capabilities.
“Six mathematicians who considerably contributed to the FrontierMath benchmark confirmed that is true – that they’re unaware that OpenAI could have unique entry to this benchmark (and others received’t). Most categorical they aren’t certain they’d have contributed had they recognized,” revealed Carina Hong, a PhD candidate at Stanford, on X.
AI consultants like Gary Marcus are questioning the legitimacy of OpenAI’s claims, evaluating the state of affairs on to Theranos.
In December final yr, when OpenAI introduced its new o3 household of fashions, the corporate claimed that the o3 achieved a powerful 25% accuracy on the EpochAI Frontier Math benchmark. It was an enormous leap over the earlier excessive scores of simply 2% from different highly effective fashions. The benchmark assigns LLMs to unravel mathematical issues of unprecedented issue.
In an unique interplay with AIM earlier, Besiroglu revealed that Epoch AI considerably reduces information contamination points by producing novel issues within the benchmark. He additionally stated, “The [benchmark] information is personal, so it’s not used for coaching.”
A consumer on LessWrong found that the most recent model of FrontierMath’s analysis paper explaining the benchmark included a footnote stating, “We gratefully acknowledge OpenAI for his or her help in creating the benchmark.”
Mikhail Samin, government director on the AI Governance and Security Institute, stated on X that “OpenAI has a historical past of deceptive behaviour- from deceiving its personal board to secret non-disparagement agreements that former workers needed to sign- so I assume this shouldn’t be too shocking.”
OpenAI additionally claimed the o3 mannequin scored nearly 90% on the ARC-AGI benchmark, exceeding human efficiency. The benchmark is alleged to be the “solely AI benchmark that measures progress in the direction of basic intelligence.” Nevertheless, François Chollet, creator of the ARC-AGI benchmark, acknowledged, “I don’t consider that is AGI—there are nonetheless simple ARC-AGI-1 duties that o3 can’t resolve.”
For the reason that mannequin’s launch, Marcus has at all times been scpetical of the outcomes. Earlier, he additionally stated “Not one particular person outdoors of OpenAI has evaluated o3’s robustness throughout several types of issues.”
Amid the benchmark controversy, OpenAI Sam Altman appears tremendous excited to launch o3 mini within the coming weeks.
The submit OpenAI Simply Pulled a Theranos With o3 appeared first on Analytics India Journal.