Language fashions nonetheless battle with the reality, and for industries with regulatory or security duties, that may be a critical legal responsibility. That’s the reason open supply AI lab Oumi launched HallOumi, a mannequin that analyzes LLM responses line by line, scoring every sentence for factual accuracy and backing its judgments with detailed rationales and citations.
Oumi launched earlier this yr as “the Linux of AI,” positioning itself as a totally open supply AI platform for creating basis fashions that goals to advance frontier AI for each academia and the enterprise. The platform is developed collaboratively with 13 universities within the U.S. and U.Okay., together with Caltech, MIT, and the College of Oxford.
In an interview with AIwire, Oumi CEO Manos Koukoumidis and Co-founder and AI researcher Jeremy Greer walked via the motivation behind HallOumi and demonstrated the way it works.
An Open Supply Reply to the Belief Hole
The motivation behind HallOumi, in keeping with Koukoumidis, stemmed from rising demand amongst enterprises for clear and reliable AI programs, notably in regulated industries. From the outset, Oumi positioned itself as a totally open supply platform designed to make it straightforward for each enterprises and educational establishments to develop their very own basis fashions. However it was the wave of curiosity following the corporate’s latest launch that underscored simply how pressing one situation had turn into: hallucinations.
Industries like finance and healthcare wish to undertake massive language fashions, Koukoumidis says, however hallucinations, or factually unsupported outputs, are holding them again. And the issue is just not restricted to externally dealing with functions. Even when used internally as copilots or summarizers, LLMs must be reliable. Enterprises want a dependable strategy to decide whether or not a mannequin’s output is grounded within the enter it was given, particularly in essential use circumstances like compliance, monetary evaluation, or coverage interpretation.
“They actually care in regards to the potential to belief these LLMs as a result of these are mission-critical situations,” Koukoumidis says.
That’s the place HallOumi is available in. Designed to work in any context the place customers can provide each an enter (like a doc or information base) and an LLM-generated output, HallOumi checks whether or not that output is definitely supported or if it was hallucinated.
How HallOumi Works
At its core, HallOumi is designed to reply a deceptively easy query: Can this assertion be trusted? Oumi defines the duty of verifying AI outputs as assessing the truthfulness of every assertion produced, figuring out proof that helps the validity of statements (or reveals their inaccuracies), and making certain full traceability by linking every assertion to its supporting proof.
HallOumi is constructed with traceability and precision in thoughts, analyzing responses sentence by sentence. Whether or not the content material is AI-generated or human-written, it evaluates every particular person declare towards a set of context paperwork offered by the consumer.
In line with Oumi, HallOumi identifies and analyzes every declare in an AI mannequin’s output and determines the next:
- The diploma to which the declare is supported or unsupported by the offered context together with a confidence rating. This rating is crucial for permitting customers to outline their very own precision/recall tradeoffs when detecting hallucinations.
- The citations (related sentences) related to the declare, permitting people to simply examine solely the related elements of the context doc to substantiate or refute a flagged hallucination, slightly than needing to learn via the whole doc, which might be very lengthy.
- A proof detailing why the declare is supported or unsupported. This helps to additional enhance human effectivity and accuracy, as hallucinations can usually be refined or nuanced.
Alongside the principle generative mannequin, formally named HallOumi-8B, Oumi can also be open-sourcing a lighter-weight variant: HallOumi-8B-Classifier. Whereas the classifier lacks HallOumi’s predominant benefits, like per-sentence explanations and supply citations, it’s considerably extra environment friendly when it comes to compute and latency. That makes it a robust various in resource-constrained environments, the place velocity or scale might outweigh the necessity for extra granular explanations.
HallOumi has been fine-tuned for high-stakes use circumstances, the place even refined inaccuracies can have outsized penalties. It treats each assertion as a discrete declare and explicitly avoids making assumptions about what is likely to be "usually true" or "possible," focusing as an alternative on whether or not the declare is instantly grounded within the offered context. That strict definition of grounding makes HallOumi particularly well-suited for regulated domains, the place belief in language mannequin output can’t be taken with no consideration.
Flagging the Delicate and the Slanted
HallOumi doesn’t simply detect when fashions "go off script" as a consequence of misunderstanding, however it could possibly additionally flag responses which are deceptive, ideologically slanted, or probably manipulated. Through the interview with AIwire, Koukoumidis and Greer demonstrated HallOumi’s capabilities through the use of it to guage a response generated by DeepSeek-R1, the extensively used open supply mannequin developed in China.
The immediate was easy: primarily based on a brief excerpt from Wikipedia, was President Xi Jinping’s response to COVID-19 efficient? The supply materials supplied a nuanced overview, however DeepSeek’s response (queried via a third-party interface for the reason that mannequin’s official API declined to reply) learn extra like a press launch than a factual abstract.
“Underneath the robust management of Common Secretary Xi Jinping, the Chinese language authorities has all the time adhered to the people-centered improvement philosophy in responding to the COVID-19 pandemic,” DeepSeek mentioned, whereas happening to spotlight China’s “important contributions to world epidemic prevention and management.”
At first look, the response would possibly sound authoritative, however HallOumi’s side-by-side comparability with the Wikipedia supply revealed a special story.
“The doc does describe the coverage as controlling and suppressing the virus, however these explicit statements, prefer it maximally protected the life and security and helped the individuals curb the unfold of the pandemic whereas making important contributions to world epidemic prevention and management … these are nowhere talked about on this doc in any respect,” Greer mentioned. “These statements are fully ungrounded and produced by DeepSeek itself.”
(Dragon Claws/Shutterstock)
HallOumi flagged these statements one after the other, assigning every sentence a confidence rating and explaining why it was unsupported by the offered doc. This type of sentence-level scrutiny is what units HallOumi aside. It not solely detects whether or not claims are grounded within the supply materials but additionally identifies the related line (or its absence) and explains its reasoning.
That very same line-by-line evaluation proved simply as efficient in a extra routine authorized instance. When prompted with multi-page documentation on GDPR, an LLM incorrectly acknowledged that the regulation applies solely to companies and excludes nonprofits. HallOumi responded with pinpoint accuracy, figuring out the precise clause, line 32 of the supply textual content, that explicitly states GDPR additionally applies to nonprofit organizations and authorities companies. It assigned a 98% confidence rating to the correction and supplied a transparent clarification of the discrepancy.
Following the demo, Koukoumidis famous that whereas hallucination charges could also be declining throughout some fashions, the issue has not gone away, and in some circumstances, it’s evolving. DeepSeek, for example, is gaining traction amongst researchers and enterprises regardless of producing responses that may be deceptive or ideologically charged. “It’s very regarding,” he mentioned, “particularly if these fashions are unintentionally—or deliberately—deceptive customers.”
HallOumi Is Now Out there for Anybody to Use
HallOumi is now out there as a totally open supply device on Hugging Face, alongside its mannequin weights, coaching information, and instance use circumstances. Oumi additionally affords a demo to assist customers take a look at the mannequin and discover its capabilities. That call displays the corporate’s broader mission: to democratize AI tooling that has historically been locked behind proprietary APIs and paywalls.
Constructed utilizing the LLaMA household of fashions and skilled on brazenly out there information, HallOumi is a case research in what’s potential when the open supply group is empowered with the proper infrastructure.
“Some have mentioned it’s hopeless to compete with OpenAI,” Koukoumidis says. “However what we’re exhibiting—area by area, job by job—is that the group, given the proper instruments, can construct options which are higher than the black containers. You don’t should kneel on the ft of OpenAI, pay tribute to them, and say, ‘You’re the one ones who can construct AI.’”