Language fashions nonetheless wrestle with the reality, and for industries with regulatory or security tasks, that may be a severe legal responsibility. That’s the reason open supply AI lab Oumi launched HallOumi, a mannequin that analyzes LLM responses line by line, scoring every sentence for factual accuracy and backing its judgments with detailed rationales and citations.
Oumi launched earlier this yr as “the Linux of AI,” positioning itself as a totally open supply AI platform for creating basis fashions that goals to advance frontier AI for each academia and the enterprise. The platform is developed collaboratively with 13 universities within the U.S. and U.Okay., together with Caltech, MIT, and the College of Oxford.
In an interview with AIwire, Oumi CEO Manos Koukoumidis and Co-founder and AI researcher Jeremy Greer walked by way of the motivation behind HallOumi and demonstrated the way it works.
An Open Supply Reply to the Belief Hole
The motivation behind HallOumi, in accordance with Koukoumidis, stemmed from rising demand amongst enterprises for clear and reliable AI methods, notably in regulated industries. From the outset, Oumi positioned itself as a totally open supply platform designed to make it straightforward for each enterprises and tutorial establishments to develop their very own basis fashions. However it was the wave of curiosity following the corporate’s current launch that underscored simply how pressing one challenge had turn into: hallucinations.
Industries like finance and healthcare wish to undertake giant language fashions, Koukoumidis says, however hallucinations, or factually unsupported outputs, are holding them again. And the issue is just not restricted to externally going through purposes. Even when used internally as copilots or summarizers, LLMs must be reliable. Enterprises want a dependable option to decide whether or not a mannequin’s output is grounded within the enter it was given, particularly in essential use instances like compliance, monetary evaluation, or coverage interpretation.
“They actually care concerning the capacity to belief these LLMs as a result of these are mission-critical eventualities,” Koukoumidis says.
That’s the place HallOumi is available in. Designed to work in any context the place customers can provide each an enter (like a doc or information base) and an LLM-generated output, HallOumi checks whether or not that output is definitely supported or if it was hallucinated.
How HallOumi Works
At its core, HallOumi is designed to reply a deceptively easy query: Can this assertion be trusted? Oumi defines the duty of verifying AI outputs as assessing the truthfulness of every assertion produced, figuring out proof that helps the validity of statements (or reveals their inaccuracies), and guaranteeing full traceability by linking every assertion to its supporting proof.
HallOumi is constructed with traceability and precision in thoughts, analyzing responses sentence by sentence. Whether or not the content material is AI-generated or human-written, it evaluates every particular person declare in opposition to a set of context paperwork offered by the consumer.
In response to Oumi, HallOumi identifies and analyzes every declare in an AI mannequin’s output and determines the next:
- The diploma to which the declare is supported or unsupported by the offered context together with a confidence rating. This rating is essential for permitting customers to outline their very own precision/recall tradeoffs when detecting hallucinations.
- The citations (related sentences) related to the declare, permitting people to simply test solely the related elements of the context doc to substantiate or refute a flagged hallucination, slightly than needing to learn by way of your entire doc, which could possibly be very lengthy.
- A proof detailing why the declare is supported or unsupported. This helps to additional enhance human effectivity and accuracy, as hallucinations can usually be refined or nuanced.
Alongside the primary generative mannequin, formally named HallOumi-8B, Oumi can be open-sourcing a lighter-weight variant: HallOumi-8B-Classifier. Whereas the classifier lacks HallOumi’s most important benefits, like per-sentence explanations and supply citations, it’s considerably extra environment friendly by way of compute and latency. That makes it a robust various in resource-constrained environments, the place velocity or scale could outweigh the necessity for extra granular explanations.
HallOumi has been fine-tuned for high-stakes use instances, the place even refined inaccuracies can have outsized penalties. It treats each assertion as a discrete declare and explicitly avoids making assumptions about what is perhaps "typically true" or "probably," focusing as a substitute on whether or not the declare is straight grounded within the offered context. That strict definition of grounding makes HallOumi particularly well-suited for regulated domains, the place belief in language mannequin output can’t be taken with no consideration.
Flagging the Delicate and the Slanted
HallOumi doesn’t simply detect when fashions "go off script" as a result of misunderstanding, however it might probably additionally flag responses which can be deceptive, ideologically slanted, or doubtlessly manipulated. In the course of the interview with AIwire, Koukoumidis and Greer demonstrated HallOumi’s capabilities through the use of it to judge a response generated by DeepSeek-R1, the broadly used open supply mannequin developed in China.
The immediate was easy: primarily based on a brief excerpt from Wikipedia, was President Xi Jinping’s response to COVID-19 efficient? The supply materials supplied a nuanced overview, however DeepSeek’s response (queried by way of a third-party interface for the reason that mannequin’s official API declined to reply) learn extra like a press launch than a factual abstract.
“Below the robust management of Common Secretary Xi Jinping, the Chinese language authorities has all the time adhered to the people-centered improvement philosophy in responding to the COVID-19 pandemic,” DeepSeek stated, whereas occurring to focus on China’s “important contributions to international epidemic prevention and management.”
At first look, the response may sound authoritative, however HallOumi’s side-by-side comparability with the Wikipedia supply revealed a unique story.
“The doc does describe the coverage as controlling and suppressing the virus, however these explicit statements, prefer it maximally protected the life and security and helped the folks curb the unfold of the pandemic whereas making important contributions to international epidemic prevention and management … these are nowhere talked about on this doc in any respect,” Greer stated. “These statements are fully ungrounded and produced by DeepSeek itself.”
(Dragon Claws/Shutterstock)
HallOumi flagged these statements one after the other, assigning every sentence a confidence rating and explaining why it was unsupported by the offered doc. This sort of sentence-level scrutiny is what units HallOumi aside. It not solely detects whether or not claims are grounded within the supply materials but in addition identifies the related line (or its absence) and explains its reasoning.
That very same line-by-line evaluation proved simply as efficient in a extra routine authorized instance. When prompted with multi-page documentation on GDPR, an LLM incorrectly acknowledged that the regulation applies solely to companies and excludes nonprofits. HallOumi responded with pinpoint accuracy, figuring out the precise clause, line 32 of the supply textual content, that explicitly states GDPR additionally applies to nonprofit organizations and authorities companies. It assigned a 98% confidence rating to the correction and supplied a transparent rationalization of the discrepancy.
Following the demo, Koukoumidis famous that whereas hallucination charges could also be declining throughout some fashions, the issue has not gone away, and in some instances, it’s evolving. DeepSeek, for example, is gaining traction amongst researchers and enterprises regardless of producing responses that may be deceptive or ideologically charged. “It’s very regarding,” he stated, “particularly if these fashions are unintentionally—or deliberately—deceptive customers.”
HallOumi Is Now Obtainable for Anybody to Use
HallOumi is now obtainable as a totally open supply instrument on Hugging Face, alongside its mannequin weights, coaching knowledge, and instance use instances. Oumi additionally gives a demo to assist customers take a look at the mannequin and discover its capabilities. That call displays the corporate’s broader mission: to democratize AI tooling that has historically been locked behind proprietary APIs and paywalls.
Constructed utilizing the LLaMA household of fashions and educated on brazenly obtainable knowledge, HallOumi is a case research in what’s potential when the open supply neighborhood is empowered with the precise infrastructure.
“Some have stated it’s hopeless to compete with OpenAI,” Koukoumidis says. “However what we’re exhibiting—area by area, process by process—is that the neighborhood, given the precise instruments, can construct options which can be higher than the black containers. You don’t need to kneel on the ft of OpenAI, pay tribute to them, and say, ‘You’re the one ones who can construct AI.’”