Following the recent launch of SearchGPT, users have reported issues with hallucinations—a challenge that’s hardly new. Even Perplexity, a popular search engine, struggles with hallucinations. But there’s good news!
A few days ago, a team of researchers from China published a paper titled ‘HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems’. This paper could offer a breakthrough in tackling hallucination problems in AI search engines.
The paper discusses the use of the HTML format for Retrieval-augmented generation (RAG) systems, which aim to enhance the performance of LLMs by providing them with external knowledge. The authors argue that using HTML instead of plain text can better preserve the structural and semantic information inherent in web pages.
To preserve HTML structure effectively, authors have proposed techniques like the two-stage pruning algorithm, which helps LLMs to effectively shorten the input context for the LLMs without losing key information.
What Problem is it Solving?
RAG has been a popular approach to enhance the knowledge capabilities of LLMs and mitigate their tendency to hallucinate information. Commercial systems like ChatGPT and Perplexity increasingly rely on web search engines as primary retrieval systems, typically involving retrieving search results, downloading HTML sources, and extracting plain text.
However, this conventional RAG process often leads to losing valuable structural and semantic information inherent in HTML. Critical elements such as headings and table structures are stripped away during text extraction, undermining the potential depth and contextual richness of the retrieved information.
Elvis Saravia, co-founder at DAIR.AI, mentioned in his LinkedIn post that to address the challenge of HTML documents being too lengthy for LLM context windows, the authors developed a two-step pruning method. The first involves cleaning unnecessary HTML elements (reducing the length by 94%), and then comes the block-tree-based pruning approach that combines embedding-based and generative pruning to further reduce the content while maintaining important information.
However, incorporating HTML brings in significant complexities, including additional elements like tags, JavaScript, and CSS specifications that potentially introduce noise and increase input tokens.
The proposed approach centres on a two-step block-tree-based pruning method, strategically removing unnecessary HTML blocks and selectively retaining only the most relevant document components. This technique enables more efficient and precise knowledge integration without sacrificing semantic depth or contextual richness.
Apart from HTML, there is also a discussion around implementing Markdown, a lightweight markup language that allows users to format plain text with special characters. The appeal, especially for developers, lies in its simplicity—modern note-taking apps are often built around Markdown.
Software development consultant and instructor Cristina Belderrain suggested that Markdown would be a better choice here. “It provides semantic and structural information as well, all without needing cleaning and/or pruning,” she added, suggesting that implementing Markdown will reduce the pruning tasks we must go through while utilising HTML.
The problem here is that while popular, Markdown can not be compared with HTML, especially in terms of how much interest is still powered by HTML. Before considering a breakthrough like RAG, approaching a versatile tech like HTML would make more sense.
What’s so Special About HtmlRAG?
If you dig deeper, you will realise that similar approaches have already been implemented. For example, Google’s Vertex AI Search has implemented advanced HTML processing capabilities and Microsoft’s Azure AI Search offers vector search capabilities for HTML documents.
HtmlRAG takes a fundamentally different approach by directly feeding HTML structure into the RAG pipeline rather than converting it to plain text first. This preserves crucial semantic and structural information like headings, tables, and hierarchical relationships that would otherwise be lost.
Features like block-tree-based pruning and granularity-adjustable block tree structure allow HtmlRAG to process web content more effectively than traditional RAG systems while maintaining the rich contextual information inherent in HTML documents.
The post HTML to Prevent LLMs from Overdosing on Hallucinations appeared first on Analytics India Magazine.