When Meta introduced the long-awaited subsequent era of its open-source mannequin, Llama 4, debates emerged on social media about whether or not this marks the top of retrieval-augmented era (RAG), as a result of mannequin’s 10-million context window. The large context window permits the mannequin to course of considerably massive quantities of data in a single question, elevating a number of questions concerning the necessity of RAG.
LLAMA4 has 10M context window
Why even RAG anymore? pic.twitter.com/fcPGV3hgKj— Peter Yang (@petergyang) April 5, 2025
Shorter-context fashions usually depend on exterior retrieval to entry information. Nevertheless, Llama 4’s bigger context allows it to handle extra info internally, thereby reducing the necessity for exterior sources when reasoning or processing static information. However is that this adequate to indicate the top of RAG?
Go away RAG Alone, Please
A number of builders and business consultants rallied to defend RAG, which has confronted many challenges. Concerning prices, pushing 10 million tokens right into a context window won’t be low cost—it’s going to exceed a greenback per question and take ‘tens of seconds’ to generate a response, as indicated by Marco D’Alia, a software program architect on X.
Persons are saying the ten million context dimension of @meta Llama 4 means RAG is useless.
I’ve two questions for you:
1) Do you wish to spend $1+ for every message?
2) Do you wish to wait a VERY very long time on each message to course of all these tokens?— Tristan Rhodes (@tristanbob) April 5, 2025
Moreover, many emphasised that longer context home windows had been by no means meant to exchange RAG, whose capabilities primarily targeted on including related chunks of data to the enter.
“RAG isn’t about fixing for a finite context window, it’s about filtering for sign from a loud dataset. Regardless of how huge and highly effective your context window will get, eradicating junk information from the enter will at all times enhance efficiency,” mentioned Jamie Voynow, a machine studying engineer on X.
Gokul JS, a founding engineer of Aerotime, summarised all the debate with a easy analogy: “Think about handing somebody a dense web page of textual content, taking it away, then asking questions. They’ll bear in mind bits, not every part,” he mentioned in a publish on X. He added that LLMs aren’t any totally different in such conditions and that simply because they deal with extra context doesn’t at all times assure an correct response.
Moreover, a ten million context window is large, however it might not embody each use case. Granted, RAG use instances have actually decreased with time, given how most AI fashions retrieve info from just a few PDFs with ease, however a number of sensible use instances might want to transcend that.
“Most enterprises have terabytes of paperwork. No context window can embody a pharmaceutical firm’s 50K+ analysis papers and a long time of regulatory submissions,” mentioned Skylar Payne, a former ML techniques engineer at Google and LinkedIn.
It might make sense if we’re speaking about how gpt-3.5 used to have 4k context and we would have liked RAG for an arxiv paper however we don’t need to now.
Again to the current: Even with 10M context, we’ll most likely nonetheless RAG for arxiv papers from 2025 alone, and I’m undecided loading 10M value…— Eugene Yan (@eugeneyan) April 6, 2025
Moreover, AI fashions have data cutoffs. This implies they can not reply queries depending on the most recent real-time info except retrieved dynamically, which requires utilizing RAG.
Furthermore, if somebody plans to run Llama 4 on inference suppliers like Groq or Collectively AI, these providers provide a context restrict considerably decrease than 10 million. Groq supplies roughly 130,000 tokens for each the Llama 4 Scout and Maverick. Collectively AI affords about 300,000 tokens for the Llama 4 Scout and roughly 520,000 tokens for the Llama 4 Maverick.
LLMs Carry out Poorly Past 32,000 Tokens
Furthermore, a research revealed that after 30,000 tokens in context, LLMs exhibited a decline in efficiency. Though it didn’t embrace the Llama 4 mannequin, the research indicated that at 32k tokens, 10 out of 12 examined AI fashions carried out beneath half their short-context baseline. Even OpenAI’s GPT-4o, one of many high performers, dropped from a baseline rating of 99.3% to 69.7%.
“Our evaluation suggests these declines stem from the elevated problem the eye mechanism faces in longer contexts when literal matches are absent, making it tougher to retrieve related info,” learn the research.
The research additionally famous that conflicting info throughout the context can confuse the AI mannequin, making it obligatory to use a filtering step to take away irrelevant or deceptive content material. “That’s normally not an issue with RAG, but when we indiscriminately put every part within the context, we’ll additionally want a filtering step,” mentioned D’Alia, who cited the above research to again his arguments.
All issues thought of, Meta’s Llama 4 is certainly an enormous step ahead in open supply AI.
Synthetic Evaluation, a platform that evaluates AI fashions, mentioned that the Llama 4 Maverick beats the Claude 3.7 Sonnet however trails the DeepSeek-V3 whereas being extra environment friendly. However, the Llama 4 Scout affords efficiency parity with the GPT-4o mini.
On the MMLU-Professional benchmark, which evaluates LLMs on reasoning-focused questions, the Llama 4 Maverick scored 80%, matching the Claude 3.7 Sonnet (80%) and OpenAI’s o3-mini (79%).
On the GPQA Diamond benchmark, which checks AI fashions on graduate-level science questions, the Llama 4 Maverick scored 60%, decrease than Gemini 2.0 Flash (60%) and DeepSeek V3 (66%).

The publish Llama 4 Sparks ‘RAG Is Useless’ Debate, But Once more appeared first on Analytics India Journal.