DeepSeek Has ‘Cracked’ Cheap Long Context for LLMs With Its New Model

DeepSeek, the China-based AI lab, has released DeepSeek-V3.2-Exp, an experimental AI model on September 29. The model is claimed to achieve ‘significant efficiency improvements in both training and inference’.

It is built upon the DeepSeek-V3.1-Terminus, which itself is an upgraded version of the DeepSeek-V3.1 model.

It introduces what is called ‘DeepSeek Sparse Attention (DSA)’, a sparse attention mechanism designed to explore and validate optimisations for training and inference efficiency in long-context scenarios, according to the company.

How does @deepseek_ai Sparse Attention (DSA) work?
It has 2 components: the Lightning Indexer and Sparse Multi-Latent Attention (MLA). The indexer keeps a small key cache of 128 per token (vs. 512 for MLA). It scores incoming queries. The top-2048 tokens to pass to Sparse MLA. https://t.co/WXwKDHnkXB pic.twitter.com/QzzPRvAaNa

— vLLM (@vllm_project) September 29, 2025

Despite using a much simpler and faster attention method that processes far fewer tokens during long-context tasks, DeepSeek revealed that it performs on par with V3.1-Terminus.

For context, this model scored 58 on the Artificial Intelligence index, which incorporates the performance of an AI model across 10 benchmarks in diverse domains. Anthropic’s Claude 4.1 Opus model scores 59, Gemini 2.5 Pro scores 60, and OpenAI’s GPT-5 (high) scores 68.

For more details on the architecture, refer to the technical report, available here.

“The DeepSeek team cracked cheap long context for LLMs: a ~3.5x cheaper prefill and ~10x cheaper decode at 128k context at inference with the same quality,” said Deedy Das, partner at Menlo Ventures, reacting to the announcement on X.

DeepSeek casually unlocked 50x attention efficiency in ~1 year
> MLA is ~5.6x faster than MHA
> DSA is 9x faster than MLA
never doubted you, you big beautiful whale

— Ahmad (@TheAhmadOsman) September 29, 2025

The model is available on the DeepSeek app, web and API. The model’s weights are available on Hugging Face.

The company also announced that the API pricing has been cut by 50%. DeepSeek has reduced input costs from $0.07 to $0.028 per 1M tokens for cache hits and from $0.56 to $0.28 for cache misses, while output costs have dropped from $1.68 to $0.42.

DeepSeek-V3.2 shows:
– Chinese chips are rising: Day-0 support for Huawei Ascend & Cambricon;
– ML compiler: DeepSeek uses TileLang, letting you write Python → compile to optimized kernels on diverse hardware. E.g., 80 lines of Python can reach 95% of FlashMLA’s (CUDA written… pic.twitter.com/QxOaAq6r5J

— Yuchen Jin (@Yuchenj_UW) September 29, 2025

“This experimental release represents our ongoing research into more efficient transformer architectures, particularly focusing on improving computational efficiency when processing extended text sequences,” said DeepSeek in the blog post.

The post DeepSeek Has ‘Cracked’ Cheap Long Context for LLMs With Its New Model appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...