DeepSeek Might Have Just Killed the Text Tokeniser

Researchers at DeepSeek have introduced DeepSeek-OCR, a new model that explores how visual inputs can help large language models (LLMs) handle longer text efficiently. Instead of feeding text directly into a model, DeepSeek-OCR compresses it into visual tokens, which are essentially ‘images of text’ that carry the same information in fewer tokens.

This approach, called contexts optical compression, could help LLMs overcome one of their biggest limitations by allowing them to process long documents without high computational costs.

This efficiency makes it practical for large-scale use. According to the paper, a single A100 GPU can process over 200,000 pages per day, generating training data for LLMs or vision-language models at an industrial scale.

How DeepSeek OCR Works

The DeepSeek-OCR model is built around two main components — DeepEncoder and DeepSeek3B-MoE, a mixture-of-experts decoder. The DeepEncoder converts document images into a compact set of vision tokens by combining elements from the SAM and CLIP architectures.

It includes a token compression module that reduces the number of vision tokens while preserving essential visual information. Once the image is encoded, the decoder reconstructs the original text from these compressed visual tokens, completing the process from image to readable text.

DeepSeek-OCR also supports multiple resolution modes, ranging from small 512×512 images to large 1280×1280 ones, depending on the input type. This flexibility allows it to handle various document layouts and formats, including charts, chemical formulas, and geometric figures.

Andrej Karpathy, founding member of OpenAI, shared his thoughts on X about the DeepSeek-OCR paper. While he called it “a good OCR model,” he added that what truly interests him is the larger idea it represents — whether LLMs should process pixels instead of text?

Karpathy wrote that perhaps “all inputs to LLMs should only ever be images.” Even pure text, he suggested, could be rendered visually before being fed into the model. This, he argued, would allow for better information compression, more efficient context handling, and a richer, more general input stream, one that includes not only words but also visual cues like bold or coloured text.

He also criticised the traditional tokeniser, calling it an “ugly, separate, non-end-to-end stage” that brings along unnecessary complexity from Unicode and byte encodings. In his view, tokenisers make visually identical characters appear as different tokens to the model, breaking the natural link between how humans see text and how machines interpret it.

On similar lines, Sean Goedecke, staff software engineer at GitHub, explained in a blog post that text tokens are discrete — meaning there’s a fixed list of them, typically around 50,000, with each token corresponding to a single point in the embedding space. Image tokens, on the other hand, are continuous and can take on any value in that space, allowing a single image token to represent far more nuanced information than a text token.

“An image token can be far more expressive than a series of text tokens,” he said. He further suggested that this might even mirror how humans process information. ​​

However, AI researcher at Microsoft Research Adithya S K Kolavi told AIM that he still believes text will remain a crucial part of these systems, especially since the outputs of most models are still token-based. “But converting text into images and feeding that to the model could definitely lead to some surprising and novel capabilities,” he added.

Kolavi said that Karpathy’s point about tokenisation being a bottleneck is valid, as it introduces inefficiencies and arbitrary segmentation of meaning. Representing text as images could remove those boundaries.

However, he added that this approach also brings new challenges, especially around pixel-to-language alignment and scaling. “Training such models will require a lot more data and compute to bridge the gap between visual and semantic understanding,” he said.

Benchmark Results

In tests, DeepSeek-OCR showed strong results across multiple benchmarks

On the Fox benchmark, which measures text compression and decoding accuracy, DeepSeek-OCR achieved around 97% accuracy when compressing text 10 times, meaning it used only one-tenth of the normal token count. Even at a twenty-times compression ratio, it maintained roughly 60% accuracy.

“DeepSeek-OCR is the best OCR ever. It parses this extremely hard-to-read handwritten letter written by mathematician Ramanujan in 1913 with a frightening degree of accuracy,” said Deedy Das of Menlo Ventures in a post on X.

This result shows that the model can decode text efficiently from highly compressed visual input. For LLMs that struggle with long contexts, this kind of optical compression could reduce token usage dramatically without major information loss.

On OmniDocBench, a standard benchmark for document parsing, DeepSeek-OCR also performed strongly.

Why It Matters

DeepSeek-OCR offers both a practical OCR tool and a new research direction. By proving that text can be represented visually without major accuracy loss, it opens the door to combining vision and language in new ways.

One of the more interesting ideas discussed in the paper is how optical compression could model human-like memory decay. The researchers propose that as time passes, older context in a conversation could be compressed into smaller, blurrier visual representations, much like how human memories fade.

The post DeepSeek Might Have Just Killed the Text Tokeniser appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...