The AI world is hungry for information. With regards to English, there may be an abundance of digitised high-quality content material that researchers can use to coach the AI fashions. Nonetheless, there’s a dearth of digitised content material for native Indian languages like Hindi, Odia, Marathi and Telugu. Notably, most of such content material is saved in libraries and outdated texts.
That is the place optical character recognition (OCR) is available in. OCR has lengthy been a cornerstone method for enabling the conversion of assorted types of written textual content, comparable to scanned paperwork, photos, and PDFs, into machine-readable information.
Nonetheless, since fashionable massive language fashions (LLMs) can course of massive volumes of knowledge in a number of languages simply by importing PDFs or photos, the relevance of OCR is in query. Homegrown initiatives like Bhashini and AI4Bharat, or startups like Sarvam, have constructed frameworks and purposes for scanning texts from photos into machine-readable format.
Even so, this falls wanting gathering large quantities of knowledge. Although corporations might digitise content material utilizing OCR, it will take plenty of time and guide effort. On the similar time, they nonetheless need high-quality information from the tons of of 1000’s of books in Indic languages, which may realistically solely be supplied by OCR.
That is the place LLMs have been enjoying a vital function in serving to them.
Are LLMs Killing OCR?
Indian startups like Sarvam AI have began coaching their fashions utilizing artificial information generated from Meta Llama 3.3. This enables corporations to make use of the information generated by the mannequin to coach their very own fashions.
The success of such approaches is obvious in initiatives like Sarvam AI’s Sarvam 2B, which was skilled on 2 trillion artificial Indic tokens. This demonstrates how such information can effectively prepare smaller, purpose-built fashions whereas retaining excessive efficiency.
Amassing information with OCR may be very tough as it’s largely a really guide means of scanning paperwork. Hamid Shojanazeri, accomplice engineering supervisor (PyTorch and Llama) at Meta, mentioned artificial information technology solves important bottlenecks in domains the place accumulating real-world datasets is simply too pricey or impractical. “Artificial information is important for advancing AI in privacy-sensitive areas or low-resource languages,” he added.
That is precisely why OCR is taking a again seat for startups which can be specializing in the English language.
Conventional OCR techniques have been instrumental in digitising printed textual content, however they usually wrestle with handwritten content material, advanced layouts, and numerous fonts. Current examples just like the GPT-4o mini have been capable of determine textual content with rather more accuracy than any OCR, making the case for OCR’s points even stronger.
For example, platforms like Amazon Textract mix OCR with machine studying (ML) to extract textual content and information from nearly any doc, enhancing accuracy and performance.
Miguel Ríos Berríos, co-founder and CTO of Parcha, just lately wrote on X that OCR stays related for easy textual content extraction duties. Nonetheless, in high-stakes purposes like doc verification, it’s being overtaken by extra superior AI fashions that combine imaginative and prescient, language, and metadata evaluation for real-time and adaptable decision-making.
“OCR and text-based guidelines solely see half the image. A doc isn’t simply its textual content content material – it’s the connection between visible parts, fonts’ consistency, official seals’ placement, and even the metadata traces left by modifying software program,” Berríos mentioned. He added that fashionable imaginative and prescient fashions can course of all these indicators concurrently, flagging refined inconsistencies that conventional approaches miss.
Some consider that whereas LLMs are good at extracting textual content from clear photos, they nonetheless wrestle with advanced paperwork like handwritten textual content, low-quality scans, or uncommon fonts. Arham Raza, AI engineer at Clouxi Plexi, mentioned, “OCR techniques like ABBYY FineReader are particularly designed to deal with these points and stay far superior in these situations.”
In accordance with him, OCR is far quicker at processing massive batches of textual content, whereas LLMs might be slower and have token limits. “OCR is way from useless, particularly for issues like authorized or medical paperwork!”
India Retains OCR Afloat
India’s linguistic variety, with 22 formally recognised languages and quite a few dialects, presents distinctive challenges for digital accessibility. Many paperwork, historic information, and literary works can be found solely in printed or handwritten varieties in varied Indian languages. That is what the Indian initiatives proceed to concentrate on, and OCR could be the best choice now.
Ori Shachar, co-founder and CEO of Autom8Labs, wrote on LinkedIn, “After working with all the big LLM suppliers on analysing photos of scanned textual content, I can declare the dying of normal OCR purposes. These LLM simply extract the textual content and browse it from the picture seamlessly, doing a greater job than OCR.”
However that is largely for English. Different languages are nonetheless not as exact as required.
Indian startups have lengthy been devoted to scaling their OCR capabilities. Though they may get hold of plenty of Indic information by means of artificial technology, the standard of textual content in a number of books can’t be assured with out OCR. Nonetheless, that is additionally progressively altering, with LLMs having the ability to detect Indic language textual content when paperwork are uploaded.
Whereas there are positively consultants who disagree that OCR is in bother due to issues with imaginative and prescient language fashions (VLMs) and LLMs, comparable to hallucinations and the excessive value of every picture, the way forward for OCR appears hazy. The price of working such fashions is reducing. LLMs could be overkill for a lot of duties, however OCR would possibly quickly not be sufficient.
The put up OCR is Dying, however Not in India appeared first on Analytics India Journal.