How Johnson & Johnson India Constructed an In-Home OCR Software to Scan 4 Million Paperwork

When the information science workforce at Johnson & Johnson in India confronted the daunting problem of extracting textual content from practically 4 million paperwork, that they had two selections. They might both depend on third-party Optical Character Recognition (OCR) companies at a big value or develop an in-house answer tailor-made to their wants.

Venkata Karthik T, senior supervisor of information science at Johnson & Johnson, revealed his journey at MLDS 2025 and supplied an insightful look into how necessity, innovation, and price constraints led them to construct a strong, scalable OCR device.

“On the finish of the day, value is a really crucial issue for any challenge. So caring for that is essential,” Karthik defined. Moreover, utilizing third-party options raised privateness issues.

The challenge, spanning over a yr with a number of iterations, demonstrated the facility of in-house innovation. “It’s extra like we picked up what was actually necessary and began implementing them. 95% of the issue will get solved with these items,” Karthik famous.

Dealing with delicate inner paperwork required better management over the information. Furthermore, constructing an inner device supplied the benefit of customisation, enabling them to tailor the OCR engine to numerous use instances past doc scanning.

The Thought Course of Behind Constructing the Software In-Home

Karthik stated that his workforce evaluated a number of OCR frameworks, prioritising exercise metrics, capabilities, and usefulness. They created datasets for testing, categorising paperwork into digital, noisy, and handwritten varieties.

Digital paperwork had been generated utilizing SynthTIGER, noisy paperwork sourced from the FunSD dataset, and handwritten textual content taken from the IAM dataset. After rigorous testing, they narrowed their focus to 4 key OCR fashions: PaddleOCR, Tesseract, EasyOCR, and HDR Pipeline.

Every framework had its strengths and weaknesses. PaddleOCR excelled at desk extraction however struggled with dense textual content. Tesseract labored properly on dense textual content however had points with tables. HDR Pipeline carried out finest for handwritten textual content.

“So as a substitute of selecting one, we mixed PaddleOCR and Tesseract. We took each outputs and noticed which one had the best confidence rating. If one mannequin recognized textual content and one other missed it, we merged outcomes, bettering total accuracy,” Karthik stated.

The prevailing course of was simple: ship PDFs or photographs to a third-party service, extract textual content, and use it for downstream functions. Nonetheless, the workforce sought to copy and enhance this with their very own OCR pipeline.

The device was evaluated utilizing phrase error charge (WER), character error charge (CER), and accuracy. Whereas third-party APIs achieved practically 98-99% accuracy on digital and noisy paperwork, the hybrid mannequin considerably improved inner efficiency. HDR Pipeline was notably efficient for handwritten textual content, reaching 85% accuracy.

How AI Helped with Price – the Greatest Issue

Price effectivity was undoubtedly one other key consideration. “We don’t need the API up and working all day. It’s an pointless value,” Karthik stated. As a substitute of a pricey front-end UI, the workforce deployed a backend API on Kubernetes. They optimised infrastructure utilizing batch processing, streaming mode, and event-based triggering to make sure the system ran solely when wanted.

To additional refine textual content extraction, the workforce launched AI-powered error correction. Utilizing ChatGPT for low-confidence phrases improved accuracy by 3%. Additionally they experimented with fine-tuned BERT fashions to right OCR errors with out counting on costly third-party APIs.

Moreover, they developed six pre-built extraction templates to streamline knowledge retrieval. These templates allowed customers to specify areas of curiosity, similar to key-value pairs, structured tables, or spatial relationships inside paperwork. This diminished the necessity for handbook changes and sped up adoption inside the organisation.

Since many paperwork contained tabular knowledge, the workforce leveraged Microsoft’s Desk Transformer. This mannequin recognized tables and their elements, together with rows, columns, and headers, earlier than feeding them into PaddleOCR for textual content extraction.

For barcodes, they used a mixture of YOLOv5 for detection and a number of open-source decoders. If these failed, they utilized super-resolution strategies to reinforce barcode readability, boosting decoding accuracy to 84%.

Regardless of vital progress, Karthik stated that challenges stay with handwritten textual content, the place even people battle to decipher poor handwriting. Nonetheless, the workforce is optimistic about integrating vision-language fashions (VLMs) like OCR-free RAG fashions to bypass conventional OCR altogether.

The publish How Johnson & Johnson India Constructed an In-Home OCR Software to Scan 4 Million Paperwork appeared first on Analytics India Journal.