How redBus Makes use of Uncooked Information from 150 billion Information Factors

Varied apps and companies gather knowledge, which should be processed, organised, and managed to be significant. A number of applied sciences allow corporations to perform this. How does an organization like redBus, a bus ticket reserving platform that processes 150 billion knowledge factors every day, deal with this?

At DES 2025, Ravikumar Kumarasamy, VP of engineering at redBus, defined how the corporate rebuilt its knowledge platform to make uncooked occasions queryable, evolvable, and prepared for reuse. He highlighted schema drift, environment friendly storage, and why it pays to deal with uncooked knowledge as a long-term asset.

The corporate makes use of the uncooked knowledge from the info factors, which is supposed to be streamed in via functions and companies. It was not saved in a manner that supported historic querying or scalable reprocessing.

Due to this fact, the group got down to change that by constructing a storage framework that might infer schema on the fly, compact recordsdata for effectivity, and permit customers to retrieve filtered datasets with out engineering intervention.

Turning JSON haystacks into Apache Parquet

The strategy started with uncooked occasion ingestion over Apache Kafka, an open-source occasion streaming platform, coming in as JSON payloads. As an alternative of counting on fastened API contracts, the system infers the schema dynamically from every occasion.

Kumarasamy stated, “I have already got the schema, however the problem is that the schema is ever evolving. We introduce new fields and data into the schema. So, as a substitute of monitoring the schema, we thought, why don’t we infer the schema from the uncooked knowledge?”

The uncooked knowledge consists of time, supply, nation, ID description, and quantity. “So, now we derive a basic schema from that,” Kumarasamy stated.

As soon as a schema is recognized, it’s versioned, bucketed, and used to forged the incoming knowledge right into a generalised format.

Kind mismatches are dealt with via outlined casting guidelines—integers will be upcast to lengthy knowledge varieties, strings are handled as fallback varieties, and unsupported conversions are rejected early. This generalisation permits the system to normalise various knowledge sources with out dropping their authentic constancy.

The metadata extraction step can be notable. redBus pulls frequent fields like nation, occasion supply, and occasion time from each payload and appends them to the info, making a constant layer of knowledge that helps filtering and querying with out scanning complete datasets.

Occasions are saved in Parquet format, which retains them compact and straightforward to question. RedBus makes use of an automatic system to merge the massive variety of small recordsdata. Kumarasamy stated this setup cuts storage wants by 93% whereas making the info simple to entry.

Reusability of Information

To assist exploration, the group developed an inner serverless device known as ‘Lenses’. It permits groups to extract datasets from uncooked knowledge buckets utilizing a easy interface with filter choices like geography, occasion kind, and time vary. Behind the scenes, the device creates Parquet recordsdata and offers customers a hyperlink to obtain the info for evaluation or checks.

“Our CRM and advertising groups, product managers and engineers use Lenses to resolve issues,” Kumarasamy stated.

Whereas issues about dropping knowledge constancy within the transformation course of have been thought-about, Kumarasamy clarified that solely the construction is reconstructed; the uncooked data stays intact.

A key benefit of the structure is its temporal precision. Historic occasions will be retrieved precisely, right down to particular hours on particular days, utilizing metadata embedded within the file names and bucket IDs. This enables each real-time and retrospective evaluation from a single unified storage layer.

Collectively, these improvements type a modular, scalable system that balances flexibility, price, and ease of use, remodeling uncooked occasion streams right into a trusted, queryable system of file.

redBus has constructed a versatile system that avoids the complexity of conventional knowledge lakes. Whereas it nonetheless faces some challenges like managing variations and merging recordsdata, the platform now treats uncooked knowledge as extra than simply passing occasions, it’s a structured, reusable supply of knowledge.

The submit How redBus Makes use of Uncooked Information from 150 billion Information Factors appeared first on Analytics India Journal.

How redBus Makes use of Uncooked Information from 150 billion Information Factors

Turning JSON haystacks into Apache Parquet

Reusability of Information

Latest stories

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron...

PNNL: Integrating AI into Biological Research

Rick Stevens on the Genesis Mission and the Future of...

Inside the DOE’s 26 AI Challenges for Genesis Mission

You might also like...

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron Star Data

PNNL: Integrating AI into Biological Research