Inside OpenAI’s $10 Bn Shortcut to Real-Time AI

As demand for real-time AI applications grows, the focus is turning to inference infrastructure. Low-latency performance is emerging as a key bottleneck in building applications like coding agents and voice-based interactions, forcing AI developers to look beyond traditional GPU-heavy architectures.

Real-time inference is critical for AI models to make instantaneous decisions, fuelling real-time applications such as autonomous driving and financial fraud detection.

OpenAI now has a first-mover advantage after entering a multi-year partnership with AI chipmaker Cerebras to deploy 750 megawatts of wafer-scale AI systems for inference. The rollout will begin in 2026 in multiple phases, with the infrastructure designed to serve OpenAI customers globally. The deal is valued at more than $10 billion, according to the Wall Street Journal.

“Cerebras adds a dedicated low-latency inference solution to our platform. That means faster responses, more natural interactions, and a stronger foundation to scale real-time AI to many more people,” Sachin Katti of OpenAI had said in a statement.

The partnership builds on years of engagement between OpenAI and Cerebras, with Sam Altman one of Cerebras’ early investors. The deal comes at a critical moment for OpenAI as it diversifies its AI infrastructure. It also comes on the heels of Apple and Google’s surprise partnership to infuse Google’s AI tech into iOS, including in the updated version of Siri.

Pressure on NVIDIA

Competition in the AI inference market is intensifying as AMD and Intel build lower-cost alternatives to GPUs and hyperscalers like Google and Amazon build their own TPUs.

In this heated environment, OpenAI seems to have struck gold.

Cerebras claims its systems can run large language models up to 15x faster than GPU-based alternatives. The company’s most recent and current chip architecture is the Wafer Scale Engine-3 (WSE-3), which powers its latest systems such as the CS-3, a wafer-scale AI processor with around 4 trillion transistors and roughly 900,000 AI-optimised cores.

According to Cerebras, the CS-3 system is up to 21x faster than NVIDIA’s DGX B200 Blackwell GPU and operates at about one-third the cost and power, supporting applications including conversational AI, real-time code generation and reasoning tasks.

In an exclusive conversation with AIM in October last year, Andrew Feldman, co-founder and CEO of Cerebras, said wafer-scale computing sits at the heart of the company’s next phase of growth. “This is the largest chip in the history of the computer industry,” he boasted, adding that by keeping far more data on a single chip, Cerebras can process information faster, move data less frequently, consume less power, and deliver results in far less time.

“AI becomes exciting when the response is real time,” Feldman noted. “Nobody wants to wait 40 seconds or four minutes for an answer.”

NVIDIA is not a bystander either. Recently, Groq, a US-based company that builds specialised hardware for AI inference, announced a non-exclusive licensing agreement with NVIDIA valued at about $20 billion. As part of the deal, Groq founder Jonathan Ross, president Sunny Madra, and several other employees joined the company, bringing with them Groq’s low-latency language processing unit processor.

Economics of Inference

Beyond speed, the economics of inference are central to Big Tech’s AI strategy. Faster inference can translate into lower cost per token by reducing compute time, energy consumption, and infrastructure overhead.

“B200-class GPUs can be cost-effective when utilisation is high, traffic can be deeply batched and the software stack is well optimised,” Carmen Li, CEO of Silicon Data, tells AIM.

Li adds that many interactive inference workloads such as chat, agents and voice are bursty and sensitive to latency, which limits batching and creates inefficiencies. “These workloads don’t behave well on heavily batched systems,” she says.

Batching involves grouping multiple data inputs to process them together as a single batch to boost computational throughput.

Li notes that wafer-scale systems perform better economically in such scenarios by reducing the need for multi-GPU coordination and interconnect overhead, consolidating compute and memory bandwidth into a single system, and delivering more predictable latency when strict service-level requirements must be met.

Feldman highlights that GPUs still make sense for slower, throughput-oriented tasks like synthetic data generation. But for agentic AI, real-time reasoning, and customer-facing applications, wafer-scale has a decisive edge.

Escaping CUDA Lock-in

One of the biggest barriers to moving away from GPUs is software lock-in, particularly around NVIDIA’s parallel computing platform CUDA. Cerebras claims it has largely eliminated that friction.

“The way you move quickly and disintermediate CUDA is through the use of an API,” Feldman says. “Most application developers don’t want anything to do with CUDA.”

Instead, developers connect to Cerebras much like they would to any cloud AI API, by changing just a few lines of code.

However, Li points to software constraints, saying wafer-scale platforms rely on specialised programming models, compilers and APIs that are narrowly optimised for machine learning. This limits flexibility compared to the CUDA ecosystem and suggests wafer-scale will function as a specialised inference tier rather than a universal replacement for GPUs.

Li notes wafer-scale inference can be faster and more power-efficient for workloads that fit its architecture, but fabrication yield remains a key variable. Even with fault tolerance, wafer-scale manufacturing is difficult, and if yields drive up costs, the performance benefits may not fully offset higher capital expenditure.

She adds that wafer-scale systems do not eliminate the costs of distributed computing. “Once workloads exceed a single system, or when geo-distribution and high availability matter, familiar scaling penalties reappear,” Li notes, adding that the approach primarily optimises single-node latency and efficiency rather than large-scale distributed inference.

What’s Next for Cerebras

Cerebras is reportedly in talks to raise $1 billion at a valuation of about $22 billion, nearly tripling its previous valuation. Last September, the company raised $1.1 billion in an oversubscribed Series G funding round, valuing it at $8.1 billion post-money.

In addition to OpenAI, Cerebras works with Abu Dhabi-based AI group G42. The company filed confidentially for an IPO in September 2024 but withdrew the filing in October 2025 amid scrutiny from the Committee on Foreign Investment in the United States over its ties to G42.

Other than OpenAI, Cerebras’ customers include AWS, Meta, IBM, Mistral, Cognition, and Hugging Face.

The post Inside OpenAI’s $10 Bn Shortcut to Real-Time AI appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...