With Rust, Cloudflare is Trying to Tackle the Industry’s Inference Bottleneck

Cloudflare has introduced Infire, a new LLM inference engine built in Rust, to run AI workloads on its distributed network.

Unlike hyperscalers that rely upon large centralised data centres packed with expensive GPUs, Cloudflare operates a lean global network that sits within 50 milliseconds of 95% of internet users. That unique architecture demands a more efficient way to serve inference.

Mari Galicer, group product manager at Cloudflare, in an interaction with AIM, explained how inference is a different challenge for them compared to hyperscalers. “Most hyperscalers operate large, centralised data centres with nodes dedicated to AI compute, whereas Cloudflare operates a lean, distributed network, with each compute node needing to serve different types of traffic.”

“This makes CPU and GPU overhead a challenge, and as a consequence, we have to manage resources much more dynamically and efficiently,” she said.

The company shared in a blog post that the motivation behind this is that they did not want to tackle scaling challenges by just throwing money and buying more GPUs. Instead, the company needed to utilise every bit of idle capacity and be agile with where each model is deployed.

Building Infire in Rust

The company had initially relied on vLLM, the widely used open-source inference engine, but discovered it was not optimised for dynamic, distributed edge workloads. Running Python-based vLLM also required sandboxing for security, which slowed performance and consumed valuable CPU cycles.

So, the team decided to build Infire in Rust. “The primary tradeoff was the up-front development cost of building something from the ground up,” Galicer said. “But because we have quite a few engineers with deep expertise in Rust, we found this was a worthwhile investment.”

This makes sense when Rust, as a choice of programming language, could spell trouble for the wrong team.

Rust’s safety guarantees play a central role. Galicer explained that Rust’s compile-time memory safety ensures protection against common vulnerabilities without the performance overhead of a garbage collector. This inherent security enables Cloudflare to deploy Infire directly and reliably on servers, alongside other services, thereby eliminating the necessity for resource-intensive sandboxing.

The architectural shift might explain Infire’s improved CPU efficiency. Benchmarks show Infire sustaining over 40 requests per second while using just 25% of CPU resources, compared with vLLM’s 140% usage on bare metal.

“Infire’s design reduces CPU overhead primarily by being built in Rust,” Galicer explained.

“This eliminates the need for a heavy security sandbox like gvisor, which Cloudflare had to use with the Python-based vLLM, thereby removing a major source of CPU consumption.”

Also Read: Cloudflare Just Becamne an Enemy of All AI Companies

Performance Edge

At the technical core, Infire employs techniques such as continuous batching, paged KV caching, and just-in-time CUDA kernel compilation. The latter is tailored for Nvidia Hopper GPUs.

“Infire compiles CUDA kernels at runtime that are specifically tailored for the exact model architecture and Cloudflare’s Nvidia Hopper GPUs,” Galicer said.

She highlighted that customised kernel generation, tailored to the specific operations and parameter sizes of a model, offers superior optimisation opportunities compared to the traditional method of integrating generic, pre-compiled kernels.

The result is higher throughput, lower latency, and greater GPU utilisation. Benchmarks show Infire delivering up to 7% faster completions than vLLM on unloaded hardware and significantly better performance under real-world load.

According to Galicer, workloads with “high-throughput and many concurrent requests, which are typical in a large, distributed edge network, see the most improvement.”

Currently, Infire powers the Llama 3.1 8B model in Workers AI, and Cloudflare says more models will follow.

While the company has not yet tested models like DeepSeek or Qwen, Galicer confirmed that Infire will evolve alongside Cloudflare’s AI catalogue.

Future Direction

When asked about open sourcing, Galicer was cautious: “We’re in the very early stages of developing Infire, and as the project matures, we will continue to evaluate whether we should open source it.”

For Cloudflare, the project sits at the centre of its long-term strategy.

“Infire is a foundational part of our AI strategy because it provides a highly efficient and secure engine for running AI inference directly on our globally distributed network,” Galicer said.

Infire serves as the platform through which the company believes it can address numerous performance hurdles, ultimately leading to faster and more cost-effective inference at Cloudflare.

Also Read: Why Gleam Could Be The Next Most Admired Programming Language After Rust

With Rust, Cloudflare is Trying to Tackle the Industry’s Inference Bottleneck

Building Infire in Rust

Performance Edge

Future Direction

NxtGen’s Standardised AI Solutions Framework Sets the Pace for AI Delivery at Scale

Splunk Expands Observability to Keep an Eye on AI

AWS Space Accelerator Program to Support 42 Indian Space Startups

Google’s Gemini Nano Banana and the Cost of Convenience

ServiceNow University Launches in India to Train 1 Million in AI by 2027

Latest stories

AWS Space Accelerator Program to Support 42 Indian Space Startups

ServiceNow University Launches in India to Train 1 Million in...

Tech Mahindra, AMD Tie Up to Drive Hybrid Multi-Cloud Transformation...

Splunk Expands Observability to Keep an Eye on AI

You Can Now Apply & Hold a Spot in Y...

You might also like...

AWS Space Accelerator Program to Support 42 Indian Space Startups

ServiceNow University Launches in India to Train 1 Million in AI by 2027

Tech Mahindra, AMD Tie Up to Drive Hybrid Multi-Cloud Transformation for Global Enterprises