The Future of Storage for HPC and AI October 15, 2025 by Alex Woodie
The AI arms race has made “GPU” and “gigawatt” household words, and for good reason: What’s happening with the scale of compute is unprecedented. But what about the underlying storage layer? How are organizations going to store all the data for AI and keep those hungry GPUs fed? It turns out there’s a revolution occurring in storage for HPC and AI, too.
Welcome to our special HPCwire series about the future of storage for HPC and AI. In this first story, we’re going to lay out the current state of storage for AI and HPC and touch on some of the broad challenges that organizations are facing. In future pieces, we’ll dig into different aspects of the HPC and AI storage industry, and give our best data-driven bet at where this is all headed.
For starters, some things have changed with AI and HPC storage, but some things haven’t. On the hardware front, while solid state disks (SSDs) based on NVMe flash media have grown dominant, there are still roles for spinning disk and even tape in the storage mix. Support for RDMA, whether over InfiniBand or Ethernet, and Nvidia’s GPUDirect technologies are helping to keep GPUs fed.
Gigawatt-scale data centers need ample storage (Source: Shutterstock)
From a software point of view, there is a broad mix of file systems and object stores in use. Parallel file systems that have powered traditional HPC workloads, such as Lustre, PanFS, and IBM Storage Scale (formerly Spectrum Scale and GPFS), are experiencing a renaissance thanks to the AI buildout. Training large AI models is similar in some ways to traditional HPC workloads like modeling and simulation. Both require moving a lot of data of relatively large block sizes at high speeds into a GPU and its associated memory, something that traditional parallel file systems are good at.
At the same time, some organizations are basing their AI storage on network-attached storage (NAS) storage systems that use NFS or parallel NFS (pNFS). A handful of software-only vendors from the NFS and pNFS world are finding success. Many storage vendors, whether using a traditional parallel file system or pNFS–or software-only plays or appliance sellers–are integrating S3-compatable object storage into the mix, primarily to serve AI inference workloads. Ethernet and InfiniBand are the predominant networking protocols in AI and HPC, with RDMA used to speed up data transfers in both.
What has changed is the scale of the storage and the way it’s used. A petabyte of storage used to be considered “big data,” but thanks to today’s super-dense flash, organizations can store an exabyte of data on a single rack. Gigawatt-sized data centers built by the likes of Meta, OpenAI, Google, and will contain thousands of racks of storage servers to go along with the compute clusters containing hundreds of thousands of GPUs. Some of these will contain the latest proprietary networking technology from Nvidia, such as its NVLink
Ascendent AI workloads bring slightly different requirements compared to HPC, including more data ingest, labeling, preparation, and sorting before the real work (model training) even begins. Once the model is trained, inference workloads bring another different set of performance and capability requirements. File sizes range from large to small, and inputs into a chatbot or agentic AI interaction may call on pieces of data from a variety of different systems. Data orchestration becomes an issue, as do features like security, privacy, and data residency requirements.
Emerging tech, like the Nvidia NVSwitch, which strings multiple GPUs together using Nvidia’s NVlink technology to create a single GPU hypercluster, will press the bounds of storage
While commercial organizations have shared infrastructure for scientific computing and AI computing and storage, the workloads have different requirements, said Addison Snell, the CEO of analyst firm Intersect360 Research. “And there’s a widening gap between what end users are asking for and what vendors are providing,” he said.
It used to be there were two storage tiers, disk and tape. “Now you get five, six, seven tiers in most of these environments,” Snell continued. “And performance now isn’t so much about how much bandwidth do I have to this one tier. It’s about how have I optimized it, what data is on which tier.”
All of the companies chasing the HPC and AI storage market need to supply the underlying infrastructure to support the core capabilities, said Mark Nossokoff, the storage industry analyst at Hyperion Research. “But that’s just baseline and table stakes,” he told HPCwire. “You need stuff on top of it to really be able to manage and understand what’s going on with the data that’s being moved and stored and get it at the right place at the right time.”
AI training clusters will often feature specialized flash drives called “burst buffers” to help smooth out the rough I/O patches during training. At inference time, many storage vendors have integrated key-value caches into their storage platforms that allow them to maintain state across the life of an AI interaction, or even store components of the conversation for later use.
Coordinating data and metadata is an emerging problem in HPC and AI storage
Metadata management has become a bigger deal with AI storage, particularly when data is spread across multiple systems, including on-prem and in the cloud. Even cataloging, managing, and governing this metadata across a single exabyte-scale storage cluster is a challenge, and every vendor seems to implement this feature differently.
“AI wants access to all data across all locations, and that’s not how storage was typically built out. So that to me is the crux of what organizations are dealing with,” says Molly Presley, SVP of global marketing for Hammerspace. “Customers don’t know how to put all these pieces together. There’s a lot of new application technology that they’ve never worked with. And how do they decide which piece of the whole stack to use?”
Surveys have indicated that many (if not most) HPC organizations already are using their clusters to run AI workloads, whether in direct support of traditional modeling and simulation workloads or for other use cases, such as assisting with data analysis, literature review, hypothesis formulation, or assisting with scientific experiments. While there are similarities between the two types of workloads, there are important differences.
“It’s like a big zoo in HPC. You can you can pick an example from HPC that does anything: all writes, all reads, latency sensitive, throughput sensitive,” says James Coomer, senior vice presidentof products at DDN, who started out in the HPC business 30 years ago as a PhD-holding researcher.
“Anything you like, whether it’s fluid dynamics or crash simulation or cosmology or quantum mechanics modeling or whatever it is, you’ll find an application that does something weird to storage in a different way, whereas AI is actually, in that sense, more sensible,” Coomer says. “Training…loads those models, loads datasets, checkpoints. That’s pretty much it.”
The future of AI and HPC storage is bright
The challenges with fitting storage to AI are different. “We have customers who literally spend $1 billion,” Coomer continues. “Thirty percent is spent on the data center, cooling, infrastructure power, 60% to 50% is on GPUs, 10% on networking, and basically 5% is on the storage. But if you spend that 5% of your budget on the wrong storage, you can really kill the productivity of that whole pie. You can get 25% less output because you spend this hidden time waiting for the data to move.”
Storage for AI is changing drastically, and yesterday’s concepts don’t apply to tomorrow’s problem, says WEKA CTO Shimon Ben-David. “If in the past you only talked about storage, you sold storage for backup, shared storage for block devices. That’s not something that will be sustainable much more because customers honestly are expecting much more.”
Nobody wants to buy storage today; instead, everybody wants to buy an outcome, Ben-David continued. “So you can’t just say, here’s my storage environment. What you are able to show is, I have an environment that accelerates your inferencing five times, 10 times more,” he says. “Or I have an environment that fully saturates your GPUs. Or here’s an environment that already contains vector databases, RDBMs databases that you can just use.”
Gartner recently published a report claiming that 60% of AI projects will be abandoned through 2026 due to lack of AI ready data, according to Jeff Baxter, vice president of marketing for NetApp. “And we’re seeing more and more [customers] run into the problem of the models are great, the data science is sound, but there’s not AI-ready data that’s easily accessible and governed in a way that can drive those experiments,” he said.
All told, these are great times to be in the high-end storage business, according to Erik Salo, the vice president of marketing for VDURA, the original developer of the Panasas file system, PanFS. “It’s just the coolest arms race I’ve ever seen in my whole career,” says Salo. “A couple of years ago, it was uncommon for me to see an RFQ for a terabyte a second of bandwidth. Now I’m seeing four, five, eight, nine terabytes a second for these systems. They’re just getting bigger and bigger.”
Stay tuned for our next article in this series.
This article first appeared on our sister publication, HPCwire.
Related
About the author: Alex Woodie
Alex Woodie has written about IT as a technology journalist for more than a decade. He brings extensive experience from the IBM midrange marketplace, including topics such as servers, ERP applications, programming, databases, security, high availability, storage, business intelligence, cloud, and mobile enablement. He resides in the San Diego area.