The CUDA Killer

While NVIDIA’s fame rests on its GPUs, the real magic comes from CUDA, the software it can’t do without. In a recent interview with No Priors, NVIDIA chief Jensen Huang said that the goal is for their AI engineers to “build once, run everywhere”.

“The investment in software is the most expensive,” Huang said, speaking about CUDA and its critical role in supporting and maximising the potential of hardware.

He further said that NVIDIA maintains a strong commitment to supporting its software indefinitely, citing programming language C as an example of this approach. “We’ve never given up on a piece of software,” he said, adding that NVIDIA will continue to maintain the software it develops “for as long as we shall live”.

In another interview, he revealed that the psychology of protecting the software began in 1993 (the year NVIDIA was founded) and has been the company’s priority ever since. “The reason why NVIDIA’s CUDA has such a massive install base is because we have always protected it,” said Huang.

Today, five million developers across about 40,000 companies use CUDA. It provides a robust environment with over 300 code libraries, 600 AI models, and support for 3,700 GPU-accelerated applications, catering to diverse computing needs.

CUDA primarily supports C, C++, and Fortran and includes a well-established API with extensive libraries for parallel processing, such as cuBLAS for linear algebra, cuDNN for deep learning, and Thrust for parallel algorithms. Developers can also use frameworks like PyTorch and TensorFlow, which come with built-in CUDA support.

AMD ROCm Rolls

To challenge NVIDIA’s CUDA, AMD launched ROCm 6.2, which introduces support for essential AI features such as the FP8 datatype, Flash Attention 3, Kernel Fusion, and more. These updates enable ROCm 6.2 to deliver up to a 2.4x performance boost in inference and a 1.8x improvement in training across a range of LLMs compared to ROCm 6.0.

In an exclusive interview with AIM, Sasank Chilamkurthy, the founder of Johnaic, said that RCom has an advantage over CUDA due to its strong support for PyTorch. He added that companies preferring to own their code may benefit from using RCom. Chilamkurthy also shared a fun fact with us: ROCm was initially built for NVIDIA GPUs, however, AMD later prioritised it for its own GPUs.

Notably, the company recently introduced its new MI325X accelerators for training and inferencing LLMs.

During the Advancing AI Summit in 2024, Vamsi Bopanna, SVP of AI at AMD, shared more details about ROCm.

“ROCm is a complete set of libraries, runtime compilers, and tools needed to develop and deploy AI workloads. We designed ROCm to be modular and open source, allowing for rapid contributions from AI communities,” said Bopanna, adding that it is also designed to connect easily with ecosystem components and frameworks like PyTorch and model hubs like Hugging Face.

He explained that they have expanded support for newer frameworks like JAX and implemented powerful new features, algorithms, and optimisations to deliver the best performance for generative AI workloads.

AMD also supports various open-source frameworks, including vLLM, Triton, SGlang, ONXX Runtime, and more. Bopanna revealed that today, over 1 million Hugging Face models run on AMD.

“We have a very deep partnership with PyTorch. We are fully upstreamed in it, run over 200,000 tests nightly, and are a tier-one citizen within the PyTorch community,” said Bradley McCredie, corporate vice president at AMD, adding that there are only two compute platforms that are fully upstreamed in PyTorch, and AMD is one of them.

AMD also works closely with Triton, an open-source programming language and compiler developed by OpenAI for GPU programming. While originally designed for NVIDIA GPUs, recent developments have enabled Triton to be compatible with AMD GPUs through the ROCm platform.

“Triton is an extremely strategic platform for our industry as it provides a highly productive environment with a high level of abstraction for coders, enabling excellent performance,” said McCredie. He added that it eliminates the dependence on hardware-specific languages like CUDA, allowing programmers to code directly at this level and compile directly to the AMD platform.

Moreover, AMD GPUs with ROCm offer a more cost-effective option when compared to NVIDIA GPUs with CUDA. Although its top-tier GPUs may lag by 10-30% in raw performance, the price difference can be significant.

AMD showcased testimonials for ROCm at Advancing AI 2024 by inviting startup leaders, including Amit Jain, the CEO of Luma AI; Ashish Vashwani, the CEO of Essential AI; Dani Yogatama, the CEO of Reka AI, and Dmytro Dzhulgakov, the CTO of Fireworks AI.

Luma AI recently launched a video-generation model called Dream Machine. “The models we’re training are very challenging and don’t resemble LLMs at all. However, we were impressed with how quickly we could get the model running on ROCm and MI300X GPUs. It took us just a few days to establish the end-to-end pipeline, which is quite fantastic,” said Jain.

Advantage CUDA?

AMD’s ROCm is not quite as mature as CUDA. NVIDIA’s ecosystem around CUDA is extensive, with a large developer community, extensive documentation, and a broad set of tools for debugging and profiling.

Most deep learning frameworks, HPC (high-performance computing) applications, and libraries are developed with CUDA in mind, making it the go-to choice for many developers.

NVIDIA introduced CUDA in 2006 as a proprietary parallel computing platform and application programming interface (API) model, while ROCm was originally launched in 2014.

AMD is still developing ROCm to catch up. A user on Hacker News suggested that for ROCm to have a chance against CUDA, AMD would need to commit billions to ecosystem-building. This involves supporting developers, creating resources, and nurturing a long-term platform that can rival CUDA’s popularity.

The user explained that historically, hardware companies like Intel and AMD have struggled with building and maintaining strong software ecosystems. Using OpenCL as an example, he pointed out that although it was supported by hardware companies, it failed to develop into a strong competitor due to inconsistent support and a lack of ecosystem investment.

​​The only problem with CUDA is that it is closed source and works only for NVIDIA GPU workloads. Though people have been finding several solutions to work around this restriction, CUDA remains the best on NVIDIA GPUs, and since everyone is using its GPUs, the moat becomes even bigger.

Today, NVIDIA controls 95% of the AI chip market. “CUDA is dominant in GPU programming because NVIDIA dominates the GPU market when it comes to AI, ML, and other GPU programming applications. CUDA is exclusive to NVIDIA,” posted a user on Reddit.

In India, startups like Unscript and Sarvam AI revealed to AIM that they work with NVIDIA GPUs and CUDA and haven’t adopted ROCm yet.

Last year, AMD acquired Nod.ai to provide AI customers with open software for easily deploying high-performance AI models optimised for AMD hardware. The company recently acquired Europe’s private AI lab, Silo AI. “We recently completed the acquisition of Silo AI, which adds a world-class team with tremendous experience in training and optimising LLMs, as well as delivering customer-specific AI solutions,” said AMD chief Lisa Su.

All in all, while ROCm offers a compelling alternative, CUDA’s maturity and widespread use make it hard to surpass it.

The post The CUDA Killer appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...