Amazon Web Services (AWS) is preparing to take on NVIDIA as a strong contender as its 2015 acquisition of Annapurna Labs – an Israeli startup whose name was inspired by the Annapurna mountain range in the Himalayas – is proving to be an advantage. At the AWS re:Invent in Las Vegas, the cloud giant announced new chips, including Trainium2, Graviton 4 and Inferentia.
AWS claims that Trainium2 offers 30-40% better price performance than the previous generation of the graphics processing unit (GPU) based Elastic Compute Cloud (EC2) instances. Customers like Anthropic, Databricks, Adobe, Qualcomm, poolside, and even Apple are already on board.
“Today, there’s really only one choice on the GPU side, and it’s just NVIDIA,” said Matt Garman, CEO at AWS. “We think that customers would appreciate having multiple choices.”
It is worth noting that AWS recently invested $4 billion in Anthropic, making it the primary cloud provider and training partner. The company also introduced Trn2 UltraServers and the next-generation Trainium3 AI training chip.
AWS is working with Anthropic to build Project Rainier – a large AI compute cluster powered by thousands of Trainium2 chips. This will help Anthropic develop its models, including optimising its flagship product Claude, to run on Trainium2 hardware.
“This cluster is going to be five times the number of exaflops as the current cluster that Anthropic used to train their leading set of Claude models that are out there in the world,” Garman added.
On the other hand, OpenAI plans to partner with Taiwan Semiconductor Manufacturing Company (TSMC) and Broadcom to launch its first in-house AI chip by 2026. Meanwhile, OpenAI is also banking on NVIDIA’s Blackwell GPU architecture to scale its o1 model and test time compute.
Notably, Anthropic CEO Dario Amodei, in a recent podcast, said that the cost of training AI models today can reach up to $1 billion. While models like GPT-4 cost approximately $100 million, he predicts that within the next three years, training costs could escalate to $10 or even $100 billion.
Advantage Trainium?
According to Garman, Trainium2 delivers 30-40% better price performance than current GPU-powered instances. The new TRN2 instances come equipped with 16 custom-built chips interconnected via NeuronLink, a high-speed and low-latency interconnect. This configuration provides up to 20.8 petaflops of compute from a single node.
The company also introduced Trn2 UltraServers, which combine four Trn2 servers into a single system and offer 83.2 petaflops of compute power for better scalability. These servers feature 64 interconnected Trainium2 chips. For comparison, NVIDIA’s Blackwell B200 is expected to provide up to 720 petaflops of FP8 performance with a rack of 72 GPUs.
Few expected Apple to use AWS Trainium2 to train its models. Benoit Dupin, Apple’s senior director of machine learning and AI, revealed how deeply the company relies on AWS for its AI and ML capabilities. Dupin accredited the decade-long partnership with AWS for enabling Apple’s innovations like Siri, iCloud Music, and Apple TV. “AWS has consistently supported our dynamic needs at scale and globally.”
Apple has leveraged AWS’s solutions, including Graviton and Inferentia chips. It achieved milestones like a 40% efficiency boost by migrating to Graviton instances. Dupin also teased early success with AWS Trainium2 chips, which could deliver a 50% leap in pre-training efficiency.
“We knew that the first iteration of Trainium wasn’t perfect for every workload, but we saw enough traction to give us confidence we were on the right path,” Garman revealed.
Trainium chips leverage the Neuron SDK, which efficiently optimises AI workloads. It supports deep learning inference and training and seamlessly integrates with TensorFlow and PyTorch while avoiding closed-source dependencies. However, it still faces challenges from NVIDIA’s CUDA.
Switching from NVIDIA to Trainium requires hundreds of hours of testing and rewriting code – a barrier few companies want to cross. Acknowledging this challenge internally, AWS called CUDA the single biggest reason customers stick with NVIDIA.
Meanwhile, Amazon’s cloud rivals, Microsoft and Google, are working on their own AI chips to reduce their reliance on NVIDIA. Google recently announced the general availability of Trillium, its sixth generation of Tensor Processing Unit (TPU). “Using a combination of our TPUs and GPUs, LG AI Research reduced inference processing time for its multimodal model by more than 50% and operating costs by 72%,” said Google chief Sundar Pichai during the recent earnings call.
In a similar way, numerous companies today are competing for NVIDIA’s chip market share, including AI chip startups such as Groq, Cerebras Systems, and SambaNova Systems.
The AI chip market is projected to hit $100 billion in the upcoming years, and AWS is pouring billions of dollars into Trainium to stake its claim. However, beating NVIDIA’s dominance won’t be an easy feat.
The post AWS Thinks it Can Solve NVIDIA’s Customer Problems appeared first on Analytics India Magazine.