Have LLMs Hit a Wall? Microsoft chief Satya Nadella tackled this hot-button issue at Microsoft Ignite 2024, offering a refreshingly candid take on the discussion.
“There’s a lot of debate on whether we have hit the wall with scaling laws. Is it going to continue? The thing to remember, at the end of the day, is that these are not physical laws. They are just empirical observations that held true, much like how Moore’s Law did for a long time,” he said.
Nadella welcomed the scepticism and debates, calling them beneficial to push innovation in areas such as model architectures, data regimes, and systems architecture. He also discussed OpenAI’s new scaling law, which focuses on test-time computing, and how it will be integrated into features like Copilot Think Deeper, powered by OpenAI’s o1.
In a recent earnings call, NVIDIA chief Jensen Huang said that OpenAI o1 had introduced a new scaling law called ‘test-time scaling’, which consumed a lot of computing resources. Microsoft is working closely with NVIDIA to scale test-time computing for OpenAI.
Nadella emphasised the importance of maximising value in the most efficient way. “Last month, we introduced new clusters with H200s that became available. We’re very excited about it,” said Nadella. He added that with their stack optimisation between H100 and H200, Azure can deliver performance for everything from inference to training.
Efficiency Wars: Tokens, Watts, and Dollars
“Tokens per watt plus dollar is the best way to think about the new currency of performance,” said Nadella, adding that Microsoft will continue to build new data centre intelligence factories.
Nadella introduced a new metric that reflects the efficiency of generating tokens, considering both energy consumption (measured in watts) and cost (measured in dollars). This means that for every unit of energy (watt) used and every dollar spent, a certain number of tokens are produced.
Despite the progress, NVIDIA has yet to solve the inferencing challenge. Acknowledging the difficulties involved, Huang shared that their goal is to produce tokens at low latency.
“Inference is super hard. And the reason…is that you need the accuracy to be high…You need the throughput to be high so that the cost can be as low as possible. But you also need the latency to be low. And computers that are high-throughput and have latency are incredibly hard to build,” he said.
“Our hopes and dreams are that, someday, the world will do a ton of inference,” said Huang, adding that there will be thousands of AI-native start-ups that will generate tokens.
Microsoft also announced the preview of NVIDIA Blackwell AI infrastructure on Azure.
“Blackwell is pretty amazing. It has 72 GPUs on a single NVLink domain, and when combined with Infiniband on the backend, these racks are optimised for the most cutting-edge training and inference workloads. We are very excited about having Blackwell,” said Nadella.
Besides NVIDIA, Microsoft is also working closely with AMD. “We were the first cloud to offer VMs powered by AMD’s MI300X GPU, and we’re using that infrastructure to power Azure OpenAI. Today, we’re introducing Azure HBv5, which we co-engineered with AMD,” he said.
Nadella confirmed that Azure HBv5 is up to eight times faster than any other cloud virtual machine, setting a new standard for high-performance computing, and it will be generally available next year.
Data Center as a Product
In a recent podcast with No Priors, Huang explained that NVIDIA now views data centres as a product rather than just GPUs.
“We have an initiative in our company called data centre as a product. We don’t sell it as a product, but we have to treat it like it’s a product—everything from planning for it to standing it up, optimising it, tuning it, and keeping it operational,” he said, adding that their goal is for data centres to be as beautiful as iPhones.
However, he acknowledged that energy, capital, and supply chain are a major challenge when it comes to scaling compute. Interestingly, with the growing demand for data centres to power AI technology, tech giants such as Microsoft, Google, and Amazon have struck deals with nuclear power plant operators.
Huang highlighted that intelligence is built on top of computing power, marvelling at the evolution of data centres. Initially, data centres were simply used for storing data, but now, they are generating new tokens.
“We are creating single-tenant data centres that don’t just store files; instead, they produce tokens. These tokens are then reconstituted into something that seems like intelligence,” he elaborated.
According to Huang, intelligence or tokens can take many forms. “It could be robotic motion, sequences of amino acids, or chemical chains – the possibilities are countless.”
Meanwhile, Groq chief Jonathan Ross shared similar views in a recent LinkedIn post, comparing generative AI to the internet and mobile phones. He explained that the internet was part of the Information Age, and its primary function was to take a piece of data, replicate it with high fidelity, and distribute it globally.
Generative AI, on the other hand, is different. “It’s not about copying,” Ross said. “It’s not about data or information. It’s about compute.”
Compute is the New Currency
In an interview with Lex Fridman earlier this year, OpenAI chief Sam Altman said, “Compute is going to be the currency of the future. It may become the most valuable commodity in the world and we should invest significantly in expanding compute resources.”
In a similar vein, Altman proposed a concept where everyone would have access to a portion of GPT-7’s computing resources. “I wonder if the future looks something more like ‘universal basic compute’ than universal basic income, where everyone receives a slice of GPT-7 compute,” Altman speculated.
This explains why OpenAI plans to partner with TSMC and Broadcom to launch its first in-house AI chip by 2026.
On the other hand, Elon Musk’s xAI has built the world’s largest and most powerful AI supercomputer, Colossus, a liquid-cooled cluster in Memphis, comprising 100,000 NVIDIA H100 GPUs. xAI is now working to double its size to a combined total of 200,000 NVIDIA Hopper GPUs.
Inference War
Moreover, OpenAI’s new scaling method has prompted major inference chip makers like Groq, Sambanova, and Cerebras to improve their performance, enabling them to run Llama models at record-breaking speeds.
Cerebras recently shared that its Llama 3.1 model with 405 billion parameters is now running at 969 tokens per second. Meanwhile, Groq launched a new endpoint for Llama 3.1 70B, achieving 1,665 tokens per second by leveraging speculative decoding. On the other hand, Sambanova’s Llama 3.1 405B models are now running at up to 200 tokens per second.
Meanwhile, NVIDIA has reportedly asked its suppliers to redesign their racks multiple times to address the overheating issue. In response, Groq’s Sunny Madra posted a picture of Groq’s cluster with the caption, “Air cooled here.”
However, Huang has dismissed such reports, asserting that Blackwell production is running at full steam. NVIDIA’s CFO revealed that the company shipped 13,000 GPU samples to customers in the third quarter, including one of the first Blackwell DGX engineering samples to OpenAI.
On the other hand, SambaNova AI’s Rodrigo Liang said, “Sambanova’s datascale rack weighs just 738 pounds, requires no special cooling or power, and outperforms an NVIDIA rack while using only one-tenth of the power.”
The post AI Scaling Laws Crumble Under Token Pressure appeared first on Analytics India Magazine.