Meta has unveiled crucial details about its cutting-edge hardware infrastructure, specifically tailored for AI training and as Yann LeCun pointed out for training Llama 3. The company revealed insights into its 24,576-GPU data centre-scale clusters, integral to supporting current and forthcoming AI models, including Llama 3, the successor to Llama 2.
Representing a significant investment in AI hardware, Meta’s clusters underscore the pivotal role of infrastructure in shaping the future of AI. These clusters are designed to power Meta’s long-term vision of creating AGI in an open and responsible manner, aiming for widespread accessibility.
The computing infrastructure for Llama-3 training. https://t.co/xpUfvCHjW0
— Yann LeCun (@ylecun) March 12, 2024
In the latest development, Meta has deployed two variants of its 24,576-GPU clusters, each equipped with distinct network fabric solutions. One cluster utilises a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric, while the other features an NVIDIA Quantum2 InfiniBand fabric. Both solutions boast 400 Gbps endpoints, enabling seamless interconnectivity for large-scale training tasks.
Notably, the company’s AI Research SuperCluster (RSC), introduced in 2022, featuring 16,000 NVIDIA A100 GPUs, has been pivotal in advancing open and responsible AI research, facilitating the development of advanced AI models such as Llama and Llama 2.
Through meticulous network, software, and model architecture co-design, Meta has successfully harnessed the capabilities of both RoCE and InfiniBand clusters, mitigating network bottlenecks in large-scale AI workloads. This includes ongoing training sessions of Llama 3 on Meta’s RoCE cluster, demonstrating the effectiveness of the infrastructure in supporting advanced AI training tasks.
Looking ahead to the end of 2024, Meta’s goal is to further expand its infrastructure footprint, encompassing 350,000 NVIDIA H100s. This expansion is part of a comprehensive portfolio initiative aimed at achieving computational capabilities equivalent to nearly 600,000 H100s.
The post Meta is Training Llama 3 on 24k NVIDIA H100 Clusters appeared first on Analytics India Magazine.