Llama-3公布基础训练设施,使用49,000个H100

互联网资讯4个月前发布 tree
57 0 0

今日应用


今日话题


Llama-3公布基础训练设施,使用49,000个H100
Llama-3公布基础训练设施,使用49,000个H100
 

重点标签 AIMetaLlama-3GPU集群数据存储

文章摘要


Meta, a leading social and technology company, has announced the establishment of two new 24K H100 GPU clusters, comprising 49,152 GPUs, specifically for trAIning the large model Llama-3. The Llama-3 model utilizes RoCEv2 networking and is based on Tectonic/Hammerspace’s NFS/FUSE network storage, continuing to use the PyTorch machine learning library. It is estimated that Llama-3 will be launched by the end of April or mid-May, potentially being a multi-modal model and remaining open-source.

Meta’s commitment to AI is evident in its significant investments, aiming to create AGI (Artificial General Intelligence) for the benefit of humanity. In January 2022, Meta revealed details of its AI Research SuperCluster (RSC) with 16,000 Nvidia A100 GPUs, which has played a crucial role in developing popular models like Llama and Llama 2, as well as advancements in computer vision, NLP, speech recognition, and image generation.

The new GPU clusters, built upon the success of the RSC, contain 24,576 H100 GPUs each and are capable of supporting the training of more complex and higher-parameter models. Meta is expected to have a computing power of 600,000 H100 GPUs by the end of 2024.

To handle the hundreds of trillions of AI model requests processed daily, Meta employs an efficient and flexible network to ensure the safe and stable operation of its data centers. One cluster is based on Arista 7800, Wedge400, and Minipack2 OCP racks switches, creating a solution with a converged Ethernet Remote Direct Memory Access (RoCE) network structure. The other uses NVIDIA Quantum2 InfiniBand architecture, both capable of interconnecting 400 Gbps endpoints.

By utilizing these two different clusters, Meta can assess the suitability and scalability of various interconnects for large-scale training, providing valuable experience for the design and construction of even larger and more extensive clusters in the future. Meta has successfully used RoCE and InfiniBand clusters for large generative AI workloads, including the ongoing training of Llama-3, without any network bottlenecks.

The new clusters use Grand Teton, Meta’s internally designed open GPU hardware platform, which was first introduced on October 18, 2022. Grand Teton is built on multiple generations of AI systems, integrating power, control, computing, and structural interfaces into a single chassis for better overall performance, signal integrity, and thermal performance.

As large models increasingly become multi-modal, requiring vast amounts of image, video, audio, and text data, the demand for data storage has grown rapidly. Meta’s new cluster storage deployment is supported by a custom user-space Linux file system API, provided by Meta’s Tectonic distributed storage solution optimized for flash media. This solution enables thousands of GPUs to save and load checkpoints synchronously, a challenge for any storage solution, while also offering flexible, high-throughput byte-level storage for data loading.

Meta has also collaborated with Hammerspace to develop, deploy, and utilize a parallel network file system (NFS) to meet the storage requirements of super AI clusters. Hammerspace allows engineers to interactively debug jobs with thousands of GPUs, as all nodes in the environment can immediately access code changes. Combining Meta’s Tectonic distributed storage solution with Hammerspace enables rapid feature iteration without compromising scale.

文章来源


原文地址: 点我阅读全文
原文作者: AIGC开放社区

© 版权声明

相关文章

暂无评论

暂无评论...