Llama-3公布基础训练设施，使用49,000个H100

互联网资讯7个月前发布 tree

88 0 0

今日应用

AI自动写论文

AIPaperPass，引领学术写作新潮流！免费千字大纲，5分钟生成3万字初稿，提供PPT答辩汇报、中英文知网参考文献等全方位服务。严格控制重复率，超过10%包退费。让AIPaperPass助您的研究之旅更上一层楼！

今日话题

Llama-3公布基础训练设施，使用49,000个H100

重点标签 AI、Meta、Llama-3、GPU集群、数据存储

文章摘要

Meta, a leading social and technology company, has announced the establishment of two new 24K H100 GPU clusters, comprising 49,152 GPUs, specifically for trAIning the large model Llama-3. The Llama-3 model utilizes RoCEv2 networking and is based on Tectonic/Hammerspace’s NFS/FUSE network storage, continuing to use the PyTorch machine learning library. It is estimated that Llama-3 will be launched by the end of April or mid-May, potentially being a multi-modal model and remaining open-source.

Meta’s commitment to AI is evident in its significant investments, aiming to create AGI (Artificial General Intelligence) for the benefit of humanity. In January 2022, Meta revealed details of its AI Research SuperCluster (RSC) with 16,000 Nvidia A100 GPUs, which has played a crucial role in developing popular models like Llama and Llama 2, as well as advancements in computer vision, NLP, speech recognition, and image generation.

The new GPU clusters, built upon the success of the RSC, contain 24,576 H100 GPUs each and are capable of supporting the training of more complex and higher-parameter models. Meta is expected to have a computing power of 600,000 H100 GPUs by the end of 2024.

To handle the hundreds of trillions of AI model requests processed daily, Meta employs an efficient and flexible network to ensure the safe and stable operation of its data centers. One cluster is based on Arista 7800, Wedge400, and Minipack2 OCP racks switches, creating a solution with a converged Ethernet Remote Direct Memory Access (RoCE) network structure. The other uses NVIDIA Quantum2 InfiniBand architecture, both capable of interconnecting 400 Gbps endpoints.

By utilizing these two different clusters, Meta can assess the suitability and scalability of various interconnects for large-scale training, providing valuable experience for the design and construction of even larger and more extensive clusters in the future. Meta has successfully used RoCE and InfiniBand clusters for large generative AI workloads, including the ongoing training of Llama-3, without any network bottlenecks.

The new clusters use Grand Teton, Meta’s internally designed open GPU hardware platform, which was first introduced on October 18, 2022. Grand Teton is built on multiple generations of AI systems, integrating power, control, computing, and structural interfaces into a single chassis for better overall performance, signal integrity, and thermal performance.

As large models increasingly become multi-modal, requiring vast amounts of image, video, audio, and text data, the demand for data storage has grown rapidly. Meta’s new cluster storage deployment is supported by a custom user-space Linux file system API, provided by Meta’s Tectonic distributed storage solution optimized for flash media. This solution enables thousands of GPUs to save and load checkpoints synchronously, a challenge for any storage solution, while also offering flexible, high-throughput byte-level storage for data loading.

Meta has also collaborated with Hammerspace to develop, deploy, and utilize a parallel network file system (NFS) to meet the storage requirements of super AI clusters. Hammerspace allows engineers to interactively debug jobs with thousands of GPUs, as all nodes in the environment can immediately access code changes. Combining Meta’s Tectonic distributed storage solution with Hammerspace enables rapid feature iteration without compromising scale.

文章来源

原文地址: 点我阅读全文
原文作者: AIGC开放社区

# 互联网资讯 # AI # GPU集群 # Llama-3 # Meta # 数据存储

文章版权归作者所有，未经允许请勿转载。

暂无评论

暂无评论...

Llama-3公布基础训练设施，使用49,000个H100

今日应用

今日话题

文章摘要

文章来源

全球首个AI程序员诞生，码农饭碗一夜被砸！10块IOI金牌华人团队震撼打造，996写代码训练模型

加盟费1000万，海底捞县城捞金？

相关文章

暂无评论

热门网址

热门标签