NVIDIA英偉達(dá) GH200 超級服務(wù)器 AI人工智能趨勢,升級版
? ?在COMPUTEX 2023, NVIDIA 宣布NVIDIA DGX GH200,這標(biāo)志著 GPU 的又一突破——加速計算,為最苛刻的巨型人工智能工作負(fù)載提供動力。除了描述 NVIDIA DGX GH200 體系結(jié)構(gòu)的關(guān)鍵方面外,本文還討論了如何使用NVIDIA Base Command實現(xiàn)快速部署,加快用戶入職,并簡化系統(tǒng)管理。
GPU 的統(tǒng)一內(nèi)存編程模型是過去 7 年來復(fù)雜加速計算應(yīng)用取得各種突破的基石。 2016 年, NVIDIA 推出NVLink技術(shù)和帶有 CUDA-6 的統(tǒng)一內(nèi)存編程模型,旨在增加 GPU 加速工作負(fù)載的可用內(nèi)存。
從那時起,每個 DGX 系統(tǒng)的核心都是與 NVLink 互連的基板上的 GPU 復(fù)合體,其中每個 GPU 可以以 NVLink 的速度訪問另一個的存儲器。許多具有 GPU 復(fù)合體的 DGX 通過高速網(wǎng)絡(luò)互連,形成更大的超級計算機,如NVIDIA Selene 超級計算機。然而,一類新興的萬億參數(shù)的巨型人工智能模型要么需要幾個月的訓(xùn)練,要么即使在當(dāng)今最好的超級計算機上也無法求解。
為了讓需要一個能夠解決這些非凡挑戰(zhàn)的先進(jìn)平臺的科學(xué)家們獲得力量, NVIDIANVIDIA Grace Hopper Superchip與 NVLink 交換系統(tǒng),在 NVIDIA DGX GH200 系統(tǒng)中集成多達(dá) 256 GPU 。在 DGX GH200 系統(tǒng)中, GPU 共享內(nèi)存編程模型可以通過 NVLink 高速訪問 144 TB 的內(nèi)存。
與單個相比NVIDIA DGX A100 320 GB 系統(tǒng), NVIDIA DGX GH200 通過 NVLink 為 GPU 共享內(nèi)存編程模型提供了近 500 倍的內(nèi)存,形成了一個巨大的數(shù)據(jù)中心大小的 GPU 。 NVIDIA DGX GH200 是第一臺突破 NVLink 上 GPU 可訪問內(nèi)存 100 TB 障礙的超級計算機。

At COMPUTEX 2023, NVIDIA announced the NVIDIA DGX GH200, which marks another breakthrough for GPU-accelerating computing to power the most demanding giant AI workloads. In addition to describing key aspects of the NVIDIA DGX GH200 architecture, this article discusses how to use NVIDIA Base Command to enable rapid deployment, speed user onboarding, and simplify system administration.
The unified in-memory programming model for Gpus has been the cornerstone of various breakthroughs in complex accelerated computing applications over the past seven years. In 2016, NVIDIA introduced NVLink technology and a unified memory programming model with CUDA-6 designed to increase the memory available for GPU-accelerated workloads.
Since then, the core of every DGX system has been a GPU complex on a substrate interlinked with NVLink, where each GPU can access the memory of the other at the speed of NVLink. Many DGXS with GPU complexes are interconnected through high-speed networks to form larger supercomputers, such as the NVIDIA Selene supercomputer. However, an emerging class of trillion-parameter giant AI models either require months of training or cannot be solved even on today's best supercomputers.
To empower scientists who need an advanced platform capable of solving these extraordinary challenges, the NVIDIANVIDIA Grace Hopper Superchip with the NVLink switching system integrates up to 256 Gpus in the NVIDIA DGX GH200 system. In the DGX GH200 system, the GPU shared memory programming model can access 144 TB of memory at high speed via NVLink.
Compared to a single NVIDIA DGX A100 320 GB system, the NVIDIA DGX GH200 provides nearly 500 times more memory for the GPU shared memory programming model via NVLink, resulting in a huge data-center sized GPU. The NVIDIA DGX GH200 is the first supercomputer to break the 100 TB barrier of GPU-accessible memory on NVLink.
A800 PCIE 單卡
A800 NVLink 8卡模組
A100 超微NV服務(wù)器
A100 PCIE 單卡
H800超微NV服務(wù)器
H100 PCIE 單卡
H100 超微NV服務(wù)器
GH200 超級服務(wù)器
A800 PCIE single card
A800 NVLink 8 module
A100 Ultra Micro NV server
A100 PCIE single card
H800 Ultra Micro NV server
H100 PCIE single card
H100 Supermicro NV server
GH200 super server