Hosting Qwen on NVIDIA Blackwell GB200 NVL72
- NVIDIA’s GB200 NVL72 rack architecture enables the deployment of trillion-parameter AI models, such as Qwen, by integrating high-density Blackwell GPUs with advanced interconnect systems.
- The GB200 NVL72 is designed as an exascale computer contained within a single rack, providing the necessary memory and communication speeds to support high-performance computing (HPC) and large-scale...
- Each of these nodes contains two ARM-based NVIDIA Grace CPUs and four Blackwell-based GPUs.
NVIDIA’s GB200 NVL72 rack architecture enables the deployment of trillion-parameter AI models, such as Qwen, by integrating high-density Blackwell GPUs with advanced interconnect systems.
The GB200 NVL72 is designed as an exascale computer contained within a single rack, providing the necessary memory and communication speeds to support high-performance computing (HPC) and large-scale AI workloads.
Hardware Architecture and Memory
The system is composed of 18 nodes. Each of these nodes contains two ARM-based NVIDIA Grace CPUs and four Blackwell-based GPUs.
To manage the massive data requirements of trillion-parameter models, each GPU in the system is equipped with 180GB of High Bandwidth Memory (HBM).
Interconnect and Communication Bandwidth
The 72 GPUs within the rack are interconnected via NVLink and 18 NVLink Switch ASICs. This configuration allows for 1,800 GB/s of bandwidth between any two peers in the system.

According to NVIDIA, the NVLink Switch System provides a total of 130 terabytes per second (TB/s) of low-latency GPU communications.
The infrastructure further incorporates ConnectX-7 InfiniBand to support the networking requirements of the deployment.
Capacity for Large-Scale AI Models
The combination of the Blackwell GPU architecture and the NVLink interconnect provides the memory capacity and throughput required to host models with a trillion parameters.
By consolidating these resources into a single rack, the GB200 NVL72 reduces the complexity of deploying the most demanding AI models while maintaining the communication speeds necessary for efficient inference and training.
