Earlier this week, Nvidia surprise-announced their new Vera Rubin architecture (no relation to teh recently unveiled telescope) at the Consumer Electronics Show in Las Vegas. The new platform, set to reach customers later this year, is advertised to offer a ten-fold reduction in inference costs and a four-fold reduction in how many GPUs it would take to train certain models, as compared to Nvidia’s Blackwell architecture.
The usual suspect for improved performance is the GPU. Indeed, the new Rubin GPU boasts 50 quadrillion floating-point operations per second (petaFLOPS) of 4-bit computation, as compared to 10 petaflops on Blackwell, at least for transformer-based inference workloads like large language models.
However,focusing on just the GPU misses the bigger picture. There are a total of six new chips in the Vera-Rubin-based computers: the vera CPU,the Rubin GPU,and four distinct networking chips. To achieve performance advantages, the components have to work in concert, says Gilad shainer, senior vice president of networking at Nvidia.
“The same unit connected in a different way will deliver a fully different level of performance,” Shainer says. “That’s why we call it extreme co-design.”
Expanded “in-network compute”
AI workloads, both training and inference, run on large numbers of GPUs together. “Two years back, inferencing was mainly run on a single GPU, a single box, a single server,” Shainer says. “Right now, inferencing is becoming distributed, and it’s not just in a rack.It’s going to go across racks.”
To accommodate these hugely distributed tasks, as many GPUs as possible need to effectively work as one. This is the aim of the so-called scale-up network: the connection of GPUs within a single rack. Nvidia handles this connection with their NVLink networking chip. The new line includes the NVLink6 switch, with double the bandwidth of the previous version (3,600 gigabytes per second for GPU-to-GPU connections, as compared to 1,800 GB/s for NVLink5 switch).
In addition to the bandwidth doubling, the scale-up chips also include double the number of SerDes-serializer/deserializers (which allow data to be sent across fewer wires) and an expanded number of calculations that can be done within the network.
“The scale-up network is not really the network itself,” Shainer says. “It’s computing infrastructure, and some of the computing operations ar
