NVIDIA is significantly boosting the performance and cost-efficiency of AI inference with its Blackwell Ultra platform, particularly for the rapidly growing field of agentic AI. New data reveals substantial gains over the previous generation, Hopper, and even the recently released GB200 NVL72 systems.
The surge in demand for AI agents and coding assistants is driving a dramatic increase in software-programming-related AI queries. According to OpenRouter’s State of Inference report, these queries have jumped from 11% to approximately 50% in the past year. These applications demand both low latency for real-time responsiveness and the ability to process vast amounts of data – entire codebases – to reason effectively.
Breakthrough Performance with GB300 NVL72
NVIDIA’s GB300 NVL72 systems, powered by the Blackwell Ultra GPU, are at the heart of these improvements. Analysis indicates a 50x increase in throughput per megawatt compared to the Hopper platform, translating to a 35x reduction in cost per token. This leap in efficiency is the result of a holistic approach to design, encompassing chip architecture, system-level innovations, and software optimizations.
The GB200 NVL72, already a significant step forward, delivered more than 10x more tokens per watt than Hopper, reducing the cost per token to one-tenth. The GB300 NVL72 builds on this foundation, with continuous improvements to software like NVIDIA TensorRT-LLM, NVIDIA Dynamo, Mooncake, and SGLang further enhancing throughput for mixture-of-experts (MoE) inference across various latency requirements. For example, improvements to the TensorRT-LLM library have yielded a 5x performance boost on GB200 for low-latency workloads in just four months.
Key to this performance gain are several technical advancements: higher-performance GPU kernels optimized for efficiency and low latency, NVIDIA NVLink Symmetric Memory enabling direct GPU-to-GPU memory access, and programmatic dependent launch, which minimizes idle time by overlapping kernel setup with previous kernel completion.
Economic Advantages for Long-Context Workloads
While both GB200 NVL72 and GB300 NVL72 excel at delivering low latency, the benefits of GB300 NVL72 become particularly pronounced when dealing with long-context workloads. For tasks involving 128,000-token inputs and 8,000-token outputs – common in AI coding assistants analyzing extensive codebases – the GB300 NVL72 achieves up to 1.5x lower cost per token compared to the GB200 NVL72.
As AI agents process more code to improve understanding, computational demands increase. Blackwell Ultra addresses this with 1.5x higher NVFP4 compute performance and 2x faster attention processing, allowing agents to efficiently analyze entire codebases.
Industry Adoption and Future Developments
Major cloud providers and AI innovators are already deploying NVIDIA GB200 NVL72 at scale and are now transitioning to GB300 NVL72 for production workloads. Microsoft, CoreWeave, and OCI are among those integrating GB300 NVL72 to support low-latency and long-context applications like agentic coding and coding assistants.
“As inference moves to the center of AI production, long-context performance and token efficiency become critical,” said Chen Goldberg, senior vice president of engineering at CoreWeave. “Grace Blackwell NVL72 addresses that challenge directly, and CoreWeave’s AI cloud, including CKS and SUNK, is designed to translate GB300 systems’ gains, building on the success of GB200, into predictable performance and cost efficiency. The result is better token economics and more usable inference for customers running workloads at scale.”
Looking ahead, NVIDIA’s Rubin platform promises even greater advancements. Combining six new chips into a single AI supercomputer, Rubin is projected to deliver up to 10x higher throughput per megawatt compared to Blackwell for MoE inference, reducing the cost per million tokens to one-tenth. Rubin is expected to require only one-fourth the number of GPUs compared to Blackwell for training large MoE models.
The NVIDIA Vera Rubin NVL72 system, built on the Rubin platform, represents the next generation of AI infrastructure, poised to unlock further performance and cost improvements as software optimizations continue to evolve.
