Microsoft Azure ND GB300 Inference: 1.1 Million Tokens/Sec
“`html
Azure GB300 Achieves Industry-First: Over 1 Million Tokens Per Second in AI Inference
Table of Contents
Published November 4, 2024, at 05:56 AM PST
Microsoft announced on November 2, 2024, that its new Azure ND GB300 virtual machine has achieved a breakthrough in artificial intelligence (AI) inference performance, surpassing one million tokens per second.This marks a significant milestone, representing an industry first and demonstrating substantial improvements over previous generations of hardware. The performance was independently validated by Signal65,a performance-validation and benchmarking firm.
Signal65 Validation and Performance Gains
According to a blog post by Signal65,the azure ND GB300 delivers a 27% betterment in inference performance compared to the previous NVIDIA GB200 generation,with only a 17% increase in power consumption. This efficiency gain is crucial for reducing operational costs and environmental impact.
Signal65 further reported that the GB300 offers nearly a 10x increase in inference performance over the NVIDIA H100 generation, coupled with a nearly 2.5x improvement in power efficiency when measured at the rack level. This substantial improvement positions Azure as a leader in providing high-performance,energy-efficient AI infrastructure.
“This milestone is significant not just for breaking the one-million-token-per-second barrier and being an industry-first, but for doing so on a platform architected to meet the dynamic use and data governance needs of modern enterprises,” said Russ Fellows, VP of Labs at Signal65.
Understanding Tokens and AI Inference
In the context of large language models (LLMs), a “token” is a unit of text - it can be a word, part of a word, or even a single character. The number of tokens processed per second (tokens/s) is a key metric for evaluating the speed of AI inference. Higher tokens/s rates translate to faster response times for AI applications, such as chatbots, content generation tools, and code completion assistants.
AI inference is the process of using a trained AI model to make predictions or generate outputs based on new input data. Efficient inference is critical for deploying AI applications in real-world scenarios where low latency and high throughput are essential.
Azure ND GB300: Key Specifications
While detailed specifications are still emerging, the Azure ND GB300 VMs are built around NVIDIA GB300 GPUs. These GPUs feature significant architectural improvements designed to accelerate AI workloads. Microsoft has optimized the Azure platform to fully leverage the capabilities of the GB300, resulting in the observed performance gains.
| Metric | Azure ND GB300 | NVIDIA GB200 | NVIDIA H100 |
|---|---|---|---|
| Inference Performance | > 1 Million tokens/s | ~740,000 Tokens/s (estimated) | ~100,000 Tokens/s (estimated) |
| Performance Improvement (vs GB200) | 27% | – | – |
| Performance improvement (vs H100) | ~10x |