AI Inference: The Race for Tokens Per Watt & Goodput Efficiency
AI datacenters are increasingly viewed as “factories,” converting power into tokens – the fundamental units of output in generative AI. But maximizing efficiency in these facilities isn’t simply about adding more processing power. It’s a complex equation balancing token throughput, user experience, and cost, a dynamic explored in recent analysis by SemiAnalysis’s InferenceX and detailed in reporting by The Register.
The core challenge, as Nvidia CEO Jensen Huang recently stated, is that “inference tokens per watt translates directly to the revenues of the CSPs” (cloud service providers). Generating enough tokens to cover infrastructure, power, and operational costs is paramount. However, scaling inference isn’t a straightforward process. As Dave Salvator, director of accelerated computing products at Nvidia, points out, “It’s not one size fits all in terms of the answer. Notice SLAs, there’s different application types.”
This complexity introduces the concept of “goodput” – service-level targets like time to first token or per-user generation rates. Organizations must optimize for both total token throughput per megawatt and user interactivity. The ideal scenario, illustrated by the efficiency Pareto curve from InferenceX, lies in maximizing both aspects simultaneously.
Not All Tokens Are Created Equal
Maximizing token throughput at the expense of user experience is a readily achievable, yet ultimately undesirable, outcome. The key lies in finding the right balance. As Salvator explained to The Register, the optimal approach depends heavily on specific service-level agreements (SLAs) and application requirements.
InferenceX’s benchmark data reveals a trade-off. Chips can be configured for high throughput, generating over 3.5 million tokens per second per megawatt, but at the cost of interactivity. These tokens are cheap to produce but slow to deliver – akin to a “city bus” in terms of speed. Conversely, prioritizing interactivity reduces throughput, resulting in more expensive, “premium” tokens.
The sweet spot, dubbed the “Goldilocks zone,” offers a balance between throughput and interactivity, providing cost-effective performance. Achieving this balance requires careful consideration of both hardware and software.
Software’s Critical Role
Hardware alone isn’t sufficient. The choice of inference serving framework significantly impacts performance. Frameworks like vLLM, SGLang, and TensorRT LLM exhibit varying levels of efficiency depending on the model in question. Nvidia is actively addressing this through its inference microservices (NIMs), aiming to simplify deployment and optimize performance.
Recent data from InferenceX demonstrates the impact of software optimization. TensorRT LLM, running on Nvidia’s B200 GPUs, significantly outperforms SGLang when serving models like DeepSeek R1. However, open-source inference engines remain valuable to hyperscalers and model houses due to their customizability.
Disaggregated Compute and Emerging Architectures
Further gains in efficiency are being realized through disaggregated compute frameworks like Nvidia’s Dynamo and AMD’s MoRI. These frameworks distribute workloads across a pool of GPUs, separating compute-intensive prefill (prompt processing) from bandwidth-limited decode (token generation). The optimal ratio of prefill to decode GPUs varies depending on the application, with latency-sensitive applications like code assistants requiring a different configuration than high-throughput scenarios.
Techniques like multi-token prediction, a form of speculative decoding, further enhance efficiency by moving the Pareto curve upwards and to the right. The shift towards Mixture of Experts (MoE) model architectures, which utilize subsets of the entire model, is driving a move towards larger, rack-scale architectures like Nvidia’s NVL72, AMD’s Helios, and AWS’ Trainium3. These architectures feature high-speed interconnects to reduce latency and boost throughput.
While Nvidia currently leads in rack-scale platforms, AMD is poised to enter the market with its MI455X-based Helios systems in the second half of , promising comparable performance to Nvidia’s next-generation Vera-Rubin racks.
The Race to Lower Precision
The pursuit of efficiency extends to data precision. Lower precision formats, such as FP4, require less memory capacity, bandwidth, and compute. While traditionally FP8 and FP16 have been the standards, models are increasingly adopting FP4, provided optimized kernels are available. However, reducing precision can impact accuracy, a trade-off that must be carefully managed.
AMD and Nvidia’s latest accelerators employ clever mathematical techniques to mitigate accuracy loss in FP4, expanding the range of representable values. This is a key area of ongoing development, with both companies continually optimizing their hardware and software stacks.
A Commodity Market
For inference providers serving open-weight models, tokens are becoming a commodity. The competitive landscape is driving a “race to the bottom,” where providers strive to offer the most desirable models, the highest quality tokens, or the fastest tokens at the lowest cost. Some, like Cerebras, are leveraging unique hardware architectures to deliver low-latency tokens, securing contracts with companies like OpenAI. Others, like Fireworks, are focusing on customization tools to enable customers to tailor models to their specific applications.
As open-weight models continue to converge in quality with closed models, customization becomes increasingly appealing. However, even fine-tuned model serving is becoming commoditized, forcing smaller providers to differentiate themselves through constant optimization and innovation. The state of AI, as Salvator put it, is “a very much a moving target,” requiring continuous adaptation and improvement in both software and hardware.