Both Anthropic and OpenAI have recently introduced “fast mode” options for their leading coding models, promising significantly increased interaction speeds. However, the approaches taken by the two companies are fundamentally different, reflecting distinct philosophies and technical capabilities. While both aim to reduce latency, the underlying mechanisms and resulting trade-offs vary considerably.
Anthropic’s fast mode, available for Claude Opus 4.6, achieves speed gains through a technique akin to prioritizing individual users over maximizing overall throughput. The company offers up to 2.5x tokens per second – around 170, compared to the standard 65 for Opus 4.6. What we have is accomplished by employing low-batch-size inference. Traditional AI inference relies on “batching” – grouping multiple user requests together to improve efficiency by maximizing GPU utilization. However, this introduces latency as users must wait for a batch to fill before processing begins. Anthropic’s fast mode essentially eliminates this wait, processing requests immediately as they arrive, at a cost. The analogy used is a bus system: a dedicated bus for each passenger ensures speed, but reduces overall capacity.
OpenAI’s approach, implemented for GPT-5.3-Codex via a new model called GPT-5.3-Codex-Spark, is far more ambitious and technically complex. It boasts a staggering speed increase of over 1000 tokens per second – a 15x improvement over the standard 65 tokens per second of GPT-5.3-Codex. However, this speed comes at a price: Spark is not the full GPT-5.3-Codex model. It’s a smaller, distilled version, optimized for speed but with reduced capabilities. According to reports, Spark is prone to errors and struggles with tasks, particularly tool calls, that the full GPT-5.3-Codex handles flawlessly.
The key to OpenAI’s speedup lies in specialized hardware: Cerebras chips. Announced in January 2026, this partnership leverages Cerebras’ unique wafer-scale engine technology. Unlike conventional GPUs with limited onboard memory, Cerebras chips are enormous – 70 square inches compared to the roughly one square inch of a typical H100 chip – and contain a massive amount of internal SRAM. This allows OpenAI to load the entire model directly onto the chip, eliminating the bottleneck of constantly streaming data to and from external memory. The result is a fifteenfold increase in inference speed.
The difference in approach highlights a fundamental trade-off. Anthropic prioritizes serving the full model, even if it means sacrificing some speed. OpenAI, is willing to use a less capable model to achieve significantly faster performance. The choice reflects differing priorities and access to resources. Anthropic’s solution is relatively straightforward to implement, while OpenAI’s requires substantial investment in specialized hardware and model distillation techniques.
The implications of these developments extend beyond simple speed improvements. While Anthropic’s fast mode is six times more expensive than standard inference, it provides access to the full Opus 4.6 model. OpenAI’s Spark, while faster, represents a step down in model quality. This raises questions about the value of speed versus accuracy and whether users are willing to trade capabilities for responsiveness.
It’s important to note that OpenAI’s use of Cerebras chips isn’t simply about raw speed. The large SRAM capacity allows for a different kind of inference – one where the entire model resides in memory, eliminating the latency associated with data transfer. This is a significant architectural shift with potentially far-reaching consequences.
Whether fast AI inference will become a core feature of AI systems remains to be seen. The current implementations suggest that speed gains often come at the cost of accuracy or require substantial investment in specialized hardware. As one observer noted, the usefulness of AI agents is primarily determined by their reliability, not their speed. A faster model that makes more mistakes may ultimately be less valuable than a slower, more accurate one.
However, the emergence of these fast modes signals a growing focus on optimizing AI inference. As models continue to grow in size and complexity, reducing latency will become increasingly critical. The approaches taken by Anthropic and OpenAI represent two distinct paths towards this goal, each with its own strengths and weaknesses. The future of AI inference may well involve a combination of these techniques, tailored to specific applications and user needs.
