Home » Tech » Maia 200: AI Accelerator for Inference

Maia 200: AI Accelerator for Inference

by Lisa Park - Tech Editor

Today, we’re​ proud to‌ introduce Maia 200, ⁢a breakthrough​ inference accelerator engineered to dramatically improve ⁣the economics of AI token generation. Maia 200 is an AI inference powerhouse: an accelerator built ‌on TSMCS 3nm process‍ with native ⁢FP8/FP4 tensor cores, a redesigned ⁣memory system with 216GB‌ HBM3e at 7 TB/s and 272MB of on-chip SRAM, plus data movement engines⁢ that keep massive models fed,⁣ fast​ and highly utilized. ‌This makes Maia 200 the most performant, first-party silicon from ‌any hyperscaler, with three times the FP4 performance of the third generation Amazon Trainium, and FP8 performance ​above⁢ Google’s seventh generation TPU. Maia⁢ 200 is also the most efficient inference system Microsoft has ever deployed, with ‌30% better ‌performance per dollar then the latest generation hardware in our fleet today.

Maia 200 is part ⁣of our ⁣heterogenous AI infrastructure⁣ and will serve multiple models, including the ⁣latest GPT-5.2 models from OpenAI, bringing performance per dollar‌ advantage to Microsoft Foundry and Microsoft 365 Copilot. The ⁤Microsoft superintelligence team will use Maia 200 for synthetic data generation and reinforcement learning ‌to improve next-generation in-house ‍models. For synthetic data pipeline use⁤ cases,Maia 200’s unique​ design helps accelerate the rate at wich high-quality,domain-specific data ⁣can be generated and filtered,feeding downstream training with fresher,more targeted signals.

Maia 200 is deployed in our US Central datacenter region near Des ⁢moines,Iowa,with the US West 3 datacenter ⁣region near Phoenix,Arizona,coming next and future regions to follow. Maia 200 ​integrates seamlessly with Azure, and we​ are‌ previewing the⁣ Maia SDK with a‌ complete ​set of tools to build⁣ and optimize models for Maia 200. It includes a‌ full set of capabilities, including PyTorch integration, a Triton compiler and optimized kernel library, and ⁢access to Maia’s low-level programming language.⁢ this gives developers fine-grained control when needed while enabling easy model ‍porting across heterogeneous hardware accelerators.

YouTube Video

Engineered for AI inference

Fabricated on TSMC’s cutting-edge 3-nanometer process, each Maia 200 chip contains over 140 billion transistors and is tailored for large-scale AI workloads while also delivering ⁢efficient performance per dollar. On both fronts, Maia 200 is built to excel. It is ⁤indeed designed for⁣ the latest models‍ using low-precision compute, with each Maia⁢ 200 chip delivering over 10 petaFLOPS in 4-bit⁢ precision (FP4) and over 5 petaFLOPS of 8-bit (FP8) performance, all ​within a 750W SoC ⁢TDP envelope.‍ In practical terms,‍ Maia 200 can effortlessly ⁤run today’s largest models, with plenty of headroom for even⁣ bigger models‍ in the future.

Crucially, FLOPS aren’t⁤ the only ingredient⁤ for faster AI. feeding data is equally ‍meaningful. Maia 200⁣ attacks this ​bottleneck with a redesigned memory subsystem. The Maia 200‌ memory subsystem is‌ centered on narrow-precision datatypes, a specialized DMA engine, on-die‌ SRAM and‍ a specialized NoC fabric for high‑bandw

Consequently of ⁣these​ investments, AI models were running on Maia 200 silicon within days of first packaged part arrival. Time from first silicon to first datacenter rack ‌deployment was reduced to less than half that of comparable AI infrastructure programs. And this end-to-end approach, from chip to software​ to datacenter, translates directly into higher utilization, faster time to production and ‍sustained improvements in performance per dollar​ and per watt at cloud scale.

A view of the Maia‍ 200 ⁢rack and the HXU cooling ⁣unit.

sign up for the Maia SDK preview

The era of large-scale AI ⁢is just beginning, and infrastructure will define what’s possible. Our Maia AI accelerator program is designed to ‌be multi-generational. As we deploy Maia 200 across ⁣our global infrastructure, we are already‍ designing for future generations and expect each​ generation will continually⁤ set‍ new benchmarks for what’s possible and deliver ever better performance and efficiency‌ for the most critically important AI workloads.

Today, we’re inviting developers, AI startups and academics to begin exploring⁢ early model ⁣and workload optimization with⁢ the new maia 200 software development kit (SDK). The ‍SDK includes a Triton ⁤Compiler, support ⁤for PyTorch, low-level programming​ in NPL and a​ Maia simulator and cost calculator to optimize for efficiencies earlier‍ in the code lifecycle. Sign up for the preview here.

Get more photos, video and resources on our Maia 200 site ‌ and read more details.

Scott Guthrie is responsible for hyperscale cloud ‍computing solutions and services including azure, Microsoft’s cloud⁢ computing platform, generative AI solutions, data platforms and details‍ and cybersecurity. These‍ platforms and services help‍ organizations worldwide ‌solve⁤ urgent challenges and drive long-term transformation.

Tags: , ,

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.