Why Universal Transformers Keep Winning: Design Choices, Dead Ends, and ADRs in ML Research Code
- In a reflective essay published on April 27, 2026, machine learning researcher Grigory Sapunov revisits his long-standing fascination with Universal Transformers, a recurrent variant of the Transformer architecture...
- Sapunov, writing on his Gonzo ML Substack, traces his obsession with Universal Transformers back to a 2018 paper by Dehghani, Kaiser, Gouws, and their collaborators.
- The appeal of this approach, he argues, lies in its theoretical elegance.
In a reflective essay published on April 27, 2026, machine learning researcher Grigory Sapunov revisits his long-standing fascination with Universal Transformers, a recurrent variant of the Transformer architecture that has quietly shaped advances in artificial intelligence over the past six years. While not a traditional entertainment story, Sapunov’s account offers a rare glimpse into the intersection of cutting-edge AI research and the creative processes that drive technological innovation—an area increasingly relevant to film, television, music, and gaming industries as they integrate AI tools into production workflows.
The Return to Universal Transformers
Sapunov, writing on his Gonzo ML Substack, traces his obsession with Universal Transformers back to a 2018 paper by Dehghani, Kaiser, Gouws, and their collaborators. The paper proposed a radical shift in how neural networks process information: instead of stacking fixed layers like traditional Transformers, Universal Transformers apply a single shared block recursively, refining token representations over multiple steps. This design introduced a form of adaptive computation, allowing the model to determine how much processing power to allocate to each input—a concept Sapunov describes as “more like a step-machine than a feedforward function.”
The appeal of this approach, he argues, lies in its theoretical elegance. By decoupling model size from effective depth, Universal Transformers achieve greater parameter efficiency while maintaining the parallelism and global receptive field that made Transformers dominant in natural language processing. The framework also incorporates Adaptive Computation Time (ACT), a mechanism that dynamically halts processing for tokens that have reached sufficient refinement, further optimizing computational resources.
From Theory to Practice: UTM-Jax
Sapunov’s latest project, UTM-Jax, is a JAX-based implementation of Universal Transformers that reflects his ongoing experimentation with the architecture. In his essay, he details the design choices behind the model, including the use of shared parameters across recurrent steps and the integration of dynamic halting mechanisms. While the technical specifics are dense, Sapunov frames the work as part of a broader resurgence of “UT-flavored ideas” in AI research, including looped transformers and recurrent-depth reasoning models.

He acknowledges that Universal Transformers have not yet achieved the widespread adoption he once predicted. “In 2020, I assumed by 2024 we would all be running adaptive-depth models at scale,” he writes. “We aren’t quite—but the last year or two have brought a real resurgence.” This resurgence, he notes, is evident in recent papers on looped transformers and mechanistic analyses of recurrent architectures, suggesting that the core questions Universal Transformers sought to answer—such as how much computation a network should allocate per input—remain central to AI development.
Why It Matters for Entertainment
While Sapunov’s essay is rooted in technical research, its implications extend to the entertainment industry, where AI tools are increasingly used for tasks ranging from script analysis to visual effects and music composition. Universal Transformers’ ability to adaptively refine representations could prove valuable in creative applications where computational efficiency and flexibility are critical. For example, AI-driven animation or post-production tools might leverage such architectures to optimize rendering times without sacrificing quality, while music generation models could benefit from adaptive computation to produce more nuanced compositions.
The entertainment sector has already seen AI models influence content creation, from deepfake technology in film to AI-generated soundtracks in video games. Universal Transformers, with their recurrent structure and parameter efficiency, could further bridge the gap between raw computational power and creative expression. Sapunov’s work on UTM-Jax, though still in the research phase, hints at a future where AI models are not just tools but collaborative partners in the creative process, capable of dynamically adjusting their output based on the complexity of the task at hand.
Design Choices and Dead Ends
Sapunov’s essay also offers a candid look at the challenges of implementing Universal Transformers. He describes the project as a series of “design choices and dead ends,” emphasizing the iterative nature of AI research. One key decision was the use of JAX, a high-performance numerical computing library, which allowed for efficient experimentation with the model’s architecture. However, he also notes the difficulties in balancing theoretical innovation with practical implementation, particularly when scaling the model for real-world applications.
Another focus of his work is the use of Architectural Decision Records (ADRs), a documentation practice that captures the rationale behind key design choices. Sapunov argues that ADRs are essential for maintaining clarity in research code, especially in collaborative environments where multiple contributors may be working on the same project. “The companion arXiv paper is up,” he writes, “and a separate review post is coming with the experimental findings; here I want to talk about the architecture itself and the design decisions behind it.”
The Broader AI Landscape
Sapunov’s reflections on Universal Transformers arrive at a time when the AI research community is increasingly exploring alternatives to traditional Transformer architectures. While standard Transformers have dominated the field since their introduction in 2017, their limitations—particularly in terms of computational efficiency and adaptability—have prompted researchers to experiment with recurrent and looped variants. Universal Transformers represent one such alternative, offering a middle ground between the parallelism of Transformers and the sequential refinement of recurrent neural networks (RNNs).
The resurgence of interest in Universal Transformers aligns with broader trends in AI, including the development of mixture-of-experts (MoE) models and sparse routing techniques. These innovations aim to improve performance on complex tasks such as algorithmic reasoning and language modeling, areas that are increasingly relevant to entertainment applications. For instance, AI models capable of advanced reasoning could enhance interactive storytelling in video games or generate more coherent scripts for film and television.
What Comes Next
While Sapunov’s essay does not provide a roadmap for the future of Universal Transformers, it underscores the ongoing evolution of AI architectures. His work on UTM-Jax suggests that the model’s potential is far from exhausted, particularly as researchers continue to refine its adaptive computation mechanisms. For the entertainment industry, this could mean more sophisticated AI tools that seamlessly integrate into creative workflows, from pre-production to post-production and beyond.
As AI continues to permeate the entertainment sector, the lessons from Universal Transformers—such as the importance of adaptive computation and parameter efficiency—may prove invaluable. While the technology is still in the research phase, its principles could shape the next generation of AI-driven creative tools, offering new possibilities for filmmakers, musicians, game developers, and other artists.
For now, Sapunov’s essay serves as a reminder that the most impactful innovations often emerge from revisiting and refining existing ideas. “The question UT was asking—should the network choose how much computation to spend per input?—was always the right one,” he concludes. “We see being asked again, under different names. It keeps pulling me back.”
