DeepMind: Video Models Like LLMs for Visual Tasks

by Lisa Park - Tech Editor September 29, 2025

written by Lisa Park - Tech Editor September 29, 2025

okay, here’s a breakdown of the key data from ⁢the provided text, focusing on Veo 3 and its capabilities:

Key Points about Veo 3:

* “Chain-of-Frames” Reasoning: Veo 3 utilizes ⁢a process called “chain-of-frames,” ⁣which is a visual ⁣equivalent to the “chain-of-thought” reasoning used in large language models (LLMs). This suggests it’s not just seeing but reasoning about what it sees.
* visual Prompting matters: The way⁢ prompts are designed and visually presented considerably impacts Veo 3’s⁣ performance. Things like background color (green improves segmentation) and prompt phrasing⁣ can change outcomes.
* LLM Assistance: An LLM is used as a prompt rewriter to help⁢ with ⁣certain tasks. In some cases (like Sudoku), the LLM ⁤might be doing the actual solving, not the video model.
* Beyond LLM Capabilities: Crucially, for core visual reasoning tasks (robot navigation,⁣ maze solving, symmetry detection), Gemini 2.5‌ Pro (a powerful‍ LLM) cannot solve these problems directly from images. Veo 3 can, suggesting it possesses reasoning abilities beyond current LLMs.
* “Black Box” but Promising: The researchers don’t fully understand how ‌ Veo 3 is achieving these results,⁤ calling it a “black box.” However, they believe it indicates a new form of reasoning is emerging within the video model itself.
* Catching Up to Specialists: ‌Veo 3 ⁢isn’t yet ⁤as good as specialized models like Meta’s ‌SAMv2 (for image segmentation), but it’s improving rapidly.
* Rapid advancement: the model has shown critically important progress in just six months.

In essence, the article portrays Veo⁢ 3 as a significant step forward in video understanding and reasoning, demonstrating capabilities that go beyond what current LLMs can achieve when presented with visual information.

Related Article Recommendation:

The article recommends a piece titled “The ⁣great AI scaling debate continues into 2025” from the-decoder.com. The image associated with the recommendation shows fireworks,‌ likely symbolizing the ongoing advancements and discussions around AI⁣ scaling.

I see 3

Lisa Park - Tech Editor

Lisa Park is a leading technology journalist with 11 years of experience covering Silicon Valley, emerging technologies, and digital innovation. Lisa holds a Master's in Computer Science and Her expertise spans artificial intelligence, blockchain technology, cybersecurity, and venture capital. She has exclusive access to tech executives, startup founders, and industry insiders, making her a trusted voice in technology reporting.

DeepMind: Video Models Like LLMs for Visual Tasks

Share this:

Related

Electronic Arts Acquisition: $55 Billion Takeover

Milan Fashion Show: Final Designs Revealed

You may also like

Leave a Comment Cancel Reply