Veo 3 Steps into the GPT-3 Era: Seeing, Modeling, Reasoning

Introduction & ComparisonSeptember 29, 2025

The Google DeepMind team published the paper “Video Models Are Zero-Shot Learners and Reasoners”, highlighting that Veo 3 has reached a “GPT-3 moment” in the field of visual AI.

After testing 18,384 video generation tasks, Veo 3 demonstrated remarkable zero-shot abilities across a wide range of visual tasks. Beyond video generation, Veo 3 can automatically perform complex visual tasks without any training, such as navigating mazes or solving Sudoku puzzles. This indicates that video models are on a path to becoming vision foundation models, just as large language models (LLMs) became foundation models for language.

Researchers have proposed a hierarchical model for visual intelligence, consisting of four levels:

Perception: The foundation of understanding visual information

In traditional computer vision, perception tasks are often siloed, one model handles edge detection, another handles image segmentation, and yet another handles different tasks.

With Veo 3, classic computer vision tasks can be performed with just a prompt. For example, uploading a blurry image and asking Veo 3 to remove the blur instantly produces a clear image.

Modeling: Understanding the rules that govern the world

After “seeing” the world, the next level is to understand how it works. Video models, which inherently process temporal sequences, have a natural advantage in learning intuitive physics.

Veo 3 demonstrates a strong grasp of fundamental physical principles. For example, in one test, researchers asked the model to simulate placing a stone and a bottle cap into water. Veo 3 accurately generated a video showing the stone sinking quickly while the bottle cap floated, showcasing its understanding of buoyancy.

Manipulation: Becoming an all-in-one visual editor

With perception and modeling abilities, the model naturally learns how to manipulate the world. This translates into a range of powerful zero-shot image and video editing capabilities.

Tasks like background removal, style transfer, 3D object pose adjustment, or even turning a selfie into a professional business portrait can all be completed directly with a prompt using Veo 3.

Reasoning: The visual version of “Chain-of-Thought”

This represents the highest level of visual intelligence. When a model faces a task requiring multi-step planning and logic, how does it think?

In the field of LLMs, the concept of Chain-of-Thought (CoT) allows a model to output its reasoning step by step, significantly improving its ability to solve complex problems.

The authors of the paper propose a corresponding concept for video models: Chain-of-Frames (CoF). Video generation occurs frame by frame. When generating the next frame, the model must consider all previous frames. This process is essentially a step-by-step reasoning unfolding across both time and space, where each generated frame corresponds to a reasoning step in an LLM’s chain of thought.

Chain-of-Frames reasoning enables Veo 3 to tackle complex tasks requiring visual planning, such as:

Maze solving: The model can generate a complete animation of an object moving from the maze’s start to finish, strictly following the rules.
Completing visual sequences: By analyzing the progression of earlier shapes, the model can infer which shape should fill the final blank.
Tool use: In one test, the task was to generate a video of “removing a walnut from a fish tank.” Veo 3 successfully simulated a human using a tool—like a spoon—to accomplish the task.

With Veo 3, we may be witnessing the dawn of AI mastering physics and spatial intelligence. This breakthrough is more than just video generation; it’s about giving AI the ability to see, understand, and interact with the world in ways once thought impossible.

Curious about what’s next? Try Veo 3, and the future of AI might be unfolding frame by frame right before your eyes.

What is Veo 3.1? The Cutting-Edge Features You Need to Know

Veo 3.1 is Google’s latest AI video model, offering Quality and Fast versions with enhanced real-time feedback and superior audio-visual performance.