AI’s Next Leap: Understanding the Physical World

13

Large language models (LLMs) are hitting a wall. While they excel at processing text, today’s AI struggles in real-world applications like robotics and autonomous driving because it lacks a fundamental understanding of how things work. This limitation is driving massive investment into “world models”—AI systems that simulate physics and causality, not just predict the next word. Investors have already poured over $2 billion into startups like AMI Labs and World Labs, signaling a major shift in AI development.

The Problem with Pure Prediction

LLMs operate by predicting the most likely next token (word or pixel). They mimic human language without truly understanding the physical consequences of actions. Turing Award winner Richard Sutton warns that this approach limits AI’s ability to learn from experience and adapt to changes. Google DeepMind CEO Demis Hassabis calls this “jagged intelligence”: AI can ace abstract tests but fails at basic physics. This brittleness means models break easily with even minor input changes.

The core issue is that current AI doesn’t model the world; it mimics what people say about it. This is why even advanced vision-language models (VLMs) can behave erratically in unpredictable environments.

Three Approaches to Building World Models

Researchers are now prioritizing AI systems that act as internal simulators, testing hypotheses before taking action. This has led to three main architectural approaches, each with unique strengths and weaknesses.

JEPA: Real-Time Efficiency

The first approach, championed by AMI Labs, focuses on latent representations —learning the core rules of interaction without memorizing every detail. Based on the Joint Embedding Predictive Architecture (JEPA), this method mimics human cognition: we track trajectories, not every leaf in the background.

JEPA models discard irrelevant data, making them computationally efficient. This is ideal for robotics, self-driving cars, and other real-time applications where speed is critical. AMI Labs is already partnering with healthcare companies to reduce cognitive load in fast-paced settings. According to Yann LeCun, JEPA-based models are designed to achieve goals controllably.

Gaussian Splats: Spatial Immersion

World Labs takes a different route, building complete 3D environments from prompts using generative models and Gaussian splats (mathematical particles that define geometry and lighting). This drastically reduces the cost of creating interactive 3D spaces, addressing the “wordsmith in the dark” problem identified by World Labs founder Fei-Fei Li.

These 3D representations are directly compatible with physics and 3D engines like Unreal Engine, allowing for seamless interaction. While not ideal for split-second execution, this approach has massive potential for spatial computing, entertainment, and industrial design. Autodesk is heavily invested in this technology to integrate it into their design applications.

End-to-End Generation: Scalable Simulation

DeepMind’s Genie 3 and Nvidia’s Cosmos represent a third approach: generating entire scenes, physics, and reactions on the fly. The model is the engine, processing prompts and actions in real-time.

This enables massive synthetic data generation, allowing developers to test rare or dangerous scenarios without physical risks. Waymo is adapting Genie 3 to train its self-driving cars, and Nvidia uses Cosmos for autonomous vehicle development. The downside is high computational cost, but the ability to simulate complete physical interactions is a game-changer.

The Future: Hybrid Architectures

LLMs will remain crucial for reasoning and communication. However, world models are becoming the foundational infrastructure for physical and spatial data pipelines. The next wave will likely be hybrid systems that combine the strengths of each approach—prediction, spatial immersion, and scalable simulation. The goal remains the same: to create AI that doesn’t just talk about the world, but understands it.