Apple Research Explores LLM Spatial Understanding and Annotation

News Context

At a glance

Apple researchers have developed a new multimodal large language model (MLLM) called MM-Spatial, designed to overcome the current limitations of AI in understanding three-dimensional space.
The research, published in September 2025 and presented at the International Conference on Computer Vision (ICCV), introduces both a new model and the infrastructure required to train and...
Central to the creation of MM-Spatial is a novel dataset known as Cubify Anything VQA (CA-VQA).

Apple researchers have developed a new multimodal large language model (MLLM) called MM-Spatial, designed to overcome the current limitations of AI in understanding three-dimensional space. While existing multimodal models typically excel at interpreting 2D visual data, they often struggle to reason about 3D environments. The development of MM-Spatial aims to bridge this gap, specifically focusing on indoor scenes.

The research, published in September 2025 and presented at the International Conference on Computer Vision (ICCV), introduces both a new model and the infrastructure required to train and evaluate it. The team created a specialized supervised fine-tuning dataset and a corresponding evaluation benchmark to improve how AI perceives depth, distance, and spatial relationships.

The Cubify Anything VQA Dataset

Central to the creation of MM-Spatial is a novel dataset known as Cubify Anything VQA (CA-VQA). This dataset leverages large-scale, high-quality 3D scene data utilizing open-set annotations to provide the model with a diverse range of spatial tasks.

The CA-VQA dataset focuses on several critical spatial understanding tasks, including:

Predicting spatial relationships between objects.
Estimating metric size and distance.
Performing 3D grounding to locate objects within a space.

To enhance the model’s accuracy, the researchers incorporated multiple types of input signals. These include single images, multi-frame or multi-view inputs, and metric depth data derived from both sensors and estimations.

Technical Capabilities and Performance

The researchers found that integrating metric depth and multi-view inputs significantly improved the model’s 3D understanding. According to the study, the data alone allowed MM-Spatial to achieve depth perception capabilities that are comparable to those of dedicated monocular depth estimation models.

Apple Intelligence, Reflection 70B, open-source AI agents, and LLM research ideas

MM-Spatial also supports Chain-of-Thought spatial reasoning. This process allows the model to execute complex reasoning steps that involve 2D grounding and depth estimation to arrive at a spatial conclusion. The model can leverage depth input through the use of specific tools.

In testing, MM-Spatial achieved state-of-the-art performance on various 3D spatial understanding benchmarks, including the newly developed CA-VQA benchmark.

Research and Development

The project was led by a team of researchers including Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, and Peter Grasch.

By developing a generalist MLLM capable of sophisticated 3D reasoning, the research provides a framework for AI to better interact with and understand the physical geometry of indoor environments, moving beyond the limitations of flat image analysis.

Apple Research Explores LLM Spatial Understanding and Annotation

The Cubify Anything VQA Dataset

Technical Capabilities and Performance

Research and Development

Share this:

Related