Apple Research Explores LLM Spatial Understanding and Annotation
- Apple researchers have developed a new multimodal large language model (MLLM) called MM-Spatial, designed to overcome the current limitations of AI in understanding three-dimensional space.
- The research, published in September 2025 and presented at the International Conference on Computer Vision (ICCV), introduces both a new model and the infrastructure required to train and...
- Central to the creation of MM-Spatial is a novel dataset known as Cubify Anything VQA (CA-VQA).
Apple researchers have developed a new multimodal large language model (MLLM) called MM-Spatial, designed to overcome the current limitations of AI in understanding three-dimensional space. While existing multimodal models typically excel at interpreting 2D visual data, they often struggle to reason about 3D environments. The development of MM-Spatial aims to bridge this gap, specifically focusing on indoor scenes.
The research, published in September 2025 and presented at the International Conference on Computer Vision (ICCV), introduces both a new model and the infrastructure required to train and evaluate it. The team created a specialized supervised fine-tuning dataset and a corresponding evaluation benchmark to improve how AI perceives depth, distance, and spatial relationships.
The Cubify Anything VQA Dataset
Central to the creation of MM-Spatial is a novel dataset known as Cubify Anything VQA (CA-VQA). This dataset leverages large-scale, high-quality 3D scene data utilizing open-set annotations to provide the model with a diverse range of spatial tasks.
The CA-VQA dataset focuses on several critical spatial understanding tasks, including:
- Predicting spatial relationships between objects.
- Estimating metric size and distance.
- Performing 3D grounding to locate objects within a space.
To enhance the model’s accuracy, the researchers incorporated multiple types of input signals. These include single images, multi-frame or multi-view inputs, and metric depth data derived from both sensors and estimations.
Technical Capabilities and Performance
The researchers found that integrating metric depth and multi-view inputs significantly improved the model’s 3D understanding. According to the study, the data alone allowed MM-Spatial to achieve depth perception capabilities that are comparable to those of dedicated monocular depth estimation models.
MM-Spatial also supports Chain-of-Thought
spatial reasoning. This process allows the model to execute complex reasoning steps that involve 2D grounding and depth estimation to arrive at a spatial conclusion. The model can leverage depth input through the use of specific tools.
In testing, MM-Spatial achieved state-of-the-art performance on various 3D spatial understanding benchmarks, including the newly developed CA-VQA benchmark.
Research and Development
The project was led by a team of researchers including Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, and Peter Grasch.
By developing a generalist MLLM capable of sophisticated 3D reasoning, the research provides a framework for AI to better interact with and understand the physical geometry of indoor environments, moving beyond the limitations of flat image analysis.
