4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: ob- ject understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D- Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence.
Figure 1: An overview of Spatial4D-Bench. Spatial4D-Bench is a large-scale, multi-task evaluation benchmark to comprehensively assess MLLMs’ 4D spatial reasoning abilities. It consists of ~40,000 question-answer pairs covering 18 well-defined tasks, which are organized into 6 categories, including object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning, covering various aspects of 4D spatial reasoning.
Table 1: Comparison of Spatial4D-Bench with state-of-the-art spatial intelligence benchmarks. We evaluate coverage across 6 cognitive categories: object understanding (size, attribute, count and affordance), scene understanding (room size, scene class and grounding), spatial relationships (absolute/relative distance and orientation), spatiotemporal (S.T.) relationships (action, order, memory and state change), spatial reasoning (egocentric and route plan), and spatiotemporal reasoning (prediction and physical plausibility). Unlike prior works, Spatial4D-Bench provides significantly higher data scale and comprehensive coverage of all 18 tasks, offering a robust evaluation of MLLMs’ 4D reasoning capabilities.
Figure 2: Distribution of question-answer pairs provided by our Spatial4D-Bench.
Figure 3: Spatial4D-Bench Task Taxonomy. We organize 18 distinct tasks into 6 progressive categories representing the spectrum of spatial cognition. The taxonomy progresses from perception and understanding in object/scene level, through spatial/spatiotemporal understanding, to dynamic spatial/spatiotemporal reasoning, mirroring the cognitive abilities of human intelligence.
Evaluates the ability to perceive and understand object properties and characteristics in spatial environments.
Assesses the capability to comprehend entire spatial scenes and their structural characteristics.
Measures the ability to recognize and quantify spatial relationships between objects in 3D space.
Evaluates the understanding of spatial relationships that evolve over time and temporal dynamics.
Tests the capacity to reason about spatial information and solve spatial problems in static environments.
Assesses the ability to reason about spatial events and predict outcomes over time in dynamic environments.
Current standings of AI models evaluated on the comprehensive Spatial4D-Bench benchmark with 18 subtasks.
| Rank | Model | Institution | Overall | Object Understanding | Scene Understanding | Spatial Relationship Understanding | Spatiotemporal Relationship Understanding | Spatial Reasoning | Spatiotemporal Reasoning | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Obj. Attr. | Obj. Size | Obj. Count | Affordance | Room Size | Scene Class | 3D Grounding | Abs Dist | Rel Dist | Rel Orient | Action Recog | App Order | Spatial Mem | State Change | Ego Reason | Route Plan | Action Pred | Phys Plaus | ||||
| Baseline Models | |||||||||||||||||||||
|
Human
|
Human Level
|
Human Annotators | 78.02% | 74.61% | 89.09% | 66.79% | 81.48% | 55.00% | 83.33% | 78.85% | 48.08% | 71.15% | 69.23% | 100.00% | 83.33% | 73.33% | 93.33% | 95.00% | 91.67% | 83.33% | 66.67% |
|
Random
|
Chance-level (Random)
|
Random Guess | - | - | 25.00% | - | 25.00% | - | 25.00% | 25.00% | - | 25.00% | 25.00% | 25.00% | 25.0% | 25.0% | 25.00% | 25.00% | 5.03% | 25.00% | 25.00% |
|
Frequency
|
Chance-level (Frequency)
|
Most Frequent Answer | - | - | 29.32% | - | 28.27% | - | 26.11% | 33.90% | - | 30.57% | 25.12% | 25.80% | 26.14% | 30.45% | 29.33% | 32.57% | - | 27.73% | 30.10% |
| MLLM Models | |||||||||||||||||||||
|
1
|
GPT-5
PROPRIETARY
|
OpenAI | 60.90% | 78.64% | 68.71% | 54.49% | 67.41% | 45.56% | 75.16% | 70.59% | 37.69% | 68.57% | 49.25% | 71.60% | 68.45% | 58.80% | 83.20% | 58.80% | 32.83% | 66.67% | 38.78% |
|
2
|
Qwen3-VL-235B-A22B
OPEN
|
Alibaba Cloud | 56.17% | 79.76% | 62.21% | 64.70% | 57.82% | 56.62% | 64.38% | 60.88% | 44.52% | 60.23% | 55.40% | 61.12% | 66.17% | 49.52% | 68.20% | 44.20% | 19.50% | 57.69% | 38.11% |
|
3
|
Gemini-2.5-Pro
PROPRIETARY
|
Google Research | 54.68% | 74.14% | 67.25% | 32.40% | 56.82% | 49.19% | 65.59% | 70.44% | 30.00% | 63.37% | 42.33% | 55.05% | 67.20% | 52.29% | 79.57% | 55.80% | 30.67% | 50.48% | 41.56% |
|
4
|
Qwen3-VL-30B-A3B
OPEN
|
Alibaba Cloud | 53.29% | 80.10% | 58.92% | 67.74% | 52.41% | 67.22% | 54.72% | 50.55% | 42.20% | 58.11% | 53.94% | 42.95% | 61.48% | 47.42% | 72.73% | 41.90% | 12.00% | 56.46% | 38.33% |
|
5
|
InternVL3.5-241B-A28B
OPEN
|
Shanghai AI Lab | 50.89% | 62.17% | 57.83% | 63.63% | 59.86% | 47.62% | 58.06% | 47.35% | 31.80% | 62.81% | 31.16% | 60.32% | 60.53% | 45.22% | 71.13% | 40.90% | 21.83% | 54.29% | 39.44% |
|
6
|
InternVL3.5-38B
OPEN
|
Shanghai AI Lab | 49.47% | 65.63% | 55.00% | 60.59% | 63.36% | 55.02% | 61.18% | 38.09% | 28.51% | 55.48% | 52.47% | 48.55% | 54.77% | 43.88% | 69.60% | 36.80% | 15.50% | 45.47% | 36.33% |
|
7
|
Qwen2.5-VL-32B
OPEN
|
Alibaba Cloud | 43.61% | 62.27% | 56.08% | 37.89% | 51.00% | 50.35% | 52.32% | 29.41% | 28.15% | 45.28% | 42.29% | 43.53% | 37.20% | 47.13% | 65.37% | 34.90% | 14.17% | 57.69% | 29.89% |
|
8
|
Qwen2.5-VL-72B
OPEN
|
Alibaba Cloud | 43.26% | 65.47% | 59.58% | 33.65% | 54.86% | 39.97% | 51.81% | 31.14% | 24.90% | 44.16% | 21.02% | 47.33% | 39.70% | 47.51% | 68.60% | 40.70% | 14.17% | 63.40% | 30.78% |
|
9
|
InternVL3.5-8B
OPEN
|
Shanghai AI Lab | 41.87% | 47.39% | 47.96% | 55.99% | 49.41% | 50.16% | 48.40% | 33.86% | 25.89% | 48.51% | 40.17% | 40.37% | 50.27% | 40.15% | 62.07% | 31.10% | 9.83% | 42.99% | 29.22% |
|
10
|
VideoLlama3-7B
OPEN
|
Alibaba Cloud | 38.30% | 33.21% | 52.88% | 52.69% | 33.91% | 28.41% | 50.86% | 26.99% | 23.84% | 41.99% | 31.01% | 42.90% | 40.27% | 43.21% | 55.83% | 40.80% | 14.67% | 40.68% | 35.22% |
|
11
|
Qwen2.5-VL-7B
OPEN
|
Alibaba Cloud | 37.13% | 35.63% | 52.29% | 46.84% | 40.00% | 39.82% | 42.71% | 25.96% | 18.93% | 38.96% | 24.79% | 39.25% | 39.36% | 40.25% | 55.13% | 36.00% | 13.50% | 46.94% | 31.89% |
| Date | Dataset | Status / Note |
|---|---|---|
| Feb 15, 2026 | Spatial4D-Bench1K-mini | Official Release, Code:h481 |
| Late March, 2026 | Spatial4D-Bench40K | Coming Soon! |
| TBA | Spatial4D-Bench9K | Coming Soon: To be used as a Challenge dataset! |
This dataset is published under a segmented licensing model. By accessing or using the data, you agree to comply with the following terms:
Route Plan Data: This specific component is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). You are free to share and adapt this material for any purpose, including commercial use, provided that you give appropriate credit and distribute your contributions under the same license.
All Other Data (Remaining Components): All other parts of the dataset are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). These components may NOT be used for commercial purposes without prior written consent.