Skip to content

Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration

Notifications You must be signed in to change notification settings

VisionXLab/avi-math

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration

1East China Normal University  2Nanyang Technological University  3SenseTime Research  4Shanghai Jiaotong University  5Harbin Institute of Technology 

Paper Paper Dataset


📢 Latest Updates

  • [2025.09.15] We released the benchmark and evaluation code.
  • [2025.09.08] Accepted by ISPRS JPRS.

Abstract

Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce \dataset, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in \dataset. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in \dataset. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications.

ARI: arithmetic, CNT: counting, ALG: algebra, STA: statistics, LOG: logic, GEO: geometry.

🏆 Contributions

  • Benchmark: We introduce AVI-Math, the first multimodal benchmark for mathematical reasoning in UAV imagery, covering six subjects and real-world UAV scenarios.

  • Analysis: We provide a comprehensive analysis, uncovering the limitations of current VLMs in mathematical reasoning and offering insights for future improvements.

  • Exploration: We explore the potential of Chain-of-Thought prompting and fine-tuning techniques to enhance VLM performance, providing a 215k-sample instruction set for VLMs to learn domain-specific knowledge in UAV scenarios.


💬 Benchmark

Examples of six mathematical reasoning subjects in AVI-Math.

Please download the dataset first and then refer to the code in the evaluation to infer and evaluate the score.


🔍 Analysis

Accuracy scores on the AVI-Math. AVG: average accuracy of the six subjects. FRE: free-form question, CHO: multiple choice question, T/F: true or false question. The highest scores among models in each part and overall are highlighted in blue and red. The table exclusively employs the original model weights without fine-tuning.


🚀 Exploration

Chain-of-Thought and fine-tuning results on various VLMs.

📜 Citation

@ARTICLE{zhou2025avimath,
  author={Zhou, Yue and Feng, Litong and Lan, Mengcheng and Yang, Xue and Li, Qingyun and Ke, Yiping and Jiang, Xue and Zhang, Wayne},
  journal={ISPRS Journal of Photogrammetry and Remote Sensing}, 
  title={Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration}, 
  year={2025},
  volume={},
  number={},
  pages={},
  doi={}
}

Contact

[email protected]

About

Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published