This is the code for the paper Visual Agentic AI for Spatial Reasoning with a Dynamic API by Damiano Marsili, Rohun Agrawal, Yisong Yue and Georgia Gkioxari.
Project Page | Paper | Dataset | BibTeX
Clone the repo:
git clone https://github.com/damianomarsili/VADAR.git
Setup environment and download models:
cd VADAR
python -m venv venv
source venv/bin/activate
sh setup.sh
echo YOUR_OPEN_API_KEY > api.key
Note: This setup assumes CUDA 12.2 and Python 3.10. If using a different version of CUDA, replace the --index-url
in setup.sh
with a CUDA runtime that is compatible with your CUDA version. For example, for CUDA 11.8, replace with --index-url https://download.pytorch.org/whl/cu118
.
VADAR uses SAM2, UniDepth and GroundingDINO.
For a quick exploration of VADAR's functionality, we have compiled a notebook demo-notebook/quickstart.ipynb
. For evaluating on larger datasets, please refer to the "Evaluating VADAR" section below.
Omni3D-Bench contains 500 (image, question, answer) tuples of diverse real-world scenes sourced from Omni3D. The dataset is released under the Creative Commons Non-Commercial license. View samples from the dataset here.
Omni3D-Bench is hosted on HuggingFace. The benchmark can be accessed with the following code:
from datasets import load_dataset
dataset = load_dataset("dmarsili/Omni3D-Bench")
Additionally, a .zip
of the dataset can be downloaded at the above link.
Samples in Omni3D-Bench consist of images, questions, and ground-truth answers. The annotations can be loaded as a python dictonary with the following format:
<!-- annotations.json -->
{
"questions": [
{
"image_index" : str, image ID
"question_index" : str, question ID
"image" : PIL Image, image for query
"question" : str, query
"answer_type" : str, expected answer type - {int, float, str}
"answer" : str|int|float, ground truth response to the query
},
{
...
},
...
]
}
Both Omni3D-Bench and the subset of CLEVR used in the paper can be downloaded with:
sh download_data.sh
You can use a custom dataset by placing it in the data
directory. Your dataset folder should contain an images
folder and an annotations.json in the format specified in the "Omni3D-Bench" section above.
To evaluate VADAR, run the following code:
python evaluate.py --annotations-json data/[DATASET_NAME]/annotations.json --image-pth data/[DATASET_NAME]/images/
Note: If evaluating VADAR on the CLEVR or GQA datasets, add the additional --dataset clevr OR gqa
tag. If omitted, the prompts and API for Omni3D-Bench will be used.
The evaluation script will produce the following files:
results/[timestamp]/
├── signature_generator # signatures generated by Signature Agent
│ ├── image_1_question_2.html
│ ├── image_5_question_8.html
│ ├── image_9_question_14.html
│ └── ...
├── api_generator # method implementations generated by API Agent
│ ├── method_1
│ │ ├── executable_program.py # python implementation of method
│ │ └── result.json # Unit test result
│ ├── method_2
│ │ ├── executable_program.py # python implementation of method
│ │ └── result.json # Unit test result
│ ├── ...
│ └── api.json # JSON of generated API.
├── program_generator # programs generated by Program Agent
│ ├── image_0_question_0.html
│ ├── image_0_question_1.html
│ ├── image_1_question_2.html
│ ├── ...
│ └── programs.json # JSON of generated programs.
├── program_execution # execution log of programs. (TODO: dami here)
│ ├── image_0_questions_0
│ │ ├── executable_program.py # python implementation of query solution
│ │ ├── result.json # JSON of program output
│ │ └── trace.html # Visualization of output trace.
│ ├── image_0_questions_1
│ │ ├── executable_program.py # python implementation of query solution
│ │ ├── result.json # JSON of program output
│ │ └── trace.html # Visualization of output trace.
│ ├── ...
│ └── execution.json # JSON of experiment execution
├── execution.csv # CSV with full execution log.
└── results.txt # Summarized Results
See RESULTS.md
for detailed VADAR performance on Omni3D-Bench, CLEVR, and GQA, as well as comparison with other methods.
If you use VADAR or the Omni3D-Bench dataset in your research, please use the following BibTeX entry.
@misc{marsili2025visualagenticaispatial,
title={Visual Agentic AI for Spatial Reasoning with a Dynamic API},
author={Damiano Marsili and Rohun Agrawal and Yisong Yue and Georgia Gkioxari},
year={2025},
eprint={2502.06787},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.06787},
}