Sara Ghaboura *
Ketan More *
Ritesh Thawkar
Wafa Alghallabi
Omkar Thawakar
Fahad Shahbaz Khan
Hisham Cholakkal
Salman Khan
Rao M. Anwer
*Equal Contribution
🤗 [19 Feb 2025] TimeTravel dataset available on HuggingFace.
🔥 [20 Feb 2025] TimeTravel the 1st comprehensive open-source benchmark on Historical and Cultural Artifacts is released.
TimeTravel is the first comprehensive benchmark for AI-driven historical artifact analysis, designed to identify artifacts within their historical era and cultural context. Spanning 266 cultural groups across 10 regions, it prioritizes historical knowledge, contextual reasoning, and cultural preservation, unlike generic object recognition benchmarks. With over 10,000 expert-verified samples, TimeTravel sets a new standard for evaluating multimodal models in historical research, cross-civilizational analysis, and AI-powered cultural heritage preservation.
Figure 1. Left: TimeTravel Taxonomy maps artifacts from 10 civilizations, 266 cultures, and 10k+ verified samples for AI-driven historical analysis. Right: Regional dataset distribution by archaeological provenance, with Greece holding the largest share (18%) and balanced regional coverage.
- First Historical Artifact Benchmark: The 1st large-scale multimodal benchmark for AI-driven historical artifact analysis
- Broad Coverage: It spans across 10 civilizations and 266 cultural groups.
- Expert-Verified Samples: Over 10k samples include manuscripts, inscriptions, sculptures, and archaeological artifacts, manually curated by historians and archaeologists.
- Structured Taxonomy: Provides a hierarchical framework for artifact classification, interpretation, and cross-civilizational analysis.
- AI Evaluation Framework: Assesses GPT-4V, LLaVA, and other LMMs on historical knowledge, contextual reasoning, and multimodal understanding.
- Bridging AI and Cultural Heritage: Enables AI-driven historical research, archaeological analysis, and cultural preservation.
- Open-Source & Standardized: A publicly available dataset and evaluation framework to advance AI applications in history and archaeology.
The TimeTravel dataset follows a structured pipeline to ensure the accuracy, completeness, and contextual richness of historical artifacts.
Figure 2. TimeTravel Data Pipeline: A structured workflow for collecting, processing, and refining museum artifact data, integrating GPT-4o-generated descriptions with expert validation for benchmark accuracy.compliance.
Our approach consists of four key phases:
- Data Selection: Curated 10,250 artifacts from museum collections, spanning 266 cultural groups, with expert validation to ensure historical accuracy and diversity.
- Data Cleaning: Addressed missing or incomplete metadata (titles, dates, iconography) by cross-referencing museum archives and academic sources, ensuring data consistency.
- Generation & Verification: Used GPT-4o to generate context-aware descriptions, which were refined and validated by historians and archaeologists for authenticity.
- Data Aggregation: Standardized and structured dataset into image-text pairs, making it a valuable resource for AI-driven historical analysis and cultural heritage research.
The following tables present a comprehensive evaluation of various multimodal models on the TimeTravel benchmark. The first table compares model performance across multiple metrics, while the second analyzes their ability to describe archaeological artifacts from different civilizations, highlighting variations in accuracy and descriptive depth.
Model | BLEU | METEOR | ROUGE-L | SPICE | BERTScore | LLM-Judge |
---|---|---|---|---|---|---|
GPT-4o-0806 | 0.1758🏅 | 0.2439 | 0.1230🏅 | 0.1035🏅 | 0.8349🏅 | 0.3013🏅 |
Gemini-2.0-Flash | 0.1072 | 0.2456 | 0.0884 | 0.0919 | 0.8127 | 0.2630 |
Gemini-1.5-Pro | 0.1067 | 0.2406 | 0.0848 | 0.0901 | 0.8172 | 0.2276 |
GPT-4o-mini-0718 | 0.1369 | 0.2658🏅 | 0.1027 | 0.1001 | 0.8283 | 0.2492 |
Llama-3.2-Vision-Inst | 0.1161 | 0.2072 | 0.1027 | 0.0648 | 0.8111 | 0.1255 |
Qwen-2.5-VL | 0.1155 | 0.2648 | 0.0887 | 0.1002 | 0.8198 | 0.1792 |
Llava-Next | 0.1118 | 0.2340 | 0.0961 | 0.0799 | 0.8246 | 0.1161 |
Table: Performance comparison of various closed and open-source models on our proposed TimeTravel benchmark.
Model | India | Roman Emp. | China | British Isles | Iran | Iraq | Japan | Cent. America | Greece | Egypt |
---|---|---|---|---|---|---|---|---|---|---|
GPT-4o-0806 | 0.2491🏅 | 0.4463🏅 | 0.2491🏅 | 0.1899🏅 | 0.3522🏅 | 0.3545🏅 | 0.2228🏅 | 0.3144🏅 | 0.2757🏅 | 0.3649🏅 |
Gemini-2.0-Flash | 0.1859 | 0.3358 | 0.2059 | 0.1556 | 0.3376 | 0.3071 | 0.2000 | 0.2677 | 0.2582 | 0.3602 |
Gemini-1.5-Pro | 0.1118 | 0.2632 | 0.2139 | 0.1545 | 0.3320 | 0.2587 | 0.1871 | 0.2708 | 0.2088 | 0.2908 |
GPT-4o-mini-0718 | 0.2311 | 0.3612 | 0.2207 | 0.1866 | 0.2991 | 0.2632 | 0.2087 | 0.3195 | 0.2101 | 0.2501 |
Llama-3.2-Vision-Inst | 0.0744 | 0.1450 | 0.1227 | 0.0777 | 0.2000 | 0.1155 | 0.1075 | 0.1553 | 0.1351 | 0.1201 |
Qwen-2.5-VL | 0.0888 | 0.1578 | 0.1192 | 0.1713 | 0.2515 | 0.1576 | 0.1771 | 0.1442 | 0.1442 | 0.2660 |
Llava-Next | 0.0788 | 0.0961 | 0.1455 | 0.1091 | 0.1464 | 0.1194 | 0.1353 | 0.1917 | 0.1111 | 0.0709 |
Figures 3 and 4 showcase the cultural and material diversity of the TimeTravel dataset alongside a cross-model comparison, highlighting variations in artifact representation, historical periods, material compositions, and descriptive accuracy across different AI models.

Figure 3. Cultural and Material Diversity: TimeTravel spans civilizations from Ancient Egypt to Japan, covering prehistoric to medieval eras with artifacts in ceramics, metals, and stone, showcasing historical craftsmanship and cultural heritage.

Please refer to Evaluation folder to reproduce the results.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or suggestions, feel free to reach out to us on GitHub Discussions.
If you use TimeTravle dataset in your research, please consider citing:
@misc{ghaboura2025timetravelcomprehensivebenchmark,
title={Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts},
author={Sara Ghaboura and Ketan More and Ritesh Thawkar and Wafa Alghallabi and Omkar Thawakar and Fahad Shahbaz Khan and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
year={2025},
eprint={2502.14865},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.14865},
}