TimeTravel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

Sara Ghaboura ^* Ketan More ^* Ritesh Thawkar Wafa Alghallabi Omkar Thawakar
Fahad Shahbaz Khan Hisham Cholakkal Salman Khan Rao M. Anwer

^{*Equal Contribution}

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Latest Updates

🤗 [19 Feb 2025] TimeTravel dataset available on HuggingFace.
🔥 [20 Feb 2025] TimeTravel the 1^st comprehensive open-source benchmark on Historical and Cultural Artifacts is released.

Overview

TimeTravel is the first comprehensive benchmark for AI-driven historical artifact analysis, designed to identify artifacts within their historical era and cultural context. Spanning 266 cultural groups across 10 regions, it prioritizes historical knowledge, contextual reasoning, and cultural preservation, unlike generic object recognition benchmarks. With over 10,000 expert-verified samples, TimeTravel sets a new standard for evaluating multimodal models in historical research, cross-civilizational analysis, and AI-powered cultural heritage preservation.

Figure 1. Left: TimeTravel Taxonomy maps artifacts from 10 civilizations, 266 cultures, and 10k+ verified samples for AI-driven historical analysis. Right: Regional dataset distribution by archaeological provenance, with Greece holding the largest share (18%) and balanced regional coverage.

🌟 Key Features

Key Features of TimeTravel

First Historical Artifact Benchmark: The 1^st large-scale multimodal benchmark for AI-driven historical artifact analysis
Broad Coverage: It spans across 10 civilizations and 266 cultural groups.
Expert-Verified Samples: Over 10k samples include manuscripts, inscriptions, sculptures, and archaeological artifacts, manually curated by historians and archaeologists.
Structured Taxonomy: Provides a hierarchical framework for artifact classification, interpretation, and cross-civilizational analysis.
AI Evaluation Framework: Assesses GPT-4V, LLaVA, and other LMMs on historical knowledge, contextual reasoning, and multimodal understanding.
Bridging AI and Cultural Heritage: Enables AI-driven historical research, archaeological analysis, and cultural preservation.
Open-Source & Standardized: A publicly available dataset and evaluation framework to advance AI applications in history and archaeology.

TimeTravel Creation Pipeline

The TimeTravel dataset follows a structured pipeline to ensure the accuracy, completeness, and contextual richness of historical artifacts.

Figure 2. TimeTravel Data Pipeline: A structured workflow for collecting, processing, and refining museum artifact data, integrating GPT-4o-generated descriptions with expert validation for benchmark accuracy.compliance.

Our approach consists of four key phases:

Data Selection: Curated 10,250 artifacts from museum collections, spanning 266 cultural groups, with expert validation to ensure historical accuracy and diversity.
Data Cleaning: Addressed missing or incomplete metadata (titles, dates, iconography) by cross-referencing museum archives and academic sources, ensuring data consistency.
Generation & Verification: Used GPT-4o to generate context-aware descriptions, which were refined and validated by historians and archaeologists for authenticity.
Data Aggregation: Standardized and structured dataset into image-text pairs, making it a valuable resource for AI-driven historical analysis and cultural heritage research.

🎯 Quantitative Evaluation and Results

The following tables present a comprehensive evaluation of various multimodal models on the TimeTravel benchmark. The first table compares model performance across multiple metrics, while the second analyzes their ability to describe archaeological artifacts from different civilizations, highlighting variations in accuracy and descriptive depth.

Model BLEU METEOR ROUGE-L SPICE BERTScore LLM-Judge

GPT-4o-0806 0.1758🏅 0.2439 0.1230🏅 0.1035🏅 0.8349🏅 0.3013🏅

Gemini-2.0-Flash 0.1072 0.2456 0.0884 0.0919 0.8127 0.2630

Gemini-1.5-Pro 0.1067 0.2406 0.0848 0.0901 0.8172 0.2276

GPT-4o-mini-0718 0.1369 0.2658🏅 0.1027 0.1001 0.8283 0.2492

Llama-3.2-Vision-Inst 0.1161 0.2072 0.1027 0.0648 0.8111 0.1255

Qwen-2.5-VL 0.1155 0.2648 0.0887 0.1002 0.8198 0.1792

Llava-Next 0.1118 0.2340 0.0961 0.0799 0.8246 0.1161

Table: Performance comparison of various closed and open-source models on our proposed TimeTravel benchmark.

Model India Roman Emp. China British Isles Iran Iraq Japan Cent. America Greece Egypt

GPT-4o-0806 0.2491🏅 0.4463🏅 0.2491🏅 0.1899🏅 0.3522🏅 0.3545🏅 0.2228🏅 0.3144🏅 0.2757🏅 0.3649🏅

Gemini-2.0-Flash 0.1859 0.3358 0.2059 0.1556 0.3376 0.3071 0.2000 0.2677 0.2582 0.3602

Gemini-1.5-Pro 0.1118 0.2632 0.2139 0.1545 0.3320 0.2587 0.1871 0.2708 0.2088 0.2908

GPT-4o-mini-0718 0.2311 0.3612 0.2207 0.1866 0.2991 0.2632 0.2087 0.3195 0.2101 0.2501

Llama-3.2-Vision-Inst 0.0744 0.1450 0.1227 0.0777 0.2000 0.1155 0.1075 0.1553 0.1351 0.1201

Qwen-2.5-VL 0.0888 0.1578 0.1192 0.1713 0.2515 0.1576 0.1771 0.1442 0.1442 0.2660

Llava-Next 0.0788 0.0961 0.1455 0.1091 0.1464 0.1194 0.1353 0.1917 0.1111 0.0709

Table: Analysis of LLM-Judge evaluation of various models in describing archaeological artifacts across civilizations from different geographical locations.

🧐 TimeTravel Dataset Examples

Figures 3 and 4 showcase the cultural and material diversity of the TimeTravel dataset alongside a cross-model comparison, highlighting variations in artifact representation, historical periods, material compositions, and descriptive accuracy across different AI models.

Figure 3. Cultural and Material Diversity: TimeTravel spans civilizations from Ancient Egypt to Japan, covering prehistoric to medieval eras with artifacts in ceramics, metals, and stone, showcasing historical craftsmanship and cultural heritage.

Figure 4. Cross-Model Comparison: Variations in descriptive depth and accuracy across open- and closed-source models, highlighting interpretative differences and alignment with ground truth.

Evaluation

Please refer to Evaluation folder to reproduce the results.

⚖️ License

This project is licensed under the MIT License - see the LICENSE file for details.

💬 Contact us

For questions or suggestions, feel free to reach out to us on GitHub Discussions.

📚 Citation

If you use TimeTravle dataset in your research, please consider citing:

@misc{ghaboura2025timetravelcomprehensivebenchmark,
      title={Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts}, 
      author={Sara Ghaboura and Ketan More and Ritesh Thawkar and Wafa Alghallabi and Omkar Thawakar and Fahad Shahbaz Khan and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
      year={2025},
      eprint={2502.14865},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.14865}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TimeTravel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

Latest Updates

Overview

Figure 1. Left: TimeTravel Taxonomy maps artifacts from 10 civilizations, 266 cultures, and 10k+ verified samples for AI-driven historical analysis. Right: Regional dataset distribution by archaeological provenance, with Greece holding the largest share (18%) and balanced regional coverage.

🌟 Key Features

Key Features of TimeTravel

TimeTravel Creation Pipeline

Figure 2. TimeTravel Data Pipeline: A structured workflow for collecting, processing, and refining museum artifact data, integrating GPT-4o-generated descriptions with expert validation for benchmark accuracy.compliance.

🎯 Quantitative Evaluation and Results

Table: Performance comparison of various closed and open-source models on our proposed TimeTravel benchmark.

Table: Analysis of LLM-Judge evaluation of various models in describing archaeological artifacts across civilizations from different geographical locations.

🧐 TimeTravel Dataset Examples

Figure 3. Cultural and Material Diversity: TimeTravel spans civilizations from Ancient Egypt to Japan, covering prehistoric to medieval eras with artifacts in ceramics, metals, and stone, showcasing historical craftsmanship and cultural heritage.

Figure 4. Cross-Model Comparison: Variations in descriptive depth and accuracy across open- and closed-source models, highlighting interpretative differences and alignment with ground truth.

Evaluation

⚖️ License

💬 Contact us

📚 Citation

Model	BLEU	METEOR	ROUGE-L	SPICE	BERTScore	LLM-Judge
GPT-4o-0806	0.1758🏅	0.2439	0.1230🏅	0.1035🏅	0.8349🏅	0.3013🏅
Gemini-2.0-Flash	0.1072	0.2456	0.0884	0.0919	0.8127	0.2630
Gemini-1.5-Pro	0.1067	0.2406	0.0848	0.0901	0.8172	0.2276
GPT-4o-mini-0718	0.1369	0.2658🏅	0.1027	0.1001	0.8283	0.2492
Llama-3.2-Vision-Inst	0.1161	0.2072	0.1027	0.0648	0.8111	0.1255
Qwen-2.5-VL	0.1155	0.2648	0.0887	0.1002	0.8198	0.1792
Llava-Next	0.1118	0.2340	0.0961	0.0799	0.8246	0.1161

Model	India	Roman Emp.	China	British Isles	Iran	Iraq	Japan	Cent. America	Greece	Egypt
GPT-4o-0806	0.2491🏅	0.4463🏅	0.2491🏅	0.1899🏅	0.3522🏅	0.3545🏅	0.2228🏅	0.3144🏅	0.2757🏅	0.3649🏅
Gemini-2.0-Flash	0.1859	0.3358	0.2059	0.1556	0.3376	0.3071	0.2000	0.2677	0.2582	0.3602
Gemini-1.5-Pro	0.1118	0.2632	0.2139	0.1545	0.3320	0.2587	0.1871	0.2708	0.2088	0.2908
GPT-4o-mini-0718	0.2311	0.3612	0.2207	0.1866	0.2991	0.2632	0.2087	0.3195	0.2101	0.2501
Llama-3.2-Vision-Inst	0.0744	0.1450	0.1227	0.0777	0.2000	0.1155	0.1075	0.1553	0.1351	0.1201
Qwen-2.5-VL	0.0888	0.1578	0.1192	0.1713	0.2515	0.1576	0.1771	0.1442	0.1442	0.2660
Llava-Next	0.0788	0.0961	0.1455	0.1091	0.1464	0.1194	0.1353	0.1917	0.1111	0.0709

Files

README.md

Latest commit

History

README.md

File metadata and controls

TimeTravel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

Latest Updates

Overview

Figure 1. Left: TimeTravel Taxonomy maps artifacts from 10 civilizations, 266 cultures, and 10k+ verified samples for AI-driven historical analysis. Right: Regional dataset distribution by archaeological provenance, with Greece holding the largest share (18%) and balanced regional coverage.

🌟 Key Features

Key Features of TimeTravel

TimeTravel Creation Pipeline

Figure 2. TimeTravel Data Pipeline: A structured workflow for collecting, processing, and refining museum artifact data, integrating GPT-4o-generated descriptions with expert validation for benchmark accuracy.compliance.

🎯 Quantitative Evaluation and Results

Table: Performance comparison of various closed and open-source models on our proposed TimeTravel benchmark.

Table: Analysis of LLM-Judge evaluation of various models in describing archaeological artifacts across civilizations from different geographical locations.

🧐 TimeTravel Dataset Examples

Figure 3. Cultural and Material Diversity: TimeTravel spans civilizations from Ancient Egypt to Japan, covering prehistoric to medieval eras with artifacts in ceramics, metals, and stone, showcasing historical craftsmanship and cultural heritage.

Figure 4. Cross-Model Comparison: Variations in descriptive depth and accuracy across open- and closed-source models, highlighting interpretative differences and alignment with ground truth.

Evaluation

⚖️ License

💬 Contact us

📚 Citation