[Preprint 2025] Official code of the paper “TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility” Keywords: Video-Language Models, Physics Plausibility, Video Reasoning, Trajectory-aware Attention, Benchmarking
Saman Motamed1,2✉,
Minghao Chen2,
Luc Van Gool1,
Iro Laina2
1INSAIT, Sofia University "St. Kliment Ohridski" 2Visual Geometry Group, University of Oxford
Please support our work by leaving a star on our repo! ⭐⭐⭐
- [202509/10] 📢 📢 TRAVL tuning dataset is released.
- [202509/10] 📢 📢 ImplausiBench benchmark is released.
- [202509/10] 📢 📢 The paper is on arXiv
- Training code (LLaVA-NeXT + TRAVL)
- LLM Judge evaluation script
- Release LLaVA-NeXT + TRAVL weights
- Modern VLMs can give an overview of a video quite well, yet they fail to reason about more finegrained physical interactions in a video.
- TRAVL is a light, modular attention recipe— spatial + trajectory-aware temporal—that helps VLMs judge physics implausibility more reliably.
- ImplausiBench is our 300-video benchmark (150 real, 150 implausible) with paired, style-matched videos and grounded MCQs to evaluate visual-temporal reasoning beyond language shortcuts.
- TRAVL Dataset is our curated dataset of 3,482 videos with 19,708 physics‑focused Q/A pairs.
- Paper: (arXiv link coming soon)
- Project page: https://sam-motamed.github.io/projects/TRAVL
A 300-video benchmark (150 real, 150 implausible) for evaluating visual-temporal physics plausibility with paired clips (shared first frame & style) and grounded MCQs that reduce language-only shortcuts.
- Hugging Face → https://huggingface.co/datasets/INSAIT-Institute/ImplausiBench
- What’s inside
- ImplausiBench/real/*.mp4&- ImplausiBench/implausible/*.mp4
- ImplausiBench-MCQA.jsongrounded multiple-choice questions per pair
 
- Metrics reported: Human & LLM-judge accuracy on Real / Implausible subsets (150 each)
git lfs install
git clone https://huggingface.co/datasets/INSAIT-Institute/ImplausiBench data/implausibench- Scale: 3,482 videos • 19,708 QA pairs
- Composition: real + implausible clips (e.g., Physics-IQ, Impossible Videos, Video-ChatGPT)
- Link: https://huggingface.co/datasets/INSAIT-Institute/TRAVL
# Option A: huggingface_hub
pip install -U huggingface_hub
python - << 'PY'
from huggingface_hub import snapshot_download
snapshot_download(repo_id="INSAIT-Institute/TRAVL", repo_type="dataset", local_dir="data/travl")
PY
# Option B: git-lfs
git lfs install
git clone https://huggingface.co/datasets/INSAIT-Institute/TRAVL data/travlAccuracies (%) on Implausible (generated) and Real subsets (150 videos each).
We report both Human and LLM-judge scores. Sorted by Implausible — Human (best → worst).
| Model | Implausible — Human | Implausible — LLM | Real — Human | Real — LLM | 
|---|---|---|---|---|
| LLaVA-NeXT (TRAVL) |  52.7 |  28.7 |  47.3 |  31.3 | 
| Gemini 2.5 Pro |  41.3 |  29.3 |  100.0 |  78.0 | 
| LLaVA-NeXT (SFT) |  34.0 |  22.0 |  45.3 |  23.3 | 
| GPT-4o |  32.7 |  21.3 |  84.7 |  64.0 | 
| Qwen2.5-VL |  18.7 |  12.0 |  96.7 |  74.7 | 
| InternVL 2.5 |  12.7 |  4.7 |  92.7 |  76.0 | 
| LLaVA-NeXT (pretrained) |  3.3 |  2.7 |  98.7 |  62.7 | 
@article{{motamed2025travl,
    title={TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility},
    author={Saman Motamed and Minghao Chen and Luc Van Gool and Iro Laina},
    year={2025},
    eprint={2510.07550},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}Questions or feedback? Reach us at [email protected].
Our work was made possible by efforts from following works. Thanks to all the contributors!

