Temporal Instructional Diagram Grounding in Unconstrained Videos

Jiahao Zhang¹; Frederic Z. Zhang²; Cristian Rodriguez²; Yizhak Ben-Shabat^1,3; Anoop Cherian⁴; Stephen Gould¹

¹The Australian National University ²The Australian Institute for Machine Learning
³Technion Israel Institute of Technology ⁴Mitsubishi Electric Research Labs

Abstract

We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries, significantly outperforming existing methods while simultaneously grounding multiple queries.

Method

Prerequisites

Installation

# clone project
git clone https://github.com/DavidZhang73/TDGV.git

# create conda virtual environment
conda create -n TDGV python=3.10
conda activate TDGV

# install pytorch according to the official website https://pytorch.org/get-started/locally/
# Test on PyTorch 2.1.2 only
# conda install pytorch==2.1.2 torchvision==0.16.2 pytorch-cuda=11.8 -c pytorch -c nvidia
conda install pytorch==2.1.2 torchvision==0.16.2 pytorch-cuda=12.1 -c pytorch -c nvidia

# install other requirements
pip install -r requirements.txt

# CD to the project directory
cd TDGV

Data Preparation

Use huggingface_hub CLI to download the pre-processed datasets:

huggingface-cli download --repo-type dataset DavidZhang73/TDGVDatasets --local-dir ./data

Training

Train the model on IAW dataset:

python src/main.py fit -c configs/iaw_ours_aligned.yaml --trainer.logger.name iaw_ours_aligned

Train the model on YouCook2 dataset:

python src/main.py fit -c configs/youcook2_ours_aligned.yaml --trainer.logger.name youcook2_ours_aligned

Train the model on ActivityNet Caption dataset:

python src/main.py fit -c configs/anet_ours_aligned.yaml --trainer.logger.name anet_ours_aligned

Citation

@inproceedings{Zhang2025Temporally,
  title={Temporally Grounding Instructional Diagrams in Unconstrained Videos},
  author={Zhang, Jiahao and Zhang, Frederic Z and Rodriguez, Cristian and Ben-Shabat, Yizhak and Cherian, Anoop and Gould, Stephen},
  booktitle={Winter Conference on Applications of Computer Vision (WACV)},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
configs		configs
imgs		imgs
src		src
.gitignore		.gitignore
.project-root		.project-root
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Temporal Instructional Diagram Grounding in Unconstrained Videos

Abstract

Method

Prerequisites

Installation

Data Preparation

Training

Citation

About

Languages

License

DavidZhang73/TDGV

Folders and files

Latest commit

History

Repository files navigation

Temporal Instructional Diagram Grounding in Unconstrained Videos

Abstract

Method

Prerequisites

Installation

Data Preparation

Training

Citation

About

Resources

License

Stars

Watchers

Forks

Languages