Multimodal Reference Visual Grounding

arXiv, Project, MultimodalGround Dataset

Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects.

In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-7B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding.

Multimodal Reference Visual Grounding Task

MultimodalGround Dataset

Framework

Visual Grounding Example

Getting Started

Our environment is built on NIDS-Net, so you can refer to the installation instructions provided in the NIDS-Net repository.

Preparing Datasets

We provide the MultimodalGround Dataset in this link. The ground truth annotations are in the file "merged_coco_annotations.json". Please put them into a data folder as follows:

data_folder
│
└───templates
│   │
│   └───001_a_and_w_root_beer_soda_pop_bottle
│   │   │   images, depth maps, masks
│   │
│   └───002_coca-cola_soda_diet_pop_bottle
│       │   images, depth maps, masks
│       │   ...
│   
│   
└───scenes
    │   scene_001
    │   scene_002
    │   ...

Usage

Check GroundingDINO and SAM

GroundingDINO + SAM: test_gdino.py

Download some NIDS-Net files here and place them in the root folder, including the template embeddings and adapter weights:

weight of adapter: "adapter_weights/refer_weight_1004_temp_0.05_epoch_640_lr_0.001_bs_1024_vec_reduction_4_weights.pth"
refined template embeddings: "adapted_obj_feats/refer_weight_1004_temp_0.05_epoch_640_lr_0.001_bs_1024_vec_reduction_4.json"

Inference with our method eval_our_method.py. We save the result into a json file. Then use eval_results.py to evaluate the pridictions.

bash run_method.sh

Citation

If you find the method useful in your research, please consider citing:

@misc{lu2025multimodalreferencevisualgrounding,
      title={Multimodal Reference Visual Grounding}, 
      author={Yangxiao Lu and Ruosen Li and Liqiang Jing and Jikai Wang and Xinya Du and Yunhui Guo and Nicholas Ruozzi and Yu Xiang},
      year={2025},
      eprint={2504.02876},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.02876}, 
}

Acknowledgments

This project is based on the following repositories:

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
eval_utils		eval_utils
imgs		imgs
referring_utils		referring_utils
robokit		robokit
ros		ros
src		src
template_images		template_images
utils		utils
.gitignore		.gitignore
README.md		README.md
adapter.py		adapter.py
convert_label_studio_json_to_COCO.py		convert_label_studio_json_to_COCO.py
eval_gdino.py		eval_gdino.py
eval_nids_net.py		eval_nids_net.py
eval_results.py		eval_results.py
extract.sh		extract.sh
gather_all_performance.py		gather_all_performance.py
get_object_features_via_FFA.py		get_object_features_via_FFA.py
match_results.py		match_results.py
merged_coco_annotations.json		merged_coco_annotations.json
predict_bbox.py		predict_bbox.py
run_method.sh		run_method.sh
test_gdino.py		test_gdino.py
test_nids-net.sh		test_nids-net.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Multimodal Reference Visual Grounding

Multimodal Reference Visual Grounding Task

MultimodalGround Dataset

Framework

Visual Grounding Example

Getting Started

Preparing Datasets

Usage

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Uh oh!

Uh oh!

YoungSean/Multimodal-Reference-Visual-Grounding

Folders and files

Latest commit

History

Repository files navigation

Multimodal Reference Visual Grounding

Multimodal Reference Visual Grounding Task

MultimodalGround Dataset

Framework

Visual Grounding Example

Getting Started

Preparing Datasets

Usage

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages