arXiv, Project, MultimodalGround Dataset
Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects.
In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-7B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding.
Our environment is built on NIDS-Net, so you can refer to the installation instructions provided in the NIDS-Net repository.
We provide the MultimodalGround Dataset in this link. The ground truth annotations are in the file "merged_coco_annotations.json". Please put them into a data folder as follows:
data_folder
│
└───templates
│ │
│ └───001_a_and_w_root_beer_soda_pop_bottle
│ │ │ images, depth maps, masks
│ │
│ └───002_coca-cola_soda_diet_pop_bottle
│ │ images, depth maps, masks
│ │ ...
│
│
└───scenes
│ scene_001
│ scene_002
│ ...
- Check GroundingDINO and SAM
- GroundingDINO + SAM:
test_gdino.py
- Download some NIDS-Net files here and place them in the root folder, including the template embeddings and adapter weights:
- weight of adapter: "adapter_weights/refer_weight_1004_temp_0.05_epoch_640_lr_0.001_bs_1024_vec_reduction_4_weights.pth"
- refined template embeddings: "adapted_obj_feats/refer_weight_1004_temp_0.05_epoch_640_lr_0.001_bs_1024_vec_reduction_4.json"
- Inference with our method eval_our_method.py. We save the result into a json file. Then use eval_results.py to evaluate the pridictions.
bash run_method.shIf you find the method useful in your research, please consider citing:
@misc{lu2025multimodalreferencevisualgrounding,
title={Multimodal Reference Visual Grounding},
author={Yangxiao Lu and Ruosen Li and Liqiang Jing and Jikai Wang and Xinya Du and Yunhui Guo and Nicholas Ruozzi and Yu Xiang},
year={2025},
eprint={2504.02876},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.02876},
}This project is based on the following repositories:




