Skip to content

Code for Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack


Notifications You must be signed in to change notification settings


Repository files navigation


Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack

Please feel free to contact [email protected] if you have any question.

Brief Introduction

Vision-language pre-training (VLP) models excel at interpreting both images and text but remain vulnerable to multimodal adversarial examples (AEs). Advancing the generation of transferable AEs, which succeed across unseen models, is key to developing more robust and practical VLP models. Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process, aiming to improve transferability by expanding the contrast space of image-text features. However, these methods focus solely on diversity around the current AEs, yielding limited gains in transferability. To address this issue, we propose to increase the diversity of AEs by leveraging the intersection regions along the adversarial trajectory during optimization. Specifically, we propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity. We provide a theoretical analysis to demonstrate the effectiveness of the proposed adversarial evolution triangle. Moreover, we find that redundant inactive dimensions can dominate similarity calculations, distorting feature matching and making AEs model-dependent with reduced transferability. Hence, we propose to generate AEs in the semantic image-text feature contrast space, which can project the original feature space into a semantic corpus subspace. The proposed semantic-aligned subspace can reduce the image feature redundancy, thereby improving adversarial transferability. Extensive experiments across different datasets and models demonstrate that the proposed method can effectively improve adversarial transferability and outperform state-of-the-art adversarial attack methods.

Quick Start

1. Install dependencies

pip install torch==2.1.0 torchvision==0.16.0 --index-url
pip install -r requirements.txt

2. Prepare datasets and models

Download the datasets, Flickr30k and MSCOCO (the annotations is provided in ./data_annotation/). Set the root path of the dataset in ./configs/Retrieval_flickr.yaml, image_root.
The checkpoints of the fine-tuned VLP models is accessible in ALBEF, TCL, CLIP.

Prepare datasets:

You can download the datasets from this link. Or you can use the following instruction:

wget --no-check-certificate '' -O datasets.tar.gz

Prepare checkpoints for models:

Please create a directory checkpoints first, Then use the following instructions:

  1. ALBEF Pre-Trained on Flickr30K
 wget -O albef_flickr.pth
  1. ALBEF Pre-Trained on MSCOCO
wget -O albef_mscoco.pth
  1. TCL Pre-Trained on Flickr30K(Invalid Now)
wget --no-check-certificate '' -O tcl_flickr.pth
  1. TCL Pre-Trained on MSCOCO(Invalid Now)
wget --no-check-certificate '' -O tcl_mscoco.pth

Common Issue: We found that the pretrained weights were removed from the TCL repository, so we uploaded the weights we had saved to the Hugging Face repository. Please download all the weights from the Hugging Face repository Sensen02/VLPTransferAttackCheckpoints · Hugging Face.

3. parameter settings

Our method has two adjustable hyperparameters. In, you can set the attribute sample_numbers in the ImageAttacker class, with a default value of 5. In the TextAttacker class, you can set the attribute text_ratios, with a default value of [0.6, 0.2, 0.2].

Transferability Evaluation

1. Image-Text Retrieval Attack Evaluation

We provide can choose to import SGAttacker or RAttacker(Ours) in for Image-Text Retrieval Attack Evaluation,Here are the annotations for the running parameters:

--config: the path for config file
--cuda_id: the id for gpu server
--model_list: all of evaluation VLP models, we provide ALBEF,TCL,CLIP(ViT&CNN)
--source_model: selected VLP models to generate multimodal adversarial examples
--albef_ckpt: the checkpoint for ALBEF
--tcl_ckpt: the checkpoint for TCL

Here is an example for Flickr30K dataset.

python --config ./configs/Retrieval_flickr.yaml \
	--cuda_id 0 \
	--model_list ['ALBEF','TCL','CLIP_ViT','CLIP_CNN'] \
	--source_model CLIP_CNN \
	--albef_ckpt ./checkpoints/albef_flickr.pth \
	--tcl_ckpt ./checkpoints/tcl_flickr.pth \
	--original_rank_index_path ./std_eval_idx/flickr30k/

Main Results

Here is an example for MSCOCO dataset.

python --config ./configs/Retrieval_coco.yaml \
	--cuda_id 0 \
	--model_list ['ALBEF','TCL','CLIP_ViT','CLIP_CNN'] \
	--source_model CLIP_CNN \
	--albef_ckpt ./checkpoints/albef_coco.pth \
	--tcl_ckpt ./checkpoints/tcl_coco.pth \
	--original_rank_index_path ./std_eval_idx/coco/

Main Results

2. Cross-Task Attack Evaluation

We present two cross-task attack evaluations, ITR->VG and ITR->IC.


First, please use the MSCOCO dataset and the provided files ./data_annotation/refcoco+_test_for_adv.json and ./data_annotation/refcoco+_val_for_adv.json to generate adversarial images(10K images).

After that, please refer to (use '--evaluate') in ALBEF, and replace the clean images in the MSCOCO dataset with the adversarial images. Then, you can get the performance of the ALBEF model on the adversarial images, corresponding to the Val, TestA, and TestB metrics.


First, please use the MSCOCO dataset and the provided files ./data_annotation/coco_karpathy_test.json and ./data_annotation/coco_karpathy_val.json to generate adversarial images(3K images).

After that, please refer to (use '--evaluate') in BLIP, and replace the clean images in the MSCOCO dataset with the adversarial images. Then, you can get the performance of the ALBEF model on the adversarial images, corresponding to the B@4, METEOR, ROUGE-L, CIDEr and SPICE metrics.

Main Results:

3. Transfer Attack on LLMs

Please send the adversarial images to LLMs and prompt these systems with the query "Describe this image".

Main Results:


1. Visualization on Multimodal Dataset

2. Visualization on Image Captioning

3. Visualization on Visual Grounding

4. Visualization on LLMs




Kindly include a reference to this paper in your publications if it helps your research:

  title={Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory},
  author={Gao, Sensen and Jia, Xiaojun and Ren, Xuhong and Tsang, Ivor and Guo, Qing},
  journal={arXiv preprint arXiv:2403.12445},


Code for Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack







No releases published


No packages published
