Skip to content

Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models

Notifications You must be signed in to change notification settings

Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual Adversarial Examples Jailbreak
Aligned Large Language Models

Xiangyu Qi1,* ,  Kaixuan Huang1,* ,  Ashwinee Panda1
Peter Henderson2 ,  Mengdi Wang1 ,  Prateek Mittal1  
*Equal Contribution
Princeton University1    Stanford University2

AAAI (Oral), 2024

arXiv 


!!! Warning: this repository contains prompts, model behaviors, and training data that are offensive in nature.




Overview: A single visual adversarial example can jailbreak MiniGPT-4.

Note

  1. For each instruction below, we've sampled 100 random outputs, calculating the refusal and obedience ratios via manual inspection. A representative, redacted output is showcased for each.
  2. We use ɛ = 16/255 in the following demo.

MiniGPT-4 can refuse harmful instructions with a non-trivial probability (see the green boxes). But we find that the aligned behaviors can falter significantly when prompted with a visual adversarial input (see the red boxes).

In the above example, we optimize the adversarial example x' on a small, manually curated corpus comprised of derogatory content against a certain <gender-1>, an ethnic <race-1>, and the human race to directly maximize the model’s probability of generating such content.

Though the scope of the corpus is very narrow, surprisingly, a single such adversarial example can enable the model to heed a wide range of harmful instructions and produce harmful content far beyond merely imitating the derogatory corpus (see the following examples) used in the optimization.

Intriguingly, x' also facilitates the generation of offensive content against other social groups (<religious-group-1>, <religious-group-2>) and even instructions for murder, which were not explicitly optimized for.

In folder adversarial_images/, we provide our sample adversarial images under different distortion constraints. The effectiveness of our adversarial examples can be verified by using the MiniGPT-4 interface running in the huggingface space https://huggingface.co/spaces/Vision-CAIR/minigpt4.




Step-by-Step Instructions for Reimplementing Our Experiments on MiniGPT-4

Note: a single A100 80G GPU is sufficient to launch the following experiments.


Installation

We take MiniGPT-4 (13B) as the sandbox to showcase our attacks. The following installation instructions are adapted from the MiniGPT-4 repository.

1. Set up the environment

git clone https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models.git

cd Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

conda env create -f environment.yml
conda activate minigpt4

2. Prepare the pretrained weights for MiniGPT-4

As we directly inherit the MiniGPT-4 code base, the guide from the MiniGPT-4 repository can also be directly used to get all the weights.

  • Get Vicuna: MiniGPT-4 (13B) is built on the v0 version of Vicuna-13B. Please refer to this guide from the MiniGPT-4 repository to get the weights of Vicuna.

    Then, set the path to the vicuna weight in the model config file here at Line 16.

  • Get MiniGPT-4 (the 13B version) checkpoint: download from here.

    Then, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4_eval.yaml at Line 11.


Generate Visual Adversarial Examples

Generate a visual adversarial example within a distortion constraint of epsilon = 16/255 (similar to the example in our overview demo). The final adversarial examples will be saved to $save_dir/bad_prompt.bmp, and we also save intermediate checkpoints every 100 iterations.

The argument can be adjusted (e.g., --eps=32,64,128) to evaluate the effectiveness of the attacks under different distortion budgets.

python minigpt_visual_attack.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --n_iters 5000 --constrained --eps 16 --alpha 1 --save_dir visual_constrained_eps_16

When there is no need for "visual stealthiness", one can use the following command to run unconstrained attacks (the adversarial image can take any values within the legitimate range of pixel values).

python minigpt_visual_attack.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --n_iters 5000  --alpha 1 --save_dir visual_unconstrained

Evaluation

In folder adversarial_images/, we provide off-the-shelf adversarial images that we generated (under different distortion constraints).

To verify the effectiveness of our adversarial examples:

  • Play with the web-based interface of MiniGPT-4:

  • Testing on a diverse set of 40 manually curated harmful instructions

    (Warning: this will involve materials that are offensive in nature)

    python minigpt_test_manual_prompts_visual_llm.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --image_path  adversarial_images/prompt_unconstrained.bmp

    The argument --image_path can be customized to the path of any input image.

  • Testing on the RealToxicityPrompts Dataset

    • Download the RealToxicityPrompts dataset and copy/soft-link the dataset file to ./rtp_prompts.jsonl.

      The inference.py will read the dataset, filter out prompts with challenging = True and ask the model to generate the continuation for each prompt.

      python minigpt_inference.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --image_file  adversarial_images/prompt_unconstrained.bmp --output_file result.jsonl

      The argument --image_file can be customized to the path of any input image, --output_file indicates where to save the generations.

    • Request and place your Perspective API key in .perspective_api_key.

      The get_metric.py will calculate the toxic scores using both Perspective API and Detoxify.

      python get_metric.py --input result.jsonl --output result_eval.jsonl

Generate Textual Adversarial Examples

We also provide codes for optimizing adversarial text tokens w.r.t. the same attack targets as our visual attacks. A running example:

python minigpt_textual_attack.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --n_iters 5000 --n_candidates 50 --save_dir textual_unconstrained



Attacks on Other Models

We also implement our attacks on two other open-sourced VLMs, including InstructBLIP and LLaVA. To launch experiments on these models, we suggest users to create a separate conda environment for each model and install dependencies following the instructions of the original repositories of these models.

Instruct BLIP

  • Set up the environment and prepare model checkpoints following the instructions of the original repository at: https://github.com/salesforce/LAVIS/tree/main/projects/instructblip

    • Choose the 13B version model built upon vicuna13b, i.e., blip2_vicuna_instruct-vicuna13b
    • Download the version 1.1 model of vicuna-13b to the directory ./ckpts/vicuna-13b-v1.1
  • Attack

    python -u instructblip_visual_attack.py --n_iters 5000 --constrained --save_dir results_blip_constrained_16 --eps 16 --alpha 1
    python -u instructblip_visual_attack.py --n_iters 5000 --constrained --save_dir results_blip_constrained_32 --eps 32 --alpha 1
    python -u instructblip_visual_attack.py --n_iters 5000 --constrained --save_dir results_blip_constrained_64 --eps 64 --alpha 1
    python -u instructblip_visual_attack.py --n_iters 5000 --save_dir results_blip_unconstrained --alpha 1
  • Testing on the RealToxicityPrompts Dataset

    Make inference on the dataset:

    python instructblip_inference.py --image_file path_to_the_adversarial_example --output_file result.jsonl

    The get_metric.py will calculate the toxic scores using both Perspective API and Detoxify.

    python get_metric.py --input result.jsonl --output result_eval.jsonl

    Then, you can run cal_metrics.py to summarize the evaluation results from the two evaluation:

    python cal_metrics.py --input result_eval.jsonl

LLaVA-LLaMA-2

  • Set up the environment following the instructions of the original repository at: https://github.com/haotian-liu/LLaVA

    • After get the model, save it to the path ./ckpts/llava_llama_2_13b_chat_freeze
  • Attack

    python -u llava_llama_v2_visual_attack.py --n_iters 5000 --constrained --save_dir results_llava_llama_v2_constrained_16 --eps 16 --alpha 1
    python -u llava_llama_v2_visual_attack.py --n_iters 5000 --constrained --save_dir results_llava_llama_v2_constrained_32 --eps 32 --alpha 1
    python -u llava_llama_v2_visual_attack.py --n_iters 5000 --constrained --save_dir results_llava_llama_v2_constrained_64 --eps 64 --alpha 1
    python -u llava_llama_v2_visual_attack.py --n_iters 5000 --save_dir results_llava_llama_v2_unconstrained --alpha 1
  • Testing on the RealToxicityPrompts Dataset

    Make inference on the dataset:

    python -u llava_llama_v2_inference.py --image_file path_to_the_adversarial_example --output_file result.jsonl

    The get_metric.py will calculate the toxic scores using both Perspective API and Detoxify.

    python get_metric.py --input result.jsonl --output result_eval.jsonl

    Then, you can run cal_metrics.py to summarize the evaluation results from the two evaluation:

    python cal_metrics.py --input result_eval.jsonl



Citation

If you find this useful in your research, please consider citing:

@misc{qi2023visual,
      title={Visual Adversarial Examples Jailbreak Aligned Large Language Models}, 
      author={Xiangyu Qi and Kaixuan Huang and Ashwinee Panda and Peter Henderson and Mengdi Wang and Prateek Mittal},
      year={2023},
      eprint={2306.13213},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

About

Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages