- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we evaluate the impact of various attack settings on LLM performance and provide a baseline benchmark for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs.
-
2024.08 Update1: We support more LLMs including: Llama2, Llama3, Mistral, Qwen, Tulu, and Vicuna families. See all supported models and details HERE.
-
2024.08 Update2: We add two new attack methods: DrAttack and MultiJail. DrAttack is a new prompt-level jailbreak methods and need to preprocessing the data. We have provide the preprocessing results for datasets used in this repo. MultiJail (ICLR 2024) is a new type of jailbreak method that manually-create multilingual prompts to attack the LLMs. There are 10 languages supported in MultiJail, including: English, High-Resource Language (HRL) : Chines (zh), Italic (it), Vietnamese (vi); Medium-Resource Language (MRL): Arabic (ar), Korean (ko), Thai (th); Low-Resource Language (LRL): Bengali (bn), Swahili (sw), Javanese (jv).
-
2024.08 Update3: We add a step-by-step guide to obtain the output of model-level defense methods. We provide the code to train the adversarial training model and safety training model. We also provide the code to run the RPO method to get the defense adversarial suffix. Please see the details in the 1.3.2 section. After the paper is public, we will also provide the pre-trained model in the official repository.
-
2024.08 Update4: Recently, we noticed a new defense methods called PrimeGuard which is presented in NextGenAISafety@ICML2024. This paper proposes a dynamic routing mechanism to detect different types of prompts and use in-context learning to prevent the jailbreak responses. We have added the PrimeGuard defense method in the codebase. However, in the original realization, PrimeGuard requires API LLM calls to construct the interaction environment. Therefore, you need to use this defense method in multiple GPUs environment and start the local server first. We will continue to work on the optimization of the PrimeGuard defense method.
-
2024.10 Update: We are working on new version of JailTrickBench which will be released in the future. The new version will include more attack methods, defense methods, more tricks and will support more LLMs. We will also provide the pre-trained models and datasets in the new version. Stay tuned!
🌟 If you find this resource helpful, please consider starring this repository and citing our NeurIPS'24 paper:
@inproceedings{NEURIPS2024_xu2024bag,
author={Xu, Zhao and Liu, Fan and Liu, Hao},
booktitle = {Advances in Neural Information Processing Systems},
title = {Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs},
year = {2024}
}
@article{xu2024bag,
title={Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs},
author={Xu, Zhao and Liu, Fan and Liu, Hao},
journal={arXiv preprint arXiv:2406.09324},
year={2024}
}
@article{liu2024adversarial,
title={Adversarial tuning: Defending against jailbreak attacks for llms},
author={Liu, Fan and Xu, Zhao and Liu, Hao},
journal={arXiv preprint arXiv:2406.06622},
year={2024}
}
git clone <Repo Link> or download the zip file from anonymous github
cd ./Bag_of_Tricks_for_LLM_Jailbreaking-4E10/
pip install -r requirements.txt
There are several baselines that use GPT service in their method. And We also use GPT-4 based agent to evaluate the attack success rate. Please search and replace YOUR_KEY_HERE
with your OpenAI API key in the following files: ./bag_of_tricks_for_llm_jailbreaking/GPTEvaluatorAgent/language_models.py
, ./bag_of_tricks_for_llm_jailbreaking/baseline/TAP/language_models.py
, ./bag_of_tricks_for_llm_jailbreaking/baseline/PAIR/language_models.py
, ./bag_of_tricks_for_llm_jailbreaking/baseline/GPTFuzz/gptfuzzer/llm/llm.py
, ./bag_of_tricks_for_llm_jailbreaking/baseline/AutoDAN/utils/opt_utils.py
.
After setting the API key, you can run the tricks experiments with one-click scripts below.
To reproduce the main experiment results, you need to prepare the following models:
- Attack Model:
- AdvPrompter: You need to train the AdvPrompter model to get the LoRA Adapter and merge the model. You can save the attack model to
"./models/attack/
. - AmpleGCG: If you cannot load the huggingface model directly using the
AutoModelForCausalLM.from_pretrained()
function, you may need to download the attack modelosunlp/AmpleGCG-llama2-sourced-llama2-7b-chat
andosunlp/AmpleGCG-llama2-sourced-vicuna-7b
from the Hugging Face first.
- AdvPrompter: You need to train the AdvPrompter model to get the LoRA Adapter and merge the model. You can save the attack model to
- Defense Model:
- RPO: You need to run the RPO method from the original repository to get the defense adversarial suffix. We also provide the results we obtained.
- Advertorial Training: You need to train the adversarial training model to get the defense model. You can save the defense model to
"./models/defense/
. Due to the space limitation, we cannot provide the pre-trained model in anonymous repository. After the paper is public, we will provide the pre-trained model in the official repository. - Safety Training: You need to train the safety training model to get the defense model. You can save the defense model to
"./models/defense/
. Due to the space limitation, we cannot provide the pre-trained model in anonymous repository. After the paper is public, we will provide the pre-trained model in the official repository.
- Adversarial Training: You need to generate the adversarial data to train the model. In this repo, you can use any attack method in
baseline/
folder to generate the adversarial data. In our experiment, we use GCG to generate adversarial prompts, and use GPT-4 to generate safety response. We also provide the adversarial data sample we obtained in the./defenses/data/AdversarialTraining/adv_training_data_sample.json
to help you better understand the whole process. You can use the adversarial data to train the adversarial training model. To fine-tune LLMs efficiently and easily, we recommend using the LLaMA-Factory tools. You can use the following command to fine-tune the model with the adversarial data:
# Dependence Installation
conda create -n llama_factory python=3.10
conda activate llama_factory
cd defense/LLaMA-Factory/
pip install -e .[metrics]
# Fine-tune the model
## We have provided the adversarial data in the `./defenses/data/AdversarialTraining/adv_training_data.json`. You can use the following command to fine-tune the model.
cd defenses/LLaMA-Factory/my_scripts/bag_of_trick/
bash adversarial_training_llama2.sh
If you want to train another LLM or train the model with your own adversarial data, remember to add your dataset in LLaMA-Factory/data/dataset_info.json
and modify YOUR_LLM_PATH
, CHANGE_TO_YOUR_DATASET_NAME
, YOUR_LLM_NAME
in the adversarial_training_llama2.sh
script as follows:
# lora
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--config_file ../../examples/accelerate/single_config.yaml \
../../src/train.py \
--stage sft \
--do_train \
--model_name_or_path YOUR_LLM_PATH \
--dataset CHANGE_TO_YOUR_DATASET_NAME \
--dataset_dir ../../data \
--template llama2 \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir ../../saves/bag_of_trick/LLaMA2-7B/lora/YOUR_LLM_NAME \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 4096 \
--preprocessing_num_workers 32 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 2 \
--warmup_steps 4 \
--save_steps 20 \
--eval_steps 20 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 8.0 \
--val_size 0.05 \
--ddp_timeout 180000000 \
--plot_loss \
--bf16 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli export \
--model_name_or_path YOUR_LLM_PATH \
--adapter_name_or_path .../../saves/bag_of_trick/LLaMA2-7B/lora/YOUR_LLM_NAME \
--template llama2 \
--finetuning_type lora \
--export_dir .../../saves/at_transfer/LLaMA2-7B/merged/YOUR_LLM_NAME \
--export_size 2 \
--export_device cpu \
--export_legacy_format False
- Safety Training: You need to generate the safety data to train the model. For example, your can use GPT-4 to generate safe response to harmful prompts. We also provide the sample data for safety training we obtained in the
./defenses/data/SafetyTraining/safety_training_data_sample.json
. You can use the safety data to train the safety training model. If you have installed the LLaMA-Factory, you can use the following command to fine-tune the model with the safety data:
conda activate llama_factory
cd defenses/LLaMA-Factory/my_scripts/bag_of_trick/
bash safety_training_llama2.sh
- Unlearning: For unlearning, you can make use of the output from the adversarial training step, i.e., the adversarial data. You can directly use the output file of
GCG
attack and perform unlearning. We also provide the sample data for unlearning we obtained in the./defenses/data/Unlearning/unlearning_data_sample.json
. You can use the unlearning data to train the unlearning model. You can use the following command to unlearn the model with the unlearning data:
cd defenses/Unlearning/
python unlearning.py --model_name YOUR_LLM_NAME --data_path ./data/Unlearning/unlearning_data_sample.json --model_save_dir ./models/defense/Unlearning/
- RPO: Although RPO is not a model-level defense method, it can be used to generate the defense adversarial suffix. We provide the code to run the RPO method to get the defense adversarial suffix. You can use the following command to get the defense adversarial suffix on llama2:
cd defenses/RPO/
python rpo.py --model_path meta-llama/Llama-2-7b-chat-hf --instructions_path ./data/RPO/harmful_bench_train.json --template_name llama2
For each trick, we provide a example script. You can run the script to reproduce the results in the paper. The script is in the ./scripts/
folder. You can run the script by the following command:
# 1. Attack Budget
bash scripts/1_trick_atk_budget_gcg.sh
bash scripts/1_trick_atk_budget_pair.sh
# 2. Attack Ability
bash scripts/2_trick_atk_ability_pair.sh
# 3. Attack Suffix Length
bash scripts/3_trick_atk_suffix_length.sh
# 4. Attack Intension
bash scripts/4_trick_atk_intension_autodan.sh
bash scripts/4_trick_atk_intension_pair.sh
# 5. Target Model Size
bash scripts/5_trick_target_size_autodan.sh
bash scripts/5_trick_target_size_pair.sh
# 6. Target Safety Fine-tuning Alignment
bash scripts/6_trick_target_align_autodan.sh
bash scripts/6_trick_target_align_pair.sh
# 7. Target System Prompt
bash scripts/7_trick_target_system_autodan.sh
bash scripts/7_trick_target_system_pair.sh
# 8. Target Template Type
bash scripts/8_trick_target_template_autodan.sh
bash scripts/8_trick_target_template_pair.sh
Note: As some baselines require a long time to run, we provide a feature to run the experiment in parallel (We use 50 A800 GPUs to accelerate the experiments) You can set the --data_split
and --data_split_total_num
to run the experiment in parallel. For example, you can set --data_split_total_num 2
and --data_split_idx 0
in the script to run the first half of the data, and set --data_split_total_num 2
and --data_split_idx 1
in the script to run the second half of the data. After all data is finished, the program will automatically merge the results.
# Example: Use vicuna as target model
# We provide 98 scripts for the main experiments (7 attacks * 7 defenses * 2 datasets). For No defense method, you can run the script by the following command:
# 1. Run AutoDAN attack on No defense
bash scripts/main_vicuna/1_data1_None_defense.sh
# 2. Run PAIR attack on No defense
bash scripts/main_vicuna/2_data1_None_defense.sh
# 3. Run TAP attack on No defense
bash scripts/main_vicuna/3_data1_None_defense.sh
# 4. Run GPTFuzz attack on No defense
bash scripts/main_vicuna/4_data1_None_defense.sh
# 5. Run GCG attack on No defense
bash scripts/main_vicuna/5_data1_None_defense.sh
# 6. Run AdvPrompter attack on No defense
bash scripts/main_vicuna/6_data1_None_defense.sh
# 7. Run AmpleGCG attack on No defense
bash scripts/main_vicuna/7_data1_None_defense.sh
Note1: As some baselines require a long time to run, we provide a feature to run the experiment in parallel (We use 50 A800 GPUs to accelerate the experiments) You can set the --data_split
and --data_split_total_num
to run the experiment in parallel. For example, you can set --data_split_total_num 2
and --data_split_idx 0
in the script to run the first half of the data, and set --data_split_total_num 2
and --data_split_idx 1
in the script to run the second half of the data. After all data is finished, the program will automatically merge the results.
Note2: We provide a notebook to generate the scripts for the main experiments. You can run the notebook ./generate_scripts.ipynb
to generate the scripts for the main experiments. You may also write you own script to generate the scripts for the tricks experiments.
For detailed arguments and options, please refer to the initialie_args.py
or help message of main.py
.
python main.py -h
-
python -u main.py \ --target_model_path lmsys/vicuna-13b-v1.5 \ --defense_type None_defense \ --attack AutoDAN \ --instructions_path ./data/harmful_bench_50.csv \ --save_result_path ./exp_results/main_vicuna/ \ --agent_evaluation \ --resume_exp \ --agent_recheck \ --exp_name main_vicuna_none_defense
-
python -u main.py \ --target_model_path lmsys/vicuna-13b-v1.5 \ --defense_type None_defense \ --attack PAIR \ --attack_model lmsys/vicuna-13b-v1.5 \ --instructions_path ./data/harmful_bench_50.csv \ --save_result_path ./exp_results/main_vicuna/ \ --agent_evaluation \ --resume_exp \ --agent_recheck \ --exp_name main_vicuna_none_defense
-
python -u main.py \ --target_model_path lmsys/vicuna-13b-v1.5 \ --defense_type None_defense \ --attack TAP \ --attack_model lmsys/vicuna-13b-v1.5 \ --instructions_path ./data/harmful_bench_50.csv \ --save_result_path ./exp_results/main_vicuna/ \ --agent_evaluation \ --resume_exp \ --agent_recheck \ --exp_name main_vicuna_none_defense
-
python -u main.py \ --target_model_path lmsys/vicuna-13b-v1.5 \ --defense_type None_defense \ --attack GPTFuzz \ --instructions_path ./data/harmful_bench_50.csv \ --save_result_path ./exp_results/main_vicuna/ \ --agent_evaluation \ --resume_exp \ --agent_recheck
-
python -u main.py \ --target_model_path lmsys/vicuna-13b-v1.5 \ --defense_type None_defense \ --attack GCG \ --instructions_path ./data/harmful_bench_50.csv \ --save_result_path ./exp_results/main_vicuna/ \ --agent_evaluation \ --resume_exp \ --agent_recheck \ --exp_name main_vicuna_none_defense
-
python -u main.py \ --target_model_path lmsys/vicuna-13b-v1.5 \ --defense_type None_defense \ --attack AdvPrompter \ --instructions_path ./data/harmful_bench_50.csv \ --save_result_path ./exp_results/main_vicuna/ \ --agent_evaluation \ --resume_exp \ --agent_recheck \ --exp_name main_vicuna_none_defense \ --adv_prompter_model_path ./models/attack/advprompter_vicuna_7b
-
python -u main.py \ --target_model_path lmsys/vicuna-13b-v1.5 \ --defense_type None_defense \ --attack AmpleGCG \ --instructions_path ./data/harmful_bench_50.csv \ --save_result_path ./exp_results/main_vicuna/ \ --agent_evaluation \ --resume_exp \ --agent_recheck \ --exp_name main_vicuna_none_defense \ --attack_source vicuna
-
python -u main.py \ --target_model_path lmsys/vicuna-13b-v1.5 \ --defense_type None_defense \ --attack DrAttack \ --instructions_path ./data/harmful_bench_50.csv \ --save_result_path ./exp_results/main_vicuna/ \ --prompt_info_path ./baseline/DrAttack/dratk_data/attack_prompt_data/harmful_bench_test_info.json \ --agent_evaluation \ --resume_exp \ --agent_recheck \ --exp_name main_vicuna_none_defense
-
python -u main.py \ --target_model_path lmsys/vicuna-13b-v1.5 \ --defense_type None_defense \ --attack MultiJail \ --instructions_path ./baseline/MultiJail/multijail_data/1_MultiJail_en.csv \ --save_result_path ./exp_results/main_vicuna/ \ --agent_evaluation \ --resume_exp \ --agent_recheck \ --exp_name main_vicuna_none_defense
- No Defense
- **Self Reminder: ** Defending chatgpt against jailbreak attack via self-reminders. (Nature Machine Intelligence, 2023) Paper
- **RPO: **Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. (SeTLLM@ICLR2024) Paper
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. Paper
- Unlearning: Large language model unlearning. (SoLaR@NeurIPS2023) Paper
- Adversarial Training
- Safety Training
Due to huge amount of experiments results, we provide a script to analyze the results.
To analyze the results, you can use the ./analysis/report_trick.py
script. For example, you can run the following command to analyze the results of the main experiments:
python analysis/report_trick.py
Model | Model size |
---|---|
Llama | 7B/13B/33B/65B |
Llama 2 | 7B/13B/70B |
Llama 3/Llama 3.1 | 8B/70B |
Mistral/Mixtral | 7B/8x7B/8x22B |
Qwen/Qwen1.5/Qwen2 | 0.5B/1.5B/4B/7B/14B/32B/72B/110B |
Vicuna | 7B/13B |
For model size larger than 13B/14B, we use 4 bit quantization to reduce the memory usage.
In the implementation of this project, we have referred to the code from the following repositories or papers:
- Attack methods: GCG, AutoDAN, PAIR, TAP, GPTFuzz, AdvPrompter, AmpleGCG, DrAttack, MultiJail
- Defense methods: Self Reminder, RPO, SmoothLLM, Unlearning