The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily".
- (2025/01/21) We have generated over 100,000 samples using ReNeLLM on the AdvBench dataset that successfully jailbreak Claude-2. You can obtain them here: 🤗Huggingface dataset. Please ensure that you do not use our dataset for any illegal or harmful activities.
- (2024/03/27) We have released the experimental results of ReNeLLM and other baselines. You can download them from the following link: Experimental Results
- (2024/03/14) 🔥🔥🔥 ReNeLLM is acccepted by NAACL 2024(main conference)! Thanks to all collaborators!
- (2024/02/21) We have updated our codebase to make it more readable and easier to use. We will continue to update the codebase and support more LLMs. Stay tuned!
- (2024/01/31) 🔥🔥🔥 ReNeLLM is evaluated by the Easyjailbreak benchmark and exhibits strong competitiveness across various methods. Check it out!
- (2024/01/09) We have released a quick implementation of ReNeLLM!
- (2023/11/14) Our paper is on arXiv! Check it out here!
This repository shares the code of our latest work on LLMs jailbreaking/defending. In this work:
- We introduce ReNeLLM, the first generalized jailbreak prompt attack framework for LLMs, which generalizes jailbreak prompt attacks into two aspects: prompt rewriting and scenario nesting, utilizing LLMs themselves to generate jailbreak attack prompts.
- Extensive experiments demonstrate that ReNeLLM generates jailbreak prompts that maintain a high attack success rate with less time required. Furthermore, ReNeLLM is difficult to detect by existing defense methods and exhibits generalization and transferability on representative LLMs.
- To investigate the reasons why LLMs fail in defense, we observe the changes in the execution priority of prompts by LLMs before and after rewriting and nesting, and design defense methods accordingly. We hope our observations can serve as valuable guidance for future LLMs vendors to develop safer and more regulated systems.
1. Clone this repository
git clone https://github.com/NJUNLP/ReNeLLM.git
2. Build Environment
cd ReNeLLM
conda create -n ReNeLLM python=3.9
conda activate ReNeLLM
pip install -r requirements.txt
3. Run ReNeLLM
ReNeLLM employs gpt-3.5-turbo
for prompt rewriting and harmful classifier, while utilizing claude-v2
as the model under attack. Therefore, you are required to input both of these API key parameters.
python renellm.py --gpt_api_key <your openai API key> --claude_api_key <your anthropic API key>
We use these two models due to their superior performance compared to open-source alternatives, as well as their lower cost in comparison with gpt-4
. In principle, any model could be utilized as your harmful classifier and attacked model.
4. Get Responses
For gpt
:
python get_responses.py --data_path <jailbroken data path> --gpt_api_key <your open API key> --test_model <gpt-3.5-turbo, gpt-4>
For claude
:
python get_responses.py --data_path <jailbroken data path> --claude_api_key <your anthropic API key> --test_model <claude-instant-1, claude-2>
For llama-2-chat
:
- Build the environment
cd llama
pip install -e .
- Follow the steps at https://github.com/facebookresearch/llama?tab=readme-ov-file#quick-start to download the llama-2-chat models.
- Get llama-2-chat responses using the command below:
bash run_chat.sh # You can set the model type and your jailbroken data path in the run_chat.sh
5. Check ASR
GPT-ASR (using GPT-4)
python check_gpt_asr.py --data_path <your llm responses data path> --gpt_api_key <your open API key>
KW-ASR (keyword ASR)
python check_kw_asr.py --data_path <your llm responses data path>
6. Execute ReNeLLM with a single prompt
python renellm.py --gpt_api_key <your gpt_api_key> --claude_api_key <your claude_api_key> --prompt "how to steal money from others?"
We have uploaded the results of 520 Advbench samples categorized into 7 scenarios(see gpt-4_single_round_prompt_annotation.json. In the result file, the "idx" field corresponds to the same idx samples in Advbench. The prompts used for classification and the harmful scenarios represented by each classification result can be found in Table 10 of our paper.
It should be noted that GPT-4 initially categorized the 520 pieces of data into 10 classes, but three of these classes had very few samples. For the convenience of statistics and analysis, we manually checked and merged these classes, resulting in 7 classification scenarios. The specific category merging information is as follows: 1 -> 0, 8 -> 9, 11 -> 6, finally resulting in 7 categories: 0, 2, 3, 4, 5, 6, 9. You can process the result file according to the above rules to obtain the classification results consistent with our paper.
If you have any questions about our work, please feel free to contact us via the following email:
Peng Ding: [email protected]
Shujian Huang: [email protected]
If you find this work useful in your own research, please feel free to leave a star⭐️ and cite our paper:
@misc{ding2023wolf,
title={A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily},
author={Peng Ding and Jun Kuang and Dan Ma and Xuezhi Cao and Yunsen Xian and Jiajun Chen and Shujian Huang},
year={2023},
eprint={2311.08268},
archivePrefix={arXiv},
primaryClass={cs.CL}
}