An Objective for Nuanced LLM Jailbreaks

This codebase implements the target prefix generation pipeline from our nuanced LLM jailbreaks paper. For a given user requests and victim LLM, the pipeline automatically generates and selects target prefixes. Replacing the original "Sure, here is ..." with these prefixes enables more nuanced jailbreak attacks.

[arXiv]

Pre-Generated Target Prefixes (Ready to Use)

We pre-generated target prefixes for commonly used jailbreak requests in AdvBench for users to directly use and evaluate. These prefixes are stored in the data folder. Note: The CSV files do not include the original jailbreak requests (i.e. missing the 'goal' column) to avoid directly hosting the original AdvBench dataset. Please run ```python recover_requests.py`` to recover the original jailbreak requests.

We processed 100 jailbreak requests in advance, including 50 curated by us and 50 from PAIR. For these requests, we considered four victim LLMs: llama-2, 3, 3.1, and Gemma-2.

Quick Start

Create conda environment (optional):

conda create -n nuancedjb python=3.11
conda activate nuancedjb
pip install -r requirements.txt

Generating prefixes for Llama-3-8B:

python pipeline.py \
   --config="./default_config.py" \
   --config.victim_model="meta-llama/Meta-Llama-3-8B-Instruct" \
   --config.input_csv="./input/demo_requests.csv" \
   --config.output_dir="./output" \
   --config.start_step=1

(Optional) Loading from an existing checkpoint and redo selection with a different weight for PASR:

python pipeline.py \
   --config="./default_config.py" \
   --config.victim_model="meta-llama/Meta-Llama-3-8B-Instruct" \
   --config.input_csv="./input/demo_requests.csv" \
   --config.output_dir="./output" \
   --config.start_step=7 \
   --config.run_id={existing run id like ABCD}

Optional Arguments

Please refer to default_config.py for more optional arguments and their descriptions.

License

This repository is made available under a CC-by-NC license, however you may have other legal obligations that govern your use of other content, such as the terms of service for third party models.

Citation

If you find our work helpful, please cite it with

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data		data
input		input
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
completer.py		completer.py
default_config.py		default_config.py
llm_generator.py		llm_generator.py
pipeline.py		pipeline.py
preprocessor.py		preprocessor.py
readme.md		readme.md
requirements.txt		requirements.txt
scorer_nll.py		scorer_nll.py
scorer_pasr.py		scorer_pasr.py
selector.py		selector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Objective for Nuanced LLM Jailbreaks

Pre-Generated Target Prefixes (Ready to Use)

Quick Start

Optional Arguments

License

Citation

About

Releases

Packages

Contributors 2

Languages

License

facebookresearch/jailbreak-objectives

Folders and files

Latest commit

History

Repository files navigation

An Objective for Nuanced LLM Jailbreaks

Pre-Generated Target Prefixes (Ready to Use)

Quick Start

Optional Arguments

License

Citation

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages