Skip to content

Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"

License

Notifications You must be signed in to change notification settings

facebookresearch/jailbreak-objectives

Repository files navigation

An Objective for Nuanced LLM Jailbreaks

This codebase implements the target prefix generation pipeline from our nuanced LLM jailbreaks paper. For a given user requests and victim LLM, the pipeline automatically generates and selects target prefixes. Replacing the original "Sure, here is ..." with these prefixes enables more nuanced jailbreak attacks.

[arXiv]

Alt Text

Pre-Generated Target Prefixes (Ready to Use)

We pre-generated target prefixes for commonly used jailbreak requests in AdvBench for users to directly use and evaluate. These prefixes are stored in the data folder. Note: The CSV files do not include the original jailbreak requests (i.e. missing the 'goal' column) to avoid directly hosting the original AdvBench dataset. Please run ```python recover_requests.py`` to recover the original jailbreak requests.

We processed 100 jailbreak requests in advance, including 50 curated by us and 50 from PAIR. For these requests, we considered four victim LLMs: llama-2, 3, 3.1, and Gemma-2.

Quick Start

  1. Create conda environment (optional):

    conda create -n nuancedjb python=3.11
    conda activate nuancedjb
    pip install -r requirements.txt
  2. Generating prefixes for Llama-3-8B:

    python pipeline.py \
       --config="./default_config.py" \
       --config.victim_model="meta-llama/Meta-Llama-3-8B-Instruct" \
       --config.input_csv="./input/demo_requests.csv" \
       --config.output_dir="./output" \
       --config.start_step=1
  3. (Optional) Loading from an existing checkpoint and redo selection with a different weight for PASR:

    python pipeline.py \
       --config="./default_config.py" \
       --config.victim_model="meta-llama/Meta-Llama-3-8B-Instruct" \
       --config.input_csv="./input/demo_requests.csv" \
       --config.output_dir="./output" \
       --config.start_step=7 \
       --config.run_id={existing run id like ABCD}

Optional Arguments

Please refer to default_config.py for more optional arguments and their descriptions.

License

This repository is made available under a CC-by-NC license, however you may have other legal obligations that govern your use of other content, such as the terms of service for third party models.

Citation

If you find our work helpful, please cite it with

About

Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages