This document describes high-level results of testing the GCG attack using various publicly-available large language models.
- Model families with publicly-available versions capable of handling chat interaction
- Other model families that can be used with Broken Hill
- Other model families that do not currently work with Broken Hill
- Conversation template name:
llama2
- Azurro APT3 collection at Hugging Face
- Conversation template names:
- For GLM-4:
glm4
- For GLM-4:
- Zhipu AI's GLM-4-9B-Chat page at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
You will need to specify the --trust-remote-code
option to use these models.
Broken Hill includes a custom glm4
conversation template based on the fschat
chatglm3
template, due to differences in the system prompt formatting.
Currently, only GLM-4 works in Broken Hill due to code provided with earlier versions of the model that is incompatible with modern versions of Transformers. We've tried coming up with instructions to make the earlier versions work, but it looks like a deep rabbit hole.
You may encounter the following error when using GLM-4, depending on whether or not the GLM developers have updated their code by the time you've cloned their repository:
TypeError: ChatGLM4Tokenizer._pad() got an unexpected keyword argument 'padding_side'
As a workaround, you can make the following change to the tokenization_chatglm.py
file included with the model:
Locate the following code:
def _pad(
self,
encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
max_length: Optional[int] = None,
padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
) -> dict:
Add a definition for the missing parameter, so that the method signature looks like this:
def _pad(
self,
encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
max_length: Optional[int] = None,
padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
padding_side: Optional[str] = "left",
) -> dict:
Locate this line:
assert self.padding_side == "left"
Comment it out, so it looks like the following:
#assert padding_side == "left"
- Conversation template name:
falcon
- TII's Falcon LLM website
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Conversation template name:
falcon-mamba
- TII's FalconMamba 7B page at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
Broken Hill includes a custom falcon-mamba
conversation template for this model.
- Conversation template name:
gemma
- Google's Gemma model family documentation
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: Yes
- Tool can generate adversarial content that defeats those restrictions: Yes
Broken Hill includes a custom gemma
chat template because fschat
seems to go back and forth between including one and not including one, and the current version the last time I checked added a spurious extra \n<end_of_turn>
to the end of the conversation.
Gemma is strongly conditioned to avoid discussing certain topics. We'll be adding a separate discussion about this.
- Conversation template name:
gemma
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Depends on variation
- Will generally follow system prompt instructions that restrict information given to the user: Depends on variation
- Conversation template name:
gemma
- Google's Gemma model family documentation
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: Yes
- Tool can generate adversarial content that defeats those restrictions: Yes
Broken Hill includes a custom gemma
chat template because fschat
seems to go back and forth between including one and not including one, and the current version the last time I checked added a spurious extra \n<end_of_turn>
to the end of the conversation.
Gemma 2 is strongly conditioned to avoid discussing certain topics. We'll be adding a separate discussion about this.
As with their GPT-J, GPT-Neo, and Pythia models, Eleuther AI only publishes GPT-NeoX as base models, and (as of this writing) all GPT-NeoX variations fine-tuned for chat have been published by unrelated third parties.
- Conversation template name:
gptneox
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Eleuther AI's GPT-NeoX page at Hugging Face
- Conversation template name:
gptneox
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Conversation template name:
gptneox
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Depends on variation
- Will generally follow system prompt instructions that restrict information given to the user: Depends on variation
Some models based on GPT-NeoX do not include their own tokenizer, e.g. tiny-random-GPTNeoXForCausalLM-safetensors. If you receive a "Can't load tokenizer" error, try explicitly specifying the path to the GPT-NeoX 20B tokenizer, e.g. --tokenizer LLMs/EleutherAI/gpt-neox-20b
. However, tiny-random-GPTNeoXForCausalLM-safetensors
specifically will still cause Broken Hill to crash, so don't use that model unless your goal is to make Broken Hill crash.
- Conversation template name:
guanaco
- Guanaco-7B at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
Guanaco is a PEFT pre-trained model built on top of the original Llama. To use Tim Dettmers' canonical version, you'll need to specify the corresponding Llama model using the --model
option, and refer to Guanaco using the --peft-adapter
option, e.g.:
--model /mnt/md0/Machine_Learning/LLMs/huggyllama/llama-7b \
--peft-adapter /mnt/md0/Machine_Learning/LLMs/timdettmers/guanaco-7b \
"TheBloke"'s guanaco-7B-HF version unifies Guanaco into a single model. Using the --peft-adapter
option is unnecessary with that variation.
Even though Guanaco is a model layered on top of Llama, it uses its own conversation template. The format is similar to the fschat
zero_shot
template, but not identical, so Broken Hill includes a custom guanaco
template.
- Conversation template name:
guanaco
- Guanaco-3B-Uncensored-v2 at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: No
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
Broken Hill can successfully load the original Llama model, but we haven't been able to find any documentation on the specific format it expects conversation messages in. Using the templates that seem like they'd work (llama2
, zero_shot
, guanaco
) produces output similar to other models when given input using a conversation template that doesn't match the data the model was trained with. In other words, it's unclear how useful the results are. If you have reliable information on the correct conversation format, please let us know.
- Conversation template name:
alpaca
- Conversation template name:
felladrin-llama-chat
- Conversation template name:
vikhr
- Conversation template name:
llama2
orllama-2
(see discussion below) - Meta's Llama LLM family website
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
fschat
includes a template for Llama-2 named llama-2
, but it is slightly incorrect (for example, it does not add the leading <s>
at the beginning of the conversation, and it adds a trailing <s>
to the conversation. Fixing the template completely seems like it will require code changes to fschat
. Broken Hill includes a modified version of the template named llama2
that can be used as a workaround. The custom template has a different name in this case to allow operators to easy choose which option they believe is the "least worst option" for their purposes.
The custom template is also slightly incorrect, but seems to be "less wrong" regarding the parts of the output that are more likely to affect Broken Hill's results. Specifically, it adds the leading <s>
at the beginning of the conversation when a system prompt is present, and sets a default empty system message to cause the system message block to be included in all conversations. It still leaves a trailing <s>
at the end of the conversation.
Until this issue is resolved, Broken Hill will report one or more warnings when the Llama-2 templates are used.
- Conversation template name:
llama2
orllama-2
(see discussion above) - Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Depends on variation
- Will generally follow system prompt instructions that restrict information given to the user: Depends on variation
- Conversation template name:
llama-3
(see instructions below) - Meta's Llama LLM family website
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Conversation template name:
llama-3
(see instructions above) - Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Depends on variation
- Will generally follow system prompt instructions that restrict information given to the user: Depends on variation
- Conversation template name:
minicpm
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Will generally follow system prompt instructions that restrict information given to the user: TBD
Broken Hill includes a custom minicpm
conversation template for first-party MiniCPM models.
- Conversation template name:
vikhr
- Vikhr-tiny-0.1 model page at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Conversation template name:
mistral
- Mistral AI homepage
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Conversation template name:
daredevil
- Daredevil-7B model page at Hugging Face
- NeuralDaredevil-7B model page at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
These models are derived from Mistral (and other models), and their chat format is similar, but not identical. Broken Hill includes a custom daredevil
conversation template to use with them.
- Conversation template name:
mistral
- Intel Neural-Chat-v3-3 model page at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: No
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Conversation template name:
mistral
- Mistral AI homepage
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Conversation template name:
mpt
- Databricks' Mosaic Research website, which includes MPT
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
fschat
includes a template for MPT, but for some reason there are two templates named mpt-7b-chat
and mpt-30b-chat
, which are completely different. Broken Hill includes a shortcut template definition for mpt
that points to mpt-7b-chat
.
One might assume that - because Broken Hill supports a model named MPT
- it would also support the very similarly named model mpt-1b-redpajama-200b-dolly. That assumption would be incorrect. mpt-1b-redpajama-200b-dolly
has its own custom template in Broken Hill (mpt-redpajama
- because it uses a completely different conversation format), but a GCG attack cannot currently be performed against it because the interface for the model doesn't support generation using an inputs_embeds
keyword (or equivalent):
"inputs_embeds is not implemented for MosaicGPT yet"
-- mosaic_gpt.py
:400
Maybe someday someone will add that model to the Transformers library and give it the appropriate code.
- Conversation template name:
zero_shot
- Microsoft's Orca-2-7b model at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Conversation template name:
phi2
- Microsoft's Phi-2 model at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: Yes
- Will generally follow system prompt instructions that restrict information given to the user: Yes
- Tool can generate adversarial content that defeats those restrictions: Yes
Broken Hill includes a custom phi2
chat template because fschat
does not currently include one.
- Conversation template name:
phi3
- Microsoft's Phi-3 model collection at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: Yes
- Will generally follow system prompt instructions that restrict information given to the user: Yes
- Tool can generate adversarial content that defeats those restrictions: Yes
Broken Hill includes a custom phi3
chat template because fschat
does not currently include one.
The Phi-3-small-128k-instruct
version of Phi-3 requires --trust-remote-code
, even though other versions of the model (such as Phi-3-medium-128k-instruct
) no longer require it. Additionally, that version will cause Broken Hill to crash when processed on the CPU instead of a CUDA device, with the following error:
Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
We're researching a workaround for this.
- phi3
- phi3:3.8b-mini-128k-instruct-q8_0
- phi3:3.8b-mini-128k-instruct-q2_K
- phi3:3.8b-mini-128k-instruct-q4_0
- phi3:3.8b-mini-128k-instruct-fp16
As with their GPT-J, GPT-Neo, and GPT-NeoX models, Eleuther AI only publishes Pythia as base models, and (as of this writing) all Pythia variations fine-tuned for chat have been published by unrelated third parties.
- Conversation template name: ``
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Pythia GitHub repository
- pythia-14m
- pythia-70m
- pythia-70m-deduped
- pythia-160m
- pythia-410m
- pythia-1b
- pythia-1b-deduped
- pythia-1.4b
- pythia-2.8b
- Conversation template name:
oasst_pythia
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Depends on variation
- Will generally follow system prompt instructions that restrict information given to the user: Depends on variation
- huge-roadrunner-pythia-1b-deduped-oasst
- Does not appear to be trained to avoid discussing topics
- oasst_pythia-70m-deduped_webgpt
- Does not appear to be trained to avoid discussing topics
- Conversation template name:
qwen
- Alibaba's Qwen model family page at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: Yes
- Tool can generate adversarial content that defeats those restrictions: Yes
fschat
includes a template for Qwen and Qwen 2, but for some reason it's named qwen-7b-chat
specifically, and it specifies the use of an <|endoftext|>
stop string that the models' apply_chat_template
function does not add. As a result, Broken Hill includes a custom qwen
template definition.
- Conversation template name:
qwen
- Conversation template name:
qwen2
- Alibaba's Qwen model family page at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: Yes
- Tool can generate adversarial content that defeats those restrictions: Yes
fschat
includes a template for Qwen and Qwen 2, but for some reason it's named qwen-7b-chat
specifically, and it specifies the use of an <|endoftext|>
stop string that the models' apply_chat_template
function does not add. As a result, Broken Hill includes a custom qwen
template definition.
- Conversation template name:
redpajama-incite
- RedPajama-INCITE-7B-Chat model page at Hugging Face
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Conversation template name:
smollm
- Hugging Face blog post introducing SmolLM
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
Broken Hill includes a custom smollm
chat template because fschat
does not currently include one.
- Conversation template name:
solar
- Hugging Face blog post introducing SmolLM
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
fschat
includes a solar
template, but its output is missing the ### System:
header for the system prompt, so Broken Hill includes a custom solar
chat template with that issue corrected.
- Conversation template name:
stablelm
- Stability AI StableLM family GitHub repository
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Depends
- Will generally follow system prompt instructions that restrict information given to the user: TBD
The smaller versions of this model family don't seem to have any built-in restrictions regarding controversial topics. However, the larger versions (e.g. stablelm-tuned-alpha-7b
) do exhibit restrictions.
- Conversation template name:
stablelm2
- Stability AI StableLM family GitHub repository
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Depends
- Will generally follow system prompt instructions that restrict information given to the user: TBD
As discussed in the documentation for stablelm-2-1_6b-chat and the documentation for stablelm-2-zephyr-1_6b, the smaller versions of this model family don't have any built-in restrictions regarding controversial topics.
However, the larger versions (e.g. stablelm-2-12b-chat
) do exhibit restrictions.
- Conversation template name:
TinyLlama
- TinyLlama GitHub repository
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: Yes
- Tool can generate adversarial content that defeats those restrictions: Yes
Vicuna is based on Llama, but has so many sub-variations that it's been given its own section.
Despite fschat
's overly-specific vicuna_v1.1
template name, versions up to 1.5 have been successfully tested in Broken Hill.
- Conversation template name:
vicuna_v1.1
- The Large Model Systems Organization's Vicuna web page
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
- Conversation template name:
stable-vicuna
- CarperAI's original version of StableVicuna
- "TheBloke"'s pre-merged version of StableVicuna
- Trained to avoid discussing a variety of potentially-dangerous and controversial topics: Yes
- Tool can generate adversarial content that defeats those restrictions: TBD
- Will generally follow system prompt instructions that restrict information given to the user: TBD
- Tool can generate adversarial content that defeats those restrictions: TBD
StableVicuna was originally released by CarperAI as a set of weight differences to be applied to the Llama model. We have only tested it using "TheBloke"'s pre-merged version of the model.
These model families can be used in the tool, but publicly-available versions are not trained to handle chat-type interactions. Broken Hill can handle them in case someone runs across a derived model that's been trained for chat-like interaction. If you encounter a derived model, you'll likely need to add a custom chat template to the code to generate useful results.
BART currently requires currently requires --max-new-tokens-final 512
(or lower) to be manually specified.
Currently, only the blenderbot_small-90M
version of BlenderBot works in Broken Hill. We are unsure why other versions of the model cause it to crash, but plan to investigate the issue.
BlenderBot currently requires --max-new-tokens 32
(or lower) and --max-new-tokens-final 32
(or lower).
GPT-2 currently requires --max-new-tokens-final 512
(or lower) to be manually specified.
- (OpenAI) gpt2 (not the same model as the next entry)
- (openai-community) gpt2 (not the same model as the previous entry)
- gpt2-medium
- gpt2-large
- gpt2-xl
- tiny-gpt2
- Conversation template name:
zero_shot
- State Spaces' "Transformers-compatible Mamba" page at Hugging Face
--suppress-attention-mask
is required.- You must use the "-hf" ("Transformers-compatible") variations of Mamba.
- A better conversation template to use instead of
zero_shot
would likely improve results.
OPT currently requires --max-new-tokens-final 512
(or lower) to be explicitly specified.
Broken Hill can work with Pegasus again as of version 0.32, but don't expect useful results unless you're working with a trained derivative.
Pegasus requires --max-new-tokens-final 512
(or lower).
Pegasus-X models are not currently supported.
No model based on T5 can currently be tested with Broken Hill, because Broken Hill does not yet support its architecture.