Name		Name	Last commit message	Last commit date
parent directory ..
chat_templates		chat_templates
comparisons		comparisons
data		data
data_harmless		data_harmless
generation_configs		generation_configs
scripts		scripts
README.md		README.md
compare_pca_harmfulness_boundary.py		compare_pca_harmfulness_boundary.py
compare_pca_refusal_boundary.py		compare_pca_refusal_boundary.py
compare_pca_soft_harmfulness.py		compare_pca_soft_harmfulness.py
compare_pca_soft_refusal.py		compare_pca_soft_refusal.py
estimate.py		estimate.py
evaluate.py		evaluate.py
forward.py		forward.py
forward_with_soft.py		forward_with_soft.py
generate.py		generate.py
train.py		train.py
train_unlikelihood.py		train_unlikelihood.py
utils.py		utils.py

README.md

README

You can find the scripts of running LLMs with human-crafted safety prompts and training continuous safety prompts in scripts. Note that for local running you should set the env variable HF_MODELS that indicates the save folder of LLMs.

If you find this repository useful or our work is related to your research, please kindly cite it:

@article{
  llm-safeguard,
  title={Prompt-Driven LLM Safeguarding via Directed Representation Optimization},
  author={Chujie Zheng and Fan Yin and Hao Zhou and Fandong Meng and Jie Zhou and Kai-Wei Chang and Minlie Huang and Nanyun Peng},
  journal={arXiv preprint arXiv:2401.18018},
  year={2024}
}

If you find the chat templates used in this project useful, please also kindly cite it:

@misc{zheng-2023-chat-templates,
  author = {Zheng, Chujie},
  title = {Chat Templates for HuggingFace Large Language Models},
  year = {2023},
  howpublished = {\url{https://github.com/chujiezheng/chat_templates}}
}

How to Run Code

To get generation and evaluation results with human-crafted safety prompts, run:

bash scripts/run_mistral-v1.sh
bash scripts/run_mistral-v1_harmless.sh

To train continuous safety prompts, and then get generation and evaluation results, run:

bash scripts/forward.sh
bash scripts/forward_harmless.sh
bash scripts/train_mistral-v1.sh

You may uncomment the unlikelihood line to reproduce the vanilla Prompt Tuning baseline.

To visualize the hidden states with estimated boundaries, run:

bash scripts/compare_gather.sh

Experimental Results

Our experimental results are released in another data repo: https://github.com/chujiezheng/LLM-Safeguard_data

Acknowledgement

Our code base builds upon the follow repository: https://github.com/Princeton-SysML/Jailbreak_LLM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

README.md

README

How to Run Code

Experimental Results

Acknowledgement

Files

code

Directory actions

More options

Directory actions

More options

Latest commit

History

code

Folders and files

parent directory

README.md

README

How to Run Code

Experimental Results

Acknowledgement