LLM-Safeguard

Official repository for our ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"

If you find this repository useful or our work is related to your research, please kindly cite it:

@inproceedings{
  llm-safeguard,
  title={On Prompt-Driven Safeguarding for Large Language Models},
  author={Chujie Zheng and Fan Yin and Hao Zhou and Fandong Meng and Jie Zhou and Kai-Wei Chang and Minlie Huang and Nanyun Peng},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

If you find the chat templates used in this project useful, please also kindly cite it:

@misc{zheng-2024-chat-templates,
  author = {Zheng, Chujie},
  title = {Chat Templates for HuggingFace Large Language Models},
  year = {2024},
  howpublished = {\url{https://github.com/chujiezheng/chat_templates}}
}

Overview

As we know, prepending model inputs with safety prompts is a common means for safeguarding large language models (LLMs) from complying with harmful queries. This practice has been adopted in real-world deployed LLMs like ChatGPT and Mistral.

Do you know how safety prompts intrinsically work in LLM safeguarding? Our work reveals their working mechanisms from the perspective of model representations, and opens the potential for automatically optimizing them to improve LLM safety.

We find that in models’ representation space, harmful and harmless queries can be largely distinguished, but this is not noticeably enhanced by safety prompts (the below figure upper). Instead, the queries’ representations are moved by safety prompts in similar directions where models become more prone to refusing to provide assistance (the below figure lower) even when the queries are harmless.

Inspired by these findings, we propose a method called DRO (Directed Representation Optimization) for automatic safety prompt optimization. It treats safety prompts as continuous, trainable embeddings and learns to move the representations of harmful/harmless queries along/opposite the direction in which the model’s refusal probability increases.

Experiments with eight LLMs on out-of-domain benchmarks demonstrate that DRO remarkably improves the safeguarding performance of human-crafted safety prompts and outperforms strong baselines, without compromising the general model capability.

Please refer to our paper for the technical details of DRO.

Code and Data

See code for the experimental code for reproducing our experimental results.

We also release the experimental data and results in another repo: https://github.com/chujiezheng/LLM-Safeguard_data

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.assets		README.assets
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Safeguard

Overview

Code and Data

About

Releases

Packages

Languages

chujiezheng/LLM-Safeguard

Folders and files

Latest commit

History

Repository files navigation

LLM-Safeguard

Overview

Code and Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages