A simple jailbreak detection tool for safeguarding LLMs. Available as a fine-tuned model on HuggingFace at jackhhao/jailbreak-classifier.
Jailbreaking is a technique that involves creating prompts to bypass standard safety/moderation controls for LLMs. If succesful, it can lead to dangerous downstream attacks and unrestricted output. This tool serves as a way to proactively detect and defend against such attacks.
- Python 3
To install, run pip install -r requirements.txt
.
There are three options available to start using this model:
- Use the HuggingFace inference pipeline
- Use the Cohere API
- Train and run the model locally
Simply run the following snippet:
from transformers import pipeline
pipe = pipeline("text-classification", model="jackhhao/jailbreak-classifier")
print(pipe("is this a jailbreak?"))
- Obtain a trial API key from the Cohere dashboard.
- Create a
.env
file (example one provided) with the API key. - Go to
cohere_client.py
and replace the classifier input with your own examples.
- Run
train.py
(uses the data underdata/
). - Run
classify.py
, replacing the classifier input with your own examples if desired.
- Create CLI tool for easy input + prediction
- Build Streamlit app to classify prompts via UI (& switch between models)
- Add moderation score / toxicity as additional model feature
This project is licensed under the MIT License - see the LICENSE file for details.
Jack Hao - https://www.linkedin.com/in/jackhhao
Thanks to the Cohere team for providing such an easy-to-use & powerful API!
And shout-out to the HuggingFace team for hosting a great platform for open-source datasets & models :)