LLM Warden

A simple jailbreak detection tool for safeguarding LLMs. Available as a fine-tuned model on HuggingFace at jackhhao/jailbreak-classifier.

Description

Jailbreaking is a technique that involves creating prompts to bypass standard safety/moderation controls for LLMs. If succesful, it can lead to dangerous downstream attacks and unrestricted output. This tool serves as a way to proactively detect and defend against such attacks.

Getting Started

Dependencies

Python 3

Installation

To install, run pip install -r requirements.txt.

Usage

There are three options available to start using this model:

Use the HuggingFace inference pipeline
Use the Cohere API
Train and run the model locally

Using the inference pipeline

Simply run the following snippet:

from transformers import pipeline

pipe = pipeline("text-classification", model="jackhhao/jailbreak-classifier")

print(pipe("is this a jailbreak?"))

Using Cohere

Obtain a trial API key from the Cohere dashboard.
Create a .env file (example one provided) with the API key.
Go to cohere_client.py and replace the classifier input with your own examples.

Running locally

Run train.py (uses the data under data/).
Run classify.py, replacing the classifier input with your own examples if desired.

Roadmap

Create CLI tool for easy input + prediction
Build Streamlit app to classify prompts via UI (& switch between models)
Add moderation score / toxicity as additional model feature

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Jack Hao - https://www.linkedin.com/in/jackhhao

Acknowledgments

Thanks to the Cohere team for providing such an easy-to-use & powerful API!

And shout-out to the HuggingFace team for hosting a great platform for open-source datasets & models :)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
datasets		datasets
src		src
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Warden

Description

Getting Started

Dependencies

Installation

Usage

Using the inference pipeline

Using Cohere

Running locally

Roadmap

License

Contact

Acknowledgments

About

Releases

Packages

Languages

License

jackhhao/llm-warden

Folders and files

Latest commit

History

Repository files navigation

LLM Warden

Description

Getting Started

Dependencies

Installation

Usage

Using the inference pipeline

Using Cohere

Running locally

Roadmap

License

Contact

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages