deccp

Evaling and unaligning Chinese LLM censorship

Summary

This current code is a PoC for un-censoring Qwen 2 Instruct models. These prompts were hand-checked to see if they caused refusals specifically with Qwen/Qwen2-7B-Instruct and you'd need to apply this process to any other models yourself.

Everything is Apache 2.0 licensed:

This code is primarily based off of https://github.com/Sumandora/remove-refusals-with-transformers
LLM-assisted, hand-tested refusal dataset: https://huggingface.co/datasets/augmxnt/deccp
Abliterated model: https://huggingface.co/augmxnt/Qwen2-7B-Instruct-deccp

I've posted a full analysis/writeup here: https://huggingface.co/blog/leonardlin/chinese-llm-censorship-analysis

This repo includes the adapted abliteration (single vector refusal removal). For more about this, see:

Original introduction of the technique by Andi Arditi, et al: Refusal in LLMs is mediated by a single direction
This writeup by FailSpy, the coiner of the term "abliterated" to refer to the orthogonalized-refusal modification: Abliterated-v3: Details about the methodology, FAQ, source code; New Phi-3-mini-128k and Phi-3-vision-128k, re-abliterated Llama-3-70B-Instruct, and new "Geminified" model.
mlabonne's accessible writeup: Uncensor any LLM with abliteration

Those with an interest in vector steering may want to take a look at Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages - this seems to be a technique has has been popular for a few months in Japan as you can get very good language transfer results with very low compute requirements.

Make Your Own

This is a working repo and my understanding of torch, einops, and uh, linear algebra is patchy at best, and the code is mostly cut-and-pasted from smarter people (with some rock-banging from my end), but it does seem to work.

I've renamed the scripts for the actual workflow from 01-04, which should get you to modified weights on huggingface with only a few variable changes (for Qwen2 models - otherwise you're going to need to look at your architecture's layer setup), so feel free to fork this and give it a spin if you want (but no, I won't be supporting this codebase at all).

You should also modify the "harmful" and "harmless" text files to taste. I don't love the nomenclature, but I was also too lazy to change it so ¯_(ツ)_/¯

Future Work

This was more of a one-off curiousity so I probably won't be working on it more, however if anyone were to continue work:

Create a single potentially-censored list and do per-model checks on what's actually censored or not (EN+CN)
For these prompts, create gold-standard responses from GPT4, Claude3 Opus, etc.
Chinese Model Eval Framework
- Use LLM-as-a-Judge to first categorize if the responses to the censored list are refusals or not
- Use LLM-as-a-Judge to classify/analyze non-censored responses vs gold-standard responses to characterize misinformation
Abliteration should be improved (eg, integrate optimizations from https://github.com/FailSpy/abliterator ) for layer selection (combined w/ evals)
KTO or some other direct reward/contrastive RL method would probably be best to try to efficiently re-align some of the problematic answers (multiple good answers to try to unlearn the default bad ones)

I found one other review of Chinese LLM alignment from 2024-03 that takes a different approach to testing (not trying to find refusals, but probing for political views and biases): https://www.chinatalk.media/p/censorships-impact-on-chinas-chatbots

Update

Someone pointed me to TC260-003. Here's some more info:

Following the release of TC260-003 "Basic Requirements for the Security of Generative Artificial Intelligence Services" （TC260 doc）by China’s National Cybersecurity Standardization Technical Committee (TC260) on March 4th, the committee has now issued another draft national standard titled "Cybersecurity Technology - Basic Requirements for the Security of Generative Artificial Intelligence Services." This new standard is open for public comments until July 22nd.

TC260-003: Basic Requirements for the Security of Generative Artificial Intelligence Services

Professional English Translation: https://cset.georgetown.edu/wp-content/uploads/t0588_generative_AI_safety_EN.pdf

The following Chinese standard for generative AI establishes very specific oversight processes that Chinese AI companies must adopt in regard to their model training data, model-generated content, and more. The standard names more than 30 specific safety risks, some of which—algorithmic bias, disclosure of personally identifiable information, copyright infringement—are widely recognized internationally. Others, such as guidelines on how to answer questions about China’s political system and Chinese history, are specific to the tightly censored Chinese internet. One notable addition to this document, relative to a preliminary draft released in October 2023, is a clause requiring a supply chain security assessment of Chinese generative AI models’ underlying hardware and software.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
deccp_dataset		deccp_dataset
.gitignore		.gitignore
01-compute_refusal_dir.py		01-compute_refusal_dir.py
02-test-vector-results.py		02-test-vector-results.py
03-save-model-weights.py		03-save-model-weights.py
04-upload-model-to-hf.py		04-upload-model-to-hf.py
LICENSE		LICENSE
Qwen_Qwen2-7B-Instruct_refusal_dir.pt		Qwen_Qwen2-7B-Instruct_refusal_dir.pt
README.md		README.md
abliterator.py		abliterator.py
harmful.txt		harmful.txt
harmless.txt		harmless.txt
inference.py		inference.py
multilayer-compute.py		multilayer-compute.py
multilayer-inference.py		multilayer-inference.py
output-deccp.txt		output-deccp.txt
output-original.txt		output-original.txt
qwen2-abliterate.py		qwen2-abliterate.py
test-model.py		test-model.py
upload-dataset.py		upload-dataset.py
writeup.md		writeup.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deccp

Summary

Make Your Own

Future Work

Update

About

Releases

Packages

Languages

License

AUGMXNT/deccp

Folders and files

Latest commit

History

Repository files navigation

deccp

Summary

Make Your Own

Future Work

Update

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages