Skip to content

AUGMXNT/deccp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

deccp

Evaling and unaligning Chinese LLM censorship

Summary

This current code is a PoC for un-censoring Qwen 2 Instruct models. These prompts were hand-checked to see if they caused refusals specifically with Qwen/Qwen2-7B-Instruct and you'd need to apply this process to any other models yourself.

Everything is Apache 2.0 licensed:

I've posted a full analysis/writeup here: https://huggingface.co/blog/leonardlin/chinese-llm-censorship-analysis

This repo includes the adapted abliteration (single vector refusal removal). For more about this, see:

Those with an interest in vector steering may want to take a look at Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages - this seems to be a technique has has been popular for a few months in Japan as you can get very good language transfer results with very low compute requirements.

Make Your Own

This is a working repo and my understanding of torch, einops, and uh, linear algebra is patchy at best, and the code is mostly cut-and-pasted from smarter people (with some rock-banging from my end), but it does seem to work.

I've renamed the scripts for the actual workflow from 01-04, which should get you to modified weights on huggingface with only a few variable changes (for Qwen2 models - otherwise you're going to need to look at your architecture's layer setup), so feel free to fork this and give it a spin if you want (but no, I won't be supporting this codebase at all).

You should also modify the "harmful" and "harmless" text files to taste. I don't love the nomenclature, but I was also too lazy to change it so ¯_(ツ)_/¯

Future Work

This was more of a one-off curiousity so I probably won't be working on it more, however if anyone were to continue work:

  • Create a single potentially-censored list and do per-model checks on what's actually censored or not (EN+CN)
  • For these prompts, create gold-standard responses from GPT4, Claude3 Opus, etc.
  • Chinese Model Eval Framework
    • Use LLM-as-a-Judge to first categorize if the responses to the censored list are refusals or not
    • Use LLM-as-a-Judge to classify/analyze non-censored responses vs gold-standard responses to characterize misinformation
  • Abliteration should be improved (eg, integrate optimizations from https://github.com/FailSpy/abliterator ) for layer selection (combined w/ evals)
  • KTO or some other direct reward/contrastive RL method would probably be best to try to efficiently re-align some of the problematic answers (multiple good answers to try to unlearn the default bad ones)

I found one other review of Chinese LLM alignment from 2024-03 that takes a different approach to testing (not trying to find refusals, but probing for political views and biases): https://www.chinatalk.media/p/censorships-impact-on-chinas-chatbots

Update

Someone pointed me to TC260-003. Here's some more info:

Following the release of TC260-003 "Basic Requirements for the Security of Generative Artificial Intelligence Services" (TC260 doc)by China’s National Cybersecurity Standardization Technical Committee (TC260) on March 4th, the committee has now issued another draft national standard titled "Cybersecurity Technology - Basic Requirements for the Security of Generative Artificial Intelligence Services." This new standard is open for public comments until July 22nd.

TC260-003: Basic Requirements for the Security of Generative Artificial Intelligence Services

Professional English Translation: https://cset.georgetown.edu/wp-content/uploads/t0588_generative_AI_safety_EN.pdf

The following Chinese standard for generative AI establishes very specific oversight processes that Chinese AI companies must adopt in regard to their model training data, model-generated content, and more. The standard names more than 30 specific safety risks, some of which—algorithmic bias, disclosure of personally identifiable information, copyright infringement—are widely recognized internationally. Others, such as guidelines on how to answer questions about China’s political system and Chinese history, are specific to the tightly censored Chinese internet. One notable addition to this document, relative to a preliminary draft released in October 2023, is a clause requiring a supply chain security assessment of Chinese generative AI models’ underlying hardware and software.

See also:

About

Evaling and unaligning Chinese LLM censorship

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages