Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable

[📕 Paper] [🤗 DirectRefusal] [🤗 SafeChain Aligned model] [🤗 DirectRefusal Aligned model]

Aligned LRM production pipeline

A two-stage sequential pipelin is considered.

Reasoning training.. At this stage, we supervised fine-tune(SFT) the model on reasoning dataset (e.g., s1k) to produce the Large Reasoning Model (LRM).
Safety alignment. At this stage, we SFT the model on safety dataset (e.g., SafeChain, DirectRefusal) to safety aligned the LRM.

Safety Tax

We identify an important tradeoff at the safety alignment stage. Pariticularly, Safety alignment can restore the safety of the LRM (smaller harmful score). However, this come with a cost of degrading the model reasoning ability (smaller reasoing accuracy). You can't get a safety aligned model and good reasoning model at the same time! The phenomenon is named Safety Tax.

Preparation

Install the required packages.

conda create --name s1k python=3.12.0
source activate s1k
pip install -r requirements.txt

Install the evaluation benchmark lm_eval.

cd eval/lm-evaluation-harness
pip install -e .[math,vllm]

All the datasets have already been processed and uploaded to Huggingface. So no worry about the datasets! But we do provide a python script to produce the dataset to your own huggingface repo. Checkout the directory /data.
Some models (e.g., Llama) need permission to access. When they said you don't have permission, apply one in their huggingface page. After applying permission, you should be able to access the model, but you first need to enter your token in the file huggingface_token.txt.
Please enter your OPENAI key and the huggingface key in the the scripts in script/safety_alignment. Lm-Eval framework need to use GPT4o API to evaluate the results.

Command to reproduce results

All the scripts are avaialble in script/safety_alignment. We recommend to use Slurm to reproduce the results as the logging file will be automatically organized into the script directory (if you don't use Slurm, just replace sbatch with bash in our example).

With the following commands you can reproduce all the results. Note, we use 8xH200, if you don't have 8 GPUs, you may change the gpu number in the scripts.

s1.1-32B

sbatch sft.sh TianshengHuang/s1k 
sbatch sft_cot.sh TianshengHuang/s1k 
sbatch original.sh TianshengHuang/s1k

DeepSeek32B

sbatch sft.sh deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
sbatch sft_cot.sh deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
sbatch original.sh deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

LIMO

sbatch sft.sh GAIR/LIMO
sbatch sft_cot.sh GAIR/LIMO
sbatch original.sh GAIR/LIMO

Epochs experiments (for producing the thumb pic)

sbatch sft.sh TianshengHuang/s1k 1
sbatch sft.sh TianshengHuang/s1k 2
sbatch sft.sh TianshengHuang/s1k 3
sbatch sft.sh TianshengHuang/s1k 4

Acknowledgment

The repo is built upon the code base from simplescaling s1. Special thanks to simplesacling team!

Citation

@misc{huang2025safetytax,
      title={Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable}, 
      author={Tiansheng Huang and Sihao Hu and Fatih Ilhan and Selim Furkan Tekin and Zachary Yahn and Yichang Xu and Ling Liu},
      year={2025},
      eprint={2503.00555},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2503.00555}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
eval		eval
poison/evaluation		poison/evaluation
script/safety_alignment		script/safety_alignment
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
deepspeed_zero3.yaml		deepspeed_zero3.yaml
huggingface_token.txt		huggingface_token.txt
requirements.txt		requirements.txt
thumb.png		thumb.png
two_stage.png		two_stage.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable

Aligned LRM production pipeline

Safety Tax

Preparation

Command to reproduce results

Acknowledgment

Citation

About

Releases

Packages

Languages

License

git-disl/Safety-Tax

Folders and files

Latest commit

History

Repository files navigation

Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable

Aligned LRM production pipeline

Safety Tax

Preparation

Command to reproduce results

Acknowledgment

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages