Code for the paper "Precise In-Parameter Concept Erasure in Large Language Models" (link).


The codebase is still being finalized, but for now I uploaded the main files used in the paper. These are:
evals.py
- includes code for running all of our evaluations, as well as wrapper classes for making the API access totransformers
andtransformer_lens
models identical.feature_finder.py
- used to find the SAE features that will be used to unlearn a concept. Currently there's no demo of this (planning on adding soon), but all of the code is there.editor.py
- this includes the code used to perform the unlearning. The most important function there isunlearn_concept
, which receives aConcept
object which includes the relevant features and hyperparameters to unlearn.
We also have a demo for erasing the Harry Potter concept (erasing_harry_potter.ipynb
), which shows how to use the framework. Note that the editing code currently only works with the transformer_lens
library.
In data/cvs.json
you can find all of the data used in all of our evaluations.
Please cite as:
@misc{gurarieh2025preciseinparameterconcepterasure,
title={Precise In-Parameter Concept Erasure in Large Language Models},
author={Yoav Gur-Arieh and Clara Suslik and Yihuai Hong and Fazl Barez and Mor Geva},
year={2025},
eprint={2505.22586},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.22586},
}
Feel free to contact if you have any thoughts, questions or suggestions :)