Text anonymization is a Python library for anonymizing sensitive information in text data. Focused on Swiss French banking data.
Based on presidio for PII detection and camembert for NER.
You must have conda and git installed.
Create a conda environment with python 3.10 and activate it:
conda create -n my_env python=3.10
conda activate my_env
Clone the project and install it:
git clone https://github.com/idiap/anonymization.git
cd anonymization
pip install -e . # install in editable mode
configure # Download models
pytest -sv tests # (optional) run the test suite to make sure everything is working as expected
Anonymize your text (.txt), CSV (.csv) or Excel (.xslx) /path/to/my_file.xslx
file by calling:
anonymize -f /path/to/my_file.xslx
This generates an anonymized file here /path/to/my_file_anonymized.xslx
You can use the test example:
anoymize -f ./tests/example.txt -c ./tests/config.json
You can pass a customized configuration to run your anonymization.
To generate a default configuration file (used by default when running anonymize):
gen_config
This creates .json
file with the following fields:
Keyword | Description |
---|---|
entities | List of entites you want to anonymize. By default it listed all the available entities. For example: "Mon nom est Alfred, voici mon numéro: 079563684" results in "Mon nom est <ANONYM_PER>, voici mon numéro <ANONYM_PHONE>" |
flag_only | Boolean. If True, the anonymization will only flag sensitive component of the text but will not remove them. For example: "Mon nom est Alfred, voici mon numéro: 079563684" results in "Mon nom est , voici mon numéro <FLAG 079563684>". |
language | Language selection in "fr", "en", "de". However, the current version is specialized for French language. |
process_columns | List of integers. If your input file is an Excel of CSV file, the anonymization is only applied to the specified columns of the data. |
pseudonymize | List of entities to pseudomize, i.e. replace the flaged text with fake one (e.g. use fake names). Should list entities already present in entities list. Entities that are not pseudomized are anonymized. For example, if onle "PERSON" is given to pseudonymize: "Mon nom est Alfred, voici mon numéro: 079563684" results in "Mon nom est Bernard, voici mon numéro <ANONYM_PHONE>" |
use_camembert | Boolean. If true, use french camembert_ner for NER recognition. Detectors are cumulative (default all used). |
use_spacy | Boolean. If true, use spacy for NER and PII detection. Detectors are cumulative (default all used). |
use_swiss_ner | Boolean. If true, use spacy for NER sepcialized in Swiss entity recognition. Detectors are cumulative (default all used). |
To use a constomized config.json
configuration file:
anonymize -f /path/to/my_file.xslx -c config.json
For more help:
anonymize -h
Please be advised that the use of this code comes with no guarantees or warranties. Users are responsible for its application, and no liability is assumed by the developer for any consequences arising from its use.
This package was developed with the support of the Banque Cantonale du Valais (BCVs).
SPDX-FileCopyrightText: Copyright © 2023 Idiap Research Institute [email protected]
SPDX-FileContributor: Théophile Gentilhomme [email protected]
SPDX-License-Identifier: GPL-3.0-only