This project demonstrates steps to implement privacy preserving record linkage using different approaches of bloom filter encoding. This project was implemented using the FEBRL synthetic dataset.
- Data pre-processing (data cleaning, phonetic encoding)
- Privacy preservation (field-level & record-level bloom filters, bloom filter hardening techniques)
- Blocking and indexing
- Comparison (dice coefficient similarity)
- Classification (supervised and unsupervised)
- Evaluation (blocking:pair completeness, reduction ratio; linkage: accuracy, f1 score, precision, recall)
Complied using:
- python 3.9.7
- pandas 1.3.4
- hashlib: https://docs.python.org/3/library/hashlib.html
- record linkage toolkit (recordlinkage 0.15): https://pypi.org/project/recordlinkage/
This project uses a few other projects as below:
- Implementation examples of Python Record Linkage Toolkit: https://github.com/mayerantoine/recordlinkage
- PPRL hardening technique and attack: https://dmm.anu.edu.au/pprlattack/
- FEBRL: http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/manual.html
- Python record linkage toolkit: https://recordlinkage.readthedocs.io/en/latest/about.html
- Bloom filter examples: https://github.com/onnovalkering/notebooks/tree/master/record-linkage
This project is licensed under the GNU General Public License v3.0