Python Datafly

datafly.py is a Python implementation of the Datafly algorithm.

Datafly is a greedy heuristic algorithm which is used to anonymize a table in order to satisfy k-anonymity.

Currently supports the CSV format.

Usage

Use the --help command to show the help message:

usage: datafly.py [-h] --private_table PRIVATE_TABLE --quasi_identifier
                  QUASI_IDENTIFIER [QUASI_IDENTIFIER ...]
                  --domain_gen_hierarchies DOMAIN_GEN_HIERARCHIES
                  [DOMAIN_GEN_HIERARCHIES ...] -k K --output OUTPUT

Python implementation of the Datafly algorithm. Finds a k-anonymous
representation of a table.

optional arguments:
  -h, --help            show this help message and exit
  --private_table PRIVATE_TABLE, -pt PRIVATE_TABLE
                        Path to the CSV table to K-anonymize.
  --quasi_identifier QUASI_IDENTIFIER [QUASI_IDENTIFIER ...], -qi QUASI_IDENTIFIER [QUASI_IDENTIFIER ...]
                        Names of the attributes which are Quasi Identifiers.
  --domain_gen_hierarchies DOMAIN_GEN_HIERARCHIES [DOMAIN_GEN_HIERARCHIES ...], -dgh DOMAIN_GEN_HIERARCHIES [DOMAIN_GEN_HIERARCHIES ...]
                        Paths to the generalization files (must have same
                        order as the QI name list.
  -k K                  Value of K.
  --output OUTPUT, -o OUTPUT
                        Path to the output file.

Domain Generalization Hierarchy file format

For each Quasi Identifier attribute it must be specified a corresponding Domain Generalization Hierarchy, which is used to generalize the attribute values.

Each DGH is specified through a DGH file, which in each line specifies the hierarchy relationship of a value for that attribute. For example, for an attribute age, the file could be in this format:

...
42,30-45,30-60,1-60,1-120
43,30-45,30-60,1-60,1-120
44,30-45,30-60,1-60,1-120
45,30-45,30-60,1-60,1-120
46,45-60,30-60,1-60,1-120
...

As shown above each line specifies for a value not generalized (generalization level 0) its hierarchy relationship in the form level 0,level 1,level 2,...,level n (from not-generalized to most generic value).

Example of anonymization:

The ./example folder contains a sample database (db_100.csv), and some Domain Generalization Hierarchy files (age_generalization.csv, city_birth_generalization.csv, zip_code_generalization.csv).

The following command will anonymize the table db_100.csv writing a new table db_100_3_anon.csv which is 3-anonymous (k = 3):

$ python datafly.py -pt "example/db_100.csv" -qi "age" "city_birth" "zip_code" -dgh "example/age_generalization.csv" "example/city_birth_generalization.csv" "example/zip_code_generalization.csv" -k 3 -o "example/db_100_3_anon.csv"

Note that the list of Quasi Identifier names and the corresponding DGH files paths must have the same order.

Authors

Alessio Vierti - Initial work

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
example		example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
datafly.py		datafly.py
dgh.py		dgh.py
tree.py		tree.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Datafly

Usage

Domain Generalization Hierarchy file format

Example of anonymization:

Authors

License

About

Releases

Packages

Languages

License

nazilkbahar/python-datafly

Folders and files

Latest commit

History

Repository files navigation

Python Datafly

Usage

Domain Generalization Hierarchy file format

Example of anonymization:

Authors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages