- Gonymizer
The Gonymizer project (Go + Anonymizer) is a project that was built at SmithRx in hope to simplify the QA process. Gonymizer is written in Golang and is meant to help database administrators and infrastructure folks easily anonymize production database dumps before loading this data into a QA environment.
We have built in support, and examples, for:
- Kubernetes CRONJOB scheduling
- AWS-S3 Storage processing and loading
We plan to have built-in:
- CRONJOB BASH scripts to use local disk as storage (see tasks, we need help!)
- AWS-Lambda Job scheduling (see tasks, we need help!)
Our API is an easy one to follow and we encourage others to join in by trying Gonymizer with their own development and staging environments either directly using the CLI or using the API. We include in our documentation: example configurations, best practices, Kubernetes CRONJOB examples, examples for AWS-Lambda, and other infrastructure tools. Please see the docs directory in this application to see a full how-to guide and where to get started.
Currently Gonymizer only supports PostgreSQL 9.x-13.x. We have not tested Gonymizer on versions 12+, but plan to in the near future. If you would like to help by adding support for other database management systems, new processors, or general questions please join by checking the CONTRIBUTING.md file in this repository.
- HIPAA: Health Insurance Portability and Accountability Act of 1996
- PCI DSS: Payment Card Industry Data Security Standard
- PHI: Protected Health Information
- PII: Personally identifiable information
In this document/codebase, we use them interchangeably.
If you are a seasoned Go veteran or already have an environment which contains Go>= 1.11 then you can skip to the next section.
Gonymizer requires that one has complete install of Go >= 1.11. To install Go on OSX you can run the following:
brew install go
Once this is complete we will need to make sure our Go paths are set correctly in our BASH profile. NOTE: You may need to change the directories below to match your setup.
echo "
export GOPATH=~/go
export GOROOT=/usr/local/Cellar/go/1.11.2/libexec
export GO111MODULE=on
" >> ~/.profile
It is recommended to put all Go source code under ~/go. Once this is complete we can attempt to build the application:
cd ~/go/src/github.com/smithoss/gonymizer/scripts
./build.sh
The build script will build two binaries. One for MacOS on the amd64 architecture as well as a Linux amd64 binary. These binaries are stored under the Gonymizer/bin directory. Now that we have a built binary we can attempt to download a map file using our JSON configuration:
./gonymizer-darwin -c ~/conf/gonymizer-config-file.json dump
Use the following steps to get up and going. Commands should be similar for Debian 9.x and Ubuntu 18.04.
- Install Golang and Git
sudo apt-get install go git
- Add go path to profile
echo "
export GOPATH=~/go
export GO111MODULE=on
" >> ~/.bashrc
- Git checkout
mkdir -p ~/go/src/github.com/smithoss/
cd ~/go/src/github.com/smithoss/
git clone https://github.com/smithoss/Gonymizer.git gonymizer
- Build the project
cd gonymizer/cmd/
go build -o ../bin/gonymizer .
- Run the binary
cd ../bin
./gonymizer --help
Gonymizer has many different configuration settings that can be enabled or disabled using the command line options.
It is recommended that one run gonymizer --help
or gonymizer CMD --help
where CMD is one of the commands to see
which options are available at any given time.
Below we give examples of both the CLI configuration as well as examples on how to create your map file.
Gonymizer was built using the Cobra + Viper Golang libraries to allow for easy configuration however you like it. We recommend using a JSON, YAML, or TOML file to configure Gonymizer. Below we will go over an example configuration for running Gonymizer.
For an example of how to set up a CLI configuration check our Dell Store 2 example in docs/demo/dellstore2/gonymizer_config.json
{
"comment": "This example is viewable under docs/demo/dellstore2",
"num-workers": 2,
"dump": {
"database": "store",
"disable-ssl": true,
"dump-file": "phi_dump.sql",
"exclude-schema": [
"pg*",
"information_schema"
],
"host": "localhost",
"port": 5432,
"schema": ["public"],
"row-count-file": "row-counts.csv",
"username": "levi"
}
}
}
comment
: is used to leave for comments for the reader and is not used by the application.
log-level
: is the level the application uses to know what should be displayed to the screen. Choices are: FATAL,
ERROR, WARN, INFO, DEBUG. We use the Logrus Golang library for logging so please read the documentation
here for more information.
database
: is the master database with PHI and PII that will be used for dumping a SQL dump file from.
host
: is the hostname for the master database with PHI and PII that will be used for dumping a SQL dump file from.
port
: is the host port that will be used to connect to the master database with PHI and PII.
username
: is the username that will be used to connect to the master database with PHI and PII.
password
: is the password that will be used to connect to the master database with PHI and PII.
disable-ssl
: is the master database with PHI and PII that will be used for dumping a SQL dump file from.
dump-file
: is where Gonymizer will store the SQL statements from the dump
command.
map-file
: is the file that gonymizer uses to map out which columns need to be anonymized and how. When using the
map
command in conjunction with --map-file
, or in the configuration above, a file is named similarly to the
map-file
, but with skeleton
in the name instead. More on this below in the map section.
exclude-table
: is list of tables that are not to be included during the pg_dump step of the extraction process.
This allows us to only focus on tables that are needed for our base application to work. Using this option minimizes
the size of our dump file and in return decreases the amount of time needed for dumping, processing, and
reloading. This option operates in the same fashion as pg_dump's --exclude-table
option.
exclude-table-data
: allows you to create a list of tables we would like to include in the pg_dump process but do not
want to include any of the data (table schema only). The usage and advantages are the same as the exclude-table
feature explained above and is identical to pg_dump's --exclude-table-data
option.
schema
: is a list of schemas the Gonymizer should dump from the master database. This option must be in the form
of a list if you are using the configuration methods mentioned above.
exclude-schema
: is a list of system level schemas that Gonymizer should ignore when adding CREATE SCHEMA statements
to the dump file. These schemas may still be included in the --schema
option, for example the public
schema.
schema-prefix
: is the prefix used for a schema environment where there is a prefix that matches other schemas. This
is same as a sharded architecture design which is outside the scope of this article and it is recommended to read
here if you are unfamiliar with this design paradigm.
For example: [company_1, company2, company_..., company_n-1, company_n] would be
--schema-prefix=company_ --schemas=company
--oids
: allows you to provide the --oids
option for older versions of pg_dump (prior to version 12)
NOTE: Some arguments are not included here. It is recommended to use gonymizer --help
and
gonymizer [COMMAND] --help
for more information and configuration options.
Once one has created a skeleton map file it is recommended to create a new true map file which will be used to let gonymizer know which columns need to be anonymized in the database and which columns do not. There are two methods in which gonymizer map files work (inclusive and exclusive).
NOTE: Currently SmithRx is using an exclusive dump file which can be found under map_files/prod_map.json
Below is a list of fake data creators and scramblers. This table may not be up to date so please make sure to check
processor.go
for a full list.
Processor Name | Use |
---|---|
AlphaNumericScrambler | Scrambles strings. If a number is in the string it will replace it with another random number |
EmptyJson | Replaces a JSON with an empty one ({} ) |
FakeStreetAddress | Used to replace a real US address with a fake one |
FakeCity | Used to replace a city column |
FakeLatitude | Used to replace a latitude column |
FakeLongitude | Used to replace a longitude column |
FakeCompanyName | Used to replace a company name |
FakeParagraph | Used to generate a random paragraph |
FakeUserAgent | Used to replace user agent with fake one |
FakeEmailAddress | Used to replace e-mail with a fake one |
FakeGender | Used to replace gender with a fake one |
FakeFirstName | Used to replace a person's first name with a fake first name (non-gender specific) |
FakeIPv4 | Used to replace an IPv4 with a fake one |
FakeIPv6 | Used to replace IPv6 with a fake one |
FakeCurrency | Used to replace currency with a fake one |
FakeLastName | Used to replace a person's last name with a fake last name |
ProcessorFullName | Used to replace a person's full name with fake one |
ProcessorLanguage | Used to replace a person's language with fake one |
FakePhoneNumber | Used to replace a person's phone number with fake phone number |
FakeState | Used to replace a state (full state name, non-abbreviated) |
FakeStateAbbrev | Used to replace a state abbreviation |
FakeUsername | Used to replace a username with a fake one |
FakeZip | Used to replace a real zip code with another zip code |
Identity | Used to notify Gonymizer not to anonymize the column (same as leaving the column out of the map file) |
RandomBoolean | Randomizes boolean fields |
RandomDate | Randomizes Day and Month, but keeps year the same (HIPAA only requires month and day be changed) |
RandomDigits | Randomizes a string of digit(s), but keeps the same length |
RandomUUID | Randomizes a UUID string, but keep a mapping of the old UUID and map it to the new UUID. If the old is found elsewhere in the database the new UUID will be used instead of creating another one. Useful for UUID primary key mapping (relationships). |
ScrubString | Replaces a string with *'s. Useful for password hashes. |
UniqueAlphaNumericScrambler | Similar to AlphaNumericScrambler but that all scrambled strings in the table column will be unique. |
An inclusive map file is a map file which includes every column in every table that is contained in a list of schemas
that is configurable by using the --schemas
option. If you are using a sharded/group configuration only one copy of
the column will be added to the file. An example map file can be found in map_files/example_db_map.json
.
Once there is an up to date skeleton file one can then walk through the file and modify the "Processors"."Name" field
for any column that needs to be anonymized. This can be done by simply replacing the "Identity" processor with one
listed in the table above. For example to pick a fake first name for a column labeled first_name
one would add the
FakeFirstName
to the "Processors"."Name" field like so:
{
"TableSchema": "public",
"TableName": "users",
"ColumnName": "first_name",
"DataType": "character varying",
"ParentSchema": "",
"ParentTable": "",
"ParentColumn": "",
"OrdinalPosition": 6,
"IsNullable": false,
"Processors": [
{
"Name": "FakeFirstName",
"Max": 0,
"Min": 0,
"Variance": 0,
"Comment": ""
}
],
"Comment": ""
}
An exclusive map file is a map file that contains only the columns that need to be anonymized. This is the only difference from the inclusive map file method and should make map files smaller and simpler to navigate since they will not contain any columns using the "Identity" processor. It is assumed that all columns that are not listed in the map file are considered to be OK to add to the dump file WITHOUT any scrambling or anonymization. This means that the user must add column definitions for every schema change that requires anonymization.
Pro Tip: An east way to handle schema changes is to run the map
command to create a new map file and copy/paste
the new columns into your map file while adding the proper processors at the same time.
Relationship mapping allows the user to define columns that should remain congruent during the processing/anonymization
step. For example if a user is identified by a unique UUID that is used across multiple tables in the database one may
select the RandomUUID
processor which keeps a global hash map of OLD-UUID => NEW-UUID
. The
global hash map then can be used by the processor and can also be stored to disk for back-tracing values to
debug the application. The only way to enable this type of logging is to edit the generator.go file and add the
function call the writeDebugMap function. Adding this to your run-time is outside of the scope of this documentation
and it is recommended to NEVER use this option when working with real PHI and PII data. If this file is compromised
and stolen, an attacker will gain full access of the mapping of (PHI, PII) => (Non-PHI, Non-PII)
.
Currently we only allow for global mapping of the following processors (more may be added later):
- AlphaNumericScrambler
- UniqueAlphaNumericScrambler
- RandomUUID
They can be found in the processor.go file:
var UUIDMap = map[uuid.UUID]uuid.UUID{}
var AlphaNumericMap = map[string]map[string]string{}
There are plans to add more globally aware processors in the future, but at this time only 2 are available.
To map a relationship one can do this quite easily by notifying Gonymizer that there is a parent table and column that exist that the column should be mapped to. Below is an example where we identify the parent schema, table, and column:
{
"TableSchema": "public",
"TableName": "credit_scores",
"ColumnName": "ssn",
"DataType": "integer",
"ParentSchema": "public",
"ParentTable": "user",
"ParentColumn": "ssn",
"OrdinalPosition": 6,
"IsNullable": false,
"Processors": [
{
"Name": "AlphaNumericScrambler",
"Max": 0,
"Min": 0,
"Variance": 0,
"Comment": ""
}
]
"Comment": ""
}
In the example above we are mapping the social security number (SSN) from the credit_scores
table to the users
table by simply notifying gonymizer that there exists a map for ssn that is tied to the users.ssn
table and column.
Gonymizer will see this and look the value up in the global AlphaNumericMap variable mentioned earlier. If the
original SSN key does not exist in the map the Gonymizer will automatically scramble the SSN and add an entry in the
map such that:
map["old SSN"]: "new value (new SSN)"
Every time gonymizer checks a value in the SSN column it will look up this value and replace it with the previously anonymized SSN. This allows us to map keys between tables.
Also make sure to add the parent table itself as a parent when creating a relationship mapping. From the example above the same would be true:
{
"TableSchema": "public",
"TableName": "user",
"ColumnName": "ssn",
"DataType": "integer",
"ParentSchema": "public",
"ParentTable": "user",
"ParentColumn": "ssn",
"OrdinalPosition": 6,
"IsNullable": false,
"Processors": [
{
"Name": "AlphaNumericScrambler",
"Max": 0,
"Min": 0,
"Variance": 0,
"Comment": ""
}
]
"Comment": ""
}
Notice that we added the column as a parent of itself. If this step is missing all other columns will be mapped to the correct value, but the parent column will not be mapped to the same hash map so it will contain different values than expected.
Note 1: Multiple tables can link back to the user table by simply adding the schema, table, and column names to the parent fields in the map file for the specified column.
Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole. Explanation is outside the scope of this READ.me and more information can be found at this [Wikipedia article](https://en.wikipedia.org/wiki/Shard_(database_architecture\)).
NOTE: When working with a database that contains many schemas matching the schema-prefix (shards), one will need to make sure that all tables and columns are identical across each schema. Manging the DDL for each schema is outside the scope of Gonymizer project and should be done by external database administration tools.
- Create a map file:
gonymizer -c config/production-conf.json map
- Edit dump file to define which columns need to be anonymized.
- Create a PII encumbered dump file:
gonymizer -c config/prod-conf.json dump
- Use the Process command to anonymize the PII dump file:
gonymizer -c config/prod-conf.json process
- Use the Load command to load the anonymized database file into the database
gonymizer -c config/staging.json load
Also check out our slides from Percona Live 2019 here
-
Step 1: Generate a Map Skeleton (should only need to use the first time or during schema changes)
This will generate a new skeleton (defined, but empty) config file from scratch:
./gonymizer -c config/prod-conf.json map
If you already have a map file and just need to due to migrations, schema changes, etc (2nd -> nth runs) change the path to the real map file. The map command will NOT overwrite your map file, instead it will create a new file with "skeleton" in the name. This will also append new columns to the bottom:
./gonymizer -c config/prod-conf.json --map-file=db_mapper.prod_map.json map
Will output a file named:
db_mapper.prod_map.json.skeleton.json
-
Step 2: Copy the newly created skeleton file to a new production map file
Pro Tip: It is recommended to leave OUT column definitions from your map file that are to be skipped by the gonymizer. This is to keep the map file simple and clean. The gonymizer will skip any column that is not in the map file and continue on. The purpose of the skeleton file is to use it as a base line and to copy/paste your anonymized columns from the skeleton file into your true map file. This map file will be used in the processing step later. See Map Configuration above for more information.
mv db_mapper.prod_map.json.skeleton.json db_mapper.prod_map.json
Edit every field (removing unneeded columns if going Pro Tip route). Add processors or Min/Max as necessary.
-
Step 3: Generate PHI & PII-encumbered dumpfile
CAUTION!! This dump file will contain PII! Only do this on secure machines with encrypted block devices only!
./gonymizer -c config/prod-config.json dump --dump-file=dump-pii.sql
-
Step 4: Generate altered data using the dumpfile built in step 3
If you've correctly configured db_mapper.j
./gonymizer -c config/prod-conf.json --map-file=db_mapper.prod_nap.json\ --dump-file=dump-pii.sql --s3-file-path=s3://my-bucket-name.s3.us-west-2.amazonaws.com/db-dump-processed.sql process
-
Step 5. Use the Load command to load the data into the database to verify that the data is correctly scrambled
The processed SQL file can simply be imported using PSQL.
./gonymizer -c config/staging-conf.json --load-file=s3://my-bucket-name.s3.us-west-2.amazonaws.com/db-dump-processed.sql load
Testing for Gonymizer is different than expected for typical projects. When adding a test to the project one will
need to make sure the test is called from the main_test.go
test harness file in the root directory of the project.
All tests should be added to the seqUnitTests
function in the proper position in the test sequence. This sequence
creates, imports, modifies, and drops the database in the local test database.
To run tests you will want to use the command (in the root directory of the project)
go test -v -run TestStart
To specify username / password you can use the following environment variables:
PGUSER
PGDATABASE
Lets assume we created a new processor function for anonymizing IP addresses as seen in #64. In this case we create the
test using normal methods, but will need to add the function to the main_test.go
by adding the following line:
t.Run("ProcessorIPV4", TestProcessorIPv4)
Please make sure to read our license agreement here LICENSE.txt. We may state throughout our documentation that we are using this application to anonymize data for HIPAA requirements, but this is in our own environment and we give NO guarantee this will be the same for other's uses. Considering everyone's data set is completely different and the configuration of this application is very involved we cannot guarantee that this application will guarantee any compliance of any type. This is the application users responsibility to verify with council that the dataset that is processed by the application is indeed HIPAA/PCI/PHI/PII compliant.
THERE IS ABSOLUTELY NO GUARANTEE THAT USING THIS SOFTWARE WILL COMPLETE A CORRECT ANONYMIZATION OF YOUR DATA SET FOR COMPLIANCE PURPOSES. PLEASE SEE LICENSE.txt FOR MORE INFORMATION.
All graphics used in this project are released under the Create Common License 3.0
The Gonymizer Gophers logo was created by Levi Junkert which uses the Go Gopher that Takuya Ueda made from the original design of the Go Gopher which was created by Renee French