This repository holds code for the NHSX Analytics Unit PhD internship project (previously known as Synthetic Data Generation - VAE) contextualising and investigating the potential use of Variational AutoEncoders (VAEs) for synthetic health data generation undertaken by Dominic Danks.
Project Description - Synthetic Data Exploration: Variational Autoencoders
Note: No data, public or private are shared in this repository.
- The main code is found in the root of the repository (see Usage below for more information)
- The accompanying report is also available in the
reports
folder - More information about the VAE with Differential Privacy can be found in the model card
N.B. A modified copy of Opacus (v0.14.0), a library for training PyTorch models with differential privacy, is contained within the repository. See the model card for more details.
To get a local copy up and running follow these simple steps.
To clone the repo:
git clone https://github.com/nhsx/SynthVAE.git
To create a suitable environment:
python -m venv synth_env
source synth_env/bin/activate
pip install -r requirements.txt
To reproduce the experiments contained in the report involving the SDV baseline models (e.g. CopulaGAN, CTGAN, GaussianCopula and TVAE), run sdv_baselines.py
. The parameters can be found using the --help
flag:
python sdv_baselines.py --help
usage: sdv_baselines.py [-h] [--n_runs N_RUNS] [--model_type {CopulaGAN,CTGAN,GaussianCopula,TVAE}]
optional arguments:
-h, --help show this help message and exit
--n_runs N_RUNS set number of runs/seeds
--model_type {CopulaGAN,CTGAN,GaussianCopula,TVAE}
set model for baseline experiment
To reproduce the experiments contained in the report involving the VAE with/without differential privacy, run scratch_vae_expts.py
. The parameters can be found using the --help
flag:
python scratch_vae_expts.py --help
usage: scratch_vae_expts.py [-h] [--n_runs N_RUNS] [--diff_priv DIFF_PRIV] [--savefile SAVEFILE]
optional arguments:
-h, --help show this help message and exit
--n_runs N_RUNS set number of runs/seeds
--diff_priv DIFF_PRIV
run VAE with differential privacy
--savefile SAVEFILE save trained model's state_dict to file
Code to load a saved model and generate correlation heatmaps is contained within plot.py
.
The file containing the save model's state_dict should be provided via a command line argument:
python plot.py --help
usage: plot.py [-h] --savefile SAVEFILE
optional arguments:
-h, --help show this help message and exit
--savefile SAVEFILE load trained model's state_dict from file
Experiments are run against the Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT) dataset accessed via the pycox python library.
See the open issues for a list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
See CONTRIBUTING.md for detailed guidance.
Distributed under the MIT License. See LICENSE for more information.
To find out more about the Analytics Unit visit our project website or get in touch at [email protected].