Transformer model for Biomarkers prediction: Evaluating the impact of ECDF normalization on model robustness in clinical data
This repository contains the source code for the master's thesis. The full thesis manuscript is available here.
- Project Setup
- Data Access
- Environment Variables
- Snakemake Profile
- Executing Tests (Optional)
- Running the Experiments
TL;DR with defaults
git clone https://github.com/mshavliuk/thesis_code.git ecdf_thesis
cd ecdf_thesis
conda env create -f workflow/envs/gcloud.yml -n gcloud
conda activate gcloud
gcloud auth login
gcloud config set project $(gcloud projects list --format="value(projectId)" --limit=1)
conda deactivate
conda env create -f environment.yml -n ecdf_thesis
mkdir /tmp/ecdf_thesis
conda env config vars set \
PYTHONPATH=$(pwd) \
DATA_DIR=$(pwd)/data \
WANDB_PROJECT=ECDF-thesis \
TEMP_DIR=$(pwd)/temp \
SNAKEMAKE_PROFILE=$(pwd)/workflow/profiles/workstation \
-n ecdf_thesis
conda activate ecdf_thesis
wandb login
snakemake generate_unittest_dataset
pytest
snakemake
To get started, clone the repository and set up the environment:
git clone https://github.com/mshavliuk/thesis_code.git ecdf_thesis
cd ecdf_thesis
# Create and activate the Conda environment
conda env create -f environment.yml -n ecdf_thesis
conda activate ecdf_thesis
# Log in to Weights & Biases for checkpointing and logging
wandb login
Perhaps, less developer-friendly, but more reproducible approach to install project dependencies is to try to obtain the exact versions that were installed during the project development (ignoring all patch-level updates). It's generally not recommended to use this approach unless the first one breaks.
conda create -f environment.txt -n ecdf_thesis
conda activate ecdf_thesis
pip install -r requirements.txt
The MIMIC-III dataset is distributed under restricted access by PhysioNet. To obtain the dataset, you must complete additional certification. Refer to the MIMIC-III dataset page for more information.
To download the dataset via Google Cloud:
- Connect your PhysioNet account to Google Cloud in Cloud Settings.
- Request Google Cloud access.
- Set up a Google Cloud Project and configure a billing account.
- Create a designated Conda environment with Google Cloud SDK:
The reason for a separate environment is that at this time (20.11.24) google-cloud-sdk is incompatible with Python 3.12 used in this project.
conda env create -f workflow/envs/gcloud.yml -n gcloud conda activate gcloud
- Authenticate using the following commands:
gcloud auth login
- Set proper PROJECT_ID or keep the default one
# see available projects gcloud projects list # choose the default project to use gcloud config set project PROJECT_ID
- Get back to the main Conda environment
conda activate ecdf_thesis
Set the following environment variables to ensure proper functionality:
PYTHONPATH
: Include the root project directory.TEMP_DIR
: Path for temporary files, has to lead to existing directory (e.g.,$(pwd)/temp
).DATA_DIR
: Path for datasets and plots.
On clusters with restricted filesystem access, you may also need to set:
MPLCONFIGDIR
:/tmp/ecdf_thesis/matplotlib
SNAKEMAKE_OUTPUT_CACHE
:/tmp/ecdf_thesis/snakemake-cache
XDG_CACHE_HOME
:/tmp/ecdf_thesis/xdg_cache
To persist these variables in the current Conda environment:
mkdir temp
conda env config vars set \
PYTHONPATH=$(pwd) \
DATA_DIR=$(pwd)/data \
WANDB_PROJECT=ECDF-thesis \
TEMP_DIR=$(pwd)/temp
# reactivate the environment for changes to take effect
conda deactivate
conda activate ecdf_thesis
It is recommended to configure a Snakemake profile suited to your hardware. This project includes two profiles:
workstation
: For local workstations with a single GPU.tuni-cluster
: For Slurm clusters with a quota of 12 nodes.
To set the profile:
# For local usage
conda env config vars set SNAKEMAKE_PROFILE=$(pwd)/workflow/profiles/workstation
# For Slurm clusters
conda env config vars set SNAKEMAKE_PROFILE=$(pwd)/workflow/profiles/tuni-cluster
# Reactivate the environment
conda deactivate
conda activate ecdf_thesis
For more details, refer to the Snakemake documentation.
To ensure the code and dependencies are properly set up, run unit tests:
-
Preprocess datasets for unit testing:
snakemake generate_unittest_dataset
-
Run the tests:
pytest
Expected output
========================================== test session starts =========================================== platform linux -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0 rootdir: /home/user/projects/thesis_code collected 58 itemssrc/util/tests/test_collator.py ..... [ 8%] src/util/tests/test_dataset.py .................. [ 39%] src/util/tests/test_variable_ecdf_scaler.py ...... [ 50%] workflow/tests/test_data_extractor.py .................... [ 84%] workflow/tests/test_data_processing_job.py ......... [100%]
==================================== 58 passed, 11 warnings in 24.63s ====================================
- Download the MIMIC-III dataset from Google Cloud and store temporarily.
- Convert downloaded
.csv.gz
files to.parquet
. - Preprocess data to generate datasets with different noise levels (listed in
config.yaml
). - Pretrain the model.
- Compute test metrics for pretrained models and mark the best ones for finetuning.
- Fine-tune the model.
- Analyze the results
Adjust the number of cross-validation folds and data fractions in config.yaml
.
Snakemake automates the entire experiment pipeline (except the analysis). To reproduce the experiments, simply run:
snakemake
The computation may take several days, depending on hardware.
Alternatively, you can run specific experiments. For example, to execute all jobs for the ECDF experiments:
snakemake results/finetune_config/ecdf.SUCCESS
The Snakemake DAG for this pipeline:
After completion, Snakemake creates empty .SUCCESS
files in results/job_type/job_name.SUCCESS
to track completed jobs. This ensures that subsequent runs do not re-execute unless input files or configurations are modified.
For more flexibility, you can run individual experiments or steps directly using Bash commands. For example, to run model pretraining:
python src/pretrain --config experiments/pretrain/ours-noise-13.yaml
Refer to the shell
sections in Snakefile
for more examples.