Skip to content

GSTT-CSC/project-template

Repository files navigation


Logo

A framework for AI applications for healthcare

View repo · Report Bug · Request Feature

New project template

Introduction

This repository contains a skeleton project template for use with new projects using the csc-mlops development platform. The template provides a starting point with helper classes and functions to facilitate rapid development and deployment of applications.

Getting started

This project template makes use of classes and functions provided by the csc-mlops package, installing this to your local environment is easy with pip:

pip install csc-mlops

To begin a training run once you've set up your project following the guidelines below, utilise the csc-mlops package run command.

mlops run scripts/train.py -c config/config.cfg

Run the following for usage information.

mlops run -h

The first thing to do after cloning this template is to rename the appropriate files and folders to make the directory project specific. The project directory should be renamed to make it clear that it contains your project files.

There are 5 main components that need to be completed after cloning the template:

1. config/config.cfg and config/local_config.cfg

The config file contains all the information that is used for configuring the project, experiment, and tracking server. This includes training parameters and XNAT configurations.

The config file path is also passed as an argument to the MLOps Experiment class where the experiment and tracking are configured.

As there will be differences between local development and running on DGX (for example XNAT configurations), it is highly encouraged to make use of local_config when devlopiong locally.

Note: The values present in the template config files are examples, you can remove any except those in [server] and [project] which are necessary for MLOps. Outside of these you are encouraged to add and modify the config files as relevant to your project.

2. project/Network.py

This file is used to define the PyTorch LightningModule class.

This is where you set the Network architecture and flow that you will use for training, validation, and testing.

Here you can set up which metrics are calculated and at which stage in the flow these occur, along with the model and optimiser.

The example has numerous metrics and steps that are not always necessary, feel free to delete or add as relevant to your project.

3. project/DataModule.py

This file is used to define the PyTorch LightningDataModule class.

This is where you define the data that is used for training, validation, and testing.

The example involves retrieving data from XNAT (more on this below) which may not be necessary for your project. There are additional data validation steps that might not be relevant, feel free to delete or add as relevant to your project.

4. scripts/train.py

This file is used to define the training run.

This is where the Datamodule and Network are pulled together.

The example includes callbacks to retrieve the best model parameters, feel free to delete or add as relevant to your project.

5. Dockerfile

This dockerfile sets up the Docker image that the MLOps run will utilise.

In the example this is just a simple environment running python version 3.10. You will most likely need to adapt this for your project.

Examples of projects utilising these components:

https://github.com/GSTT-CSC/CARNAX-Neonatal-Abdominal-X-Ray

https://github.com/GSTT-CSC/wrist-fracture-x-ray

https://github.com/GSTT-CSC/dental-classifier

For further information on MLOps please refer to the MLOps tutorial repo:

https://github.com/GSTT-CSC/MLOps-tutorial

Additional steps that are strongly recommended for project setup:

1. Set up GitHub Actions

To run your tests using GitHub actions the .github/workflows/development_test.yml and .github/workflows/production_test.yml files should be modified.

These workflows use environment variables, defined at the top of the workflow to make testing easier.

The production tests also use a GitHub secret to authenticate the writing of a gist to store the test coverage badge auth: ${{ secrets.PYTEST_COVERAGE_COMMENT }}. GitHub secrets are hidden string variables stored at a repository level, these can be defined in the repository settings.

More information about how the test coverage badge is defined can be found here.

2. Set up Git Hooks

This repository contains a pre-commit hook that helps prevent committing sensitive information to the repository by scanning your commits for certain patterns like names, addresses, phone numbers, patient IDs, etc.

2.1. Set up the Pre-commit Hook

The pre-commit hook script is located in the git_hooks directory. Copy the pre-commit script from this directory to the .git/hooks/ directory in your local repository.

cp .github/hooks/pre-commit .git/hooks/ 

Make the script executable:

chmod +x .git/hooks/pre-commit

The script will now automatically check the files you're about to commit for any sensitive information patterns.

2.2. Set up Pre-commit Hook Exceptions

Sometimes, there may be legitimate cases where these patterns are allowed. In these cases, you can add exceptions to the .sensitive_exceptions and .files_exceptions files. Populating these files is not mandatory for git hooks to work but should be kept in the root of the project directory.

The .sensitive_exceptions file should contain any specific instances of the forbidden patterns that you want to allow. Each exception should be on its own line. You can for instance add specific addresses or dates you wish to push to remote.

The .files_exceptions file should contain any files/directories that you want to exclude from the checks. Each file should be on its own line.

These files are added to .gitignore as they are not advised to be committed.

2.3. Resolving Pre-commit Hook Issues

When the pre-commit hook identifies potential sensitive information in a commit, it will prevent the commit from being completed and output information about the offending files and patterns.

How you view this output will depend on your method of committing:

  • VSCode: If you're using VSCode UI to commit your changes, you can view the pre-commit hook output by clicking on "Show command output" when the error is thrown.

  • Terminal: If you're committing via terminal, the output will be displayed directly in the terminal.

Utility functions that may be useful

XNAT data handler

Accessing data stored in an XNAT archive is performed through two steps - first the XNAT database is queried for project subjects using the DataBuilderXNAT class. This list of results is then loaded using the PyTorch style data loading transform called LoadImageXNATd.

1. Create list of data samples

A list of subjects is extracted from the XNAT archive for the specified project. This is done automatically by the helper function xnat_build_dataset.

from utils.tools import xnat_build_dataset

PROJECT_ID = 'my_project'
xnat_configuration = {'server': XNAT_LOCATION,
                      'user': XNAT_USER,
                      'password': XNAT_USER_PASSWORD,
                      'project': XNAT_PROJECT_ID}

xnat_data_list = xnat_build_dataset(self.xnat_configuration)

Each element in the list xnat_data_list is a dictionary with two keys, Where these fields indicated unique references to each subject.

{
    'subject_id': <subject_id>,
    'subject_uri': <subject_uri>
}

2. Download relevant data using LoadImageXNATd and actions

A MONAI transform LoadImageXNATd is used to download the data from XNAT. This transform can be used in place of the conventional LoadImaged transform provided by MONAI to access local data.

A worked example is given below to create a valid dataloader containing the sag_t2_tse scans from XNAT where each subject has two experiments This first thing that is required is an action function. This is a function that operates on an XNAT SubjectData object and returns the desired ImageScanData object from the archive and the key under which is will be stored in the dataset. For example the function below will extract the 'sag_t2_tse' scans from the archive.

def fetch_sag_t2_tse(subject_data: SubjectData = None) -> (ImageScanData, str):
    """
    Function that identifies and returns the required xnat ImageData object from a xnat SubjectData object
    along with the 'key' that it will be used to access it.
    """
    for exp in subject_data.experiments:
        if 'MR_2' in subject_data.experiments[exp].label:
            for scan in subject_data.experiments[exp].scans:
                if 'sag_t2_tse' in subject_data.experiments[exp].scans[scan].series_description:
                    return subject_data.experiments[exp].scans[scan], 'sag_t2_tse'

In this example, the fetch_sag_t2_tse function will loop over all experiments available for the subject, then if one of these experiments has 'MR_2' in the label it will loop over all the scans in this experiment until it finds one with 'sag_t2_tse' in the series_description. The URI to this scan is then extracted and returned along with the key it will be stored under in the data dictionary, in this case 'sag_t2_tse'.

We can now pass this action function to the LoadImageXNATd transform which will perform each action function in the list sequentially. So if multiple datasets are required for each Subject then multiple functions can be used.

from transforms.LoadImageXNATd import LoadImageXNATd
from monai.transforms import Compose, ToTensord
from torch.utils.data import DataLoader
from monai.data import CacheDataset
from xnat.mixin import ImageScanData, SubjectData
from monai.data.utils import list_data_collate

# list of actions to be applied sequentially
actions = [fetch_sag_t2_tse]

train_transforms = Compose(
    [
        LoadImageXNATd(keys=['subject_uri'], actions=actions, xnat_configuration=xnat_configuration),
        ToTensord(keys=['sag_t2_tse'])
    ]
)

dataset = CacheDataset(data=xnat_data_list, transform=train_transforms)
data_loader = DataLoader(dataset, batch_size=1, shuffle=True, num_workers=0, collate_fn=list_data_collate)

If further transforms are required they can be added to the Compose transform list as usual.

Contact

For bug reports and feature requests please raise a GitHub issue on this repository.