- Author: @nadia-eecs
- Acknowledgements & Appreciation To: @igorsusmelj
This repository demonstrates a complete workflow of training a machine learning model with the aid of Active Learning using Lightly and Label Studio and has been adapted heavily (esp. this README.md) from Lightly_LabelStudio_AL.
Labeling data is expensive! We can make the task easier by utilizing Active Learning to select a subset of unlabeled data to be labeled and then used to train a model. By selecting well, the model can have similar or better performance than if trained on the entire dataset. lightly.ai blog post
Assume we have a new unlabelled dataset and want to train a new model. We do not want to label all samples because not all of them are valuable. Lightly can help select a good subset of samples to kick off labeling and model training. The loop is as follows:
- Lightly chooses a subset of the unlabelled samples.
- This subset is labeled using Label Studio.
- A machine learning model is trained on the labeled data and generates predictions for the entire dataset.
- Lightly consumes predictions and performs Active Learning to choose the next batch of samples to be labeled.
- This new batch of samples is labeled in Label Studio.
- The machine learning model is re-trained on the enriched labeled dataset and to achieve better performance.
Let's get started!
Make sure you have an account for the Lightly Web App.
You also need to know your API token which is shown under your USERNAME
-> Preferences
.
Clone this repo and install all Python package requirements in the requirements.txt
file, e.g. with pip.
git clone https://github.com/blue-marble-space-institute-of-science/mouse-bps-labeler.git
cd mouse-bps-labeler
pip install -r requirements.txt
Datasources enable Lightly to access data in the cloud. They need to be configured with credentials. To create a datasource you must specify a dataset, the credentials, and a resource_path
which must point to a directory within the storage bucket.
A Lightly Dataset can support the following image file types:
png
jpeg
bmp
gif
tiff
The input datasource is the raw input that Lightly reads. Lightly requires list and read access.
Lighly needs to have read
, list
, write
, and delete
permissions:
s3:GetObject
(if read only)s3:ListBucket
(if read only)s3:PutObject
s3:DeleteObject
For detailed documentation see AWS S3 Lightly Documentation
Identify Lightly as a user with a role int he AWS account. Use this if internal or external policies require it and if security and compliance are important.
-
- Log into AWS IAM Console
-
- Create Role
-
- Select AWS Account and configure ID and Access Policy for Lightly.
The code in this repository is such that all keys are recommended to be placed in a .env
file at the same location as the .git
to be ignored with .gitignore
. The format of the .env
file is as follows:
MY_LIGHTLY_TOKEN=<token>
LIGHTLY_WORKER_ID=<id>
S3_REGION=<s3 region>
S3_ROLE_ARN=<s3 arn>
S3_EXTERNAL_ID=<s3 external id>
...
We want to train a classifier to predict whether the microscopy image contains a linear arrangement of 53BP1 accumulation on chromatin surrounding DNA damage, or irradiation induced foci. We use this dataset: Biological and Physical Sciences (BPS) Microscopy Benchmark Training Dataset.
1.1 00_download_Gyhi_4hr_from_s3_source.sh
To Download High Radiation Dosage and 4 Hour Post Exposure Conditions of the BPS Data
Run 00_download_Gyhi_4hr_from_s3_source.sh
to download a subset of the publicly available BPS Mouse Microscopy data locally. In this example, we will be using the portion of the dataset for high radiation (Gy) exposure levels and 4 hours post-radiation and will store the information in directory called data_Gyhi_4hr
chmod +x bps_labeler/scripts/a_data_setup/00_download_Gyhi_4hr_from_s3_source.sh
./bps_labeler/scripts/a_data_setup/00_download_Gyhi_4hr_from_s3_source.sh
After downloading the data, you will see the data directory as follows:
data_Gyhi_4hr/
├── meta.csv
├── filtered_files.txt
├── filtered_meta.csv
├── P242_73665006707-A6_003_013_proj.tif
├── P242_73665006707-A6_008_034_proj.tif
├── P242_73665006707-A6_009_007_proj.tif
...
Since metadata is available in filtered_meta.csv
we would like to separate out the metadata for use with Lightly which requires a specific format. Lightly metadata format. In other words, all rows in the csv file must be written as a json with the same file stem as the .tif
images.
python bps_labeler/scripts/a_data_setup/00_generate_json_from_meta_csv.py
01_local_data_setup_lightly.py
does the following tasks to ensure local setup of data suitable for AWS S3 datasource configuration with the Lightly platform. It performs the following tasks:
- Converts the 16bit uint TIFF files into JPG for easy rendering with Lightly and Labeler UI
- Extracts metadata from
metadata.csv
to create individual JSON files for each TIF - Splits the dataset into training and validation sets with their respective data and metadata
- Generates the Lightly metadata
schema.json
for processing the JSON individual file metadata
The BPS Mouse data is currently available in 16 uint TIFF format. In order for images to render both in the Lightly Platform and Label Studio, the images must be reduced to JPG format. In this implementation, to optimize space we replace the files with the JPG format.
We extract metadata from CSV to JSON to generate individual json files with the same filestem as the images containing information with respect to the filename, the radiation exposure, the particle, and the post exposure time period. These files will also save to data_Gyhi_4hr
and you will be able to see the data directory as follows while the script runs:
data_Gyhi_4hr/
├── meta.csv
├── filtered_files.txt
├── filtered_meta.csv
├── P242_73665006707-A6_003_013_proj.jpg
├── P242_73665006707-A6_003_013_proj.json
├── P242_73665006707-A6_003_013_proj.tif
├── P242_73665006707-A6_008_034_proj.jpg
├── P242_73665006707-A6_008_034_proj.json
├── P242_73665006707-A6_008_034_proj.tif
├── P242_73665006707-A6_009_007_proj.jpg
├── P242_73665006707-A6_009_007_proj.json
├── P242_73665006707-A6_009_007_proj.tif
...
Running 01_local_data_setup_lightly.py
additionally splits the files into the directories train_set
and val_set
and migrates the files and their corresponding metadata to subdirectories data
and metadata
.
The parent directory will remain data_Gyhi_4hr
and you will be able to see the data directory as follows:
data_Gyhi_4hr/
├── train_set/
│ ├── data/
│ │ ├── P242_73665006707-A6_003_013_proj.jpg
│ │ ├── P242_73665006707-A6_008_034_proj.jpg
│ │ ├── P242_73665006707-A6_009_007_proj.jpg
│ │ ...
│ └── meta/
│ ├── P242_73665006707-A6_003_013_proj.json
│ ├── P242_73665006707-A6_008_034_proj.json
│ ├── P242_73665006707-A6_009_007_proj.json
│ ...
├── val_set/
│ ├── data/
│ │ ├── P242_73665006707-G3_005_027_proj.jpg
│ │ ...
│ └── meta/
│ ├── P242_73665006707-G3_005_027_proj.json
│ ...
├── filtered_files.txt
├── filtered_meta.csv
├── full_train.json
├── meta.csv
└── val.json
...
After this, note the following files and directories in the current directory:
train_set
: Directory that contains all samples to be used for training the model and the associated metadata for each .jpg imageval_set
: Directory that contains all samples to be used for model validation. It may be helpful to label them to check performance though for our purposes it is not.full_train.json
: JSON file that records paths to all files intrain_set
.val.json
: JSON file that records paths of all files inval_set
.
In order for the Lightly platform to read the associated metadata for each individual .tif file, a schema.json
file must contain a list of configuration entries. There is one generated based on the metadata for the BPS Mouse Microscopy dataset.
In this tutorial, samples are stored in the cloud, and Lightly Worker will read the samples from the cloud data source. For details, please refer to Set Up Your First Dataset. Here we use Amazon S3 as an example.
Under your S3 bucket, create two directories: data
and lightly
. We will upload all training samples to data
. For example, run the AWS CLI tool:
chmod +x bps_labeler/scripts/a_data_setup/02_upload_training_set_Gyhi_4hr_to_s3_dest.sh
./bps_labeler/scripts/a_data_setup/02_upload_training_set_Gyhi_4hr_to_s3_dest.sh
After uploading the samples, your S3 bucket should look like
s3://bucket/
├── lightly/
│ └── .lightly/
│ └── metadata/
│ ├── schema.json
│ ├── P242_73665006707-A6_003_013_proj.json
│ ├── P242_73665006707-A6_008_034_proj.json
│ ├── P242_73665006707-A6_009_007_proj.json
│ ...
└── data/
├── P242_73665006707-A6_003_013_proj.jpg
├── P242_73665006707-A6_008_034_proj.jpg
├── P242_73665006707-A6_009_007_proj.jpg
├── ...
To setup the Lightly Worker on your machine run the following script:
./bps_labeler/scripts/b_label_first_selection/03_start_lightly_worker.sh
Now, with all unlabelled data samples in your training dataset, we want to select a good subset, label them, and train our classification model with them. Lightly can do this selection for you in a simple way. The script 03_run_first_selection.py does the job for you. You need to first set up Lightly Worker on your machine and put the correct configuration values in the script. Please refer to Install Lightly and Set Up Your First Dataset for more details.
Run the script after your worker is ready:
python bps_labeler/scripts/b_label_first_selection/03_run_first_selection.py
In this script, Lightly Worker first creates a dataset named nasa-bps-microscopy
within the Lightly Platform, selects 50 samples based on embeddings of the training samples and particle type metadata and records them in this dataset. It does this to ensure diverse sampling and balance respectively. These 50 samples are the ones that we are going to label in the first round. You can see the selected samples in the Web App.
We do this using the open source labeling tool Label Studio, which is a browser-based tool hosted on your machine. You have already installed it and can run it from the command line. It will need access to your local files. We will first download the selected samples, import them in Label Studio, label them, and export the annotations.
Curious to get started with Label Studio? Check out this tutorial for help getting started!
We can download the selected samples from the Lightly Platform. The 04_download_samples.py script will do everything for you and download the samples to a local directory called data_Gyhi_4hr/samples_for_labeling
.
python bps_labeler/scripts/b_label_first_selection/04_download_samples.py
Lightly Worker created a tag for the selected samples. This script pulls information about samples in this tag and downloads the samples.
Now we can launch LabelStudio.
export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true && label-studio start
You should see it in your browser. Create an account and log in.
Create a new project called "nasa-bps-microscopy".
Then, head to Settings
-> Cloud Storage
-> Add Source Storage
-> Storage Type
: Local files
.
Set the Absolute local path
to the absolute path of directory samples_for_labeling
.
Enable the option Treat every bucket object as a source file
.
Then click Add Storage
. It will show you that you have added a storage.
Now click on Sync Storage
to finally load the 50 images.
In Settings
-> Instructions
you may insert Labeling Instructions
Please review the following NASA BPS Fluorescence microscopy images of individual nuclei from mouse fibroblast cells. Cells that have been irradiated high energy radiation, may incur double stranded DNA damage. Identifying tracks or linear arrangements of bright visible and circular fluorescent foci indicate 53BP1 repair mechanisms. In the images featured please do your best to classify the images as having a:
- track
- no track
Thank you for your contributions!
In the Settings
-> Labeling Interface
in the Code
insert
<View>
<Image name="image" value="$image"/>
<Choices name="choice" toName="image">
<Choice value="track"/>
<Choice value="no track"/>
</Choices>
</View>
It tells Label Studio that there is an image classification task with 2 distinct choices.
Now if you click on your project again, you see 50 tasks and the corresponding images.
Click on Label All Tasks
and get those 50 images labeled.
Pro Tip! Use the keys 1
, 2
, on your keyboard as hotkeys to be faster!
Export the labels via Export
in the format JSON-MIN
.
Rename the file to annotation-0.json
and place that in a directory called data_Gyhi_4hr/ls_annotations/
in the data directory of this repository.
We can train a classification model with the 50 labeled samples. The train_model_1.py script loads samples from annotation-0.json
and performs this task.
python bps_labeler/scripts/c_train_model/train_model_01_resnet.py
The following steps are performed in this script:
- Load the annotations and the labeled images.
- Load the validation set (optional as this depends on having a small set of ground truth labels which we do not have)
- Fine tune a simple model as in resnet50.py.
- Use model trained on a sampling of the data to compute label predictions for all samples for training, including unlabeled samples. These will be used for balancing your next dataset split.
- Dump the predictions in Lightly Prediction format into directory
lightly_predictions
in your AWS datasource.
It is okay for now. We will improve this. Predictions will be used for active learning.
Lightly Worker also does active learning for you based on predictions. It consumes predictions stored in the data source. We need to place the predictions we just acquired in the data source. For detailed information, please refer to Predictions Folder Structure. Here we still use the AWS S3 bucket as an example.
In the lightly
directory you created earlier in your S3 bucket, you will create a subdirectory .lightly/predictions
where predictions are kept. You need the following additional files. You can create these files directly by copying the code blocks below.
["bps-classification"]
We only have one task here, and let's name it as bps-classification
.
{
"task_type": "classification",
"categories": [
{
"id": 0,
"name": "track"
},
{
"id": 1,
"name": "no track"
}
]
}
Place these files in the lightly
directory in your bucket along with predictions from your local directory lightly_prediction
by running:
./bps_labeler/scripts/c_train_model/06_upload_predictions_s3.sh
After uploading these files, your S3 bucket should look like
s3://bucket/
├── lightly/
│ └── .lightly/
│ ├── metadata/
│ │ └── ...
│ └── predictions/
│ ├── tasks.json
│ └── bps-classification/
│ ├── schema.json
│ ├── P242_73665006707-A6_003_013_proj.json
│ ├── P242_73665006707-A6_008_034_proj.json
│ ├── P242_73665006707-A6_009_007_proj.json
│ ├── ...
└── data/
├── P242_73665006707-A6_003_013_proj.jpg
├── P242_73665006707-A6_008_034_proj.jpg
├── P242_73665006707-A6_009_007_proj.jpg
├── ...
where files uploaded are local prediction files in lightly_prediction
to the s3 bucket in lightly/.lightly/predictions/bps-classification
.
With the predictions, Lightly Worker can perform active learning and select new samples for us. The 06_run_second_selection.py script does the job.
python bps_labeler/scripts/d_label_second_selection/07_run_second_selection.py
Note: if your Lightly Worker is not started you may need to rerun:
./bps_labeler/scripts/b_label_first_selection/03_start_lightly_worker.sh
This time, Lightly Worker goes through all training samples again and selects another 50 samples based on active learning scores computed from the predictions we uploaded in the previous step. For more details, please refer to Selection Scores and Active Learning Scorer.
You can see the results in the Web App.
You can repeat step 3 to download and label new samples.
python bps_labeler/scripts/b_label_first_selection/04_download_samples.py
To import new samples, go to Settings
-> Cloud Storage
and then click Sync Storage
on the Source Cloud Storage you created earlier. A message Synced 50 task(s)
should show up.
Then, you can go back to the project page and label the new samples. After finishing annotating the samples, export the annotations again. Rename the file to annotation-2.json
and place that in the root directory of this repository.
Very similar to the script in step 4, script 08_train_model_02_resnet.py loads samples from annotation-2.json
and trains the classification model again with all 60 labeled samples now.
python bps_labeler/e_train_model/08_train_model_02_resnet.py