Dataset Preparation

Prerequisite artifacts:

Infrastructure that will be used:

Workflow

If the stacks are not in a GCP bucket, see the previous workflow Copying the raw data into the cloud for storage and usage.
Either edit the configuration file configs/data_preparation.yaml or create your own configuration file and place it in the configs folder.
Use Terraform to start the appropriate GCP virtual machine (terraform apply). This will copy the current code base from your local machine to the GCP machine so make sure any changes to the configuration file are saved before this step is run.
Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been created named <project_name>-<user_name> where <project_name> is the name of your GCP project and <user_name> is your GCP user name.
To create a dataset, SSH into the virtual machine <project_name>-<user_name>, start tmux (tmux), cd into the code directory (cd necstlab-damage-segmentation), and run python3 prepare_dataset.py --gcp-bucket <gcp_bucket> --config-file configs/<config_filename>.yaml.
Once dataset preparation has finished, you should see the folder <gcp_bucket>/datasets/<dataset_ID> has been created and populated, where <dataset_ID> was defined in configs/data_preparation.yaml.
Use Terraform to terminate the appropriate GCP virtual machine (terraform destroy). Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been destroyed.

--gcp-bucket: type=str, help='The GCP bucket where the processed data is located and to use to store the prepared dataset.'
--config-file: type=str, help='The location of the data preparation configuration file.'

python3 prepare_dataset.py --gcp-bucket gs://sandbox --config-file configs/config_sandbox/dataset-composite_0123.yaml

In VM SSH, use nano text editor to edit scripts previously uploaded to VM. E.g., nano configs/dataset-medium.yaml to edit text in dataset-medium.yaml
To create a VM without destroying others (assuming terraform apply seeks to create & destroy), use target flag: terraform apply -lock=false -target=google_compute_instance.vm[<#>] to create VM #. Similar syntax with terraform destroy to specify target.