Conversation
To allow fair comparison of different models, compute F1 and IoU metrics on 8 independent images from a hold-out set (see 00a1c5b) that was handpicked to be from different geographic localities compared to the train/val set. The evaluate function now has a boolean `calc_loss` parameter, so that the expensive loss computation can be skipped during the `test_step` loop (i.e. compute metrics only). Note on issues and workarounds. Computing the test metrics means the full-size masks need to be loaded, and for some reason the geographical extent of the mask doesn't align with the image so need to do `rio.clip_box`. Also, the predicted tensor shapes are sometimes not exactly 5x that of the input image, so need to interpolate to the right size (which isn't actually proper, I know). Also needed to use CPU for test_step because GPU did not have enough memory, and use 32-bit precision too.
4 tasks
With the DeepSpeed ZeRO strategy, the test_step computation can now happen on the GPU instead of the CPU. However, will need to have at least 2 GPUs (i.e. run only on the HPC server's A40s) to have enough GPU memory for a full-size Sentinel-2 image. Also made a one-line to to get height and width from segmmask tensor instead of superres tensor at L362, and force using one (CPU or GPU) device only in test_s2s2net.
Clip Sentinel-2 mask to bounding box extent of the binary mask instead of the other way around! This code has actually been living in a separate file for months, but now it's all in the S2S2Dataset class! The mask itself is first clipped to its own non-NaN bbox extent, and this is rounded to 10m resolution instead of the mask's 2m resolution. Then the Sentinel-2 image is clipped to this clipped mask's bbox extent. There are quite a few chained operations to prevent a memory leak that causes GPU out of memory issues.
Get fix that ensures checkpoint states are saved in a common filepath with deepspeed Lightning-AI/pytorch-lightning#12887.
Save the best model (based on highest validation F1) while training! Previously we just saved the neural network model at the end of the training run, but that might not be the best one as the metrics fluctuate up and down, so now we save the model with the maximum validation F1 instead. The model checkpoint when using the DeepSpeed ZeRO Stage 2 strategy is actually a folder (?), so need to convert it using a Pytorch Lightning utility function into a regular single file model checkpoint. Increased number of training epochs from 27 to 52, which takes about 1hr10min to run with DeepSpeed :D Also increased the num_workes from 1 to 4 for the predict and test dataloaders.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
To allow fair comparison of different models, compute F1 and IoU metrics on 8 independent images from a hold-out set (see 00a1c5b) that was handpicked to be from different geographic localities compared to the train/val set. The evaluate function now has a boolean
calc_lossparameter, so that the expensive loss computation can be skipped during thetest_steploop (i.e. compute metrics only).Note on issues and workarounds:
rio.clip_box.Needed to use CPU for test_step because GPU did not have enough memory, and use 32-bit precision too.Edit: fixed with model sharding ⚡ DeepSpeed ZeRO Stage 2 model parallel training #2.TODO: