:monocle_face: Implement test_step on independent hold-out set of images by weiji14 · Pull Request #1 · weiji14/s2s2net

weiji14 · 2022-05-19T19:44:57Z

To allow fair comparison of different models, compute F1 and IoU metrics on 8 independent images from a hold-out set (see 00a1c5b) that was handpicked to be from different geographic localities compared to the train/val set. The evaluate function now has a boolean calc_loss parameter, so that the expensive loss computation can be skipped during the test_step loop (i.e. compute metrics only).

Note on issues and workarounds:

Computing the test metrics means the full-size masks need to be loaded, and for some reason the geographical extent of the mask doesn't align with the image so need to do rio.clip_box.
Predicted tensor shapes are sometimes not exactly 5x that of the input image, so need to interpolate to the right size (which isn't actually proper, I know).
~~Needed to use CPU for test_step because GPU did not have enough memory, and use 32-bit precision too.~~ Edit: fixed with model sharding ⚡ DeepSpeed ZeRO Stage 2 model parallel training #2.

TODO:

Initial implementatin of test_step (7e87c6c)
Run test_step on GPU instead of CPU (🧐 Implement test_step on independent hold-out set of images #1 and 099f92f)
Refactor to ensure image and mask geographical extents align (e03a6e9)
FIx saving and loading from model checkpoint

To allow fair comparison of different models, compute F1 and IoU metrics on 8 independent images from a hold-out set (see 00a1c5b) that was handpicked to be from different geographic localities compared to the train/val set. The evaluate function now has a boolean `calc_loss` parameter, so that the expensive loss computation can be skipped during the `test_step` loop (i.e. compute metrics only). Note on issues and workarounds. Computing the test metrics means the full-size masks need to be loaded, and for some reason the geographical extent of the mask doesn't align with the image so need to do `rio.clip_box`. Also, the predicted tensor shapes are sometimes not exactly 5x that of the input image, so need to interpolate to the right size (which isn't actually proper, I know). Also needed to use CPU for test_step because GPU did not have enough memory, and use 32-bit precision too.

With the DeepSpeed ZeRO strategy, the test_step computation can now happen on the GPU instead of the CPU. However, will need to have at least 2 GPUs (i.e. run only on the HPC server's A40s) to have enough GPU memory for a full-size Sentinel-2 image. Also made a one-line to to get height and width from segmmask tensor instead of superres tensor at L362, and force using one (CPU or GPU) device only in test_s2s2net.

Clip Sentinel-2 mask to bounding box extent of the binary mask instead of the other way around! This code has actually been living in a separate file for months, but now it's all in the S2S2Dataset class! The mask itself is first clipped to its own non-NaN bbox extent, and this is rounded to 10m resolution instead of the mask's 2m resolution. Then the Sentinel-2 image is clipped to this clipped mask's bbox extent. There are quite a few chained operations to prevent a memory leak that causes GPU out of memory issues.

Get fix that ensures checkpoint states are saved in a common filepath with deepspeed Lightning-AI/pytorch-lightning#12887.

Save the best model (based on highest validation F1) while training! Previously we just saved the neural network model at the end of the training run, but that might not be the best one as the metrics fluctuate up and down, so now we save the model with the maximum validation F1 instead. The model checkpoint when using the DeepSpeed ZeRO Stage 2 strategy is actually a folder (?), so need to convert it using a Pytorch Lightning utility function into a regular single file model checkpoint. Increased number of training epochs from 27 to 52, which takes about 1hr10min to run with DeepSpeed :D Also increased the num_workes from 1 to 4 for the predict and test dataloaders.

weiji14 added the enhancement New feature or request label May 19, 2022

weiji14 self-assigned this May 19, 2022

weiji14 mentioned this pull request May 20, 2022

⚡ DeepSpeed ZeRO Stage 2 model parallel training #2

Merged

4 tasks

weiji14 added 5 commits May 21, 2022 16:26

🔀 Merge branch 'main' into test_step

416cf9e

🔀 Merge branch 'main' into test_step

2dac8ba

Get fix that ensures checkpoint states are saved in a common filepath with deepspeed Lightning-AI/pytorch-lightning#12887.

weiji14 marked this pull request as ready for review May 24, 2022 01:58

weiji14 merged commit 71e83a8 into main May 24, 2022

weiji14 deleted the test_step branch May 24, 2022 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🧐 Implement test_step on independent hold-out set of images#1

🧐 Implement test_step on independent hold-out set of images#1
weiji14 merged 6 commits into
mainfrom
test_step

weiji14 commented May 19, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

weiji14 commented May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weiji14 commented May 19, 2022 •

edited

Loading