Skip to content

Conversation

@weiji14
Copy link
Owner

@weiji14 weiji14 commented Jan 1, 2019

Supersedes #28. Produce even more perceptually realistic bed elevation! Shift from Ledig et al. 2016's SRGAN to Wang et al. 2018's Enhanced Super Resolution Generative Adversarial Network (ESRGAN). ESRGAN won first place in the PIRM 2018 challenge on perceptual super resolution (Region 3).

Ablation study showing main changes to enhance performance of Super Resolution Generative Adversarial Network

Much of the improvements appear to come from Lim et al. 2017's Enhanced Deep Residual Network (EDSR) paper which won first place in the New Trends in Image Restoration and Enhancement (NTIRE) 2018 challenge on image super-resolution.

Will also need to look into where we want our Super Resolution neural network model to lie in the Perception-Distortion tradeoff space (see Blau & Michaeli, 2017). The intuition as I see it in terms of our geospatial problem of creating a Super Resolution Digital Elevation Model (DEM) from a Low Resolution DEM, is that distortion is about making each individual elevation value close to the groundtruth elevation, whereas perception is about ensuring the whole 2D grid space looks like a physically valid topography (e.g. no overly steep cliffs or crazy terrain).

The Perception-Distortion tradeoff curve

Blog posts:

Github repositories:

References:

  • Blau, Y., & Michaeli, T. (2017). The Perception-Distortion Tradeoff. ArXiv:1711.06077 [Cs]. Retrieved from https://arxiv.org/abs/1711.06077
  • Blau, Y., Mechrez, R., Timofte, R., Michaeli, T., & Zelnik-Manor, L. (2018). 2018 PIRM Challenge on Perceptual Image Super-resolution. Retrieved from https://arxiv.org/abs/1809.07517v2
  • Jolicoeur-Martineau, A. (2018). The relativistic discriminator: a key element missing from standard GAN. ArXiv:1807.00734 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1807.00734
  • Lim, B., Son, S., Kim, H., Nah, S., & Lee, K. M. (2017). Enhanced Deep Residual Networks for Single Image Super-Resolution. ArXiv:1707.02921 [Cs]. Retrieved from https://arxiv.org/abs/1707.02921
  • Timofte, R., Agustsson, E., Gool, L. V., Yang, M.-H., Zhang, L., Lim, B., … Guo, Q. (2017). NTIRE 2017 Challenge on Single Image Super-Resolution: Methods and Results. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 1110–1121). Honolulu, HI, USA: IEEE. https://doi.org/10.1109/CVPRW.2017.149
  • Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., … Tang, X. (2018). ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. ArXiv:1809.00219 [Cs]. Retrieved from https://arxiv.org/abs/1809.00219
  • Zhao, H., Gallo, O., Frosio, I., & Kautz, J. (2017). Loss Functions for Image Restoration With Neural Networks. IEEE Transactions on Computational Imaging, 3(1), 47–57. https://doi.org/10.1109/TCI.2016.2644865

TODO modifications to turn SRGAN into ESRGAN

🏗️ Model architecture changes

  • Remove BatchNormalization from Generator Network (55c238a)
  • LeakyReLU instead of PReLU in Generator Network (c2f3ec2)
  • Deeper and more complex Generator Neural Network
    • Residual in Residual Dense Blocks (RRDB) with residual scaling of 0.2 (c607fd3)
    • Smaller weight initialization (i.e. He Normal initialization times 0.1) (5b87c0c)
  • Two convolutional layers (with a LeakyReLU in between) using kernel size 3 after the upsampling layers, instead of one convolutional layer using kernel size 9 (aee52cc)

⚡ Loss function changes

📈 Training hyperparameter details

  • Adam learning rate of 1e-4 and epsilon of 1e-8 in generator and discriminator (0e4b599)
  • Let Generator's Perceptual Loss = 1 * Content Loss (MAE) + 0.5 * Adversarial Loss (BCE) (a2d9749)

🗃️ Other changes (noted here, but won't be implemented in this pull request)

  • Content Loss before Activation layer in VGG (not using VGG loss for our geospatial problem)
  • 2x nearest neighbour upsample + conv block instead of pixelshuffle (Implement in Replace PixelShuffle Upsamling with NearestNeighbour resize followed by Conv2D #160 Chainer does not have nearest neighbour upsample, only billinear (and it doesn't support ONNX export). PixelShuffle works fine anyway so sticking with it).
  • More data (will do in separate pull request!)

First major change in shift to an Enhanced Super Resolution Generative Adversarial Network (ESRGAN). Remove the BatchNormalization layer in our Generator Network (but keeping it in the Discriminator Network). Less parameters means less GPU memory usage, and faster model training and evaluation! This commit references experiment at https://www.comet.ml/weiji14/deepbedmap/8a28c3ff40c24a6289d56f67d159f17f.

Reduce number of training epochs from 100 to 60 as generator network is faced with exploding gradients after about 80 epochs (see e.g. https://www.comet.ml/weiji14/deepbedmap/ff07e90f755c4a6cb4b78be0632fc1ee). This also means our training is kept to about 10 minutes per run, half of the 20 minutes before! Breaking down the time savings, roughly 5 minute reduction time comes from having less parameters to calculate, another 5 minutes because there is now 40 fewer training epochs.

Caveat is that the Root Mean Squared Error on our Pine Island Glacier test dataset is extremely high, at about 500 compared to under 50 before. Predicted results are not good at all. Histogram shows high bias and high spread, although it is now a unimodal instead of a bimodal distribution... Up next is to implement more changes along the ESRGAN lines, namely modify the neural network architecture and loss function components.
@weiji14 weiji14 added enhancement ✨ New feature or request model 🏗️ Pull requests that update neural network model labels Jan 1, 2019
@weiji14 weiji14 added this to the v0.6.0 milestone Jan 1, 2019
@weiji14 weiji14 self-assigned this Jan 1, 2019
@weiji14 weiji14 changed the title Enhancing the Super Resolution Generative Adversarial Network WIP Enhancing the Super Resolution Generative Adversarial Network Jan 1, 2019
Change the Content Loss function in the generator network from an L2 based Mean Squared Error to an L1 based Mean Absolute Error. See paper by [Zhao et al., 2017](https://doi.org/10.1109/TCI.2016.2644865) on rationale for using an L1 loss function in image generation. Basically better convergence without falling into local minima.

Back to training for 100 epochs, taking up about 15min compared to 10min for 60 epochs before. The Root Mean Squared Error on our test area is back down to about 50! Histogram also look better, being a unimodal distribution instead of a bimodal distribution as before in c6106f9. It seems as though we can continue to train the model for several more epochs as the PSNR metric value has not plateaued. More ESRGAN improvements to come!
More closely following the ESRGAN implementation by switching from a Parametric Rectified Linear Unit to a Leaky Rectified Linear Unit. RMSE Performance metric on test region has become somewhat worse, now about 100 instead of 50 before. This commit references experiment at https://www.comet.ml/weiji14/deepbedmap/a3b308f217b34a73906b3c3673b49706.

Noting down here that there has also been some experimental runs with exploding gradients. However, switching away from PReLU decreases the number of training parameters, potentially speeding up the neural network training. Coming next In this gradual transition to ESRGAN are more massive changes to the model architecture, loss functions, and perhaps even an overhaul of the deep learning framework being used!
Patches c2f3ec2. Residual block was not constructed properly, as we used uppercase 'X' instead of lowercase 'x' inside the python function. That means the generator network's parameter counts have changed again...

Only retrained twice to test if this fixed residual block performs better. This commit specifically references experiment at https://www.comet.ml/weiji14/deepbedmap/497bd90c68d74aaa97a63818161b3897. The RMSE_test metric is slightly above 100, similar to the best result before, but perhaps less prone to exploding/vanishing gradients? Visual inspection of results in deepbedmap.ipynb does look better than previous run though.
The Kahutea HPC server I've been using has just updated its CUDA drivers to version 410.79, so I'm matching that (almost). Note that we are sticking to cudatoolkit 9.0 instead of 10.0 and cudnn 7.1.2 instead of 7.2.1 (or higher) as that is what the tensorflow-gpu binaries on pypi are pre-compiled with. It is possible to use an older cudatoolkit on a newer driver (as we are doing now), but not vice versa. I.e. cudatoolkit 9.0 works on a >=410.48 driver, but cudatoolkit 10.0 won't work on a 396.xx version driver. Basically, just get the latest driver!!

Also made a slight adjustment to Dockerfile script. Add the conda-forge/label/dev channel system-wide for consistency.
Chainer - A flexible deep learning framework. Currenly only CPU compute supported, GPU compute via CuPy will come later. Added Open Neural Network Exchange (ONNX) for cross-framework compatibility. Also pinned tornado to 5.1.1 as 6.0.0 was breaking some other packages.
Part of enabling Chainer GPU support via CuPy. Dropping tensorflow-gpu for now, as we would like to update cudatoolkit from 9.0 to 9.2 (which official pre-compiled tensorflow-gpu binaries 1.10 to 1.12 does not support, they stick to cudatoolkit 9.0). Cudnn also updated from 7.1.2 to 7.2.1 which is the latest available on conda. Note that keras/tensorflow training/evaluation should still work on CPU (just that they will be a magnitude order slower without GPU enabled).
Enable GPU support in Chainer! Using pre-compiled CuPy binaries on CUDA 9.2.
Changing the Super Resolution Generator Adversarial Network's *Generator Model* from Keras to Chainer, while keeping the network components the same. For example, the parameter count remains exactly the same at 1604929, the same Glorot Uniform weight initialization is used, etc. Model is now 'compartmentalized', with input and residual class blocks used that plugs-in into a wrapper function, instead of a more intuitive linear keras-like structure.

Unit doctests now include those declared in classes. Generator network tested for one forward computation pass, which intializes the weights and allows for proper parameter count to be returned. Also made some cosmetic changes to the markdown section headings as the code is getting more complicated.
As close as possible, a replacement of the original Keras implementation with a Chainer implementation of the discriminator model within the Super Resolution Generative Adversarial Network. Parameter counts down from 6828033 to 6824193, a difference of 3840. This is possibly due to some differences in chainer vs keras's BatchNormalization layer implementation, which I couldn't quite track down (not a simple apples to apples comparison).
Chainer uses NCHW tensor format instead of Keras/Tensorflow's NHWC format, so this commit has the necessary rollaxis code (CPU/GPU enabled!). Chainer also has the concept of a dataset iterator, which we create here to help with training/evaluation runs later.

Also removing scikit-learn which was only used for doing a train_test_split. Using Chainer's native random split implementation instead.
More work on transition to Chainer. Reimplement the Peak Signal-Noise Ratio calculation in CuPy/Numpy instead of Keras. Made several adjustment to the previous implementation that was 'incorrect'. 1) remove the epsilon (fingers crossed that it works later!), 2) output a single value mean squared error (instead of many values per tile), 3) change data range from 2**16 to 2**32 which is for int32. These changes effectively result in an output exactly the same as skimage.measure.compare_psnr, except that it works on a GPU which is helpful as it saves time on copying the arrays back and forth between GPU and CPU memory.
Not a GAN. This is just a Super Resolution Convolutional Neural Network, similar to https://github.com/weiji14/deepbedmap/blob/v0.2.0/srcnn_train.ipynb at commit 9272ecc. Obviously more advanced now though :P.

This commit is mainly just setting up a manual neural network training loop that can run using Chainer. The train_generator function has been refactored to a train_eval_generator function that replicates most of the previous functionality, except that it now has an evaluation only mode. Looking to further extend this work to handle more complicated loss functions later. Some LaTeX math equations was added on how the Content Loss function is defined. Cosmetic refactoring of the tqdm progressbar code made in the main training loop.

Next up is to add the discriminator network back, so that we can call it a GAN again!
Lengthy mathematical definition of a numerically stable Adversarial Loss (Sigmoid Binary Cross Entropy) documented in Markdown/LaTeX. Generator Loss function was fixed to remove division over batchsize (it was already divided, didn't need to do it again), and unit tests added for good measure. Also added the adversarial loss to the generator loss in a 1:1 fashion, and wrote up the code for calculating the standard discriminator loss. Sigmoid activation in Discriminator removed since it is now defined in the loss function.
Back to a real Super Resolution *Generative Adversarial Network*, now using Chainer instead of Keras! Discriminator training function reimplemented in Chainer using the Adversarial Loss (sigmoid cross entropy) function defined in the previous commit. Logging only four metrics in total now (2 losses, 1 generator psnr, 1 discriminator accuracy), with some cosmetic refactoring made to the main training for-loop code block.

Great to see how model training is 3 minutes faster than in e1e7144. Yet to properly evaluate the predicted results, so will do in the next commit by re-adapting deepbedmap.ipynb scripts.
Going from keras's model save functions to Chainer-based ones. Using Chainer's serializer to save the trained model's parameters to Numpy's zipped format. Using ONNX_chainer to save the model's parameters to ONNX format (binary and text).
Replacing the backend of our DeepBedMap application, from Keras to Chainer! Most changes are to do with changing the hardcoded image format from NHWC to NCHW in deepbedmap.ipynb. Model weights now loaded from Chainer's .npz format (see d63f3a8) instead of Keras's hdf5 format.

Also, we are logging our experiment data in Comet.ML again, though there are some missing information (e.g. system metrics, graph output, etc). This commit references https://www.comet.ml/weiji14/deepbedmap/ea5d527c62d441a2bec3004fb130a9dc. Next step will be to remove the deprecated code still referencing Keras.
Bye bye Keras, cue "Thanks for the Memories". Remove Keras and Tensorflow from dependencies, which should really thin down our docker image size! Unit tests should run much faster with the removal of the old keras model architecture and compile functions.
Updating model/weights/README.md and remove old keras architecture json file in light of d63f3a8 and ee1e9df. Specifically, the generator model architecture is stored in ONNX instead of JSON format, and model weights are stored in NPZ instead of HDF5 format.
Closes #81. Moving from good ol' Keras to Chainer allows more flexibility in defining our loss function. Especially helpful for training 'state-of-the-art' Generative Adversarial Networks that have non-standard Generator/Discriminator loss functions.
Implement a Relativistic Standard GAN (RSGAN) according to paper at https://arxiv.org/abs/1807.00734, github repo at https://github.com/AlexiaJM/RelativisticGAN. Specifically, relativism (if that's a word) is added via the adversarial loss. This is found in both the generator and discriminator loss functions, though the calculation is mirrored as both networks are competing against each other (i.e. minimizing the cost function in different directions).

Root Mean Squared Error test results seem a bit better than a Standard GAN. The discriminator loss seems quite small and flat, indicating little training as the discriminator model has already converged? Generator training does not appear to have reached a plateau, so more room for improvement. TODO step up to a Relativistic Average GAN (RaGAN), and document this improved Adversarial Loss function in the jupyter notebook.
Upgrade our Relativistic Standard GAN (RSGAN) to a Relativistic Average GAN (RaGAN)! We calculate the difference between our (real/fake) predicted labels and *averaged* (fake/real) predicted labels, and set the target difference accordingly depending on whether we're training the generator or discriminator neural network.

This commit specifically references experimental run at https://www.comet.ml/weiji14/deepbedmap/0121a4b4d7ad46b991f9cc9efdb319da. The results are promising, giving a RMSE of 87.2 compared with >100 in the last few non-RaGAN experimental runs. Caution though, as we haven't actually tested this extensively for more experimental runs.
Implement a denser, deeper, Residual-in-Residual Dense Block (RRDB) in the main trunk of our Generator Network! After several manual experimental tweaks of the hyperparameters (number of residual blocks and number of training epochs), plus some luck, we finally have a great-ish result! The GeneratorNetwork class can only be simply called as well to build the network, as we have set some default parameters inside.

Using only 2 RRDB blocks, with a total 30 convolutional layers (2x5x3), up from the 16 convolutional layers we used before. Original ESRGAN paper by Wang et al. 2018 uses 23. Also training for just 50 epochs instead of 100. RMSE is under 60, better than bicubic, but there is a degree of luck involved. Visual inspection of results also shows heavy checkerboard artifacts in the topography.

In order to train a much deeper network without exploding/vanishing gradients, it does appear that we will need to follow the paper's recommendation to use a smaller initialization (i.e. He initialization x0.1) than the one currently used.(i.e. Glorot Uniform, a hangover default initialization from using Keras).
Using smaller weight initialization (He Normal) with scaling of 0.1 instead weight initialization (Glorot Uniform) with scaling 1.0. See https://github.com/xinntao/BasicSR/blob/master/codes/models/networks.py for original Pytorch weight initialization implementation details. Also made Generator Network deeper, with 4 Residual-in-Residual Blocks instead of 2, translating to 60 convolutional layers (4x3x5).

Results are much better than before, with an RMSE of 41.26 compared to 58.92 before. Less checkerboard artifacts, but there are some strange stripes. Histogram of elevation error is looking great! This commit references experiment at https://www.comet.ml/weiji14/deepbedmap/de9e8151127b43069780be18c8d738d8.
Stabilize training by decreasing Adam learning rate from 1e-3 to 2e-4, and epsilon value from 1e-7 to 1e-8. Also started to properly log those hyperparameter settings to Comet.ML, instead of writing down notes on what's been tweaked between experiments.

Now following Wang et al. 2018's Pytorch ESRGAN training details more closely (see https://github.com/xinntao/BasicSR/blob/ef680f2de06a6501ede8c797e0d8d2f3ea46ca81/codes/options/train/train_ESRGAN.json#L56-L65). From the paper, they first use a learning rate of 2e-4 to train a PSNR-oriented network, and drop it to 1e-4 to train with the perceptual loss function. We will stick to just 2e-4, and not use weight decay. Results actually worse than previous run, with an RMSE of 84.69 instead of 41.26 before. But the training looks more stable, and it seems that we can increase the number of training epochs to get a better result without exploding gradients.
Set Generator Loss = 5e-3 * Adversarial Loss + 1e-2 * Content Loss. Basically giving 2x more weight to the pixel-wise Content Loss (Mean Absolute Error) than the relativistic average Adversarial Loss. Results doesn't seem to be significantly better (it is actually slightly worse), but at least the generator loss plot is nicer to look at with smaller values instead of a couple of hundred units.

Also removing the "Work in Progress" note, since we are almost done! Just a few more modifications to the architecture perhaps to resolve some checkerboarding artifacts.
Attempt to make Super-Resolution results better by using two convolutional layers (kernel size 3) with a LeakyReLU in between after the pixelshuffle upsampling layers, instead of just one convolutional layer (kernel size 9). This follows Wang et al. 2018's ESRGAN implementation more closely. Checkerboard artifacts do remain, but maybe longer training will help.
Patch Residual in Residual Dense Block from c607fd3 by adding residual scaling factor.
Trying to get the results to beat bicubic baseline. Didn't really want to deviate from the 2e-4 learning rate setting as in the ESRGAN paper, but having exhausted tweaking most other hyperparameter settings, the learning rate seems to be the one to tweak. This commit references experiment at https://www.comet.ml/weiji14/deepbedmap/d64dd9dd8dc54b3397a36d26337080c3

Using a middle ground learning rate value of 6e-4. This is halfway between 1e-3 and 2e-4, respectively used by our last 'champion' model at 5b87c0c, tracked at https://www.comet.ml/weiji14/deepbedmap/de9e8151127b43069780be18c8d738d8 and the recommendation from Wang et al. 2018's ESRGAN paper. The learning rate of 1e-3 is quite unstable and hard to replicate, while 2e-4 did not give good RMSE results. This 6e-4 compromise is still a bit bad and finicky but it will do for now, until we do some hyperparameter optimization later.
@weiji14 weiji14 changed the title WIP Enhancing the Super Resolution Generative Adversarial Network Enhancing the Super Resolution Generative Adversarial Network Feb 3, 2019
@weiji14 weiji14 merged commit a87d2eb into master Feb 3, 2019
weiji14 added a commit that referenced this pull request Feb 3, 2019
Closes #78. Enhancing the Super Resolution Generative Adversarial Network.
@weiji14 weiji14 deleted the model/esrgan branch February 6, 2019 05:11
weiji14 added a commit that referenced this pull request Mar 19, 2019
Noticed that the discriminator network doesn't quite follow ESRGAN (it is more like SRGAN's). Patching #78 by increasing depth of discriminator from 9 to 10 blocks, use kernel size of 4 in some Conv2D layers, and setting penultimate fully connected layer with 100 instead of 1024 neurons. See https://github.com/xinntao/BasicSR/blame/902b4ae1f4beec7359de6e62ed0aebfc335d8dfd/codes/models/modules/architecture.py#L86-L129 for original Pytorch implementation details.

The discriminator has become stronger, and it actually took a few experments to get a good RMSE test result. That means there will be a need to retune our hyperparameters. This commit references experiment at https://www.comet.ml/weiji14/deepbedmap/80c51658b2074743ba5151cde7d24560 with an RMSE test of 46.46.
weiji14 added a commit that referenced this pull request Mar 22, 2019
All this time we had a higher adversarial weighting than the content loss?!! Critical patch for a2d9749 in #78. Unit tests updated, and rightly so, the loss is higher meaning we have some optimization work to do!
weiji14 added a commit that referenced this pull request Mar 28, 2019
Yet another critical patch for #78 and #129, can't believe it... Change discriminator to use HeNormal initialization instead of GlorotUniform, a hangover from using Keras, see #81. Refer to relevant code in ESRGAN's original Pytorch implementation at https://github.com/xinntao/BasicSR/blob/477e14e97eca4cb776d3b37667d42f8484b8b68b/codes/models/networks.py (where it's called kaiming initialization). This initializer change was recorded in one successful training round reviewable at https://www.comet.ml/weiji14/deepbedmap/17cfbfd5a54043c3a39b5ba183b1cc68.

Also noticed that Chainer's BatchNormalization behaves differently when the global_config.train flag is set to True/False. My assumption that simply not passing in an optimizer during the evaluation stage was incorrect. To be precise, the setting should only affect the Discriminator neural network since that's where we have BatchNormalization layers, but we've added the config flag to both the train_eval_generator and train_eval_discriminator functions to be extra sure. With that, the final recorded Comet.ML experiment this commit references is at https://www.comet.ml/weiji14/deepbedmap/44acdbc1127f4440891ed905846401cf.

Note that we have not retuned any hyperparameters, though that would be a smart thing to do. If you review the Comet.ML experiments, you'll notice that there were two cases of exploding gradients when these hotfixes were implemented. The discriminator's loss and accuracy charts look very different now, and in particular, there is a significant gap between the training and validation metrics.
weiji14 added a commit that referenced this pull request Apr 15, 2020
Took some time to write up this section, as it requires drawing a good looking model architecture feature! Writing-wise, mainly outlining the key differences between our model and ESRGAN, hopefully I've listed them all down. See also #78 if necessary.

The deepbedmap_model_architecture pdf figure is produced using Tikz/TeX via a Python script (both of which I've provided). Specifically giving credit to the code at HarisIqbal88/PlotNeuralNet@894567f. Input BEDMAP2 image downloaded from http://cdn.antarcticglaciers.org/wp-content/uploads/2013/08/bedmap2_preview-300x300.png, though it's mirror flipped in the figure which I'll have to fix. Also changed incorrect numbering of the LaTeX equations from the last commit...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement ✨ New feature or request model 🏗️ Pull requests that update neural network model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant