Skip to content

Conversation

@weiji14
Copy link
Owner

@weiji14 weiji14 commented Jan 9, 2019

Required by #78. The Relativistic Discriminator in the Enhanced Super Resolution Generative Adversarial Network (ESRGAN) is more advanced, requiring a custom loss function that Keras itself cannot intuitively handle. Switching to Chainer's "Define-By-Run" scheme (similar to Pytorch) gives us greater flexibility in designing and training our neural network, at the expense of some user-friendliness.

Other benefits of Chainer over Keras:

  • Better floating point 16 support (for faster training)
  • Higher Pull-Request to Issues ratio on Github (possibly easier to contribute to code)
  • Uses CuPy which fits into the PyData ecosystem.
  • etc

TODO:

  • Update CUDA driver to 10.0 (4dd92c8) and other cuda libraries (b22841d)
  • Pip install chainer (e6a3066) and cupy-cuda100 (a2a46a0)
  • Reimplement Generator Network in Chainer (ff53f7f)
  • Reimplement Discriminator Network in Chainer (c8bc374)
  • Ensure that SRGAN training runs properly (072ddda)
  • Switch deepbedmap app from Keras to Chainer backend (29e97f9)
  • Remove Keras/Tensorflow from dependencies (ee1e9df)

The Kahutea HPC server I've been using has just updated its CUDA drivers to version 410.79, so I'm matching that (almost). Note that we are sticking to cudatoolkit 9.0 instead of 10.0 and cudnn 7.1.2 instead of 7.2.1 (or higher) as that is what the tensorflow-gpu binaries on pypi are pre-compiled with. It is possible to use an older cudatoolkit on a newer driver (as we are doing now), but not vice versa. I.e. cudatoolkit 9.0 works on a >=410.48 driver, but cudatoolkit 10.0 won't work on a 396.xx version driver. Basically, just get the latest driver!!

Also made a slight adjustment to Dockerfile script. Add the conda-forge/label/dev channel system-wide for consistency.
@weiji14 weiji14 added enhancement ✨ New feature or request model 🏗️ Pull requests that update neural network model labels Jan 9, 2019
@weiji14 weiji14 added this to the v0.6.0 milestone Jan 9, 2019
@weiji14 weiji14 self-assigned this Jan 9, 2019
Chainer - A flexible deep learning framework. Currenly only CPU compute supported, GPU compute via CuPy will come later. Added Open Neural Network Exchange (ONNX) for cross-framework compatibility. Also pinned tornado to 5.1.1 as 6.0.0 was breaking some other packages.
Part of enabling Chainer GPU support via CuPy. Dropping tensorflow-gpu for now, as we would like to update cudatoolkit from 9.0 to 9.2 (which official pre-compiled tensorflow-gpu binaries 1.10 to 1.12 does not support, they stick to cudatoolkit 9.0). Cudnn also updated from 7.1.2 to 7.2.1 which is the latest available on conda. Note that keras/tensorflow training/evaluation should still work on CPU (just that they will be a magnitude order slower without GPU enabled).
Enable GPU support in Chainer! Using pre-compiled CuPy binaries on CUDA 9.2.
Changing the Super Resolution Generator Adversarial Network's *Generator Model* from Keras to Chainer, while keeping the network components the same. For example, the parameter count remains exactly the same at 1604929, the same Glorot Uniform weight initialization is used, etc. Model is now 'compartmentalized', with input and residual class blocks used that plugs-in into a wrapper function, instead of a more intuitive linear keras-like structure.

Unit doctests now include those declared in classes. Generator network tested for one forward computation pass, which intializes the weights and allows for proper parameter count to be returned. Also made some cosmetic changes to the markdown section headings as the code is getting more complicated.
As close as possible, a replacement of the original Keras implementation with a Chainer implementation of the discriminator model within the Super Resolution Generative Adversarial Network. Parameter counts down from 6828033 to 6824193, a difference of 3840. This is possibly due to some differences in chainer vs keras's BatchNormalization layer implementation, which I couldn't quite track down (not a simple apples to apples comparison).
Chainer uses NCHW tensor format instead of Keras/Tensorflow's NHWC format, so this commit has the necessary rollaxis code (CPU/GPU enabled!). Chainer also has the concept of a dataset iterator, which we create here to help with training/evaluation runs later.

Also removing scikit-learn which was only used for doing a train_test_split. Using Chainer's native random split implementation instead.
More work on transition to Chainer. Reimplement the Peak Signal-Noise Ratio calculation in CuPy/Numpy instead of Keras. Made several adjustment to the previous implementation that was 'incorrect'. 1) remove the epsilon (fingers crossed that it works later!), 2) output a single value mean squared error (instead of many values per tile), 3) change data range from 2**16 to 2**32 which is for int32. These changes effectively result in an output exactly the same as skimage.measure.compare_psnr, except that it works on a GPU which is helpful as it saves time on copying the arrays back and forth between GPU and CPU memory.
Not a GAN. This is just a Super Resolution Convolutional Neural Network, similar to https://github.com/weiji14/deepbedmap/blob/v0.2.0/srcnn_train.ipynb at commit 9272ecc. Obviously more advanced now though :P.

This commit is mainly just setting up a manual neural network training loop that can run using Chainer. The train_generator function has been refactored to a train_eval_generator function that replicates most of the previous functionality, except that it now has an evaluation only mode. Looking to further extend this work to handle more complicated loss functions later. Some LaTeX math equations was added on how the Content Loss function is defined. Cosmetic refactoring of the tqdm progressbar code made in the main training loop.

Next up is to add the discriminator network back, so that we can call it a GAN again!
Lengthy mathematical definition of a numerically stable Adversarial Loss (Sigmoid Binary Cross Entropy) documented in Markdown/LaTeX. Generator Loss function was fixed to remove division over batchsize (it was already divided, didn't need to do it again), and unit tests added for good measure. Also added the adversarial loss to the generator loss in a 1:1 fashion, and wrote up the code for calculating the standard discriminator loss. Sigmoid activation in Discriminator removed since it is now defined in the loss function.
Back to a real Super Resolution *Generative Adversarial Network*, now using Chainer instead of Keras! Discriminator training function reimplemented in Chainer using the Adversarial Loss (sigmoid cross entropy) function defined in the previous commit. Logging only four metrics in total now (2 losses, 1 generator psnr, 1 discriminator accuracy), with some cosmetic refactoring made to the main training for-loop code block.

Great to see how model training is 3 minutes faster than in e1e7144. Yet to properly evaluate the predicted results, so will do in the next commit by re-adapting deepbedmap.ipynb scripts.
Going from keras's model save functions to Chainer-based ones. Using Chainer's serializer to save the trained model's parameters to Numpy's zipped format. Using ONNX_chainer to save the model's parameters to ONNX format (binary and text).
Replacing the backend of our DeepBedMap application, from Keras to Chainer! Most changes are to do with changing the hardcoded image format from NHWC to NCHW in deepbedmap.ipynb. Model weights now loaded from Chainer's .npz format (see d63f3a8) instead of Keras's hdf5 format.

Also, we are logging our experiment data in Comet.ML again, though there are some missing information (e.g. system metrics, graph output, etc). This commit references https://www.comet.ml/weiji14/deepbedmap/ea5d527c62d441a2bec3004fb130a9dc. Next step will be to remove the deprecated code still referencing Keras.
Bye bye Keras, cue "Thanks for the Memories". Remove Keras and Tensorflow from dependencies, which should really thin down our docker image size! Unit tests should run much faster with the removal of the old keras model architecture and compile functions.
Updating model/weights/README.md and remove old keras architecture json file in light of d63f3a8 and ee1e9df. Specifically, the generator model architecture is stored in ONNX instead of JSON format, and model weights are stored in NPZ instead of HDF5 format.
@weiji14 weiji14 changed the title WIP Switch deep learning framework from Keras to Chainer Switch deep learning framework from Keras to Chainer Jan 22, 2019
@weiji14 weiji14 merged commit a0818e2 into model/esrgan Jan 22, 2019
weiji14 added a commit that referenced this pull request Jan 22, 2019
Closes #81. Moving from good ol' Keras to Chainer allows more flexibility in defining our loss function. Especially helpful for training 'state-of-the-art' Generative Adversarial Networks that have non-standard Generator/Discriminator loss functions.
@weiji14 weiji14 deleted the keras_to_chainer branch January 22, 2019 19:54
weiji14 added a commit that referenced this pull request Feb 11, 2019
Defaulting to the Number-Channel-Height-Width (NCHW) format for our arrays. This patches 8979878 in the #81 Keras to Chainer move. Mostly just removing the hardcoded np.rollaxis lines which changes shape from (1,height,width) to (height,width,1) in data_prep.ipynb and deepbedmap.ipynb.

Also making sure that all our recent library upgrades work properly, and it does! Pandas HTML table formatting seems to have changed a bit but otherwise fine.
weiji14 added a commit that referenced this pull request Feb 11, 2019
Defaulting to the Number-Channel-Height-Width (NCHW) format for our arrays. This patches ffd650c in the #81 Keras to Chainer move. Mostly just removing the hardcoded np.rollaxis lines which changes shape from (1,height,width) to (height,width,1) in data_prep.ipynb and deepbedmap.ipynb.

Also making sure that all our recent library upgrades work properly, and it does! Pandas HTML table formatting seems to have changed a bit but otherwise fine.
weiji14 added a commit that referenced this pull request Mar 28, 2019
Yet another critical patch for #78 and #129, can't believe it... Change discriminator to use HeNormal initialization instead of GlorotUniform, a hangover from using Keras, see #81. Refer to relevant code in ESRGAN's original Pytorch implementation at https://github.com/xinntao/BasicSR/blob/477e14e97eca4cb776d3b37667d42f8484b8b68b/codes/models/networks.py (where it's called kaiming initialization). This initializer change was recorded in one successful training round reviewable at https://www.comet.ml/weiji14/deepbedmap/17cfbfd5a54043c3a39b5ba183b1cc68.

Also noticed that Chainer's BatchNormalization behaves differently when the global_config.train flag is set to True/False. My assumption that simply not passing in an optimizer during the evaluation stage was incorrect. To be precise, the setting should only affect the Discriminator neural network since that's where we have BatchNormalization layers, but we've added the config flag to both the train_eval_generator and train_eval_discriminator functions to be extra sure. With that, the final recorded Comet.ML experiment this commit references is at https://www.comet.ml/weiji14/deepbedmap/44acdbc1127f4440891ed905846401cf.

Note that we have not retuned any hyperparameters, though that would be a smart thing to do. If you review the Comet.ML experiments, you'll notice that there were two cases of exploding gradients when these hotfixes were implemented. The discriminator's loss and accuracy charts look very different now, and in particular, there is a significant gap between the training and validation metrics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement ✨ New feature or request model 🏗️ Pull requests that update neural network model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant