Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the learning rate decay and preprocessing you used in your training? #56

Open
kwotsin opened this issue Jun 6, 2017 · 13 comments

Comments

@kwotsin
Copy link

kwotsin commented Jun 6, 2017

Thanks for providing the source code of this fantastic architecture. I am trying to clarify the learning rate decay as mentioned in your opts.lua file - is the learning rate decay 1e-1 or 1e-7 every 100 epochs? From your training, it seems that you didnt set the -d parameter, so would the decay go to 1e-7 by default?

However, the comment you gave for lrDecayEvery is:

--lrDecayEvery (default 100) Decay learning rate every X epoch by 1e-1

So I'd like to ask if the decay rate should be 1e-7 or 1e-1 every 100 epochs.

Also, what do you mean by # samples in this line?

-d,--learningRateDecay (default 1e-7) learning rate decay (in # samples)


Also, could I know how you performed your preprocessing for the training/evaluation data?

@kwotsin kwotsin changed the title What is the learning rate decay you used in your training? What is the learning rate decay and preprocessing you used in your training? Jun 6, 2017
@codeAC29
Copy link
Contributor

codeAC29 commented Jun 6, 2017

  1. --learningRateDecay is 1e-1 and # samples is there by mistake and has got no meaning.
  2. We do not perform any preprocessing.

@kwotsin
Copy link
Author

kwotsin commented Jun 7, 2017

Thank you for your reply! May I also confirm with you that for each dataset, you trained the model for a total of 300 epochs and you perform decay only every 100 epochs?

@codeAC29
Copy link
Contributor

codeAC29 commented Jun 7, 2017

Yes that is correct.

@kwotsin
Copy link
Author

kwotsin commented Jun 8, 2017

Thank you for the confirmation. Could I also know if you have continued to turn on dropout and batch norm when evaluating the test data? For many models, I think this is a standard thing to do. However, on my side, I seem to see a large difference in performance when I turned off batch norm and dropout.

Also, could I confirm the dataset you were using is equivalent to what is found here: https://github.com/alexgkendall/SegNet-Tutorial/tree/master/CamVid

Thank you once again.

@codeAC29
Copy link
Contributor

codeAC29 commented Jun 9, 2017

  1. If by turning off you mean deleting batchnorm then no you cannot do that. You need to adjust the weights of previous conv layer before getting rid of batchnorm layer. Once that is done then I don't think there will be any difference in performance.
  2. Yes the dataset used here is equivalent to the one which you have mentioned in your comment.

@kwotsin
Copy link
Author

kwotsin commented Jun 9, 2017

I have tested on the test dataset with dropout and batch_norm activated, and the results seem to be better than having either batch_norm or dropout turned off (or both). Did you have to turn off dropout when evaluating the test dataset? I see that in many models it's a common thing to stop dropout for test dataset.

Further, for the ordering of the classes in CamVid, I noted that the original dataset gave class labels from 0-11, where 11 is the void class. If the dataset you've used is the one found in the segnet tutorial as well, did you have to relabel all the segmentations from 1-12 (in lua's case), since you've put class 1 as void? Is there a particular reason why void is the first class instead of the last?

CamVid Labelling: https://github.com/alexgkendall/SegNet-Tutorial/blob/c922cc4a4fcc7ce279dd998fb2d4a8703f34ebd7/Scripts/test_segmentation_camvid.py#L60

Your Labelling:

local conClasses = {'Sky', 'Building', 'Column-Pole',

Could I also confirm with you if you performed median frequency balancing for obtained the weighted cross entropy loss? For a reference, these are the class weights used for the CamVid dataset:

https://github.com/alexgkendall/SegNet-Tutorial/blob/c922cc4a4fcc7ce279dd998fb2d4a8703f34ebd7/Models/segnet_train.prototxt#L1538

Thank you for your help once again.

@codeAC29
Copy link
Contributor

  1. As i told in my previous comment, you cannot just delete batchnorm layer. Before doing that you need to modify the weights of previous conv layer. In our case we did not get rid of these layers while testing.

  2. We do not include Unlabelled class in our confusion matrix. Giving it label 1 made writing the code easier, because cityscapes has Unlabelled as its first class.

  3. As mentioned in the paper, we use our own weight calculation scheme which gave us better result than median frequency balancing.

@kwotsin
Copy link
Author

kwotsin commented Jun 14, 2017

@codeAC29 thanks for your excellent response! I have been mulling over your response, and I've tried to create a version that can deactivate both batch_norm and spatial dropout during evaluation, however this gives me a very poor result. Like what you mentioned, during testing, batch_norm and spatial dropout are turned on. Is it correct to say these two functions are critical to evaluating images?

On the other hand, if batch_norm is critical to helping the model perform, would evaluating single images result in a very poor result? From my results, somehow there is quite a bit of difference in output when evaluating single images vs a batch of image. Would there be a way to effectively evaluate singular images for the network? I am currently only performing feature standardization to alleviate the effects.

Your paper has a great amount of content which I'm still learning to appreciate. Would you share how in particular is the p_class calculated in the weighing formula: w_class = 1.0 / ln(c + p_class) ? From your code, is it right to assume that p_class is the number of occurrences of a certain pixel label in all images, divided by the total number of pixels in all images? Is there a particular reason why the class weights should be restricted between 1 and 50? Using median frequency balancing, I see that the weights do not exceed 10.

Also, to verify with you, the spatial dropout you have used is Spatial Dropout in 2D (channel wise dropping) - is this correct?

@codeAC29
Copy link
Contributor

  1. As I have told in two of my previous comments: "You cannot just delete batch norm". Before removing batchnorm, you will have to do something like this:
        -- x is old model and y is new model
         local xsub    = x.modules[i].modules
         local xsubsub = x.modules[i].modules[1].modules
         local output = module.running_mean:nElement()
         local eps = xsubsub[j].eps
         local momentum = xsubsub[j].momentum
         local affine = xsubsub[j].affine
         y:add(nn.BatchNormalization(output*#xsub, eps, momentum, affine))
         y.modules[#y.modules].train = false

         -- concatenate distributed parameters over different models
         for k = 1, #xsub do
            local range = {output*(k-1)+1, output*k}
            y.modules[#y.modules].running_mean[{range}]:copy(xsub[k].modules[j].running_mean)
            y.modules[#y.modules].running_var[{range}]:copy(xsub[k].modules[j].running_var)
            if affine then
               y.modules[#y.modules].weight[{range}]:copy(xsub[k].modules[j].weight)
               y.modules[#y.modules].bias[{range}]:copy(xsub[k].modules[j].bias)
            end
         end
  1. Yes, p_class is what you have said. The values of the weights need to be such that, while training you are giving equal importance to all the classes. If xi is number of pixels occupied by class i then weight wi should be such that xi*wi is mostly giving a constant value for all the classes. If the there is huge class imbalance then weights varying between 1 to 50 is also fine, which you found in this case.

  2. Yes that is correct.

@kwotsin
Copy link
Author

kwotsin commented Jun 15, 2017

@codeAC29 Thank you for your excellent response once again. I am currently trying a variant of ENet in TensorFlow, in which case batch_norm could be turned off by setting the argument is_training=False in the standard batch norm function. Not accounting for implementation but theoretically speaking, would you say that spatial dropout and batch norm are crucial for getting good results?

If batch_norm and dropout are both turned on simultaneously during testing, how have you handled the test images differently from the validation images? Was there any change to the model when you were performing the inference for test images? If there aren't any changes, could the test images be included into the mix of train-validation images instead, given there is no difference from evaluating test images and validation images? That is, of course, assuming there are no changes to the model during testing.

Also, what inspired you to not perform any preprocessing for the images? Is there a conceptual reason behind this? It would be interesting to learn the reason why no preprocessing works well for the datasets.

In your paper, you mentioned that all the relus are replaced with prelus. However, in the decoder implementation here: https://github.com/e-lab/ENet-training/blob/master/train/models/decoder.lua
it seems that relus are used once again instead of prelus. Should relus be used in the decoder rather than prelus?

@codeAC29
Copy link
Contributor

  1. Yes batchnorm and spatial dropouts are very important for getting good results because it forces your network to learn and become more general.
  2. We did not change our model for inference on test images. Once you have your trained model, of course you can calculate accuracies on test set, the same way they were calculated on train-val set.
  3. In case of decoder, relus gave better result. Most probably because prelus have extra parameters and our network was not pretrained on any other bigger dataset.

@kwotsin
Copy link
Author

kwotsin commented Jun 27, 2017

@codeAC29 can I verify with you that your calculation on the ENet test accuracy was the result of testing on both the test and validation dataset combined together? It seems to me this is a natural choice given that there are no architectural changes for both the test and validation dataset and that the only difference comes from the data. In fact, perhaps the test dataset can be distributed to both training and validation datasets?

@codeAC29
Copy link
Contributor

@kwotsin No, we performed testing only on test dataset. Combining test data into training/validation will give you better result but then the whole point of test data will be lost. So, you should always train and validate your network using respective data and then in the end when you have your trained network, run it on test data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants