Skip to content

[src,scripts,egs] Implement and tune resnets.#1620

Merged
danpovey merged 1 commit intokaldi-asr:kaldi_52from
danpovey:kaldi_52_resnet_final
May 13, 2017
Merged

[src,scripts,egs] Implement and tune resnets.#1620
danpovey merged 1 commit intokaldi-asr:kaldi_52from
danpovey:kaldi_52_resnet_final

Conversation

@danpovey
Copy link
Contributor

No description provided.

@danpovey
Copy link
Contributor Author

@freewym, this (run_resnet_1b.sh) is probably the setup we want to use to test out backstitch, since it has quite competitive results: we are below 5% error (4.8%) in CIFAR-10 with only 1.3 million parameters and 16 convolutional layers, and I'm not aware of any other systems below 5% error that are anywhere near so small or so shallow. The CIFAR-100 error is 24%.

Interestingly, in our setup I did not see any benefit from per-channel mean subtraction, which is normally done on CIFAR. Perhaps the natural gradient makes it unnecessary.
In the PR #1613, there is a binary 'image-preprocess.cc' (https://github.com/kaldi-asr/kaldi/pull/1613/files#diff-3c933de8236d0eab9bcfc480733bc2a1) which can do various types of preprocessing such as this one; and the version of get_egs.sh there is modified to support its use; but I'm not checking that in here as I didn't see any benefit from any form of preprocessing.

@freewym
Copy link
Contributor

freewym commented May 13, 2017

Great! I will test based on this PR.

@danpovey
Copy link
Contributor Author

I'm going to merge this now, since the checks completed.

@hhadian, there are some things that might improve this, that it would be helpful if you could test them out. But start from 1a for most tuning experiments, as it's faster than 1b (having fewer epochs).

  • I'm pretty sure a larger model would be better. E.g. try increasing $nf1 from 48 to 64. A much larger model would probably be quite a bit better, but let's not go overboard.
  • The --proportional-shrink value, which is similar to l2 regularization, is a critical tuning value, and the 50.0 is only tuned to within a factor of 2. 25.0 was worse, and 100.0 was a lot worse. So maybe try 40.0 and see if it's better.
  • I'd like to confirm that natural gradient helps -> try disabling it.
  • I suspect that setting alpha-out=2.0 (default is 4.0) for the natural gradient, would be better; can you please try that? It might affect the optimization speed more than the results.
  • The config generation doesn't support this yet, but IIRC in the wide-resnet paper they put dropout in the middle of the res-block, with probability 0.3, and it was helpful. So that might be worth a try.
  • Can you please try a similar setup on SVHN?

@danpovey danpovey merged commit d2d0738 into kaldi-asr:kaldi_52 May 13, 2017
@danpovey
Copy link
Contributor Author

Oh, @hhadian, one more thing.
In this CIFAR setup, the final combination seems to be important, it gives a couple of percent improvement. This is surprising, as for speech tasks it was never really critical. Because there are a lot of coefficients to estimate (15 * num-layers) and the training error is very small (and we estimate the coefficients on the training data), I'm concerned that there may be a lot of noise in the combination-parameter estimates, which could be degrading results a little. So can you please:

  • Modify image/nnet3/get_egs.sh so that it takes an option combine_subset_egs, defaulting to 5k, and instead of just making the combine subset the same as the train-diagnostic subset, use the tail not the head to make the combine.egs use different data (this will make the training diagnostics more meaningful after combination).
  • Run an experiment with a larger --combine-subset-egs value, e.g. 25k. It's already very slow with 5k (takes at least half an hour with the resnets I'm currently using), so this will be extremely slow, but it's just to see if it helps.

I am thinking of using importance sampling and per-example weighting to speed up both the training and the final combination, since I believe the derivative magnitudes in different examples will be extremely different, based on how easy different examples are to classify. But that is something we can definitely look at later on, not now-- i.e. after the NIPS deadline.

@hhadian
Copy link
Contributor

hhadian commented May 13, 2017

Will do all

@danpovey
Copy link
Contributor Author

And one more thing... it would be nice to know whether the self-repair makes any difference in this setup. So please try setting the self-repair-scale to 0.0 and see if it affects performance.

@hhadian
Copy link
Contributor

hhadian commented May 13, 2017

OK will try it

@hhadian
Copy link
Contributor

hhadian commented May 17, 2017

These are the results:

# System               baseline(1a)  bigger_1stlayer bigger_2nd pshrink40   pshrink55   noNG_2jobs   alpha2.0
# final test accuracy:      0.949        0.9481      0.9469      0.9441      0.9455      0.9305       0.947
# final train accuracy:     0.9992       0.9984       0.999      0.9994      0.9964      0.9802       0.999
# final test objf:        -0.169885     -0.170357   -0.170604   -0.176587   -0.166264   -0.200331   -0.172659
# final train objf:       -0.0117902    -0.0107631 -0.00934661 -0.00685971  -0.0177978   -0.071727  -0.0106905
# num-parameters:           1322730      1401578     1668490     1322730     1322730     1322730     1322730

@hhadian
Copy link
Contributor

hhadian commented May 17, 2017

And the results for separate combine subset (from tail):

# System                 baseline(1a)  tail_combineset_5k  tail_combineset_25k
# final test accuracy:        0.949          0.9467          0.9458
# final train accuracy:       0.9992          0.9962         0.9976
# final test objf:         -0.169885       -0.172149      -0.169397
# final train objf:       -0.0117902       -0.015221      -0.0132601
# num-parameters:           1322730         1322730         1322730

The accuracies before the final step are (almost) similar to baseline's.
Baseline exp dir: /home/hhadian/xent/egs/cifar/v1/exp/resnet1a_cifar10/
And the exp dirs for the 5k and 25k experiments:
/home/hhadian/xent/egs/cifar/v1/exp/resnet1g5k_cifar10/
/home/hhadian/xent/egs/cifar/v1/exp/resnet1g25k_cifar10/

@hhadian
Copy link
Contributor

hhadian commented May 19, 2017

# System                baseline(1a) dropout_in_resblock no_selfrepair noNG_1job
# final test accuracy:        0.949      0.9441      0.9469      0.9328
# final train accuracy:       0.9992      0.9944       0.998      0.9934
# final test objf:         -0.169885   -0.167914    -0.17233    -0.20722
# final train objf:       -0.0117902  -0.0247912  -0.0125956   -0.028442
# num-parameters:           1322730     1322730     1322730     1322730

I think it might be worth it to try dropout_in_resblock on CIFAR100 as it is the only one which has improved the objective function.

@danpovey
Copy link
Contributor Author

danpovey commented May 19, 2017 via email

@YiwenShaoStephen
Copy link
Contributor

@hhadian I'm recently also working on tuning the number of filters (nf1,2,3). Previously I only try increasing nf3. I see you showed the result of "bigger_1stlayer bigger_2nd". Can you give me more detail about these experiments, like the number of nf1 and nf2?

@hhadian
Copy link
Contributor

hhadian commented Nov 22, 2017

@YiwenShaoStephen, for bigger 1st layer I used 64 instead of 48 and for bigger 2nd layer I used 128 instead of 96.

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants