-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training capability much worse than 2.0.5 #8430
Comments
Out of curiosity, could you try with TF itself as a baseline and/or try with another backed ( if installing it isn't too complicated for you). |
Do you also observe a difference if you use Do you also observe a difference if you use |
I faced similar situation 2.0.0 (with tf 1.1, cuda 8, cudnn 5.1) vs 2.0.9 (with tf 1.4, cuda 8, cudnn 6). So, wondering if error averaging is changed somehow across the batches per epoch. Another issue with keras 2.1.1, when I use fit_generator with batch_size = 16, steps_per_epoch=348 (more than the actual samples i.e. 174), it ends the epoch at 11/384 th batch only and starts new epoch. Not sure why it is breaking the epoch, it did not happen before, with 2.0.9 and older. Clearly, 11th batch will have less than 16 samples, don't know why it should cause problem. |
While further investigating this issue I ended up receiving NaN as loss when I tried to use different optimizers. To handle that issue I uninstalled everything that was Nvidia on my machine and made a clean install. I also reinstalled keras (now 2.1.2) and TF (1.4). I also removed all the batchnorm layers and NaN is gone, however the model still fails to train with adam but works very well with adadelta as expected (OK, this is probably specific to my data/problem). Now I'm after the batchnorm issue. |
Hi, I'm having the same problem. Are there any updates? My project is a music classification and uses Keras v2.0.4 + TF v1.1.0, which are so outdated. Therefore, I investigated by decreasing Keras version from v2.1.3 to v2.0.5 (fixed TF version to 1.1.0). I read release note and comparing changes to figure out what's wrong, but I couldn't find it! Do you have any ideas which can possibly be a problem??? Thanks! |
@tae-jun |
@AndrasEros Thanks for your quick response! https://github.com/tae-jun/sample-cnn The symptom is so obvious even after only ONE epoch. Below is the training history for each version. Keras 2.0.5Epoch 1/100
6631/6631 [==============================] - 1139s - loss: 0.2004 - val_loss: 0.1710
Epoch 2/100
6631/6631 [==============================] - 1133s - loss: 0.1710 - val_loss: 0.1602
Epoch 3/100
6631/6631 [==============================] - 1135s - loss: 0.1616 - val_loss: 0.1571
Epoch 4/100
6631/6631 [==============================] - 1135s - loss: 0.1566 - val_loss: 0.1535 Keras 2.0.6Epoch 1/100
6631/6631 [==============================] - 1159s - loss: 0.2083 - val_loss: 0.1825
Epoch 2/100
6631/6631 [==============================] - 1150s - loss: 0.1824 - val_loss: 0.1736
Epoch 3/100
6631/6631 [==============================] - 1147s - loss: 0.1726 - val_loss: 0.1601
Epoch 4/100
6631/6631 [==============================] - 1149s - loss: 0.1666 - val_loss: 0.1583 Hardware Setup
Software Setup
Please ask me any other information if you need! Thanks 😄 |
Very interesting!
Differences:
Now I'm just brainstorming what we can check:
Additional ideas are welcome from anyone! |
|
How long do your train-eval loops take? If short(ish) you could use git bisect to find the problematic commit. |
Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.
Thank you!
Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).
I happened to re-run a model that I trained some months ago. I was surprised that I can't train it to the same level as before. I did the following investigation using different versions of Kerras and Tensorflow. For better interpreting the results note that Y is normalized between -1 and +1.
Model used:
Results with Keras 2.0.9 and TF 1.4.0:
Loss does not decrease 0.08 even after longer time.
Results with Keras 2.0.5 and TF 1.2.0:
I have ran these a few times and results are always showing the same difference between the two versions. It looks clear that the older version is superior in training speed and converges to a much lower loss. 2.0.5 achieves mse 0.015 after 9 epoch while 2.0.9 is at mse 0.08 after epoch 15. Also version 2.0.9 never achieves mse 0.015 it gets stuck around 0.08 lagging significantly behind the earlier version. I tried to trace back when the change was introduced between the two versions so I ran:
Results with Keras 2.0.6 and TF 1.2.0
It appears that upgrading keras from 2.0.5 to 2.0.6 causes the decreased efficiency in training therefore it's likely that the problem was introduced with 2.0.6 and it's still there ever since. I ran all tests with same environment with same data, the only change I did was upgrading/downgrading keras and TF.
Can someone check with another model that 2.0.5 is that much better?
Is there perhaps a known issue? I could not find by searching.
The text was updated successfully, but these errors were encountered: