Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training capability much worse than 2.0.5 #8430

Closed
3 of 4 tasks
AndrasEros opened this issue Nov 8, 2017 · 10 comments
Closed
3 of 4 tasks

Training capability much worse than 2.0.5 #8430

AndrasEros opened this issue Nov 8, 2017 · 10 comments

Comments

@AndrasEros
Copy link

AndrasEros commented Nov 8, 2017

Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Thank you!

  • Check that you are up-to-date with the master branch of Keras. You can update with:
    pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps

  • If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.

  • If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
    pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps

  • Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

I happened to re-run a model that I trained some months ago. I was surprised that I can't train it to the same level as before. I did the following investigation using different versions of Kerras and Tensorflow. For better interpreting the results note that Y is normalized between -1 and +1.

Model used:

from nntools import helper
import numpy as np
import random
import win32gui
import win32con


from keras.layers import Input, LSTM, Dense, concatenate, BatchNormalization
from keras.models import Model
from keras import optimizers
from keras.callbacks import EarlyStopping, CSVLogger, ReduceLROnPlateau
from keras.utils import plot_model
import os
from phased_lstm_keras.PhasedLSTM import PhasedLSTM as PLSTM

import tensorflow as tf

os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

hwnd = win32gui.GetForegroundWindow()
win32gui.ShowWindow(hwnd, win32con.SW_MAXIMIZE)

timesteps = 40
holdout_percentage = 0.05  #Not used now
pretrain_epochs = 40
#early_stopping_patiente = 50
datafile = "C:/Data/data40_new3.csv"

xEMA_,y_ = helper.rnn_csv_toXY(datafile,timesteps,["P","ATR"],"T1",False)

adadelta_EMA = optimizers.adadelta()
adam_EMA=optimizers.adam()
sgd_EMA = optimizers.SGD(lr=0.01, decay=4e-5, momentum=0.9, nesterov=False)  #LSTM
sgd_EMA_PLSTM = optimizers.SGD(lr=0.01, decay=4e-5, momentum=0.2, nesterov=False)  #PLSTM
reduce_lr_EMA = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1, cooldown=1, min_lr=0.005)  #For SGD
#im = IncreaseMomentum(step=0.2, max_momentum=0.7)
#early_stopping_EMA = EarlyStopping(monitor='val_loss', min_delta=0.0001,
                                                                        #patience=early_stopping_patiente, mode='auto')
with tf.device('/gpu:0'):
    ema_in = Input(name='ema_in', shape=(xEMA_.shape[1],xEMA_.shape[2]))
    ema_in_BN = BatchNormalization()(ema_in)
    ema_lstm1 = LSTM(1200, name='ema_lstm1', implementation=2, return_sequences=True)(ema_in_BN)
    ema_lstm1_BN = BatchNormalization()(ema_lstm1)
    ema_lstm2 = LSTM(1200, name='ema_lstm2', implementation=2, return_sequences=False)(ema_lstm1_BN)
    ema_lstm2_BN = BatchNormalization()(ema_lstm2)
    ema_dense1 = Dense(2400, name='ema_dense1', activation='tanh')(ema_lstm2_BN)
    ema_dense1_BN = BatchNormalization()(ema_dense1)
    ema_dense2 = Dense(1200, name='ema_dense2', activation='tanh')(ema_dense1_BN)
    ema_dense2_BN = BatchNormalization()(ema_dense2)
    ema_dense3 = Dense(600, name='ema_dense3', activation='tanh')(ema_dense2_BN)
    ema_dense3_BN = BatchNormalization()(ema_dense3)
    ema_output = Dense(1, name='ema_output', activation='tanh')(ema_dense3_BN)

    ema_model = Model(inputs=[ema_in], outputs=[ema_output])
    reduce_lr_M_adadelta = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1, cooldown=1, min_lr=0.5)  #For adadelta
    reduce_lr_M = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1, cooldown=1, min_lr=0.005)  #For SGD
    early_stopping_M = EarlyStopping(monitor='val_loss', min_delta=0.001, patience=40, mode='auto')
    csv_logger = CSVLogger('training_log3_209_LSTM_SGD.csv')

    ema_model.compile(optimizer=sgd_EMA,
              loss={'ema_output': 'mean_squared_error'}, metrics=['mae'])
    #ema_model.compile(optimizer=adadelta_EMA,
    #          loss={'ema_output': 'mean_squared_error'}, metrics=['mae'])
    print("Train EMA on GPU0...")
    ema_model.fit({'ema_in': xEMA_},
          {'ema_output': y_},
          epochs=500, batch_size=40, validation_split=0.1,
          callbacks=[csv_logger, reduce_lr_M, early_stopping_M])

Results with Keras 2.0.9 and TF 1.4.0:

epoch,loss,mean_absolute_error,val_loss,val_mean_absolute_error

0,1.05968262962,0.985275764846,1.10819295021,1.03706740068

1,0.979859521839,0.939838988512,1.0076873856,0.979688630307

2,0.946857691283,0.913535889255,0.750179696942,0.825602595291

3,0.762938497789,0.785929248137,1.08659711466,1.02602888577

4,0.501235865148,0.589149763212,0.132253268072,0.307935305328

5,0.2807936117,0.413878542905,0.114725752961,0.292098902317

6,0.139581092516,0.291962343971,0.070717339354,0.21767100456

7,0.114901034081,0.263645202189,0.0706408966877,0.217224851168

8,0.101012383458,0.245739355278,0.0468780662885,0.176116148001

9,0.0997591257731,0.244218098339,0.0869001861954,0.24957025293

10,0.0957704882136,0.238081805803,0.0793308527111,0.237788404863

11,0.0917768303432,0.232730614943,0.0355999692114,0.154031230826

12,0.0880096098064,0.227360123548,0.0304724577584,0.140374521908

13,0.0856908053515,0.223671286127,0.0345248061943,0.150562275536

14,0.0844831893862,0.22217891928,0.0360884426361,0.154798901412

15,0.0844001181336,0.221470015169,0.030373785525,0.140394812572

Loss does not decrease 0.08 even after longer time.

Results with Keras 2.0.5 and TF 1.2.0:

epoch,loss,mean_absolute_error,val_loss,val_mean_absolute_error

0,0.674994680866,0.787532132172,0.210298314454,0.401246211335

1,0.50003129797,0.640251753075,0.186895873197,0.360440107238

2,0.0864018025559,0.23287392271,0.0184035980327,0.112992629679

3,0.0415469702159,0.159678566147,0.0345238949807,0.166836664362

4,0.0323593134869,0.141381580302,0.0117207404594,0.0815250731591

5,0.0255187416594,0.125329670333,0.00788067547078,0.0707015042412

6,0.0199052717129,0.110094056107,0.0146587090298,0.0996316193543

7,0.0194325417159,0.108653233154,0.022364237375,0.12307138551

8,0.0179296625274,0.103960103221,0.0104151797266,0.0821187130446

9,0.0156120462513,0.0965747672158,0.00753274433028,0.0676462595845

I have ran these a few times and results are always showing the same difference between the two versions. It looks clear that the older version is superior in training speed and converges to a much lower loss. 2.0.5 achieves mse 0.015 after 9 epoch while 2.0.9 is at mse 0.08 after epoch 15. Also version 2.0.9 never achieves mse 0.015 it gets stuck around 0.08 lagging significantly behind the earlier version. I tried to trace back when the change was introduced between the two versions so I ran:

Results with Keras 2.0.6 and TF 1.2.0

epoch,loss,mean_absolute_error,val_loss,val_mean_absolute_error

0,1.01219834391,0.958042641546,0.323887029367,0.507626796452

1,0.987153389447,0.947846812448,0.991314166002,0.972727180805

2,0.913755040475,0.895092398818,0.960553074588,0.954757456198

3,0.887703684235,0.877553070937,0.319060129579,0.459220583437

4,0.42077836859,0.534649828892,0.427964422614,0.602694774897

5,0.166042508874,0.321681825808,0.0520719601093,0.188639527333

6,0.130638394979,0.283327733159,0.0684811605399,0.21664535411

7,0.111096850008,0.259380427971,0.0437811681681,0.170154755492

8,0.105542496341,0.25173293699,0.0437140094443,0.16746392877

9,0.0990599608695,0.243373310091,0.0838070973706,0.238413533012

10,0.0927036643884,0.234348856499,0.0374346287566,0.158595657002

11,0.0888211610787,0.228353776273,0.0463573170836,0.176301127744

12,0.0889625790833,0.228434735812,0.0604750635987,0.202052195814

13,0.0873379450592,0.226243785039,0.0420134068426,0.164519093145

14,0.0842049010805,0.221496316829,0.0536444546858,0.186482944605

15,0.0756696959965,0.20801132382,0.0432103228089,0.169859484031

16,0.0757876485275,0.208129870598,0.0326403788479,0.145170186645

17,0.075430783716,0.207450285694,0.0314801485713,0.143648955088

18,0.0749894291144,0.206845118805,0.0369722390038,0.154727127577

It appears that upgrading keras from 2.0.5 to 2.0.6 causes the decreased efficiency in training therefore it's likely that the problem was introduced with 2.0.6 and it's still there ever since. I ran all tests with same environment with same data, the only change I did was upgrading/downgrading keras and TF.
Can someone check with another model that 2.0.5 is that much better?
Is there perhaps a known issue? I could not find by searching.

@AndrasEros AndrasEros changed the title Training capability worse than 2.0.5 Training capability much worse than 2.0.5 Nov 8, 2017
@roya0045
Copy link

Out of curiosity, could you try with TF itself as a baseline and/or try with another backed ( if installing it isn't too complicated for you).

@fchollet
Copy link
Collaborator

Do you also observe a difference if you use GRU instead of LSTM?

Do you also observe a difference if you use tf.keras in TF 1.4 instead of PyPI Keras?

@gmrhub
Copy link

gmrhub commented Nov 15, 2017

I faced similar situation 2.0.0 (with tf 1.1, cuda 8, cudnn 5.1) vs 2.0.9 (with tf 1.4, cuda 8, cudnn 6).
With so many differences I did not have time to investigate further.

So, wondering if error averaging is changed somehow across the batches per epoch.

Another issue with keras 2.1.1, when I use fit_generator with batch_size = 16, steps_per_epoch=348 (more than the actual samples i.e. 174), it ends the epoch at 11/384 th batch only and starts new epoch. Not sure why it is breaking the epoch, it did not happen before, with 2.0.9 and older. Clearly, 11th batch will have less than 16 samples, don't know why it should cause problem.

@AndrasEros
Copy link
Author

AndrasEros commented Dec 7, 2017

While further investigating this issue I ended up receiving NaN as loss when I tried to use different optimizers. To handle that issue I uninstalled everything that was Nvidia on my machine and made a clean install. I also reinstalled keras (now 2.1.2) and TF (1.4). I also removed all the batchnorm layers and NaN is gone, however the model still fails to train with adam but works very well with adadelta as expected (OK, this is probably specific to my data/problem). Now I'm after the batchnorm issue.

@tae-jun
Copy link
Contributor

tae-jun commented Feb 6, 2018

Hi, I'm having the same problem. Are there any updates?

My project is a music classification and uses Keras v2.0.4 + TF v1.1.0, which are so outdated.
So, I upgrade them to Keras v2.1.3 + TF v1.4.1, but the performance (ROC-AUC) was so much worse.

Therefore, I investigated by decreasing Keras version from v2.1.3 to v2.0.5 (fixed TF version to 1.1.0).
And it turns out that the performance is bad from v2.1.3 to v2.0.6 and it becomes fine at v2.0.5.
I guess there was something wrong when the version increased to v2.0.6.

I read release note and comparing changes to figure out what's wrong, but I couldn't find it!

Do you have any ideas which can possibly be a problem???
It didn't fix at the latest version, so we should figure out together!

Thanks!

@AndrasEros
Copy link
Author

@tae-jun
We must have something in common. It seems most users didn't notice anything, it's only few of us. Why?
Can you please share your code that behaves differently with different versions?
Can you share your hardware setup?
Can you share your software setup?

@tae-jun
Copy link
Contributor

tae-jun commented Feb 8, 2018

@AndrasEros Thanks for your quick response!

https://github.com/tae-jun/sample-cnn
This is the project I'm working on! Its task is music classification. The CNN architecture is HERE.

The symptom is so obvious even after only ONE epoch. Below is the training history for each version.

Keras 2.0.5

Epoch 1/100
6631/6631 [==============================] - 1139s - loss: 0.2004 - val_loss: 0.1710
Epoch 2/100
6631/6631 [==============================] - 1133s - loss: 0.1710 - val_loss: 0.1602
Epoch 3/100
6631/6631 [==============================] - 1135s - loss: 0.1616 - val_loss: 0.1571
Epoch 4/100
6631/6631 [==============================] - 1135s - loss: 0.1566 - val_loss: 0.1535

Keras 2.0.6

Epoch 1/100
6631/6631 [==============================] - 1159s - loss: 0.2083 - val_loss: 0.1825
Epoch 2/100
6631/6631 [==============================] - 1150s - loss: 0.1824 - val_loss: 0.1736
Epoch 3/100
6631/6631 [==============================] - 1147s - loss: 0.1726 - val_loss: 0.1601
Epoch 4/100
6631/6631 [==============================] - 1149s - loss: 0.1666 - val_loss: 0.1583

Hardware Setup

  • GTX 1080Ti x2
    (Which information could be helpful?)

Software Setup

  • CentOS 7.3
  • CUDA 8 / CuDNN 6
  • Anaconda 4.3.24 (with Python 3.5)
  • TensorFlow 1.1.0

Please ask me any other information if you need! Thanks 😄

@AndrasEros
Copy link
Author

Very interesting!
Similarities that matter:

  • I have the same hardware, 2x GTX 1080Ti
  • We both use Batchnormalization that was touched in Keras 2.0.6

Differences:

  • I'm on Windows 10
  • I have Anaconda 4.2.0 (64 bit)

Now I'm just brainstorming what we can check:

  • I'm not sure Windows and Linux drivers are connected but we can check GPU driver. Mine is 388.13. I upgraded it same time when I upgraded Keras to 2.0.6. We can search open issues or try to downgrade.
  • We both have a multi-GPU environment and Keras has a multi-GPU processing that is called from fit_generator that was touched in 2.0.6. We both should try to hide one of the GPU so TF and Keras can't see it. How to do it.
  • Remove Batchnorm completely from our models, don't even import it and try again 2.0.5 vs. 2.0.6.

Additional ideas are welcome from anyone!

@tae-jun
Copy link
Contributor

tae-jun commented Feb 9, 2018

  • My GPU driver version is 381.22
  • I have 2 GPUs but I use only one GPU for a training, so I guess it's not the reason 😥
  • I didn't know that BatchNorm was touched in 2.0.6! I should remove BatchNorms and compare performances. I will let you know the result! 😄

@lukedeo
Copy link
Contributor

lukedeo commented Feb 9, 2018

How long do your train-eval loops take? If short(ish) you could use git bisect to find the problematic commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants