Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early-stopping does not work properly in Keras 3 when used in a for loop #20256

Open
Senantq opened this issue Sep 13, 2024 · 8 comments
Open
Assignees
Labels

Comments

@Senantq
Copy link

Senantq commented Sep 13, 2024

Hello,
I am using Keras 3.5 with TF 2.17. My code is more or less the following (but it is not a grid search as in the real code I also increment some other variables that are not directly linked to the network):


def create_conv(nb_cc_value, l1_value):
    model = Sequential()
    model.add(tensorflow.keras.layers.RandomFlip(mode="horizontal"))
    model.add(Conv2D(32, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(MaxPool2D())
    model.add(BatchNormalization())
    model.add(Conv2D(64, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(MaxPool2D())
    model.add(BatchNormalization())
    model.add(Conv2D(512, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(MaxPool2D())
    model.add(BatchNormalization())
    model.add(Conv2D(1024, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(BatchNormalization())
    model.add(MaxPool2D())
    model.add(Conv2D(2048, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(Flatten())
    model.add(BatchNormalization())
    model.add(Dense(nb_cc_value, activation='relu', kernel_regularizer=l1(l1_value)))
    model.add(Dense(56, activation = 'sigmoid'))
    model.build((None,150,150,1))
    
    lr_schedule = tensorflow.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=0.01, decay_steps=10000, decay_rate=0.7, staircase=False)
    optimizer = tensorflow.keras.optimizers.SGD(learning_rate=lr_schedule, momentum = 0.9)
    model.compile(loss= ['mse'], optimizer = optimizer, metrics = ['mse'])
    return model

# %%--------------------------------------------------Initialization
early_stopping = EarlyStopping(monitor='val_mse', min_delta = 0.001, patience=5, restore_best_weights=True)

nb_cc = [2, 6, 12, 102, 302, 602]
l1_values = [2.220446049250313e-16, 0.0000001, 0.0001]

for nb_cc_value in nb_cc:
    for l1_value in l1_values:
        for run in range(1,3):
            model = create_conv(nb_cc_value, l1_value)
            history = model.fit(X_train, y_train, epochs=epoques,callbacks=[early_stopping], validation_data=(X_test, y_test), batch_size=6, shuffle=True, verbose=1)
                # Nettoyage
            del X_train, y_train, X_test, y_test, vectors_dict, ethnie_dict, test_image_counts, model, history, prediction
            tensorflow.keras.backend.clear_session()
            gc.collect()

However, when I run it, only the very first run in the whole code works fine. The others all stops at something like 1 or 2 epochs even if the 'val_mse' variable is decreasing. I have run it using Keras 2.15.0 (tensorflow 2.15.0.post1) and it worked fine then.

Any help is much appreciated, thank you

@mehtamansi29
Copy link
Collaborator

Hi @Senantq -

Can you help me with dataset to reproduce the issue ?

@Senantq
Copy link
Author

Senantq commented Sep 13, 2024

Hi @mehtamansi29
Sure! Here is the link to a google drive where you can find the fulle code as well as the folder containing the dataset: https://drive.google.com/drive/folders/1W6y-X_UlUNDoHHV8gG4CT5K30LwJWZvc?usp=drive_link

@mehtamansi29
Copy link
Collaborator

Hi @Senantq -

Thanks but the drive links is not accessible for me. Can you provide accessible link ?

@ghsanti
Copy link
Contributor

ghsanti commented Sep 16, 2024

@Senantq Some possible causes that wouldn't be a bug:

  1. If one changes patience=2 to patience=5 but does not run the cell (does not explain variation though.)
  2. Variation of one unit due to early_stopping not being within mentioned the loop. Because it's outside the loop, the first loop iteration needs an extra epoch.

I do not see further deltas but it may depend on the actual code, if it's different from the included.


In OP, for a standard classification one should use SparseCE, or CE, but I assume OP knows and it's used for a reason.

It's easier to help if one includes a minimal, self-contained code snippet for the issue. Datasets are very easy to load from keras.api.datasets.cifar10 import load.

Example.

@mehtamansi29
Copy link
Collaborator

Hi @Senantq -

I am unable to reproduce your exact code with your dataset as your drive link is not accessible.

But I run your model with some of layers on mnist dataset with same early stopping callbacks and seems working fine. As EarlyStopping(monitor='val_mse', min_delta = 0.001, patience=5, restore_best_weights=True) here patience=5 and monitor='val_mse' , so after 5 epochs 'val_mse' is not descreasing then training get stopped.

Attached gist here for your reference.

@Senantq
Copy link
Author

Senantq commented Sep 19, 2024

Hi everyone,
I am very sorry for the delayed response. The link is now accessible, it contains my whole script, the dataset, and my conda environment yaml.

If one changes patience=2 to patience=5 but does not run the cell (does not explain variation though.)

The code is run as a .py script, so the problem does not come from there.

Variation of one unit due to early_stopping not being within mentioned the loop. Because it's outside the loop, the first loop iteration needs an extra epoch

It could have be maybe, but then I don't see why it works perfectly fine with TF 2.15/Keras 2.

In OP, for a standard classification one should use SparseCE, or CE, but I assume OP knows and it's used for a reason.

This is completely voluntary, thank you for the remainder.

It's easier to help if one includes a minimal, self-contained code snippet for the issue. Datasets are very easy to load from keras.api.datasets.cifar10 import load.

Understood. I will try to do the simplest code next time, but I was questioned due to the particularities of the training here.

I am also encountering another problem with the very same script on a cluster, where the code stops within the first 30minutes due to an OOM problem on a A100, but runs for 7h straight on a V100 which as 8Gb less than the A100. So I am beginning to suspect a memory leak that could be due to the CUDAs libs.

Thank you for the time spent

@mehtamansi29
Copy link
Collaborator

Hi @Senantq-

I am very sorry for the delayed response. The link is now accessible, it contains my whole script, the dataset, and my conda environment yaml.

Thanks for the code. I am getting this after running your code.

Ethnie: Caucasians - Sous-dossiers conservees dans le dataset d'entrainement: 0, dataset de test: 0
Ethnie: Afro_Americans - Sous-dossiers conservees dans le dataset d'entrainement: 20, dataset de test: 20

Code:

for nb_cc_value in nb_cc:
    for ethnie in ethnies:
        for proportion in prop:
            proportion = proportion/100.
            for l1_value in l1_values:
                for run in range(1,3): #(1, 11)
                    X_train, y_train, X_test, y_test, vectors_dict, ethnie_dict, test_image_counts = load_images_and_vectors(target_folder=ethnie, base_dir = base_directory, proportion=proportion, ethnie_exclue=ethnie_exclue, target_size=(150,150), test_proportion=0.15)
                    print(X_train.shape)

It means there is no images coming through the training. As due to loop, model is intialize and train for few epochs and after getting zero training image iteration got stop.

@Senantq
Copy link
Author

Senantq commented Sep 19, 2024

The fact that one of the main folder (here Caucasians) has no training images at the beginning of the 'proportion in prop' for loop is expected. This is due to some research purposes for my PhD in psychology. But it should still receive plenty of training images from the other main folder (Afro_Americans, something like 20*130 images). I don't think this should stop the training however

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants