Early-stopping does not work properly in Keras 3 when used in a for loop #20256

Senantq · 2024-09-13T12:47:36Z

Hello,
I am using Keras 3.5 with TF 2.17. My code is more or less the following (but it is not a grid search as in the real code I also increment some other variables that are not directly linked to the network):


def create_conv(nb_cc_value, l1_value):
    model = Sequential()
    model.add(tensorflow.keras.layers.RandomFlip(mode="horizontal"))
    model.add(Conv2D(32, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(MaxPool2D())
    model.add(BatchNormalization())
    model.add(Conv2D(64, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(MaxPool2D())
    model.add(BatchNormalization())
    model.add(Conv2D(512, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(MaxPool2D())
    model.add(BatchNormalization())
    model.add(Conv2D(1024, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(BatchNormalization())
    model.add(MaxPool2D())
    model.add(Conv2D(2048, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(Flatten())
    model.add(BatchNormalization())
    model.add(Dense(nb_cc_value, activation='relu', kernel_regularizer=l1(l1_value)))
    model.add(Dense(56, activation = 'sigmoid'))
    model.build((None,150,150,1))
    
    lr_schedule = tensorflow.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=0.01, decay_steps=10000, decay_rate=0.7, staircase=False)
    optimizer = tensorflow.keras.optimizers.SGD(learning_rate=lr_schedule, momentum = 0.9)
    model.compile(loss= ['mse'], optimizer = optimizer, metrics = ['mse'])
    return model

# %%--------------------------------------------------Initialization
early_stopping = EarlyStopping(monitor='val_mse', min_delta = 0.001, patience=5, restore_best_weights=True)

nb_cc = [2, 6, 12, 102, 302, 602]
l1_values = [2.220446049250313e-16, 0.0000001, 0.0001]

for nb_cc_value in nb_cc:
    for l1_value in l1_values:
        for run in range(1,3):
            model = create_conv(nb_cc_value, l1_value)
            history = model.fit(X_train, y_train, epochs=epoques,callbacks=[early_stopping], validation_data=(X_test, y_test), batch_size=6, shuffle=True, verbose=1)
                # Nettoyage
            del X_train, y_train, X_test, y_test, vectors_dict, ethnie_dict, test_image_counts, model, history, prediction
            tensorflow.keras.backend.clear_session()
            gc.collect()

However, when I run it, only the very first run in the whole code works fine. The others all stops at something like 1 or 2 epochs even if the 'val_mse' variable is decreasing. I have run it using Keras 2.15.0 (tensorflow 2.15.0.post1) and it worked fine then.

Any help is much appreciated, thank you

The text was updated successfully, but these errors were encountered:

mehtamansi29 · 2024-09-13T18:51:45Z

Hi @Senantq -

Can you help me with dataset to reproduce the issue ?

Senantq · 2024-09-13T21:31:10Z

Hi @mehtamansi29
Sure! Here is the link to a google drive where you can find the fulle code as well as the folder containing the dataset: https://drive.google.com/drive/folders/1W6y-X_UlUNDoHHV8gG4CT5K30LwJWZvc?usp=drive_link

mehtamansi29 · 2024-09-16T04:02:05Z

Hi @Senantq -

Thanks but the drive links is not accessible for me. Can you provide accessible link ?

ghsanti · 2024-09-16T23:53:47Z

@Senantq Some possible causes that wouldn't be a bug:

If one changes patience=2 to patience=5 but does not run the cell (does not explain variation though.)
Variation of one unit due to early_stopping not being within mentioned the loop. Because it's outside the loop, the first loop iteration needs an extra epoch.

I do not see further deltas but it may depend on the actual code, if it's different from the included.

In OP, for a standard classification one should use SparseCE, or CE, but I assume OP knows and it's used for a reason.

It's easier to help if one includes a minimal, self-contained code snippet for the issue. Datasets are very easy to load from keras.api.datasets.cifar10 import load.

Example.

mehtamansi29 · 2024-09-18T11:50:52Z

Hi @Senantq -

I am unable to reproduce your exact code with your dataset as your drive link is not accessible.

But I run your model with some of layers on mnist dataset with same early stopping callbacks and seems working fine. As EarlyStopping(monitor='val_mse', min_delta = 0.001, patience=5, restore_best_weights=True) here patience=5 and monitor='val_mse' , so after 5 epochs 'val_mse' is not descreasing then training get stopped.

Attached gist here for your reference.

Senantq · 2024-09-19T11:44:55Z

Hi everyone,
I am very sorry for the delayed response. The link is now accessible, it contains my whole script, the dataset, and my conda environment yaml.

If one changes patience=2 to patience=5 but does not run the cell (does not explain variation though.)

The code is run as a .py script, so the problem does not come from there.

Variation of one unit due to early_stopping not being within mentioned the loop. Because it's outside the loop, the first loop iteration needs an extra epoch

It could have be maybe, but then I don't see why it works perfectly fine with TF 2.15/Keras 2.

In OP, for a standard classification one should use SparseCE, or CE, but I assume OP knows and it's used for a reason.

This is completely voluntary, thank you for the remainder.

It's easier to help if one includes a minimal, self-contained code snippet for the issue. Datasets are very easy to load from keras.api.datasets.cifar10 import load.

Understood. I will try to do the simplest code next time, but I was questioned due to the particularities of the training here.

I am also encountering another problem with the very same script on a cluster, where the code stops within the first 30minutes due to an OOM problem on a A100, but runs for 7h straight on a V100 which as 8Gb less than the A100. So I am beginning to suspect a memory leak that could be due to the CUDAs libs.

Thank you for the time spent

mehtamansi29 · 2024-09-19T18:49:12Z

Hi @Senantq-

I am very sorry for the delayed response. The link is now accessible, it contains my whole script, the dataset, and my conda environment yaml.

Thanks for the code. I am getting this after running your code.

Ethnie: Caucasians - Sous-dossiers conservees dans le dataset d'entrainement: 0, dataset de test: 0
Ethnie: Afro_Americans - Sous-dossiers conservees dans le dataset d'entrainement: 20, dataset de test: 20

Code:

for nb_cc_value in nb_cc:
    for ethnie in ethnies:
        for proportion in prop:
            proportion = proportion/100.
            for l1_value in l1_values:
                for run in range(1,3): #(1, 11)
                    X_train, y_train, X_test, y_test, vectors_dict, ethnie_dict, test_image_counts = load_images_and_vectors(target_folder=ethnie, base_dir = base_directory, proportion=proportion, ethnie_exclue=ethnie_exclue, target_size=(150,150), test_proportion=0.15)
                    print(X_train.shape)

It means there is no images coming through the training. As due to loop, model is intialize and train for few epochs and after getting zero training image iteration got stop.

Senantq · 2024-09-19T19:01:59Z

The fact that one of the main folder (here Caucasians) has no training images at the beginning of the 'proportion in prop' for loop is expected. This is due to some research purposes for my PhD in psychology. But it should still receive plenty of training images from the other main folder (Afro_Americans, something like 20*130 images). I don't think this should stop the training however

github-actions bot assigned mehtamansi29 Sep 13, 2024

mehtamansi29 added type:Bug stat:awaiting response from contributor labels Sep 16, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Sep 16, 2024

mehtamansi29 added the stat:awaiting response from contributor label Sep 18, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Sep 19, 2024

mehtamansi29 added the stat:awaiting response from contributor label Sep 19, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early-stopping does not work properly in Keras 3 when used in a for loop #20256

Early-stopping does not work properly in Keras 3 when used in a for loop #20256

Senantq commented Sep 13, 2024

mehtamansi29 commented Sep 13, 2024

Senantq commented Sep 13, 2024

mehtamansi29 commented Sep 16, 2024

ghsanti commented Sep 16, 2024 •

edited

Loading

mehtamansi29 commented Sep 18, 2024

Senantq commented Sep 19, 2024

mehtamansi29 commented Sep 19, 2024

Senantq commented Sep 19, 2024 •

edited

Loading

Early-stopping does not work properly in Keras 3 when used in a for loop #20256

Early-stopping does not work properly in Keras 3 when used in a for loop #20256

Comments

Senantq commented Sep 13, 2024

mehtamansi29 commented Sep 13, 2024

Senantq commented Sep 13, 2024

mehtamansi29 commented Sep 16, 2024

ghsanti commented Sep 16, 2024 • edited Loading

Example.

mehtamansi29 commented Sep 18, 2024

Senantq commented Sep 19, 2024

mehtamansi29 commented Sep 19, 2024

Senantq commented Sep 19, 2024 • edited Loading

ghsanti commented Sep 16, 2024 •

edited

Loading

Senantq commented Sep 19, 2024 •

edited

Loading