Using large dataset from a TF record the model doesnt train anything #2444

emmanuelol · 2024-05-14T04:08:55Z

Current Behavior:

I'm using Tensorflow nightly as a backend for Keras_CV-nightly; I'm using the docker image from Docker HUB of Tensorflow. I'm trying to use a larger dataset for object detection, almost 70K images for training and 10K for testing. For the past two weeks, I've been trying to train Retinanet and YOLOV8 with this dataset, but as soon I start the first epoch, I get a bunch of times the following message:

15555 gpu_timer.cc:114] Skipping the delay kernel, measurement accuracy will be reduced

After that, there is no training; everything freezes. If I review the GPU resources, I see that almost all of the memory is used, and there are peaks of activity in the GPU. I wait for hours but never do anything, and the system kills the training after a while.
When I try YOLOV8m, I get an additional message:

/usr/local/lib/python3.11/dist-packages/keras_cv/src/metrics/coco/pycoco_wrapper.py:98: RuntimeWarning: invalid value encountered in scalar multiply

Expected Behavior:

When I took under 5K images from the dataset, the training presented the same message as above, but the model started training.

Steps To Reproduce:

train_dataset = tf.data.TFRecordDataset([train_tfrecord_file])
val_dataset = tf.data.TFRecordDataset([val_tfrecord_file])

### Parse TFRecord
def parse_tfrecord_fn(example):
    feature_description = {
        'image/encoded': tf.io.FixedLenFeature([], tf.string),
        'image/height': tf.io.FixedLenFeature([], tf.int64),
        'image/width': tf.io.FixedLenFeature([], tf.int64),
        'image/object/bbox/xmin': tf.io.VarLenFeature(tf.float32),
        'image/object/bbox/xmax': tf.io.VarLenFeature(tf.float32),
        'image/object/bbox/ymin': tf.io.VarLenFeature(tf.float32),
        'image/object/bbox/ymax': tf.io.VarLenFeature(tf.float32),
        #'image/object/class/text': tf.io.VarLenFeature(tf.string),
        'image/object/class/label': tf.io.VarLenFeature(tf.int64),
    }
    
    parsed_example = tf.io.parse_single_example(example, feature_description)

    # Decode the JPEG image and normalize the pixel values to the [0, 255] range.
    img = tf.image.decode_jpeg(parsed_example['image/encoded'], channels=3) # Returned as uint8

    # Get the bounding box coordinates and class labels.
    xmin = tf.sparse.to_dense(parsed_example['image/object/bbox/xmin'])
    xmax = tf.sparse.to_dense(parsed_example['image/object/bbox/xmax'])
    ymin = tf.sparse.to_dense(parsed_example['image/object/bbox/ymin'])
    ymax = tf.sparse.to_dense(parsed_example['image/object/bbox/ymax'])
    #labels = tf.sparse.to_dense(parsed_example['image/object/class/text'])
    labels = tf.sparse.to_dense(parsed_example['image/object/class/label'])

    # Stack the bounding box coordinates to create a [num_boxes, 4] tensor.
    rel_boxes = tf.stack([xmin, ymin, xmax, ymax], axis=-1)
    boxes = keras_cv.bounding_box.convert_format(rel_boxes, source='rel_xyxy', target='xyxy', images=img)

    # Create the final dictionary.
    image_dataset = {
        'images': img,
        'bounding_boxes': {
            'classes': labels,
            'boxes': boxes
        }
    }
    return image_dataset

def dict_to_tuple(inputs):
    return inputs["images"], bounding_box.to_dense(
        inputs["bounding_boxes"], max_boxes=50
    )
train_dataset = train_dataset.map(parse_tfrecord_fn)
val_dataset = val_dataset.map(parse_tfrecord_fn)
train_dataset = train_dataset.cache()
val_dataset = val_dataset.cache()
train_dataset = train_dataset.shuffle(8 * strategy.num_replicas_in_sync)

train_dataset = train_dataset.ragged_batch(BATCH_SIZE)
val_dataset = val_dataset.ragged_batch(BATCH_SIZE)

For the rest, I'm using the retinanet example in this repository.

Version:

Docker image: tensorflow/tensorflow:nightly-gpu
Docker 26.1.1
NVIDIA Container Toolkit
Ubuntu 22.04
NVIDIA driver 550.67

Thanks in advance. Any help is always welcome.

The text was updated successfully, but these errors were encountered:

mehtamansi29 · 2024-05-20T08:46:35Z

Hi @emmanuelol

Could you please provide dummy dataset to reproduce this issue ?

emmanuelol · 2024-05-22T18:37:44Z

Hi, one of the datasets where this issue is present is the BDD100k dataset. I downloaded and converted it to TFRecord; in the past, I've been using such TFRecord to train models with the TensorFlow Object Detection API without issues.

emmanuelol · 2024-05-28T18:45:41Z

https://www.kaggle.com/datasets/pa928human/bdd100k-multiclass-tfrecords-val-part-1
This dataset is an example where the bug is present.

github-actions bot assigned sachinprasadhs May 14, 2024

mehtamansi29 self-assigned this May 20, 2024

mehtamansi29 added the type:Bug Something isn't working label May 20, 2024

mehtamansi29 added the stat:awaiting response from contributor label May 20, 2024

sachinprasadhs removed the stat:awaiting response from contributor label Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using large dataset from a TF record the model doesnt train anything #2444

Using large dataset from a TF record the model doesnt train anything #2444

emmanuelol commented May 14, 2024

mehtamansi29 commented May 20, 2024

emmanuelol commented May 22, 2024

emmanuelol commented May 28, 2024

Using large dataset from a TF record the model doesnt train anything #2444

Using large dataset from a TF record the model doesnt train anything #2444

Comments

emmanuelol commented May 14, 2024

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Version:

Thanks in advance. Any help is always welcome.

mehtamansi29 commented May 20, 2024

emmanuelol commented May 22, 2024

emmanuelol commented May 28, 2024