Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

ImageDetIter looping forever in MXNet-1.3.0 #13037

Closed
Wallart opened this issue Oct 30, 2018 · 5 comments · Fixed by #13550
Closed

ImageDetIter looping forever in MXNet-1.3.0 #13037

Wallart opened this issue Oct 30, 2018 · 5 comments · Fixed by #13550

Comments

@Wallart
Copy link

Wallart commented Oct 30, 2018

Hello everybody !
I might have found a bug on on the ImageDetIter.
Recently I ran some old code written thanks to Gluon Object detection tutorial.[https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html]

It's fully functional until MXNet 1.2.1. But in MXNet 1.3.0 the loop for iterating over batches is looping forever.

@frankfliu
Copy link
Contributor

@mxnet-label-bot [Bug, Python]

@kalyc
Copy link
Contributor

kalyc commented Nov 14, 2018

Thanks for reporting the issue @Wallart
Do you mind sharing the code snippet you used and details from your development environment?

@Wallart
Copy link
Author

Wallart commented Nov 15, 2018

I'm running my code on nvidia-docker containers (Ubuntu 17.10) with CUDA 9.2. I compiled each of my MXNet version from scratch with opencv and mkldnn support.
On the hardware side I'm using 2 GTX 1080Ti.

If we consider the following code snippet, using a dataset in ImageRecord format (approximately 200 images) :

    cls_loss = FocalLoss()
    box_loss = SmoothL1Loss()
    cls_metric = mx.metric.Accuracy()
    box_metric = mx.metric.MAE()

    for epoch in range(start_epoch, epochs):
        # reset iterator and tick
        train_data.reset()
        cls_metric.reset()
        box_metric.reset()
        epoch_tick = time.time()

        # iterate through all batch
        for i, batch in enumerate(train_data):
            batch_tick = time.time()

            # record gradients
            with autograd.record():
                x = batch.data[0].as_in_context(ctx)
                y = batch.label[0].as_in_context(ctx)

                default_anchors, class_predictions, box_predictions = net(x)
                box_target, box_mask, cls_target = training_targets(default_anchors, class_predictions, y)

                # losses
                loss1 = cls_loss(class_predictions, cls_target)
                loss2 = box_loss(box_predictions, box_target, box_mask)

                # sum all losses
                loss = loss1 + loss2

                # backpropagate
                loss.backward()

            # apply
            trainer.step(batch_size)

            # update metrics
            cls_metric.update([cls_target], [nd.transpose(class_predictions, (0, 2, 1))])
            box_metric.update([box_target], [box_predictions * box_mask])

            if (i + 1) % log_interval == 0:
                name1, val1 = cls_metric.get()
                name2, val2 = box_metric.get()
                print('[Epoch %d Batch %d] speed: %f samples/s, training: %s=%f, %s=%f'
                      % (epoch, i, batch_size / (time.time() - batch_tick), name1, val1, name2, val2))

        # end of epoch logging
        name1, val1 = cls_metric.get()
        name2, val2 = box_metric.get()
        print('[Epoch %d] training: %s=%f, %s=%f' % (epoch, name1, val1, name2, val2))
        print('[Epoch %d] time cost: %f' % (epoch, time.time() - epoch_tick))

On MXNet 1.2.1 it will work as expected and the epochs will keeps flowing through the console

[Epoch 0] training: accuracy=0.833192, mae=0.004929
[Epoch 0] time cost: 1.240091
[Epoch 1] training: accuracy=0.966545, mae=0.004379
[Epoch 1] time cost: 0.610014
[Epoch 2] training: accuracy=0.976884, mae=0.003983
[Epoch 2] time cost: 0.631764
[Epoch 3] training: accuracy=0.983173, mae=0.004638

But on MXNet 1.3.0 an epoch will be divided to an infinite range of batches

[Epoch 0 Batch 19] speed: 1155.356185 samples/s, training: accuracy=0.923830, mae=0.004783
[Epoch 0 Batch 39] speed: 1105.710115 samples/s, training: accuracy=0.954663, mae=0.004561
[Epoch 0 Batch 59] speed: 1169.286568 samples/s, training: accuracy=0.966536, mae=0.004413
[Epoch 0 Batch 79] speed: 1132.142250 samples/s, training: accuracy=0.973061, mae=0.004393
[Epoch 0 Batch 99] speed: 1115.432219 samples/s, training: accuracy=0.977253, mae=0.004304
[Epoch 0 Batch 119] speed: 1139.079420 samples/s, training: accuracy=0.980220, mae=0.004205

It shouldn't take so long to complete a simple epoch, there is only 200 images in the dataset.
I suspect some changes in the data iterator on MXNet 1.3.0

@zhreshold
Copy link
Member

@stu1130 Can you have a look?

I can confirm that the changes made in #12131 breaks the behavior of next_sample therefore causing infinite loop to the subclass ImageDetIter in image.detection

@stu1130
Copy link
Contributor

stu1130 commented Dec 1, 2018

@zhreshold ok I'll look into it

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants