Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-737]Add last batch handle for imageiter #12131

Merged

Conversation

stu1130
Copy link
Contributor

@stu1130 stu1130 commented Aug 10, 2018

Description

Add last_batch_handle parameter to ImageIter based on #11883

  • last_batch_handle support pad(default), discard, roll_over

Note that reading record files(.rec) without shuffle(i.e. sequential read) didn't support 'discard'

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Add last_batch_handle feature to ImageIter and adjust test_imageiter to test last_batch_handle parameter
  • Change the shuffle behavior to the same as NDArrayIter where shuffling the data only happen during the iterator initialization

Comments

N/A
@zhreshold

@stu1130 stu1130 requested a review from szha as a code owner August 10, 2018 22:55
@sandeep-krishnamurthy sandeep-krishnamurthy added Gluon pr-work-in-progress PR is still work in progress labels Aug 12, 2018
@@ -1059,16 +1059,21 @@ class ImageIter(io.DataIter):
Label name for provided symbols.
dtype : str
Label data type. Default: float32. Other options: int32, int64, float64
last_batch_hanle : str, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original NDArrayIter call it last_batch_handle but I think the name last_batch already give much information of what it does. So I would go with your suggestion.

@@ -1059,16 +1059,21 @@ class ImageIter(io.DataIter):
Label name for provided symbols.
dtype : str
Label data type. Default: float32. Other options: int32, int64, float64
last_batch_hanle : str, optional
How to handle the last batch. This parameter can be ‘pad’, ‘discard’ or ‘roll_over’.
'discard' is not support when reading from record file(.rec) withouting shuffle(=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is discard not supported explicitly? We can add a warning to this behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please add explanation what each option stands for

Copy link
Contributor Author

@stu1130 stu1130 Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why discard is not supported is that when we read the rec file sequentially we don't know how many images in the file. Therefore, there is no way we can precalculate the number of images we need to discard. The only two solutions that I came up with is

  • iterate the file during the initialization of data iterator
  • allow users to input the number of the images

The first solution would take lots of time if the file is large during the initialization. The second one is not user-friendly. So I decided to give up this option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can check if valid batch.shape[0] is smaller than batch_size without checking how many images are there.

@@ -1149,22 +1157,53 @@ def __init__(self, batch_size, data_shape, label_width=1,
else:
self.auglist = aug_list
self.cur = 0
self.is_iterated_over = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make internal state variables private with leading underscore

if pad != 0:
_ = self.iterate(batch_data, batch_label, i, self._imgrec)

if self.last_batch_handle != 'roll_over':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to cache last batch in memory if roll_over? I think open/closing recordio is problematic

Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
Welcome to MXNet community. Thanks for your first contributions!

Few changes requested.

if self.imgrec is not None:
self.imgrec.reset()
self.cur = 0
self._is_allowed_reading = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to set this again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to reset RecordIO, current cursor as long as the last_batch is not roll_over. As for self._is_allowed_reading, it needs to be set to True after reset. Consider the situation where the data iter with 'pad' or 'discard' have iterated till the last_batch, the self._is_allowed_reading is set to False(After did the padding, the value would set to False in case users call the next()), we need to set it back to True to prepare for next() function call.


def next_sample(self):
"""Helper function for reading in next sample."""
if self._is_allowed_reading is False:
raise StopIteration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add user comprehensible message when raising errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StopIteration is normal behavior, not for users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup my bad. It is built in exception for iterators. Please ignore my comment.

if self.cur < self.num_image:
idx = self.seq[self.cur]
else:
if self.last_batch != 'discard':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be != discard and == rollover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter it is pad or roll_over, we need to "reset the cursor"(assign 0 to the self.cur) to do the padding

def next(self):
"""Returns the next batch of data."""
batch_size = self.batch_size
c, h, w = self.data_shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming always channels_first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we are

path_imglist=path_imglist, path_root='', dtype=dtype)
for _ in range(3):
for batch in test_iter:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert the returned batch size and shape here?

# handle the last batch
if self.seq and last_batch == 'discard':
new_seq_n = len(self.seq) - len(self.seq) % batch_size
self.seq = self.seq[:new_seq_n]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct, you now change the behavior from initial shuffle to per epoch shuffle, so self.seq must be shuffled in reset, and the slice can not be fixed.

Copy link
Contributor Author

@stu1130 stu1130 Aug 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to NDArrayIter, it shuffle only during the initialization. But you are right. We should shuffle for each epoch.

raise StopIteration
header, img = recordio.unpack(s)
return header.label, img

def next(self):
"""Returns the next batch of data."""
def iterate(self, batch_data, batch_label, start=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _collect_batch as name instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually _batchify could be better IMO

pad = batch_size - i
# handle padding for sequential read
if pad != 0:
if self.seq is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic of last_batch here is kind weird

# calculate the padding
pad = batch_size - i
# handle padding for 'pad' and 'roll_over' for the last batch
if pad != 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor the padding logic here

@szha szha requested review from zhreshold and removed request for szha August 16, 2018 23:05
Copy link
Member

@zhreshold zhreshold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially LGTM now, please address remaining comments and we are good to merge.

random.shuffle(self.seq)
if self.imgrec is not None:
self.imgrec.reset()
self.cur = 0
self._is_allowed_reading = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._allow_read for shorter but same meaning

first_batch_roll_over_twice = test_iter.next()
assert np.array_equal(
first_batch_roll_over_twice.data[0][2].asnumpy(), first_image.asnumpy())
assert first_batch_roll_over_twice.pad == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert second epoch with size 6?
Also can you add test for shuffle=True, just for sanity test, no value assert is required for shuffle mode.

Copy link
Contributor Author

@stu1130 stu1130 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the code change is completed.
Thanks, @zhreshold and @sandeep-krishnamurthy

# test the third epoch with size 6
assert i == 6
# test shuffle option for sanity test
test_iter = mx.image.ImageIter(3, (3, 224, 224), label_width=1, imglist=imglist, shuffle=True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add shuffle test case

for _ in test_iter:
i += 1
# test the third epoch with size 6
assert i == 6
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test epoch size

Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thank you @stu1130 @zhreshold

if self.imgrec is not None:
self.imgrec.reset()
self.cur = 0
if self._allow_read is False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just self._allow_read = True would work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it works as well. Only iter with 'pad' would change this flag though.

@sandeep-krishnamurthy sandeep-krishnamurthy changed the title [MXNET-737][WIP] Add last batch handle for imageiter [MXNET-737]Add last batch handle for imageiter Aug 18, 2018
@sandeep-krishnamurthy sandeep-krishnamurthy merged commit afb77f8 into apache:master Aug 18, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
[MXNET-737]Add last batch handle for imageiter
@stu1130 stu1130 deleted the add_last_batch_handle_on_imageiter branch January 10, 2020 21:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Gluon pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants