Generalize IO API to support any number of data / labels #468

pluskid · 2015-11-02T20:20:06Z

piiswrong · 2015-11-02T22:32:45Z

You might want to merge in my PR here:#456

pluskid · 2015-11-02T22:34:58Z

@piiswrong Yes, please merge that PR and I will rebase later.

piiswrong · 2015-11-03T22:08:14Z

Do we need to differentiate between data and label?

pluskid · 2015-11-03T22:31:15Z

The only difference is that during training data and label is copied into the network. But during prediction only data is copied. During evaluating on the validation set, data and label are copied into different places. I also found it a bit redundant but I did not have a better way to handle this in more general case.

piiswrong · 2015-11-04T20:00:38Z

I see.
What's the change to interface?
Could you summarize it and put it somewhere to make it easier for people working with old interface?

pluskid · 2015-11-04T20:05:52Z

@piiswrong The main change is:

Originally, we use postfix _data and _label to detect which is data and which is label, and only support one data, one label for input.
Now there is no special naming convention. The user specify the names of the data in the data iterator, and those names should match the corresponding names (by default data and softmax_label, to keep maximum backward compatibility) in the symbol. And multiple data/label could be used if the data iterator provides.

I will make a more detailed document soon.

pluskid · 2015-11-04T22:07:41Z

@tqchen Hmm, I did not notice that. I agree this is a potential problem with MXDataIter. I'm not familiar with the underlying data loading and prefetching implementation. But I think there might be a simple way (not making the code too messy) to wire with the Python iterator API so that the first mini-batch could be fetched to check the data shape and batch size without doing an explicit reset.

tqchen · 2015-11-04T22:11:42Z

Yes, that will need some special treatment in python API, i.e. have a cached instance that will be returned in the call to the next, instead of calling MXIterNext, and support a function like peek to get the instance without advancing the iterator(in impl put that into cached inst)

piiswrong · 2015-11-04T22:11:47Z

Thanks for the summary!

With regard to shape check, probably implement a peek method?

tqchen · 2015-11-04T22:15:45Z

I guess the peek method from python API might be a good idea, may need to implement peek, next on base-class, and ask the child class to implement a special _next method to get the data.

tqchen · 2015-11-04T22:48:47Z

I think the current code is ready to be merged after we get the peek

pluskid · 2015-11-04T22:56:13Z

k, I should be able to figure that out by the weekend

pluskid · 2015-11-04T23:43:23Z

Actually I would like to propose to simplify of the current data iter base class. I think the idea is to have as less methods as possible so that the user should be able to write a customized data iter very easily, say by just writing a for loop with Python generator api. Common tools like shuffling within a large cached in-memory pre-loaded mini batches could be provided so that the users do not need to implement the same thing over and over again. But this requires more efforts and discussion about the interface.

tqchen · 2015-11-04T23:48:36Z

As long as we have a fixed set of api, it is OK. For example, having a subclass Iter that can take a generator, while have default impl for all the other functions

pluskid · 2015-11-05T14:13:36Z

@tqchen Yes, but writing default impl, for example, for iter_next, or getpad on the iterator object is not easy, if we assume all we know is a generator from the end user. My understanding is that those functions are there to help it easier to implement the wrapper MXDataIter (or maybe even array data iter). But there is no need to require all data iterators to implement those method because:

Those methods (e.g. next_iter, getdata, etc.) are never explicitly called, since all the usage of data iterator is for batch in data_iter: ... -- therefore, the only required interface is a python iterator interface, the user could implement it explicitly by writing the __iter__ and next function, or by just wrapping a generator. But I might be wrong. If there is other potential usage of data iterator that requires a richer interface, please let me know.
I think it is not true that under different implementation paradigm it is equally easy to implement all the interface functions. For example, if I simply provide a generator to iterate over all the data. In order to implement iter_next, I will need to call the underlying generator, cache the results, and catch the end-iter exception, etc. Similarly, in order to implement getdata on the iterator object, I will need to explicitly fetch data from the underlying iterator, and store it somewhere.

tqchen · 2015-11-05T17:18:27Z

I get your point, yes I agree, iter and next are two things we need. The rest interface that are helpful(but not needed) is the getpad(to get number of padded instance, for predictor)

getpad, which was used for reporting number of padded data when data, but as you did in the update, it can readily be incorporated into the databatch

We might need the provide_data and provide_label, either as a list of functions. In that sense, I agree that the peek should be implemented in the subclass, to support provide_data and label for MXDataIter

pluskid · 2015-11-05T18:41:25Z

Yes getpad is not being suggested to be removed, they now exists in the batch object, which is the object returned in each iteration that contains the data, label and other information (including the pad) for that minibatch.

tqchen · 2015-11-05T19:49:44Z

Sounds good to me

piiswrong · 2015-11-05T23:53:07Z

Since we now support multiple data/label, the metrics interface probably should also be updated?

pluskid · 2015-11-06T06:02:55Z

Yes that is one aspect that we also need to consider a bit. Currently I just replaced all existing metrics with one that naively deal with pred and label list of the same length, by evaluating them one by one and then accumulate. This should be backward compatible with existing behavior, but for long term, we need to think about what would be best interface.

tqchen · 2015-11-06T07:03:14Z

Yes, I guess that should be addressed in a separate issue, let us first aim to get this merged

pluskid · 2015-11-07T02:43:53Z

Please review and merge it. I added a simple cache for the first batch to avoid calling reset. The discussed simplification of base iter interface could wait later, otherwise accumulating conflicts with base could get quite complicated. I hope I did not overwrite something during merging.

tqchen · 2015-11-07T03:02:26Z

python/mxnet/model.py

-
-            # reset the training data if reach the end of train_data, we only
-            # need to deal with the following two situations:
-            # 1. epoch_size is None:


This logic is recently added by mu for distributed training, so maybe needed here. The logic is to run epoch_size batches on each machine, to avoid different number of epochs on each machine

tqchen · 2015-11-07T03:04:14Z

I have one comment on support epoch_size parameter which is added by @mli for distributed training. Maybe need to add that back.

This is used when different machines have different batches on their local data, and we need to make sure each machine run the same number of batches in one epoch in distributed setting. Otherwise one machine will run additional batch, and hang because other machine did not send statistics over in BSP setting

tqchen · 2015-11-07T03:04:22Z

Other parts LGTM

pluskid · 2015-11-07T04:13:07Z

I tried to recover that part here.

Generalize IO API to support any number of data / labels

tqchen · 2015-11-07T06:28:37Z

cool, this is merged

pluskid mentioned this pull request Nov 3, 2015

does the LSTM networks support multi-gpus? #476

Closed

pluskid added 11 commits November 4, 2015 07:58

modify MXDataIter to to provide data/label info

39a35d0

modify model for extended data API (un-tested)

27dda3a

fix model for MNIST MLP example

ca95176

a working MNIST MLP training example

671e332

clean up and fix for update_on_kvstore

677a705

prediction API to use extended dataiter API

a73f40b

NDArrayIter and mlp_numpy

a0cf820

recover fit API

ddcff58

fix lint errors

10828f8

fix original interface of model

77b22ef

clean up

dfc293a

pluskid changed the title ~~[WIP] Generalize IO API to support any number of data / labels~~ Generalize IO API to support any number of data / labels Nov 4, 2015

pluskid added 9 commits November 4, 2015 08:47

fix examples for updated data API

b260c69

fix lint

7b31c93

io unit-test

9178a3c

fix io unit-test

b93dd83

fix unit-test (different version of python)

bdbac7f

fix unit-test error (python version)

9e8fc7e

fix test_mlp.py

fa28d03

Merge branch 'master' into multi-data

03e3811

clean up

dee85d9

pluskid added 2 commits November 4, 2015 15:06

fix regression tests

8b72a8c

fix lint on metric.py

dee39a8

tqchen mentioned this pull request Nov 6, 2015

Add Index to IO Batch #504

Merged

pluskid added 3 commits November 6, 2015 21:08

Merge branch 'master' into multi-data

2014f8a

recover model.py from some ancient commit

b387195

remove explicit call to reset in mxdataiter

bbf1d00

tqchen reviewed Nov 7, 2015
View reviewed changes

pluskid added 2 commits November 6, 2015 22:33

recover epoch_size commit

62da23b

fix epoch_size loop

6cc3ee3

tqchen added a commit that referenced this pull request Nov 7, 2015

Merge pull request #468 from pluskid/multi-data

a1a6f65

Generalize IO API to support any number of data / labels

tqchen merged commit a1a6f65 into apache:master Nov 7, 2015

pluskid deleted the multi-data branch November 9, 2015 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize IO API to support any number of data / labels #468

Generalize IO API to support any number of data / labels #468

pluskid commented Nov 2, 2015

piiswrong commented Nov 2, 2015

pluskid commented Nov 2, 2015

piiswrong commented Nov 3, 2015

pluskid commented Nov 3, 2015

piiswrong commented Nov 4, 2015

pluskid commented Nov 4, 2015

pluskid commented Nov 4, 2015

tqchen commented Nov 4, 2015

piiswrong commented Nov 4, 2015

tqchen commented Nov 4, 2015

tqchen commented Nov 4, 2015

pluskid commented Nov 4, 2015

pluskid commented Nov 4, 2015

tqchen commented Nov 4, 2015

pluskid commented Nov 5, 2015

tqchen commented Nov 5, 2015

pluskid commented Nov 5, 2015

tqchen commented Nov 5, 2015

piiswrong commented Nov 5, 2015

pluskid commented Nov 6, 2015

tqchen commented Nov 6, 2015

pluskid commented Nov 7, 2015

tqchen Nov 7, 2015

tqchen commented Nov 7, 2015

tqchen commented Nov 7, 2015

pluskid commented Nov 7, 2015

tqchen commented Nov 7, 2015

Generalize IO API to support any number of data / labels #468

Generalize IO API to support any number of data / labels #468

Conversation

pluskid commented Nov 2, 2015

piiswrong commented Nov 2, 2015

pluskid commented Nov 2, 2015

piiswrong commented Nov 3, 2015

pluskid commented Nov 3, 2015

piiswrong commented Nov 4, 2015

pluskid commented Nov 4, 2015

pluskid commented Nov 4, 2015

tqchen commented Nov 4, 2015

piiswrong commented Nov 4, 2015

tqchen commented Nov 4, 2015

tqchen commented Nov 4, 2015

pluskid commented Nov 4, 2015

pluskid commented Nov 4, 2015

tqchen commented Nov 4, 2015

pluskid commented Nov 5, 2015

tqchen commented Nov 5, 2015

pluskid commented Nov 5, 2015

tqchen commented Nov 5, 2015

piiswrong commented Nov 5, 2015

pluskid commented Nov 6, 2015

tqchen commented Nov 6, 2015

pluskid commented Nov 7, 2015

tqchen Nov 7, 2015

Choose a reason for hiding this comment

tqchen commented Nov 7, 2015

tqchen commented Nov 7, 2015

pluskid commented Nov 7, 2015

tqchen commented Nov 7, 2015