Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Generalize IO API to support any number of data / labels #468

Merged
merged 30 commits into from
Nov 7, 2015
Merged

Generalize IO API to support any number of data / labels #468

merged 30 commits into from
Nov 7, 2015

Conversation

pluskid
Copy link
Contributor

@pluskid pluskid commented Nov 2, 2015

  • Modify MXDataIter interface
  • Modify Model train to use extended API
  • Modify MNIST example
  • Modify Model predict to use extended API
  • Modify array iterator
  • Lint and clean up
  • Fix unit-test

@piiswrong
Copy link
Contributor

You might want to merge in my PR here:#456

@pluskid
Copy link
Contributor Author

pluskid commented Nov 2, 2015

@piiswrong Yes, please merge that PR and I will rebase later.

@piiswrong
Copy link
Contributor

Do we need to differentiate between data and label?

@pluskid
Copy link
Contributor Author

pluskid commented Nov 3, 2015

The only difference is that during training data and label is copied into the network. But during prediction only data is copied. During evaluating on the validation set, data and label are copied into different places. I also found it a bit redundant but I did not have a better way to handle this in more general case.

@pluskid pluskid changed the title [WIP] Generalize IO API to support any number of data / labels Generalize IO API to support any number of data / labels Nov 4, 2015
@piiswrong
Copy link
Contributor

I see.
What's the change to interface?
Could you summarize it and put it somewhere to make it easier for people working with old interface?

@pluskid
Copy link
Contributor Author

pluskid commented Nov 4, 2015

@piiswrong The main change is:

  • Originally, we use postfix _data and _label to detect which is data and which is label, and only support one data, one label for input.
  • Now there is no special naming convention. The user specify the names of the data in the data iterator, and those names should match the corresponding names (by default data and softmax_label, to keep maximum backward compatibility) in the symbol. And multiple data/label could be used if the data iterator provides.

I will make a more detailed document soon.

@pluskid
Copy link
Contributor Author

pluskid commented Nov 4, 2015

@tqchen Hmm, I did not notice that. I agree this is a potential problem with MXDataIter. I'm not familiar with the underlying data loading and prefetching implementation. But I think there might be a simple way (not making the code too messy) to wire with the Python iterator API so that the first mini-batch could be fetched to check the data shape and batch size without doing an explicit reset.

@tqchen
Copy link
Member

tqchen commented Nov 4, 2015

Yes, that will need some special treatment in python API, i.e. have a cached instance that will be returned in the call to the next, instead of calling MXIterNext, and support a function like peek to get the instance without advancing the iterator(in impl put that into cached inst)

@piiswrong
Copy link
Contributor

Thanks for the summary!

With regard to shape check, probably implement a peek method?

@tqchen
Copy link
Member

tqchen commented Nov 4, 2015

I guess the peek method from python API might be a good idea, may need to implement peek, next on base-class, and ask the child class to implement a special _next method to get the data.

@tqchen
Copy link
Member

tqchen commented Nov 4, 2015

I think the current code is ready to be merged after we get the peek

@pluskid
Copy link
Contributor Author

pluskid commented Nov 4, 2015

k, I should be able to figure that out by the weekend

@pluskid
Copy link
Contributor Author

pluskid commented Nov 4, 2015

Actually I would like to propose to simplify of the current data iter base class. I think the idea is to have as less methods as possible so that the user should be able to write a customized data iter very easily, say by just writing a for loop with Python generator api. Common tools like shuffling within a large cached in-memory pre-loaded mini batches could be provided so that the users do not need to implement the same thing over and over again. But this requires more efforts and discussion about the interface.

@tqchen
Copy link
Member

tqchen commented Nov 4, 2015

As long as we have a fixed set of api, it is OK. For example, having a subclass Iter that can take a generator, while have default impl for all the other functions

@pluskid
Copy link
Contributor Author

pluskid commented Nov 5, 2015

@tqchen Yes, but writing default impl, for example, for iter_next, or getpad on the iterator object is not easy, if we assume all we know is a generator from the end user. My understanding is that those functions are there to help it easier to implement the wrapper MXDataIter (or maybe even array data iter). But there is no need to require all data iterators to implement those method because:

  • Those methods (e.g. next_iter, getdata, etc.) are never explicitly called, since all the usage of data iterator is for batch in data_iter: ... -- therefore, the only required interface is a python iterator interface, the user could implement it explicitly by writing the __iter__ and next function, or by just wrapping a generator. But I might be wrong. If there is other potential usage of data iterator that requires a richer interface, please let me know.
  • I think it is not true that under different implementation paradigm it is equally easy to implement all the interface functions. For example, if I simply provide a generator to iterate over all the data. In order to implement iter_next, I will need to call the underlying generator, cache the results, and catch the end-iter exception, etc. Similarly, in order to implement getdata on the iterator object, I will need to explicitly fetch data from the underlying iterator, and store it somewhere.

@tqchen
Copy link
Member

tqchen commented Nov 5, 2015

I get your point, yes I agree, iter and next are two things we need. The rest interface that are helpful(but not needed) is the getpad(to get number of padded instance, for predictor)

  • getpad, which was used for reporting number of padded data when data, but as you did in the update, it can readily be incorporated into the databatch

We might need the provide_data and provide_label, either as a list of functions. In that sense, I agree that the peek should be implemented in the subclass, to support provide_data and label for MXDataIter

@pluskid
Copy link
Contributor Author

pluskid commented Nov 5, 2015

Yes getpad is not being suggested to be removed, they now exists in the batch object, which is the object returned in each iteration that contains the data, label and other information (including the pad) for that minibatch.

@tqchen
Copy link
Member

tqchen commented Nov 5, 2015

Sounds good to me

@piiswrong
Copy link
Contributor

Since we now support multiple data/label, the metrics interface probably should also be updated?

@pluskid
Copy link
Contributor Author

pluskid commented Nov 6, 2015

Yes that is one aspect that we also need to consider a bit. Currently I just replaced all existing metrics with one that naively deal with pred and label list of the same length, by evaluating them one by one and then accumulate. This should be backward compatible with existing behavior, but for long term, we need to think about what would be best interface.

@tqchen
Copy link
Member

tqchen commented Nov 6, 2015

Yes, I guess that should be addressed in a separate issue, let us first aim to get this merged

@tqchen tqchen mentioned this pull request Nov 6, 2015
@pluskid
Copy link
Contributor Author

pluskid commented Nov 7, 2015

Please review and merge it. I added a simple cache for the first batch to avoid calling reset. The discussed simplification of base iter interface could wait later, otherwise accumulating conflicts with base could get quite complicated. I hope I did not overwrite something during merging.


# reset the training data if reach the end of train_data, we only
# need to deal with the following two situations:
# 1. epoch_size is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is recently added by mu for distributed training, so maybe needed here. The logic is to run epoch_size batches on each machine, to avoid different number of epochs on each machine

@tqchen
Copy link
Member

tqchen commented Nov 7, 2015

I have one comment on support epoch_size parameter which is added by @mli for distributed training. Maybe need to add that back.

This is used when different machines have different batches on their local data, and we need to make sure each machine run the same number of batches in one epoch in distributed setting. Otherwise one machine will run additional batch, and hang because other machine did not send statistics over in BSP setting

@tqchen
Copy link
Member

tqchen commented Nov 7, 2015

Other parts LGTM

@pluskid
Copy link
Contributor Author

pluskid commented Nov 7, 2015

I tried to recover that part here.

tqchen added a commit that referenced this pull request Nov 7, 2015
Generalize IO API to support any number of data / labels
@tqchen tqchen merged commit a1a6f65 into apache:master Nov 7, 2015
@tqchen
Copy link
Member

tqchen commented Nov 7, 2015

cool, this is merged

@pluskid pluskid deleted the multi-data branch November 9, 2015 19:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants