API proposal: container or data class #2056

twiecki · 2017-04-19T10:07:23Z

Currently it's a bit clunky (but at least possible) to change out data in a model. E.g.:

x = theano.shared([1, 2, 3])
y = theano.shared([1, 2, 3])
with pm.Model() as model:
    beta = pm.Normal('beta', 0, 1)
    obs = pm.Normal('obs', x * beta, 1, observed=y)

Then, if I want to predict on new data, I have to call:

x.set_value([4, 5, 6])

The fact that the model has no idea of the data is a bit problematic for cases where we want to automate this process, e.g. when predicting or running ppc on hold-out data.

I'm proposing a pm.Data container (better ideas for names welcome) that would work like this:

with pm.Model() as model:
    x = pm.Data('x', [1, 2, 3])
    y = pm.Data('y', [1, 2, 3])
    beta = pm.Normal('beta', 0, 1)
    obs = pm.Normal('obs', x * beta, 1, observed=y)

Then, if I want to predict on new data, I have to call:

model.replace(x=[4,5,6], y=[4,5,6])

or, with nicer api:

predictions = model.predict(trace, x=[4,5,6])

and predictions would be a dict like {'y': [[4, 5, 6], ...]

What is pm.Data? Just a theano.shared that is known the model.

The model is now aware of its in- and outputs. For example, if the model is a glm, we could very easily have API that just plots the PPC over a range, e.g. with model: pm.plot_posterior_glm(trace, eval={'x': np.linspace(-3, 3, 100). Behind the scenes it would replace the value, call ppc, and plot the result.

Other things I haven't thought about is if this can also help with mini-batching, making that API nicer. Maybe @ferrine has some thoughts on this.

The text was updated successfully, but these errors were encountered:

ferrine · 2017-04-19T11:05:32Z

there are several problems.

As I remember, sample_pcc relies on distribution shape that is fixed now. So it will be (and it is now) confusing for user to predict on data with the same shape. Some refactoring is needed, what is the progress of Proposal: Dist shape refactor #1125 ?
pm.Data should be independent from Model and some more options should be available
that will minimize data transfers that are costly for GPU

data_gen = pm.Data(numpy_array, in_memory=10000, minibatch=500, memory_update=custom_generator)
# creates shared with size 10000 and slices randomly for 500 samples
# also has data_gen.callback with signature `(*_)`
# callback updates in_memory storage
# model collects that callbacks and creates a single callback `minibatch_update`
# minibatch_update should be called by demand. recommended delay is 10000/500

fonnesbeck · 2017-04-19T13:05:30Z

It would be nice if our PyMC nodes were able to infer shape from the data, and change when data are swapped out.

Not sure why we would want the Data object to be independent of the model. I thought that would be part of the point of having a class. What's your thinking there @ferrine ?

I like the Data name.

springcoil · 2017-04-19T13:11:09Z

Hi all, Agreed with what's been said so far from @fonnesbeck and @twiecki. I like the whole data class - I think it's intuitive and is what we do.

…

On Wed, Apr 19, 2017 at 3:05 PM, Chris Fonnesbeck ***@***.***> wrote: It would be nice if our PyMC nodes were able to infer shape from the data, and change when data are swapped out. Not sure why we would want the Data object to be independent of the model. I thought that would be part of the point of having a class. What's your thinking there @ferrine <https://github.com/ferrine> ? I like the Data name. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2056 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA8DiP1oiaEU3NSz7-IHobcnk-N8Jdxvks5rxgabgaJpZM4NBd8A> .

-- Peadar Coyle Skype: springcoilarch www.twitter.com/springcoil peadarcoyle.wordpress.com

twiecki · 2017-04-19T14:03:57Z

Agree it should be part of the model, that's the point.
@fonnesbeck Currently you can change the dimensionality (at least add/remove data points in an existing dimensions) using theano.shared.

fonnesbeck · 2017-04-19T14:09:38Z

But we don't currently have consistent shape inference. For example, try parameterizing a MvNormal without specifying shape.

ferrine · 2017-04-19T14:20:09Z

It would be nice if our PyMC nodes were able to infer shape from the data, and change when data are swapped out.

How can we track shape change with arbitrary theano operations? I think it's a like a dream.

data = pm.Data(...)
data = data**3

fonnesbeck · 2017-04-19T14:24:10Z

Well, I did say "it would be nice" ...

but, if we have a robust Data class, we could deal with the commonest operations at least, no?

twiecki · 2017-04-19T14:28:43Z

@ferrine That's possible if Data inherited from theano.shared (or provided the API).

junpenglao · 2017-04-19T14:58:46Z

Or with a batchsize arg indicating which dimension is flexible?

jmloyola · 2019-02-21T21:16:56Z

I would like to work in this issue.
I'm interested in participating in GSoC this year and I will use this oportunity to start learning the codebase of PyMC3.
Reading the discussion here, it seems that there migth be some details to be considered. Do you think this is a good first issue?

What is the solution you're looking for:

replace the use of theano.shared for a new pm.Data container. The user will only have to change the theano.shared for pm.Data. To implement this, it only requieres that pm.Data is a theano.shared variable.
remove the use of theano.share without creating a new Data container. The user will create models as usual but if she wants to predict on new data she can use a new API predictions = model.predict(trace, x=[4,5,6]). This requieres more back-end changes I think but is a lot more user-friendly.

twiecki · 2019-02-22T12:27:45Z

Sounds great @jmloyola! It's the first: create a new data container (that probably inherits from theano.shared) that registers itself to the model to allow API as outlined above.

jmloyola · 2019-03-25T17:37:29Z

This issue can be closed. 😃

springcoil added the enhancements label Apr 19, 2017

twiecki added the beginner friendly label Apr 20, 2017

ferrine mentioned this issue May 10, 2017

boost minibatches #2171

Merged

4 tasks

twiecki added the good_first_issue label Jan 17, 2018

jmloyola mentioned this issue Mar 1, 2019

Add data container and pm.set_data #3389

Merged

ColCarroll closed this as completed Mar 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API proposal: container or data class #2056

API proposal: container or data class #2056

twiecki commented Apr 19, 2017 •

edited

Loading

ferrine commented Apr 19, 2017

fonnesbeck commented Apr 19, 2017

springcoil commented Apr 19, 2017 via email

twiecki commented Apr 19, 2017

fonnesbeck commented Apr 19, 2017

ferrine commented Apr 19, 2017

fonnesbeck commented Apr 19, 2017

twiecki commented Apr 19, 2017

junpenglao commented Apr 19, 2017

jmloyola commented Feb 21, 2019

twiecki commented Feb 22, 2019

jmloyola commented Mar 25, 2019

API proposal: container or data class #2056

API proposal: container or data class #2056

Comments

twiecki commented Apr 19, 2017 • edited Loading

ferrine commented Apr 19, 2017

fonnesbeck commented Apr 19, 2017

springcoil commented Apr 19, 2017 via email

twiecki commented Apr 19, 2017

fonnesbeck commented Apr 19, 2017

ferrine commented Apr 19, 2017

fonnesbeck commented Apr 19, 2017

twiecki commented Apr 19, 2017

junpenglao commented Apr 19, 2017

jmloyola commented Feb 21, 2019

twiecki commented Feb 22, 2019

jmloyola commented Mar 25, 2019

twiecki commented Apr 19, 2017 •

edited

Loading