Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API proposal: container or data class #2056

Closed
twiecki opened this issue Apr 19, 2017 · 12 comments
Closed

API proposal: container or data class #2056

twiecki opened this issue Apr 19, 2017 · 12 comments

Comments

@twiecki
Copy link
Member

twiecki commented Apr 19, 2017

Currently it's a bit clunky (but at least possible) to change out data in a model. E.g.:

x = theano.shared([1, 2, 3])
y = theano.shared([1, 2, 3])
with pm.Model() as model:
    beta = pm.Normal('beta', 0, 1)
    obs = pm.Normal('obs', x * beta, 1, observed=y)

Then, if I want to predict on new data, I have to call:

x.set_value([4, 5, 6])

The fact that the model has no idea of the data is a bit problematic for cases where we want to automate this process, e.g. when predicting or running ppc on hold-out data.

I'm proposing a pm.Data container (better ideas for names welcome) that would work like this:

with pm.Model() as model:
    x = pm.Data('x', [1, 2, 3])
    y = pm.Data('y', [1, 2, 3])
    beta = pm.Normal('beta', 0, 1)
    obs = pm.Normal('obs', x * beta, 1, observed=y)

Then, if I want to predict on new data, I have to call:

model.replace(x=[4,5,6], y=[4,5,6])

or, with nicer api:

predictions = model.predict(trace, x=[4,5,6])

and predictions would be a dict like {'y': [[4, 5, 6], ...]

What is pm.Data? Just a theano.shared that is known the model.

The model is now aware of its in- and outputs. For example, if the model is a glm, we could very easily have API that just plots the PPC over a range, e.g. with model: pm.plot_posterior_glm(trace, eval={'x': np.linspace(-3, 3, 100). Behind the scenes it would replace the value, call ppc, and plot the result.

Other things I haven't thought about is if this can also help with mini-batching, making that API nicer. Maybe @ferrine has some thoughts on this.

@ferrine
Copy link
Member

ferrine commented Apr 19, 2017

there are several problems.

  • As I remember, sample_pcc relies on distribution shape that is fixed now. So it will be (and it is now) confusing for user to predict on data with the same shape. Some refactoring is needed, what is the progress of Proposal: Dist shape refactor #1125 ?
  • pm.Data should be independent from Model and some more options should be available
    that will minimize data transfers that are costly for GPU
data_gen = pm.Data(numpy_array, in_memory=10000, minibatch=500, memory_update=custom_generator)
# creates shared with size 10000 and slices randomly for 500 samples
# also has data_gen.callback with signature `(*_)`
# callback updates in_memory storage
# model collects that callbacks and creates a single callback `minibatch_update`
# minibatch_update should be called by demand. recommended delay is 10000/500 

@fonnesbeck
Copy link
Member

It would be nice if our PyMC nodes were able to infer shape from the data, and change when data are swapped out.

Not sure why we would want the Data object to be independent of the model. I thought that would be part of the point of having a class. What's your thinking there @ferrine ?

I like the Data name.

@springcoil
Copy link
Contributor

springcoil commented Apr 19, 2017 via email

@twiecki
Copy link
Member Author

twiecki commented Apr 19, 2017

Agree it should be part of the model, that's the point.
@fonnesbeck Currently you can change the dimensionality (at least add/remove data points in an existing dimensions) using theano.shared.

@fonnesbeck
Copy link
Member

But we don't currently have consistent shape inference. For example, try parameterizing a MvNormal without specifying shape.

@ferrine
Copy link
Member

ferrine commented Apr 19, 2017

It would be nice if our PyMC nodes were able to infer shape from the data, and change when data are swapped out.

How can we track shape change with arbitrary theano operations? I think it's a like a dream.

data = pm.Data(...)
data = data**3

@fonnesbeck
Copy link
Member

Well, I did say "it would be nice" ...

but, if we have a robust Data class, we could deal with the commonest operations at least, no?

@twiecki
Copy link
Member Author

twiecki commented Apr 19, 2017

@ferrine That's possible if Data inherited from theano.shared (or provided the API).

@junpenglao
Copy link
Member

Or with a batchsize arg indicating which dimension is flexible?

@jmloyola
Copy link
Contributor

I would like to work in this issue.
I'm interested in participating in GSoC this year and I will use this oportunity to start learning the codebase of PyMC3.
Reading the discussion here, it seems that there migth be some details to be considered. Do you think this is a good first issue?

What is the solution you're looking for:

  • replace the use of theano.shared for a new pm.Data container. The user will only have to change the theano.shared for pm.Data. To implement this, it only requieres that pm.Data is a theano.shared variable.
  • remove the use of theano.share without creating a new Data container. The user will create models as usual but if she wants to predict on new data she can use a new API predictions = model.predict(trace, x=[4,5,6]). This requieres more back-end changes I think but is a lot more user-friendly.

@twiecki
Copy link
Member Author

twiecki commented Feb 22, 2019

Sounds great @jmloyola! It's the first: create a new data container (that probably inherits from theano.shared) that registers itself to the model to allow API as outlined above.

@jmloyola
Copy link
Contributor

This issue can be closed. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants