Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specification of dims instead of shape #3551

Merged
merged 29 commits into from
Jun 10, 2020

Conversation

aseyboldt
Copy link
Member

@aseyboldt aseyboldt commented Jul 18, 2019

Right now we always have to specify the shapes of random variables as numbers. If we use arviz to analyse the trace we then need to specify the dimension names and coordinates for those variables. It would be much easier if we could specify the dimension names and coordinates while building the model.
This is a WIP implementation of that idea. There are two ways to specify new dimensions and their coordinates:

  • Using a pm.Data variable with a pandas dataset. We can extract the index and columns of this and remember it for later
  • Using the new coords argument to model.

The stochastic volatility model from the getting started notebook could look this this:

# Has a datetime index
returns = pd.read_csv(pm.get_data('SP500.csv'), parse_dates=True, index_col=0)

with pm.Model() as sp500_model:
    # The model remembers the datetime index with the name 'date'
    # Maybe we should add a `store_coords=True` argument, so that
    # it becomes more obvious where the new dimension / coordinate
    # is coming from.
    values = pm.Data('returns', returns['change'], dims='date')
    
    nu = pm.Exponential('nu', 1/10., testval=5.)
    sigma = pm.Exponential('sigma', 1/0.02, testval=.1)

    # We can now figure out the shape of this variable based on the
    # index of the dataset
    s = pm.GaussianRandomWalk('s', sigma=sigma, dims='date')
    volatility_process = pm.Deterministic('volatility_process', pm.math.exp(-2*s)**0.5, dims='date')

    r = pm.StudentT('r', nu=nu, sigma=volatility_process, observed=values, dims='date')

Arviz can then look at model.coords and model.RV_dims so that the coordinates show up in the trace:

>>> sp500_model.coords
{'date': DatetimeIndex(['2008-05-02', '2008-05-05', '2008-05-06', '2008-05-07',
                '2008-05-08', '2008-05-09', '2008-05-12', '2008-05-13',
                '2008-05-14', '2008-05-15',
                ...
                '2009-11-18', '2009-11-19', '2009-11-20', '2009-11-23',
                '2009-11-24', '2009-11-25', '2009-11-27', '2009-11-30',
                '2009-12-01', '2009-12-02'],
               dtype='datetime64[ns]', name='date', length=401, freq=None)}

>>> sp500_model.RV_dims
{'returns': ('date',),
 's': ('date',),
 'volatility_process': ('date',),
 'r': ('date',)}

We can specify completely new coordinates like this:

coords = {
    'a_dim2': pd.Index([6, 7, 8, 9]),
}

with pm.Model(coords=coords) as model:
    pm.Data('foo', pd.Series([1, 2, 3]), dims='a_dim')
    pm.Normal('a', dims=('a_dim', 'a_dim2'))

TODO

  • More testing and documentation. Do we want to use this in the documentation? (eg getting started).
  • Do we need to change the dims keyword in pm.Data to make it more obvious that we are also declaring a new dimension?
  • Add support for model.coords and model.RV_dims to arviz. --> That's WIP Import coords and dims from pymc3 arviz-devs/arviz#757 and will depend on this PR

pymc3/model.py Outdated Show resolved Hide resolved
pymc3/model.py Outdated Show resolved Hide resolved
@fonnesbeck
Copy link
Member

Perhaps change the getting started notebook to use this? Would be nice to have the example handy.

@aseyboldt
Copy link
Member Author

We can also add a pm.TidyData, that deals with translating strings in a dataframe into integer arrays for slicing, and makes sure we still get the right labels in the output.
Screenshot from 2019-07-19 11-47-22

@rpgoldman
Copy link
Contributor

What are the implications of the interaction between this and changing the shape of the model, e.g., by resetting a pm.Data? So, for example, if I change the shape of a set of predictors, and I had a variable whose dims included a set of rows?

@aseyboldt
Copy link
Member Author

@rpgoldman It neither fixes nor breaks (as far as I know) current functionality of changing the data. So basically, it doesn't work. :-(
You can always change the values, but if you change the shape, that might also change the shape of unobserved variables. And a lot of our code assumes that those shapes can't change.
I've been experimenting with fixes for this, but for the PR here I think that is out of scope.

@rpgoldman
Copy link
Contributor

@rpgoldman It neither fixes nor breaks (as far as I know) current functionality of changing the data. So basically, it doesn't work. :-(
You can always change the values, but if you change the shape, that might also change the shape of unobserved variables. And a lot of our code assumes that those shapes can't change.
I've been experimenting with fixes for this, but for the PR here I think that is out of scope.

As my adviser used to say, "we have reduced this to a previously unsolved problem." 😉

I take your point -- this is really a problem with pm.Data in general, not a problem with coordinates and dimensions. Follow-up suggestion: if the coords are a function of a pm.Data, what about making it possible to extract them from the pm.Data? So if the pm.Data has a number of rows, instead of making us separately specify the row indices, how about providing a mechanism for pulling them out of the pm.Data object? My guess is that there is already a way to do this, but it probably requires knowledge of the underlying theano or at least the implementation of pm.Data.

@aseyboldt
Copy link
Member Author

"we have reduced this to a previously unsolved problem."

I like that one 😆

Getting the coords is just df.index and df.columns, but we could definitely add that to pm.Data. I'm not sure what the use-case would be though.
Maybe just pm.Data().coords and pm.Data().dims? The first would return a dict from dim name to df.columns and df.index, the second (df.index.name or kwargs['dims'][0]], df.columns.name or kwargs['dims'][1]).

@rpgoldman
Copy link
Contributor

"we have reduced this to a previously unsolved problem."

I like that one 😆

Getting the coords is just df.index and df.columns, but we could definitely add that to pm.Data. I'm not sure what the use-case would be though.
Maybe just pm.Data().coords and pm.Data().dims? The first would return a dict from dim name to df.columns and df.index, the second (df.index.name or kwargs['dims'][0]], df.columns.name or kwargs['dims'][1]).

pm.Data doesn't always come from a pandas dataframe, though, so if a modeler is writing code that will build a model with pm.Data, they might want to use one dimension of the shape of a numpy array, for example.

@rpgoldman
Copy link
Contributor

We can also add a pm.TidyData, that deals with translating strings in a dataframe into integer arrays for slicing, and makes sure we still get the right labels in the output.

This looks really useful! I have a model where I have done a lot of indexing (building on the ideas that @junpenglao showed in his schizophrenia study example), and I think this would make that sort of thing much easier.

@aseyboldt
Copy link
Member Author

aseyboldt commented Jul 20, 2019 via email

@rpgoldman
Copy link
Contributor

What case study is that? I've been looking for a nice little example that we can use to demonstrate this.

This is the one:

https://colab.research.google.com/github/junpenglao/advance-bayesian-modelling-with-PyMC3/blob/master/Notebooks/Code10%20-%20Schizophrenic_case_study.ipynb

@michaelosthege
Copy link
Member

michaelosthege commented Jun 5, 2020

I just realize that I have a use case for this feature!
Anybody else interested in re-activating this PR?

Oh and the rt.live model could actually use this too. Currently the coords there are tracked as a separate dict. (cc @twiecki @AlexAndorra @canyon289 )

@AlexAndorra
Copy link
Contributor

AlexAndorra commented Jun 5, 2020 via email

@michaelosthege
Copy link
Member

@aseyboldt @AlexAndorra I fixed the conflicts - let's see what Travis says.
Getting this into 3.9.0 would play nicely with the new return_inferencedata option.

@michaelosthege michaelosthege added this to the 3.9.0 milestone Jun 5, 2020
@AlexAndorra
Copy link
Contributor

I don't know about you, but on my browser the tests are still pending, which is weird, since your latest commit is from 21 hours ago 🤔

@michaelosthege
Copy link
Member

@AlexAndorra I triggered a pipeline build (https://travis-ci.org/github/pymc-devs/pymc3/builds/695670690) but the commit IDs don't match. I don't understand it..

But please review and there'll likely be some change that will trigger them again.

Copy link
Contributor

@AlexAndorra AlexAndorra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice, thanks a lot for reviving it @michaelosthege 👏
I left a bunch of comments, as this is my first review on this PR. Tell me if anything is unclear.

Also, this will definitely need some new tests. I guess updating the Data container example NB will give us ideas about those tests 😉

Travis is weirdly still hanging on GH, but when you go to Travis page, it seems like the tests passed, so this looks a like a bug -- let's see if it disappears with the new commits 🤷‍♂️

RELEASE-NOTES.md Outdated Show resolved Hide resolved
RELEASE-NOTES.md Outdated Show resolved Hide resolved
pymc3/data.py Outdated Show resolved Hide resolved
pymc3/data.py Outdated Show resolved Hide resolved
pymc3/data.py Outdated Show resolved Hide resolved
pymc3/data.py Outdated Show resolved Hide resolved
pymc3/model.py Outdated Show resolved Hide resolved
pymc3/model.py Outdated Show resolved Hide resolved
pymc3/model.py Outdated Show resolved Hide resolved
pymc3/model.py Outdated Show resolved Hide resolved
@michaelosthege
Copy link
Member

I just pushed a few commits, including the removal of isidentifier. I changed my opinion about that restriction: As Alex pointed out, one big advantage of the feature is to remove the need for keeping dim names around separately. Restricting the names, would mean that the user still has to drag them around itself.

Some questions are still unresolved:

  • the name logic looks like it's closely connected to how the API shall be used
  • it's unclear to me if we want to keep TidyData around...

As this is a new feature, I'd personally prefer to keep it simple & not introduce 4 different ways how coords/dims can be specified.

@AlexAndorra
Copy link
Contributor

Ok, just pushed my changes 🎉

  • Fixed some stuff in data.py
  • Updated the "Data container" and "Getting started" NBs with the new functionalities and reran them with latest PyMC and ArviZ (ReviewNB won't trigger because the PR was created before we connected the app to our repo I think)

I think the most important is to check the implementation in the code and the new docstrings I wrote for pm.Data.
I won't have time to write the tests today, so feel free to take the lead on that if my changes look good and you have some time 🙂

@michaelosthege
Copy link
Member

GitHub doesn't show it, but the latest Travis build is green 🎉

@AlexAndorra please mark your approval and merge :)

@AlexAndorra
Copy link
Contributor

@michaelosthege I approved but I can't merge yet because Travis didn't tell GH it sucessfully ran 🤦

@AlexAndorra AlexAndorra merged commit 76f3a7b into pymc-devs:master Jun 10, 2020
@StanczakDominik
Copy link

A note to the next unfortunate soul who searches: the updated notebooks are in the repo, but not yet on the website!

Also, thanks for this, this looks like it'll simplify my models a fair bit! 🙇‍♂️

@OlafHaag
Copy link

Is there any comprehensible documentation on how to use it? The Primer on Bayesian Methods for Multilevel Modeling uses the dims argument, but never explains what it actually does. I'm new to pymc3 and would like to understand how to use dims/shape and what happens with the slicing notation on TransformedRV/FreeRV with TensorSharedVariables like in the example:

coords = {"Level": ["Basement", "Floor"], "obs_id": np.arange(floor.size)}
with pm.Model(coords=coords) as pooled_model:
    floor_idx = pm.Data("floor_idx", floor, dims="obs_id")
    a = pm.Normal("a", 0.0, sigma=10.0, dims="Level")

    theta = a[floor_idx]
    sigma = pm.Exponential("sigma", 1.0)

    y = pm.Normal("y", theta, sigma=sigma, observed=log_radon, dims="obs_id")

What happens when dims are a tuple and when to use it?
How would one add a state-level above counties using dims in the primer example?

Or take the temperature example. What if the data was in long format with an unequal number of measurements for each city? And how to add a level, e.g. size (small, medium, big), climate zone, or continents using dims?

@arvids
Copy link

arvids commented Apr 4, 2021

Or take the temperature example. What if the data was in long format with an unequal number of measurements for each city? And how to add a level, e.g. size (small, medium, big), climate zone, or continents using dims?

Did you find a solution for this type of problem?

@michaelosthege
Copy link
Member

@arvids I'm sorry our documentation is still sparse on this topic. However, the dims will become more important with the upcoming PyMC3 v4and we will work on the docs for that specifically.

To answer your question: If the number of measurements for each city differs, that's a typical scenario for imputation. . It should work with dims just fine. Essentially you just put None into the array positions where you don't have data.

Just a note: If your data is very big and very sparse at the same time, you may run into performance limitations because of the imputed values take some memory too. In these cases one can model the data in its long-form and use tensor indexing when constructing deterministics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants