Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Feature: Add a TidyData Class #3953

Closed
AlexAndorra opened this issue Jun 10, 2020 · 17 comments
Closed

New Feature: Add a TidyData Class #3953

AlexAndorra opened this issue Jun 10, 2020 · 17 comments

Comments

@AlexAndorra
Copy link
Contributor

This issue is up for grabs, with the goal of developing a new feature called pm.TidyData. The goal is to deal with automatically translating strings from a dataframe into integer arrays for indexing in models, and then make sure we still get the right labels in the output, for plots and diagnostics.

@aseyboldt implemented a first version and an example, but the implementation is not mature enough, so we decided to not include it in #3551.

example_tidydata

This would be a very useful new feature though, which is why we created this issue -- to remind ourselves of it and in case someone wants to give it a try 😉

@NowanIlfideme
Copy link

Any reason to not support this explicitly via xarray? The coords and dims mapping between pandas and xarray works pretty well (see below), and xarray already natively has coords and dims (+ ArviZ's InferenceData is a pack of xr.Dataset objects as well). This would help workflows that use multidimensional or hierarchical data, for example.

For pandas use cases conversion should be very easy: df.to_xarray() converts a DataFrame to a Dataset with one or more coordinates (e.g. from a MultiIndex; see docs), and the inverse conversion works as well.

@twiecki
Copy link
Member

twiecki commented Aug 3, 2020

@NowanIlfideme That sounds like a good approach to me. Want to do a PR?

@AlexAndorra
Copy link
Contributor Author

AlexAndorra commented Aug 3, 2020

Yeah, feel free to open a PR, basing it on @aseyboldt's previous approach or not. ArviZ and xarray evolved a lot since Adrian's first thought about this, so there may be easier approches now. This issue is just a guide to express the need and explicit the goal

@kc611
Copy link
Contributor

kc611 commented Jan 28, 2021

@twiecki @AlexAndorra Does this mean that the TidyData logic first added in #3551 should be simply re-implemented via xarray instead of numpy.ndarray. Or were there some features expected from the initial implementation which it did not fulfill ?

@AlexAndorra
Copy link
Contributor Author

Hi @kc611 ! Mmmh, I think it's more about extending the current implementation to work with xarray datasets (and thus Pandas dataframe), as shown in the example above. That way, you get all the dims, coords and associated indexes defined at the same time and place, and conveniently so.
Of course, my memory of the details of what we did for this in August is a bit fuzzy, and ArviZ/xarray evolved since then, so you may find that another approach is more appropriate if you take that on.
How does that sound?

@kc611
Copy link
Contributor

kc611 commented Jan 28, 2021

Alright I'll see if I can come up with something.

Should this class also support implicit conversion for pd.DataFrame objects (or should inputs be only limited to xarray Datasets)?

@AlexAndorra
Copy link
Contributor Author

AlexAndorra commented Jan 29, 2021

Awesome, looking forward to it @kc611 !

Should this class also support implicit conversion for pd.DataFrame objects (or should inputs be only limited to xarray Datasets)?

I think inputs could be both pd.Dataframe and xr.Dataset, but the conversion should not be made by PyMC, it should be made by the user

@twiecki
Copy link
Member

twiecki commented Jan 30, 2021

I don't think anyone would input xarrays, we only need DataFrames.

@kc611
Copy link
Contributor

kc611 commented Jan 30, 2021

@twiecki I have a related draft running(implemented using xarray and ). We could do a .to_xrray() if the input is a pd.Dataframe over there. But if not (and we want the support limited to pd.Dataframes ) then I guess I will have to revert back to the original TidyData class implementation. In any case I think we should try to have support for both since the interconversion between two looks quite seamless.

I think it'll be a better idea if you have a look at it (the draft PR) anyway in it's current state.

@NowanIlfideme
Copy link

I don't think anyone would input xarrays, we only need DataFrames.

Disagree completely - I would love to input xarray datasets, in fact for several of my latest models I've had to manually add the values to the model.

Dataframes are good too, but 2d Dataframes are easily converted to xarray, while Multi-Index ones can be trickier (the implementers just needs to make sure that they don't auto-expand the MultiIndex during conversion, ie need to choose the proper converter).

Finally, with arviz integrating with xarray more and more, I think pymc3 doing so would be great as well.

@ricardoV94
Copy link
Member

I am trying to understand the difference between the proposed TidyData and the current Data.

Is there a reason why their functionalities need to be in separate objects? Is TidyData something that should not be changed after model specification? Is TidyData a "2D" version of Data?

@kc611
Copy link
Contributor

kc611 commented Feb 1, 2021

I am trying to understand the difference between the proposed TidyData and the current Data.

For now it's expected to do indexing for string data. Maybe it's functionality can be extended to doing things like cleaning/tidying up data ( as the name suggests ) like filling in missing values automatically (using nearest neighbours) or one-hot encoding for categorical data. I don't know the extent of it's use in PyMC models tho.

Is TidyData something that should not be changed after model specification?

No, I think it'll be a good idea to add that functionality too. That said I think it'll be a better idea to properly discuss the scope and use cases of this new class ( and if a new class is needed at all ) before I continue adding random stuff in my PR :-p

@twiecki
Copy link
Member

twiecki commented Feb 1, 2021

Started looking a bit closer and I'm a bit confused about the API.

Currently, we can specify dims with:

coords = {"date": df_data.index, "city": df_data.columns}
with pm.Model(coords=coords) as model:
    city_offset = pm.Normal("city_offset", mu=0.0, sd=3.0, dims="city")

I think TidyData is supposed to hook into coords, but I'm not sure how as TidyData is defined inside the model whereas coords in the model context.

Anyone have a full grasp on this?

CC @aseyboldt

@ricardoV94
Copy link
Member

ricardoV94 commented Feb 1, 2021

I think TidyData is supposed to hook into coords, but I'm not sure how as TidyData is defined inside the model whereas coords in the model context.

It can simply make use of model.add_coords:

https://github.com/pymc-devs/pymc3/blob/43e757229e7aac66209a7a06efe1a36b9570daa4/pymc3/model.py#L1097

pm.Data can also do this (and is also instanciated only inside the model context): https://github.com/pymc-devs/pymc3/blob/acb326149adffe03fd06aac6515ed58b682f646b/pymc3/data.py#L540-L541

@fonnesbeck
Copy link
Member

This has been dormant for quite a while. Is there still interest in pursuing this, or can it be closed?

@NowanIlfideme
Copy link

I'm interested as a user. Usually I need to wrap the model class into something I dynamically construct, using my own encoding scheme. Having this built into PyMC would make this much more standardized.

@twiecki
Copy link
Member

twiecki commented Nov 18, 2021

@NowanIlfideme Are you interested in giving this a shot? We'd help of course.

@pymc-devs pymc-devs locked and limited conversation to collaborators Jan 10, 2022
@ricardoV94 ricardoV94 converted this issue into discussion #5335 Jan 10, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

6 participants