New Feature: Add a TidyData Class #3953

AlexAndorra · 2020-06-10T10:17:25Z

This issue is up for grabs, with the goal of developing a new feature called pm.TidyData. The goal is to deal with automatically translating strings from a dataframe into integer arrays for indexing in models, and then make sure we still get the right labels in the output, for plots and diagnostics.

@aseyboldt implemented a first version and an example, but the implementation is not mature enough, so we decided to not include it in #3551.

This would be a very useful new feature though, which is why we created this issue -- to remind ourselves of it and in case someone wants to give it a try 😉

The text was updated successfully, but these errors were encountered:

NowanIlfideme · 2020-08-03T10:57:14Z

Any reason to not support this explicitly via xarray? The coords and dims mapping between pandas and xarray works pretty well (see below), and xarray already natively has coords and dims (+ ArviZ's InferenceData is a pack of xr.Dataset objects as well). This would help workflows that use multidimensional or hierarchical data, for example.

For pandas use cases conversion should be very easy: df.to_xarray() converts a DataFrame to a Dataset with one or more coordinates (e.g. from a MultiIndex; see docs), and the inverse conversion works as well.

twiecki · 2020-08-03T11:50:10Z

@NowanIlfideme That sounds like a good approach to me. Want to do a PR?

AlexAndorra · 2020-08-03T11:58:32Z

Yeah, feel free to open a PR, basing it on @aseyboldt's previous approach or not. ArviZ and xarray evolved a lot since Adrian's first thought about this, so there may be easier approches now. This issue is just a guide to express the need and explicit the goal

kc611 · 2021-01-28T09:00:37Z

@twiecki @AlexAndorra Does this mean that the TidyData logic first added in #3551 should be simply re-implemented via xarray instead of numpy.ndarray. Or were there some features expected from the initial implementation which it did not fulfill ?

AlexAndorra · 2021-01-28T09:38:32Z

Hi @kc611 ! Mmmh, I think it's more about extending the current implementation to work with xarray datasets (and thus Pandas dataframe), as shown in the example above. That way, you get all the dims, coords and associated indexes defined at the same time and place, and conveniently so.
Of course, my memory of the details of what we did for this in August is a bit fuzzy, and ArviZ/xarray evolved since then, so you may find that another approach is more appropriate if you take that on.
How does that sound?

kc611 · 2021-01-28T14:49:25Z

Alright I'll see if I can come up with something.

Should this class also support implicit conversion for pd.DataFrame objects (or should inputs be only limited to xarray Datasets)?

AlexAndorra · 2021-01-29T10:43:12Z

Awesome, looking forward to it @kc611 !

Should this class also support implicit conversion for pd.DataFrame objects (or should inputs be only limited to xarray Datasets)?

I think inputs could be both pd.Dataframe and xr.Dataset, but the conversion should not be made by PyMC, it should be made by the user

twiecki · 2021-01-30T15:53:56Z

I don't think anyone would input xarrays, we only need DataFrames.

kc611 · 2021-01-30T16:30:34Z

@twiecki I have a related draft running(implemented using xarray and ). We could do a .to_xrray() if the input is a pd.Dataframe over there. But if not (and we want the support limited to pd.Dataframes ) then I guess I will have to revert back to the original TidyData class implementation. In any case I think we should try to have support for both since the interconversion between two looks quite seamless.

I think it'll be a better idea if you have a look at it (the draft PR) anyway in it's current state.

NowanIlfideme · 2021-01-30T16:52:26Z

I don't think anyone would input xarrays, we only need DataFrames.

Disagree completely - I would love to input xarray datasets, in fact for several of my latest models I've had to manually add the values to the model.

Dataframes are good too, but 2d Dataframes are easily converted to xarray, while Multi-Index ones can be trickier (the implementers just needs to make sure that they don't auto-expand the MultiIndex during conversion, ie need to choose the proper converter).

Finally, with arviz integrating with xarray more and more, I think pymc3 doing so would be great as well.

ricardoV94 · 2021-01-31T10:18:31Z

I am trying to understand the difference between the proposed TidyData and the current Data.

Is there a reason why their functionalities need to be in separate objects? Is TidyData something that should not be changed after model specification? Is TidyData a "2D" version of Data?

kc611 · 2021-02-01T05:39:06Z

I am trying to understand the difference between the proposed TidyData and the current Data.

For now it's expected to do indexing for string data. Maybe it's functionality can be extended to doing things like cleaning/tidying up data ( as the name suggests ) like filling in missing values automatically (using nearest neighbours) or one-hot encoding for categorical data. I don't know the extent of it's use in PyMC models tho.

Is TidyData something that should not be changed after model specification?

No, I think it'll be a good idea to add that functionality too. That said I think it'll be a better idea to properly discuss the scope and use cases of this new class ( and if a new class is needed at all ) before I continue adding random stuff in my PR :-p

twiecki · 2021-02-01T09:41:26Z

Started looking a bit closer and I'm a bit confused about the API.

Currently, we can specify dims with:

coords = {"date": df_data.index, "city": df_data.columns}
with pm.Model(coords=coords) as model:
    city_offset = pm.Normal("city_offset", mu=0.0, sd=3.0, dims="city")

I think TidyData is supposed to hook into coords, but I'm not sure how as TidyData is defined inside the model whereas coords in the model context.

Anyone have a full grasp on this?

CC @aseyboldt

ricardoV94 · 2021-02-01T12:04:38Z

I think TidyData is supposed to hook into coords, but I'm not sure how as TidyData is defined inside the model whereas coords in the model context.

It can simply make use of model.add_coords:

https://github.com/pymc-devs/pymc3/blob/43e757229e7aac66209a7a06efe1a36b9570daa4/pymc3/model.py#L1097

pm.Data can also do this (and is also instanciated only inside the model context): https://github.com/pymc-devs/pymc3/blob/acb326149adffe03fd06aac6515ed58b682f646b/pymc3/data.py#L540-L541

fonnesbeck · 2021-11-18T01:22:22Z

This has been dormant for quite a while. Is there still interest in pursuing this, or can it be closed?

NowanIlfideme · 2021-11-18T01:43:49Z

I'm interested as a user. Usually I need to wrap the model class into something I dynamically construct, using my own encoding scheme. Having this built into PyMC would make this much more standardized.

twiecki · 2021-11-18T10:44:59Z

@NowanIlfideme Are you interested in giving this a shot? We'd help of course.

AlexAndorra added enhancements feature request help wanted labels Jun 10, 2020

StanczakDominik mentioned this issue Jun 19, 2020

Use PyMC3 3.9 dims, Model(coords) arguments from the Model context manager instead of manually adding dims for az.from_pymc3? arviz-devs/arviz#1250

Closed

kc611 mentioned this issue Jan 29, 2021

[WIP] Introduced pm.TidyData class #4447

Closed

pymc-devs locked and limited conversation to collaborators Jan 10, 2022

ricardoV94 converted this issue into discussion #5335 Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

New Feature: Add a TidyData Class #3953

New Feature: Add a TidyData Class #3953

AlexAndorra commented Jun 10, 2020

NowanIlfideme commented Aug 3, 2020

twiecki commented Aug 3, 2020

AlexAndorra commented Aug 3, 2020 •

edited

Loading

kc611 commented Jan 28, 2021

AlexAndorra commented Jan 28, 2021

kc611 commented Jan 28, 2021

AlexAndorra commented Jan 29, 2021 •

edited

Loading

twiecki commented Jan 30, 2021

kc611 commented Jan 30, 2021

NowanIlfideme commented Jan 30, 2021

ricardoV94 commented Jan 31, 2021

kc611 commented Feb 1, 2021

twiecki commented Feb 1, 2021

ricardoV94 commented Feb 1, 2021 •

edited

Loading

fonnesbeck commented Nov 18, 2021

NowanIlfideme commented Nov 18, 2021

twiecki commented Nov 18, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

New Feature: Add a TidyData Class #3953

New Feature: Add a TidyData Class #3953

Comments

AlexAndorra commented Jun 10, 2020

NowanIlfideme commented Aug 3, 2020

twiecki commented Aug 3, 2020

AlexAndorra commented Aug 3, 2020 • edited Loading

kc611 commented Jan 28, 2021

AlexAndorra commented Jan 28, 2021

kc611 commented Jan 28, 2021

AlexAndorra commented Jan 29, 2021 • edited Loading

twiecki commented Jan 30, 2021

kc611 commented Jan 30, 2021

NowanIlfideme commented Jan 30, 2021

ricardoV94 commented Jan 31, 2021

kc611 commented Feb 1, 2021

twiecki commented Feb 1, 2021

ricardoV94 commented Feb 1, 2021 • edited Loading

fonnesbeck commented Nov 18, 2021

NowanIlfideme commented Nov 18, 2021

twiecki commented Nov 18, 2021

This issue was moved to a discussion.

AlexAndorra commented Aug 3, 2020 •

edited

Loading

AlexAndorra commented Jan 29, 2021 •

edited

Loading

ricardoV94 commented Feb 1, 2021 •

edited

Loading