-
-
Notifications
You must be signed in to change notification settings - Fork 2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Feature: Add a TidyData Class #3953
Comments
Any reason to not support this explicitly via For |
@NowanIlfideme That sounds like a good approach to me. Want to do a PR? |
Yeah, feel free to open a PR, basing it on @aseyboldt's previous approach or not. ArviZ and xarray evolved a lot since Adrian's first thought about this, so there may be easier approches now. This issue is just a guide to express the need and explicit the goal |
@twiecki @AlexAndorra Does this mean that the |
Hi @kc611 ! Mmmh, I think it's more about extending the current implementation to work with xarray datasets (and thus Pandas dataframe), as shown in the example above. That way, you get all the dims, coords and associated indexes defined at the same time and place, and conveniently so. |
Alright I'll see if I can come up with something. Should this class also support implicit conversion for |
Awesome, looking forward to it @kc611 !
I think inputs could be both pd.Dataframe and xr.Dataset, but the conversion should not be made by PyMC, it should be made by the user |
I don't think anyone would input xarrays, we only need DataFrames. |
@twiecki I have a related draft running(implemented using I think it'll be a better idea if you have a look at it (the draft PR) anyway in it's current state. |
Disagree completely - I would love to input xarray datasets, in fact for several of my latest models I've had to manually add the values to the model. Dataframes are good too, but 2d Dataframes are easily converted to xarray, while Multi-Index ones can be trickier (the implementers just needs to make sure that they don't auto-expand the MultiIndex during conversion, ie need to choose the proper converter). Finally, with arviz integrating with xarray more and more, I think pymc3 doing so would be great as well. |
I am trying to understand the difference between the proposed TidyData and the current Data. Is there a reason why their functionalities need to be in separate objects? Is TidyData something that should not be changed after model specification? Is TidyData a "2D" version of Data? |
For now it's expected to do indexing for string data. Maybe it's functionality can be extended to doing things like cleaning/tidying up data ( as the name suggests ) like filling in missing values automatically (using nearest neighbours) or one-hot encoding for categorical data. I don't know the extent of it's use in PyMC models tho.
No, I think it'll be a good idea to add that functionality too. That said I think it'll be a better idea to properly discuss the scope and use cases of this new class ( and if a new class is needed at all ) before I continue adding random stuff in my PR :-p |
Started looking a bit closer and I'm a bit confused about the API. Currently, we can specify dims with: coords = {"date": df_data.index, "city": df_data.columns}
with pm.Model(coords=coords) as model:
city_offset = pm.Normal("city_offset", mu=0.0, sd=3.0, dims="city") I think Anyone have a full grasp on this? CC @aseyboldt |
It can simply make use of
|
This has been dormant for quite a while. Is there still interest in pursuing this, or can it be closed? |
I'm interested as a user. Usually I need to wrap the model class into something I dynamically construct, using my own encoding scheme. Having this built into PyMC would make this much more standardized. |
@NowanIlfideme Are you interested in giving this a shot? We'd help of course. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
This issue is up for grabs, with the goal of developing a new feature called
pm.TidyData
. The goal is to deal with automatically translating strings from a dataframe into integer arrays for indexing in models, and then make sure we still get the right labels in the output, for plots and diagnostics.@aseyboldt implemented a first version and an example, but the implementation is not mature enough, so we decided to not include it in #3551.
This would be a very useful new feature though, which is why we created this issue -- to remind ourselves of it and in case someone wants to give it a try 😉
The text was updated successfully, but these errors were encountered: