Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.from_dataframe: deprecate expanding the multi-index #8166

Open
benbovy opened this issue Sep 10, 2023 · 4 comments · May be fixed by #8170
Open

Dataset.from_dataframe: deprecate expanding the multi-index #8166

benbovy opened this issue Sep 10, 2023 · 4 comments · May be fixed by #8170

Comments

@benbovy
Copy link
Member

benbovy commented Sep 10, 2023

What is your issue?

Let's continue here the discussion about changing the behavior of Dataset.from_dataframe (see #8140 (comment)).

The current behaviour of Dataset.from_dataframe where it always unstacks feels wrong to me.
To me, it seems sensible that Dataset.from_dataframe(df) automatically creates a Dataset with PandasMultiIndex if df has a MultiIndex. The user can then use that or quite easily unstack to a dense or sparse array.

If we don't unstack anymore the multi-index in Dataset.from_dataframe, are we OK that the "Dataset -> DataFrame -> Dataset" round-trip will not yield expected results unless we unstack explicitly?

ds = xr.Dataset(
    {"foo": (("x", "y"), [[1, 2], [3, 4]])},
    coords={"x": ["a", "b"], "y": [1, 2]},
)

df = ds.to_dataframe()
ds2 = xr.Dataset.from_dataframe(df, dim="z")

ds2.identical(ds)  # False

ds2.unstack("z").identical(ds)  # True

cc @max-sixty @dcherian

@benbovy benbovy added needs triage Issue that has not been reviewed by xarray team member design question and removed needs triage Issue that has not been reviewed by xarray team member labels Sep 10, 2023
@max-sixty
Copy link
Collaborator

That's a good point, and these invariants are indeed nice to uphold.

Is there a branch with the dim= code on? Or it's just a mental model atm? (I wrote a message but not sure it's correct so removed it, will rewrite with either the code or more thought!)

@dcherian
Copy link
Contributor

dcherian commented Sep 11, 2023

Sorry I wasn't very clear in that thread.

I think we should avoid the dim argument for this reason.

We could just use "dim_X" if Index.name is None, and have the user manually rename to a name they like.

@benbovy benbovy linked a pull request Sep 11, 2023 that will close this issue
5 tasks
@benbovy
Copy link
Member Author

benbovy commented Sep 11, 2023

Is there a branch with the dim= code on?

See #8170

@max-sixty
Copy link
Collaborator

Without any magical ideas for maintaining the from_dataframe / to_dataframe round-trip, I would be +1 on deprecating unstacking / expanding the multi-index; to the extent it helps us with finishing off the index refactor and fixing bugs such as #8646.

(personally I don't even use from_dataframe, I just do xr.Dataset(df), which doesn't unstack... So this would also have the advantage of unifying that behavior...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants