Skip to content

Conversation

@rhshadrach
Copy link
Member

@rhshadrach rhshadrach added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 23, 2023
@rhshadrach rhshadrach added this to the 2.1 milestone Jun 23, 2023
- Bug in :meth:`DataFrame.stack` losing extension dtypes when columns is a :class:`MultiIndex` and frame contains mixed dtypes (:issue:`45740`)
- Bug in :meth:`DataFrame.stack` sorting columns lexicographically (:issue:`53786`)
- Bug in :meth:`DataFrame.stack` sorting columns lexicographically in rare cases (:issue:`53786`)
- Bug in :meth:`DataFrame.stack` sorting index lexicographically in rare cases (:issue:`53824`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tons of tests for stacking not sorting the order; only one of them is impacted by this bug. I haven't been able to figure out a way to describe the circumstances this happens under.

… into stack_sort_bug_2

# Conflicts:
#	doc/source/whatsnew/v2.1.0.rst
#	pandas/core/reshape/reshape.py
@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 25, 2023

@mroeschke - this removes all uses of sort in DataFrame.stack. Looking at #15105 again, it appear to me the cause of all the issues was the unstack behavior. #53298 adding sort=True/False to unstack resolves that issue alone, and the use of sort in unstack argument controls the sorting of the values.

On the other hand, #53282 does not modify the sorting of the values (values here being the levels + codes combination), but rather just the sorting of the levels (the codes are then also adjusted so the values come out the same). This PR essentially only implements sort=False. So I think we can remove the sort argument from stack and still resolve #15105 with this PR. Would you be okay with me doing this here?

@mroeschke
Copy link
Member

So I think we can remove the sort argument from stack and still resolve #15105 with this PR. Would you be okay with me doing this here?

Hmm it would be nice for API consistency for stack/unstack to both have sort keywords. Would it be difficult to implement sort=True for stack?

@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 26, 2023

Would it be difficult to implement sort=True for stack?

No - it's not. But I was leaning the other way: remove sort from unstack after changing it's default to False. To sort, it would just be a call to sort_index(axis=1) after. There is a perf impact here:

arrays = [np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]
index = MultiIndex.from_arrays(arrays)
df = DataFrame(np.random.randn(10000, 4), index=index)

%timeit df.unstack(1, sort=True)
602 µs ± 5.38 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit df.unstack(1, sort=False)
638 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit df.unstack(1, sort=False).sort_index(axis=1)
1.04 ms ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In general I advocate for not having arguments when it's just an additional call, but perhaps the perf benefit is worth it here?

Somewhat like unstack, we can implement sort=True in stack more efficiently than having to call .sort_index(axis=1) after. If we're going this route, then we'd need to have sort=False as the default for stack, sort=True as the default for unstack, and align their defaults after a deprecation. I'd suggest sort=False as the default.

@mroeschke
Copy link
Member

In general I advocate for not having arguments when it's just an additional call, but perhaps the perf benefit is worth it here?

Yeah I agree with not having arguments when its an additional call, additionally I think the sort=False behavior is a more natural default behavior so I wouldn't mind moving toward an eventual removal of the keyword where we do not sort by default

@rhshadrach rhshadrach requested a review from mroeschke June 28, 2023 00:25
@mroeschke mroeschke merged commit 5307062 into pandas-dev:main Jun 28, 2023
@mroeschke
Copy link
Member

Nice thanks @rhshadrach

@rhshadrach rhshadrach deleted the stack_sort_bug_2 branch June 28, 2023 18:25
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
…v#53825)

* BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels

* revert ignoring warnings

* BUG: DataFrame.stack sorting columns

* whatsnew

* Docstring fixup

* Merge cleanup

* WIP

* BUG: DataFrame.stack sometimes sorting the resulting index

* mypy fixups

* Remove sort argument from DataFrame.stack
rhshadrach added a commit that referenced this pull request Jul 10, 2023
rhshadrach added a commit that referenced this pull request Jul 13, 2023
…54068)

Revert "BUG: DataFrame.stack sometimes sorting the resulting index (#53825)"

This reverts commit 5307062.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.stack sorting index values in rare cases

2 participants