-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
ENH: Implement MultiIndex.insert_level for inserting levels at specified positions #62610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Implement insert_level method for MultiIndex to insert new levels at specified positions - Add comprehensive test cases for the new functionality - Fix level names handling to match expected behavior Resolves: MultiIndex level insertion feature request
- Implement insert_level method for MultiIndex to insert new levels at specified positions - Add comprehensive test cases for the new functionality - Fix level names handling to match expected behavior Resolves: MultiIndex level insertion feature request
- Implement insert_level method for MultiIndex to insert new levels at specified positions - Add comprehensive test cases for the new functionality - Fix level names handling to match expected behavior Resolves: MultiIndex level insertion feature request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Here are some suggestions against the test suite.
It seems that you committed fastparquet
and pyarrow
accidentally.
Also, can you add an entry to whatsnew?
def setup_method(self): | ||
self.simple_idx = pd.MultiIndex.from_tuples( | ||
[("A", 1), ("B", 2), ("C", 3)], names=["level1", "level2"] | ||
) | ||
self.empty_idx = pd.MultiIndex.from_tuples([], names=["level1", "level2"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I would prefer that you define this in the test body that should use it.
) | ||
self.empty_idx = pd.MultiIndex.from_tuples([], names=["level1", "level2"]) | ||
|
||
def test_insert_level_basic(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize this test?
result = self.simple_idx.insert_level(0, "new_val", name="new_level") | ||
assert result.names[0] == "new_level" | ||
|
||
def test_insert_level_edge_positions(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can go into the first test.
result_end = self.simple_idx.insert_level(2, "end") | ||
assert result_end.nlevels == 3 | ||
|
||
def test_insert_level_error_cases(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize this?
with pytest.raises(ValueError, match="Length of values must match"): | ||
self.simple_idx.insert_level(1, ["too", "few"]) | ||
|
||
def test_insert_level_with_different_data_types(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests could go into the first test too.
def test_debug_names(): | ||
idx = pd.MultiIndex.from_tuples( | ||
[("A", 1), ("B", 2), ("C", 3)], names=["level1", "level2"] | ||
) | ||
print("Original names:", idx.names) | ||
|
||
result = idx.insert_level(0, "new_value") | ||
print("Result names:", result.names) | ||
|
||
expected = pd.MultiIndex.from_tuples( | ||
[("new_value", "A", 1), ("new_value", "B", 2), ("new_value", "C", 3)], | ||
names=[None, "level1", "level2"], | ||
) | ||
print("Expected names:", expected.names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a leftover debug function from earlier development. I've deleted it from the test file.
|
||
tm.assert_index_equal(original, self.simple_idx) | ||
|
||
assert result.nlevels == original.nlevels + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion feels redundant against the first test.
def test_insert_level_with_name(self): | ||
result = self.simple_idx.insert_level(0, "new_val", name="new_level") | ||
assert result.names[0] == "new_level" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also test adding names in the first test and remove this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience!!!! I've made all the requested improvements such as “Moved test data into individual test methods”. All tests continue to pass and the implementation is ready for review!
(2, "end"), | ||
], | ||
) | ||
def test_insert_level_edge_positions(self, position, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test would also be better in the first test. Asserting the whole series is better than asserting its size.
"value", | ||
[100, 1.5, None], | ||
) | ||
def test_insert_level_with_different_data_types(self, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, you are only asserting its size. Move it to the first test to verify with tm.assert_index_equal(result, expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've consolidated all appropriate tests into the main parametrized test to provide comprehensive validation with tm.assert_index_equal
, if here are some other problems, I would fix them immediately, thank you!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file seems unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, It might be remnants of a merge conflict - looks like an issue from resolving conflicts.I have made it the same as the main branch.
fastparquet
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you committed this by accident
pyarrow
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you committed this by accident
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git rm pyarrow
doc/source/whatsnew/v3.0.0.rst
Outdated
- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to remove the -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I fixed it
pandas/core/indexes/multi.py
Outdated
new_tuple = list(tup) | ||
new_tuple.insert(position, value[i]) | ||
new_tuples.append(tuple(new_tuple)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary, this conversion of tuple -> list -> tuple?
pandas/core/indexes/multi.py
Outdated
new_tuple = list(tup) | ||
new_tuple.insert(position, value[i]) | ||
new_tuples.append(tuple(new_tuple)) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this branch? Do any of your test cases fall here?
pandas/core/indexes/multi.py
Outdated
if self.names is not None: | ||
new_names = list(self.names) | ||
else: | ||
new_names = [None] * self.nlevels | ||
|
||
new_names.insert(position, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.names is not None: | |
new_names = list(self.names) | |
else: | |
new_names = [None] * self.nlevels | |
new_names.insert(position, name) | |
if self.names is not None: | |
new_names = self.names[:position] + [name] + self.names[position + 1:] | |
else: | |
new_names = [None] * (position) + [name] + [None] * (self.nlevel - position) |
Is there a case where self.names
is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I research the Constructors, If the user has not named it, the result._names = [None] * len(levels)
would make it a list class. Can it be regard as self.names
would never be None? Maybe here can be new_names = self.names[:position] + [name] + self.names[position:]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be regard as self.names would never be None?
The MultiIndex
constructor contains these lines
pandas/pandas/core/indexes/multi.py
Lines 329 to 332 in a329dc3
result._names = [None] * len(levels) | |
if names is not None: | |
# handles name validation | |
result._set_names(names) |
That indicates it would never be None
. So I think it's safe to remove the branching.
pyarrow
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git rm pyarrow
…ersions & REF: Remove unnecessary else branch in MultiIndex.insert_level & REF: Simplify names handling in MultiIndex.insert_level
pandas/core/indexes/multi.py
Outdated
if position == 0: | ||
new_tuple = (value[i],) + tup | ||
elif position == len(tup): | ||
new_tuple = tup + (value[i],) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this if-else
conditions. The slicing in the else
branch handles the edges.
pandas/core/indexes/multi.py
Outdated
>>> idx.insert_level(1, ["L1", "L2"], name="z") | ||
MultiIndex([('A', 'L1', 1), ('B', 'L2', 2)], | ||
names=['x', 'z', 'y']) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you removed the documentation accidentally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Alvaro-Kothe Just checking if there's anything else needed before this can be merged? All checks are passing and the feature is ready. Thanks! |
It needs the approval of a core member. Just wait. |
- :meth:`pandas.concat` will raise a ``ValueError`` when ``ignore_index=True`` and ``keys`` is not ``None`` (:issue:`59274`) | ||
- :py:class:`frozenset` elements in pandas objects are now natively printed (:issue:`60690`) | ||
- Add ``"delete_rows"`` option to ``if_exists`` argument in :meth:`DataFrame.to_sql` deleting all records of the table before inserting data (:issue:`37210`). | ||
- Added :meth:`MultiIndex.insert_level` to insert new levels at specified positions in a MultiIndex (:issue:`62558`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put insert_level
in the API reference?
pandas/core/indexes/multi.py
Outdated
Value(s) to use for the new level. If scalar, broadcast to all items. | ||
If array-like, length must match the length of the index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this accept only an array-like?
pandas/core/indexes/multi.py
Outdated
result = self._reorder_ilevels(order) | ||
return result | ||
|
||
def insert_level(self, position: int, value, name=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def insert_level(self, position: int, value, name=None): | |
def insert_level(self, position: int, value, name: Hashable=lib.no_default) -> MultiIndex: |
for i, tup in enumerate(self): | ||
new_tuple = tup[:position] + (value[i],) + tup[position:] | ||
new_tuples.append(new_tuple) | ||
|
||
new_names = self.names[:position] + [name] + self.names[position:] | ||
|
||
return MultiIndex.from_tuples(new_tuples, names=new_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using from_tuples
(which will be slow), can you calculate the level
and codes
of the new values
(you can find examples of this in this file) and interpose that with the existing self.levels
and self.codes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mroeschke
I'm working on this and have a question about the optimal approach.
I started with a levels/codes based implementation:
def insert_level(self, position: int, value, name: Hashable = lib.no_default) -> MultiIndex:
#...
new_level = Index(value)
new_codes_for_level = new_level.get_indexer(value)
new_levels = self.levels[:position] + [new_level] + self.levels[position:]
new_codes = self.codes[:position] + [new_codes_for_level] + self.codes[position:]
new_names = self.names[:position] + [name] + self.names[position:]
return MultiIndex(levels=new_levels, codes=new_codes, names=new_names, verify_integrity=False)
However, I'm encountering issues with None value handling where new_level.get_indexer(value) fails when the new level contains duplicate values (like [None, None, None]).Should I:Continue with levels/codes optimization and implement custom codes mapping to handle duplicates, similar to how from_arrays handles it internally?Or just use the simpler from_tuples approach that builds new tuples and delegates to the well-tested MultiIndex.from_tuples method?The from_tuples approach would be more reliable for edge cases but potentially less performant for large indices.
Thanks for your guidance!
def test_insert_level_integration(): | ||
idx = MultiIndex.from_tuples([("A", 1), ("B", 2)]) | ||
|
||
df = pd.DataFrame({"data": [10, 20]}, index=idx) | ||
new_idx = idx.insert_level(0, "group1") | ||
df_new = df.set_index(new_idx) | ||
|
||
assert df_new.index.nlevels == 3 | ||
assert len(df_new) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this test? I don't think it's necessarily testing anything new
import pandas._testing as tm | ||
|
||
|
||
class TestMultiIndexInsertLevel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this class and make all the tests free functions?
Adds
insert_level
method to MultiIndex for inserting new levels at specified positions.This addresses the feature request in issue #62558 for a simple way to add levels to MultiIndex at given positions.
insert_level
method inpandas/core/indexes/multi.py
pandas/tests/indexes/multi/test_insert_level.py
idx = pd.MultiIndex.from_tuples([('A', 1), ('B', 2)])
result = idx.insert_level(1, 'new_level')