-
-
Couldn't load subscription status.
- Fork 19.2k
BUG: .loc with duplicated label may have incorrect index dtype #11497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/index.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic should go in Index._shallow_copy itself.
pandas/core/index.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty convoluted
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to handle freq reset for DTI / keep for PeriodIndex operation in the function. Will consider better way and remove from this PR.
eee51c5 to
3dc6db0
Compare
|
Updated and now green. |
pandas/core/index.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is infer not good enough here? (or maybe just reprupose and make infer a class attribute? (lke what you are calling _infer_as_myclass). e.g. confusing that we need/have 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To achieve below result. .loc against DatetimeIndex internally converts the index to Int64Index. Thus, there should be a logic to revert to DatetimeIndex.
pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3], infer=True)
# Int64Index([1, 2, 3], dtype='int64')
idx = pd.date_range('2011-01-01', freq='D', periods=3)
idx._shallow_copy(idx.asi8, infer=True)
# DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')
In current master, the below case results in:
idx._shallow_copy(idx.asi8, infer=True)
Int64Index([1293840000000000000, 1293926400000000000, 1294012800000000000], dtype='int64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. The point is that maybe if we have a class attribute, we don't need the instance one (e.g. passing infer=True at all). OR (if you think we need both) I would name them the same, and only use the instance passed one if its not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the inference of the above example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below is the behavior of current master.
pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3], infer=True)
# Int64Index([1, 2, 3], dtype='int64')
pd.Index([1., 2., 3.])._shallow_copy([1, 2, 3])
# Float64Index([1, 2, 3], dtype='int64')
idx = pd.date_range('2011-01-01', freq='D', periods=3)
idx._shallow_copy(idx.asi8, infer=True)
# Int64Index([1293840000000000000, 1293926400000000000, 1294012800000000000], dtype='int64')
idx._shallow_copy(idx.asi8)
# DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')
.loc must infer because user may pass different dtype.
pd.Index([1., 2., 3.])._shallow_copy(['a', 'b', 'c'])
# Float64Index([u'a', u'b', u'c'], dtype='|S1')
# -> NG, incorrect dtype
Thus, I understand _shallow_copy should cover followings.
- Return the same
Indexif values are guaranteed to be the same dtype as the original (without inference) - Return the correct type of
Indexif values can have different dtypes from the original (with inference)- Inference prioritizing original class (used in datetime-like
Indexaccepting internal repr likeasi8) - Normal inference (simply pass values to
Index)
- Inference prioritizing original class (used in datetime-like
Class attribute _infer_as_myclass intends to split the logic as below (or needs to override _shallow_copy in tseries/base)
pd.Index([1., 2., 3.])._shallow_copy(['2011-01', '2011-02', '2011-02'], infer=True)
# Index([u'2011-01', u'2011-02', u'2011-02'], dtype='object')
idx._shallow_copy(['2011-01', '2011-02', '2011-02'], infer=True)
# Index([u'2011-01', u'2011-02', u'2011-02'], dtype='object')
# -> want DatetimeIndex in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sinhrks not questing the need for this! just whether we need infer AND _infer_as_myclass or could we just use a single class variable for this (rather than the passed kw infer). Just confusing to have 2 sort-of-similar ways of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. One idea is splitting the method. Based on the current impl:
- Use
_simple_newwhen inference is not needed. - Use
_shallow_copywhen inference is needed.
I've once tried it, but it isn't work by the simple change because of some impl differences, for example MultiIndex doesn't have _simple_new.
Let me try it in a separate PR, or change the milestone to 0.18.0.
|
@sinhrks can you update |
|
@sinhrks yeh, prob requires some playing around to get this to work nicely. Like your separation of concerns though. We should document this prob at top of core/index.py |
4a232a9 to
6e3ccb0
Compare
|
How about adding
We can replace If OK, I'll squash. |
|
@sinhrks seems reasonable call it |
|
@jreback OK, renamed. Also added general explanation about |
BUG: .loc with duplicated label may have incorrect index dtype
|
thanks @sinhrks and nice docs! |
.locresult with duplicated keys may have incorredIndexdtype.After the PR:
Above OK results are unchanged.