-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update has_dtypes to allow functions as dict values of parametr items #37
base: master
Are you sure you want to change the base?
Conversation
A change to allow has_dtypes parameter ``items`` to have functions as dict values. A few discussion point: I check for ``FunctionType`` not whether the input is callable. This is because variuos types are themselves callables (``int``, ``float`` etc.). Secondly, the check function must return ``True`` to pass. A truthy value is not enough. I'm thinking that users will sometimes return the series or dtype, and explicitly requiring ``True`` will minimize some errors (or so my thinking goes). Thirdly, the check function checks the *series*, not the series' dtype. My thinking on this is that by checking the series rather than the dtype only, the function can also check orderedness and categories on a categorical, and these checks seem logical to place in ``has_dtypes``. I can also see the counter argument, that we're not strictly checking ``dtype``, but ``pd.api.types`` accepts both series and dtypes, so no great loss should come from using series. I can write some tests and update docs, if this proposal is accepted. See #36 for the issue.
Small cleanup
check ``pd.Series.dtype``, not ``pd.Series``.
Thanks for this, and sorry for not getting around to reviewing it. I'll have more time later. I think accepting a function is fine, but one even simpler fix may be to replace the if not dtypes[k] == v: withn from pandas.api.types import is_dtype_equal
if not is_dtype_equal(dtypes[k], v): That at least covers the basics like |
Thanks for replying. I think this doesn't really solve the issue, as dtype checks still have to be precise and my issue is broader dtype checks (i.e. I check for broad dtype int (as an example), but do not care about which int sub-dtype i'm dealing with (int16, int 32, int64 - doesn't matter) ....). To check for broader dtypes, you have to use the functions in
this cannnot be done by using I've make my proposal a bit clearer wrt. the requirement the check-function to return booleans. |
Clearer errors for users.
BTW, |
I've made some additions for tests and docs and I think I'm finished unless you see anything. |
Hi @TomAugspurger Have you looked further at this? On a second look, a solution that IMO would be cleaner, would be for items.values to use strings that correspond to the possible output values of
I like this better than the function-based solution, as you don't have to do imports of |
I'm fine with only supporting the latest version of pandas
In [20]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')
Out[20]: True
In [21]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int64')
Out[21]: True
In [22]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int32')
Out[22]: False So Are there any other cases where you see that not working? |
Hmm, strange. I get: In [2]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')
Out[2]: False Are you sure you get It is very strange that we get different return values. I'm on Pandas version 0.20.2 on Python 3.6.1 on windows 10. EDIT: I can add that specific dtype works as expected: In [3]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int64')
Out[3]: True |
The specific dtypes of the two examples are different: The return value of The definition of def is_dtype_equal(source, target):
try:
source = _get_dtype(source)
target = _get_dtype(target)
return source == target
except (TypeError, AttributeError):
return False So in this case we are just doing In[1] dtype('int32') == dtype('int64') which evaluates to Hence using Do you not agree? |
Hi, @TomAugspurger. Do you not agree that >>>pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')
False and not I'm on Pandas 0.20.2, python 3.6.1 and Windows 10, in case you're on non-windows and this is a platform-dependant difference. |
Yeah, it comes down to the python / platform. On 64-bit OSes it'll be True
(and `is_dtype_equal(np.dtype('int32'), 'int') will be False)
I'll need to think a bit about what the desired behavior is here.
…On Tue, Jun 20, 2017 at 4:04 PM, topper-123 ***@***.***> wrote:
Hi, @TomAugspurger <https://github.com/tomaugspurger>. Do you not agree
that
>>>pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')False
and not True?
I'm on Pandas 0.20.2, python 3.6.1 and Windows 10, in case you're on
non-windows and this is a platform-dependant difference.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIvV6lYlu8ehWNEjkbrPM0sbQtetkks5sGDPZgaJpZM4NodTm>
.
|
Oh that makes sense, now you explain it. This does make I'll lay of this issue and let you decide what you want to do. |
Yeah, that was my initial thought too. I just want to think more about whether there are cases where we would want to follow the platform integer size (I haven't come up with any yet though). |
I promised to stay away, but I couldn't and tried to solve this... :-) The fundamental issue of testing for dtype hierarchy is that numpy is quite a bit illogical in its own dtype hierarchy checking: >>>import numpy as np
>>>np.issubdtype(np.dtype('int32'), np.int64)
False
>>>np.issubdtype(np.dtype('int32'), 'int64') # note: a string
True # WTF This behaviour doesn't make any sense wrt. type checking. I don't know if there are other reasons for the implementation in numpy, but I doubt it really. Probably it's the way it is for legacy reasons. I've asked about this in numpy issue #3218. Because of the above, the solution for sane dtype hierarchy checking in pandas (see below) is a bit wordy. Using this solution though, we get maximum usability and can check both for specific dtypes and dtypes classes, e.g.: @has_dtypes({
'A': 'int', #any int
'A': np.int # same as above
'A': 'int32' #int32 specifically, fails on int64 etc.
'A': np.int32 # same as above
'A': np.number # any int, float or other numpy number
'A': np.generic # any numpy object
'B': object # any python object
'B': 'category' # pandas categorical
'C': bool # a boolean
'D': 'datetime64' # np.datetime64 object
'D': np.datetime64 # same as above
})
def check():
return some_df I think this is quite nice and clean code from the user's perspective. The code for a new def has_dtypes(df, items):
is_dtype_equal = pd.api.types.is_dtype_equal
def issubdtype(arg1, arg2):
"""Mimics np.issubdype, but without the .mro() nonsense.
"""
if np.issubclass_(arg2, np.generic):
return issubclass(np.dtype(arg1).type, arg2)
return issubclass(np.dtype(arg1).type, np.dtype(arg2).type)
for k, v in items.items():
d = df.dtypes[k]
try:
assert issubdtype(d, v), ("Column {!r} has the wrong dtype {!r}. Check requires {!r}"
.format(k, d, v))
except TypeError:
assert is_dtype_equal(d, v), ("Column {!r} has the wrong dtype {!r}. check requires {!r}"
.format(k, d, v))
return df Would you be willing to use this in engarde? I also wonder whether something from the code above should be added to pandas type checking, unless |
Hi @TomAugspurger , Have you considered a solution to this? It would be very nice if That would mean something similar to the code I've submitted, though I do think that an approach like the above Regards. |
Haven't had a chance to look yet.
…On Wed, Jun 28, 2017 at 2:21 AM, topper-123 ***@***.***> wrote:
Hi @TomAugspurger <https://github.com/tomaugspurger> ,
Have you considered a solution to this? It would be very nice if
has_dtypes could accept np.integer, np.number etc. IMO the idiomatic way
to do this would be to use the functions in pd.api.types as that is
what's officially supported by pandas for dtype checking.
That would mean something similar to the code I've submitted, though I do
think that an approach like the above has_dtypes would be 'better', but
it's not the pandas waý. I've opened a ticket (#16768) on the pandas issue
tracker on this, we'll see if it'll be accepted.
Regards.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIhxMhJryzTXozN1UY2hcoImLz9Z7ks5sIf8BgaJpZM4NodTm>
.
|
A change to allow
has_dtypes
parameteritems
to have functions as dict values. See #36 for the issue.A few discussion point:
I check for
FunctionType
in the if-statement, not whether the input is a callable. This is because various types are themselves callables (int
,float
etc.), and checking for callability would be ambiguous in many cases .Secondly, the check function must return
True
to pass. A truthy value is not enough. I'm thinking that users will sometimes return the series or dtype in the check function, and explicitly requiringTrue
will minimize some errors (or so my thinking goes).Thirdly, the check function checks the series.dtype, not the series, as the function name implies. This means the check functions can't checks for orderedness and categories on a
categorical
, but these are not part of the dtype, and can be checked in the `´checks.verify*``-functions.I'm open to revert on the above points, if you want me to. I will write the tests and update docs, if this proposal is accepted.