-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix mean for datetime-like using the respective time resolution unit #9977
base: main
Are you sure you want to change the base?
Conversation
…ion unit, adapting tests
Works fine, thanks a lot! |
Great, let's wait for some more feedback. @dcherian and @spencerkclark would you mind taking a look here? |
xarray/core/duck_array_ops.py
Outdated
# from version 2025.01.2 xarray uses np.datetime64[unit] where unit | ||
# is one of "s", "ms", "us", "ns" | ||
# the respective unit is used for the timedelta representation | ||
unit, _ = np.datetime_data(offset.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just pass offset.dtype
in the astype
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not, because we have datetime64 as offset. I'd happily change this, if there is a better method to distribute the resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcherian Your comment made me think of something else, and it seems the current solution is not clever enough. For instance:
import xarray as xr
import numpy as np
timestamps = xr.DataArray(
[np.datetime64("1970-01-01 00:00:01"),
np.datetime64("1970-01-01 00:00:02"), np.datetime64("NaT", "s")]
)
timestamps.mean()
Out[2]:
<xarray.DataArray ()> Size: 8B
array('1970-01-01T01:00:01', dtype='datetime64[s]')
This will not cover for the needed higher resolution, as we expect 0.500
seconds here. This wasn't an issue before as we already where at ns at that point.
There is a solution over in
Line 574 in 1c7ee65
def _numbers_to_timedelta( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question @kmuehlbauer. Both pandas and NumPy seem to maintain the dtype, despite producing an imprecise result:
>>> times = pd.date_range("2000", freq="s", periods=2, unit="s")
>>> times.mean()
Timestamp('2000-01-01 00:00:00')
NumPy doesn't currently support taking the mean of np.datetime64
values (numpy/numpy#12901), which is why we have our own logic here, but we can do something analogous with np.timedelta64
values:
>>> np.array([0, 1]).astype("timedelta64[s]").mean()
np.timedelta64(0,'s')
I wonder if we should just do the same for now?
I guess an argument against something fancier is that it could cause problems with overflow—consider the following example:
>>> pd.date_range("1500", freq="us", periods=2, unit="us").mean()
Timestamp('1500-01-01 00:00:00')
Technically we would need nanosecond-precision to resolve the true mean, but it would be for an out-of-bounds time at that precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've found a backwards compatible solution to this. If we assume our dateime64 is of some resolution unit
then the output of _mean
will also be in this resolution. As offset
still has the same resolution we can just add the int64
representation of _mean
. So casting to int64
will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy doesn't currently support taking the mean of
np.datetime64
values (numpy/numpy#12901), which is why we have our own logic here, but we can do something analogous withnp.timedelta64
values:>>> np.array([0, 1]).astype("timedelta64[s]").mean() np.timedelta64(0,'s')
The current solution (just casting _mean output to int64) will work as usual for "ns" resolution input. For lower resolution input it will align with pandas and numpy. Would that be good enough for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kmuehlbauer for jumping on this—a couple more comments. I think it was useful to revisit this code independent of the non-nanosecond update.
@@ -715,8 +713,11 @@ def mean(array, axis=None, skipna=None, **kwargs): | |||
if dtypes.is_datetime_like(array.dtype): | |||
offset = _datetime_nanmin(array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an issue before, but I think it would be safest to use the Unix epoch ("1970-01-01"
) for the offset
, since this would guarantee the timedeltas would be representable with the same precision as the datetimes. For example, this produces the incorrect result independent of this PR:
>>> array = np.array(["1678-01-01", "2260-01-01"], dtype="datetime64[ns]")
>>> times = xr.DataArray(array, dims=["time"])
>>> times.mean()
<xarray.DataArray ()> Size: 8B
array('2261-04-11T23:47:16.854775808', dtype='datetime64[ns]')
I'm happy to address that in another PR if you would like (it is possible there are further simplifications we could make to this logic).
Co-authored-by: Spencer Clark <[email protected]>
whats-new.rst