Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: handle NaT values in dt-accessor #8084

Merged
merged 11 commits into from
Sep 14, 2023

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Aug 18, 2023

@keewis
Copy link
Collaborator

keewis commented Aug 18, 2023

if I understand correctly, pandas will return float arrays / series if there's NaT, so checking the output dtypes might be faster?

@kmuehlbauer
Copy link
Contributor Author

if I understand correctly, pandas will return float arrays / series if there's NaT, so checking the output dtypes might be faster?

Thanks! Yes, indeed, so we only need to special case isocalendar.

@kmuehlbauer
Copy link
Contributor Author

@keewis Do you have a recommendation how to retrieve resulting dtype for the dask-part?

@keewis
Copy link
Collaborator

keewis commented Aug 18, 2023

right, with dask this could be tricky, since we cannot know if the data contains missing values or not... I really would like to have nullable integer dtypes in situations like these.

@kmuehlbauer
Copy link
Contributor Author

with dask this could be tricky, since we cannot know if the data contains missing values or not...

Even testing a small subset does not reveal the dtype in any case, since the NaT might be in any chunk we do not test. So we would have to know a priori to be on the safe side also for further processing in the dask world.

@kmuehlbauer
Copy link
Contributor Author

One idea for the dask-part is to just keep current behaviour. On compute, this RuntimeError is raised:

RuntimeWarning: invalid value encountered in cast

The user could then just use .astype("float64") before computation. No proper solution, but maybe sufficient? I'll add the needed changes in a day or two.

@kmuehlbauer
Copy link
Contributor Author

kmuehlbauer commented Aug 21, 2023

The current solution works for most of the .dt.accessor out of the box (it just keeps float64 if provided by pandas). For isocalendar it is more of a workaround until a proper fix is added upstream (proposed pandas-dev/pandas#54657). The workaround checks for presence of <NA> in the pandas output and handles appropriately. This might introduce a performance penalty.

Note: This does not help for the dask part, which is left unchanged. Here we would somehow need a priory knowledge if NaT is involved or not.

Question: Should we add tests in this PR for handling in presence of NaT? This is currently not tested at all.

@kmuehlbauer
Copy link
Contributor Author

This is good to go from my side.

Unfortunately there has not been any comment on the upstream-issue so far: pandas-dev/pandas#54657. We can revisit and change when this is resolved over at pandas.

@kmuehlbauer kmuehlbauer added the plan to merge Final call for comments label Aug 31, 2023
@kmuehlbauer
Copy link
Contributor Author

@keewis I did not touch the dask part. I'd appreciate if you or others could have a final look for approval. Thanks.

@kmuehlbauer
Copy link
Contributor Author

kmuehlbauer commented Sep 14, 2023

Will merge after tests are passing.

@kmuehlbauer kmuehlbauer merged commit 0c4f165 into pydata:main Sep 14, 2023
@kmuehlbauer kmuehlbauer deleted the handle-nat-dt-accessor branch September 14, 2023 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.dt accessor returns int instead of float, resulting in misrepresentation of NaT values
2 participants