Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor datetime and timedelta encoding for increased robustness #9498

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Sep 15, 2024

This PR makes the proposed updates in #9488 (comment), removing the guard against overflow in dtype casting. In so doing I have added units checking / potential modification to the cftime code path to bring things up to speed with the NumPy code path, and also added a regression test for #9134.

Now we more robustly raise an error in the case of chunked times where an integer dtype encoding is specified, but the units do not allow for an accurate roundtrip. This was the original intent of the code, though we can also discuss whether it is better to warn instead of raise in this instance—one way or another this would also make it safer to switch the default units with which we encode chunked times (#9154).

As @kmuehlbauer notes, this would be another way to fix #9488; in some ways it is complementary to #9497. Note also, as I mentioned in #9488 (comment), it also removes guardrails preventing #8542. So there is a bit of a trade-off, but in general I think the positives outweigh the negatives—this is not the correct place in the encoding path for dtype casting, and we do not have similar guardrails in other contexts.

* Implement robust units checking for lazy encoding
* Remove dtype casting, which interferes with missing value encoding
Copy link

@langmore langmore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After line 332, we could have an int32 value for num_dates (e.g. if the user provided an int32 value to begin with). This will be passed down to _decode_datetime_with_pandas, which, upon calling pd.to_timedelta(flat_num_dates.min(), time_units) will raise OutOfBoundsTimedelta: Cannot cast 24 from h to 'ns' without overflow due to pandas not handling the int32 first argument. This is preventable by casting num_dates to int64.

I believe this is related to pandas-dev/pandas#56996

@spencerkclark
Copy link
Member Author

Thanks for pointing this out @langmore. Indeed I think we actually have tests that would fail because of this in our suite, but we don't (yet) have a CI environment that has NumPy >= 2, but no cftime. In other words the fact that we can fall back to decoding times with cftime masks this.

Your suggested workaround makes sense to me; we may just want to make sure that we cast signed integers to int64 and unsigned integers to uint64 to avoid overflow (and avoid doing any casting for floats). If you'd like, I think we'd be happy to take a PR; I can also take care of it if you prefer. Unfortunately my attempt to fix this upstream stalled.

@langmore
Copy link

Thanks for pointing this out @langmore. Indeed I think we actually have tests that would fail because of this in our suite, but we don't (yet) have a CI environment that has NumPy >= 2, but no cftime. In other words the fact that we can fall back to decoding times with cftime masks this.

Your suggested workaround makes sense to me; we may just want to make sure that we cast signed integers to int64 and unsigned integers to uint64 to avoid overflow (and avoid doing any casting for floats). If you'd like, I think we'd be happy to take a PR; I can also take care of it if you prefer. Unfortunately my attempt to fix this upstream stalled.

@spencerkclark I won't have time to fix this, since we're also falling back on cftime so it works for us, for now.

@spencerkclark
Copy link
Member Author

No worries @langmore—I went ahead with something along the lines of your suggested fix in #9518.

@spencerkclark spencerkclark marked this pull request as ready for review February 23, 2025 15:20
@dcherian
Copy link
Contributor

dcherian commented Mar 7, 2025

Is this one ready to go in?

@spencerkclark
Copy link
Member Author

Test failures look relevant upon the latest merge—I'll take a look when I get a chance.

@spencerkclark
Copy link
Member Author

OK I think things are ready from my perspective—new zarr test failures are unrelated—though it would probably be good for @kmuehlbauer to give this a once over when he has a chance. Some of the changes here interacted with changes made in #10050. Specifically, this PR restores the responsibility of final dtype casting to later in the encoding chain, so we need to retain the "dtype" key in the encoding. In general though, I think this slightly simplifies the logic in the datetime and timedelta coders, which is probably a good thing. See 2970532 for the necessary updates.

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @spencerkclark, this looks really good! I've added two suggestions to fix minor issues.

Copy link
Member Author

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @kmuehlbauer—very helpful comments! Things should hopefully be all good now.

@dcherian dcherian requested a review from kmuehlbauer March 17, 2025 22:04
Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go now Thanks @spencerkclark, those are really tricky issues solved here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants