Skip to content

Ensure kwarg order matches sklearn expectations#6716

Closed
jcrist wants to merge 1 commit intorapidsai:branch-25.06from
jcrist:sklearn-sig-check
Closed

Ensure kwarg order matches sklearn expectations#6716
jcrist wants to merge 1 commit intorapidsai:branch-25.06from
jcrist:sklearn-sig-check

Conversation

@jcrist
Copy link
Member

@jcrist jcrist commented May 13, 2025

Some common sklearn methods take positional-or-keyword arguments (like sample_weight). While users should pass these in by keyword, the standard signatures accept them by position as well (and some parts of the sklearn test suite and examples pass them positionally). For optimal sklearn compatibility, our signatures should have these parameters in the same standard order sklearn expects.

This PR adds a test for this, and fixes the few locations where we weren't compatible.

Some common sklearn methods take positional-or-keyword arguments (like
`sample_weight`). While users _should_ pass these in by keyword, the
standard signatures accept them by position as well (and some parts of
the sklearn test suite and examples pass them positionally). For optimal
sklearn compatibility, our signatures should have these parameters in
the same standard order sklearn expects.

This PR adds a test for this, and fixes the few locations where we
weren't compatible.
@jcrist jcrist requested a review from a team as a code owner May 13, 2025 20:19
@jcrist jcrist requested review from betatim and vyasr May 13, 2025 20:19
@jcrist jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuml-accel Issues related to cuml.accel labels May 13, 2025
@jcrist jcrist self-assigned this May 13, 2025
@github-actions github-actions bot added the Cython / Python Cython or Python issue label May 13, 2025
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While users should pass them by keyword, in practice they might not, so this represents a breaking change. We need to perform an assessment of impact and possible mitigation.

One option would be to add a future warning for 25.06 and fix the order in 25.08.

@jcrist
Copy link
Member Author

jcrist commented May 13, 2025

We need to perform an assessment of impact and possible mitigation.

Not one of our tests or examples needed updating, which feels to me like a decent measure of "did any code rely on the ordering here". While I agree this is technically a breaking change, I don't expect this to break anyone's workflow. Further, I don't see a sane way of cleanly deprecating this. Personally I would recommend we mark it as a breaking change so it shows up in the release notes and move on. We've certainly made more disruptive breaking changes in the past.

If you disagree, can you please recommend a practical course of action with tangible steps we should take.

@jcrist jcrist added breaking Breaking change and removed non-breaking Non-breaking change labels May 13, 2025
@jcrist
Copy link
Member Author

jcrist commented May 13, 2025

For some precedent, we (err, I) did a similar change a few months ago in #6260 and it was merged without a deprecation period even though technically this changed the signatures.

@betatim
Copy link
Member

betatim commented May 14, 2025

I'm for this change, it is these "little things" that tend to be why swapping scikit-learn for cuml doesn't "just work". Making adoption hard for users. I'm not sure there is a great solution here for making the change though. It is hard to evaluate how many people rely on this. It is hard to do a deprecation cycle.

My best idea (not sure I love it though) for how to do a deprecation cycle is to look at the type of sample_weights and out_dtype(or what ever the other argument is) and rely on the fact that they can't be the same type. out_dtype can be a str or Numpy dtype, but not an array like. Same is true for sample_weight. I don't know if this is bulletproof or just a heuristic. However it would give us a chance to raise a warning about the args having swapped order. Maybe we can also inspect the arguments to determine which were passed as positional and which as keywords? I don't know off the top off my head, my guess is that it is unlikely (without much much dark magic).

I think here we are caught between a rock and a hard place. The signatures shouldn't have been introduced with this order of arguments, but that didn't get caught during review. Solving this via a deprecation cycle will require quite a bit of finessing and unclear if it is not just a heuristic (-> some people will still be caught out by it). One actionable take away from this is that we should try and up our game when reviewing things to try and prevent introducing such brain teasers ;)


A more philosophical point in reply to "we made a similar change here, so lets do it again": I like having rules (shocking for a German, I know) that we can apply without making exceptions. I'm against exceptions because once you start making them, you keep making them because "everyone is a special snowflake". This also means that the rule(s) we make for ourselves have to be made so that we can apply them and make progress.

As an example, it is popular to have a rule like "no merging if the CI is red". This is a fine rule to have, however you need pretty good CI (no flaky tests, reliable runners, test suite that completes in a reasonable amount of time, etc) for this rule to be consistently applied and make progress. If you don't have the prerequisites yet, I think it makes more sense to make a rule that recognises this, maybe something like "try to merge PRs only if the CI is green, and leave a brief comment explaining what was failing&why it is irrelevant to the change". Having the more strict rule and making exceptions is IMHO the worst of both worlds because it removes the usefulness of having CI and any incentive to put in the work to improve the CI.

In conclusion, I don't want to support this change "as is" with the justification of "we did this before already", but that doesn't mean I don't support making this change

@jcrist
Copy link
Member Author

jcrist commented May 14, 2025

A more philosophical point in reply to "we made a similar change here, so lets do it again"...

I agree that past mistakes don't justify repeating them. If we view merging #6260 without a deprecation period as a mistake, then sure we probably shouldn't merge this one as is.

The only reason I brought up #6260 is because we made breaking changes to several of these signatures then as well and no one complained. Seemed like the best metric we'd have for "does this break anyone's code". There's a possibility with all changes to break someone's workflow (https://xkcd.com/1172/), and deprecation cycles can slow improvements for users or be painful for devs. IMO an intention of "try to deprecate when feasible" makes more sense in these situations than a hard rule.


Tim, I confess I'm having a hard time parsing out what you're recommending here. Are you saying "this is unfortunate, but should merge it anyway"? Or "this is unfortunate, and we should find a way to deprecate things as cleanly as possible"? If the latter, I'll find a way, just trying to understand where we stand here.

@csadorf
Copy link
Contributor

csadorf commented May 14, 2025

@jcrist Thank you for referencing the precedent and I believe that #6260 clearly should at the very least have been flagged as breaking.

Like both of you pointed out, deprecating this behavior is difficult, albeit I'd argue not impossible. I've blocked this PR, because I want to make sure that we give breaking changes the appropriate consideration. If we come to the conclusion that we are either not able to cleanly (or uncleanly) deprecate the current (arguably false) API, then we should do that, but it must be an explicit decision where we have weighed trade-offs and considered mitigations.

I am happy to have another close look at this today and see if I can come up with something. I had pondered possible solutions like the ones Tim pointed out.

@jcrist
Copy link
Member Author

jcrist commented May 14, 2025

Mmk. If we want to deprecate things, I think the cleanest (and maybe nicest end state) would be to go even further and deprecate all non-standard kwargs passed positionally, moving things like convert_dtype to keyword-only. I have (I think) a sane way of doing that. The end state is signatures would look like:

def fit(self, X, y, sample_weight=None, *, convert_dtype=True, other_cuml_only_keyword=...):
   ...

This is maybe a nicer end state in total since all the cuml-only things would be keyword only, leave the positional-or-keyword arguments open to better track sklearn conventions as they evolve.

If this sounds good to y'all, I'll close this and open a new one.

@csadorf
Copy link
Contributor

csadorf commented May 14, 2025

Mmk. If we want to deprecate things, I think the cleanest (and maybe nicest end state) would be to go even further and deprecate all non-standard kwargs passed positionally, moving things like convert_dtype to keyword-only. I have (I think) a sane way of doing that. The end state is signatures would look like:

def fit(self, X, y, sample_weight=None, *, convert_dtype=True, other_cuml_only_keyword=...):
   ...

This is maybe a nicer end state in total since all the cuml-only things would be keyword only, leave the positional-or-keyword arguments open to better track sklearn conventions as they evolve.

If this sounds good to y'all, I'll close this and open a new one.

Fantastic idea. I had actually thought about making all arguments after (X, y) keyword-only and then dismissed the idea, because of the parity-regression with scikit-learn. But making sure that at least "our" arguments are keyword-only is the perfect compromise.

We could also consider the option to make all arguments (after X, y) keyword-only and allow additional positional arguments only for the cuml.accel proxy. I'm not sure how much effort that would be though.

@jcrist
Copy link
Member Author

jcrist commented May 15, 2025

Closing this for now, I'll open a new PR with the changes above when ready.

@jcrist jcrist closed this May 15, 2025
rapids-bot bot pushed a commit that referenced this pull request May 15, 2025
The sklearn interface has a bunch of standard methods (`fit`, `transform`, ...). cuml strives to match this interface. Some cuml methods take cuml-specific keyword arguments. To improve our sklearn compatibility and avoid potential signature compatibility issues in the future if sklearn makes changes, this PR deprecates passing cuml-specific kwargs positionally.

Addressing the suggestion made here: #6716 (comment)

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Simon Adorf (https://github.com/csadorf)

URL: #6738
@betatim
Copy link
Member

betatim commented May 16, 2025

Tim, I confess I'm having a hard time parsing out what you're recommending here. Are you saying "this is unfortunate, but should merge it anyway"? Or "this is unfortunate, and we should find a way to deprecate things as cleanly as possible"? If the latter, I'll find a way, just trying to understand where we stand here.

Sorry for the delayed reply, I think the reason it isn't clear to you is that it is also not clear to me how to do this, because in general I am +100 for removing these "little" stumbling blocks.

Having pondered it a bit and also the conversation we had yesterday in the meeting I think today I'd vote to "just do it, record it as a breaking change".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change cuml-accel Issues related to cuml.accel Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants