Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c-api][python-package][R-package] expose feature num bin #5048

Merged
merged 12 commits into from
Mar 15, 2022

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Mar 3, 2022

This adds the LGBM_DatasetGetFeatureNumBin function to the C API to be able to get the number of constructed bins for a feature. I also added this to the Python API as the Dataset method feature_num_bin and added a test for it.

Closes #3406.

@jmoralez jmoralez changed the title expose FeatureNumBin in C api expose FeatureNumBin in C API Mar 3, 2022
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR!

LGTM for the Python part and just one minor suggestion for the test.

Also, could you please open a new issue or rename the old original one requesting to add the same functionality into the R-package?

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
@jmoralez
Copy link
Collaborator Author

jmoralez commented Mar 5, 2022

I'll include the changes for the R package

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Very nice test cases.

@jmoralez jmoralez requested a review from Laurae2 as a code owner March 7, 2022 17:43
@jmoralez
Copy link
Collaborator Author

jmoralez commented Mar 7, 2022

Some checks (1, 2, 3, 4, 5) are failing with the message Error in `FUN(X[[i]], ...)`: object 'LGBM_DatasetGetFeatureNumBin_R' not found. Do I have to export it somewhere else?

@StrikerRUS
Copy link
Collaborator

@jmoralez

Some checks (1, 2, 3, 4, 5) are failing with the message Error in `FUN(X[[i]], ...)`: object 'LGBM_DatasetGetFeatureNumBin_R' not found. Do I have to export it somewhere else?

Yeah, please check #4829 as example. New function should be registered here:

// .Call() calls
static const R_CallMethodDef CallEntries[] = {

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm travelling right now but would really like the chance to review this, now that changes to the R package's public API have been included.

Could you please either reserve the R changes for a separate PR, or otherwise wait 2 days to merge this? I'll be able to provide a review the day after tomorrow.

@jmoralez
Copy link
Collaborator Author

jmoralez commented Mar 7, 2022

Thank you @jameslamb, I'd really appreciate your review. I don't think we're in a rush to merge this, should I change this to draft or something similar to avoid merging it?

@jameslamb jameslamb changed the title expose FeatureNumBin in C API WIP: expose FeatureNumBin in C API Mar 9, 2022
@jameslamb
Copy link
Collaborator

should I change this to draft or something similar to avoid merging it?

I was going to say that adding "WIP" to the title will trigger a blocking CI job, but I guess that job doesn't block merging (I can still see the merge button available).

Screen Shot 2022-03-09 at 9 34 38 AM

So changing this to draft would be fine. As I kind of implied in my comment, it would also have been fine to split the R changes into a separate PR if you wanted.

Not a big deal either way though, I'm back and able to review this morning!

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! The changes on the R side look good to me, and I appreciate you making the interface there 1-based to match R's conventions. LightGBM's R package has historically been inconsistent about that (see, for example, discussion in #4970).

I left a few very minor suggestions for the R package.


I also think there's one other thing we should talk about now, on this PR. I believe it'd be useful to have a similar interface, but where you could get the number of bins by feature name.

I think that exposing that would prevent the need for users to write code like this to achieve the same thing when using named data objects like data frames:

feat_indx = [
    i, _
    for feat_name in df.columns
    if feat_name = "my_cool_feature"
][0]
ds.feature_num_bin(feat_indx)

I see three options for that:

  1. Don't do this at all.
  2. Allow the input to feature_num_bin() in the Python and R packages to accept Union[int, str]
    • assume that a string passed to this method should be matched exactly in feature_names
  3. Keep feature_num_bin() in the Python and R packages as-is and later add a new Dataset method like feature_num_bin_by_feature_name()
    • this accepts a string, matches it to feature_names to get the index (if feature_names have been defined), then calls feature_num_bin() with that index.

I only bring this up here in this review because if we want to eventually do Option 2, then I'd recommend changing the argument name to feature instead of feature_idx.

I have a mild preference for Option 3, as I think having explicit methods without union types makes it easier to reason about how the code is going to work. But this project already uses union types in so many other places, I'm not opposed to Option 2.

What do you think? also tagging @StrikerRUS @shiyu1994 @guolinke for other opinions.

R-package/R/lgb.Dataset.R Outdated Show resolved Hide resolved
R-package/tests/testthat/test_dataset.R Show resolved Hide resolved
R-package/tests/testthat/test_dataset.R Outdated Show resolved Hide resolved
@jmoralez
Copy link
Collaborator Author

jmoralez commented Mar 9, 2022

I was going to say that adding "WIP" to the title will trigger a blocking CI job, but I guess that job doesn't block merging

That check is not marked as required.
image

Although it'd definitely be useful because otherwise the WIP doesn't help much I think.

@jmoralez
Copy link
Collaborator Author

jmoralez commented Mar 9, 2022

I like option 2 and I think in the R package it's very straightforward to implement because the feature names get stored as column names. However for the Python package it would involve getting the feature names each time. Would it be in the scope of this PR to add a buffer or something so that they're only computed once?

@jameslamb
Copy link
Collaborator

Would it be in the scope of this PR to add a buffer or something so that they're only computed once?

Sorry, I'm not proposing also accepting feature names as part of this PR. I don't want to delay this PR by adding on more things. Just thought we should make a decision about how we think we might support the use of feature names, since it could impact the choice of argument name for the new methods introduced in this PR.

@jameslamb jameslamb mentioned this pull request Mar 9, 2022
@StrikerRUS
Copy link
Collaborator

@jameslamb Thanks a lot for the suggestion of adding access by feature name! I'm for option 2 for the consistency with some other APIs, e.g.

feature : int or str
The feature name or index the histogram is plotted for.
If int, interpreted as index.
If str, interpreted as name.

feature : int or str
The feature name or index the histogram is calculated for.
If int, interpreted as index.
If str, interpreted as name.

@jmoralez jmoralez requested a review from jameslamb March 13, 2022 14:48
@jameslamb
Copy link
Collaborator

Ok works for me @jmoralez and @StrikerRUS ! Let's go with option 2.

@jmoralez for this PR, could you just change the keyword argument name to feature? And then in a follow-up PR, add support for accepting feature names (or document that in a feature request and add a link to #2302).

@jmoralez jmoralez changed the title WIP: expose FeatureNumBin in C API expose FeatureNumBin in C API Mar 13, 2022
@jmoralez jmoralez changed the title expose FeatureNumBin in C API [c-api][python-package][R-package] expose feature num bin Mar 13, 2022
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks very much for the changes and excellent tests.

@StrikerRUS StrikerRUS merged commit d10372e into microsoft:master Mar 15, 2022
@jmoralez jmoralez deleted the feature-num-bin branch March 15, 2022 04:26
@shiyu1994
Copy link
Collaborator

I see three options for that:

Don't do this at all.
Allow the input to feature_num_bin() in the Python and R packages to accept Union[int, str]
assume that a string passed to this method should be matched exactly in feature_names
Keep feature_num_bin() in the Python and R packages as-is and later add a new Dataset method like feature_num_bin_by_feature_name()
this accepts a string, matches it to feature_names to get the index (if feature_names have been defined), then calls feature_num_bin() with that index.
I only bring this up here in this review because if we want to eventually do Option 2, then I'd recommend changing the argument name to feature instead of feature_idx.

I have a mild preference for Option 3, as I think having explicit methods without union types makes it easier to reason about how the code is going to work. But this project already uses union types in so many other places, I'm not opposed to Option 2.

@jameslamb Sorry for missing the message. I also agree with option 2.

@jameslamb
Copy link
Collaborator

No problem! Thanks for taking the time to come back and comment!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose number of bins used by the model while binning continuous features to C API
4 participants