Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pandas nullable types to the sklearn api #4173

Closed
freddyaboulton opened this issue Apr 13, 2021 · 9 comments · Fixed by #4927
Closed

Add support for pandas nullable types to the sklearn api #4173

freddyaboulton opened this issue Apr 13, 2021 · 9 comments · Fixed by #4927

Comments

@freddyaboulton
Copy link

freddyaboulton commented Apr 13, 2021

Summary

I would like to use lightgbm's sklearn api with the new pandas nullable types. Currently, lightgbm does not recognize these types as valid. Reproducer using 3.0.0 and pandas 0.24.1:

from lightgbm.sklearn import LGBMClassifier
import pytest
import pandas as pd

for dtype in ['Int64', 'Float64', 'boolean']:
    
    lr = LGBMClassifier()

    # ValueError if only features use nullable types
    df = pd.DataFrame({"a": pd.Series([True, False, True, False], dtype=dtype),
                       "b": pd.Series([4, 5, 6])})
    y = pd.Series([1, 0, 1, 0])
    
    with pytest.raises(ValueError):
        lr.fit(df, y)
    
    # ValueError if only target uses nullable types
    df = pd.DataFrame({"a": pd.Series([True, False, True, False])})
    y = pd.Series([1, 0, 1, 0], dtype=dtype)
    
    with pytest.raises(ValueError):
        lr.fit(df, y)

# No ValueError if use "old" types
for dtype in ['int', 'float', 'bool']:
    
    df = pd.DataFrame({"a": pd.Series([True, False, True, False], dtype=dtype)})
    y = pd.Series([1, 0, 1, 0])

    lr = LGBMClassifier()
    lr.fit(df, y)

Motivation

As a user, I would like to build a data processing pipeline using the latest pandas features. I can work around lightgbm's limitations but this is cumbersome and some information could be lost in the conversion.

Thank you!

Description

References

Pandas nullable integer dtype documentation.

@freddyaboulton freddyaboulton changed the title Add support for nullable types to the sklearn api Add support for pandas nullable types to the sklearn api Apr 13, 2021
@StrikerRUS
Copy link
Collaborator

I'd like to highlight that

IntegerArray is currently experimental. Its API or implementation may change without warning.

But this warning has been there at least since 2019: #2486 (comment).

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@jmoralez
Copy link
Collaborator

jmoralez commented Jan 4, 2022

Reopening since I'm working on this.

My understanding of

data = data.values
if data.dtype != np.float32 and data.dtype != np.float64:
data = data.astype(np.float32)

is that the data gets converted to either float32 or float64, depending on the common dtype.

So my approach is checking if there are any 'Float64' (nullable dtype) or 'float64' (regular numpy dtype) in the dataframe. If there are any then the target dtype is 'float64', otherwise it is 'float32'. Then use data.to_numpy with the target dtype.

I'd like to get feedback on this approach before opening a PR. The latency for predictions actually seems to be lower with this, since this allows me to change the logic in

def _get_bad_pandas_dtypes(dtypes):
to just calling pd.api.types.is_numeric_dtype on each dtype.

@jmoralez jmoralez reopened this Jan 4, 2022
jameslamb pushed a commit that referenced this issue Feb 24, 2022
…4927)

* map nullable dtypes to regular float dtypes

* cast x3 to float after introducing missing values

* add test for regular dtypes

* use .astype and then values. update nullable_dtypes test and include test for regular numpy dtypes

* more specific allowed dtypes. test no copy when single float dtype df

* use np.find_common_type. set np.float128 to None when it isn't supported

* set default as type(None)

* move tests that use lgb.train to test_engine

* include np.float32 when finding common dtype

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

* add linebreak

Co-authored-by: Nikita Titov <[email protected]>
@DanielMS93
Copy link

Hi @jmoralez thanks for pushing this ahead.

I am currently using lightgbm v3.3.2 and attempting to use Int64 as a dtype for some features but i am still seeing the usual error

ValueError: DataFrame.dtypes for data must be int, float or bool.

Can you confirm the status of this feature? is this a bug or am i missing something?

@jmoralez
Copy link
Collaborator

Hi @DanielMS93. The changes were merged recently so they weren't included in the 3.3.2 version, they will be available in the next release. If you want to use them now you can install from GitHub.

@StrikerRUS
Copy link
Collaborator

If you want to use them now you can install from GitHub.

... or install nightly build if you are not comfortable to compile cpp code
https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#nightly-builds

@leahmcguire
Copy link

It seems like they also failed to make it into the 3.3.3 release. At least I dont see it in the release notes and I am still seeing the error with an Int64

 File "/usr/local/lib/python3.9/site-packages/lightgbm/basic.py", line 594, in _data_from_pandas
    raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n

Any word on when this will make it in?

@jameslamb
Copy link
Collaborator

@leahmcguire v3.3.3 was a special patch release just to keep CRAN from removing the R package. It doesn't contain the fix from #4927.

The change from #4927 will be in v4.0.0. We don't have an estimated date, but you can subscribe to #5153 to be notified when that release goes out.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants