Add support for pandas nullable types to the sklearn api #4173

freddyaboulton · 2021-04-13T19:19:18Z

Summary

I would like to use lightgbm's sklearn api with the new pandas nullable types. Currently, lightgbm does not recognize these types as valid. Reproducer using 3.0.0 and pandas 0.24.1:

from lightgbm.sklearn import LGBMClassifier
import pytest
import pandas as pd

for dtype in ['Int64', 'Float64', 'boolean']:
    
    lr = LGBMClassifier()

    # ValueError if only features use nullable types
    df = pd.DataFrame({"a": pd.Series([True, False, True, False], dtype=dtype),
                       "b": pd.Series([4, 5, 6])})
    y = pd.Series([1, 0, 1, 0])
    
    with pytest.raises(ValueError):
        lr.fit(df, y)
    
    # ValueError if only target uses nullable types
    df = pd.DataFrame({"a": pd.Series([True, False, True, False])})
    y = pd.Series([1, 0, 1, 0], dtype=dtype)
    
    with pytest.raises(ValueError):
        lr.fit(df, y)

# No ValueError if use "old" types
for dtype in ['int', 'float', 'bool']:
    
    df = pd.DataFrame({"a": pd.Series([True, False, True, False], dtype=dtype)})
    y = pd.Series([1, 0, 1, 0])

    lr = LGBMClassifier()
    lr.fit(df, y)

Motivation

As a user, I would like to build a data processing pipeline using the latest pandas features. I can work around lightgbm's limitations but this is cumbersome and some information could be lost in the conversion.

Thank you!

Description

References

Pandas nullable integer dtype documentation.

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2021-04-14T20:08:07Z

I'd like to highlight that

IntegerArray is currently experimental. Its API or implementation may change without warning.

But this warning has been there at least since 2019: #2486 (comment).

StrikerRUS · 2021-04-14T20:10:59Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

jmoralez · 2022-01-04T03:59:37Z

Reopening since I'm working on this.

My understanding of

LightGBM/python-package/lightgbm/basic.py

Lines 549 to 551 in af5b40e

    
           data = data.values 
        
           if data.dtype != np.float32 and data.dtype != np.float64: 
        
               data = data.astype(np.float32)

is that the data gets converted to either float32 or float64, depending on the common dtype.

So my approach is checking if there are any 'Float64' (nullable dtype) or 'float64' (regular numpy dtype) in the dataframe. If there are any then the target dtype is 'float64', otherwise it is 'float32'. Then use data.to_numpy with the target dtype.

I'd like to get feedback on this approach before opening a PR. The latency for predictions actually seems to be lower with this, since this allows me to change the logic in

LightGBM/python-package/lightgbm/basic.py

Line 504 in af5b40e

def _get_bad_pandas_dtypes(dtypes):

to just calling pd.api.types.is_numeric_dtype on each dtype.

…4927) * map nullable dtypes to regular float dtypes * cast x3 to float after introducing missing values * add test for regular dtypes * use .astype and then values. update nullable_dtypes test and include test for regular numpy dtypes * more specific allowed dtypes. test no copy when single float dtype df * use np.find_common_type. set np.float128 to None when it isn't supported * set default as type(None) * move tests that use lgb.train to test_engine * include np.float32 when finding common dtype * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * add linebreak Co-authored-by: Nikita Titov <[email protected]>

DanielMS93 · 2022-03-28T17:28:23Z

Hi @jmoralez thanks for pushing this ahead.

I am currently using lightgbm v3.3.2 and attempting to use Int64 as a dtype for some features but i am still seeing the usual error

ValueError: DataFrame.dtypes for data must be int, float or bool.

Can you confirm the status of this feature? is this a bug or am i missing something?

jmoralez · 2022-03-28T17:51:21Z

Hi @DanielMS93. The changes were merged recently so they weren't included in the 3.3.2 version, they will be available in the next release. If you want to use them now you can install from GitHub.

StrikerRUS · 2022-03-28T22:30:04Z

If you want to use them now you can install from GitHub.

... or install nightly build if you are not comfortable to compile cpp code
https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#nightly-builds

leahmcguire · 2022-10-31T18:03:22Z

It seems like they also failed to make it into the 3.3.3 release. At least I dont see it in the release notes and I am still seeing the error with an Int64

 File "/usr/local/lib/python3.9/site-packages/lightgbm/basic.py", line 594, in _data_from_pandas
    raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n

Any word on when this will make it in?

jameslamb · 2022-10-31T18:33:38Z

@leahmcguire v3.3.3 was a special patch release just to keep CRAN from removing the R package. It doesn't contain the fix from #4927.

The change from #4927 will be in v4.0.0. We don't have an estimated date, but you can subscribe to #5153 to be notified when that release goes out.

github-actions · 2023-08-15T20:14:54Z

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

freddyaboulton changed the title ~~Add support for nullable types to the sklearn api~~ Add support for pandas nullable types to the sklearn api Apr 13, 2021

shiyu1994 added the feature request label Apr 14, 2021

StrikerRUS mentioned this issue Apr 14, 2021

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Apr 14, 2021

StrikerRUS added the help wanted label Apr 14, 2021

jmoralez reopened this Jan 4, 2022

jmoralez mentioned this issue Jan 6, 2022

[python-package] add support for pandas nullable types #4927

Merged

jameslamb closed this as completed in #4927 Feb 24, 2022

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for pandas nullable types to the sklearn api #4173

Add support for pandas nullable types to the sklearn api #4173

freddyaboulton commented Apr 13, 2021 •

edited

Loading

StrikerRUS commented Apr 14, 2021

StrikerRUS commented Apr 14, 2021

jmoralez commented Jan 4, 2022

DanielMS93 commented Mar 28, 2022

jmoralez commented Mar 28, 2022

StrikerRUS commented Mar 28, 2022

leahmcguire commented Oct 31, 2022

jameslamb commented Oct 31, 2022

github-actions bot commented Aug 15, 2023

Add support for pandas nullable types to the sklearn api #4173

Add support for pandas nullable types to the sklearn api #4173

Comments

freddyaboulton commented Apr 13, 2021 • edited Loading

Summary

Motivation

Description

References

StrikerRUS commented Apr 14, 2021

StrikerRUS commented Apr 14, 2021

jmoralez commented Jan 4, 2022

DanielMS93 commented Mar 28, 2022

jmoralez commented Mar 28, 2022

StrikerRUS commented Mar 28, 2022

leahmcguire commented Oct 31, 2022

jameslamb commented Oct 31, 2022

github-actions bot commented Aug 15, 2023

freddyaboulton commented Apr 13, 2021 •

edited

Loading