[dask] DaskRegressor.predict() fails on DataFrame / Series input #3861

jameslamb opened this issue Jan 26, 2021 · 15 comments

jameslamb opened this issue Jan 26, 2021 · 15 comments


jameslamb commented Jan 26, 2021

How you are using LightGBM?

LightGBM component: Python-package

Environment info

Operating System: Ubuntu 18.04

C++ compiler version: gcc 8.3.0

CMake version: 3.13.4

Python version:

output of 'conda info'
     active environment : saturn
    active env location : /opt/conda/envs/saturn
            shell level : 0
       user config file : /home/jovyan/.condarc
 populated config files : /opt/conda/.condarc
          conda version : 4.8.2
    conda-build version : not installed
         python version :
       virtual packages : __glibc=2.28
       base environment : /opt/conda  (writable)
           channel URLs :
          package cache : /opt/conda/pkgs
       envs directories : /opt/conda/envs
               platform : linux-64
             user-agent : conda/4.8.2 requests/2.22.0 CPython/3.7.7 Linux/4.14.203-156.332.amzn2.x86_64 debian/10 glibc/2.28
                UID:GID : 1000:100
             netrc file : None
           offline mode : False

LightGBM version or commit hash:

Error message and / or logs

Training with lightgbm.dask.DaskLGBMRegressor succeeds, and .predict() fails with this error.

ValueError: Metadata inference failed in `_predict_part`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
TypeError('Unknown type of parameter:y, got:Series')

  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/dask/dataframe/", line 174, in raise_on_meta_error
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/dask/dataframe/", line 5165, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/", line 319, in _predict_part
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/", line 707, in predict
    pred_leaf=pred_leaf, pred_contrib=pred_contrib, **kwargs)
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/", line 3118, in predict
    predictor = self._to_predictor(deepcopy(kwargs))
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/", line 3204, in _to_predictor
    predictor = _InnerPredictor(booster_handle=self.handle, pred_parameter=pred_parameter)
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/", line 638, in __init__
    self.pred_parameter = param_dict_to_str(pred_parameter)
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/", line 221, in param_dict_to_str
    % (key, type(val).__name__))

Reproducible example(s)

I'll update this with a better, smaller reproducible example soon. I'm rushing right now to finish something else for work, but wanted to be sure I document this so search engines return this issue if others google that error message.

I'm training and trying to .predict() on a Dask DataFrame. Something like this.

import dask.dataframe as dd
import lightgbm as lgb

taxi_train = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    storage_options={"anon": True},
).sample(frac=0.01, replace=False)

def prep_df(df: dd.DataFrame, target_col: str) -> dd.DataFrame:
    Prepare a raw taxi dataframe for training.
        * computes the target ('tip_fraction')
        * adds features
        * removes unused features
    numeric_feat = [
    categorical_feat = [
    features = numeric_feat + categorical_feat
    df = df[df.fare_amount > 0]  # avoid divide-by-zero
    df[target_col] = df.tip_amount / df.fare_amount

    df["pickup_weekday"] = df.tpep_pickup_datetime.dt.weekday
    df["pickup_weekofyear"] = df.tpep_pickup_datetime.dt.isocalendar().week
    df["pickup_hour"] = df.tpep_pickup_datetime.dt.hour
    df["pickup_week_hour"] = (df.pickup_weekday * 24) + df.pickup_hour
    df["pickup_minute"] = df.tpep_pickup_datetime.dt.minute
    df = df[features + [target_col]].astype(float).fillna(-1)

    return df

target_col = "tip_fraction"
taxi_train = prep_df(taxi, target_col)

taxi_train = taxi_train.persist()
_ = wait(taxi_train)

features = [c for c in taxi_train.columns if c != target_col]

data = taxi_train[features]
label = taxi_train[target_col]

dask_reg = lgb.dask.DaskLGBMRegressor(
    categorical_features=[6, 7]

taxi_test = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-02.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    storage_options={"anon": True},
).sample(frac=0.01, replace=False)

taxi_test = prep_df(taxi_test, target_col=target_col)

taxi_test = taxi_test.persist()
_ = wait(taxi_test)

preds = dask_reg.predict(

See the output of conda env export below for versions of Dask and its dependencies.

I think that changing the uses of map_blocks() and map_partitions based on this description from the Dask docs could fix this issue.


An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

But I'm confused and concerned about this error showing up, since it does not show up in any of the tests at, and we test against Dask DataFrame inputs there.

For anyone new to LightGBM looking to help with this before I get to it, here's the place where we're using _predict_part() in map_partitions() -->

if isinstance(data, dd._Frame):
return data.map_partitions(

Collaborator Author

I marked this "good first issue" only because I think that for someone who's experienced with Dask, they might be able to fix this without needing too much LightGBM knowledge.

Hi, James. There's a .values at the end of the map_partitions

Collaborator Author

😱😱😱 good eye! I think that behavior is inconsistent and should change, but it still doesn't explain the bug, right? Because if that method returned a Dask DataFrame, presumably you'd get this same error calling .compute() on that result, right?

Copy link

Copy link
Collaborator Author

is taxi_train != taxi_test?

Was going too fast, sorry. That was a very hastily-written issue and it needs a better reproducible example when I can. I just edited it to define taxi_train correctly.

Collaborator Author

I think that behavior is inconsistent and should change

I take it back, now I remember why there's a .values. This is so .predict() in the Dask interface always returns a Dask array regardless of input type, just like .predict() in the sklearn interface always returns a numpy array.

import numpy as np
import pandas as pd
from lightgbm import LGBMRegressor

reg = LGBMRegressor()


num_features = 20
num_rows = 1000

X = pd.DataFrame({
    "col" + str(i): np.random.random(num_rows)
    for i in range(num_features)

y = np.random.random(num_rows), y)

preds = reg.predict(X)

print(f"input type: {type(X)}, \npred type: {type(preds)}")

That makes sense. I couldn't reproduce the error, I tried on local and remote clusters.

Collaborator Author

I see some issues that suggest that this error might happen when an input contains NaNs:

I've also found the place where this happens (I think). There are some calls in DataFrame.map_partitions() that try to figure out the metadata of the response from the function being called with .map_partitions().

This is the internal function in Dask that raises the error in the original post here:

I'll try soon to create a clean reproducible example. I believe I know how to fix this, but without that repro we won't be able to test a fix.

jmoralez commented Feb 3, 2021

Not sure if it's entirely related but the predict also fails if there are categoricals.

import dask
import lightgbm as lgb
from dask.distributed import Client

client = Client()
dtypes = {
    'name': 'category',
    'id': int,
    'x': float,
    'y': float

ddf = dask.datasets.timeseries(freq='1H', dtypes=dtypes)
X, y = ddf.drop('y', 1), ddf.y
reg = lgb.dask.DaskLGBMRegressor().fit(X, y)
ValueError: Metadata inference failed in `_predict_part`.                
You have supplied a custom function and Dask is unable to                
determine the type of output that that function returns.                 
To resolve this please provide a meta= keyword.                          
The docstring of the Dask function you ran should have more information.                                                                          
Original error is below:                                                                                                                          
ValueError("could not convert string to float: 'Alice'")                 
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/dask/dataframe/", line 167, in raise_on_meta_error
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/dask/dataframe/", line 5310, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/lightgbm/", line 352, in _predict_part
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/lightgbm/", line 697, in predict
    X = _LGBMCheckArray(X, accept_sparse=True, force_all_finite=False)
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/sklearn/utils/", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/sklearn/utils/", line 616, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/numpy/core/", line 83, in asarray
    return array(a, dtype, copy=False, order=order)

If I do:

Xc = X.compute()

It works as expected. I use categoricals a lot so I would really like to see this work, I'd like to work on this. Do you think it should be a separate issue or is related?

Collaborator Author

jameslamb commented Feb 3, 2021

AH @jmoralez !!!! Maybe you found the secret to reproducing this!

Collaborator Author

jameslamb commented Feb 3, 2021

Thank you for the nice reproducible example, this could be the issue!

I'd like to work on this

I'd love if you can fix this. Do you have time to work on it over the next few days? Sorry for the rush, but this is one of the issues I want to fix before we do a 3.2.0 release of lightgbm: #3872 (comment). If you're not comfortable committing to working on this over the next few days, I'll make this my next priority and start on it tomorrow. If you are comfortable with the time constraints, then this is all yours.

jmoralez commented Feb 3, 2021

Oh I meant it as in: if no one's taking it I'd like to check it out, haha. I'm not sure I'd be able to pull it off, I prefer maybe helping you with some findings or discussions.

Collaborator Author

Haha ok, thanks! I actually get some dedicated time to work on LightGBM at how about I try this tomorrow and open a draft PR, and maybe I'll @ you for a review and other ideas?

Now that you found a small reproducible example, it should go quickly.

@jameslamb jameslamb self-assigned this Feb 3, 2021
Collaborator Author

jameslamb commented Feb 4, 2021

Alright, I think I have a fix for this in #3908. I wanted to post some more debugging information here.

Thanks to your huge help discovering that category cols was the issue, @jmoralez , I came up with the reproducible example below. I wanted something a little lower-level than using dask.datasets.timeseries, just so we could more easily change it while experimenting.

import dask
import dask.array as da
import dask.dataframe as dd
import pandas as pd
import numpy as np
import lightgbm as lgb
from dask.distributed import LocalCluster, Client

cluster = LocalCluster(n_workers=3)
client = Client(cluster)

def _create_data() -> pd.DataFrame:
    num_rows = 1000
    return pd.DataFrame({
        "float_col1": pd.Series(np.random.random(num_rows), dtype="float"),
        "float_col2": pd.Series(np.random.random(num_rows), dtype="float"),
        "cat_col": pd.Series(np.random.choice(["a", "b", "y", "z"], num_rows), dtype="category"),

parts = [dask.delayed(_create_data)() for _ in range(5)]
ddf = dd.from_delayed(
        "float_col1": "float",
        "float_col2": "float",
        "cat_col": "category"

label = da.random.random((5000, 1), (1000, 1)).to_dask_dataframe()[0]
reg = lgb.DaskLGBMRegressor(), y=label)

# this will fail
preds = reg.predict(ddf)

.predict() will fail with the error message in #3861 (comment). However, I noticed something really important deeper in the stack trace

File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/", line 598, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "/opt/conda/lib/python3.8/site-packages/numpy/core/", line 85, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'UNKNOWN_CATEGORIES'

That comes from the logic in dask.DataFrame.map_partitions() that try to execute the function being mapped on a small amount of data to check if it works.

full error log
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/", line 167, in raise_on_meta_error
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/", line 5268, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/LightGBM/python-package/lightgbm/", line 366, in _predict_part
    result = model.predict(
  File "/opt/LightGBM/python-package/lightgbm/", line 697, in predict
    X = _LGBMCheckArray(X, accept_sparse=True, force_all_finite=False)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/", line 72, in inner_f
    return f(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/", line 598, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: '__UNKNOWN_CATEGORIES__'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/LightGBM/python-package/lightgbm/", line 728, in predict
    return _predict(
  File "/opt/LightGBM/python-package/lightgbm/", line 426, in _predict
    return data.map_partitions(
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/", line 652, in map_partitions
    return map_partitions(func, self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/", line 5318, in map_partitions
    meta = _emulate(func, *args, udf=True, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/", line 5268, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/conda/lib/python3.8/", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/", line 188, in raise_on_meta_error
    raise ValueError(msg) from e
ValueError: Metadata inference failed in `_predict_part`.

You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
ValueError("could not convert string to float: '__UNKNOWN_CATEGORIES__'")

  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/", line 167, in raise_on_meta_error
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/", line 5268, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/LightGBM/python-package/lightgbm/", line 366, in _predict_part
    result = model.predict(
  File "/opt/LightGBM/python-package/lightgbm/", line 697, in predict
    X = _LGBMCheckArray(X, accept_sparse=True, force_all_finite=False)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/", line 72, in inner_f
    return f(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/", line 598, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/", line 85, in asarray
    return array(a, dtype, copy=False, order=order)

