[CLI] Categorical: Read string and convert to int on the fly #789

AbdealiLoKo · 2017-08-06T05:36:57Z

This is a feature request

It would be very useful to be able to read in Categorical values as strings ("abc", "def", etc) and convert that to integers internally.

This adds a bit of overhead, but would be much easier for users. There would probably be some overhead to do this, so a flag can be made if the user wants to do it, or we can automatically check if the column has any alphabet or not

limexp · 2017-08-08T21:41:24Z

@AbdealiJK, @guolinke,

There are two points to consider.

Reading data. Is using categorical_column parameter is required to be filled? Or feature must became categorial as soon as any non-numeric value is found? I'm quite sure that decision must be based on categorial_columns parameter. Also string representations (like 'NA' or 'nan') for missing values must be defined.
Saving and exporting. We can save additional block with strings to integers mapping. Or we can try to use intial string values. The latter approach could lead to many errors (for example, with legacy code).

AbdealiLoKo · 2017-08-08T21:44:44Z

I think:

categorical_column would be the better and simpler approach to do it
string to integer mapping is the best way to do it. Using initial string values throughout would require a lot of refactoring and also integer mapping is more memory efficient

limexp · 2017-08-08T21:52:15Z

@AbdealiJK
2. I didn't suggest to refactor everything and add strings to the kernel. The question was about representation of split value in the saved model. It could be integer index (with mapping saved apart) or a string value itself.

Laurae2 · 2017-08-08T22:00:04Z

@limexp string value would increase model size and is not efficient, it can also cause major issues (not including the potential name clashes and what to do with characters which are not compliant to the text format used by LightGBM)

For instance a space can be different to another space while being visually identical.

limexp · 2017-08-08T22:16:42Z

@Laurae2, I totally agree. And each decision has its pros and cons.
This feature would be great for CLI, so data can be used without preprocessing. Is it really important for python or R interfaces?
In any case we lose control over NaN values.

Laurae2 · 2017-08-08T22:20:18Z

@limexp R and Python already have the conversion from categoricals to integers.

But if the model deployment in production is done using CLI, one must create a script to convert categoricals to their appropriate integers (usually done with SQL or any other data warehousing software).

dah33 · 2017-08-10T20:51:53Z

I was using categorical features on a Kaggle Kernel. Python converts NaNs to "-1" when you convert values within categoricals to integers. This causes LightGBM to bomb out with a fatal error. This is silent on Kaggle, and kills the kernel.

The solution is:

for col in categorical_vars:
    df[col] = pd.Categorical(df[col].cat.codes+1)

This assumes you have created Categorical columns in Pandas in the first place, e.g.

df[col] = df[col].astype('category')

guolinke · 2017-08-16T13:14:03Z

@dah33 you can use NA to represent missing.
@wxchan can we add a conversion for the -1 (to NaN) in python package ?

Laurae2 · 2017-08-16T13:15:46Z

@guolinke @wxchan In Python package we may enforce anything negative is NaN for categorical variables (strictly -1 only for NaN would be strange).

guolinke · 2017-08-16T13:16:59Z

@Laurae2 okay, maybe I can support this in c++ side, and gives a warning for this conversion.

henry0312 · 2017-08-16T13:56:00Z

There is a very difficult problem: we cannot pass categories (list of values) to auto convert.
This becomes a problem when one know true categories and all of them doesn't apper in train data.

for col in categorical_vars:
    df[col] = pd.Categorical(df[col].cat.codes+1, categories=['A', 'B', 'C', ... ])

henry0312 · 2017-08-16T14:01:30Z

There is an another problem: we cannot know if a value is not in categories or missing value, because pandas.Categorical encode both of them to -1

henry0312 · 2017-08-16T14:05:33Z

Additonaly, pandas.Categories encode labels to int accoriding to their order of appearance, I guess, so we may not reproduce the same encoding when predicting.

This will be solved by passing categories (like #789 (comment)).

dah33 · 2017-08-16T21:30:38Z

Is there a distinction to be made between nominal and ordinal features when worrying about known categories that are not present in the training data?

Can ordinal values be encoded as floats just using Series.cat.codes.astype(float)? They would then be ordered correctly.

Nominal values if not present in training set, then is there advantage in encoding them as anything other than NaN? When would a node ever use them in a decision?

Laurae2 · 2017-08-16T21:32:28Z

@dah33 ordinal features should be numerical, while nominal features should be categorical.

dah33 · 2017-08-16T21:34:03Z

@henry0312 nominal categories pandas encodes to integers in order of appearance. Ordinal categories are encoded in a way that preserves order (0 for A, 1 for B, 2 for C, etc) even if that category did not appear in the data frame

dah33 · 2017-08-16T22:03:33Z

@Laurae2 thanks for the tip. The terminology is a bit misleading as pandas calls ordinals just a "Categorical" with an order. Whereas LightGBM has categorical_features='auto' which detects Categoricals but really this should only be handed nominals, as you say.

geoHeil · 2017-08-17T04:31:37Z

@Laurae2 I have a question regarding the encoding:
https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L532 is using eval_set[i] = (valid_x, self._le.transform(valid_y)) but the _le i.e. scikit-learn.LabelEncoder. However that one definitely fails on unseen labels during transform. How does it then magically work as outlined in #804 to properly handle unseen values?

Laurae2 · 2017-08-17T09:22:17Z

@geoHeil I don't use scikit-learn transformers as they are known to have shady issues on transformers / supervised machine learning (like scikit-learn/scikit-learn#3956 about your LabelEncoder).

It is better to prepare oneself the categorical features before feeding to LightGBM the DMatrix, or to use custom converters like the one I made in R for LightGBM, this way you know you are doing the right preprocessing out of the box:

Laurae2 · 2017-08-17T09:25:57Z

@dah33 An ordinal scale is still debatable whether it is continuous or discrete (in theory). But for LightGBM, it is better to feed them as numeric because:

Continuous treatment (numeric) respects the ordinality rule in LightGBM (greater or less than)
Discrete treatment (categoricla) breaks the ordinality rule in LightGBM (equal to + some potential conversions losing order)

geoHeil · 2017-08-17T10:49:12Z

@Laurae2 a couple of days you mentioned that

The Python wrapper abstracts the categorical conversion (String -> Int) and converts it for you.
and that is https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/compat.py#L75
so I wonder if I should lightGBM's python wrapper to automate this conversion as this is still only using a LabelEncoder which as far as I know can't handle unseen data.

Laurae2 · 2017-08-17T11:22:56Z

@geoHeil I recommend a using a separate converter because there's no way a saved model can remember something which is not native (a LightGBM saved model does not know what is Python).

As with any preprocessing steps, they must be separate to the LightGBM interface (explicit), not inside the LightGBM interface (abstracted). The conversion is done for the user convenience (like what lgb.cv does), but there are inherent drawbacks the user should automatically know as it is a preprocessing step the LightGBM model cannot have.

I don't know exactly how it is done the Python package, but @wxchan probably knows more about how categorical features are handled when predicting from a model (whether it is a fresh loaded model or a newly-trained model).

In R, you must pass the rule converter as a preprocessing step, which does the heavy lifting work for features. If the rule converter is not saved nor used, then you cannot predict properly from new data.

geoHeil · 2017-08-17T11:24:53Z

@Laurae2 thanks for the clarification. @wxchan can you clarify this for python?

wxchan · 2017-08-17T13:41:36Z

cat.codes of pandas categorical features will be saved to model after training and read from model during prediction.

geoHeil · 2017-08-17T13:48:09Z

@wxchan thanks. So https://github.com/Microsoft/LightGBM/blob/cc771df49941f1045bcca52ea97c00288d319dca/python-package/lightgbm/basic.py#L240 is storing it - but where is this information used in the transform part i.e. where possibly unseen categories are handled?

wxchan · 2017-08-18T10:19:23Z

@geoHeil store in L231, read from L237
unseen will be -1 as pandas rule of cat.codes I think, you can make up a small dataset to check.

geoHeil · 2017-08-18T10:41:18Z

@wxchan thanks. Regarding the number of levels i.e. for a String address field the number of distinct categorical levels is pretty big what would you suggest in this case?

wxchan · 2017-08-18T10:48:08Z

@geoHeil not sure I understand your question. Do you mean address of street? I think you can either merge several rare categories into one big category, or extract some common information from this feature (like city of address).

geoHeil · 2017-08-18T11:01:39Z

@wxchan exactly, I thought I had seen some min_cat and max_catparameters around the lightGBM documentation- but can't seem to find it now.

Is that first type of handling (merge by frequency) indeed implemented?

wxchan · 2017-08-18T11:06:23Z

@geoHeil no, you need to implement on your own. I am actually not sure what the status of categorical feature handling right now, seem guolinke has reverted it this afternoon. I will check it later.

guolinke · 2017-10-26T01:34:20Z

I feel like having this is CLI version is not need, and also too heavy.
A tradeoff solution is providing a python script that can convert the string to the int type, which is much easier.

AbdealiLoKo · 2017-10-26T04:23:23Z

Having a python script would not be ideal for some projects because installations on clusters can be tedious and having more dependencies would not be a good idea.
Especially if later lightgbm is modified to work with S3 or Redshift or other filesystems that would get messy as fileIO for a new filesys will have to be handled twice - in C and in python

guolinke · 2017-12-17T03:02:53Z

@AbdealiJK thanks for your thoughts.
However, having this is not trivial. It will break many IO codes in current implementation, also has a large impact on IO speed.

For dependencies problem, you can use the binaries program, instead of scripts.
Also, I think it doesn't need many dependencies. For example, you can use pure python to implement this.

AbdealiLoKo · 2017-12-17T03:53:01Z

I was actually looking to contribute this and realized that it indeed is not very trivial.

Agree that reading and writing can be handled by wrappers or pre-processing scripts as implementing this would not be worth the effort

jsh9 · 2018-03-17T04:30:53Z

I use lightgbm in Python, and I also would love for lightgbm to encode categorical features internally (i.e., "on the fly").

Having read the discussions above, I acknowledge that this is not trivial, but I think it can be implemented in the data preparation stage (i.e., when user creates the lightgbm.Dataset object), and the training stage does not need any changes.

Here are some details:

If a column is not in "categorical features", but string values are encountered in this column, throw an error
If a column is legally a "categorical feature", then we use "label encoding" to turn it into 0,1,2,3,...
If a null value (None or numpy.nan) is in the column, e.g., ['cat', 'dog', np.nan, 'cat', 'dog', ...], we convert the null into a string "N/A"
Save the label encoding dictionaries (e.g., {'pet': {1: 'cat', 2: 'dog', 3: 'fish'}, 'country': {1: 'US', 2: 'UK', 3: 'France'} }, which is a dictionary of dictionaries) as an attribute of the lightgbm.DataSet object. The users can later query these dictionaries should they want to "decode" [1,2,3,...] back to ['cat', 'dog', 'fish', ...]

In this way, the lightgbm.DataSet object being passed to lightgbm.train is still a matrix with only numerical values, so the training subroutines does not need to be altered at all.

StrikerRUS · 2019-08-01T17:06:09Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

guolinke added help wanted feature request labels Aug 7, 2017

AbdealiLoKo mentioned this issue Aug 25, 2017

Categorical Feature Support #853

Closed

guolinke changed the title ~~Categorical: Read string and convert to int on the fly~~ [CLI] Categorical: Read string and convert to int on the fly Oct 26, 2017

StrikerRUS mentioned this issue Jun 21, 2018

[python] dont format string values with precision #1465

Merged

guolinke mentioned this issue Aug 1, 2019

Feature Requests & Voting Hub #2302

Open

guolinke closed this as completed Aug 1, 2019

StrikerRUS mentioned this issue May 10, 2021

Any colon in a CSV or TSV file fools the parser #4180

Open

Dinara301 mentioned this issue Oct 8, 2021

Partially resolve [CLI] Categorical: Read string and convert to int on the fly #789 (closed in favour #2303) #4658

Closed

jameslamb mentioned this issue Apr 15, 2022

Strings as Categorical Values #4876

Closed

[CLI] Categorical: Read string and convert to int on the fly #789

[CLI] Categorical: Read string and convert to int on the fly #789

Comments

AbdealiLoKo commented Aug 6, 2017

limexp commented Aug 8, 2017

AbdealiLoKo commented Aug 8, 2017

limexp commented Aug 8, 2017

Laurae2 commented Aug 8, 2017 • edited Loading

limexp commented Aug 8, 2017

Laurae2 commented Aug 8, 2017

dah33 commented Aug 10, 2017 • edited Loading

guolinke commented Aug 16, 2017

Laurae2 commented Aug 16, 2017

guolinke commented Aug 16, 2017

henry0312 commented Aug 16, 2017 • edited Loading

henry0312 commented Aug 16, 2017 • edited Loading

henry0312 commented Aug 16, 2017 • edited Loading

dah33 commented Aug 16, 2017

Laurae2 commented Aug 16, 2017

dah33 commented Aug 16, 2017

dah33 commented Aug 16, 2017

geoHeil commented Aug 17, 2017

Laurae2 commented Aug 17, 2017

Laurae2 commented Aug 17, 2017

geoHeil commented Aug 17, 2017

Laurae2 commented Aug 17, 2017

geoHeil commented Aug 17, 2017

wxchan commented Aug 17, 2017

geoHeil commented Aug 17, 2017

wxchan commented Aug 18, 2017

geoHeil commented Aug 18, 2017

wxchan commented Aug 18, 2017

geoHeil commented Aug 18, 2017

wxchan commented Aug 18, 2017

guolinke commented Oct 26, 2017

AbdealiLoKo commented Oct 26, 2017

guolinke commented Dec 17, 2017

AbdealiLoKo commented Dec 17, 2017

jsh9 commented Mar 17, 2018 • edited Loading

StrikerRUS commented Aug 1, 2019

Laurae2 commented Aug 8, 2017 •

edited

Loading

dah33 commented Aug 10, 2017 •

edited

Loading

henry0312 commented Aug 16, 2017 •

edited

Loading

henry0312 commented Aug 16, 2017 •

edited

Loading

henry0312 commented Aug 16, 2017 •

edited

Loading

jsh9 commented Mar 17, 2018 •

edited

Loading