Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLI] Categorical: Read string and convert to int on the fly #789

Closed
AbdealiLoKo opened this issue Aug 6, 2017 · 36 comments
Closed

[CLI] Categorical: Read string and convert to int on the fly #789

AbdealiLoKo opened this issue Aug 6, 2017 · 36 comments

Comments

@AbdealiLoKo
Copy link

This is a feature request

It would be very useful to be able to read in Categorical values as strings ("abc", "def", etc) and convert that to integers internally.

This adds a bit of overhead, but would be much easier for users. There would probably be some overhead to do this, so a flag can be made if the user wants to do it, or we can automatically check if the column has any alphabet or not

@limexp
Copy link
Contributor

limexp commented Aug 8, 2017

@AbdealiJK, @guolinke,

There are two points to consider.

  1. Reading data. Is using categorical_column parameter is required to be filled? Or feature must became categorial as soon as any non-numeric value is found? I'm quite sure that decision must be based on categorial_columns parameter. Also string representations (like 'NA' or 'nan') for missing values must be defined.

  2. Saving and exporting. We can save additional block with strings to integers mapping. Or we can try to use intial string values. The latter approach could lead to many errors (for example, with legacy code).

@AbdealiLoKo
Copy link
Author

I think:

  1. categorical_column would be the better and simpler approach to do it
  2. string to integer mapping is the best way to do it. Using initial string values throughout would require a lot of refactoring and also integer mapping is more memory efficient

@limexp
Copy link
Contributor

limexp commented Aug 8, 2017

@AbdealiJK
2. I didn't suggest to refactor everything and add strings to the kernel. The question was about representation of split value in the saved model. It could be integer index (with mapping saved apart) or a string value itself.

@Laurae2
Copy link
Contributor

Laurae2 commented Aug 8, 2017

@limexp string value would increase model size and is not efficient, it can also cause major issues (not including the potential name clashes and what to do with characters which are not compliant to the text format used by LightGBM)

For instance a space can be different to another space while being visually identical.

@limexp
Copy link
Contributor

limexp commented Aug 8, 2017

@Laurae2, I totally agree. And each decision has its pros and cons.
This feature would be great for CLI, so data can be used without preprocessing. Is it really important for python or R interfaces?
In any case we lose control over NaN values.

@Laurae2
Copy link
Contributor

Laurae2 commented Aug 8, 2017

@limexp R and Python already have the conversion from categoricals to integers.

But if the model deployment in production is done using CLI, one must create a script to convert categoricals to their appropriate integers (usually done with SQL or any other data warehousing software).

@dah33
Copy link
Contributor

dah33 commented Aug 10, 2017

I was using categorical features on a Kaggle Kernel. Python converts NaNs to "-1" when you convert values within categoricals to integers. This causes LightGBM to bomb out with a fatal error. This is silent on Kaggle, and kills the kernel.

The solution is:

for col in categorical_vars:
    df[col] = pd.Categorical(df[col].cat.codes+1)

This assumes you have created Categorical columns in Pandas in the first place, e.g.

df[col] = df[col].astype('category')

@guolinke
Copy link
Collaborator

@dah33 you can use NA to represent missing.
@wxchan can we add a conversion for the -1 (to NaN) in python package ?

@Laurae2
Copy link
Contributor

Laurae2 commented Aug 16, 2017

@guolinke @wxchan In Python package we may enforce anything negative is NaN for categorical variables (strictly -1 only for NaN would be strange).

@guolinke
Copy link
Collaborator

@Laurae2 okay, maybe I can support this in c++ side, and gives a warning for this conversion.

@henry0312
Copy link
Contributor

henry0312 commented Aug 16, 2017

There is a very difficult problem: we cannot pass categories (list of values) to auto convert.
This becomes a problem when one know true categories and all of them doesn't apper in train data.

for col in categorical_vars:
    df[col] = pd.Categorical(df[col].cat.codes+1, categories=['A', 'B', 'C', ... ])

@henry0312
Copy link
Contributor

henry0312 commented Aug 16, 2017

There is an another problem: we cannot know if a value is not in categories or missing value, because pandas.Categorical encode both of them to -1

@henry0312
Copy link
Contributor

henry0312 commented Aug 16, 2017

Additonaly, pandas.Categories encode labels to int accoriding to their order of appearance, I guess, so we may not reproduce the same encoding when predicting.

This will be solved by passing categories (like #789 (comment)).

@dah33
Copy link
Contributor

dah33 commented Aug 16, 2017

Is there a distinction to be made between nominal and ordinal features when worrying about known categories that are not present in the training data?

Can ordinal values be encoded as floats just using Series.cat.codes.astype(float)? They would then be ordered correctly.

Nominal values if not present in training set, then is there advantage in encoding them as anything other than NaN? When would a node ever use them in a decision?

@Laurae2
Copy link
Contributor

Laurae2 commented Aug 16, 2017

@dah33 ordinal features should be numerical, while nominal features should be categorical.

@dah33
Copy link
Contributor

dah33 commented Aug 16, 2017

@henry0312 nominal categories pandas encodes to integers in order of appearance. Ordinal categories are encoded in a way that preserves order (0 for A, 1 for B, 2 for C, etc) even if that category did not appear in the data frame

@dah33
Copy link
Contributor

dah33 commented Aug 16, 2017

@Laurae2 thanks for the tip. The terminology is a bit misleading as pandas calls ordinals just a "Categorical" with an order. Whereas LightGBM has categorical_features='auto' which detects Categoricals but really this should only be handed nominals, as you say.

@geoHeil
Copy link

geoHeil commented Aug 17, 2017

@Laurae2 I have a question regarding the encoding:
https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L532 is using eval_set[i] = (valid_x, self._le.transform(valid_y)) but the _le i.e. scikit-learn.LabelEncoder. However that one definitely fails on unseen labels during transform. How does it then magically work as outlined in #804 to properly handle unseen values?

@Laurae2
Copy link
Contributor

Laurae2 commented Aug 17, 2017

@geoHeil I don't use scikit-learn transformers as they are known to have shady issues on transformers / supervised machine learning (like scikit-learn/scikit-learn#3956 about your LabelEncoder).

It is better to prepare oneself the categorical features before feeding to LightGBM the DMatrix, or to use custom converters like the one I made in R for LightGBM, this way you know you are doing the right preprocessing out of the box:

@Laurae2
Copy link
Contributor

Laurae2 commented Aug 17, 2017

@dah33 An ordinal scale is still debatable whether it is continuous or discrete (in theory). But for LightGBM, it is better to feed them as numeric because:

  • Continuous treatment (numeric) respects the ordinality rule in LightGBM (greater or less than)
  • Discrete treatment (categoricla) breaks the ordinality rule in LightGBM (equal to + some potential conversions losing order)

@geoHeil
Copy link

geoHeil commented Aug 17, 2017

@Laurae2 a couple of days you mentioned that

The Python wrapper abstracts the categorical conversion (String -> Int) and converts it for you.
and that is https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/compat.py#L75
so I wonder if I should lightGBM's python wrapper to automate this conversion as this is still only using a LabelEncoder which as far as I know can't handle unseen data.

@Laurae2
Copy link
Contributor

Laurae2 commented Aug 17, 2017

@geoHeil I recommend a using a separate converter because there's no way a saved model can remember something which is not native (a LightGBM saved model does not know what is Python).

As with any preprocessing steps, they must be separate to the LightGBM interface (explicit), not inside the LightGBM interface (abstracted). The conversion is done for the user convenience (like what lgb.cv does), but there are inherent drawbacks the user should automatically know as it is a preprocessing step the LightGBM model cannot have.

I don't know exactly how it is done the Python package, but @wxchan probably knows more about how categorical features are handled when predicting from a model (whether it is a fresh loaded model or a newly-trained model).

In R, you must pass the rule converter as a preprocessing step, which does the heavy lifting work for features. If the rule converter is not saved nor used, then you cannot predict properly from new data.

@geoHeil
Copy link

geoHeil commented Aug 17, 2017

@Laurae2 thanks for the clarification. @wxchan can you clarify this for python?

@wxchan
Copy link
Contributor

wxchan commented Aug 17, 2017

cat.codes of pandas categorical features will be saved to model after training and read from model during prediction.

@geoHeil
Copy link

geoHeil commented Aug 17, 2017

@wxchan thanks. So https://github.com/Microsoft/LightGBM/blob/cc771df49941f1045bcca52ea97c00288d319dca/python-package/lightgbm/basic.py#L240 is storing it - but where is this information used in the transform part i.e. where possibly unseen categories are handled?

@wxchan
Copy link
Contributor

wxchan commented Aug 18, 2017

@geoHeil store in L231, read from L237
unseen will be -1 as pandas rule of cat.codes I think, you can make up a small dataset to check.

@geoHeil
Copy link

geoHeil commented Aug 18, 2017

@wxchan thanks. Regarding the number of levels i.e. for a String address field the number of distinct categorical levels is pretty big what would you suggest in this case?

@wxchan
Copy link
Contributor

wxchan commented Aug 18, 2017

@geoHeil not sure I understand your question. Do you mean address of street? I think you can either merge several rare categories into one big category, or extract some common information from this feature (like city of address).

@geoHeil
Copy link

geoHeil commented Aug 18, 2017

@wxchan exactly, I thought I had seen some min_cat and max_catparameters around the lightGBM documentation- but can't seem to find it now.

Is that first type of handling (merge by frequency) indeed implemented?

@wxchan
Copy link
Contributor

wxchan commented Aug 18, 2017

@geoHeil no, you need to implement on your own. I am actually not sure what the status of categorical feature handling right now, seem guolinke has reverted it this afternoon. I will check it later.

@guolinke
Copy link
Collaborator

I feel like having this is CLI version is not need, and also too heavy.
A tradeoff solution is providing a python script that can convert the string to the int type, which is much easier.

@guolinke guolinke changed the title Categorical: Read string and convert to int on the fly [CLI] Categorical: Read string and convert to int on the fly Oct 26, 2017
@AbdealiLoKo
Copy link
Author

Having a python script would not be ideal for some projects because installations on clusters can be tedious and having more dependencies would not be a good idea.
Especially if later lightgbm is modified to work with S3 or Redshift or other filesystems that would get messy as fileIO for a new filesys will have to be handled twice - in C and in python

@guolinke
Copy link
Collaborator

@AbdealiJK thanks for your thoughts.
However, having this is not trivial. It will break many IO codes in current implementation, also has a large impact on IO speed.

For dependencies problem, you can use the binaries program, instead of scripts.
Also, I think it doesn't need many dependencies. For example, you can use pure python to implement this.

@AbdealiLoKo
Copy link
Author

I was actually looking to contribute this and realized that it indeed is not very trivial.

Agree that reading and writing can be handled by wrappers or pre-processing scripts as implementing this would not be worth the effort

@jsh9
Copy link

jsh9 commented Mar 17, 2018

I use lightgbm in Python, and I also would love for lightgbm to encode categorical features internally (i.e., "on the fly").

Having read the discussions above, I acknowledge that this is not trivial, but I think it can be implemented in the data preparation stage (i.e., when user creates the lightgbm.Dataset object), and the training stage does not need any changes.

Here are some details:

  • If a column is not in "categorical features", but string values are encountered in this column, throw an error
  • If a column is legally a "categorical feature", then we use "label encoding" to turn it into 0,1,2,3,...
  • If a null value (None or numpy.nan) is in the column, e.g., ['cat', 'dog', np.nan, 'cat', 'dog', ...], we convert the null into a string "N/A"
  • Save the label encoding dictionaries (e.g., {'pet': {1: 'cat', 2: 'dog', 3: 'fish'}, 'country': {1: 'US', 2: 'UK', 3: 'France'} }, which is a dictionary of dictionaries) as an attribute of the lightgbm.DataSet object. The users can later query these dictionaries should they want to "decode" [1,2,3,...] back to ['cat', 'dog', 'fish', ...]

In this way, the lightgbm.DataSet object being passed to lightgbm.train is still a matrix with only numerical values, so the training subroutines does not need to be altered at all.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants