-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLI] Categorical: Read string and convert to int on the fly #789
Comments
@AbdealiJK, @guolinke, There are two points to consider.
|
I think:
|
@AbdealiJK |
@limexp string value would increase model size and is not efficient, it can also cause major issues (not including the potential name clashes and what to do with characters which are not compliant to the text format used by LightGBM) For instance a space can be different to another space while being visually identical. |
@Laurae2, I totally agree. And each decision has its pros and cons. |
@limexp R and Python already have the conversion from categoricals to integers. But if the model deployment in production is done using CLI, one must create a script to convert categoricals to their appropriate integers (usually done with SQL or any other data warehousing software). |
I was using categorical features on a Kaggle Kernel. Python converts NaNs to "-1" when you convert values within categoricals to integers. This causes LightGBM to bomb out with a fatal error. This is silent on Kaggle, and kills the kernel. The solution is:
This assumes you have created Categorical columns in Pandas in the first place, e.g.
|
@Laurae2 okay, maybe I can support this in c++ side, and gives a warning for this conversion. |
There is a very difficult problem: we cannot pass categories (list of values) to auto convert. for col in categorical_vars:
df[col] = pd.Categorical(df[col].cat.codes+1, categories=['A', 'B', 'C', ... ]) |
There is an another problem: we cannot know if a value is not in categories or missing value, because pandas.Categorical encode both of them to -1 |
Additonaly, pandas.Categories encode labels to int accoriding to their order of appearance, I guess, so we may not reproduce the same encoding when predicting. This will be solved by passing categories (like #789 (comment)). |
Is there a distinction to be made between nominal and ordinal features when worrying about known categories that are not present in the training data? Can ordinal values be encoded as floats just using Series.cat.codes.astype(float)? They would then be ordered correctly. Nominal values if not present in training set, then is there advantage in encoding them as anything other than NaN? When would a node ever use them in a decision? |
@dah33 ordinal features should be numerical, while nominal features should be categorical. |
@henry0312 nominal categories pandas encodes to integers in order of appearance. Ordinal categories are encoded in a way that preserves order (0 for A, 1 for B, 2 for C, etc) even if that category did not appear in the data frame |
@Laurae2 thanks for the tip. The terminology is a bit misleading as pandas calls ordinals just a "Categorical" with an order. Whereas LightGBM has categorical_features='auto' which detects Categoricals but really this should only be handed nominals, as you say. |
@Laurae2 I have a question regarding the encoding: |
@geoHeil I don't use scikit-learn transformers as they are known to have shady issues on transformers / supervised machine learning (like scikit-learn/scikit-learn#3956 about your LabelEncoder). It is better to prepare oneself the categorical features before feeding to LightGBM the DMatrix, or to use custom converters like the one I made in R for LightGBM, this way you know you are doing the right preprocessing out of the box: |
@dah33 An ordinal scale is still debatable whether it is continuous or discrete (in theory). But for LightGBM, it is better to feed them as numeric because:
|
@Laurae2 a couple of days you mentioned that
|
@geoHeil I recommend a using a separate converter because there's no way a saved model can remember something which is not native (a LightGBM saved model does not know what is Python). As with any preprocessing steps, they must be separate to the LightGBM interface (explicit), not inside the LightGBM interface (abstracted). The conversion is done for the user convenience (like what I don't know exactly how it is done the Python package, but @wxchan probably knows more about how categorical features are handled when predicting from a model (whether it is a fresh loaded model or a newly-trained model). In R, you must pass the rule converter as a preprocessing step, which does the heavy lifting work for features. If the rule converter is not saved nor used, then you cannot predict properly from new data. |
cat.codes of pandas categorical features will be saved to model after training and read from model during prediction. |
@wxchan thanks. So https://github.com/Microsoft/LightGBM/blob/cc771df49941f1045bcca52ea97c00288d319dca/python-package/lightgbm/basic.py#L240 is storing it - but where is this information used in the transform part i.e. where possibly unseen categories are handled? |
@geoHeil store in L231, read from L237 |
@wxchan thanks. Regarding the number of levels i.e. for a String address field the number of distinct categorical levels is pretty big what would you suggest in this case? |
@geoHeil not sure I understand your question. Do you mean address of street? I think you can either merge several rare categories into one big category, or extract some common information from this feature (like city of address). |
@wxchan exactly, I thought I had seen some Is that first type of handling (merge by frequency) indeed implemented? |
@geoHeil no, you need to implement on your own. I am actually not sure what the status of categorical feature handling right now, seem guolinke has reverted it this afternoon. I will check it later. |
I feel like having this is CLI version is not need, and also too heavy. |
Having a python script would not be ideal for some projects because installations on clusters can be tedious and having more dependencies would not be a good idea. |
@AbdealiJK thanks for your thoughts. For dependencies problem, you can use the binaries program, instead of scripts. |
I was actually looking to contribute this and realized that it indeed is not very trivial. Agree that reading and writing can be handled by wrappers or pre-processing scripts as implementing this would not be worth the effort |
I use lightgbm in Python, and I also would love for lightgbm to encode categorical features internally (i.e., "on the fly"). Having read the discussions above, I acknowledge that this is not trivial, but I think it can be implemented in the data preparation stage (i.e., when user creates the Here are some details:
In this way, the |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
This is a feature request
It would be very useful to be able to read in Categorical values as strings ("abc", "def", etc) and convert that to integers internally.
This adds a bit of overhead, but would be much easier for users. There would probably be some overhead to do this, so a flag can be made if the user wants to do it, or we can automatically check if the column has any alphabet or not
The text was updated successfully, but these errors were encountered: