Inconsistent performance using ordered categorical variables vs numerical equivalents with LightGBM #971
Unanswered
harshvardhaniimi
asked this question in
Q&A
Replies: 1 comment 1 reply
-
FLAML doesn't vary the categorical features. They are passed as-is to LightGBM. Do you observe the same issue when using LightGBM without FLAML? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have found that using the numerical equivalent of these ordered categorical variables consistently results in better performance compared to using the categorical variables directly. Here's a brief explanation of the issue:
When using pandas, I can set an order for categorical variables using
pd.set_order()
. This allows me to define how these categories should be ordered when usingcat.codes
to obtain numerical equivalents. However, when using these categorical variables directly in LightGBM models, I have noticed that the performance is consistently worse than when using the numerical equivalent of these ordered categorical variables.As far as I understand,
pd.Categorical
variables should be used as numeric variables under the hood (please correct me if I'm wrong). Therefore, I am curious as to why the models using the numerical equivalent of categorical variables consistently perform better.Is there an issue with how FLAML handles ordered categorical variables (for LightGBM)? Or is there a specific reason why the performance differs between these two approaches? Any help or explanation would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions