Inconsistent performance using ordered categorical variables vs numerical equivalents with LightGBM #971

harshvardhaniimi · 2023-03-30T22:20:20Z

harshvardhaniimi
Mar 30, 2023

I have found that using the numerical equivalent of these ordered categorical variables consistently results in better performance compared to using the categorical variables directly. Here's a brief explanation of the issue:

When using pandas, I can set an order for categorical variables using pd.set_order(). This allows me to define how these categories should be ordered when using cat.codes to obtain numerical equivalents. However, when using these categorical variables directly in LightGBM models, I have noticed that the performance is consistently worse than when using the numerical equivalent of these ordered categorical variables.

As far as I understand, pd.Categorical variables should be used as numeric variables under the hood (please correct me if I'm wrong). Therefore, I am curious as to why the models using the numerical equivalent of categorical variables consistently perform better.

Is there an issue with how FLAML handles ordered categorical variables (for LightGBM)? Or is there a specific reason why the performance differs between these two approaches? Any help or explanation would be greatly appreciated.

sonichi · 2023-03-30T23:20:51Z

sonichi
Mar 30, 2023

FLAML doesn't vary the categorical features. They are passed as-is to LightGBM. Do you observe the same issue when using LightGBM without FLAML?

1 reply

harshvardhaniimi Apr 12, 2023
Author

Apologies for the delayed reply. I've not been able to replicate the issue on a public or simulated dataset. It might be because of high number of categories in our categorical variables or other characteristics of the proprietary dataset.

Let me get back to you to soon on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent performance using ordered categorical variables vs numerical equivalents with LightGBM #971

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Inconsistent performance using ordered categorical variables vs numerical equivalents with LightGBM #971

harshvardhaniimi Mar 30, 2023

Replies: 1 comment · 1 reply

sonichi Mar 30, 2023

harshvardhaniimi Apr 12, 2023 Author

harshvardhaniimi
Mar 30, 2023

Replies: 1 comment 1 reply

sonichi
Mar 30, 2023

harshvardhaniimi Apr 12, 2023
Author