[Bug]: MRE mappings for category dtype fields that drops categories in the pipeline #234
Closed
1 task done
Labels
bug
Something isn't working
What happened?
If you groupby a column of type pd.category and some possible categories are not in the data, it still generates an empty group.
If we have a category column in the data that loses categories somewhere in the pipeline for example from mappings or rare encoding then the MRE encodings calculated from a groupby categorical levels will include the category levels that were lost with a mapping passing them to np.nan. This should not break the code as the category should be mapped / rare encoded in the pipeline either way and thus use the MRE value of the category it gets passed to.
If we want to reverse the MRE mappings however for interpretability / plotting... the code will error as you now could have multiple categories with the same value in the dictionary (np.nan), which means the mappings dictionary key / values cant be inverted.
Below is a link of someone complaining about this issue and a response suggesting that the groupby passes an observed=True to avoid this
https://stackoverflow.com/questions/48471648/pandas-groupby-with-categories-with-redundant-nan
Environment
This appeared in a pandas 2.2.0 environment
Minimum reproducible code
No response
Relevant error output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: