Wanna One-Hot Encode your Train-Test sets which contains Rare-Labels and also give importance to the top entries? No Worries!
Rare-Label-One-Hot Encoder
Python Package is there to rescue you out!
It's a Categorical Encoder which can be mostly used with Classical Machine Learning Algorithms in-order to One-Hot-Encode a Feature having huge cardinality and also having rare labels in the Train-Test sets.
Basically, it'll set a threshold (that can be user-defined) of taking up the top categories/entries and treat the rest (least significant) as others
. It also handles rare label cases in case of mapping the features from Train to Test respectively and vice versa.
You can set the top entries criterion either by level
which will consider the Top entries according to the threshold set or the other by amount
which will consider all the entries above the threshold as top entries and rest as others
.
Rare-Label-One-Hot Encoder is available as RLOHE
in PyPI.
Run the following command on your terminal to install RLOHE
:
1 . Installing the package using pip
:
pip install RLOHE
OR
pip3 install RLOHE
2 . Cloning the repository:
git clone https://github.com/rahulbordoloi/Rare-Label-One-Hot-Enocder/
cd Rare-Label-One-Hot-Enocder
pip install -e .
RLOHE
package contains two functions, namely :
- TopLabeledEntries : Gives out Top Labeled Entries' Analysis of two given DataFrames.
- RareLabelOneHotEncoder : Gives out Rare Label One-Hot Encoded DataFrames according to threshold being set and it's criterion of segregation.
It is advised to run TopLabeledEntries
first in-order to check for the Top Entries and their representation in their respective dataset before going for the encoding as a sanity check.
1 . For TopLabeledEntries
Function :
Parameters | Description |
---|---|
train | Refers to the Train Dataset. |
test | Refers to the Test Dataset. |
feature_name | Refers to the Feature on which encoding is to be done |
threshold | Refers to the Top Features Seggregator Limit. |
criterion | Refers to level/volume according to which top entries will be picked up. Check reference for more information. |
secondary_feature | Refers to check amount statistics of another feature with respect to the primary feature. |
verbose | Refers to variable which controls Output to the console. |
return_dataframe | Refers to condition for if a dataframe has to be returned or not. |
2 . For RareLabelOneHotEncoder
Function :
Parameters | Description |
---|---|
train | Refers to the Train Dataset. |
test | Refers to the Test Dataset. |
feature_name | Refers to the Feature on which encoding is to be done |
threshold | Refers to the Top Features Seggregator Limit. |
criterion | Refers to level/volume according to which top entries will be picked up. Check reference for more information. |
verbose | Refers to variable which controls Output to the console. |
prefix_name | Refers to the Prefix Name to be added in front of each new encoded feature. |
Reference
* level
: Will be considering up top level
threshold entries for the particular feature, and rest as BELOW
.
* amount
: Will be considering up the entries above the threshold for the particular feature, and rest as BELOW
.
Run this script in order to get the Top Entries according to a given threshold!
# Importing Libraries
import RLOHE as encoder
import pandas as pd
# Main Method
if __name__ == '__main__':
# Reading in Dataset
train = pd.read_csv('https://raw.githubusercontent.com/rahulbordoloi/Rare-Label-One-Hot-Enocder/main/Data/Train_Data.csv')
test = pd.read_csv('https://raw.githubusercontent.com/rahulbordoloi/Rare-Label-One-Hot-Enocder/main/Data/Test_Data.csv')
# Displaying out the Top Entries According to the Threshold set.
encoder.TopLabeledEntries(train, test, feature_name = 'department_info', threshold = 10, secondary_feature = 'cost_to_pay')
Run this script in order to get the Rare Label One-Hot Encoded DataFrames according to a given threshold!
# Importing Libraries
import RLOHE as encoder
import pandas as pd
# Main Method
if __name__ == '__main__':
# Reading in Dataset
train = pd.read_csv('https://raw.githubusercontent.com/rahulbordoloi/Rare-Label-One-Hot-Enocder/main/Data/Train_Data.csv')
test = pd.read_csv('https://raw.githubusercontent.com/rahulbordoloi/Rare-Label-One-Hot-Enocder/main/Data/Test_Data.csv')
# Rare Label One-Hot Encoder [Level Wise]
encodedTrain, encodedTest = encoder.RareLabelOneHotEncoder(train, test, feature_name = 'department_info', threshold = 10,
criterion = 'level', prefix_name = 'dept')
To install RLOHE
, along with the tools you need to develop and run tests, and execute the following in your virtualenv:
$ pip install -e .[dev]
Name : Rahul Bordoloi
Website : https://rahulbordoloi.me
Email : [email protected]