Training classification models estimating the likelihood that an individual will achieve an annual income of $80,000 or more, utilizing a sophisticated analysis of multiple determinants. This comprehensive assessment incorporates work experience , Industry , job title , state and many other features to deliver a nuanced prediction of financial success.
The dataset contains 17 columns and around 28000 rows. It contains different features like work experience , industry , job title , state, education degree etc..
- Used Fuzzy Wuzzy to match different combinations of the name USA as data was manually entered by different users.
- Clubbed different education degrees into 4 most common degree categories. Similarly, done for Gender and Race.
- Took top 10 Industries and replaced other industries with "Other" for convenience.
- Took top 500 Job titles while replaced other job titles with "Other".
- Scaled down the Bonus column using RobustScalar.
- Used Frequency encoding for State, Industry, Job title, Race due to high cardinality.
- Used Label encoding for target variable.
- Used Ordinal encoding for Education degree, Work experience, Age.
- Used One Hot encoding for Gender.
- Removed outliers from our data.
With this dataset, Random Forest exhibited the highest model accuracy, reaching 81%, surpassing Naive Bayes,Logistic Regression,KNN which hovered around 76%, with Decision Tree slightly ahead at 78%.