DATA DESCRIPTION: sensor-data.csv : (1567, 592).
- The data consists of 1567 examples each with 591 features.
- The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing.
- Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.
Steps and tasks:
- Import and explore the data.
- Data cleansing:
• Missing value treatment. • Drop attribute/s if required using relevant functional knowledge. • Make all relevant modifications on the data using both functional/logical reasoning/assumptions.
- Data analysis & visualisation:
• Perform detailed relevant statistical analysis on the data. • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
- Data pre-processing:
• Segregate predictors vs target attributes • Check for target balancing and fix it if found imbalanced. • Perform train-test split and standardise the data or vice versa if required. • Check if the train and test data have similar statistical characteristics when compared with original data.
- Model training, testing and tuning: • Model training:
- Pick up a supervised learning model.
- Train the model.
- Use cross validation techniques.
- Apply hyper-parameter tuning techniques to get the best accuracy. Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.
- Use any other technique/method which can enhance the model performance.
- Display and explain the classification report in detail.
- Design a method of your own to check if the achieved train and test accuracies might change if a different sample population can lead to new train and test accuracies.
- Apply the above steps for all possible models that you have learnt so far.
- Display and compare all the models designed with their train and test accuracies.
- Select the final best trained model along with your detailed comments for selecting this model.
- Pickle the selected model for future use.
- Import the future data file. Use the same to perform the prediction using the best chosen model from above. Display the prediction results.
- Conclusion and improvisation:
- PCA was implemented to reduce the features from 478 to 200 that essentially explained 98% of the variance in the data.
- In order to balance the samples amongst all class labels, oversampling (via SMOTE) and undersampling (via Random undersampling) was performed.
- Combined Oversampling and undersampling was performed via SMOTETomek.
- Results concluded that Oversampling via SMOTE gave the best results i.e. 80.57% test accuracy for Logistic regression model and 95.54% accuracy for Support Vector Classifier.
- GridSearch Cross Validation was also used for finding the best hyperparameters. for Logistic Regresssion.