Product categorization - Sears project

Description

The objective of this project is to apply classification learning models on Sears products dataset with more than 100,000 products, and therefore to obtain a predictive model with high accuracy for identifying future products categories.

The dataset

The features in the datasets, among other things, include the name of the product, its description, price, image etc. Our dataset contains numeric values, string values and images.

Goal

The project is aimed to categorize products according to existing metadata (name, description, price, image) to a given taxonomy, by building ML algorithm that classifies products according to their data.

Our main challenges were:

How to convert the string values to a numeric vectors, and what is the better way to do so?
How to deal with a large amount of data?
Which classifier is the best to accomplish the project objective?

The Solution

In order to classify the products that contain numeric values, string values and images, we used two different classifiers - SVM and KNN. The first step was to convert the string values to numeric vectors. As a part of the converting process, we used Bag Of Words and several different sparse matrix - binary, tf, tf-idf, and sublinear tf-idf.

Some results on the textual data

SVM (linear SVM from scikit-learn)
- Dataset weight: Tf
- Data distribution: Stratified cross validation with k folds (k = 6)
The results with different C parameters:
- Dataset weight: Tf-Idf (sublinear tf scaling, i.e. replace tf with 1 + log(tf))
- Data distribution: Stratified cross validation with k folds (k = 6)
The results with differnt C parameters:
- Dataset weight: Tf-Idf
- Data distribution: Stratified cross validation with k folds (k = 6)
The results with different C parameters:
- Dataset weight: Binary
- Data distribution: Stratified cross validation with k folds (k = 6)
The results with different C parameters:
KNN
- Dataset weight: HashVectorizer
- Data distribution: 20%-80% randomly
- Search method: Nearest neighbors
- Distance method: Cosine
- Number of neighbors: 5
The results:
- Dataset weight: CountVectorizer
- Data distribution: 20%-80% randomly
- Search method: Nearest neighbors
- Distance method: Cosine
The results:
- Dataset weight: Tf-Idf with ignoring numerical values
- Data distribution: Stratified cross validation with k folds (k = 5)
- Search method: K Neighbors Classifier
- Distance method: Euclidean
The results:
- Dataset weight: Tf-Idf
- Data distribution: Stratified cross validation with k folds (k = 5)
- min_d: 1 (min_d – Create a dictionary composed of all the words that appear in the minimum min_d documents)
The results:
- Dataset weight: CountVectorizer
- Data distribution: Stratified cross validation with k folds (k = 5)
- min_d: 1 (min_d – Create a dictionary composed of all the words that appear in the minimum min_d documents)
The results:
- Dataset weight: HashingVectorizer
- Data distribution: Stratified cross validation with k folds (k = 5)
- min_d: 1 (min_d – Create a dictionary composed of all the words that appear in the minimum min_d documents)
The results:

Explanations of the included files

Note :

The files 'products_clean_144K_only_name_desc_clean.rar', 'All_products_clean.rar', 'products_all_cleanest.rar' , which are in the main folder and 'products_all_cleanest.rar', 'products_clean_144K.rar', 'products_clean_144K_price.rar', which are in the “csvfiles” folder. These are CSV files, which must be extracted before using.

clear_csv.py

When to use

Use this file first.

How to use

Open the terminal in the file’s location, and write:
'python clear_csv.py -f products.csv'
(when the 'products.csv' is the csv file to clean)

The file contains:

Main function that clears the csv file from Special Characters.

clean_csv.py

When to use

Use this file after "clear_csv.py".

How to use

Change the 'file_name_load' and 'file_name_save' variables to the wanted csv file name.

The file contains:

This file was created to make some additional cleanups of the csv file, in order to improve the SVM,
and contains a function that:
 1. Removes HTML Tags
 2. Removes non-letters
 3. Converts to lower case, splits into individual words
 5. Removes stop words
 6. Stems words

csvfixes.py

When to use

Use this file after "clear_csv.py" and "clean_csv.py".
This file was created to make some additional cleanups of the csv file, in order to improve the SVM.

How to use

Change the 'file_name_load' and 'file_name_save' variables to the wanted csv file name.

The file contains

The file contains three main functions:
1. "remove_irrelevant_docs" - Removes documents that belong to categories with frequency < 5 
2. "match_number_to_category_id" – Adds a new column ("CategoryNumber") which sequentially matches a number to each category id (from 1 to category size).
   
3. "remove_unique_words" - Removes words that appear only once in the entire document
  (this function is not relevant in case of using "max features" in Bag Of Words function)
4. "remove_special_chars" - Removes given characters from the entire document.

svm_linear.py

When to use

Use this file after using "clear_csv.py", "clean_csv.py" and "csvfixes.py"

How to use

In this file, change:
1. The 'file_name_load' variable to the wanted csv file name.
2. The 'c_parameter' variable to the wanted value, this is the cost parameter for the SVM.
3. The 'type_matrix' variable of the suitable type of the sparse matrix that creates in the function 'set_cv_fit'
  This parameter is for the files name that are created after the classification

NOTE: in the function 'set_cv_fit' - you can change the inner function 'CountVectorizer'
  to 'TfidfVectorizer' (just replace the comments).

The file contains

The linear SVM classification function
The classification is made with stratified cross validation with k folds.
When the classifier ends, it print the results to the following files:
1. Predictions file - the predictions of the SVM, and the right label on the right (Comma separated)
2. Statistics file that contains:
  - The number of correct answers of the classifier.
  - The number of incorrect answers of the classifier.
  - The percent of the right answers.
3. Two files with a list of the categories numbers, and the number of correct/incorrect answers of each category (so that we can get some additional statistics on the results).

dataframe_utils

class of functions dealing with dataframe, mainly name, description and category fields.
Use this class for actions and dataframe manipulations such as: load, remove, index, row/column extraction etc.

"get_features_and_labels": returns concatenated features (name+description) and labels.
"load_and_split_data": loads csv and splits to 80%-20% train-test
"reset_index": after splitting data, reindex sequentially
"remove_docs_by_column": remove documents of given column and column-value
"remove_rare_categories_after_split" (unused): remove documents of unique targets after splittind tada.
"rows_index_by_column": get dataframe index of documents which contains given values in a given column
"find_rare_targets": return list of labels who appear less than 'max' times

knn.py

When to use

Use for knn classification.

How to use

Class variables:

filename: csv filename to create DataFrame from.

type of vectorizer:
TFIDF: for TfidfVectorizer
COUNT_VEC: for CountVectorizer
HASH_VEC: for HashingVectorizer

vectorizer arguments:
min_d: min_df parameter for vectorizing
max_f: max_features (for CountVectorizer and TfidfVectorizer) / n_features (for HashingVectorizer)
k_clusters: k_clusters for pysparnn.cluster_index.MultiClusterIndex classifier
nbrs: n_neighbors for KNeighborsClassifier 
metric: distance metric for KNeighborsClassifier

see documentation:

https://github.com/facebookresearch/pysparnn

The file contains

Usage of Python's scikit-learn knn classifier: KNeighborsClassifier
and PySparNN library for Nearest Neighbor approximation search aimed for sparse data.

Class variables:


vector creation:
----------------
parameters:
1. train
2. test
3. min_d
4. max_f
5. selection: TFIDF / COUNT_VEC / HASH_VEC (mentioned above)


Classification:
---------------
1. KNeighborsClassifier: train(vectorized), test(vectorized), train labels, neighbors, distance metric
2. PySparNN: train(vectorized), test(vectorized), train labels, k_clusters

vectorize.py

When to use

Class responsible for vectors and/or dataset creation, before classification.
This class is calles from 'knn' class or 'svm'

1. types of vectorization:
  1. TfidfVectorizer
    related functions:
      - create_vec_tfidf
      - fit_tfidf
      - create_vec_test
  2. CountVectorizer
    related functions:
      - create_vec_cv
      - fit_cv
      - create_vec_test
  3. HashingVectorizer
    related functions:
      - create_vec_hv
      - hash_feature

2. svm format conversion:
  Args:
    features:  sparse matrix, shape [n_samples, n_features] 
    labels: labels corresponding to features.
    filename: file to write the vectors to.

  out of the above arguments, "write_vec", "write_libsvm_format", and "construct_line_libsvm_format"
  will format the dataset into svm acceptable shape, and write to file.

Resources

API design for machine learning software: experiences from the scikit-learn project, Buitinck et al., 2013.

Multi-Class Support Vector Machine

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
csvfiles		csvfiles
knn		knn
All_products_clean.rar		All_products_clean.rar
README.md		README.md
clean_csv.py		clean_csv.py
clear_csv.py		clear_csv.py
csvfixes.py		csvfixes.py
dataframe_utils.py		dataframe_utils.py
experiments.rar		experiments.rar
knn.py		knn.py
products_all_cleanest.rar		products_all_cleanest.rar
products_clean_144K_only_name_desc_clean.rar		products_clean_144K_only_name_desc_clean.rar
svm_linear.py		svm_linear.py
vectorize.py		vectorize.py

gallib2/product-categorization

Folders and files

Latest commit

History

Repository files navigation

Product categorization - Sears project

Description

The dataset

Goal

The Solution

Some results on the textual data

SVM (linear SVM from scikit-learn)

KNN

Explanations of the included files

Note :

clear_csv.py

When to use

How to use

The file contains:

clean_csv.py

When to use

How to use

The file contains:

csvfixes.py

When to use

How to use

The file contains

svm_linear.py

When to use

How to use

The file contains

dataframe_utils

knn.py

When to use

How to use

The file contains

vectorize.py

When to use

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages