Skip to content

shiv4m/handwriting-classification-from-writer-images

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

handwriting-classification-from-writer-images

  1. About Project
    The aim of the project is to predict two classes, either ‘0’ or ‘1’ given the features of some handwriting samples of writers. This is done using three different algorithms, i.e. Linear Regression, Logistic Regression and Neural Networks. The metrics for measurement for Linear Regression is “Error RMS”, for Logistic Regression is “Accuracy”, and for Neural Networks is also “Accuracy”.
  2. What is the Dataset about? The dataset consists of two main observations of features from the handwriting images. The first observation is human observed features, where the features are mechanically noted by humans. This observation consists of 9 features for each writer image. Apart from this, there are two different excel files where there are pairs of writer id’s which is being compared. The first file is of same pairs, where the two images of same writer is being compared and hence the target is ‘1’. The other file consists of different pairs, where the two images of two different writers are being compared and hence the target is ‘0’. The second observation is GSC observed features, where the features are extracted by an algorithm called GSC. This observation consists of 512 features for each writer image. This observation also has the different pair and same pair files with their corresponding targets.
    2.1. Prepare the Dataset To make the data ready for processing we need to divide the data into data matrix and the target vector. For this we read the csv files of both different and same pairs and extract the column for each writer id. Now, compare those ids with the one given in the file which has features in it. Those will be the features we will need for the algorithms. There are 791 samples for same pair human observations and around 200,000 samples in different pair human observations. We take only 791 samples from both same and different pairs since we don’t want the data to be biased. We do the same with the GSC dataset.

Now, there are two different types of operations to be performed for each algorithm. The first is the feature subtraction where, the writer id from each pair is taken and the absolute value of the difference of the two features are calculated. The same will be done for the GSC dataset. Hence, there will be 9 features for human dataset and 512 features for GSC dataset. The second is the feature concatenation where, the writer id from each pair is taken and one is concatenated after the other. This way, the number of features will be doubled. Hence, there will be a total of 18 features of human dataset and 1024 features for GSC dataset. Since, we have joined the features of same and different pairs, the target for them will also be in the same order. Thus, when splitting the dataset into training, validation and testing set, the testing set will have unseen data and hence the accuracy will be NULL. To avoid this, we shuffle the dataset with the targets and then process it further.

I have taken 791 samples for human dataset from both same and different pairs thus making a matrix of 1580 samples and 9 features for feature subtraction and 1580 samples with 18 features for feature concatenation. For the GSC dataset, I have 2000 samples from each thus a matrix of 4000 samples and 512 features for feature subtraction and a matrix of 4000 samples and 1024 features for feature concatenation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published