-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRelatedWork.tex
39 lines (27 loc) · 6.23 KB
/
RelatedWork.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Spelling correction algorithms have been vastly discussed in literature but either their evaluation measures and data analysis are not accurate or they do not take all types of OCR errors into consideration.
Kukich\cite{kukich1992techniques} comprehensively discusses various spelling correction techniques based on non word, isolated word and real word spelling errors. N-gram analysis, dictionary lookup and probabilistic techniques (\cite{strohmaier2003lexical}, \cite{ringlstetter2007text}) are used for correcting isolated and nonword errors while context-dependent techniques are used mostly for correcting real word errors including the correction of word split and join errors \cite{elmi1998spelling}.
N-gram techniques work by examining each n-gram in the text string and comparing against a pre-compiled table of n-gram statistics to retrieve the correct word while dictionary look up techniques directly check whether the text string appears in the dictionary using string matching algorithms. Both techniques require a dictionary or a large text corpus and take frequency of n-grams or word occurrence into account in order to find the correct spelling .
Probabilistic techniques use transition and word confusion probabilities to estimate likelihood of the correction in order to rank and retrieve correct word spelling.
On the other hand, Context-dependent techniques require contextual information and use either extensive NLP techniques or Statistical Language Modeling (SLM) for spelling correction.
Bassil and Alwani\cite{bassil2012ocr} use Google 1-5 gram word dataset to gain context information in order to determine the correct words sequence in the text for correction.
Tong and Evans\cite{tong1996statistical} use Statistical Language Modeling (SLM) approach involving information from letter n-grams, character confusion and word bi-gram probabilities to perform context sensitive spelling correction obtaining a 60 percent error reduction rate. All these spelling correction techniques have developed over time and have been used in combination to achieve improved accuracy \cite{brill2000improved}. Agarwal et al.\cite{agarwal2013utilizing} use a combination of Google suggestions, LCS and character confusion probabilities for choosing the correct spelling on a small set of historical newspaper data and achieve recall and precision of $51\%$ and $100\%$ respectively.
The edit distance approach, suggested initially by Wagner and Fischer\cite{wagner1974string}, is a dictionary lookup approach commonly used for OCR data correction because of the large number of substitution errors in OCR data \cite{kukich1992techniques}\cite{christen2006comparison} which can be corrected using this technique. String edit distance approaches with faster correction are discussed in \cite{marzal1993computation},\cite{schulz2002fast} with variants like Levenshtein automata and normalized edit distance.
% WHERE DO I PUT THIS? ? ? Holley et al.\cite{holley2009good} posit that it is difficult to measure the true and improved accuracy of OCR and recording character confidence is not equivalent to the accuracy of the OCR text.
%There has not been enough research work regarding automatic evaluation of word-by-word post spelling correction on OCR dataset consisting of Word Split and Join errors.
All of the above algorithms are evaluated based on the percentage of spelling errors corrected or reduction in the word error rate and do not consider the word alignment problem arising due to word split and join errors in the OCR text.
Semi-automatic spelling correction systems \cite{taghva2001ocrspell} require user interaction in order to perform complete correction and system evaluation. Rice\cite{rice1996measuring} discusses OCR errors similar to the ones in our dataset. Their algorithm evaluates edit distance spelling correction by estimating word accuracy; the length of LCS between correct and incorrect strings on a page-by-page level is used as the relevant metric.
The evaluation strategy works correctly but the definition of accuracy does not give a complete coverage of the spell correction as it does not provide any information on the errors missed by the spelling corrector due to lack of word by word comparison/alignment during the evaluation procedure.
---ADD RELATED WORK REGARDING LINGPIPE BASED CONTEXT SENSITIVE SPELL CORRECTION HERE.
LingPipe's model supports exactly this kind of context-sensitive correction.
it is able to split tokens as well as combine tokens
train up on the data which we are indexing with the search engine.
sets the weights for operations like deletions, insertions etc.
doing the language modeling at the character level ; no of characters too sample in d ata modeling
training is done specific to tokenization
Once the model and index are written to disk, we can read them in and correct queries
---ADD RELATED WORK REGARDING RECENT PAPERS READ ON SPELLING CORRECTION EVALUATION. (REYNAERT PAPERS, Compound words paper)
--specifically mention that most of the evaluation mechanisms use OCR and true word pairs for measuring the algorithm performance which is not suitable for a complete evaluation of the spelling correction algorithm where the dataset has word split and join errors.
A lot of research has been done on OCR post correction that use several spell correction models and evaluation measures. But most of them don't consider word split and join errors in the dataset due to which those measures aren't directly applicable on our dataset.
OCR and true word pairs are used to measure Word Error Rate for which it is difficult to generate the word pairs when a positional correspondence between the OCR and true word is not possible due to some extra or missing words in the OCR text.
For example, \cite{reynaert2011character} consider only unigrams spell correction and don't consider gold standard words involving any spaces
\cite{kolak2002ocr} suggest an algorithm for finding OCR and true word pairs but assume that many words are recognized correctly for each line in the OCRed document, and use these words as anchor points for pairing. However, OCR of a historical newspapers generates lots of errors for which this assumption doesn't hold true. (Give an estimation of the number of errors on avg on each line in our ocr documents )