Skip to content
This repository has been archived by the owner on Jul 26, 2024. It is now read-only.
Assaf Urieli edited this page Mar 22, 2019 · 13 revisions

Jochre is an OCR package based on supervised machine learning techniques. It has been applied to several languages, including Yiddish, Occitan and Alsacien.

There are several phases :

  1. Annotation - Annotation of a training/evaluation corpus using JochreWeb
  2. Training - Construction of the OCR model
  3. Evaluation - Evaluating the accuracy of the OCR model
  4. Analysis - Use of an existing model to analyse new scanned pages

Annotation requires the JochreWeb application.

Training and evaluation require a Jochre database constructed using JochreWeb.

Analysis requires a model constructed during training, but no longer requires the database used to construct the model.

During analysis (and evaluation), Jochre involves the following steps:

  1. Segmentation : break up the images into paragraphs, rows, groups (representing words) and shapes (representing letters). This uses ad-hoc statistical algorithms.
  2. Guessing : apply the model to guess the n most probable words for each group (this list is known as the "beam").
  3. Post-processing : use of a lexicon to rerank the words in the beam, and select the most likely analysis.

See Installation for Jochre installation instructions.

Clone this wiki locally